Replies: 1 comment
-
I'm also not quite sure if the amount of buffers I chose is correct, but in the all_gather operation all of them should be pulled at once from the NVMe so the small value (5) in conjunction with the buffer_size of 1e9 that the documentation provides also leads to an error in my case. Where it says
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey there, I wanted to ask if somebody can explain a thing I've encountered lately when using DeepSpeed-ZerO-Infinity with transformers.
So I use the vanilla integration and configured DeepSpeed with the following json-file:
while the corresponding yaml looks like this:
Accelerate launches DeepSpeed as it should and starts to offload parameters.
Now I tried to apply DeepSpeed on Qwen/Qwen3-0.6B and PeFT + the GRPOTrainer from trl.
The thing I wonder about is:
You might have noticed
in my config.json. Which differs widely from the parameters that the documentation recommends.
If I look into the swap-folder
/home/local/mem
I can see several (197) swp files, which all stand for a layer that has to be offloaded. They are quite small except for the first one (0).The token embedding layer (0) has around 311000000 byte (311 MB) in FP16 for Qwen3-0.6B, the linear layers are 2, 4, 6 MB so that is a huge difference. This also is visible in the sizes of the files which are saved in the memory dir.
As every .swp file reflects a singular buffer in the all_gather operation I have 196 buffers which are small and one that is large.
The only choice I can make for buffer_size in the json-configuration is a single parameter count (multiplied by the parameter bytes internally), which needs to fit all.
So if I choose "4194304" any of the Qwen layers to be offloaded will fit, except for token_embedding_layer which seems to be marked swappable by deepspeed.
Persistence on other devices is checked by the total parameter count but only for the lower bound, so anything smaller then x will never be offloaded.
A little dig into the code:
In "deepspeed/runtime/zero/partition_parameters.py" (line 1665 for the 0.17.2 branch) this is the place where parameters are prepared for nvme offloading.
When I run DeepSpeed I get the following error:
[rank0]: AssertionError: More elements 155582464 than buffer size 4194304
As the token embedding layer exceeds the buffer, where all of the other layers would fit.
If I change the condition to:
DeepSpeed-Infinity goes to work and works fine, as the execution continues in the else-branch and as a consequence the large layer is reserved either on the RAM or in GPU memory, but not on the nvme and I can use DeepSpeed.
Beta Was this translation helpful? Give feedback.
All reactions