-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Non-Contiguous Tensor Issue in Checkpoint Consolidation #708
Fix Non-Contiguous Tensor Issue in Checkpoint Consolidation #708
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the pull-request. It is ok for me, but gently pinging @michaelbenayoun for confirmation.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@michaelbenayoun gentle reminder ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking care of that!
Could you just run the |
@michaelbenayoun np! where can I see instructions on how to do that? |
The steps to follow are:
|
@michaelbenayoun done |
Something went wrong when you ran the command, because it changed many files. |
What does this PR do?
This PR fixes an issue in the
consolidate_tensor_parallel_checkpoints
function where non-contiguous tensors caused aValueError
when saving consolidated model checkpoints, particularly with the Safetensors format. The fix ensures that tensors are made contiguous before and after concatenation, resolving memory layout issues during saving.Error Example:
When running the following command:
The following error occurs:
ValueError: You are trying to save a non contiguous tensor: model.layers.0.self_attn.o_proj.weight which is not allowed.
This PR solves the issue by applying the
.contiguous()
method to tensors at key points in the consolidation process, ensuring the consolidated model can be saved without errors.Motivation and Context:
When consolidating distributed model checkpoints (e.g., from model shards), tensors may not be contiguous in memory. Safetensors, which is a memory-efficient way to store models, does not allow saving non-contiguous tensors. This PR modifies the
consolidate_tensor_parallel_checkpoints
function to apply.contiguous()
to tensors to ensure that the checkpoints can be properly consolidated and saved.This change impacts the overall consolidation pipeline, as
consolidate_model_parallel_checkpoints_to_unified_checkpoint
depends on the lower-level functions to process tensors correctly.