Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Non-Contiguous Tensor Issue in Checkpoint Consolidation #708

Conversation

gioannides
Copy link
Contributor

What does this PR do?

This PR fixes an issue in the consolidate_tensor_parallel_checkpoints function where non-contiguous tensors caused a ValueError when saving consolidated model checkpoints, particularly with the Safetensors format. The fix ensures that tensors are made contiguous before and after concatenation, resolving memory layout issues during saving.

Error Example:

When running the following command:

consolidate_model_parallel_checkpoints_to_unified_checkpoint(
    Path("./model_sharded/model/shards"), 
    "./consolidated", 
    "safetensors"
)

The following error occurs:

ValueError: You are trying to save a non contiguous tensor: model.layers.0.self_attn.o_proj.weight which is not allowed.

This PR solves the issue by applying the .contiguous() method to tensors at key points in the consolidation process, ensuring the consolidated model can be saved without errors.

Motivation and Context:

When consolidating distributed model checkpoints (e.g., from model shards), tensors may not be contiguous in memory. Safetensors, which is a memory-efficient way to store models, does not allow saving non-contiguous tensors. This PR modifies the consolidate_tensor_parallel_checkpoints function to apply .contiguous() to tensors to ensure that the checkpoints can be properly consolidated and saved.

This change impacts the overall consolidation pipeline, as consolidate_model_parallel_checkpoints_to_unified_checkpoint depends on the lower-level functions to process tensors correctly.

Copy link
Collaborator

@dacorvo dacorvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the pull-request. It is ok for me, but gently pinging @michaelbenayoun for confirmation.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo
Copy link
Collaborator

dacorvo commented Oct 14, 2024

@michaelbenayoun gentle reminder ...

Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of that!

@michaelbenayoun
Copy link
Member

Could you just run themake style command to fix check_code_quality please.

@gioannides
Copy link
Contributor Author

gioannides commented Oct 25, 2024

@michaelbenayoun np! where can I see instructions on how to do that?

@michaelbenayoun
Copy link
Member

The steps to follow are:

  1. Put yourself inside the optimum-neuron directory
  2. Run pip install -e ".[quality]"
  3. Run make style

@gioannides
Copy link
Contributor Author

@michaelbenayoun done

@dacorvo
Copy link
Collaborator

dacorvo commented Nov 14, 2024

Something went wrong when you ran the command, because it changed many files.
Rebasing and merging as #736

@dacorvo dacorvo closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants