Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differing component implementation logic across recipes #2307

Open
EugenHotaj opened this issue Jan 29, 2025 · 3 comments
Open

Differing component implementation logic across recipes #2307

EugenHotaj opened this issue Jan 29, 2025 · 3 comments
Labels
best practice Things we should be doing but aren't better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs bug Something isn't working triaged This issue has been assigned an owner and appropriate label

Comments

@EugenHotaj
Copy link
Contributor

I've noticed that torchtune has different implementations for some components across recipes. A couple of examples:

Gradient accumulation:

  • full_distributed_finetune.py correctly computes the loss across ranks / gas (here).
  • lora_dpo_distributed.py (and Full DPO Distributed #2275) use the "old" but incorrect implementation (here).

Tokens per second:

  • full_distributed_finetune.py computes tps by considering the unmasked tokens only.
  • lora_dpo_distributed.py computes tps by considering all tokens.

There could be more but these are the main ones that jumped out.

It would be awesome for torchtune to have more uniformity across the recipes! At least some of the key ones that are highly used (sft, dpo, maybe ppo). Right now it can get a bit confusing when switching recipes and seeing metrics you don't expect.

@felipemello1
Copy link
Contributor

felipemello1 commented Jan 29, 2025

hey @EugenHotaj , 100% agreed. As we added new functionalities, it became harder to keep all recipes 100% aligned.

We have as top priority for this year to reduce complexity in recipes and make them easier to maintain. One way to help with that, for example, is to moduralize some functionalities and make them utilities. Another way is to increase testing coverage so the burden of testing is reduced (e.g. combinations of quantization, lora, distributed, compile, performance flags, dataset packed, checkpointing, etc). We also plan on investing more on RL this year.

For now, please continue raising tickets / submitting PRs as you find issues like that. Thanks for all of your contributions!

@felipemello1 felipemello1 added bug Something isn't working best practice Things we should be doing but aren't better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs triaged This issue has been assigned an owner and appropriate label labels Jan 29, 2025
@SalmanMohammadi
Copy link
Collaborator

@EugenHotaj Thanks for raising this. There's a few more things I've noticed which are out of date in the LoRA DPO distributed recipe. If you'd be up for putting a PR up for this I'm happy to review - I found I could catch a bunch of these just looking at the diff between the SFT LoRA distributed recipe, and the DPO LoRA recipe (and crossing my eyes real hard until the diff merged into the matrix). I won't have time to put up a fix this week unfortunately.

@EugenHotaj
Copy link
Contributor Author

@SalmanMohammadi no worries, it's not urgent, just wanted to raise awareness of the issue / hope it can get prioritized in the future 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
best practice Things we should be doing but aren't better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs bug Something isn't working triaged This issue has been assigned an owner and appropriate label
Projects
None yet
Development

No branches or pull requests

3 participants