Differing component implementation logic across recipes #2307

EugenHotaj · 2025-01-29T01:06:01Z

I've noticed that torchtune has different implementations for some components across recipes. A couple of examples:

Gradient accumulation:

full_distributed_finetune.py correctly computes the loss across ranks / gas (here).
lora_dpo_distributed.py (and Full DPO Distributed #2275) use the "old" but incorrect implementation (here).

Tokens per second:

full_distributed_finetune.py computes tps by considering the unmasked tokens only.
lora_dpo_distributed.py computes tps by considering all tokens.

There could be more but these are the main ones that jumped out.

It would be awesome for torchtune to have more uniformity across the recipes! At least some of the key ones that are highly used (sft, dpo, maybe ppo). Right now it can get a bit confusing when switching recipes and seeing metrics you don't expect.

The text was updated successfully, but these errors were encountered:

felipemello1 · 2025-01-29T01:38:50Z

hey @EugenHotaj , 100% agreed. As we added new functionalities, it became harder to keep all recipes 100% aligned.

We have as top priority for this year to reduce complexity in recipes and make them easier to maintain. One way to help with that, for example, is to moduralize some functionalities and make them utilities. Another way is to increase testing coverage so the burden of testing is reduced (e.g. combinations of quantization, lora, distributed, compile, performance flags, dataset packed, checkpointing, etc). We also plan on investing more on RL this year.

For now, please continue raising tickets / submitting PRs as you find issues like that. Thanks for all of your contributions!

SalmanMohammadi · 2025-01-29T10:02:17Z

@EugenHotaj Thanks for raising this. There's a few more things I've noticed which are out of date in the LoRA DPO distributed recipe. If you'd be up for putting a PR up for this I'm happy to review - I found I could catch a bunch of these just looking at the diff between the SFT LoRA distributed recipe, and the DPO LoRA recipe (and crossing my eyes real hard until the diff merged into the matrix). I won't have time to put up a fix this week unfortunately.

EugenHotaj · 2025-01-30T01:16:21Z

@SalmanMohammadi no worries, it's not urgent, just wanted to raise awareness of the issue / hope it can get prioritized in the future 🙂

felipemello1 added bug Something isn't working best practice Things we should be doing but aren't better engineering Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs triaged This issue has been assigned an owner and appropriate label labels Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differing component implementation logic across recipes #2307

Differing component implementation logic across recipes #2307

EugenHotaj commented Jan 29, 2025

felipemello1 commented Jan 29, 2025 •

edited

Loading

SalmanMohammadi commented Jan 29, 2025

EugenHotaj commented Jan 30, 2025

Differing component implementation logic across recipes #2307

Differing component implementation logic across recipes #2307

Comments

EugenHotaj commented Jan 29, 2025

felipemello1 commented Jan 29, 2025 • edited Loading

SalmanMohammadi commented Jan 29, 2025

EugenHotaj commented Jan 30, 2025

felipemello1 commented Jan 29, 2025 •

edited

Loading