Differing component implementation logic across recipes #2307
Labels
best practice
Things we should be doing but aren't
better engineering
Tasks which help improve eng productivity e.g. building tools, cleaning up code, writing docs
bug
Something isn't working
triaged
This issue has been assigned an owner and appropriate label
I've noticed that torchtune has different implementations for some components across recipes. A couple of examples:
Gradient accumulation:
full_distributed_finetune.py
correctly computes the loss across ranks / gas (here).lora_dpo_distributed.py
(and Full DPO Distributed #2275) use the "old" but incorrect implementation (here).Tokens per second:
full_distributed_finetune.py
computestps
by considering the unmasked tokens only.lora_dpo_distributed.py
computestps
by considering all tokens.There could be more but these are the main ones that jumped out.
It would be awesome for torchtune to have more uniformity across the recipes! At least some of the key ones that are highly used (sft, dpo, maybe ppo). Right now it can get a bit confusing when switching recipes and seeing metrics you don't expect.
The text was updated successfully, but these errors were encountered: