Loss gen tokens #3677

dakinggg · 2024-10-22T22:05:11Z

What does this PR do?

This PR allows differentiation between loss generating tokens and total tokens, by allowing get_num_tokens_in_batch to return a dict with total and loss_generating keys. This will only have an impact if you are using the accumulate_batch_on_tokens flag. This does not break or change any existing functionality, as by default, if you are just returning a single number from get_num_tokens_in_batch, we will accumulate using that value. Things will only change if the user adjusts their get_num_tokens_in_batch to be a dictionary.

Related PR in LLM Foundry: mosaicml/llm-foundry#1610

Manual testing (using Foundry draft PR):

Before with all tokens loss generating, microbatching is deterministic:

Before with not all tokens loss generating, microbatching is not deterministic:

After with not all tokens loss generating, microbatching is deterministic:

mvpatel2000

Some concern on warning spam, otherwise fine

composer/core/data_spec.py

dakinggg added 10 commits October 16, 2024 10:40

should work

f138c1c

fix

7a1c83f

fix

d2a33f3

fixes

3eda1c7

fixes

9400f69

pc and tests

b64d91a

finish simplification

bfe347c

try again

b0cdfe2

fix tests

da67342

fix test

2777f37

dakinggg marked this pull request as ready for review October 23, 2024 03:37

dakinggg requested review from mvpatel2000 and irenedea October 23, 2024 03:37

Merge branch 'main' into loss-gen-tokens

e98d733

mvpatel2000 reviewed Oct 23, 2024

View reviewed changes

composer/core/data_spec.py Outdated Show resolved Hide resolved

composer/core/data_spec.py Outdated Show resolved Hide resolved

rm warning

0d4bb31

mvpatel2000 approved these changes Oct 23, 2024

View reviewed changes

dakinggg merged commit 5aaa8c9 into mosaicml:main Oct 23, 2024
14 checks passed

This was referenced Oct 24, 2024

Add loss generating token counts mosaicml/llm-foundry#1610

Merged

Change accumulate_train_batch_on_tokens default to True mosaicml/llm-foundry#1618

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss gen tokens #3677

Loss gen tokens #3677

dakinggg commented Oct 22, 2024 •

edited

Loading

mvpatel2000 left a comment

Loss gen tokens #3677

Loss gen tokens #3677

Conversation

dakinggg commented Oct 22, 2024 • edited Loading

What does this PR do?

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Oct 22, 2024 •

edited

Loading