You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The attention mask should not be all zeros; otherwise, it will use full attention instead of causal attention when calculating the loss, leading to label leakage.
The text was updated successfully, but these errors were encountered:
hiyouga
changed the title
Label leakage in Gemma 2 Finetuning
Label Leakage in Gemma 2 Finetuning
Jun 27, 2024
System Info
transformers
version: 4.42.1Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Loss curves
We initially found that the training loss quickly dropped to 0 when fine-tuning the gemma2 models.
This phenomenon is very similar to label leakage.
Breakpoints
We changed the following lines of
Gemma2DecoderLayer
to the below form.transformers/src/transformers/models/gemma2/modeling_gemma2.py
Lines 618 to 633 in 1c68f2c
Then we execute the following code to trigger it.
The result is:
Expected behavior
The attention mask should not be all zeros; otherwise, it will use full attention instead of causal attention when calculating the loss, leading to label leakage.
The text was updated successfully, but these errors were encountered: