Label Leakage in Gemma 2 Finetuning #31673

hiyouga · 2024-06-27T21:14:23Z

System Info

transformers version: 4.42.1
Platform: Linux-5.4.0-147-generic-x86_64-with-glibc2.31
Python version: 3.10.13
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.30.1
PyTorch version (GPU?): 2.3.0+cu121 (True)

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Loss curves

We initially found that the training loss quickly dropped to 0 when fine-tuning the gemma2 models.

This phenomenon is very similar to label leakage.

Breakpoints

We changed the following lines of Gemma2DecoderLayer to the below form.

transformers/src/transformers/models/gemma2/modeling_gemma2.py

Lines 618 to 633 in 1c68f2c

    
           def forward( 
        
               self, 
        
               hidden_states: torch.Tensor, 
        
               attention_mask: Optional[torch.Tensor] = None, 
        
               position_ids: Optional[torch.LongTensor] = None, 
        
               past_key_value: Optional[Cache] = None, 
        
               output_attentions: Optional[bool] = False, 
        
               use_cache: Optional[bool] = False, 
        
               cache_position: Optional[torch.LongTensor] = None, 
        
           ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]: 
        
               if self.is_sliding and attention_mask is not None:  # efficient SDPA and no padding 
        
                   attention_mask = attention_mask * torch.tril( 
        
                       torch.ones_like(attention_mask), diagonal=-self.sliding_window 
        
                   ) 
        
                   if attention_mask.shape[1] <= 1:  # when decoding 
        
                       attention_mask = attention_mask[:, -self.sliding_window :]

        if self.is_sliding and attention_mask is not None:  # efficient SDPA and no padding
            print("before", attention_mask)
            attention_mask = attention_mask * torch.tril(
                torch.ones_like(attention_mask), diagonal=-self.sliding_window
            )
            print("after", attention_mask)
            if attention_mask.shape[1] <= 1:  # when decoding
                attention_mask = attention_mask[:, -self.sliding_window :]

Then we execute the following code to trigger it.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b-it", torch_dtype="auto", device_map="auto", attn_implementation="eager"
)
inputs = torch.tensor([[0, 1, 2]]).to(model.device)
labels = inputs.clone()
outputs = model(input_ids=inputs, labels=labels)

The result is:

before tensor([[[[ 0.0000e+00, -3.3895e+38, -3.3895e+38],
          [ 0.0000e+00,  0.0000e+00, -3.3895e+38],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00]]]], device='cuda:0',
       dtype=torch.bfloat16)
after tensor([[[[0., -0., -0.],
          [0., 0., -0.],
          [0., 0., 0.]]]], device='cuda:0', dtype=torch.bfloat16)
before tensor([[[[ 0.0000e+00, -3.3895e+38, -3.3895e+38],
          [ 0.0000e+00,  0.0000e+00, -3.3895e+38],
          [ 0.0000e+00,  0.0000e+00,  0.0000e+00]]]], device='cuda:0',
       dtype=torch.bfloat16)
after tensor([[[[0., -0., -0.],
          [0., 0., -0.],
          [0., 0., 0.]]]], device='cuda:0', dtype=torch.bfloat16)
...

Expected behavior

The attention mask should not be all zeros; otherwise, it will use full attention instead of causal attention when calculating the loss, leading to label leakage.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-28T06:15:27Z

Great catch, and sorry that it affected you!

hiyouga changed the title ~~Label leakage in Gemma 2 Finetuning~~ Label Leakage in Gemma 2 Finetuning Jun 27, 2024

hiyouga mentioned this issue Jun 27, 2024

Fix Gemma2 4d attention mask #31674

Merged

5 tasks

ArthurZucker closed this as completed in #31674 Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label Leakage in Gemma 2 Finetuning #31673

Label Leakage in Gemma 2 Finetuning #31673

hiyouga commented Jun 27, 2024 •

edited

Loading

ArthurZucker commented Jun 28, 2024

Label Leakage in Gemma 2 Finetuning #31673

Label Leakage in Gemma 2 Finetuning #31673

Comments

hiyouga commented Jun 27, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Loss curves

Breakpoints

Expected behavior

ArthurZucker commented Jun 28, 2024

hiyouga commented Jun 27, 2024 •

edited

Loading