Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调阶段,fp16训练loss值为0,fp32训练loss正常 #1370

Open
zhangxiangchn opened this issue Feb 11, 2025 · 1 comment
Open

微调阶段,fp16训练loss值为0,fp32训练loss正常 #1370

zhangxiangchn opened this issue Feb 11, 2025 · 1 comment

Comments

@zhangxiangchn
Copy link

zhangxiangchn commented Feb 11, 2025

模型

jina-embeddings-v2-base-zh

模型微调参考

FlagEmbedding/examples/finetune/embedder/encoder_only/ 路径下的脚本修改

微调命令:

export WANDB_MODE=disabled
train_data="/xxx/xxx/data/finetune_data_score_v2.jsonl"
num_train_epochs=4
per_device_train_batch_size=256
num_gpus=2
if [ -z "$HF_HUB_CACHE" ]; then
export HF_HUB_CACHE="$HOME/.cache/huggingface/hub"
fi
model_args="
--model_name_or_path /SharedNFS/LLM_model/jina-embeddings-v2-base-zh
--cache_dir $HF_HUB_CACHE
--trust_remote_code True
"
data_args="
--train_data $train_data
--cache_path ~/.cache
--train_group_size 15
--query_max_len 32
--passage_max_len 32
--pad_to_multiple_of 8
--query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: '
--query_instruction_format '{}{}'
--knowledge_distillation True
"
training_args="
--output_dir ./knowledge_distillation_agent_minedHN_score_test_encoder_only_base_jina-embeddings-v2-base-zh
--overwrite_output_dir
--learning_rate 1e-5
--fp16 \ # 唯一区别就是这里要不要指定fp16进行训练
--num_train_epochs $num_train_epochs
--per_device_train_batch_size $per_device_train_batch_size
--dataloader_drop_last True
--warmup_ratio 0.1
--gradient_checkpointing
--deepspeed ../../ds_stage0.json
--logging_steps 1
--save_steps 1000
--negatives_cross_device
--temperature 0.02
--sentence_pooling_method mean
--normalize_embeddings True
--kd_loss_type kl_div
"
cmd="torchrun --nproc_per_node $num_gpus
-m FlagEmbedding.finetune.embedder.encoder_only.base
$model_args
$data_args
$training_args
"
echo $cmd
eval $cmd

log输出(fp16):

/data/anaconda3/envs/c-mteb/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
0%| | 1/452 [00:04<32:47, 4.36s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.01}
0%| | 1/452 [00:04<32:47, 4.36s/it]
0%| | 2/452 [00:07<26:18, 3.51s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.02}
0%| | 2/452 [00:07<26:18, 3.51s/it]
1%| | 3/452 [00:10<24:00, 3.21s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.03}
1%| | 3/452 [00:10<24:00, 3.21s/it]
1%| | 4/452 [00:13<25:15, 3.38s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.04}
1%| | 4/452 [00:13<25:15, 3.38s/it]
1%| | 5/452 [00:16<24:40, 3.31s/it]

{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 0, 'epoch': 0.04}
1%| | 5/452 [00:16<24:40, 3.31s/it]
1%|▏ | 6/452 [00:19<23:28, 3.16s/it]

log输出(fp32):

/data/anaconda3/envs/c-mteb/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
0%| | 1/452 [00:07<59:15, 7.88s/it]

{'loss': 5.9738, 'learning_rate': 0.0, 'epoch': 0.01}
0%| | 1/452 [00:07<59:15, 7.88s/it]
0%| | 2/452 [00:14<52:21, 6.98s/it]

{'loss': 5.2496, 'learning_rate': 1.8104259678004022e-06, 'epoch': 0.02}
0%| | 2/452 [00:14<52:21, 6.98s/it]
1%| | 3/452 [00:20<50:08, 6.70s/it]

{'loss': 5.9845, 'learning_rate': 2.8694572692954448e-06, 'epoch': 0.03}
1%| | 3/452 [00:20<50:08, 6.70s/it]
1%| | 4/452 [00:28<54:47, 7.34s/it]

{'loss': 5.684, 'learning_rate': 3.6208519356008044e-06, 'epoch': 0.04}
1%| | 4/452 [00:28<54:47, 7.34s/it]
1%| | 5/452 [00:36<54:48, 7.36s/it]

{'loss': 5.5646, 'learning_rate': 4.203678918349396e-06, 'epoch': 0.04}
1%| | 5/452 [00:36<54:48, 7.36s/it]
1%|▏ | 6/452 [00:42<52:31, 7.07s/it]

另外想问一下,loss大概在多少模型差不多收敛

@zhangxiangchn
Copy link
Author

可能是溢出了,使用bf16格式是可以正常训练的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant