Skip to content

目前训练日志中,训练性能指标如何分析呢? #6167

@tensorflowt

Description

@tensorflowt

Reminder

  • I have read the README and searched the existing issues.

System Info

目前我这边在a800880GB机器上训练llama3-8b模型,我这边的训练日志如下:
image
这里面每秒训练样本数只有40个,这个样本数和tokens如何对应呢?如何是40个tokens那就很慢呀!如果是cutoff_len*10=409600又很大感觉也不对!这块帮忙分析一下?

Reproduction

我的配置文件如下:

model

model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct
cache_dir: /worker

method

stage: sft
do_train: true
finetuning_type: lora
lora_target: all

dataset

dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

output

output_dir: /data/saves/llama3-8b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 8
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

Expected behavior

No response

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    wontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions