forked from hiyouga/LLaMA-Factory
-
Notifications
You must be signed in to change notification settings - Fork 41
Closed
Description
Reminder
- I have read the README and searched the existing issues.
System Info
你好,我在尝试使用自定义数据进行训练时,在数据集加载过程中,preprocess_sp_dataset函数出现以下报错:
len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 1, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/01986958_fortwaltonmetalrecycling.com/image_inputs_anyres/start0_0_1280_1418_crop.png']
len(seq_ids) = 2, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/08713441_www.raysahelian.com_water.html/image_inputs_anyres/start0_0_1280_1768_crop.png', '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/07875683_www.locktoncharity.org.uk_post_edinburg/image_inputs_anyres/start0_0_1280_1051_crop.png']
[rank1]: multiprocess.pool.RemoteTraceback:
[rank1]: """
[rank1]: Traceback (most recent call last):
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank1]: result = (True, func(*args, **kwds))
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank1]: for i, result in enumerate(func(**kwargs)):
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single
[rank1]: batch = apply_function_on_filtered_inputs(
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs
[rank1]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/processors/sequence_parallel.py", line 46, in sp_split
[rank1]: preprocess_sp_dataset(row, model_args.sequence_parallel_size, model_args.sequence_parallel_mode)
[rank1]: File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/data_utils.py", line 103, in preprocess_sp_dataset
[rank1]: value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
[rank1]: ValueError: range() arg 3 must not be zero
[rank1]: """
具体报错位置如下,为方便排查问题,seq_ids等信息已打印在上方log中:
def preprocess_sp_dataset(seq_ids, world_size, sequence_parallel_mode):
if sequence_parallel_mode == "zigzag-ring":
step = len(seq_ids) // (2 * world_size)
value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
...我使用的数据集为openai格式数据集,格式如下:
[
{
"messages": [{"role": "user", "content": "<image>xxx"}, {"role": "assistant", "content": "xxx"}],
"images": ["/aaa/bbb.png"]
},
...
]请问该问题产生的原因是什么,应该如何解决?
Reproduction
NNODES=1
GPUS_PER_NODE=4
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=12345
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" src/train.py \
--deepspeed $DS_CONFIG_PATH \
--flash_attn fa2 \
--sequence_parallel_size 2 \
--sequence_parallel_mode zigzag-ring \
--stage sft \
--do_train \
--model_name_or_path ${PRETRAINED} \
--dataset $DATA \
--dataset_dir ${DATASET_DIR} \
--template qwen2_vl_orig \
--finetuning_type full \
--output_dir ${output_root}/${exp_name} \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--per_device_train_batch_size ${per_device_train_batch_size} \
--gradient_accumulation_steps 1 \
--ddp_timeout 500000 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--cutoff_len 131072 \
--save_steps 4000 \
--save_total_limit 10 \
--plot_loss \
--overwrite_cache \
--num_train_epochs ${num_train_epochs} \
--bf16 \
--preprocessing_num_workers 8 \
--preprocessing_batch_size 8 \
--packing True \
--tf32 True \
--cache_dir /home/user/360-LLaMA_Factory/exps/cache
Expected behavior
No response
Others
No response
Metadata
Metadata
Assignees
Labels
No labels