Skip to content

加载数据集报错 #76

@notFoundThisPerson

Description

@notFoundThisPerson

Reminder

  • I have read the README and searched the existing issues.

System Info

你好,我在尝试使用自定义数据进行训练时,在数据集加载过程中,preprocess_sp_dataset函数出现以下报错:

len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 131072, world_size = 2
len(seq_ids) = 1, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/01986958_fortwaltonmetalrecycling.com/image_inputs_anyres/start0_0_1280_1418_crop.png']
len(seq_ids) = 2, world_size = 2
['/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/08713441_www.raysahelian.com_water.html/image_inputs_anyres/start0_0_1280_1768_crop.png', '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-basecv/zhongyufeng02/data/agent/MultiUI/v0.6_5M/crawled_data_vTraining8_pc1280/07875683_www.locktoncharity.org.uk_post_edinburg/image_inputs_anyres/start0_0_1280_1051_crop.png']

[rank1]: multiprocess.pool.RemoteTraceback: 
[rank1]: """
[rank1]: Traceback (most recent call last):
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank1]:     result = (True, func(*args, **kwds))
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
[rank1]:     for i, result in enumerate(func(**kwargs)):
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single
[rank1]:     batch = apply_function_on_filtered_inputs(
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/software/anaconda3/envs/llama_factory_qwen25_ring/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs
[rank1]:     processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/processors/sequence_parallel.py", line 46, in sp_split
[rank1]:     preprocess_sp_dataset(row, model_args.sequence_parallel_size, model_args.sequence_parallel_mode)
[rank1]:   File "/mnt/dolphinfs/ssd_pool/docker/user/hadoop-basecv-hl/hadoop-basecv/user/zhengliming04/code/agent/llama_factory_sequence_parallel/360-LLaMA-Factory/src/llamafactory/data/data_utils.py", line 103, in preprocess_sp_dataset
[rank1]:     value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
[rank1]: ValueError: range() arg 3 must not be zero
[rank1]: """

具体报错位置如下,为方便排查问题,seq_ids等信息已打印在上方log中:

def preprocess_sp_dataset(seq_ids, world_size, sequence_parallel_mode):
    if sequence_parallel_mode == "zigzag-ring":
        step = len(seq_ids) // (2 * world_size)
        value_chunks = [seq_ids[s : s + step] for s in range(0, len(seq_ids), step)]
        ...

我使用的数据集为openai格式数据集,格式如下:

[
  {
    "messages": [{"role": "user", "content": "<image>xxx"}, {"role": "assistant", "content": "xxx"}],
    "images": ["/aaa/bbb.png"]
  },
  ...
]

请问该问题产生的原因是什么,应该如何解决?

Reproduction

NNODES=1
GPUS_PER_NODE=4
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=12345
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" src/train.py \
    --deepspeed $DS_CONFIG_PATH \
    --flash_attn fa2 \
    --sequence_parallel_size 2 \
    --sequence_parallel_mode zigzag-ring \
    --stage sft \
    --do_train \
    --model_name_or_path ${PRETRAINED} \
    --dataset $DATA \
    --dataset_dir ${DATASET_DIR} \
    --template qwen2_vl_orig \
    --finetuning_type full \
    --output_dir ${output_root}/${exp_name} \
    --learning_rate 1e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --gradient_accumulation_steps 1 \
    --ddp_timeout 500000 \
    --lr_scheduler_type cosine \
    --logging_steps 1 \
    --cutoff_len 131072 \
    --save_steps 4000 \
    --save_total_limit 10 \
    --plot_loss \
    --overwrite_cache \
    --num_train_epochs ${num_train_epochs} \
    --bf16 \
    --preprocessing_num_workers 8 \
    --preprocessing_batch_size 8 \
    --packing True \
    --tf32 True \
    --cache_dir /home/user/360-LLaMA_Factory/exps/cache 

Expected behavior

No response

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions