Reminder
System Info
使用流式加载时,TypeError: IterableDataset.map() got an unexpected keyword argument 'num_proc'
问题出在:https://github.com/Qihoo360/360-LLaMA-Factory/blob/3bc07289eefcf8c8ea05f553e4ef0b82008419e4/src/llamafactory/data/loader.py#L224。
经检查Datasets库中IterableDataset map函数无法接收kwargs中的三个参数:
kwargs = dict(
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=(not data_args.overwrite_cache) or (training_args.local_process_index != 0),
desc="Running sequence parallel split on dataset",
)
Reproduction
开启流式加载即可 --streaming True
Expected behavior
一般的Dataset map函数可以接收这些参数:

流式加载IterableDataset map:

修复方式:只需在 _get_sequence_parallel_dataset 中添加额外的判断逻辑即可,目前我本地运行良好
Others
No response