如何使用load_dataset进行数据切分 #3308

ex-yanminmin001 · 2025-02-27T08:53:00Z

在模型训练前想进行数据切分，repo里有一个函数可以实现
def load_dataset(
datasets: Union[List[str], str],
*,
split_dataset_ratio: float = 0.,
seed: Union[int, np.random.RandomState, None] = None,
num_proc: int = 1,
streaming: bool = False,
use_hf: Optional[bool] = None,
hub_token: Optional[str] = None,
strict: bool = False,
download_mode: Literal['force_redownload', 'reuse_dataset_if_exists'] = 'reuse_dataset_if_exists',
columns: Optional[Dict[str, str]] = None,
remove_unused_columns: bool = True,
# self-cognition
model_name: Union[Tuple[str, str], List[str], None] = None, # zh, en
model_author: Union[Tuple[str, str], List[str], None] = None,
由于使用自定义数据集，这里load使用的AutoPreprocessor(),无法使用自定的preprocess，
由于本地数据集是动态变化的，如何动态的注册数据集
求指导意见

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如何使用load_dataset进行数据切分 #3308

如何使用load_dataset进行数据切分 #3308

ex-yanminmin001 commented Feb 27, 2025

如何使用load_dataset进行数据切分 #3308

如何使用load_dataset进行数据切分 #3308

Comments

ex-yanminmin001 commented Feb 27, 2025