Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何使用load_dataset进行数据切分 #3308

Open
ex-yanminmin001 opened this issue Feb 27, 2025 · 0 comments
Open

如何使用load_dataset进行数据切分 #3308

ex-yanminmin001 opened this issue Feb 27, 2025 · 0 comments

Comments

@ex-yanminmin001
Copy link

在模型训练前想进行数据切分,repo里有一个函数可以实现
def load_dataset(
datasets: Union[List[str], str],
*,
split_dataset_ratio: float = 0.,
seed: Union[int, np.random.RandomState, None] = None,
num_proc: int = 1,
streaming: bool = False,
use_hf: Optional[bool] = None,
hub_token: Optional[str] = None,
strict: bool = False,
download_mode: Literal['force_redownload', 'reuse_dataset_if_exists'] = 'reuse_dataset_if_exists',
columns: Optional[Dict[str, str]] = None,
remove_unused_columns: bool = True,
# self-cognition
model_name: Union[Tuple[str, str], List[str], None] = None, # zh, en
model_author: Union[Tuple[str, str], List[str], None] = None,
由于使用自定义数据集,这里load使用的AutoPreprocessor(),无法使用自定的preprocess,
由于本地数据集是动态变化的,如何动态的注册数据集
求指导意见

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant