-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pickling tokenizer fails due to builtins.CoreBPE #231
Comments
+1 Fails with this error when used in multiprocessing context, eg train_dataset = train_dataset.map(
tokenization_function,
batched=True,
batch_size=1000,
num_proc=args.num_proc,
load_from_cache_file=not args.overwrite_cache,
desc=f"Running tokenizer on train dataset with {len(train_dataset)} items"
) |
Related issue: #181 |
I allow Encoding to be pickled in tiktoken 0.6. Please let me know if the implementation doesn't work well for you! |
@hauntsaninja Works for me. Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am using tiktoken in a dataset preprocessing step for a pytorch DataLoader. They support multiprocessing in creating batches which spawns workers. This fails with exception:
TypeError: cannot pickle 'builtins.CoreBPE' object
I am not familiar with Rust, but this thread seems to suggest that a few methods in the Rust implementation would enable pickling the tokenizer.
The text was updated successfully, but these errors were encountered: