Pickling tokenizer fails due to builtins.CoreBPE #231

jerheff · 2023-12-23T15:46:07Z

I am using tiktoken in a dataset preprocessing step for a pytorch DataLoader. They support multiprocessing in creating batches which spawns workers. This fails with exception:

TypeError: cannot pickle 'builtins.CoreBPE' object

I am not familiar with Rust, but this thread seems to suggest that a few methods in the Rust implementation would enable pickling the tokenizer.

sk-g · 2024-01-03T21:45:37Z

+1

Fails with this error when used in multiprocessing context, eg

    train_dataset = train_dataset.map(
        tokenization_function,
        batched=True,
        batch_size=1000,
        num_proc=args.num_proc,
        load_from_cache_file=not args.overwrite_cache,
        desc=f"Running tokenizer on train dataset with {len(train_dataset)} items"
        )

sk-g · 2024-01-03T22:06:21Z

Related issue: #181

hauntsaninja · 2024-02-09T06:42:38Z

I allow Encoding to be pickled in tiktoken 0.6. Please let me know if the implementation doesn't work well for you!

jerheff · 2024-02-09T16:41:36Z

@hauntsaninja Works for me. Thanks!

hauntsaninja closed this as not planned Won't fix, can't repro, duplicate, stale Feb 9, 2024

hauntsaninja closed this as completed Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickling tokenizer fails due to builtins.CoreBPE #231

Pickling tokenizer fails due to builtins.CoreBPE #231

jerheff commented Dec 23, 2023

sk-g commented Jan 3, 2024

sk-g commented Jan 3, 2024

hauntsaninja commented Feb 9, 2024

jerheff commented Feb 9, 2024

Pickling tokenizer fails due to builtins.CoreBPE #231

Pickling tokenizer fails due to builtins.CoreBPE #231

Comments

jerheff commented Dec 23, 2023

sk-g commented Jan 3, 2024

sk-g commented Jan 3, 2024

hauntsaninja commented Feb 9, 2024

jerheff commented Feb 9, 2024