Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization sanitizer script #242

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Tokenization sanitizer script #242

wants to merge 4 commits into from

Conversation

soldni
Copy link
Member

@soldni soldni commented Feb 19, 2025

This PR adds a sanitization utility to escape special tokens while still allowing subsequent text transformations.

I picked `U+10F0F0` for sanitization.

In order for sanitization to work, we must use a tokenizer that is designed to split of this token as part of its pre-tokenization strategy.
In HuggingFace Tokenizers library, we achieve that by adding the follow pre-tokenization rule:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: there has been talk of moving to a different tokenizer in the future. Will we be able to preserve this behavior in the options we're considering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! we would need to extend future tokenizers with this pretokenization rule, though.

#[arg(short, long, required = true)]
output: PathBuf,

/// Substitutions in the form KEY=VALUE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a file with pairs instead of
This seems more persistent that trying to remember the incantation at every invocation of this fxn

Comment on lines +127 to +133
fn apply_substitutions(s: &str, subs: &[(String, String)]) -> Result<String, Error> {
let mut result = s.to_string();
for (key, val) in subs {
result = result.replace(key, val);
}
Ok(result)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this need not be a "Result<...>", since nothing can error here

Comment on lines +175 to +180
all_src.into_iter().zip(all_dst.into_iter()).collect::<Vec<_>>().par_iter().for_each(
|(src_path, dst_path)|{
process_single(src_path, dst_path, &args.substitutions).unwrap();
pbar.inc(1);
}
);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over indices
like
(0...all_src.len()).into_par_iter().for_each() saves you the zip and collect step. Doesn't matter here, but if you had millions/billions of things you were iterating over, maybe nice to keep in mind

@@ -0,0 +1,10 @@
{"added":"2023-04-07T15:24:38.743534+00:00","created":"2020-03-29T09:04:10Z","id":"http://100kinvesting.com/2016/11/28/first-teleconference-calls-invoice-factoring/","metadata":{"bucket":"head","cc_segment":"crawl-data/CC-MAIN-2020-16/segments/1585370494064.21/wet/CC-MAIN-20200329074745-20200329104745-00210.warc.wet.gz","date_download":"2020-03-29T09:04:10Z","digest":"sha1:DNJVF5QUTTJGEF56WA5VCYUVVD3OW7XL","language":"en","language_score":0.95,"length":569,"line_ids":[41,42,56,57,59],"nlines":5,"original_length":2734,"original_nlines":105,"perplexity":304.6,"provenance":"cc_en_head-0112.json.gz:1","source_domain":"100kinvesting.com","title":"First Teleconference Calls for Invoice Factoring, and more. — 100K Investing, LLC.","url":"http://100kinvesting.com/2016/11/28/first-teleconference-calls-invoice-factoring/"},"source":"common-crawl","text":"We will be starting to hold weekly teleconferences for businesses that may be interested in invoice factoring. In addition, there will be guest speakers to discuss seller financed mortgages, annuity sales, probate advances, and more.\nIf you're in or around any of these areas, check out the teleconference for invoice factoring. Note that these are open to the public, and anyone with a business that has invoices is welcome to attend."}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ain't reading all that.

Love that we have test cases. But would prefer synthetic test cases that are more easily manually inspectable. Now I'm just trusting these are correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops sorry this is a file I added by mistake. lemme remove.

Copy link
Contributor

@cmwilhelm cmwilhelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM as a standalone script, but what do you think about exposing some python bindings in the dolma package? This would make it a lot easier to operationalize in mise-en-ray and elsewhere. Wouldn't need a dolma CLI entrypoint, just a util fn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants