-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization sanitizer script #242
base: main
Are you sure you want to change the base?
Conversation
I picked `U+10F0F0` for sanitization. | ||
|
||
In order for sanitization to work, we must use a tokenizer that is designed to split of this token as part of its pre-tokenization strategy. | ||
In HuggingFace Tokenizers library, we achieve that by adding the follow pre-tokenization rule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: there has been talk of moving to a different tokenizer in the future. Will we be able to preserve this behavior in the options we're considering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! we would need to extend future tokenizers with this pretokenization rule, though.
#[arg(short, long, required = true)] | ||
output: PathBuf, | ||
|
||
/// Substitutions in the form KEY=VALUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a file with pairs instead of
This seems more persistent that trying to remember the incantation at every invocation of this fxn
fn apply_substitutions(s: &str, subs: &[(String, String)]) -> Result<String, Error> { | ||
let mut result = s.to_string(); | ||
for (key, val) in subs { | ||
result = result.replace(key, val); | ||
} | ||
Ok(result) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this need not be a "Result<...>", since nothing can error here
all_src.into_iter().zip(all_dst.into_iter()).collect::<Vec<_>>().par_iter().for_each( | ||
|(src_path, dst_path)|{ | ||
process_single(src_path, dst_path, &args.substitutions).unwrap(); | ||
pbar.inc(1); | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over indices
like
(0...all_src.len()).into_par_iter().for_each()
saves you the zip and collect step. Doesn't matter here, but if you had millions/billions of things you were iterating over, maybe nice to keep in mind
@@ -0,0 +1,10 @@ | |||
{"added":"2023-04-07T15:24:38.743534+00:00","created":"2020-03-29T09:04:10Z","id":"http://100kinvesting.com/2016/11/28/first-teleconference-calls-invoice-factoring/","metadata":{"bucket":"head","cc_segment":"crawl-data/CC-MAIN-2020-16/segments/1585370494064.21/wet/CC-MAIN-20200329074745-20200329104745-00210.warc.wet.gz","date_download":"2020-03-29T09:04:10Z","digest":"sha1:DNJVF5QUTTJGEF56WA5VCYUVVD3OW7XL","language":"en","language_score":0.95,"length":569,"line_ids":[41,42,56,57,59],"nlines":5,"original_length":2734,"original_nlines":105,"perplexity":304.6,"provenance":"cc_en_head-0112.json.gz:1","source_domain":"100kinvesting.com","title":"First Teleconference Calls for Invoice Factoring, and more. — 100K Investing, LLC.","url":"http://100kinvesting.com/2016/11/28/first-teleconference-calls-invoice-factoring/"},"source":"common-crawl","text":"We will be starting to hold weekly teleconferences for businesses that may be interested in invoice factoring. In addition, there will be guest speakers to discuss seller financed mortgages, annuity sales, probate advances, and more.\nIf you're in or around any of these areas, check out the teleconference for invoice factoring. Note that these are open to the public, and anyone with a business that has invoices is welcome to attend."} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ain't reading all that.
Love that we have test cases. But would prefer synthetic test cases that are more easily manually inspectable. Now I'm just trusting these are correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops sorry this is a file I added by mistake. lemme remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM as a standalone script, but what do you think about exposing some python bindings in the dolma package? This would make it a lot easier to operationalize in mise-en-ray
and elsewhere. Wouldn't need a dolma CLI entrypoint, just a util fn.
This PR adds a sanitization utility to escape special tokens while still allowing subsequent text transformations.