Subword tokenization #162

bonham79 · 2024-02-08T17:50:10Z

What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). But wonder if we see it as worthwhile as also writing up the scripting so it can be managed by the repo as well?

kylebgorman · 2024-02-08T17:53:16Z

I am weakly opposed. It is a big source of complexity in FairSeq and we don't have any reason to suppose it improves things on this task. (That said, fork and try it out and if it works better than expected...)

The one context I could imagine something vaguely similar if if we support using pretrained encoders---which we should. (I think there's an existing issue for that.) Then you'd just delegate the tokenization to the model's tokenizer.

Adamits · 2024-02-14T15:54:00Z

I think maybe an example (in /examples) would be appropriate if we want to do this, where you use existing or custom code to tokenize your data with your tokenizer of choice, write it to a new train/dev/test file, and then run yoyodyne on the data?

kylebgorman · 2024-02-14T16:29:01Z

examples is the wild west, do what you will there, within reason ;)

bonham79 · 2024-02-15T16:43:14Z

Those were my exact thoughts. Use if wanted, drop if not necessary. Probably will do decently on deep orthography inflection tasks.

bonham79 self-assigned this Feb 8, 2024

kylebgorman added the enhancement New feature or request label Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subword tokenization #162

Subword tokenization #162

bonham79 commented Feb 8, 2024

kylebgorman commented Feb 8, 2024

Adamits commented Feb 14, 2024

kylebgorman commented Feb 14, 2024

bonham79 commented Feb 15, 2024

Subword tokenization #162

Subword tokenization #162

Comments

bonham79 commented Feb 8, 2024

kylebgorman commented Feb 8, 2024

Adamits commented Feb 14, 2024

kylebgorman commented Feb 14, 2024

bonham79 commented Feb 15, 2024