Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subword tokenization #162

Open
bonham79 opened this issue Feb 8, 2024 · 4 comments
Open

Subword tokenization #162

bonham79 opened this issue Feb 8, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@bonham79
Copy link
Collaborator

bonham79 commented Feb 8, 2024

What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). But wonder if we see it as worthwhile as also writing up the scripting so it can be managed by the repo as well?

@bonham79 bonham79 self-assigned this Feb 8, 2024
@kylebgorman
Copy link
Contributor

I am weakly opposed. It is a big source of complexity in FairSeq and we don't have any reason to suppose it improves things on this task. (That said, fork and try it out and if it works better than expected...)

The one context I could imagine something vaguely similar if if we support using pretrained encoders---which we should. (I think there's an existing issue for that.) Then you'd just delegate the tokenization to the model's tokenizer.

@Adamits
Copy link
Collaborator

Adamits commented Feb 14, 2024

I think maybe an example (in /examples) would be appropriate if we want to do this, where you use existing or custom code to tokenize your data with your tokenizer of choice, write it to a new train/dev/test file, and then run yoyodyne on the data?

@kylebgorman
Copy link
Contributor

examples is the wild west, do what you will there, within reason ;)

@bonham79
Copy link
Collaborator Author

Those were my exact thoughts. Use if wanted, drop if not necessary. Probably will do decently on deep orthography inflection tasks.

@kylebgorman kylebgorman added the enhancement New feature or request label Feb 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants