You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). But wonder if we see it as worthwhile as also writing up the scripting so it can be managed by the repo as well?
The text was updated successfully, but these errors were encountered:
I am weakly opposed. It is a big source of complexity in FairSeq and we don't have any reason to suppose it improves things on this task. (That said, fork and try it out and if it works better than expected...)
The one context I could imagine something vaguely similar if if we support using pretrained encoders---which we should. (I think there's an existing issue for that.) Then you'd just delegate the tokenization to the model's tokenizer.
I think maybe an example (in /examples) would be appropriate if we want to do this, where you use existing or custom code to tokenize your data with your tokenizer of choice, write it to a new train/dev/test file, and then run yoyodyne on the data?
What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). But wonder if we see it as worthwhile as also writing up the scripting so it can be managed by the repo as well?
The text was updated successfully, but these errors were encountered: