Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization (punctuation) during training and inference #59

Open
kubapok opened this issue Jan 29, 2021 · 0 comments
Open

tokenization (punctuation) during training and inference #59

kubapok opened this issue Jan 29, 2021 · 0 comments

Comments

@kubapok
Copy link

kubapok commented Jan 29, 2021

Sample and full training and testing data contain tokenized sentences (by TweetTokenizer I suppose):
what are you doing for a living ? i am a admin .
instead of not tokenized:
what are you doing for a living? i am a admin.

During inference, the model output seems to be correct (detokenized), no matter if the input is tokenized or not.
3rd party decoding scripts in README does not use any tokenization.

What is the correct way to use the model? Should I tokenize the input or detokenize the output? Is the tokenizer exactly the same as GPT2 tokenizer or is it trained on the reddit data from scratch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant