New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Are the bag of words case-sensitive? #42

Open

yananchen1989 opened this issue Jan 11, 2022 · 1 comment

yananchen1989 commented Jan 11, 2022

Hello, I find that some words are cased while some are uncased.
They have different word ids in the vocab of tokenizer of GPT.

What is the appropriate way to process the words ?
Thanks.

kizunasunhy commented Sep 23, 2022

Seems like there's no other better way to solve this, unless you include them all in bag of words.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment