Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing Rules for the Glove.42B.300D #200

Open
hontimzam opened this issue Dec 15, 2021 · 0 comments
Open

Parsing Rules for the Glove.42B.300D #200

hontimzam opened this issue Dec 15, 2021 · 0 comments

Comments

@hontimzam
Copy link

Hello, I am Tim.
I have some questions about the pre-trained vectors in glove.42B.300D.txt

As I am working on some text, and I would like to transfer the text to vectors via the glove.42.B.300D in Python. However, I created my parsing rules/tokenizer (via Spacy library) for my text but the words selected from my own rules is not always fit with the words/vocabulary in the golve.txt.

For example in some texts:
"New York is a big city and there are many stores. The items in the stores are non-expensive. There are 5.5-billion peoples in the world"

After the my parsing rules:
"new york", "is", "a", "big", "city", "and", "there", "are", "many", "stores", "the", "items", "in", "the", "stores", "are", "non", "expensive",
"there", "are", "5.5 billion", "peoples", "in", "the", "world".

However, in the glove.42.B.300D.txt, there are:
1). no "new york" BUT "new-york",
2). contains "non" and "expensive", but also contains "non-expensive" (which is different vectors)
3). even we have hyphens, there are no "5.5-billion", but sometimes contains "9.5-billion", "4.5-billion", etc.
4). Other similar expectation cases.

As a result, only 65% of the words are covered by the library, which is not because there is anything wrong with the dictionary, it is simply because my parsing rules are not good enough. The question is how can I modify the parsing rules such that the words can well-fit the dictionary? is there are already some existing parsing rules?

I have tried to look at the words and fix the issue case by case. However, we cannot ensure that in the future, there will be some new exceptional cases....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant