Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update list of stopwords #85

Closed
nathanbegbie opened this issue Dec 10, 2019 · 1 comment
Closed

Update list of stopwords #85

nathanbegbie opened this issue Dec 10, 2019 · 1 comment
Assignees

Comments

@nathanbegbie
Copy link
Contributor

Stopwords were made to include countries, which was actually introduced when we attempted to do the clustering. We should not remove them in the stopwords approach, to give more context.

@tomwilsonsco
Copy link
Contributor

Testing revealed country names were also not in the list of 'English words', so needed to add a list of the countries here to ensure they make it through pre-processing.
Next consideration Nathan raised was the countries with spaces in their names "Democratic Republic of the Congo" these will still not survive pre-processing intact and our TDM stores unigrams, not bigrams etc at the moment.

Short-term solution we can split the multi-word countries into separate words in the input so their key words are retained e.g. search would pick up word "Congo".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants