Update list of stopwords #85

nathanbegbie · 2019-12-10T15:32:59Z

Stopwords were made to include countries, which was actually introduced when we attempted to do the clustering. We should not remove them in the stopwords approach, to give more context.

tomwilsonsco · 2019-12-13T08:47:38Z

Testing revealed country names were also not in the list of 'English words', so needed to add a list of the countries here to ensure they make it through pre-processing.
Next consideration Nathan raised was the countries with spaces in their names "Democratic Republic of the Congo" these will still not survive pre-processing intact and our TDM stores unigrams, not bigrams etc at the moment.

Short-term solution we can split the multi-word countries into separate words in the input so their key words are retained e.g. search would pick up word "Congo".

nathanbegbie assigned kevincarolan Dec 10, 2019

nathanbegbie assigned tomwilsonsco Dec 13, 2019

nathanbegbie closed this as completed Dec 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update list of stopwords #85

Update list of stopwords #85

nathanbegbie commented Dec 10, 2019

tomwilsonsco commented Dec 13, 2019

Update list of stopwords #85

Update list of stopwords #85

Comments

nathanbegbie commented Dec 10, 2019

tomwilsonsco commented Dec 13, 2019