Skip to content

Conversation

@kvakil
Copy link

@kvakil kvakil commented Oct 11, 2019

Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.

Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant