Correctly merge lowercase and uppercase bigrams #24

kvakil · 2019-10-11T19:10:36Z

Some entries in the wordsegment/bigrams.txt file used to be duplicated.
In particular, each bigrams was lowercased, but since some bigrams had
an uppercase and lowercase appearance, the same bigram appeared in
lowercase twice. The code only uses one of these entries, causing the
frequency of these bigrams to be underestimated.

The attached program lowercase_ngrams.py lowercases its input while
merging the frequencies correctly. The wordsegment/bigrams.txt file is
updated using this program. The wordsegment/unigrams.txt file did not
have this issue, so it was not changed.

A new test was added to tests/test_coverage.py, showing how "helloworld"
is now correctly segmented as "hello world". Past iterations would
segment this as "helloworld" because the frequency of the bigram was
underestimated.

Some entries in the wordsegment/bigrams.txt file used to be duplicated. In particular, each bigrams was lowercased, but since some bigrams had an uppercase and lowercase appearance, the same bigram appeared in lowercase twice. The code only uses one of these entries, causing the frequency of these bigrams to be underestimated. The attached program lowercase_ngrams.py lowercases its input while merging the frequencies correctly. The wordsegment/bigrams.txt file is updated using this program. The wordsegment/unigrams.txt file did not have this issue, so it was not changed. A new test was added to tests/test_coverage.py, showing how "helloworld" is now correctly segmented as "hello world". Past iterations would segment this as "helloworld" because the frequency of the bigram was underestimated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly merge lowercase and uppercase bigrams #24

Correctly merge lowercase and uppercase bigrams #24

Uh oh!

kvakil commented Oct 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Correctly merge lowercase and uppercase bigrams #24

Are you sure you want to change the base?

Correctly merge lowercase and uppercase bigrams #24

Uh oh!

Conversation

kvakil commented Oct 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant