-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Bug Report
CIB Mango Tree version:
v0.9.0
Current behavior:
A "repeated" phrase in the Korean language dataset is being counted as being repeated many times across posts, while it might only occur in one or two posts. See screenshots below:
Expected behavior:
Our ngram analyzer should only count one repeated phrase PER post. So just because a phrase is repeated many times in the same post, it shouldn't be counted more than once. Repeating the same phrase many times in one post is a more human behavior than repeating it many times across posts.
Steps to reproduce:
Use this dataset.
https://drive.google.com/drive/u/0/folders/1wmwQK5Bj92-NM2FCXK7r1l542Db7B-ww
Be sure to adjust import settings, because it does not import correctly by default. Hit modify the import settings and select column separator as something to change. Then select comma as your preferred column separator.
Related code:
insert short code snippets here
Other information: