Skip to content

bug: Korean dataset seems to be miscounting repetitions #241

@sandytribal

Description

@sandytribal

Bug Report

CIB Mango Tree version:

v0.9.0

Current behavior:

A "repeated" phrase in the Korean language dataset is being counted as being repeated many times across posts, while it might only occur in one or two posts. See screenshots below:

Image Image

Expected behavior:

Our ngram analyzer should only count one repeated phrase PER post. So just because a phrase is repeated many times in the same post, it shouldn't be counted more than once. Repeating the same phrase many times in one post is a more human behavior than repeating it many times across posts.

Steps to reproduce:

Use this dataset.

https://drive.google.com/drive/u/0/folders/1wmwQK5Bj92-NM2FCXK7r1l542Db7B-ww

Be sure to adjust import settings, because it does not import correctly by default. Hit modify the import settings and select column separator as something to change. Then select comma as your preferred column separator.

Related code:

insert short code snippets here

Other information:

Metadata

Metadata

Assignees

Labels

bugfixInconsistencies or issues which will cause a problem for users or implementors.

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions