Feat: Improve `chardict.py` to add n-gram counts #9

Ced-C · 2024-12-18T22:06:48Z

Warning format change to count n-gram occurences

Can go up to large n-gram parametter by changing NGRAM_MAX_LENGTH constant.

Switch from os module to pathlib to manage paths.

Fixes #7

Warning format change to count n-gram occurences Can go up to large n-gram parametter by changing `NGRAM_MAX_LENGTH` constant. Switch from os module to pathlib to manage paths. Fixes OneDeadKey#7

fabi1cazenave

Nice work. Sadly, for some reason it doesn’t work from the Makefile yet.

This JSON format change also requires some tweaks in merge.py as well, and they should be addressed in the same PR.

bin/chardict.py

fabi1cazenave · 2024-12-23T09:02:28Z

Thinking out loud, I think the name field in the JSON dict should be replaced by three sets (= lists with unique elements):

corpora: list of corpora collection identifiers (e.g. [Gutenberg], [Leipzig])
files: list of corpus files
languages: list of languages

And the merge.py script should concatenate these sets. WDYT?

Ced-C · 2024-12-23T09:59:11Z

Thinking out loud, I think the name field in the JSON dict should be replaced by three sets (= lists with unique elements):
* `corpora`: list of corpora collection identifiers (e.g. `[Gutenberg]`, `[Leipzig]`)

* `files`: list of corpus files

* `languages`: list of languages
And the merge.py script should concatenate these sets. WDYT?

I think it should have both.

the list of files and language are useful to source the corpus.
the name is useful to identify the corpus and display it in a UI

Edit :
maybe a “source” object makes sense but it can be tricky to fill it and to define what is actually needed.

I will work on merge so that it works with the changes proposed. I think I will implement two ways of merging:

concatenate (merge on char count) : useful for merging several books of the same author for instance or to increase corpus size that it too short
mix (merge on %) : useful to make a custom corpus that has 10 % emails, 50 % books, and 40 % chat for instance

Does that seem right ?

fabi1cazenave · 2024-12-24T05:12:30Z

Let’s keep this PR as minimal as possible please. :-)
It’s not a way to dismiss your proposals, it’s just a good practice to keep changes as atomic as possible.

the name is useful to identify the corpus and display it in a UI

If you intend to keep the filename it the JSON data, then you can forget the corpora identification I’ve suggested for now. We’ll do that in another PR.

I will work on merge so that it works with the changes proposed. I think I will implement two ways of merging:

For this PR we just want to support the previous behavior, i.e. making an average of all input n-gram frequencies.
I guess it means that during a merge, the name field becomes the filename of the new JSON file and that the count field becomes the sum of all count inputs.
The two merging modes you mention should be implemented in another PR.

fabi1cazenave · 2024-12-24T05:13:58Z

Oh, I didn’t see you already did the job. Forget my previous comment, looking at your work right now. :-)

fabi1cazenave

Running make does create the set of JSON files we expect in the /json directory but I’m worried by two things:

the name field contains output — I’d suggest to drop this name field for this PR
trigrams don’t seem to be sorted by frequencies, and they don’t seem to be rounded to a readable number of decimal digits either.

Besides that, I’m afraid the merge function introduces regressions:

I’m not sure merge is still able to compute average frequencies of N input corpora (= our current need) because of its implementation
it has become too complex to use — I’ve left a comment about that.

I’d suggest to keep a single and simple merge function for this PR and implement other merge methods in a follow-up.

Last but not least, a type description of the new corpus format is needed. It can be a simple addition to the main docstring. (A proper mypy description would be awesome but I don’t know how to do this.)

bin/chardict.py

bin/merge.py

…mpler and only implement current merge behaviour.

fabi1cazenave

Almost there. This looks better and better, congrats! :-)

The two main remaining points are:

the use of the name field as a supposedly unique key
the use of an empty list as a default argument.

Please rebase on the latest main branch when you’re done with my review comments.

bin/merge.py

bin/chardict.py

bin/merge.py

fabi1cazenave · 2024-12-25T15:17:21Z

bin/chardict.py

+    # get all n-grams
+    for ngram_start in range(len(txt)):
+        for ngram_length in range(NGRAM_MAX_LENGTH):
+            _ngram = get_ngram(txt, ngram_start, ngram_length)
+
+            if not _ngram:  # _ngram is ""
+                continue
+
+            if _ngram not in ngrams[ngram_length]:
+                ngrams[ngram_length][_ngram] = 0
+
+            ngrams[ngram_length][_ngram] += 1
+            ngrams_count[ngram_length] += 1


This works and the logic is simple (I like that), but I’m not a fan of how it’s implemented. If that’s okay with you:

please address the rest of this review

rebase on the latest main branch (this PR is based on a previous version)

and I’ll add a commit to your PR to suggest another implementation of the same logic.

Eager to see what changes you are thinking about.
I was pretty sure this was simple enough to be accepted but maybe it’s performance-wise that you would like to improve things ?

…tead

… empty list argument)

…r PR

…ariable type

Ced-C · 2025-01-02T10:24:11Z

thinking back about this pr, there is an inconsistency that bugs me :

in chardict, the n key for corpus["freq"][n][ngram] is an int (representing the length of a ngram)
in merge, the same key is now a string because the json dump converts the int in a string

I would tend to modify chardict to make it a string everywhere for the sake of consistency. But, in such case, I am thinking, instead of just "1", "2",… it would make the file more human-readable to have it as "1-gram", "2-gram",…

The json format would then be :

"freq" : {
   "1-gram": { "e": 13, …},
   "2-gram": {"th": 1.7, …},
    …
"count": {
  "1-gram": 999,
  "2-gram": 777,
   …
}

is it a bad idea ?

Feat: Improve chardict.py to add n-gram counts

cdc8b29

Warning format change to count n-gram occurences Can go up to large n-gram parametter by changing `NGRAM_MAX_LENGTH` constant. Switch from os module to pathlib to manage paths. Fixes OneDeadKey#7

Ced-C force-pushed the improve-chardict branch from e1a6466 to cdc8b29 Compare December 19, 2024 12:57

fabi1cazenave requested changes Dec 22, 2024

View reviewed changes

bin/chardict.py Outdated Show resolved Hide resolved

bin/chardict.py Outdated Show resolved Hide resolved

bin/chardict.py Outdated Show resolved Hide resolved

minor fixes

f69cf2a

fabi1cazenave force-pushed the improve-chardict branch from f69cf2a to e7a27d5 Compare December 22, 2024 16:32

Ced-C requested a review from fabi1cazenave December 23, 2024 16:53

fabi1cazenave requested changes Dec 24, 2024

View reviewed changes

Ced-C force-pushed the improve-chardict branch from 99a06eb to f69cf2a Compare December 24, 2024 09:15

Refactor: Following OneDeadKey#9 advise, cleaned up the code to be si…

d88560e

…mpler and only implement current merge behaviour.

Ced-C requested a review from fabi1cazenave December 24, 2024 10:55

fabi1cazenave requested changes Dec 25, 2024

View reviewed changes

fabi1cazenave force-pushed the main branch 2 times, most recently from d63e3fb to 34b9633 Compare December 26, 2024 00:45

fabi1cazenave and others added 8 commits December 26, 2024 01:56

Merge branch 'main' into improve-chardict

6d3d91b

Refactor: merge.read_corpora dropping dict of corpus for a list ins…

d0593ed

…tead

Refactor: adding merge.mergable to check if merge can be processed.

ac3e971

Refactor : updating merge.mix to follow OneDeadKey#9 guidelines (no…

3832f64

… empty list argument)

Refactor: addressing OneDeadKey#9 comment that it should be in anothe…

6e59e64

…r PR

Refactor: merge.sort_by_frequency good practice to **not** change v…

c161070

…ariable type

Refactor: removing “name” field from corpus json files

f7e27e1

Chore: typo fix in comments

52be518

Ced-C requested a review from fabi1cazenave December 28, 2024 00:13

Chore: Finish PR review and drop "name" field

be99fc2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Improve `chardict.py` to add n-gram counts #9

Feat: Improve `chardict.py` to add n-gram counts #9

Ced-C commented Dec 18, 2024

fabi1cazenave left a comment

fabi1cazenave commented Dec 23, 2024

Ced-C commented Dec 23, 2024 •

edited

Loading

fabi1cazenave commented Dec 24, 2024

fabi1cazenave commented Dec 24, 2024

fabi1cazenave left a comment

fabi1cazenave left a comment

fabi1cazenave Dec 25, 2024

Ced-C Dec 26, 2024

Ced-C commented Jan 2, 2025 •

edited

Loading

Feat: Improve chardict.py to add n-gram counts #9

Are you sure you want to change the base?

Feat: Improve chardict.py to add n-gram counts #9

Conversation

Ced-C commented Dec 18, 2024

fabi1cazenave left a comment

Choose a reason for hiding this comment

fabi1cazenave commented Dec 23, 2024

Ced-C commented Dec 23, 2024 • edited Loading

fabi1cazenave commented Dec 24, 2024

fabi1cazenave commented Dec 24, 2024

fabi1cazenave left a comment

Choose a reason for hiding this comment

fabi1cazenave left a comment

Choose a reason for hiding this comment

fabi1cazenave Dec 25, 2024

Choose a reason for hiding this comment

Ced-C Dec 26, 2024

Choose a reason for hiding this comment

Ced-C commented Jan 2, 2025 • edited Loading

Feat: Improve `chardict.py` to add n-gram counts #9

Feat: Improve `chardict.py` to add n-gram counts #9

Ced-C commented Dec 23, 2024 •

edited

Loading

Ced-C commented Jan 2, 2025 •

edited

Loading