Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Improve chardict.py to add n-gram counts #9

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Ced-C
Copy link

@Ced-C Ced-C commented Dec 18, 2024

Warning format change to count n-gram occurences

Can go up to large n-gram parametter by changing NGRAM_MAX_LENGTH constant.

Switch from os module to pathlib to manage paths.

Fixes #7

Warning format change to count n-gram occurences

Can go up to large n-gram parametter by changing `NGRAM_MAX_LENGTH`
constant.

Switch from os module to pathlib to manage paths.

Fixes OneDeadKey#7
Copy link
Contributor

@fabi1cazenave fabi1cazenave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Sadly, for some reason it doesn’t work from the Makefile yet.

This JSON format change also requires some tweaks in merge.py as well, and they should be addressed in the same PR.

@fabi1cazenave
Copy link
Contributor

Thinking out loud, I think the name field in the JSON dict should be replaced by three sets (= lists with unique elements):

  • corpora: list of corpora collection identifiers (e.g. [Gutenberg], [Leipzig])
  • files: list of corpus files
  • languages: list of languages

And the merge.py script should concatenate these sets. WDYT?

@Ced-C
Copy link
Author

Ced-C commented Dec 23, 2024

Thinking out loud, I think the name field in the JSON dict should be replaced by three sets (= lists with unique elements):

* `corpora`: list of corpora collection identifiers (e.g. `[Gutenberg]`, `[Leipzig]`)

* `files`: list of corpus files

* `languages`: list of languages

And the merge.py script should concatenate these sets. WDYT?

I think it should have both.

  • the list of files and language are useful to source the corpus.
  • the name is useful to identify the corpus and display it in a UI

Edit :
maybe a “source” object makes sense but it can be tricky to fill it and to define what is actually needed.

I will work on merge so that it works with the changes proposed. I think I will implement two ways of merging:

  • concatenate (merge on char count) : useful for merging several books of the same author for instance or to increase corpus size that it too short
  • mix (merge on %) : useful to make a custom corpus that has 10 % emails, 50 % books, and 40 % chat for instance

Does that seem right ?

@Ced-C Ced-C requested a review from fabi1cazenave December 23, 2024 16:53
@fabi1cazenave
Copy link
Contributor

Let’s keep this PR as minimal as possible please. :-)
It’s not a way to dismiss your proposals, it’s just a good practice to keep changes as atomic as possible.

the name is useful to identify the corpus and display it in a UI

If you intend to keep the filename it the JSON data, then you can forget the corpora identification I’ve suggested for now. We’ll do that in another PR.

I will work on merge so that it works with the changes proposed. I think I will implement two ways of merging:

For this PR we just want to support the previous behavior, i.e. making an average of all input n-gram frequencies.
I guess it means that during a merge, the name field becomes the filename of the new JSON file and that the count field becomes the sum of all count inputs.
The two merging modes you mention should be implemented in another PR.

@fabi1cazenave
Copy link
Contributor

Oh, I didn’t see you already did the job. Forget my previous comment, looking at your work right now. :-)

Copy link
Contributor

@fabi1cazenave fabi1cazenave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running make does create the set of JSON files we expect in the /json directory but I’m worried by two things:

  • the name field contains output — I’d suggest to drop this name field for this PR
  • trigrams don’t seem to be sorted by frequencies, and they don’t seem to be rounded to a readable number of decimal digits either.

Besides that, I’m afraid the merge function introduces regressions:

  • I’m not sure merge is still able to compute average frequencies of N input corpora (= our current need) because of its implementation
  • it has become too complex to use — I’ve left a comment about that.

I’d suggest to keep a single and simple merge function for this PR and implement other merge methods in a follow-up.

Last but not least, a type description of the new corpus format is needed. It can be a simple addition to the main docstring. (A proper mypy description would be awesome but I don’t know how to do this.)

…mpler and only implement current merge behaviour.
@Ced-C Ced-C requested a review from fabi1cazenave December 24, 2024 10:55
Copy link
Contributor

@fabi1cazenave fabi1cazenave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. This looks better and better, congrats! :-)

The two main remaining points are:

  • the use of the name field as a supposedly unique key
  • the use of an empty list as a default argument.

Please rebase on the latest main branch when you’re done with my review comments.

Comment on lines +51 to +63
# get all n-grams
for ngram_start in range(len(txt)):
for ngram_length in range(NGRAM_MAX_LENGTH):
_ngram = get_ngram(txt, ngram_start, ngram_length)

if not _ngram: # _ngram is ""
continue

if _ngram not in ngrams[ngram_length]:
ngrams[ngram_length][_ngram] = 0

ngrams[ngram_length][_ngram] += 1
ngrams_count[ngram_length] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works and the logic is simple (I like that), but I’m not a fan of how it’s implemented. If that’s okay with you:

  • please address the rest of this review
  • rebase on the latest main branch (this PR is based on a previous version)
  • and I’ll add a commit to your PR to suggest another implementation of the same logic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eager to see what changes you are thinking about.
I was pretty sure this was simple enough to be accepted but maybe it’s performance-wise that you would like to improve things ?

@fabi1cazenave fabi1cazenave force-pushed the main branch 2 times, most recently from d63e3fb to 34b9633 Compare December 26, 2024 00:45
@Ced-C Ced-C requested a review from fabi1cazenave December 28, 2024 00:13
@Ced-C
Copy link
Author

Ced-C commented Jan 2, 2025

thinking back about this pr, there is an inconsistency that bugs me :

  • in chardict, the n key for corpus["freq"][n][ngram] is an int (representing the length of a ngram)
  • in merge, the same key is now a string because the json dump converts the int in a string

I would tend to modify chardict to make it a string everywhere for the sake of consistency. But, in such case, I am thinking, instead of just "1", "2",… it would make the file more human-readable to have it as "1-gram", "2-gram",…

The json format would then be :

"freq" : {
   "1-gram": { "e": 13, …},
   "2-gram": {"th": 1.7, …},
    …
"count": {
  "1-gram": 999,
  "2-gram": 777,
   …
}

is it a bad idea ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better chardict.py
2 participants