-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Improve chardict.py
to add n-gram counts
#9
base: main
Are you sure you want to change the base?
Conversation
Warning format change to count n-gram occurences Can go up to large n-gram parametter by changing `NGRAM_MAX_LENGTH` constant. Switch from os module to pathlib to manage paths. Fixes OneDeadKey#7
e1a6466
to
cdc8b29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Sadly, for some reason it doesn’t work from the Makefile
yet.
This JSON format change also requires some tweaks in merge.py
as well, and they should be addressed in the same PR.
f69cf2a
to
e7a27d5
Compare
Thinking out loud, I think the
And the |
I think it should have both.
Edit : I will work on
Does that seem right ? |
Let’s keep this PR as minimal as possible please. :-)
If you intend to keep the filename it the JSON data, then you can forget the corpora identification I’ve suggested for now. We’ll do that in another PR.
For this PR we just want to support the previous behavior, i.e. making an average of all input n-gram frequencies. |
Oh, I didn’t see you already did the job. Forget my previous comment, looking at your work right now. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running make
does create the set of JSON files we expect in the /json
directory but I’m worried by two things:
- the
name
field containsoutput
— I’d suggest to drop thisname
field for this PR - trigrams don’t seem to be sorted by frequencies, and they don’t seem to be rounded to a readable number of decimal digits either.
Besides that, I’m afraid the merge
function introduces regressions:
- I’m not sure
merge
is still able to compute average frequencies of N input corpora (= our current need) because of its implementation - it has become too complex to use — I’ve left a comment about that.
I’d suggest to keep a single and simple merge
function for this PR and implement other merge methods in a follow-up.
Last but not least, a type description of the new corpus format is needed. It can be a simple addition to the main docstring. (A proper mypy description would be awesome but I don’t know how to do this.)
99a06eb
to
f69cf2a
Compare
…mpler and only implement current merge behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. This looks better and better, congrats! :-)
The two main remaining points are:
- the use of the
name
field as a supposedly unique key - the use of an empty list as a default argument.
Please rebase on the latest main
branch when you’re done with my review comments.
# get all n-grams | ||
for ngram_start in range(len(txt)): | ||
for ngram_length in range(NGRAM_MAX_LENGTH): | ||
_ngram = get_ngram(txt, ngram_start, ngram_length) | ||
|
||
if not _ngram: # _ngram is "" | ||
continue | ||
|
||
if _ngram not in ngrams[ngram_length]: | ||
ngrams[ngram_length][_ngram] = 0 | ||
|
||
ngrams[ngram_length][_ngram] += 1 | ||
ngrams_count[ngram_length] += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works and the logic is simple (I like that), but I’m not a fan of how it’s implemented. If that’s okay with you:
- please address the rest of this review
- rebase on the latest
main
branch (this PR is based on a previous version) - and I’ll add a commit to your PR to suggest another implementation of the same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eager to see what changes you are thinking about.
I was pretty sure this was simple enough to be accepted but maybe it’s performance-wise that you would like to improve things ?
d63e3fb
to
34b9633
Compare
… empty list argument)
thinking back about this pr, there is an inconsistency that bugs me :
I would tend to modify The json format would then be :
is it a bad idea ? |
Warning format change to count n-gram occurences
Can go up to large n-gram parametter by changing
NGRAM_MAX_LENGTH
constant.Switch from os module to pathlib to manage paths.
Fixes #7