Skip to content

Commit

Permalink
Dependencies: Upgrade Stanza to 1.8.2; Utils: 1. Add Stanza's Armenia…
Browse files Browse the repository at this point in the history
…n (Classical) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser 2. Add Stanza's English (Old) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
  • Loading branch information
BLKSerene committed Jun 23, 2024
1 parent 6d61265 commit a546a3e
Show file tree
Hide file tree
Showing 37 changed files with 261 additions and 83 deletions.
2 changes: 1 addition & 1 deletion ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ As Wordless stands on the shoulders of giants, I hereby extend my sincere gratit
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|Ruixuan Luo (罗睿轩), Jingjing Xu (许晶晶),<br>Xuancheng Ren (任宣丞), Yi Zhang (张艺),<br>Zhiyuan Zhang (张之远), Bingzhen Wei (位冰镇),<br>Xu Sun (孙栩)<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)
Expand Down
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@
- Utils: Add Pyphen's Basque syllable tokenizer
- Utils: Add PyThaiNLP's Han-solo
- Utils: Add spaCy's Faroese and Norwegian (Nynorsk) word tokenizers
- Utils: Add Stanza's Armenian (Classical) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
- Utils: Add Stanza's English (Old) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
- Utils: Add Stanza's Sindhi part-of-speech tagger
- Utils: Add VADER's sentiment analyzers
- Work Area: Add Colligation Extractor - Filter results - Node/Colligation length
Expand Down Expand Up @@ -77,7 +79,7 @@
- Dependencies: Upgrade simplemma to 1.0.0
- Dependencies: Upgrade spaCy to 3.7.5
- Dependencies: Upgrade spacy-pkuseg to 0.0.33
- Dependencies: Upgrade Stanza to 1.7.0
- Dependencies: Upgrade Stanza to 1.8.2
- Dependencies: Upgrade SudachiPy to 0.6.8
- Dependencies: Upgrade Underthesea to 6.8.4
- Dependencies: Upgrade wordcloud to 1.9.3
Expand Down
2 changes: 1 addition & 1 deletion doc/trs/zho_cn/ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)
Expand Down
2 changes: 1 addition & 1 deletion doc/trs/zho_tw/ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)
Expand Down
2 changes: 1 addition & 1 deletion requirements/requirements_tests.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ pyphen == 0.15.0
pythainlp == 5.0.4
sacremoses == 0.1.1
simplemma == 1.0.0
stanza == 1.7.0
stanza == 1.8.2
underthesea == 6.8.4
vaderSentiment == 3.3.2

Expand Down
4 changes: 2 additions & 2 deletions tests/tests_nlp/test_lemmatization.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def test_lemmatize(lang, lemmatizer):
case 'bul':
results = ['бъ̀лгарският', 'езѝк', 'съм', 'индоевропейски', 'език', 'от', 'група', 'на', 'южнославянскит', 'език', ',', 'като', 'образувам', 'негова', 'източен', 'подгрупа', '.']
case 'cat':
results = ['ell', 'català', '(', 'denominació', 'oficial', 'a', 'Catalunya', ',', 'a', 'el', 'illa', 'balear', ',', 'a', 'Andorra', ',', 'a', 'el', 'ciutat', 'de', "l'Alguer", 'i', 'tradicional', 'a', 'Catalunya', 'del', 'nord', ')', 'o', 'valencià', '(', 'denominació', 'oficial', 'al', 'pair', 'valencià', 'i', 'tradicional', 'al', 'Carxe', ')', 'ser', 'un', 'llengua', 'romànic', 'parlar', 'a', 'Catalunya', ',', 'ell', 'pair', 'valencià', '(', 'treure', "d'algunes", 'comarca', 'i', 'localitat', 'de', "l'interior", ')', ',', 'el', 'illa', 'balear', '(', 'on', 'també', 'rebre', 'ell', 'nòmer', 'de', 'mallorquí', ',', 'menorquí', ',', 'eivissenc', 'o', 'formenterer', 'segon', "l'illa", ')', ',', 'Andorra', ',', 'el', 'franjar', 'de', 'pondre', '(', 'a', "l'Aragó", ')', ',', 'el', 'ciutat', 'de', "l'Alguer", '(', 'a', "l'illa", 'de', 'Sardenya', ')', ',', 'el', 'Catalunya', 'del', 'nord', ',', '[', '8', ']', 'ell', 'Carxe', '(', 'un', 'petit', 'territori', 'de', 'Múrcia', 'habitar', 'per', 'poblador', 'valencià', ')', ',', '[', '9', ']', '[', '10', ']', 'i', 'en', 'comunitat', 'arreu', 'del', 'món', '(', 'entrar', 'el', 'qual', 'destacar', 'el', 'de', "l'Argentina", ',', 'amb', '200.000', 'parlant', ')', '.', '[', '11', ']']
results = ['ell', 'català', 'tenir', 'cinc', 'gran', 'dialecte', '(', 'valencià', ',', 'nord-occidental', ',', 'central', ',', 'balear', 'i', 'rossellonès', ')', 'que', 'juntament', 'amb', "l'alguerès", ',', 'ell', 'dividir', 'fi', 'a', 'vint-i-un', 'varietat', 'i', "s'agrupen", 'en', 'dosar', 'gran', 'bloc', ':', 'ell', 'català', 'occidental', 'i', 'ell', 'català', 'oriental', '.']
case 'hrv':
results = ['hrvatski', 'jezik', '(', 'ISO', '639-3', ':', 'hrv', ')', 'skupni', 'ju', 'naziv', 'за', 'nacionalni', 'standardni', 'jezik', 'Hrvat', ',', 'ti', 'за', 'skup', 'narječje', 'i', 'govora', 'kojima', 'govoriti', 'ili', 'biti', 'nekada', 'govoriti', 'Hrvat', '.']
case 'ces':
Expand Down Expand Up @@ -240,7 +240,7 @@ def test_lemmatize(lang, lemmatizer):
case _:
tests_lang_util_skipped = True
case 'urd':
results = ['اُردُو[8', ']', 'برصغیر', 'کم', 'معیاری', 'زبان', 'میں', 'سے', 'ایک', 'ہونا', '۔']
results = ['1837ء', 'میں', '،', 'اردو', 'برطانوی', 'ایسٹ', 'انڈیا', 'کمپنی', 'کم', 'سرکاری', 'زبان', 'بننا', 'جانا', '،', 'کمپنی', 'کم', 'دور', 'میں', 'پورا', 'شمالی', 'ہندوستان', 'میں', 'فارسی', 'کم', 'جگہ', 'لینا', 'جانا', '۔']
case 'cym':
results = ['yn', 'cyfrifiad', 'yr', 'tu', '(', '2011', ')', ',', 'darganfod', 'bodio', '19', '%', '(', '562,000', ')', 'prpers', 'preswylwr', 'cymru', '(', 'tair', 'blwydd', 'a', 'trosodd', ')', 'bod', 'gallu', 'siarad', 'cymraeg', '.']
case _:
Expand Down
Loading

0 comments on commit a546a3e

Please sign in to comment.