Dependencies: Upgrade Stanza to 1.8.2; Utils: 1. Add Stanza's Armenia…

…n (Classical) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser 2. Add Stanza's English (Old) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
BLKSerene · Jun 23, 2024 · a546a3e · a546a3e
1 parent 6d61265
commit a546a3e
Show file tree

Hide file tree

Showing 37 changed files with 261 additions and 83 deletions.
diff --git a/ACKS.md b/ACKS.md
@@ -49,7 +49,7 @@ As Wordless stands on the shoulders of giants, I hereby extend my sincere gratit
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
 26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|Ruixuan Luo (罗睿轩), Jingjing Xu (许晶晶),<br>Xuancheng Ren (任宣丞), Yi Zhang (张艺),<br>Zhiyuan Zhang (张之远), Bingzhen Wei (位冰镇),<br>Xu Sun (孙栩)<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
-28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
+28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
 30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
 31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -29,6 +29,8 @@
 - Utils: Add Pyphen's Basque syllable tokenizer
 - Utils: Add PyThaiNLP's Han-solo
 - Utils: Add spaCy's Faroese and Norwegian (Nynorsk) word tokenizers
+- Utils: Add Stanza's Armenian (Classical) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
+- Utils: Add Stanza's English (Old) sentence tokenizer, word tokenizer, part-of-speech tagger, lemmatizer, and dependency parser
 - Utils: Add Stanza's Sindhi part-of-speech tagger
 - Utils: Add VADER's sentiment analyzers
 - Work Area: Add Colligation Extractor - Filter results - Node/Colligation length
@@ -77,7 +79,7 @@
 - Dependencies: Upgrade simplemma to 1.0.0
 - Dependencies: Upgrade spaCy to 3.7.5
 - Dependencies: Upgrade spacy-pkuseg to 0.0.33
-- Dependencies: Upgrade Stanza to 1.7.0
+- Dependencies: Upgrade Stanza to 1.8.2
 - Dependencies: Upgrade SudachiPy to 0.6.8
 - Dependencies: Upgrade Underthesea to 6.8.4
 - Dependencies: Upgrade wordcloud to 1.9.3

diff --git a/doc/trs/zho_cn/ACKS.md b/doc/trs/zho_cn/ACKS.md
@@ -49,7 +49,7 @@
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
 26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
-28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
+28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
 30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
 31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)

diff --git a/doc/trs/zho_tw/ACKS.md b/doc/trs/zho_tw/ACKS.md
@@ -49,7 +49,7 @@
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
 26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
-28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
+28|[Stanza](https://github.com/stanfordnlp/stanza)|1.8.2|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
 30|[Underthesea](https://undertheseanlp.com/)|6.8.4|Vu Anh|[GPL-3.0-or-later](https://github.com/undertheseanlp/underthesea/blob/main/LICENSE)
 31|[VADER](https://github.com/cjhutto/vaderSentiment)|3.3.2|C.J. Hutto|[MIT](https://github.com/cjhutto/vaderSentiment/blob/master/LICENSE.txt)

diff --git a/requirements/requirements_tests.txt b/requirements/requirements_tests.txt
@@ -28,7 +28,7 @@ pyphen == 0.15.0
 pythainlp == 5.0.4
 sacremoses == 0.1.1
 simplemma == 1.0.0
-stanza == 1.7.0
+stanza == 1.8.2
 underthesea == 6.8.4
 vaderSentiment == 3.3.2
 

diff --git a/tests/tests_nlp/test_lemmatization.py b/tests/tests_nlp/test_lemmatization.py
@@ -74,7 +74,7 @@ def test_lemmatize(lang, lemmatizer):
         case 'bul':
             results = ['бъ̀лгарският', 'езѝк', 'съм', 'индоевропейски', 'език', 'от', 'група', 'на', 'южнославянскит', 'език', ',', 'като', 'образувам', 'негова', 'източен', 'подгрупа', '.']
         case 'cat':
-            results = ['ell', 'català', '(', 'denominació', 'oficial', 'a', 'Catalunya', ',', 'a', 'el', 'illa', 'balear', ',', 'a', 'Andorra', ',', 'a', 'el', 'ciutat', 'de', "l'Alguer", 'i', 'tradicional', 'a', 'Catalunya', 'del', 'nord', ')', 'o', 'valencià', '(', 'denominació', 'oficial', 'al', 'pair', 'valencià', 'i', 'tradicional', 'al', 'Carxe', ')', 'ser', 'un', 'llengua', 'romànic', 'parlar', 'a', 'Catalunya', ',', 'ell', 'pair', 'valencià', '(', 'treure', "d'algunes", 'comarca', 'i', 'localitat', 'de', "l'interior", ')', ',', 'el', 'illa', 'balear', '(', 'on', 'també', 'rebre', 'ell', 'nòmer', 'de', 'mallorquí', ',', 'menorquí', ',', 'eivissenc', 'o', 'formenterer', 'segon', "l'illa", ')', ',', 'Andorra', ',', 'el', 'franjar', 'de', 'pondre', '(', 'a', "l'Aragó", ')', ',', 'el', 'ciutat', 'de', "l'Alguer", '(', 'a', "l'illa", 'de', 'Sardenya', ')', ',', 'el', 'Catalunya', 'del', 'nord', ',', '[', '8', ']', 'ell', 'Carxe', '(', 'un', 'petit', 'territori', 'de', 'Múrcia', 'habitar', 'per', 'poblador', 'valencià', ')', ',', '[', '9', ']', '[', '10', ']', 'i', 'en', 'comunitat', 'arreu', 'del', 'món', '(', 'entrar', 'el', 'qual', 'destacar', 'el', 'de', "l'Argentina", ',', 'amb', '200.000', 'parlant', ')', '.', '[', '11', ']']
+            results = ['ell', 'català', 'tenir', 'cinc', 'gran', 'dialecte', '(', 'valencià', ',', 'nord-occidental', ',', 'central', ',', 'balear', 'i', 'rossellonès', ')', 'que', 'juntament', 'amb', "l'alguerès", ',', 'ell', 'dividir', 'fi', 'a', 'vint-i-un', 'varietat', 'i', "s'agrupen", 'en', 'dosar', 'gran', 'bloc', ':', 'ell', 'català', 'occidental', 'i', 'ell', 'català', 'oriental', '.']
         case 'hrv':
             results = ['hrvatski', 'jezik', '(', 'ISO', '639-3', ':', 'hrv', ')', 'skupni', 'ju', 'naziv', 'за', 'nacionalni', 'standardni', 'jezik', 'Hrvat', ',', 'ti', 'за', 'skup', 'narječje', 'i', 'govora', 'kojima', 'govoriti', 'ili', 'biti', 'nekada', 'govoriti', 'Hrvat', '.']
         case 'ces':
@@ -240,7 +240,7 @@ def test_lemmatize(lang, lemmatizer):
                 case _:
                     tests_lang_util_skipped = True
         case 'urd':
-            results = ['اُردُو[8', ']', 'برصغیر', 'کم', 'معیاری', 'زبان', 'میں', 'سے', 'ایک', 'ہونا', '۔']
+            results = ['1837ء', 'میں', '،', 'اردو', 'برطانوی', 'ایسٹ', 'انڈیا', 'کمپنی', 'کم', 'سرکاری', 'زبان', 'بننا', 'جانا', '،', 'کمپنی', 'کم', 'دور', 'میں', 'پورا', 'شمالی', 'ہندوستان', 'میں', 'فارسی', 'کم', 'جگہ', 'لینا', 'جانا', '۔']
         case 'cym':
             results = ['yn', 'cyfrifiad', 'yr', 'tu', '(', '2011', ')', ',', 'darganfod', 'bodio', '19', '%', '(', '562,000', ')', 'prpers', 'preswylwr', 'cymru', '(', 'tair', 'blwydd', 'a', 'trosodd', ')', 'bod', 'gallu', 'siarad', 'cymraeg', '.']
         case _: