Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Filter - Use ISO language in StopwordsFilter #1024

Merged
merged 4 commits into from
Nov 24, 2023

Conversation

PrimozGodec
Copy link
Collaborator

@PrimozGodec PrimozGodec commented Nov 17, 2023

Issue

This PR is part of #963, which I am splitting into smaller pieces for easier review.
The main motivation behind this is to make Preprocess work with language from Corpus.

Description of changes

This PR prepare a stop word filter to communicate (get and return languages) as ISO codes, which is necessary to enable language from Corpus (languages are stored in Corpus in ISO format).

After I changed Stop Word to work with ISO language codes, I also had to adapt the Preprocess Widget to store settings as ISO codes and call the StopWords filter with ISO language code.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2023

Codecov Report

Merging #1024 (3b5004f) into master (87a7580) will increase coverage by 0.08%.
The diff coverage is 96.22%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1024      +/-   ##
==========================================
+ Coverage   82.10%   82.19%   +0.08%     
==========================================
  Files          93       93              
  Lines       12257    12292      +35     
  Branches     1660     1668       +8     
==========================================
+ Hits        10064    10103      +39     
+ Misses       1881     1879       -2     
+ Partials      312      310       -2     

@PrimozGodec PrimozGodec changed the title [ENH] Filter - language from corpus in StopwordsFilter [ENH] Filter - Use ISO language in StopwordsFilter Nov 17, 2023
@PrimozGodec PrimozGodec marked this pull request as ready for review November 17, 2023 14:26
Copy link
Contributor

@VesnaT VesnaT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following error when opening an old workflow

----------------------------- KeyError Exception ------------------------------
Traceback (most recent call last):
  File "/Users/vesna/orange-widget-base/orangewidget/settings.py", line 592, in _migrate_settings
    self.widget_class.migrate_settings(
  File "/Users/vesna/orange3-text/orangecontrib/text/widgets/owpreprocess.py", line 1386, in migrate_settings
    pp["language"] = StopwordsFilter.lang_to_iso(pp["language"])
  File "/Users/vesna/orange3-text/orangecontrib/text/preprocess/filter.py", line 118, in lang_to_iso
    return LANG2ISO[StopwordsFilter.NLTK2LANG.get(language, language)]
KeyError: 'it'
-------------------------------------------------------------------------------

@VesnaT VesnaT merged commit b4367d5 into biolab:master Nov 24, 2023
12 checks passed
@PrimozGodec PrimozGodec deleted the language-filter branch November 24, 2023 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants