Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textblob not finding the downloaded corpora #474

Closed
cagan-elden opened this issue Sep 26, 2024 · 8 comments
Closed

Textblob not finding the downloaded corpora #474

cagan-elden opened this issue Sep 26, 2024 · 8 comments

Comments

@cagan-elden
Copy link

python -m textblob.download_corpora

Although I download the corpora as said in the error message it still does not work.
I ain't sure is it because of the NLTK library or not because I've installed that too.

@shoaib-fixes
Copy link

I found upgrading from NLTK 3.8.1 to 3.9.1 broke my project. I now get errors asking me to:

python -m textblob.download_corpora

Previously you could download textblob corpora on one account and it could be found by another account. This is no longer the case.

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.

@Ajaychaki2004
Copy link

The problem is due the version moving back to the NLTK 3.8.1 can help to rectify the error

@shoaib-fixes
Copy link

To follow up on this, I fixed it by specifying the NLTK data path and telling NLTK where to look like this:

def download_nltk_resources(self):
    """
    Downloads required NLTK resources if not already present.
    """
    import nltk
    import os
    
    # Use the environment variable or fall back to default
    nltk_data_path = os.getenv('NLTK_DATA', '/usr/local/share/nltk_data')
    
    # Ensure the directory exists
    os.makedirs(nltk_data_path, exist_ok=True)
    
    # Add our path to NLTK's data path
    nltk.data.path.insert(0, nltk_data_path)
    
    print(f"Using NLTK data path: {nltk_data_path}")
    
    required_resources = {
        'averaged_perceptron_tagger': ('taggers', 'averaged_perceptron_tagger'),
        'averaged_perceptron_tagger_eng': ('taggers', 'averaged_perceptron_tagger_eng'),
        'punkt': ('tokenizers', 'punkt'),
        'punkt_tab': ('tokenizers/punkt_tab', 'english'),
        'movie_reviews': ('corpora', 'movie_reviews'),
        'brown': ('corpora', 'brown'),
        'conll2000': ('corpora', 'conll2000'),
        'wordnet': ('corpora', 'wordnet')
    }
    
    # Download and verify all resources
    for resource, (folder, name) in required_resources.items():
        try:
            nltk.data.find(f'{folder}/{name}')
        except LookupError:
            print(f"Downloading {resource}...")
            nltk.download(resource, download_dir=nltk_data_path, quiet=True)

with NLTK_DATA specified as an environment variable.

Then do something like this:

try:
    # Download resources only once at the start
    if not hasattr(TextParser, '_resources_checked'):
        self.download_nltk_resources()
        TextParser._resources_checked = True

@jimedevelopers
Copy link

How to solve this issue?

@Ajaychaki2004
Copy link

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.
By downgrading the NLTK you can solve the issue.

@shoaib-fixes
Copy link

Moving back to NLTK 3.8.1 fixed it. I can reproduce the issue by upgrading to 3.9.1 again.
By downgrading the NLTK you can solve the issue.

Just be aware NLTK <3.9 contains a critical security vulnerability so you're better off specifying the data path like I suggested rather than using an older insecure version.

@Ajaychaki2004
Copy link

Can you tell the solution in detail ??

sloria added a commit that referenced this issue Jan 13, 2025
* fix: update corpora module names

Updates corpora module names to fix a missing corpora error when
running:

python -m textblob.download_corpora

This should fix CI errors and #482 and #474

* chore: update Python versions and CI

Updates supported Python versions to be 3.9-3.13 and updates CI to use
the built-in textblob.download_corpora command

* fix: corpora download in CI

* fix: bring back lowest env in Tox/CI

Adds back the "lowest" env in Tox/CI to ensure support in the lowest supported Python + NLTK versions

* chore: add johnfraney to Authors.rst

* Update changelog

* Update changelog

---------

Co-authored-by: Steven Loria <sloria1@gmail.com>
@sloria
Copy link
Owner

sloria commented Jan 13, 2025

fixed in 0.19.0

@sloria sloria closed this as completed Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants