Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: propagate dtype to idf_ for large corpuses in PandasNormalizedTfidfVectorizer #26

Merged
merged 17 commits into from
Oct 7, 2024

Conversation

chrispyl
Copy link
Collaborator

@chrispyl chrispyl commented Oct 6, 2024

Issue

Fitting PandasNormalizedTfidfVectorizer for large X returns results in idf_ having dtype np.float64 regardless of the provided dtype. This is due to TfidfTransformer which assumes the resulting dtype of np.log and doesn't use the dtype parameter explicitly.

Reproduction

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import uuid

#small data work fine
small_data=[str(uuid.uuid4()) for i in range(100)]
X = pd.Series(small_data)
vectorizer = TfidfVectorizer(dtype=np.float32)
vectorizer.fit(X)
print(vectorizer.idf_.dtype)

#large data do not preserve dtype
large_data=[str(uuid.uuid4()) for i in range(1000000)]
X = pd.Series(large_data)
vectorizer = TfidfVectorizer(dtype=np.float32)
vectorizer.fit(X)
print(vectorizer.idf_.dtype)

Solution

Set the type of idf_ right after fitting and before using it in any other operation to avoid surprises. This solution is temporary until the issue is fixed in scikit-learn.

Sklearn issue background

Not propagating dtype to TfidfTransformer is a known issue. Passing the dtype to np.log is not an option as in this thread they mention that it breaks functionality when dtype is integer. Adding:

.astype(dtype)

after the log operation was proposed but it was removed.

Have opened a new issue in scikit-learn.

@chrispyl chrispyl changed the title Correct propagation of dtype to idf_ for large corpuses in PandasNormalizedTfidfVectorizer FIX: Correct propagation of dtype to idf_ for large corpuses in PandasNormalizedTfidfVectorizer Oct 6, 2024
@chrispyl chrispyl changed the title FIX: Correct propagation of dtype to idf_ for large corpuses in PandasNormalizedTfidfVectorizer FIX: propagate dtype to idf_ for large corpuses in PandasNormalizedTfidfVectorizer Oct 6, 2024
@mbaak mbaak merged commit 07dfd10 into ing-bank:main Oct 7, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants