-
-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix simple impute #788
base: main
Are you sure you want to change the base?
Fix simple impute #788
Conversation
CI failing for linting issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. A few questions / comments.
@@ -70,12 +70,18 @@ def _fit_frame(self, X): | |||
if self.strategy == "mean": | |||
avg = X.mean(axis=0).values | |||
elif self.strategy == "median": | |||
avg = X.quantile().values | |||
avg = [np.median(X[col].dropna()) for col in X.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will eagerly compute the values, thanks to np.median
. Since that's done in a list comprehension, we'd end up executing the graph for X
once per column. We want to delay computation till the end.
I also think this will end up pulling all the data for a column into a single ndarray, to do the median, which we also want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using delayed
here?
avg = [dask.delayed(np.median(X[col].dropna())) for col in X.columns]
for col in X.columns: | ||
val_counts = X[col].value_counts().reset_index() | ||
if isinstance(X, dd.DataFrame): | ||
x = val_counts.to_dask_array(lengths=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need lengths here? This also triggers a computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to compute chunk sizes ... any suggestion on how to avoid it? Thanks,
Fix #787.
Also related to #779