You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KFold.split doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).
TypeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 for train, test in k_folder.split(ddf):
2 pass
File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:241, in KFold.split(self, X, y, groups)
240 def split(self, X, y=None, groups=None):
--> 241 X = check_array(X)
242 n_samples = X.shape[0]
243 n_splits = self.n_splits
File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/utils.py:197, in check_array(array, accept_dask_array, accept_dask_dataframe, accept_unknown_chunks, accept_multiple_blocks, preserve_pandas_dataframe, remove_zero_chunks, *args, **kwargs)
195 elif isinstance(array, dd.DataFrame):
196 if not accept_dask_dataframe:
--> 197 raise TypeError(
198 "This estimator does not support dask dataframes. "
199 "This might be resolved with one of\n\n"
200 " 1. ddf.to_dask_array(lengths=True)\n"
201 " 2. ddf.to_dask_array() # may cause other issues because "
202 "of unknown chunk sizes"
203 )
204 # TODO: sample?
205 return array
TypeError: This estimator does not support dask dataframes. This might be resolved with one of
1. ddf.to_dask_array(lengths=True)
2. ddf.to_dask_array() # may cause other issues because of unknown chunk sizes
Anything else we need to know?:
We recently worked around this limitation with the following:
Describe the issue:
KFold.split
doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).Minimal Complete Verifiable Example:
traceback
Anything else we need to know?:
We recently worked around this limitation with the following:
The text was updated successfully, but these errors were encountered: