You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to use dask-ml to train large models with multidimensional inputs using Incremental to sequentially pass chunks of the dask array for training. Unfortunately, it seems Incremental, or one of the downstream libraries it calls, cannot handle data that is more than 2 dimensional. When X.ndim <= 2, Incremental correctly passes each chunk of X sequentially as the underlying numpy array to the partial_fit of the estimator, which is the advertised behaviour. However, when X.ndim >2, Incremental instead passes a tuple with the dask task key string, and location - and there seems to be no obvious way of retrieving the underlying correct data.
As a workaround, is there a way of retrieving the underlying data using the supplied information?
Alternatively, the obvious workaround is to reshape the multidimensional array to 2D prior to calling fit, and then unpack it in the partial fit back to the correct shape. The array is chunked exclusively along the first dimension (and we would only roll the remaining dimensions) - which from my understanding should not be prohibitively expensive. However, this seems like unnecessary overhead at each training step.
Minimal Complete Verifiable Example:
fromdask_ml.wrappersimportIncrementalimportdask.arrayasda#Make minimalist scikit-learn style estimator.classIncrementalEstimator():
def__init__(self, model):
self.model=modeldefpartial_fit(self, X, y=None):
print('X : {}'.format(X))
print('Type X: {}'.format(type(X)))
print('y : {}'.format(y))
print('Type y: {}'.format(type(y)))
deffit(self, X, y=None):
raiseNotImplementedError('Use partial_fit instead')
defpredict(self, X):
returnself.model.predict(X)
defscore(self, X, y):
raiseNotImplementedError('Use predict instead')
defget_params(self, deep=True):
return {'model': self.model}
defset_params(self,**params):
forkey,valueinparams.items():
self.key=valuereturnself#Dummy datay=da.ones((10,), chunks=(1,))
X=da.random.random(size=(10,100,100,10,10), chunks=(1,100,100,10,10))
#Subsample such that X.ndim <= 2. This will work X_in=X[:,:,0,0,0]
estimator=Incremental(estimator=IncrementalEstimator(None))
estimator.fit(X_in,y=y)
#Now subsample such that X.ndim = 3. This will fail and pass a tuple with dask task graph name instead. X_in=X[:,:,:,0,0]
estimator=Incremental(estimator=IncrementalEstimator(None))
estimator.fit(X_in,y=y)
Anything else we need to know?:
If there is a better way of accomplishing what I'm trying to do using the dask ecosystem, let me know! :)
Thanks for the report. Can you post the traceback too?
I think the only requirement should be that the array is exclusively chunked along the first dimension (samples). But it's possible we're not handling higher-dimensional inputs correctly.
Thank you for your response @TomAugspurger There is no traceback per se from Incremental or upstream- when ndim>2, partial_fit of the estimator receives a tuple of graph key and location in place of the actual array chunk:
X : ('getitem-f2dd5ea095519c95bc220af40bcdd853', 6, 0)
Type X: <class 'tuple'>
When ndim <= 2, partial fit correctly receives the chunk as a numpy array:
X : [[...]]
Type X: <class 'numpy.ndarray'>
Any traceback comes from downstream when partial fit tries to fit to a non-numpy array.
Describe the issue:
I've been trying to use dask-ml to train large models with multidimensional inputs using Incremental to sequentially pass chunks of the dask array for training. Unfortunately, it seems Incremental, or one of the downstream libraries it calls, cannot handle data that is more than 2 dimensional. When X.ndim <= 2, Incremental correctly passes each chunk of X sequentially as the underlying numpy array to the partial_fit of the estimator, which is the advertised behaviour. However, when X.ndim >2, Incremental instead passes a tuple with the dask task key string, and location - and there seems to be no obvious way of retrieving the underlying correct data.
As a workaround, is there a way of retrieving the underlying data using the supplied information?
Alternatively, the obvious workaround is to reshape the multidimensional array to 2D prior to calling fit, and then unpack it in the partial fit back to the correct shape. The array is chunked exclusively along the first dimension (and we would only roll the remaining dimensions) - which from my understanding should not be prohibitively expensive. However, this seems like unnecessary overhead at each training step.
Minimal Complete Verifiable Example:
Anything else we need to know?:
If there is a better way of accomplishing what I'm trying to do using the dask ecosystem, let me know! :)
Environment:
The text was updated successfully, but these errors were encountered: