Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dask persist #13

Open
wants to merge 83 commits into
base: master
Choose a base branch
from

Conversation

kayibal
Copy link
Owner

@kayibal kayibal commented Jun 4, 2019

No description provided.

kayibal and others added 30 commits April 17, 2017 16:34
distributed assign support
Add property columns and index to dask.SparseFrame and increase version to 0.8.0
…ules downgrade to 0.19 until dask releases patch
This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big
This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big
was broken if loc returned a single location or integers were used as indexers
kayibal and others added 25 commits April 19, 2018 17:44
implements sort_index for sparsity dask collection.
implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame.
implement distributed set_index.
Implents to_npz for dsitributed collection.
Also fixes a small issue with optimized distributed join,
Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization
)

* Update indexer instantiation. Allow loc on index with duplicates.

* Support latest versions of pandas (>=0.23.0)

* Update circleci configuration to v2

* fix indexing error with older scipy versions (<1.0.0)

* Support column indexing in _xs method

* raise error if sparse frame is indexed (__getitem__) with None
This resolves problems that appeared after changing drtools' FileSystems behaviour.

Eventually this should be handled more elegantly. Currently there's some duplicated code which is the same as in filesystem module in drtools. Maybe we should make FileSystems a separate package (opensource) and use it both in sparsity and drtools?
* Raise error when initializing with unaligned indices
Now it detects whether pandas appended 2 description rows at the end
and removes them only if necessary.
Previously original DataFrame's index/columns would be preserved
and passed index/columns would be ignored.

Now passed index/columns are used but a SyntaxWarning is issued.

Fixes #52.
`data` currently can't be a list anyway. Its `.shape` attribute is used
at the very beginning of init method, so it has to be array-like.
- column names are preserved in groupby_agg
- when groupby_agg is used with Multiindex and level=, resulting
index has values only for specified level
- when grouping by column, this column is not present in result

Fixes #58.
* More info in setup.py

* Fix link in readme.
* enable tracking on documentation page
* Update documentation link.
* Implement distributed groupby sum and apply_concat_apply function for SparseFrame

* add test for different index datatypes

* implement sort_index

* implement __len__

* implement rename, optimize groupby_sum and join

implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame.

* implement distributed set_index

* number of line ouput in __repr__ changed.

* Create folders when writing to local filesystem

* Fix empty dtype

* Implement distributed drop.

* Always add npz extension when writing SparseFrame to npz format

* Fix metadata handling on set_index method

* Add method for dask SparseFrame and tuple divisions type

* Support empty divisions

* Pass on divisions on sort_index

* More restrictive pandas version as .drop method fails with pandas==0.20.3

* Fix bug where empty dataframe would create wrongly sized shuffle array

* Fix bug where join with in memory sparse frame would return rows from meta_nonempty

* Update dask version in setup.py

* Update deprecated set_options call

* Fix moto and boto versions

* Update test dependencies
Fix behaviour when passing Index to __getitem__
Fixes #74
This adds support for dask persist method.
@kayibal
Copy link
Owner Author

kayibal commented Jun 4, 2019

Meh I was on wrong branch will cherry pick this later

kayibal and others added 4 commits June 4, 2019 14:24
* Rename io modules to io_ and fix some version conflicts

Numpy 1.16.* is not compatible with sparsity 0.20.* thus we need to fix
the setup.py. When using Scipy<1.0.0 empty column access does not work,
thus the dependency had to be adjusted here as well.
This also renames the io_ modules to avoid issues with pythons
internal module.

* Fix incompatibility with numpy>=1.16.0, potential security issue.

Due to a security issue (CVE-2019-6446) numpy changed the default value
 of allow_pickle in np.load to True, this led to error when reading
 sparse frames from npz archives. This commit fixes it by allowing
 pickled objects, thus reading sparse frames from unkown sources is
 still a security risk.
# Conflicts:
#	sparsity/dask/core.py
#	sparsity/test/test_sparse_frame.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants