Support dask persist #13

kayibal · 2019-06-04T10:49:29Z

No description provided.

distributed assign support

Add property columns and index to dask.SparseFrame and increase version to 0.8.0

…ly method for various data types

…thod.

…ules downgrade to 0.19 until dask releases patch

…ataFrame

Hotfix/add signature

Added fillna method

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

was broken if loc returned a single location or integers were used as indexers

implements sort_index for sparsity dask collection.

implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame.

implement distributed set_index.

Implents to_npz for dsitributed collection. Also fixes a small issue with optimized distributed join,

Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization

…ty (#43)

) * Update indexer instantiation. Allow loc on index with duplicates. * Support latest versions of pandas (>=0.23.0) * Update circleci configuration to v2 * fix indexing error with older scipy versions (<1.0.0) * Support column indexing in _xs method * raise error if sparse frame is indexed (__getitem__) with None

This resolves problems that appeared after changing drtools' FileSystems behaviour. Eventually this should be handled more elegantly. Currently there's some duplicated code which is the same as in filesystem module in drtools. Maybe we should make FileSystems a separate package (opensource) and use it both in sparsity and drtools?

* Raise error when initializing with unaligned indices

Now it detects whether pandas appended 2 description rows at the end and removes them only if necessary.

Previously original DataFrame's index/columns would be preserved and passed index/columns would be ignored. Now passed index/columns are used but a SyntaxWarning is issued. Fixes #52.

`data` currently can't be a list anyway. Its `.shape` attribute is used at the very beginning of init method, so it has to be array-like.

And add a better docstring.

- column names are preserved in groupby_agg - when groupby_agg is used with Multiindex and level=, resulting index has values only for specified level - when grouping by column, this column is not present in result Fixes #58.

* More info in setup.py * Fix link in readme.

* enable tracking on documentation page * Update documentation link.

* Implement distributed groupby sum and apply_concat_apply function for SparseFrame * add test for different index datatypes * implement sort_index * implement __len__ * implement rename, optimize groupby_sum and join implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame. * implement distributed set_index * number of line ouput in __repr__ changed. * Create folders when writing to local filesystem * Fix empty dtype * Implement distributed drop. * Always add npz extension when writing SparseFrame to npz format * Fix metadata handling on set_index method * Add method for dask SparseFrame and tuple divisions type * Support empty divisions * Pass on divisions on sort_index * More restrictive pandas version as .drop method fails with pandas==0.20.3 * Fix bug where empty dataframe would create wrongly sized shuffle array * Fix bug where join with in memory sparse frame would return rows from meta_nonempty * Update dask version in setup.py * Update deprecated set_options call * Fix moto and boto versions * Update test dependencies

Fix behaviour when passing Index to __getitem__ Fixes #74

This adds support for dask persist method.

kayibal · 2019-06-04T10:50:03Z

Meh I was on wrong branch will cherry pick this later

* Rename io modules to io_ and fix some version conflicts Numpy 1.16.* is not compatible with sparsity 0.20.* thus we need to fix the setup.py. When using Scipy<1.0.0 empty column access does not work, thus the dependency had to be adjusted here as well. This also renames the io_ modules to avoid issues with pythons internal module. * Fix incompatibility with numpy>=1.16.0, potential security issue. Due to a security issue (CVE-2019-6446) numpy changed the default value of allow_pickle in np.load to True, this led to error when reading sparse frames from npz archives. This commit fixes it by allowing pickled objects, thus reading sparse frames from unkown sources is still a security risk.

# Conflicts: # sparsity/dask/core.py # sparsity/test/test_sparse_frame.py

kayibal and others added 30 commits April 17, 2017 16:34

add groupby_agg method

45376b4

read version from a single location

4c3eec2

Add package_data to include VERSION file in distribution

eb745aa

Fix package data values must be iterables

e339052

Add installtion test to sparsity

60e83d8

Added test_assign_column for dask.SparsedFrame

3f54e0d

Added working assign column for dask.SparseFrame

176228b

Increase version number:

1da4035

distributed assign support

Added property columns and index to dask.SparseFrame

f4602d0

Increase version, add *.egg-info to .gitignore

9e33ab3

Merge pull request #6 from datarevenue-berlin/feature/meta

e8838cf

Add property columns and index to dask.SparseFrame and increase version to 0.8.0

Add multiply method to SparseFrame and its test.

c776436

Fixed ci on a branch.

98007fd

Change toarray() for 1 dimensional SparseFrames, add tests for multip…

77ceb6b

…ly method for various data types

Change back to_array functionality and mimic pd.DataFrame.multiply me…

9b4740b

…thod.

bump version

2a701bb

Hotfix pandas-0.20.0 breaks backwards compatibility with internal mod…

21dfe7c

…ules downgrade to 0.19 until dask releases patch

adds support to pandas>=0.20

69a48b5

Adjust pandas dependency on setup.py

a4bba4a

Hotfix: Accept kwargs in SparseFrame.add for congruency with pandas.D…

71dbd6b

…ataFrame

Version 0.9.3

f8fc149

Merge pull request #8 from datarevenue-berlin/hotfix/add-signature

ba89a8d

Hotfix/add signature

Added fillna method

fe0ff68

Version bump to 0.10.0

e83b1e8

Merge pull request #9 from datarevenue-berlin/feature/fillna

69f2aa1

Added fillna method

Add reading categories from path

8763fef

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

Bump version to 0.11.0 and remove test from setup.py packages

54bd778

Fix CodeCov reports on project coverage

e2408fa

Read categories from path

1f4f5e9

This is useful mainly to avoid dask processes sharing really big arrays in case the categories get really big

Fix multiindex loc indexer (#11)

ed6ae45

was broken if loc returned a single location or integers were used as indexers

kayibal and others added 25 commits April 19, 2018 17:44

Sort index (#37)

4d85502

implements sort_index for sparsity dask collection.

Optimization of distributed procedures (#38)

9352ea3

implements distributed rename method and adds quicker routines to groupby_sum if divisions are known. Adds support for joining sp.SparseFrames onto a distributed SparseFrame.

Set index (#36)

401c4c6

implement distributed set_index.

To npz (#39)

b6a5938

Implents to_npz for dsitributed collection. Also fixes a small issue with optimized distributed join,

update dask imports and __dask__keys usage (#40)

f29af39

Removes some deprecation warning by updating calls from _keys() to __dask_keys__() as well as updating th import from dask.optimize to dask.optimization

Drop support for pandas>=0.23.0 as api changes break iloc functionali…

4d5fd2b

…ty (#43)

Raise error when initialising with unaligned indices (#51)

4c09026

* Raise error when initializing with unaligned indices

Fix __repr__ (#60)

011fd3e

Now it detects whether pandas appended 2 description rows at the end and removes them only if necessary.

Fix joining with axis=0 with different columns (#57)

eb777fd

Fix init from pd.DataFrame with passed index/columns (#61)

c042db9

Previously original DataFrame's index/columns would be preserved and passed index/columns would be ignored. Now passed index/columns are used but a SyntaxWarning is issued. Fixes #52.

Removed unused code (#62)

f3cd306

`data` currently can't be a list anyway. Its `.shape` attribute is used at the very beginning of init method, so it has to be array-like.

Swap behaviour for axis=0/1 in .multiply (#63)

2c2cdd7

And add a better docstring.

Better index/columns handling in groupby operations (#64)

20e3bc4

- column names are preserved in groupby_agg - when groupby_agg is used with Multiindex and level=, resulting index has values only for specified level - when grouping by column, this column is not present in result Fixes #58.

Remove traildb (#41)

b52a270

Sphinx doc (#47)

96e57f1

Require pandas not higher then 0.23.4

3d10dc8

Add BSD 3-clause license

fe33ab0

Pypi (#65)

87ab3d5

* More info in setup.py * Fix link in readme.

Add Google Analytics ID (#66)

0972823

* enable tracking on documentation page * Update documentation link.

Compatiblity with dask version 0.19.3 (#70)

43d1a44

Support getitem with Index (#75)

b27df42

Fix behaviour when passing Index to __getitem__ Fixes #74

Add support for dask persist

9ff2d91

This adds support for dask persist method.

kayibal and others added 4 commits June 4, 2019 14:24

Test persist functionality

16b3a20

PRETTY rename import

3226610

Merge branch 'master' into support-dask-persist

2469729

# Conflicts: # sparsity/dask/core.py # sparsity/test/test_sparse_frame.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dask persist #13

Support dask persist #13

kayibal commented Jun 4, 2019

kayibal commented Jun 4, 2019

Support dask persist #13

Are you sure you want to change the base?

Support dask persist #13

Conversation

kayibal commented Jun 4, 2019

kayibal commented Jun 4, 2019