Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update deps #14

Open
wants to merge 87 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
45376b4
add groupby_agg method
kayibal Apr 17, 2017
4c3eec2
read version from a single location
kayibal Apr 17, 2017
eb745aa
Add package_data to include VERSION file in distribution
kayibal Apr 18, 2017
e339052
Fix package data values must be iterables
kayibal Apr 18, 2017
60e83d8
Add installtion test to sparsity
kayibal Apr 18, 2017
3f54e0d
Added test_assign_column for dask.SparsedFrame
Talmaj Apr 21, 2017
176228b
Added working assign column for dask.SparseFrame
Talmaj Apr 21, 2017
1da4035
Increase version number:
kayibal Apr 24, 2017
f4602d0
Added property columns and index to dask.SparseFrame
Talmaj Apr 24, 2017
9e33ab3
Increase version, add *.egg-info to .gitignore
Talmaj Apr 25, 2017
e8838cf
Merge pull request #6 from datarevenue-berlin/feature/meta
Talmaj Apr 25, 2017
c776436
Add multiply method to SparseFrame and its test.
Talmaj Apr 25, 2017
98007fd
Fixed ci on a branch.
Talmaj Apr 27, 2017
77ceb6b
Change toarray() for 1 dimensional SparseFrames, add tests for multip…
Talmaj Apr 27, 2017
9b4740b
Change back to_array functionality and mimic pd.DataFrame.multiply me…
Talmaj May 2, 2017
2a701bb
bump version
kayibal May 3, 2017
21dfe7c
Hotfix pandas-0.20.0 breaks backwards compatibility with internal mod…
kayibal May 5, 2017
69a48b5
adds support to pandas>=0.20
kayibal May 5, 2017
a4bba4a
Adjust pandas dependency on setup.py
kayibal May 5, 2017
71dbd6b
Hotfix: Accept kwargs in SparseFrame.add for congruency with pandas.D…
kayibal May 18, 2017
f8fc149
Version 0.9.3
kayibal May 18, 2017
ba89a8d
Merge pull request #8 from datarevenue-berlin/hotfix/add-signature
May 18, 2017
fe0ff68
Added fillna method
michcio1234 May 18, 2017
e83b1e8
Version bump to 0.10.0
michcio1234 May 18, 2017
69f2aa1
Merge pull request #9 from datarevenue-berlin/feature/fillna
michcio1234 May 18, 2017
8763fef
Add reading categories from path
kayibal Jun 16, 2017
54bd778
Bump version to 0.11.0 and remove test from setup.py packages
kayibal Jun 19, 2017
e2408fa
Fix CodeCov reports on project coverage
kayibal Jun 19, 2017
1f4f5e9
Read categories from path
Jun 19, 2017
ed6ae45
Fix multiindex loc indexer (#11)
Jul 10, 2017
1c10bed
Fix to_npz (#12)
Jul 11, 2017
9d09f2f
Fix/empty attribute (#13)
Jul 11, 2017
5234740
Bum version to 0.11.1
kayibal Jul 11, 2017
4acf016
One-hot encode multiple columns (#16)
michcio1234 Jul 21, 2017
e3a09e6
Added possibility to drop single or multiple columns.
michcio1234 Jul 21, 2017
9b8ffc2
Remove unused code.
michcio1234 Jul 21, 2017
9ee76a6
User versioneer for dynamic versionstrings (#17)
Jul 21, 2017
1774cc6
Merge pull request #18 from datarevenue-berlin/feature/drop-columns
michcio1234 Jul 21, 2017
5b1d05f
Add support to save npz files on s3 (#15)
Jul 25, 2017
a5a713b
Bugfix: one-hot encoding column of category dtype
michcio1234 Oct 25, 2017
a62753d
It's possible to raise an exception or to ignore the situation when g…
michcio1234 Nov 2, 2017
0e0ec76
add private take method to support indexing in pandas 0.21.0 (#20)
Nov 6, 2017
5c3ceb3
Cleaner code, better documentation and more tests.
michcio1234 Nov 8, 2017
05867f6
One-hot-encode categorical column (#22)
michcio1234 Nov 9, 2017
060d09f
Implement multipart upload with default block_size=100MB (#23)
Nov 14, 2017
660bbb0
support empty frames in elemntwise operations (#21)
Nov 28, 2017
e61fa5f
add support for arbritary remote storages (#24)
Nov 28, 2017
8fafddc
Support list like label based indexing (#27)
Nov 30, 2017
d6d30b8
Add support for new dask custom collection interface. (#29)
Dec 22, 2017
f10254f
Getitem and loc failed when all labels were requested. (#28)
michcio1234 Dec 27, 2017
01328de
Elementwise comparison for arrays with different length is deprecated…
michcio1234 Jan 4, 2018
c0c7671
Distributed join (#34)
Feb 15, 2018
6de787c
add from_ddf method (#32)
Feb 17, 2018
87f9928
Distributed groupby sum operation (#35)
Mar 19, 2018
4d85502
Sort index (#37)
Apr 19, 2018
9352ea3
Optimization of distributed procedures (#38)
Apr 19, 2018
401c4c6
Set index (#36)
Apr 20, 2018
b6a5938
To npz (#39)
Apr 20, 2018
f29af39
update dask imports and __dask__keys usage (#40)
Apr 20, 2018
4d5fd2b
Drop support for pandas>=0.23.0 as api changes break iloc functionali…
Jun 1, 2018
8d2f8f6
Update indexer instantiation. Allow loc from index with duplicates. (…
Aug 22, 2018
ce1ac3a
Accept pathlib objects in io module. (#48)
michcio1234 Aug 28, 2018
4c09026
Raise error when initialising with unaligned indices (#51)
Sep 4, 2018
011fd3e
Fix __repr__ (#60)
michcio1234 Sep 5, 2018
eb777fd
Fix joining with axis=0 with different columns (#57)
michcio1234 Sep 6, 2018
c042db9
Fix init from pd.DataFrame with passed index/columns (#61)
michcio1234 Sep 6, 2018
f3cd306
Removed unused code (#62)
michcio1234 Sep 6, 2018
2c2cdd7
Swap behaviour for axis=0/1 in .multiply (#63)
michcio1234 Sep 6, 2018
20e3bc4
Better index/columns handling in groupby operations (#64)
michcio1234 Sep 7, 2018
b52a270
Remove traildb (#41)
Sep 7, 2018
96e57f1
Sphinx doc (#47)
Sep 7, 2018
3d10dc8
Require pandas not higher then 0.23.4
michcio1234 Sep 7, 2018
fe33ab0
Add BSD 3-clause license
michcio1234 Sep 7, 2018
87ab3d5
Pypi (#65)
michcio1234 Sep 7, 2018
0972823
Add Google Analytics ID (#66)
Sep 10, 2018
43d1a44
Compatiblity with dask version 0.19.3 (#70)
Oct 10, 2018
4751444
Refactor/binning (#69)
Dec 4, 2018
b27df42
Support getitem with Index (#75)
michcio1234 Jan 19, 2019
23f091b
Rename io modules to io_ and fix some version conflicts (#78)
Jun 4, 2019
6736452
Add support for dask persist (#77)
Jun 5, 2019
e8fa03f
DaskSparseFrame getitem, todense and bugfix (#79)
michcio1234 Jun 14, 2019
86bf9df
Add test for local version (#83)
michcio1234 Jul 12, 2019
816afae
Bugfix/81 get missing column (#82)
michcio1234 Jul 12, 2019
73c690f
Support latest pandas version.
kayibal Aug 5, 2019
49e1d71
Fix some tests that occurred with latest dask & co versions
kayibal Aug 5, 2019
200a554
Add alias to ensure_index import
michcio1234 Aug 6, 2019
25fad8f
FIXME: remove raise_missing=True
michcio1234 Aug 6, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: 2
jobs:
build:
working_directory: ~/sparsity
docker:
- image: drtools/dask:latest
steps:
- checkout
- run: pip install boto3==1.7.84 botocore==1.10.84 moto==1.3.6
- run: pip install pytest pytest-cov dask==1.0.0 .
- run: py.test --cov sparsity --cov-report xml sparsity
- run: bash <(curl -s https://codecov.io/bash)
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
[run]
omit = sparsity/test/*, */__init__.py
omit = sparsity/test/*, */__init__.py, */_version.py
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sparsity/_version.py export-subst
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ build/
*.so
traildb_sparse.c
__pycache__
*.egg-info
*.npz
24 changes: 24 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Copyright (c) 2018, Data Revenue
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the
names of its contributors may be used to endorse or promote products
derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include versioneer.py
include sparsity/_version.py
8 changes: 0 additions & 8 deletions Makefile

This file was deleted.

117 changes: 6 additions & 111 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,121 +2,16 @@
[![CircleCI](https://circleci.com/gh/datarevenue-berlin/sparsity.svg?style=svg)](https://circleci.com/gh/datarevenue-berlin/sparsity)
[![Codecov](https://img.shields.io/codecov/c/github/datarevenue-berlin/sparsity.svg)](https://codecov.io/gh/datarevenue-berlin/sparsity)

Sparse data processing toolbox. It builds on top of pandas and scipy to provide DataFrame
like API to work with sparse categorical data.

It also provides a extremly fast C level
interface to read from traildb databases. This make it a highly performant package to use
for dataprocessing jobs especially such as log processing and/or clickstream ot click through data.
Sparse data processing toolbox. It builds on top of pandas and scipy to provide
DataFrame-like API to work with sparse data.

In combination with dask it provides support to execute complex operations on
a concurrent/distributed level.

## Attention
**Not ready for production**

# Motivation
Many tasks especially in data analytics and machine learning domain make use of sparse
data structures to support the input of high dimensional data.

This project was started
to build an efficient homogen sparse data processing pipeline. As of today dask has no
support for something as an sparse dataframe. We process big amounts of highdimensional data
on a daily basis at [datarevenue](http://datarevenue.com) and our favourite language
and ETL framework are python and dask. After chaining many function calls on scipy.sparse
csr matrices that involved handling of indices and column names to produce a sparse data
pipeline I decided to start this project.

This package might be especially usefull to you if you have very big amounts of
sparse data such as clickstream data, categorical timeseries, log data or similarly sparse data.

# Traildb access?
[Traildb](http://traildb.io/) is an amazing log style database. It was released recently
by AdRoll. It compresses event like data extremly efficient. Furthermore it provides a
fast C-level api to query it.

Traildb has also python bindings but you still might need to iterate over many million
of users/trail or even both which has quite some overhead in python.
Therefore sparsity provides high speed access to the database in form of SparseFrame objects.
These are fast, efficient and intuitive enough to do further processing on.

*ATM uuid and timestamp informations are lost but they will be provided as a pandas.MultiIndex
handled by the SparseFrame in a (very soon) future release.*

````
In [1]: from sparsity import SparseFrame

In [2]: sdf = SparseFrame.read_traildb('pydata.tdb', field="title")

In [3]: sdf.head()
Out[3]:
0 1 2 3 4 ... 37388 37389 37390 37391 37392
0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0

[5 rows x 37393 columns]
More information and examples can be found in the [documentation](https://sparsity.readthedocs.io).

In [6]: %%timeit
...: sdf = SparseFrame.read_traildb("/Users/kayibal/Code/traildb_to_sparse/traildb_to_sparse/traildb_to_sparse/sparsity/test/pydata.tdb", field="title")
...:
10 loops, best of 3: 73.8 ms per loop

In [4]: sdf.shape
Out[4]: (109626, 37393)
````

# But wait pandas has SparseDataFrames and SparseSeries
Pandas has it's own implementation of sparse datastructures. Unfortuantely this structures
performs quite badly with a groupby sum aggregation which we also often use. Furthermore
doing a groupby on a pandasSparseDataFrame returns a dense DataFrame. This makes chaining
many groupby operations over multiple files cumbersome and less efficient. Consider
following example:

```
In [1]: import sparsity
...: import pandas as pd
...: import numpy as np
...:

In [2]: data = np.random.random(size=(1000,10))
...: data[data < 0.95] = 0
...: uids = np.random.randint(0,100,1000)
...: combined_data = np.hstack([uids.reshape(-1,1),data])
...: columns = ['id'] + list(map(str, range(10)))
...:
...: sdf = pd.SparseDataFrame(combined_data, columns = columns, default_fill_value=0)
...:

In [3]: %%timeit
...: sdf.groupby('id').sum()
...:
1 loop, best of 3: 462 ms per loop

In [4]: res = sdf.groupby('id').sum()
...: res.values.nbytes
...:
Out[4]: 7920

In [5]: data = np.random.random(size=(1000,10))
...: data[data < 0.95] = 0
...: uids = np.random.randint(0,100,1000)
...: sdf = sparsity.SparseFrame(data, columns=np.asarray(list(map(str, range(10)))), index=uids)
...:

In [6]: %%timeit
...: sdf.groupby_sum()
...:
The slowest run took 4.20 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.25 ms per loop

In [7]: res = sdf.groupby_sum()
...: res.__sizeof__()
...:
Out[7]: 6128
## Installation
```

I'm not quite sure if there is some cached result but I don't think so. This only uses a
smart csr matrix multiplication to do the operation.
$ pip install sparsity
```
20 changes: 0 additions & 20 deletions circle.yml

This file was deleted.

155 changes: 155 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# Makefile for Sphinx documentation
#

# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
PAPER =
BUILDDIR = _build

# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
# the i18n builder cannot share the environment and doctrees with the others
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .

.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext

help:
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " singlehtml to make a single large HTML file"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " htmlhelp to make HTML files and a HTML help project"
@echo " qthelp to make HTML files and a qthelp project"
@echo " devhelp to make HTML files and a Devhelp project"
@echo " epub to make an epub"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " text to make text files"
@echo " man to make manual pages"
@echo " texinfo to make Texinfo files"
@echo " info to make Texinfo files and run them through makeinfo"
@echo " gettext to make PO message catalogs"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"

clean:
-rm -rf $(BUILDDIR)/*

html:
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

apidoc:
sphinx-apidoc -fME -o api ../sparsity
dirhtml:
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."

singlehtml:
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
@echo
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."

pickle:
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
@echo
@echo "Build finished; now you can process the pickle files."

json:
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
@echo
@echo "Build finished; now you can process the JSON files."

htmlhelp:
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
@echo
@echo "Build finished; now you can run HTML Help Workshop with the" \
".hhp project file in $(BUILDDIR)/htmlhelp."

qthelp:
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
@echo
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/sparsity.qhcp"
@echo "To view the help file:"
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/sparsity.qhc"

devhelp:
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
@echo
@echo "Build finished."
@echo "To view the help file:"
@echo "# mkdir -p $$HOME/.local/share/devhelp/sparsity"
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/sparsity"
@echo "# devhelp"

epub:
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
@echo
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."

latex:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."

latexpdf:
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
@echo "Running LaTeX files through pdflatex..."
$(MAKE) -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."

text:
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
@echo
@echo "Build finished. The text files are in $(BUILDDIR)/text."

man:
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
@echo
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."

texinfo:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
@echo "Run \`make' in that directory to run these through makeinfo" \
"(use \`make info' here to do that automatically)."

info:
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
@echo "Running Texinfo files through makeinfo..."
make -C $(BUILDDIR)/texinfo info
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."

gettext:
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
@echo
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."

changes:
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
@echo
@echo "The overview file is in $(BUILDDIR)/changes."

linkcheck:
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."

doctest:
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
24 changes: 24 additions & 0 deletions docs/api/dask-sparseframe-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
Dask SparseFrame API
===============

.. py:currentmodule:: sparsity.dask.core

.. autosummary::
SparseFrame
SparseFrame.assign
SparseFrame.compute
SparseFrame.columns
SparseFrame.get_partition
SparseFrame.index
SparseFrame.join
SparseFrame.known_divisions
SparseFrame.map_partitions
SparseFrame.npartitions
SparseFrame.persist
SparseFrame.repartition
SparseFrame.set_index
SparseFrame.rename
SparseFrame.set_index
SparseFrame.sort_index
SparseFrame.to_delayed
SparseFrame.to_npz
Loading