Skip to content

Conversation

thomcom
Copy link
Contributor

@thomcom thomcom commented Mar 26, 2019

This PR adds the MultiIndex class to cudf. MultiIndex (MI) is used for slicing and manipulating dataframes in a higher dimensionality than 2d. It is of particular importance for groupby, which uses it in an index for and a column form.

This PR creates the class and adds 90% of the public methods from Pandas to the MultiIndex. The core functionality (MI codes) is implemented using cudf dataframes, as one code row must be created for each row or column in the dataframe the index is attached to.

I've left a few important tasks incomplete, as the basic functionality and groupby support is all available now for 0.7. The MI work checklist is below:

MultiIndex checklist

  • Create MultiIndex with levels/codes args
  • Do MI validation like Pandas
  • MultiIndex datatypes:
    • int
    • float
    • string
  • MultiIndex as Index
    • Set MultiIndex as DataFrame index
    • Set MultiIndex as Series index
    • MultiIndex .loc row access tuples
    • .index Shape test
    • .index Find indices
    • .index Rebuild index
  • StringIndex
  • MultiIndex iterator
    • getitem
    • take
  • MultiColumn
    • Set MultiIndex as DataFrame columns
    • .columns Shape test
    • .columns Find indices
    • MultiIndex column access tuples
    • .columns Rebuild index
  • Convenience methods
    • from_tuples
    • from_dataframe
    • from_product
  • Backwards compatibility with Pandas 0.23.4
  • MultiIndex slicing
    • None / Wildcards
    • Range / validity_mask0 to validity_mask1
  • Deep copy of MI Members
  • groupby level into MultiIndex support
  • Set item via tuple and add to levels/codes objects

This will fix #483
Fixes #1337
Fixes rapidsai/dask-cudf#191
Fixes rapidsai/dask-cudf#125
Fixes rapidsai/dask-cudf#132

@raydouglass
Copy link
Contributor

rerun tests

@kkraus14 kkraus14 added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Apr 1, 2019
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 1, 2019

@thomcom given we're still working on the correct approach for .loc on normal indexes, don't worry about it for MultiIndexes yet 😄, otherwise this is looking pretty good thus far.

@thomcom
Copy link
Contributor Author

thomcom commented Apr 1, 2019

I'm writing the first implementation dumb/slow/naive, based on the assumption that we can use a libcudf gather call for the efficient method in the near future. Once I have 100% test passing we can talk about the more efficient solution and make bindings for gather into cudf. Sound good? More progress coming today.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 1, 2019

I'm writing the first implementation dumb/slow/naive, based on the assumption that we can use a libcudf gather call for the efficient method in the near future. Once I have 100% test passing we can talk about the more efficient solution and make bindings for gather into cudf. Sound good? More progress coming today.

Sounds perfect, thanks for leading the charge on this!

@thomcom thomcom requested a review from kkraus14 April 26, 2019 19:51
@thomcom
Copy link
Contributor Author

thomcom commented Apr 26, 2019

Tests aren't passing for obscure circular dependency problems (I think), but I'm able to continue development and run tests locally without any issues. I'm asking for comments at this time. Before the end of the 0.6 dev cycle I intend to add proper multiindex output to groupby results and I hope to add slicing. Slicing is most likely to get bumped.

@thomcom thomcom requested a review from a team as a code owner April 30, 2019 22:04
@thomcom thomcom requested a review from a team April 30, 2019 22:04
@thomcom thomcom requested a review from a team as a code owner April 30, 2019 22:04
@thomcom thomcom requested review from dantegd and quasiben and removed request for a team May 1, 2019 16:07
@thomcom thomcom changed the title [WIP] Add MultiIndex support for Dataframes and Series [REVIEW] Add MultiIndex support for Dataframes and Series May 1, 2019
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly formatting changes and need an offline code review of some of the groupby code, but otherwise this looks amazing. Great work @thomcom!

@thomcom thomcom merged commit beabaaf into rapidsai:branch-0.7 May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress Python Affects Python cuDF API.
Projects
None yet
4 participants