-
Notifications
You must be signed in to change notification settings - Fork 975
[REVIEW] Add MultiIndex support for Dataframes and Series #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rerun tests |
@thomcom given we're still working on the correct approach for |
I'm writing the first implementation dumb/slow/naive, based on the assumption that we can use a libcudf gather call for the efficient method in the near future. Once I have 100% test passing we can talk about the more efficient solution and make bindings for gather into cudf. Sound good? More progress coming today. |
Sounds perfect, thanks for leading the charge on this! |
…instead of a list of lists.
…ring.py test that now passes.
Tests aren't passing for obscure circular dependency problems (I think), but I'm able to continue development and run tests locally without any issues. I'm asking for comments at this time. Before the end of the 0.6 dev cycle I intend to add proper multiindex output to groupby results and I hope to add slicing. Slicing is most likely to get bumped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly formatting changes and need an offline code review of some of the groupby code, but otherwise this looks amazing. Great work @thomcom!
This PR adds the MultiIndex class to cudf. MultiIndex (MI) is used for slicing and manipulating dataframes in a higher dimensionality than 2d. It is of particular importance for groupby, which uses it in an index for and a column form.
This PR creates the class and adds 90% of the public methods from Pandas to the MultiIndex. The core functionality (MI codes) is implemented using cudf dataframes, as one code row must be created for each row or column in the dataframe the index is attached to.
I've left a few important tasks incomplete, as the basic functionality and groupby support is all available now for 0.7. The MI work checklist is below:
MultiIndex checklist
This will fix #483
Fixes #1337
Fixes rapidsai/dask-cudf#191
Fixes rapidsai/dask-cudf#125
Fixes rapidsai/dask-cudf#132