Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-node-multi-gpu Logistic Regression in C++ #5477

Merged
merged 12 commits into from
Jul 24, 2023

Conversation

lijinf2
Copy link
Contributor

@lijinf2 lijinf2 commented Jun 26, 2023

This PR enables multi-node-multi-gpu Logistic Regression and it mostly reuses existing codes (i.e. GLMWithData and min_lbfgs) of single-GPU Logistic Regression. No change to any existing codes.

Added Pytest code for Spark cluster and the tests run successfully with 2 GPUs on a random dataset. The coef_ and intercept_ are the same as single-GPU cuml.LogisticRegression.fit. Pytest code can be found here: https://github.com/lijinf2/spark-rapids-ml/blob/lr/python/tests/test_logistic_regression.py

@rapids-bot
Copy link

rapids-bot bot commented Jun 26, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Jun 26, 2023
@lijinf2 lijinf2 changed the title Add multi-node-multi-gpu Logistic Regression in C++ [DRAFT] Add multi-node-multi-gpu Logistic Regression in C++ Jun 26, 2023
@cjnolet cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 27, 2023
@lijinf2
Copy link
Contributor Author

lijinf2 commented Jun 28, 2023

Rebased with the latest branch-23.08.

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this new feature @lijinf2! This is going to be useful not just in Spark but in Dask and other places as well!

I think overall the C++ code is structured pretty well. There are some general details around the communicator and handle, documentation, and input validation that should be done.

I think the cython can be simplified further and the Dask piece should be pretty straightforward to add. The Dask Linear Regression Wrapper should be helpful there. Also, as discussed off line, we will need tests and the Dask Linear Regression Pytests should be helpful there. More specifically, you should be able to use Dask make_classification data generator for that.

Thanks again for this contribution!

cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn_mg.cu Outdated Show resolved Hide resolved
# the cdef was copied from cuml.linear_model.qn
cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil:

cdef enum qn_loss_type "ML::GLM::qn_loss_type":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't MG specific, we should centralize it so the SG and MG can use the same one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will affect single-GPU logistic_regression.pyx. Is it ok to do it in the next PR to avoid this one being too long?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the logic of adding this change in a follow up, but could you open a GH issue capturing this (and any other small) follow up items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.

QN_LOSS_ABS "ML::GLM::QN_LOSS_ABS"
QN_LOSS_UNKNOWN "ML::GLM::QN_LOSS_UNKNOWN"

cdef struct qn_params:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here- this isn't mg specific so we should use the SG version

int n_ranks) except +


class LogisticRegressionMG(LogisticRegression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to use the design from LinearRegressionMG as much as possible to model the relationship between LogisticRegression and LogisticRegressionMG. There are some more classes and mixins that can reduce boilerplate further.

Copy link
Contributor Author

@lijinf2 lijinf2 Jul 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find it a bit challenging to inherit mixins, for two reasons:

(1) min_lbfgs accepts float*. mixins fit accepts vector<float*>
(2) min_lbfgs supports multiple classes so coef
can be a matrix. mixins assumes coef_ to be a vector of D + pams.fit_intercept length.

Any there existing APIs to address (1) and (2) to simplify the implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(3) Another issue pops up that relates to mixins._fit and (1). It seems mixins.fit converts input vectors and labels into a special type of points (i.e. X_arg, y_arg) using opg.build_data_t(X_arys). Is there s way to convert X_arg and y_arg to float* (min_lbfgs requires the float*)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined the design from 'LinearRegressionMG' and singe-gpu 'LogisticRegression'. Specifically:
(1) Got self._coef initialization support LinearRegressionMG style (i.e. 1-D array) and LogisticRegression style (i.e. 2-D array).
(2) Converted type of input vectors and input labels) from LienarRegressionMG style (vector of Matrix:Data type) to LogisticRegression style (float*).
(3) set n_classes to 2 in this PR and will work on multiple class in following PRs.

self.solver_model._coef_ = CumlArray.zeros(
coef_size, dtype=self.dtype, order='C')

def fit(self, X, y, rank, n_ranks, n_samples, n_classes, convert_dtype=False) -> "LogisticRegressionMG":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I think this could be simplified further to look more like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised the code and the _fit now looks exactly the same.
fit looks the same except for one additional argument (i.e. order='F'). This is because LinearRegressionMG uses 'F' order but cuml.LogisticRegression uses 'C' order.

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jun 29, 2023

Thanks for this new feature @lijinf2! This is going to be useful not just in Spark but in Dask and other places as well!

I think overall the C++ code is structured pretty well. There are some general details around the communicator and handle, documentation, and input validation that should be done.

I think the cython can be simplified further and the Dask piece should be pretty straightforward to add. The Dask Linear Regression Wrapper should be helpful there. Also, as discussed off line, we will need tests and the Dask Linear Regression Pytests should be helpful there. More specifically, you should be able to use Dask make_classification data generator for that.

Thanks again for this contribution!

Thanks so much for reviewing the PR! @cjnolet Your comments and links are very helpful and I will start working on the revision following your suggestions. One thing I feel uncertain relates to num_classes calculation. Is there an existing C++ API to calculate num_classes from labels across GPUs? This will help move num_classes variable to C++ implementation and reduce one argument in Cython fit wrapper.

@cjnolet
Copy link
Member

cjnolet commented Jun 30, 2023

Is there an existing C++ API to calculate num_classes from labels across GPUs?

I don't know if we have a way to do this on mutliple GPUs yet. We could use an nunique primitive and potentially do a distributed reduction. Maybe something we should think about a little more. Since time is important here, if we can't think of anything very soon, I would suggest we create a github issue for it and reference the github issue in a TODO in the code so we don't lose sight of it. I'll be the first to say that we are in need of a nice collection of mnmg primitives that are fully c++.

@cjnolet
Copy link
Member

cjnolet commented Jun 30, 2023

/ok to test

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jun 30, 2023

Is there an existing C++ API to calculate num_classes from labels across GPUs?

I don't know if we have a way to do this on mutliple GPUs yet. We could use an nunique primitive and potentially do a distributed reduction. Maybe something we should think about a little more. Since time is important here, if we can't think of anything very soon, I would suggest we create a github issue for it and reference the github issue in a TODO in the code so we don't lose sight of it. I'll be the first to say that we are in need of a nice collection of mnmg primitives that are fully c++.

Figured out it is possible with C++ raft::label::detail::getUniquelabel and comm.allgather. But it turns out the python wrapper logistic_regression_mg.pyx also needs this num_classes on the python side to initialize coef_. So a wrapper of the C++ implementation is required.

Maybe in this PR we can assume num_classes to be 2 and remove the num_class from Cython fit argument list. In a future PR we can implement the C++ num_class calculation and a wrapper to use it in logistic_regression_mg.pyx .

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jul 5, 2023

@cjnolet Thanks so much for the comments! They are very helpful. After addressing the comments, I find the implementation of dask class (cuml.dask.linear_model.LogisticRegression) and pytests become much simpler. Would like to have your review again when you have time.

The revised PR mainly includes two new files (dask/linear_model/LogisticRegression) and (tests/dask/test_dask_logistic_regression.py). The added pytests get passed in my workstation of 2-GPUs. Will need to run the full pytests of cuml.

@lijinf2 lijinf2 marked this pull request as ready for review July 11, 2023 00:18
@lijinf2 lijinf2 requested review from a team as code owners July 11, 2023 00:18
@lijinf2 lijinf2 changed the title [DRAFT] Add multi-node-multi-gpu Logistic Regression in C++ Add multi-node-multi-gpu Logistic Regression in C++ Jul 11, 2023
@dantegd
Copy link
Member

dantegd commented Jul 11, 2023

/ok to test

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jul 11, 2023

/ok to test

1 similar comment
@csadorf
Copy link
Contributor

csadorf commented Jul 11, 2023

/ok to test

Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good from my side

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks great and it's coming together. My remaining comments are very minor at this point. I think we'll be able to get this into 23.08.

cpp/src/glm/qn/glm_base_mg.cuh Outdated Show resolved Hide resolved
cpp/src/glm/qn_mg.cu Outdated Show resolved Hide resolved
cpp/src/glm/qn_mg.cu Outdated Show resolved Hide resolved
# the cdef was copied from cuml.linear_model.qn
cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil:

cdef enum qn_loss_type "ML::GLM::qn_loss_type":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.

cpp/src/glm/qn_mg.cu Show resolved Hide resolved
@lijinf2
Copy link
Contributor Author

lijinf2 commented Jul 19, 2023

Implementation looks great and it's coming together. My remaining comments are very minor at this point. I think we'll be able to get this into 23.08.

Thanks so much for reviewing again! I have revised the PR and pushed the latest.

@cjnolet
Copy link
Member

cjnolet commented Jul 19, 2023

/ok to test

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @lijinf2, that approval was premature and meant for a different PR. I think this will be getting approved shortly, though.

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jul 20, 2023

/ok to test

@lijinf2
Copy link
Contributor Author

lijinf2 commented Jul 20, 2023

The checks seem fail at irrelevant test cases(e.g. error "FAILED test_hdbscan.py::test_all_points_membership_vectors_circles[1000-knn-leaf-0-True-0.5-500-5-1000] - TypeError: 'numpy.float64' object cannot be interpreted as an integer"). Any idea? I can get test_hdbscan.py passed locally on my workstation. Perhaps rerun "/ok to test" to repeat the error.

@csadorf
Copy link
Contributor

csadorf commented Jul 21, 2023

The checks seem fail at irrelevant test cases(e.g. error "FAILED test_hdbscan.py::test_all_points_membership_vectors_circles[1000-knn-leaf-0-True-0.5-500-5-1000] - TypeError: 'numpy.float64' object cannot be interpreted as an integer"). Any idea? I can get test_hdbscan.py passed locally on my workstation. Perhaps rerun "/ok to test" to repeat the error.

We are currently experiencing some CI issues due to changes in our dependencies (see #5514). Once those are resolved, we can rerun tests here.

@cjnolet
Copy link
Member

cjnolet commented Jul 24, 2023

/merge

@rapids-bot rapids-bot bot merged commit e23167c into rapidsai:branch-23.08 Jul 24, 2023
50 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jul 29, 2023
This is a followup PR for [PR 5477](#5477). This PR adds predict API to MNMG logistic regression and tests to verify the correctness. 

Please review the code change from commit 171aef2 with message "add predict operator". The implementation is trivial after the dependency PR 5477 is merged.

Authors:
  - Jinfeng Li (https://github.com/lijinf2)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #5516
@lijinf2 lijinf2 deleted the lr_mg branch June 26, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants