Add multi-node-multi-gpu Logistic Regression in C++ #5477

lijinf2 · 2023-06-26T18:27:04Z

This PR enables multi-node-multi-gpu Logistic Regression and it mostly reuses existing codes (i.e. GLMWithData and min_lbfgs) of single-GPU Logistic Regression. No change to any existing codes.

Added Pytest code for Spark cluster and the tests run successfully with 2 GPUs on a random dataset. The coef_ and intercept_ are the same as single-GPU cuml.LogisticRegression.fit. Pytest code can be found here: https://github.com/lijinf2/spark-rapids-ml/blob/lr/python/tests/test_logistic_regression.py

rapids-bot · 2023-06-26T18:27:09Z

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

cpp/include/cuml/linear_model/qn_mg.hpp

cpp/src/glm/qn/glm_base_mg.cuh

lijinf2 · 2023-06-28T19:04:26Z

Rebased with the latest branch-23.08.

cjnolet

Thanks for this new feature @lijinf2! This is going to be useful not just in Spark but in Dask and other places as well!

I think overall the C++ code is structured pretty well. There are some general details around the communicator and handle, documentation, and input validation that should be done.

I think the cython can be simplified further and the Dask piece should be pretty straightforward to add. The Dask Linear Regression Wrapper should be helpful there. Also, as discussed off line, we will need tests and the Dask Linear Regression Pytests should be helpful there. More specifically, you should be able to use Dask make_classification data generator for that.

Thanks again for this contribution!

cpp/src/glm/qn/glm_base_mg.cuh

cpp/src/glm/qn_mg.cu

cjnolet · 2023-06-29T20:20:12Z

python/cuml/linear_model/logistic_regression_mg.pyx

+# the cdef was copied from cuml.linear_model.qn
+cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil:
+
+    cdef enum qn_loss_type "ML::GLM::qn_loss_type":


Since this isn't MG specific, we should centralize it so the SG and MG can use the same one.

This will affect single-GPU logistic_regression.pyx. Is it ok to do it in the next PR to avoid this one being too long?

I agree with the logic of adding this change in a follow up, but could you open a GH issue capturing this (and any other small) follow up items?

Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.

cjnolet · 2023-06-29T20:20:22Z

python/cuml/linear_model/logistic_regression_mg.pyx

+        QN_LOSS_ABS      "ML::GLM::QN_LOSS_ABS"
+        QN_LOSS_UNKNOWN  "ML::GLM::QN_LOSS_UNKNOWN"
+
+    cdef struct qn_params:


Same here- this isn't mg specific so we should use the SG version

cjnolet · 2023-06-29T20:22:36Z

python/cuml/linear_model/logistic_regression_mg.pyx

+        int n_ranks) except +
+
+
+class LogisticRegressionMG(LogisticRegression):


My suggestion is to use the design from LinearRegressionMG as much as possible to model the relationship between LogisticRegression and LogisticRegressionMG. There are some more classes and mixins that can reduce boilerplate further.

Find it a bit challenging to inherit mixins, for two reasons:

(1) min_lbfgs accepts float*. mixins fit accepts vector<float*>
(2) min_lbfgs supports multiple classes so coef can be a matrix. mixins assumes coef_ to be a vector of D + pams.fit_intercept length.

Any there existing APIs to address (1) and (2) to simplify the implementation?

(3) Another issue pops up that relates to mixins._fit and (1). It seems mixins.fit converts input vectors and labels into a special type of points (i.e. X_arg, y_arg) using opg.build_data_t(X_arys). Is there s way to convert X_arg and y_arg to float* (min_lbfgs requires the float*)?

Combined the design from 'LinearRegressionMG' and singe-gpu 'LogisticRegression'. Specifically:
(1) Got self._coef initialization support LinearRegressionMG style (i.e. 1-D array) and LogisticRegression style (i.e. 2-D array).
(2) Converted type of input vectors and input labels) from LienarRegressionMG style (vector of Matrix:Data type) to LogisticRegression style (float*).
(3) set n_classes to 2 in this PR and will work on multiple class in following PRs.

cjnolet · 2023-06-29T20:24:14Z

python/cuml/linear_model/logistic_regression_mg.pyx

+            self.solver_model._coef_ = CumlArray.zeros(
+                coef_size, dtype=self.dtype, order='C')
+
+    def fit(self, X, y, rank, n_ranks, n_samples, n_classes, convert_dtype=False) -> "LogisticRegressionMG":


As mentioned above, I think this could be simplified further to look more like this.

Revised the code and the _fit now looks exactly the same.
fit looks the same except for one additional argument (i.e. order='F'). This is because LinearRegressionMG uses 'F' order but cuml.LogisticRegression uses 'C' order.

lijinf2 · 2023-06-29T21:33:40Z

Thanks for this new feature @lijinf2! This is going to be useful not just in Spark but in Dask and other places as well!

I think overall the C++ code is structured pretty well. There are some general details around the communicator and handle, documentation, and input validation that should be done.

I think the cython can be simplified further and the Dask piece should be pretty straightforward to add. The Dask Linear Regression Wrapper should be helpful there. Also, as discussed off line, we will need tests and the Dask Linear Regression Pytests should be helpful there. More specifically, you should be able to use Dask make_classification data generator for that.

Thanks again for this contribution!

Thanks so much for reviewing the PR! @cjnolet Your comments and links are very helpful and I will start working on the revision following your suggestions. One thing I feel uncertain relates to num_classes calculation. Is there an existing C++ API to calculate num_classes from labels across GPUs? This will help move num_classes variable to C++ implementation and reduce one argument in Cython fit wrapper.

cjnolet · 2023-06-30T17:25:25Z

Is there an existing C++ API to calculate num_classes from labels across GPUs?

I don't know if we have a way to do this on mutliple GPUs yet. We could use an nunique primitive and potentially do a distributed reduction. Maybe something we should think about a little more. Since time is important here, if we can't think of anything very soon, I would suggest we create a github issue for it and reference the github issue in a TODO in the code so we don't lose sight of it. I'll be the first to say that we are in need of a nice collection of mnmg primitives that are fully c++.

cjnolet · 2023-06-30T17:37:19Z

/ok to test

lijinf2 · 2023-06-30T23:53:27Z

Is there an existing C++ API to calculate num_classes from labels across GPUs?

I don't know if we have a way to do this on mutliple GPUs yet. We could use an nunique primitive and potentially do a distributed reduction. Maybe something we should think about a little more. Since time is important here, if we can't think of anything very soon, I would suggest we create a github issue for it and reference the github issue in a TODO in the code so we don't lose sight of it. I'll be the first to say that we are in need of a nice collection of mnmg primitives that are fully c++.

Figured out it is possible with C++ raft::label::detail::getUniquelabel and comm.allgather. But it turns out the python wrapper logistic_regression_mg.pyx also needs this num_classes on the python side to initialize coef_. So a wrapper of the C++ implementation is required.

Maybe in this PR we can assume num_classes to be 2 and remove the num_class from Cython fit argument list. In a future PR we can implement the C++ num_class calculation and a wrapper to use it in logistic_regression_mg.pyx .

lijinf2 · 2023-07-05T23:57:12Z

@cjnolet Thanks so much for the comments! They are very helpful. After addressing the comments, I find the implementation of dask class (cuml.dask.linear_model.LogisticRegression) and pytests become much simpler. Would like to have your review again when you have time.

The revised PR mainly includes two new files (dask/linear_model/LogisticRegression) and (tests/dask/test_dask_logistic_regression.py). The added pytests get passed in my workstation of 2-GPUs. Will need to run the full pytests of cuml.

dantegd · 2023-07-11T01:23:23Z

/ok to test

lijinf2 · 2023-07-11T07:03:42Z

/ok to test

csadorf · 2023-07-11T17:13:31Z

/ok to test

dantegd

changes look good from my side

cjnolet

Implementation looks great and it's coming together. My remaining comments are very minor at this point. I think we'll be able to get this into 23.08.

cpp/src/glm/qn/glm_base_mg.cuh

cpp/src/glm/qn_mg.cu

cjnolet · 2023-07-19T17:33:51Z

python/cuml/linear_model/logistic_regression_mg.pyx

+# the cdef was copied from cuml.linear_model.qn
+cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil:
+
+    cdef enum qn_loss_type "ML::GLM::qn_loss_type":


Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.

cpp/src/glm/qn_mg.cu

lijinf2 · 2023-07-19T22:40:02Z

Implementation looks great and it's coming together. My remaining comments are very minor at this point. I think we'll be able to get this into 23.08.

Thanks so much for reviewing again! I have revised the PR and pushed the latest.

cjnolet · 2023-07-19T22:54:34Z

/ok to test

cjnolet

LGTM. Thanks!

cjnolet

Sorry @lijinf2, that approval was premature and meant for a different PR. I think this will be getting approved shortly, though.

lijinf2 · 2023-07-20T06:33:38Z

/ok to test

lijinf2 · 2023-07-20T06:38:42Z

The checks seem fail at irrelevant test cases(e.g. error "FAILED test_hdbscan.py::test_all_points_membership_vectors_circles[1000-knn-leaf-0-True-0.5-500-5-1000] - TypeError: 'numpy.float64' object cannot be interpreted as an integer"). Any idea? I can get test_hdbscan.py passed locally on my workstation. Perhaps rerun "/ok to test" to repeat the error.

csadorf · 2023-07-21T22:20:11Z

The checks seem fail at irrelevant test cases(e.g. error "FAILED test_hdbscan.py::test_all_points_membership_vectors_circles[1000-knn-leaf-0-True-0.5-500-5-1000] - TypeError: 'numpy.float64' object cannot be interpreted as an integer"). Any idea? I can get test_hdbscan.py passed locally on my workstation. Perhaps rerun "/ok to test" to repeat the error.

We are currently experiencing some CI issues due to changes in our dependencies (see #5514). Once those are resolved, we can rerun tests here.

cjnolet · 2023-07-24T20:42:06Z

/merge

This is a followup PR for [PR 5477](#5477). This PR adds predict API to MNMG logistic regression and tests to verify the correctness. Please review the code change from commit 171aef2 with message "add predict operator". The implementation is trivial after the dependency PR 5477 is merged. Authors: - Jinfeng Li (https://github.com/lijinf2) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #5516

github-actions bot added Cython / Python Cython or Python issue CMake CUDA/C++ labels Jun 26, 2023

lijinf2 changed the title ~~Add multi-node-multi-gpu Logistic Regression in C++~~ [DRAFT] Add multi-node-multi-gpu Logistic Regression in C++ Jun 26, 2023

wbo4958 reviewed Jun 27, 2023

View reviewed changes

cpp/include/cuml/linear_model/qn_mg.hpp Outdated Show resolved Hide resolved

wbo4958 reviewed Jun 27, 2023

View reviewed changes

cpp/include/cuml/linear_model/qn_mg.hpp Outdated Show resolved Hide resolved

wbo4958 reviewed Jun 27, 2023

View reviewed changes

cpp/include/cuml/linear_model/qn_mg.hpp Outdated Show resolved Hide resolved

wbo4958 reviewed Jun 27, 2023

View reviewed changes

cpp/src/glm/qn/glm_base_mg.cuh Show resolved Hide resolved

cjnolet assigned lijinf2 Jun 27, 2023

cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 27, 2023

lijinf2 force-pushed the lr_mg branch from 5a9cb6c to 77446e0 Compare June 28, 2023 18:59

cjnolet requested changes Jun 29, 2023

View reviewed changes

lijinf2 force-pushed the lr_mg branch from a0b4583 to cd22d55 Compare July 5, 2023 23:58

lijinf2 marked this pull request as ready for review July 11, 2023 00:18

lijinf2 requested review from a team as code owners July 11, 2023 00:18

lijinf2 changed the title ~~[DRAFT] Add multi-node-multi-gpu Logistic Regression in C++~~ Add multi-node-multi-gpu Logistic Regression in C++ Jul 11, 2023

lijinf2 force-pushed the lr_mg branch from cd22d55 to 51a6e85 Compare July 11, 2023 07:00

dantegd approved these changes Jul 18, 2023

View reviewed changes

cjnolet requested changes Jul 19, 2023

View reviewed changes

lijinf2 force-pushed the lr_mg branch from a2ed8d1 to 2821807 Compare July 19, 2023 22:35

cjnolet approved these changes Jul 19, 2023

View reviewed changes

cjnolet requested changes Jul 20, 2023

View reviewed changes

lijinf2 mentioned this pull request Jul 21, 2023

Support predict in MNMG Logistic Regression #5516

Merged

lijinf2 force-pushed the lr_mg branch from 2821807 to 9cce1e5 Compare July 24, 2023 18:17

lijinf2 added 12 commits July 24, 2023 11:20

get cython working

6308fe1

get a toy example working in spark rapids ml side

a3625a1

revise lr to get two tests working in spark cluster

055b958

clean code

ab925f2

deleted unused codes and added copyright info

9ac8fab

revise PR to reuse codes and add a dask test

23a2952

add dask class and two tests

b9f5a21

add doxygen

83f4b40

get pre-commit passed

87ea66a

revised according to online testing report

20a3580

refine the code according to second round comments

9e556fb

revise code and comments

3ccf5ab

lijinf2 force-pushed the lr_mg branch from 9cce1e5 to 3ccf5ab Compare July 24, 2023 18:20

cjnolet approved these changes Jul 24, 2023

View reviewed changes

rapids-bot bot merged commit e23167c into rapidsai:branch-23.08 Jul 24, 2023
50 checks passed

lijinf2 deleted the lr_mg branch June 26, 2024 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-node-multi-gpu Logistic Regression in C++ #5477

Add multi-node-multi-gpu Logistic Regression in C++ #5477

lijinf2 commented Jun 26, 2023 •

edited

Loading

rapids-bot bot commented Jun 26, 2023

lijinf2 commented Jun 28, 2023

cjnolet left a comment

cjnolet Jun 29, 2023

lijinf2 Jul 5, 2023

dantegd Jul 13, 2023

cjnolet Jul 19, 2023

cjnolet Jun 29, 2023

cjnolet Jun 29, 2023

lijinf2 Jul 1, 2023 •

edited

Loading

lijinf2 Jul 3, 2023

lijinf2 Jul 5, 2023

cjnolet Jun 29, 2023

lijinf2 Jul 5, 2023

lijinf2 commented Jun 29, 2023

cjnolet commented Jun 30, 2023

cjnolet commented Jun 30, 2023

lijinf2 commented Jun 30, 2023 •

edited

Loading

lijinf2 commented Jul 5, 2023

dantegd commented Jul 11, 2023

lijinf2 commented Jul 11, 2023

csadorf commented Jul 11, 2023

dantegd left a comment

cjnolet left a comment

cjnolet Jul 19, 2023

lijinf2 commented Jul 19, 2023

cjnolet commented Jul 19, 2023

cjnolet left a comment •

edited

Loading

cjnolet left a comment

lijinf2 commented Jul 20, 2023

lijinf2 commented Jul 20, 2023

csadorf commented Jul 21, 2023

cjnolet commented Jul 24, 2023

		int n_ranks) except +


		class LogisticRegressionMG(LogisticRegression):

Add multi-node-multi-gpu Logistic Regression in C++ #5477

Add multi-node-multi-gpu Logistic Regression in C++ #5477

Conversation

lijinf2 commented Jun 26, 2023 • edited Loading

rapids-bot bot commented Jun 26, 2023

lijinf2 commented Jun 28, 2023

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 Jul 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Jun 29, 2023

cjnolet commented Jun 30, 2023

cjnolet commented Jun 30, 2023

lijinf2 commented Jun 30, 2023 • edited Loading

lijinf2 commented Jul 5, 2023

dantegd commented Jul 11, 2023

lijinf2 commented Jul 11, 2023

csadorf commented Jul 11, 2023

dantegd left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Jul 19, 2023

cjnolet commented Jul 19, 2023

cjnolet left a comment • edited Loading

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

lijinf2 commented Jul 20, 2023

lijinf2 commented Jul 20, 2023

csadorf commented Jul 21, 2023

cjnolet commented Jul 24, 2023

lijinf2 commented Jun 26, 2023 •

edited

Loading

lijinf2 Jul 1, 2023 •

edited

Loading

lijinf2 commented Jun 30, 2023 •

edited

Loading

cjnolet left a comment •

edited

Loading