[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

WDRshadow · 2024-07-13T15:46:28Z

Main Updated:

Fixed a bug that while using multi-GPU with CUDA caused a "CurrentModificationException" error in ParForProgramBlock, where the GPU memory is modified during the process of freeing. Multi-GPU can now be used to accerate the parfor function and other functions using multiple workers and threads.
The mismatch between _numThreads and the actual number of threads is resolved when using multiple GPUs device but only allowing a single GPU (sysds.gpu.availableGPUs=1) to run parfor functions with multiple workers and multiple threads.

Other bugs in Multi-GPU process:

In a multi-GPU environment, when initializing the first instance of GPUContext for cudnnHandle, cublasHandle, and cusparseHandle in Jcuda version 10.2.0, the Native code freezes when executing into Jcuda code. The program cannot continue. This error can easily happen in the test environment of "4070+1080" dual graphics card environment, but sometimes it works fine. This bug is not present in Jcuda version 11.8.0. This is presumed to be a Jcuda issue and cannot be fixed by SystemDS.

We have tested and drawn an conclusion that Jcuda with version 10.2.0 is not support the GPUs with Ampere Architecture or higher (A100, H100 or products of the same/higher period, RTX 30 and 40 series or higher).

Other bugs outside Multi-GPU process we found:

Example script scripts/nn/examples/AttentionExample.dml can not be runned with even one GPU. Error message is RuntimeException -- Unsupported operator:MAP. We found that a function (with a high probability that it is the map function) passes an operator _map to TernaryOp class where it is not categorized by a GPU operation.

TODO List:

We'll write a test class to detect whether multi-gpu in parfor is actually being implemented.

src/main/java/org/apache/sysds/runtime/controlprogram/caching/CacheableData.java

phaniarnab · 2024-07-13T19:10:02Z

Thanks, @WDRshadow, for initiating the project.
As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

WDRshadow · 2024-07-14T00:37:59Z

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

phaniarnab · 2024-07-14T08:01:47Z

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node.
Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

WDRshadow · 2024-07-15T08:03:34Z

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

phaniarnab · 2024-07-15T08:18:04Z

Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project.
Please write scoring scenario using parfor. Create a random matrix of test images and take one of the model. For each row, call the forward path from within a parfor, allowing parallel scoring. Store the inferred class in a separate vector. I hope to see some performance improvement of utilizing multiple GPUs.
The parfor tests are not ideal for this project as the operations in those scripts were not targeted for GPUs. You may not see any speedups. However, you can use those tests for unit testing.
Did you verify that you are actually using both the GPUs?

WDRshadow · 2024-07-15T09:11:37Z

Thanks for the suggestion. It will be helpful for @KexingLi22 writing the test instances.

There is no doubt that SystemDS uses multiple GPUs for the parfor computation. We have used two ways to proof:

we have used parfor for multiplication of two 10000*10000 matrices in our simple test case and there is a significant reduction in runtime in case of two GPUs as compared to single GPU.
We can clearly see in the java debug environment that in the parallel computation in the executeLocalParFor function of the ParForProgramBlock class, the LocalParWorker and its Thread corresponding to the two GPUs take on several computation Tasks. For a test example, in a matrix math calculation by using parfor, the RTX4070 calculated 8 Tasks while the GTX1080 calculated 4.

We'll show about these in our test code.

KexingLi22 · 2024-07-16T22:52:53Z

THanks for your suggestion, @phaniarnab .

We have written a test class MultiGPUTest.java with single GPU test case, MultipleGPU test case to
run the script, in which the model EfficientNet was trained and predicts using parfor.

Everything works well and the execute time of singleGPU is 35 sec 121ms, of the multiGPU is 27 sec 378 ms.

And as the advice from @WDRshadow, I also try to add the logger instance into both ParforBody and GPUContext to trace the thread and the GPUContext. And I have already add these into the log4j.properties:

Enable detailed logging for specific classes

log4j.logger.org.apache.sysds.runtime.controlprogram.parfor.ParForBody=DEBUG
log4j.logger.org.apache.sysds.runtime.instructions.gpu.context.GPUContext=DEBUG

But when I run the test,dml script with the parfor function, nothing about this, which I expected shows out :
24/07/16 10:00:00 DEBUG ParForBody - Thread Thread-1 assigned to GPU context 0

How can I solve this problems?

phaniarnab · 2024-07-16T23:05:18Z

Thanks. The numbers do not look very good. Train just once and write the model in the disk. In a separate script, read the model and infer the test instances within a parfor loop. Here is an example script [1]. You can even use a randomly initialized model, as we are not measuring the accuracy here. I expect at least a 2x improvement. Vary the test size (i.e., the number of iterations of parfor loop) from 10k to 100k.

First focus on the development, unit testing and experiments. The logger can be delayed. Instead, extend the ParForStatistics class to report the number of GPUs used by the parfor and other relavent details. These will be printed when -stats is passed.

codecov · 2024-07-18T08:16:04Z

Codecov Report

Attention: Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.

Files	Patch %	Lines
...sds/runtime/controlprogram/ParForProgramBlock.java	0.00%	3 Missing and 2 partials ⚠️
.../runtime/controlprogram/caching/CacheableData.java	33.33%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2050      +/-   ##
============================================
- Coverage     68.84%   68.82%   -0.02%     
- Complexity    40711    40756      +45     
============================================
  Files          1440     1440              
  Lines        161565   161693     +128     
  Branches      31418    31450      +32     
============================================
+ Hits         111232   111292      +60     
- Misses        41258    41346      +88     
+ Partials       9075     9055      -20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

phaniarnab · 2024-07-18T09:04:56Z

src/test/java/org/apache/sysds/test/gpu/multigpu/GPUTest.java

+                if (multiGPUs) {
+                    assert extractedNumThreads > 1 : "Test failed: _numThreads is not greater than 1";
+                } else {
+                    assert extractedNumThreads == 1 : "Test failed: _numThreads is not equal to 1";


How does this assertion confirm the use of multiple GPUs?

phaniarnab · 2024-07-18T09:12:16Z

src/test/scripts/gpu/GPUTest.dml

+parfor(i in 1:iters) {
+  beg = ((i-1) * batch_size) %% N + 1
+  end = min(N, beg + batch_size - 1)
+  X_batch = images[beg:end,]
+  y_batch = labels[beg:end,]
+
+  pred = eff::netPredict(X_batch, model, 1, 28, 28)
+  partial_accuracies[i,1] = mean(rowIndexMax(pred) == rowIndexMax(y_batch))
+}


This is good. Run it from 10k to 100k mini-batches and plot the execution time. Compare single-gpu parfor and multi-gpu parfor. For one of the data point (e.g., 10k batches), also report the CPU time. Use all the available CPU cores.

…STEMDS-2951-dev-batch

WDRshadow · 2024-07-20T09:48:14Z

@phaniarnab We have implemented test cases based on EfficientNet. Each test contains exactly the same training procedure. Parfor-based forward tests are performed using random datasets with the same seed. The number of datasets ranges from 10k to 500k. The results using single and dual GPUs are shown below:

test_id	num_interation	1_gpu_exec_time_sec	2_gpu_exec_time_sec
test01_gpuTest_10k	10000	11.723	11.189
test01_gpuTest_20k	20000	13.714	12.866
test01_gpuTest_50k	50000	19.755	15.616
test01_gpuTest_100k	100000	29.141	23.026
test01_gpuTest_200k	200000	49.409	37.987
test01_gpuTest_500k	500000	108.874	77.917

Test environment:

CPU: 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
GPU: RTX2080Ti * 2
RAM: 80G
OS: Ubuntu 18.04
CUDA: 10.2

phaniarnab · 2024-07-20T09:59:20Z

@WDRshadow, thanks for putting the numbers here.
Did you take an average of 3 runs to capture the execution time? If not, please do that to avoid the JIT compilation and GC overheads. And I assume the numbers reported in this table only measure the total inference time and not the training time.

The speedup from 2 GPUs is way less than I expected. Can you explain, why the speedup is not consistently 2x? If you are scoring n images, then each GPU gets n/2 images, which should lead to 2x speedup. I do not anticipate any additional overhead for two GPUs for this use case.

WDRshadow · 2024-07-20T10:25:10Z

Thanks. Your assumptions are inaccurate. This time is the total execution time, which includes a exactly the same training process before the execution of the parforloop. This is one reason. I am not familiar with .dml files and have no time to learn it, so I don't know how to store and read a trained model.

phaniarnab · 2024-07-20T10:31:36Z

Okay. In that case, try one of the two options: (1) write the model to disk, create separate dml scripts for inference where you read the model and immediately start the parfor loop. You can find plenty of read, write examples in the test scripts and the reproducibility scripts I shared with you. (2) use time() method before and after the parfor and report only the inference time. You can find an example of using time() here: https://github.com/damslab/reproducibility/blob/master/vldb2022-UPLIFT-p2528/FTBench/systemds/T1.dml

For either option, make sure the intermediates are already materialized before the loop starts. SystemDS compiler sometime delays operations till used. You can print the sum of a matrix to force materialization.

WDRshadow · 2024-07-21T22:12:26Z

@phaniarnab We have change our code and test again. The time now is only including parfor execution. The parfor is run 3 times and we used the average time. Here is the record:

test_id	num_interation	1_gpu_time_sec	2_gpu_time_sec	boost_rate
test01_gpuTest_10k	10000	2.0	1.7	15%
test01_gpuTest_20k	20000	4.0	3.0	25.0%
test01_gpuTest_50k	50000	11.0	7.3	33.7%
test01_gpuTest_100k	100000	22.3	15.0	32.7%
test01_gpuTest_200k	200000	46.0	31.3	31.9%
test01_gpuTest_500k	500000	109.3	79.3	27.5%
Total	880000	194.6	137.6	29.3%

Test environment:

CPU: 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
GPU: RTX2080Ti * 2
RAM: 80G
OS: Ubuntu 18.04
CUDA: 10.2

Comments:

From the table it can be seen that the boost_rate does not reach the desired 50%. This should be due to under-optimisation of LocalParWorker or GPU memory management. We have observed the following reasons that may affect multi-GPU optimisation:

Multiple GPUs share a storage space with synchronisation locks. For example, _gpuObjects stores the caches in each Task for the GPUs to read and record the data. Each time a GPU reads that data it will cause blocking.
The TaskPartitioner design may not be optimal. When the number of Task allocations is low in the case of a large amount of data but a small number of threads, the single Task calculation will be larger. However, there may be errors in GPU computation, in which case the Task needs to be recomputed, consuming more time if that error Task is "big". This can be mitigated by improving Task allocation.
The speedup will improve greatly when multiple GPUs are equally divided into the Task and there are no errors in the computation. However, I have observed that in the case of dual graphics cards, one card may execute more Tasks than the other.

phaniarnab · 2024-07-22T06:20:59Z

Okay. Thanks, @WDRshadow, @KexingLi22.
Did you manage to set up CUDA version 10.2 in both the GPUs for these experiments?
Please make sure the regression tests (github actions) are not failing due to any of your changes. I reran the failed tests.

WDRshadow · 2024-07-22T06:25:22Z

The driver of GPU and the CUDA version is shared with all GPUs in the same devices. So, yes, both GPUs is set up with CUDA version 10.2.

phaniarnab · 2024-07-22T14:17:33Z

Looks like all tests are passing now.

WDRshadow added 2 commits July 13, 2024 13:46

fix: multiple GPUs cache error

d02782b

fix: mismatch of ParForProgramBlock._numThreads

eb77efc

phaniarnab reviewed Jul 13, 2024

View reviewed changes

src/main/java/org/apache/sysds/runtime/controlprogram/caching/CacheableData.java Show resolved Hide resolved

update: mismatch of ParForProgramBlock._numThreads

6a21cd8

Stern and others added 2 commits July 15, 2024 16:10

Add Example-ResNet.dml

e8f79ac

update: MultiGPUTest (draft)

99d5c01

Stern and others added 7 commits July 18, 2024 00:29

Commit MultiGPUTest

c56144a

Merge branch 'main' of https://github.com/WDRshadow/systemds

25bde5d

Merge branch 'main' into SYSTEMDS-2951-dev

9ea0f02

update: delete unnecessary codes

9ed077c

update: delete unnecessary codes 2

9b67303

update: MultiGPUTest

18ffb44

update: initialize SingleGPUTest available gpu

46cfbc1

update: Tests for multi-GPU completed

62a0465

phaniarnab reviewed Jul 18, 2024

View reviewed changes

WDRshadow and others added 5 commits July 18, 2024 11:53

update: add _numTasks check for test

ea9fce8

update: delete _numThreads check.

0c6a950

new GPUTest

28c7cac

Update: SingleGPUTest

a35c8bd

update: delete unnecessary codes 3

d6b2d44

Stern and others added 5 commits July 20, 2024 01:48

Update: Test batch size from 100K to 500K, print the all the time.

729200d

update: delete unnecessary codes 4

0f01ff5

update: modified the test instances

cc136d8

update: modified the test instances

7cf12a6

Merge remote-tracking branch 'origin/SYSTEMDS-2951-dev-batch' into SY…

f44fda6

…STEMDS-2951-dev-batch

Update: GPUTest.dml add time counter

9001604

Update: write and read the model, train and predict perspectively

cf08084

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

WDRshadow commented Jul 13, 2024 •

edited

Loading

phaniarnab commented Jul 13, 2024

WDRshadow commented Jul 14, 2024

phaniarnab commented Jul 14, 2024

WDRshadow commented Jul 15, 2024

phaniarnab commented Jul 15, 2024

WDRshadow commented Jul 15, 2024

KexingLi22 commented Jul 16, 2024

phaniarnab commented Jul 16, 2024

Enable detailed logging for specific classes

codecov bot commented Jul 18, 2024 •

edited

Loading

phaniarnab Jul 18, 2024 •

edited

Loading

phaniarnab Jul 18, 2024

WDRshadow commented Jul 20, 2024

phaniarnab commented Jul 20, 2024

WDRshadow commented Jul 20, 2024

phaniarnab commented Jul 20, 2024

WDRshadow commented Jul 21, 2024 •

edited

Loading

phaniarnab commented Jul 22, 2024

WDRshadow commented Jul 22, 2024

phaniarnab commented Jul 22, 2024

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

Are you sure you want to change the base?

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

Conversation

WDRshadow commented Jul 13, 2024 • edited Loading

Main Updated:

Other bugs in Multi-GPU process:

Other bugs outside Multi-GPU process we found:

TODO List:

phaniarnab commented Jul 13, 2024

WDRshadow commented Jul 14, 2024

phaniarnab commented Jul 14, 2024

WDRshadow commented Jul 15, 2024

phaniarnab commented Jul 15, 2024

WDRshadow commented Jul 15, 2024

KexingLi22 commented Jul 16, 2024

Enable detailed logging for specific classes

phaniarnab commented Jul 16, 2024

Enable detailed logging for specific classes

codecov bot commented Jul 18, 2024 • edited Loading

Codecov Report

phaniarnab Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

phaniarnab Jul 18, 2024

Choose a reason for hiding this comment

WDRshadow commented Jul 20, 2024

phaniarnab commented Jul 20, 2024

WDRshadow commented Jul 20, 2024

phaniarnab commented Jul 20, 2024

WDRshadow commented Jul 21, 2024 • edited Loading

Test environment:

Comments:

phaniarnab commented Jul 22, 2024

WDRshadow commented Jul 22, 2024

phaniarnab commented Jul 22, 2024

WDRshadow commented Jul 13, 2024 •

edited

Loading

codecov bot commented Jul 18, 2024 •

edited

Loading

phaniarnab Jul 18, 2024 •

edited

Loading

WDRshadow commented Jul 21, 2024 •

edited

Loading