Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines #2050

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

WDRshadow
Copy link

@WDRshadow WDRshadow commented Jul 13, 2024

Main Updated:

  1. Fixed a bug that while using multi-GPU with CUDA caused a "CurrentModificationException" error in ParForProgramBlock, where the GPU memory is modified during the process of freeing. Multi-GPU can now be used to accerate the parfor function and other functions using multiple workers and threads.
  2. The mismatch between _numThreads and the actual number of threads is resolved when using multiple GPUs device but only allowing a single GPU (sysds.gpu.availableGPUs=1) to run parfor functions with multiple workers and multiple threads.

Other bugs in Multi-GPU process:

  1. In a multi-GPU environment, when initializing the first instance of GPUContext for cudnnHandle, cublasHandle, and cusparseHandle in Jcuda version 10.2.0, the Native code freezes when executing into Jcuda code. The program cannot continue. This error can easily happen in the test environment of "4070+1080" dual graphics card environment, but sometimes it works fine. This bug is not present in Jcuda version 11.8.0. This is presumed to be a Jcuda issue and cannot be fixed by SystemDS.
  • We have tested and drawn an conclusion that Jcuda with version 10.2.0 is not support the GPUs with Ampere Architecture or higher (A100, H100 or products of the same/higher period, RTX 30 and 40 series or higher).

Other bugs outside Multi-GPU process we found:

  1. Example script scripts/nn/examples/AttentionExample.dml can not be runned with even one GPU. Error message is RuntimeException -- Unsupported operator:MAP. We found that a function (with a high probability that it is the map function) passes an operator _map to TernaryOp class where it is not categorized by a GPU operation.

TODO List:

  • We'll write a test class to detect whether multi-gpu in parfor is actually being implemented.

@phaniarnab
Copy link
Contributor

Thanks, @WDRshadow, for initiating the project.
As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

@WDRshadow
Copy link
Author

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

@phaniarnab
Copy link
Contributor

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node.
Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

@WDRshadow
Copy link
Author

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node. Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

@phaniarnab
Copy link
Contributor

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node. Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project.
Please write scoring scenario using parfor. Create a random matrix of test images and take one of the model. For each row, call the forward path from within a parfor, allowing parallel scoring. Store the inferred class in a separate vector. I hope to see some performance improvement of utilizing multiple GPUs.
The parfor tests are not ideal for this project as the operations in those scripts were not targeted for GPUs. You may not see any speedups. However, you can use those tests for unit testing.
Did you verify that you are actually using both the GPUs?

@WDRshadow
Copy link
Author

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node. Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project. Please write scoring scenario using parfor. Create a random matrix of test images and take one of the model. For each row, call the forward path from within a parfor, allowing parallel scoring. Store the inferred class in a separate vector. I hope to see some performance improvement of utilizing multiple GPUs. The parfor tests are not ideal for this project as the operations in those scripts were not targeted for GPUs. You may not see any speedups. However, you can use those tests for unit testing. Did you verify that you are actually using both the GPUs?

Thanks for the suggestion. It will be helpful for @KexingLi22 writing the test instances.

There is no doubt that SystemDS uses multiple GPUs for the parfor computation. We have used two ways to proof:

  1. we have used parfor for multiplication of two 10000*10000 matrices in our simple test case and there is a significant reduction in runtime in case of two GPUs as compared to single GPU.

  2. We can clearly see in the java debug environment that in the parallel computation in the executeLocalParFor function of the ParForProgramBlock class, the LocalParWorker and its Thread corresponding to the two GPUs take on several computation Tasks. For a test example, in a matrix math calculation by using parfor, the RTX4070 calculated 8 Tasks while the GTX1080 calculated 4.

We'll show about these in our test code.

@KexingLi22
Copy link

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node. Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project. Please write scoring scenario using parfor. Create a random matrix of test images and take one of the model. For each row, call the forward path from within a parfor, allowing parallel scoring. Store the inferred class in a separate vector. I hope to see some performance improvement of utilizing multiple GPUs. The parfor tests are not ideal for this project as the operations in those scripts were not targeted for GPUs. You may not see any speedups. However, you can use those tests for unit testing. Did you verify that you are actually using both the GPUs?

THanks for your suggestion, @phaniarnab .

We have written a test class MultiGPUTest.java with single GPU test case, MultipleGPU test case to
run the script, in which the model EfficientNet was trained and predicts using parfor.

Everything works well and the execute time of singleGPU is 35 sec 121ms, of the multiGPU is 27 sec 378 ms.

And as the advice from @WDRshadow, I also try to add the logger instance into both ParforBody and GPUContext to trace the thread and the GPUContext. And I have already add these into the log4j.properties:

Enable detailed logging for specific classes

log4j.logger.org.apache.sysds.runtime.controlprogram.parfor.ParForBody=DEBUG
log4j.logger.org.apache.sysds.runtime.instructions.gpu.context.GPUContext=DEBUG

But when I run the test,dml script with the parfor function, nothing about this, which I expected shows out :
24/07/16 10:00:00 DEBUG ParForBody - Thread Thread-1 assigned to GPU context 0

How can I solve this problems?

@phaniarnab
Copy link
Contributor

Thanks, @WDRshadow, for initiating the project. As discussed before, please experiment with realistic use cases such as parallel scoring and training. You can use our DNN built-ins.

Thank you for your comment. My partner @KexingLi22 is writing the test classes. We will see it soon. For DNN testing, we were faced with the awkward situation of not having enough suitable GPUs for testing. As I mentioned above, newer graphics cards can not run on Jcuda 10.2.0. To be precise, CUDA 10.2 is not supported by RTX30 series, A100 and newer graphics cards. My test environment lacks older graphics cards. Could you please help us to test in a multi-GPU environment with suitable GPUs after we have written the test classes? Or again, could you provide any testing environment for us?

Thanks for clarifying. Unfortunately, at this point, we cannot provide a setup. Once you are done with the project, I can run some performance tests along with our performance test suits. But during the development period, it is not feasible to try every change in our shared node. Without a proper setup of two GPUs, it will be very hard to complete this project. I can offer two possible directions from here:

  1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and the other at 11. CUDA 11 has some API differences and may not be able to execute all CUDA methods, but you may still have a functioning system. However, I never tried this myself and unsure about the behavior.
  2. Instead of multi-GPU, first implement a multi-stream single-GPU parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to create multiple GPU streams, and assign a stream to each parfor thread. This is probably a better alternative.

We got a double RTX2080Ti server and tested scripts in scripts/nn/example. Except AttentionExample can't recognize the operator _map and Example-MNIST_2NN_Leaky_ReLu_Softmax can't find the source file mnist_2NN.dml, the others can run good. But I know none of them are optimized for multiple GPUs. The only function that is currently optimized for multiple GPUs is parfor. We will keep testing the scripts in src/test/java/org/apache/sysds/test/functions/parfor and write new test scripts for multi-GPUs cases.

Thanks. You do not have to optimize all NN workloads for multi-GPU. Just implementing a robust parfor support is sufficient for this project. Please write scoring scenario using parfor. Create a random matrix of test images and take one of the model. For each row, call the forward path from within a parfor, allowing parallel scoring. Store the inferred class in a separate vector. I hope to see some performance improvement of utilizing multiple GPUs. The parfor tests are not ideal for this project as the operations in those scripts were not targeted for GPUs. You may not see any speedups. However, you can use those tests for unit testing. Did you verify that you are actually using both the GPUs?

THanks for your suggestion, @phaniarnab .

We have written a test class MultiGPUTest.java with single GPU test case, MultipleGPU test case to run the script, in which the model EfficientNet was trained and predicts using parfor.

Everything works well and the execute time of singleGPU is 35 sec 121ms, of the multiGPU is 27 sec 378 ms.

And as the advice from @WDRshadow, I also try to add the logger instance into both ParforBody and GPUContext to trace the thread and the GPUContext. And I have already add these into the log4j.properties:

Enable detailed logging for specific classes

log4j.logger.org.apache.sysds.runtime.controlprogram.parfor.ParForBody=DEBUG log4j.logger.org.apache.sysds.runtime.instructions.gpu.context.GPUContext=DEBUG

But when I run the test,dml script with the parfor function, nothing about this, which I expected shows out : 24/07/16 10:00:00 DEBUG ParForBody - Thread Thread-1 assigned to GPU context 0

How can I solve this problems?

Thanks. The numbers do not look very good. Train just once and write the model in the disk. In a separate script, read the model and infer the test instances within a parfor loop. Here is an example script [1]. You can even use a randomly initialized model, as we are not measuring the accuracy here. I expect at least a 2x improvement. Vary the test size (i.e., the number of iterations of parfor loop) from 10k to 100k.

First focus on the development, unit testing and experiments. The logger can be delayed. Instead, extend the ParForStatistics class to report the number of GPUs used by the parfor and other relavent details. These will be printed when -stats is passed.

Copy link

codecov bot commented Jul 18, 2024

Codecov Report

Attention: Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 68.82%. Comparing base (f81b76d) to head (cf08084).
Report is 1 commits behind head on main.

Files Patch % Lines
...sds/runtime/controlprogram/ParForProgramBlock.java 0.00% 3 Missing and 2 partials ⚠️
.../runtime/controlprogram/caching/CacheableData.java 33.33% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2050      +/-   ##
============================================
- Coverage     68.84%   68.82%   -0.02%     
- Complexity    40711    40756      +45     
============================================
  Files          1440     1440              
  Lines        161565   161693     +128     
  Branches      31418    31450      +32     
============================================
+ Hits         111232   111292      +60     
- Misses        41258    41346      +88     
+ Partials       9075     9055      -20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 80 to 83
if (multiGPUs) {
assert extractedNumThreads > 1 : "Test failed: _numThreads is not greater than 1";
} else {
assert extractedNumThreads == 1 : "Test failed: _numThreads is not equal to 1";
Copy link
Contributor

@phaniarnab phaniarnab Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this assertion confirm the use of multiple GPUs?

Comment on lines 49 to 57
parfor(i in 1:iters) {
beg = ((i-1) * batch_size) %% N + 1
end = min(N, beg + batch_size - 1)
X_batch = images[beg:end,]
y_batch = labels[beg:end,]

pred = eff::netPredict(X_batch, model, 1, 28, 28)
partial_accuracies[i,1] = mean(rowIndexMax(pred) == rowIndexMax(y_batch))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. Run it from 10k to 100k mini-batches and plot the execution time. Compare single-gpu parfor and multi-gpu parfor. For one of the data point (e.g., 10k batches), also report the CPU time. Use all the available CPU cores.

@WDRshadow
Copy link
Author

@phaniarnab We have implemented test cases based on EfficientNet. Each test contains exactly the same training procedure. Parfor-based forward tests are performed using random datasets with the same seed. The number of datasets ranges from 10k to 500k. The results using single and dual GPUs are shown below:

test_id num_interation 1_gpu_exec_time_sec 2_gpu_exec_time_sec
test01_gpuTest_10k 10000 11.723 11.189
test01_gpuTest_20k 20000 13.714 12.866
test01_gpuTest_50k 50000 19.755 15.616
test01_gpuTest_100k 100000 29.141 23.026
test01_gpuTest_200k 200000 49.409 37.987
test01_gpuTest_500k 500000 108.874 77.917

Test environment:

  • CPU: 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
  • GPU: RTX2080Ti * 2
  • RAM: 80G
  • OS: Ubuntu 18.04
  • CUDA: 10.2

@phaniarnab
Copy link
Contributor

@WDRshadow, thanks for putting the numbers here.
Did you take an average of 3 runs to capture the execution time? If not, please do that to avoid the JIT compilation and GC overheads. And I assume the numbers reported in this table only measure the total inference time and not the training time.

The speedup from 2 GPUs is way less than I expected. Can you explain, why the speedup is not consistently 2x? If you are scoring n images, then each GPU gets n/2 images, which should lead to 2x speedup. I do not anticipate any additional overhead for two GPUs for this use case.

@WDRshadow
Copy link
Author

@WDRshadow, thanks for putting the numbers here. Did you take an average of 3 runs to capture the execution time? If not, please do that to avoid the JIT compilation and GC overheads. And I assume the numbers reported in this table only measure the total inference time and not the training time.

The speedup from 2 GPUs is way less than I expected. Can you explain, why the speedup is not consistently 2x? If you are scoring n images, then each GPU gets n/2 images, which should lead to 2x speedup. I do not anticipate any additional overhead for two GPUs for this use case.

Thanks. Your assumptions are inaccurate. This time is the total execution time, which includes a exactly the same training process before the execution of the parforloop. This is one reason. I am not familiar with .dml files and have no time to learn it, so I don't know how to store and read a trained model.

@phaniarnab
Copy link
Contributor

@WDRshadow, thanks for putting the numbers here. Did you take an average of 3 runs to capture the execution time? If not, please do that to avoid the JIT compilation and GC overheads. And I assume the numbers reported in this table only measure the total inference time and not the training time.
The speedup from 2 GPUs is way less than I expected. Can you explain, why the speedup is not consistently 2x? If you are scoring n images, then each GPU gets n/2 images, which should lead to 2x speedup. I do not anticipate any additional overhead for two GPUs for this use case.

Thanks. Your assumptions are inaccurate. This time is the total execution time, which includes a exactly the same training process before the execution of the parforloop. This is one reason. I am not familiar with .dml files and have no time to learn it, so I don't know how to store and read a trained model.

Okay. In that case, try one of the two options: (1) write the model to disk, create separate dml scripts for inference where you read the model and immediately start the parfor loop. You can find plenty of read, write examples in the test scripts and the reproducibility scripts I shared with you. (2) use time() method before and after the parfor and report only the inference time. You can find an example of using time() here: https://github.com/damslab/reproducibility/blob/master/vldb2022-UPLIFT-p2528/FTBench/systemds/T1.dml

For either option, make sure the intermediates are already materialized before the loop starts. SystemDS compiler sometime delays operations till used. You can print the sum of a matrix to force materialization.

@WDRshadow
Copy link
Author

WDRshadow commented Jul 21, 2024

@phaniarnab We have change our code and test again. The time now is only including parfor execution. The parfor is run 3 times and we used the average time. Here is the record:

test_id num_interation 1_gpu_time_sec 2_gpu_time_sec boost_rate
test01_gpuTest_10k 10000 2.0 1.7 15%
test01_gpuTest_20k 20000 4.0 3.0 25.0%
test01_gpuTest_50k 50000 11.0 7.3 33.7%
test01_gpuTest_100k 100000 22.3 15.0 32.7%
test01_gpuTest_200k 200000 46.0 31.3 31.9%
test01_gpuTest_500k 500000 109.3 79.3 27.5%
Total 880000 194.6 137.6 29.3%

Test environment:

  • CPU: 24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
  • GPU: RTX2080Ti * 2
  • RAM: 80G
  • OS: Ubuntu 18.04
  • CUDA: 10.2

Comments:

From the table it can be seen that the boost_rate does not reach the desired 50%. This should be due to under-optimisation of LocalParWorker or GPU memory management. We have observed the following reasons that may affect multi-GPU optimisation:

  1. Multiple GPUs share a storage space with synchronisation locks. For example, _gpuObjects stores the caches in each Task for the GPUs to read and record the data. Each time a GPU reads that data it will cause blocking.
  2. The TaskPartitioner design may not be optimal. When the number of Task allocations is low in the case of a large amount of data but a small number of threads, the single Task calculation will be larger. However, there may be errors in GPU computation, in which case the Task needs to be recomputed, consuming more time if that error Task is "big". This can be mitigated by improving Task allocation.
  3. The speedup will improve greatly when multiple GPUs are equally divided into the Task and there are no errors in the computation. However, I have observed that in the case of dual graphics cards, one card may execute more Tasks than the other.

@phaniarnab
Copy link
Contributor

Okay. Thanks, @WDRshadow, @KexingLi22.
Did you manage to set up CUDA version 10.2 in both the GPUs for these experiments?
Please make sure the regression tests (github actions) are not failing due to any of your changes. I reran the failed tests.

@WDRshadow
Copy link
Author

Okay. Thanks, @WDRshadow, @KexingLi22. Did you manage to set up CUDA version 10.2 in both the GPUs for these experiments? Please make sure the regression tests (github actions) are not failing due to any of your changes. I reran the failed tests.

The driver of GPU and the CUDA version is shared with all GPUs in the same devices. So, yes, both GPUs is set up with CUDA version 10.2.

@phaniarnab
Copy link
Contributor

Looks like all tests are passing now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants