-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPS with cuQuantum #2168
base: main
Are you sure you want to change the base?
MPS with cuQuantum #2168
Conversation
Because |
It fails running with MPS method on GPU with error message as following, |
Yes, I know I am actively working on this please give me some time and I
will let you know when I am done
…On Fri, 14 Jun, 2024, 7:22 am Jun Doi, ***@***.***> wrote:
It fails running with MPS method on GPU with error message as following,
ERROR TensorNet::contractor : CUTENSORNET_STATUS_INVALID_VALUE
—
Reply to this email directly, view it on GitHub
<#2168 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANBTUULPA33H5DXTGKNAQ4LZHJEGFAVCNFSM6AAAAABI4MTL7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGA3DQMBQGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@doichanj , please see if this is good enough :) and, extremely sorry, for the delay, doing any development in cloud VM, is not a good experience. |
I have absolutely no idea why macOS tests are failing. |
Looks like the problem is old cvxpy version, #2169 should fix it? |
I have got my hands on Nvidia 3060 12 GB. |
I am going through more lectures of docker container when I do FAILED: qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o
/usr/local/cuda-12.5/bin/nvcc
-forward-unknown-to-host-compiler
-DAER_CUSTATEVEC
-DAER_CUTENSORNET
-DAER_THRUST_SUPPORTED=TRUE
-DSPDLOG_COMPILED_LIB
-DSPDLOG_FMT_EXTERNAL
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA
-Dcontroller_wrappers_EXPORTS
-I/usr/include/python3.11
-I/root/.venvs/qiskit-aer-dev/lib64/python3.11/site-packages/pybind11/include
-I/home/qiskit-aer/src
-isystem /root/.conan/data/nlohmann_json/3.1.1/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include
-isystem /root/.conan/data/spdlog/1.9.2/_/_/package/0ddf38f27cb22953937f1e00a3576684c190cabc/include
-isystem /root/.conan/data/fmt/8.0.1/_/_/package/300cb578a6fb15627a390ac70b832e82fd040283/include
-O3
-DNDEBUG
-std=c++14
"--generate-code=arch=compute_86,code=[compute_86,sm_86]"
-Xcompiler=-fPIC
-Xcompiler=-fvisibility=hidden
--compiler-options
-fopenmp
-gencode
arch=compute_86,code=sm_86
-DAER_THRUST_GPU
-DAER_THRUST_CUDA
-I/home/qiskit-aer/src
-isystem /home/qiskit-aer/src/third-party/headers
-use_fast_math
--expt-extended-lambda
--compiler-options
-mfma
--compiler-options
-mavx2
--compiler-options
-mno-avx512bf16
-MD
-MT
qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o
-MF qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o.d
-x cu
-c /home/qiskit-aer/src/simulators/statevector/qv_avx2.cpp
-o qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o
/usr/lib/gcc/x86_64-redhat-linux/12/include/avx512bf16vlintrin.h(53): error: identifier "__builtin_ia32_cvtne2ps2bf16_v16hi" is undefined
return (__m256bh)__builtin_ia32_cvtne2ps2bf16_v16hi(__A, __B);
I cannot figure out what enables 'AVX512-BF16'? @doichanj , please see if the PR is on track, waiting for your comments :) |
@doichanj , your comments would be highly valuable for me, |
I'm sorry but I'm on vacation, I will review in next year. |
// difference between the speed of CPU and GPU involved. Even if matrices | ||
// are big, they are not big enough to make speed of PICe a bottleneck. | ||
// In this particular case CPU was `Zen+` and GPU was `NVIDIA Ampere`. | ||
if ((num_qubits_ > 13) && (MPS::mps_device_.compare("GPU") == 0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though MPI acceleration is not available for MPS, still I compiled with AER_MPI=True
.
Upon re-testing, the code against different CPUs:
Cascade Lake 4-cores + CUDA Arch 89, SapphireRapids 4-cores, zen+ 8-cores + CUDA Arch 86
In case of very low qubit count like 2,3 etc, GPU version is 1.5-2.5 seconds slower, maybe this is the time for GPU initialization.
The offloading of SVD to the GPU does accelerates the computation( in all cases! )
It will still be faster to offload to CUDA Arch 89 even with SapphireRapids CPU.
// invoking needed GPU routines. | ||
// If the matris is not big enough, the multiplication | ||
// will be done on CPU using openblas zgemm_ routine. | ||
if ((mat1_rows > 128) && (mat1_cols > 128) && (mat2_cols > 128)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though MPI acceleration is not available for MPS, still I compiled with AER_MPI=True
.
Upon re-testing, the code against different CPUs:
Cascade Lake 4-cores + CUDA Arch 89, SapphireRapids 4-cores, zen+ 8-cores + CUDA Arch 86
In case of very low qubit count like 2,3 etc, GPU version is 1.5-2.5 seconds slower, maybe this is the time for GPU initialization.
Offloading the matrix multiplication to the GPU does accelerates the computation.
But, I suppose this should only work for combination of >=CUDA Arch 86 with any CPU whose 'phoronix-test-suite' benchmark score for test: 'pts/scimark2' '[Computational Test: Dense LU Matrix Factorization]' <=600Mflops.
I believe most of non-enterprise user will fall in this category.
No test has been done on the combination of CUDA Arch 89 with SapphireRapids, whose 'phoronix-test-suite' benchmark score for test: 'pts/scimark2' '[Computational Test: Dense LU Matrix Factorization]' 1300Mflops. Even, if it doesn't work for this combination, then all we have to do is increase the num_qubits to maybe 17 or 18, and increase the number of rows/cols to 256 or more. I have refrained myself from doing this modification because this will make MPS slower on low-end CPUs.
@doichanj , Please have a look at my last two comments, and, review the code if you think so and have time :) |
Summary
This PR aims to add a feature of doing matrix-product-state simulation on Nvidia GPUs with cutensor of cuQuantum.
Details and comments
Shows performance gains,
Got a ~x12 speedup on bigger circuits, but still I am not satisfied!
fixes #2112