Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPS with cuQuantum #2168

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

MPS with cuQuantum #2168

wants to merge 36 commits into from

Conversation

MozammilQ
Copy link

@MozammilQ MozammilQ commented Jun 6, 2024

Summary

This PR aims to add a feature of doing matrix-product-state simulation on Nvidia GPUs with cutensor of cuQuantum.

Details and comments

Shows performance gains,
Screenshot_20241203_173422

Got a ~x12 speedup on bigger circuits, but still I am not satisfied!

fixes #2112

@doichanj
Copy link
Collaborator

From 3 days I have been fighting this,

Screenshot_20240610_052824_n

Here, also the test failed here only ImportError: /tmp/tmp.pzwv2zFlTV/venv/lib/python3.12/site-packages/qiskit_aer/backends/controller_wrappers.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3AER21cutensor_csvd_wrapperER6matrixISt7complexIdEES4_RSt6vectorIdSaIdEES4_

After this PR I am missing Rust/cargo even more. This PR is 5% actual work and 95% fighting with this library error and that error.

Anyways, I really enjoyed this. I am looking forward to more contributions :) Thanks @doichanj :)

Because cutensor_csvd_wrapper is defined in namespace TensorNetwork calling cutensor_csvd_wrapper should be TensorNetwork::cutensor_csvd_wrapper

@doichanj
Copy link
Collaborator

It fails running with MPS method on GPU with error message as following,
ERROR TensorNet::contractor : CUTENSORNET_STATUS_INVALID_VALUE

@MozammilQ
Copy link
Author

MozammilQ commented Jun 14, 2024 via email

@MozammilQ
Copy link
Author

@doichanj , please see if this is good enough :)

and, extremely sorry, for the delay, doing any development in cloud VM, is not a good experience.

@MozammilQ
Copy link
Author

I have absolutely no idea why macOS tests are failing.
for years all OS I have known is Linux, regarding MacOS, and Windows I only know their spellings :)

@Randl
Copy link

Randl commented Jun 30, 2024

Looks like the problem is old cvxpy version, #2169 should fix it?

@MozammilQ
Copy link
Author

I have got my hands on Nvidia 3060 12 GB.
I am actively working on the PR, to solve the performance issue :)

@MozammilQ
Copy link
Author

MozammilQ commented Jul 4, 2024

I am going through more lectures of Tensor Networks.
I am working on it !
Trying to accelerate contract

@doichanj ,

docker container
Fedora:37
gcc: 12.3.1
python: 3.11.6
cuda-12-5
cuquantum-12
inside the venv: numpy version: 2.1.2
make a virutal env, and get into the venv.

when I do pip uninstall -y qiskit-aer && rm -frv ./qiskit_aer.egg-info/ && python setup.py clean && python setup.py develop
I get this error:

FAILED: qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o
/usr/local/cuda-12.5/bin/nvcc 
-forward-unknown-to-host-compiler 
-DAER_CUSTATEVEC 
-DAER_CUTENSORNET 
-DAER_THRUST_SUPPORTED=TRUE 
-DSPDLOG_COMPILED_LIB 
-DSPDLOG_FMT_EXTERNAL 
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA 
-Dcontroller_wrappers_EXPORTS 
-I/usr/include/python3.11 
-I/root/.venvs/qiskit-aer-dev/lib64/python3.11/site-packages/pybind11/include
-I/home/qiskit-aer/src 
-isystem /root/.conan/data/nlohmann_json/3.1.1/_/_/package/5ab84d6acfe1f23c4fae0ab88f26e3a396351ac9/include 
-isystem /root/.conan/data/spdlog/1.9.2/_/_/package/0ddf38f27cb22953937f1e00a3576684c190cabc/include
-isystem /root/.conan/data/fmt/8.0.1/_/_/package/300cb578a6fb15627a390ac70b832e82fd040283/include 
-O3 
-DNDEBUG 
-std=c++14 
"--generate-code=arch=compute_86,code=[compute_86,sm_86]" 
-Xcompiler=-fPIC 
-Xcompiler=-fvisibility=hidden  
--compiler-options 
-fopenmp  
-gencode 
arch=compute_86,code=sm_86 
-DAER_THRUST_GPU 
-DAER_THRUST_CUDA 
-I/home/qiskit-aer/src 
-isystem /home/qiskit-aer/src/third-party/headers 
-use_fast_math 
--expt-extended-lambda  
--compiler-options 
-mfma 
--compiler-options 
-mavx2 
--compiler-options 
-mno-avx512bf16 
-MD 
-MT 
qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o 
-MF qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o.d 
-x cu 
-c /home/qiskit-aer/src/simulators/statevector/qv_avx2.cpp 
-o qiskit_aer/backends/wrappers/CMakeFiles/controller_wrappers.dir/__/__/__/src/simulators/statevector/qv_avx2.cpp.o

/usr/lib/gcc/x86_64-redhat-linux/12/include/avx512bf16vlintrin.h(53): error: identifier "__builtin_ia32_cvtne2ps2bf16_v16hi" is undefined
    return (__m256bh)__builtin_ia32_cvtne2ps2bf16_v16hi(__A, __B);
    
    

I cannot figure out what enables 'AVX512-BF16'?
I have Zen+ cpu, it only has avx2 , and, not even AVX512, let alone AVX512-BF16.
Why I get this error, is it really turning on 'AVX512-BF16' somewhere? where?
What do I do here?

@doichanj , please see if the PR is on track, waiting for your comments :)

@MozammilQ MozammilQ requested a review from doichanj December 3, 2024 07:38
@MozammilQ
Copy link
Author

@doichanj , your comments would be highly valuable for me,
Please, review the code, if you have time :)

@doichanj
Copy link
Collaborator

I'm sorry but I'm on vacation, I will review in next year.
I just want to confirm that if this version can accelerate MPS simulation compared to running on CPU

// difference between the speed of CPU and GPU involved. Even if matrices
// are big, they are not big enough to make speed of PICe a bottleneck.
// In this particular case CPU was `Zen+` and GPU was `NVIDIA Ampere`.
if ((num_qubits_ > 13) && (MPS::mps_device_.compare("GPU") == 0) &&
Copy link
Author

@MozammilQ MozammilQ Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though MPI acceleration is not available for MPS, still I compiled with AER_MPI=True.
Upon re-testing, the code against different CPUs:
Cascade Lake 4-cores + CUDA Arch 89, SapphireRapids 4-cores, zen+ 8-cores + CUDA Arch 86
In case of very low qubit count like 2,3 etc, GPU version is 1.5-2.5 seconds slower, maybe this is the time for GPU initialization.
The offloading of SVD to the GPU does accelerates the computation( in all cases! )
It will still be faster to offload to CUDA Arch 89 even with SapphireRapids CPU.

// invoking needed GPU routines.
// If the matris is not big enough, the multiplication
// will be done on CPU using openblas zgemm_ routine.
if ((mat1_rows > 128) && (mat1_cols > 128) && (mat2_cols > 128)) {
Copy link
Author

@MozammilQ MozammilQ Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though MPI acceleration is not available for MPS, still I compiled with AER_MPI=True.
Upon re-testing, the code against different CPUs:
Cascade Lake 4-cores + CUDA Arch 89, SapphireRapids 4-cores, zen+ 8-cores + CUDA Arch 86
In case of very low qubit count like 2,3 etc, GPU version is 1.5-2.5 seconds slower, maybe this is the time for GPU initialization.
Offloading the matrix multiplication to the GPU does accelerates the computation.
But, I suppose this should only work for combination of >=CUDA Arch 86 with any CPU whose 'phoronix-test-suite' benchmark score for test: 'pts/scimark2' '[Computational Test: Dense LU Matrix Factorization]' <=600Mflops.
I believe most of non-enterprise user will fall in this category.
No test has been done on the combination of CUDA Arch 89 with SapphireRapids, whose 'phoronix-test-suite' benchmark score for test: 'pts/scimark2' '[Computational Test: Dense LU Matrix Factorization]' 1300Mflops. Even, if it doesn't work for this combination, then all we have to do is increase the num_qubits to maybe 17 or 18, and increase the number of rows/cols to 256 or more. I have refrained myself from doing this modification because this will make MPS slower on low-end CPUs.

@MozammilQ
Copy link
Author

@doichanj , Please have a look at my last two comments, and, review the code if you think so and have time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accelerate MPS simulator by using cuQuantum
3 participants