Array4: __cuda_array_interface__ v3 #30

ax3l · 2022-03-26T23:01:36Z

Start implementing the __cuda_array_interface__ for zero-copy data exchange on Nvidia CUDA GPUs.

Optional: accessing an external __cuda_array_interface__ object in non-owning manner as AMReX Array4:
https://github.com/cupy/cupy/blob/a5b24f91d4d77fa03e6a4dd2ac954ff9a04e21f4/cupy/core/core.pyx#L2478-L2514

mfab and mfab_device need to become functions, not fixtures. Otherwise they will be cached and outlive amrex.finalize(): AMReX Initialize/Finalize as Context Manager #81 MultiFab: Fix Fixture Lifetime #84

Particle Iter & MFIter: Python does not destruct keys by default: add a GPU stream synchronize here?

pyamrex/src/Base/MultiFab.cpp

Lines 72 to 75 in 78bbbc7

    
           if( !mfi.isValid() ) 
        
           { 
        
               first_or_done = true; 
        
               throw py::stop_iteration();

depends on MFIter::Finalize amrex#2983 and MFIter: Make Finalize Public amrex#2985
if and for do not create a scope in Python (they do in C++):

In [1]: import numpy as np                                                                                                           

In [2]: x = np.array([1,2,3])                                                   

In [3]: for a in x: 
   ...:     print(a)                                                                
1
2
3

# a is still alive xD
In [4]: a                                                                       
Out[4]: 3

ax3l · 2022-10-06T17:05:30Z

⚠️ There is an nvcc host code generation bug that we fixed with Nvidia last night. Affects CUDA Toolkit 11.4-11.8 with pybind11 (pybind/pybind11#4193)
I will ship a work-around for pybind11 (pybind/pybind11#4220) before the next CUDA release, pls use an older NVCC (e.g. 11.3) in the meantime.

Update: patch that unlocks that broken compiler range in pybind/pybind11#4220
Add this to CMake:

cmake -S . -B build -DAMREX_GPU_BACKEND=CUDA -DpyAMReX_pybind11_repo=https://github.com/ax3l/pybind11.git -DpyAMReX_pybind11_branch=fix-nvcc-11.4-11.8

ax3l · 2022-10-06T17:10:41Z

@RemiLehe build logic from README.md is this:

So concretely:

# Python packages if not already installed as described
python3 -m pip install -U pip setuptools wheel
python3 -m pip install -U cmake pytest
python3 -m pip install -U -r requirements.txt

# depending on what you try
python3 -m pip install cupy-cuda11x
python3 -m pip install numba
python3 -m pip install torch

# configure once (unless changing backend or versions heavily)
cmake -S . -B build -DAMReX_GPU_BACKEND=CUDA \
    -DpyAMReX_pybind11_repo=https://github.com/ax3l/pybind11.git \
    -DpyAMReX_pybind11_branch=fix-nvcc-11.4-11.8
# rinse & repeat: builds, packages & runs pip install
cmake --build build --target pip_install -j 8

and tests:

# Run all tests
python3 -m pytest tests/

# Run tests from a single file
python3 -m pytest tests/test_array4.py

# Run a single test (useful during debugging)
python3 -m pytest tests/test_array4.py::test_array4_cupy
python3 -m pytest tests/test_multifab.py::test_mfab_ops_cuda_cupy

# Run all tests, do not capture "print" output and be verbose
python3 -m pytest -s -vvvv tests/test_array4.py

and with nsight:

nsys profile -f true -t cuda,nvtx,osrt python3 -m pytest -s -vvv tests/test_multifab.py::test_mfab_ops_cuda_cupy

GUI:

nsight-sys

ax3l · 2022-10-06T20:35:04Z

Found a tiny bug, will rebase after #77 was merged. - Update: done.

Found another arena bug, will rebase after #78 was merged. - Update: done.

ax3l · 2022-10-07T04:00:12Z

First cupy progress. Gotta learn how to do in-place updates on arrays in kernels...

tests/test_multifab.py

ax3l · 2022-10-14T01:02:42Z

With the new MFIter::Finalize, I can also see the cudaStreamSynchronize calls at the end of the iteration :)

tests/test_multifab.py

Start implementing the `__cuda_array_interface__` for zero-copy data exchange on Nvidia CUDA GPUs.

Since `for` loops create no scope in Python, we need to trigger finalize logic, including stream syncs, before the destructor of `MultiFab` iterators are called.

incl. 3D kernel launch

src/Base/Array4.cpp

A bit tricky to implement this caster as new constructor. Not currently needed, but adds comments where to do this.

ax3l · 2022-10-17T07:26:37Z

Wuup, wuup. First part done.
Larger tests and particles next :)

ax3l added the enhancement New feature or request label Mar 26, 2022

ax3l force-pushed the array4-cuda-array-interface branch 3 times, most recently from 0e72782 to 44eb90c Compare March 26, 2022 23:09

ax3l force-pushed the array4-cuda-array-interface branch from 44eb90c to d2937d0 Compare April 8, 2022 07:14

ax3l requested a review from n01r July 1, 2022 15:58

ax3l mentioned this pull request Aug 3, 2022

.to_numpy(), .to_cupy(), etc. #55

Closed

ax3l force-pushed the array4-cuda-array-interface branch 2 times, most recently from 7c0289a to f3ff788 Compare October 5, 2022 20:52

ax3l mentioned this pull request Oct 6, 2022

Geometry: Fix Overloads #77

Merged

ax3l force-pushed the array4-cuda-array-interface branch from f3ff788 to 7fe40bd Compare October 6, 2022 21:05

ax3l requested review from RemiLehe and removed request for n01r October 6, 2022 21:14

ax3l force-pushed the array4-cuda-array-interface branch 3 times, most recently from d0f2dec to 5395043 Compare October 7, 2022 03:16

ax3l commented Oct 7, 2022

View reviewed changes

tests/test_multifab.py Show resolved Hide resolved

ax3l mentioned this pull request Oct 7, 2022

pytest fixtures: per function #78

Merged

ax3l force-pushed the array4-cuda-array-interface branch from 5395043 to df349c5 Compare October 7, 2022 18:06

ax3l commented Oct 7, 2022

View reviewed changes

tests/test_multifab.py Outdated Show resolved Hide resolved

ax3l mentioned this pull request Oct 11, 2022

MFIter::Finalize AMReX-Codes/amrex#2983

Merged

5 tasks

ax3l force-pushed the array4-cuda-array-interface branch from f5138a9 to a7fc736 Compare October 14, 2022 07:12

ax3l changed the title ~~[WIP] Array4: __cuda_array_interface__ v2~~ Array4: __cuda_array_interface__ v2 Oct 14, 2022

ax3l commented Oct 14, 2022

View reviewed changes

tests/test_multifab.py Outdated Show resolved Hide resolved

ax3l mentioned this pull request Oct 15, 2022

AMReX Initialize/Finalize as Context Manager #81

Open

ax3l mentioned this pull request Oct 17, 2022

Memory Arenas #85

Merged

ax3l and others added 6 commits October 16, 2022 20:27

Array4: __cuda_array_interface__ v2

62f2340

Start implementing the `__cuda_array_interface__` for zero-copy data exchange on Nvidia CUDA GPUs.

MultiFab: CuPy Test

271a021

MFIter: Finalize() on StopIteration

5acd36a

Since `for` loops create no scope in Python, we need to trigger finalize logic, including stream syncs, before the destructor of `MultiFab` iterators are called.

Add numba test

6965f9a

incl. 3D kernel launch

Add pytorch

9f539f9

CuPy Fuse: Avoid Extra Memset

4175194

ax3l force-pushed the array4-cuda-array-interface branch from f204a6d to 4175194 Compare October 17, 2022 03:33

ax3l mentioned this pull request Oct 17, 2022

Discussion on mapping between amrex, numpy.ndarray, and torch.tensor data types #9

Open

ax3l commented Oct 17, 2022

View reviewed changes

src/Base/Array4.cpp Outdated Show resolved Hide resolved

MultiFab Device Test: Fixes

6eb2da4

ax3l force-pushed the array4-cuda-array-interface branch from a6a1199 to 6eb2da4 Compare October 17, 2022 04:35

Update to v3

7f6d80b

ax3l changed the title ~~Array4: __cuda_array_interface__ v2~~ Array4: __cuda_array_interface__ v3 Oct 17, 2022

Array4: TODO from CUDA

e65cd41

A bit tricky to implement this caster as new constructor. Not currently needed, but adds comments where to do this.

ax3l enabled auto-merge (squash) October 17, 2022 07:26

ax3l merged commit 16ce636 into AMReX-Codes:development Oct 17, 2022

ax3l deleted the array4-cuda-array-interface branch October 17, 2022 07:28

ax3l mentioned this pull request Oct 17, 2022

Particles: CUDA Array Interface #86

Merged

7 tasks

ax3l mentioned this pull request Jun 6, 2023

AMReX Tiny Profiler Context #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array4: __cuda_array_interface__ v3 #30

Array4: __cuda_array_interface__ v3 #30

ax3l commented Mar 26, 2022 •

edited

Loading

ax3l commented Oct 6, 2022 •

edited

Loading

ax3l commented Oct 6, 2022 •

edited by RemiLehe

Loading

ax3l commented Oct 6, 2022 •

edited

Loading

ax3l commented Oct 7, 2022

ax3l commented Oct 14, 2022 •

edited

Loading

ax3l commented Oct 17, 2022

	if( !mfi.isValid() )
	{
	first_or_done = true;
	throw py::stop_iteration();

Array4: __cuda_array_interface__ v3 #30

Array4: __cuda_array_interface__ v3 #30

Conversation

ax3l commented Mar 26, 2022 • edited Loading

ax3l commented Oct 6, 2022 • edited Loading

ax3l commented Oct 6, 2022 • edited by RemiLehe Loading

ax3l commented Oct 6, 2022 • edited Loading

ax3l commented Oct 7, 2022

ax3l commented Oct 14, 2022 • edited Loading

ax3l commented Oct 17, 2022

ax3l commented Mar 26, 2022 •

edited

Loading

ax3l commented Oct 6, 2022 •

edited

Loading

ax3l commented Oct 6, 2022 •

edited by RemiLehe

Loading

ax3l commented Oct 6, 2022 •

edited

Loading

ax3l commented Oct 14, 2022 •

edited

Loading