Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION #1101

Open
pgrete opened this issue Jun 11, 2024 · 8 comments
Open

HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION #1101

pgrete opened this issue Jun 11, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@pgrete
Copy link
Collaborator

pgrete commented Jun 11, 2024

On Frontier I see a (or to be more specific many of the following)
:0:rocdevice.cpp :2660: 556940992572 us: 32834: [tid:0x7f9e41945700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
when running the following input file

$ srun -N 128 -n 1024 -c 1 --gpus-per-node=8 --gpu-bind=closest /ccs/proj/ast146/pgrete/src/athenapk/external/parthenon/build-bisect-latest-next-dev/example/advection/advection-example -i parthinput.advection_smaller
<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement  = static
nghost = 2

nx1        = 1024       # Number of zones in X1-direction
x1min      =-3.2     # minimum value of X1
x1max      = 3.2     # maximum value of X1
ix1_bc     = periodic        # inner-X1 boundary flag
ox1_bc     = periodic        # outer-X1 boundary flag

nx2        = 1024       # Number of zones in X2-direction
x2min      =-3.2     # minimum value of X2
x2max      = 3.2     # maximum value of X2
ix2_bc     = periodic        # inner-X2 boundary flag
ox2_bc     = periodic        # outer-X2 boundary flag

nx3        = 1024       # Number of zones in X3-direction
x3min      =-3.2     # minimum value of X3
x3max      = 3.2     # maximum value of X3
ix3_bc     = periodic        # inner-X3 boundary flag
ox3_bc     = periodic        # outer-X3 boundary flag

<parthenon/meshblock>
nx1        = 128        # Number of zones in X1-direction
nx2        = 128        # Number of zones in X2-direction
nx3        = 128        # Number of zones in X3-direction


<parthenon/static_refinement4>
x1min = -0.4 
x1max =  0.4
x2min = -0.4
x2max =  0.4
x3min = -0.4
x3max =  0.4
level = 4


<parthenon/static_refinement5>
x1min = -0.2 
x1max =  0.2
x2min = -0.2
x2max =  0.2
x3min = -0.2
x3max =  0.2
level = 5


#<parthenon/static_refinement6>
#x1min = -0.1125 
#x1max =  0.1125
#x2min = -0.1125
#x2max =  0.1125
#x3min = -0.1125
#x3max =  0.1125
#level = 6




<parthenon/time>
tlim = 1.0
integrator = rk1
nlim = 100
ncycle_out_mesh = -100000


<Advection>
cfl = 0.30
vx = 1.0
vy = 2.0
vz = 3.0
profile = smooth_gaussian
ang_2 = 0.0
ang_3 = 0.0
ang_2_vert = false
ang_3_vert = false
amp = 1.0 

num_vars = 5
#vec_size = 5

refine_tol = 1.01    # control the package specific refinement tagging function
derefine_tol = 1.001
compute_error = true

<parthenon/output0>
file_type = rst 
dt = 1.0 

and current develop (b28c738).

Changing

num_vars = 5
#vec_size = 5

to

#num_vars = 5
vec_size = 5

shows no issues.

@pgrete pgrete added the bug Something isn't working label Jun 11, 2024
@BenWibking
Copy link
Collaborator

BenWibking commented Jun 11, 2024

We've seen this error in AMReX codes due to a HIP compiler bug (e.g.: AMReX-Astro/Microphysics#1386 (comment))

Adding -mllvm -amdgpu-function-calls=true to the HIP compiler flags works around that issue. Does that help for this case?

@pgrete
Copy link
Collaborator Author

pgrete commented Jun 11, 2024

Which compiler are you using?
I just tried with Cray (which I've been using so far) and it didn't help.

@BenWibking
Copy link
Collaborator

BenWibking commented Jun 11, 2024

I think I've used only hipcc/amdclang++ for HIP builds recently (i.e., -DCMAKE_CXX_COMPILER=hipcc). But I think I had the PrgEnv-cray modules loaded, so I don't know what it's actually doing 🤷 .

@BenWibking
Copy link
Collaborator

Although we only saw this problem for very large kernels (e.g., with reaction networks), so it may not be related.

@BenWibking
Copy link
Collaborator

I've also tried https://rocm.docs.amd.com/en/latest/conceptual/using-gpu-sanitizer.html#compiling-for-address-sanitizer to debug these memory errors. This sometimes worked, but it also produces some false positives with global vars...

@pgrete
Copy link
Collaborator Author

pgrete commented Jun 11, 2024

I now tried the warpx recommendations, i.e.,

# from https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html#frontier-olcf

module load cmake/3.23.2
module load craype-accel-amd-gfx90a
module load rocm/5.2.0  # waiting for 5.6 for next bump
module load cray-mpich
module load cce/15.0.0
module load ninja
module load hdf5/1.14.0

# compiler environment hints
export CC=$(which hipcc)
export CXX=$(which hipcc)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64 ${PE_MPICH_GTL_DIR_amd_gfx90a} -lmpi_gtl_hsa"

export MPICH_GPU_SUPPORT_ENABLED=1

still same issue.

@BenWibking
Copy link
Collaborator

Ah, well, nevermind :/

@BenWibking
Copy link
Collaborator

BenWibking commented Jun 14, 2024

Does running with -DENABLE_ASAN=ON show anything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants