Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incline Test Failures #92

Open
3 of 13 tasks
jaelynlitz opened this issue Nov 30, 2023 · 2 comments
Open
3 of 13 tasks

Incline Test Failures #92

jaelynlitz opened this issue Nov 30, 2023 · 2 comments
Milestone

Comments

@jaelynlitz
Copy link
Contributor

Issue type

  • New feature
  • Bug
  • Discussion
  • Other

Relates to

  • OPFLOW
  • SOPFLOW
  • SCOPFLOW
  • TCOPFLOW
  • CMake build system
  • Spack configuration
  • Manual
  • Web docs
  • Other

Summary

There are two isolated test failures on Incline - one seg fault and one timeout. These are not occurring on Deception or Newell. TBD on other AMD platforms. These were introduced potentially with [email protected]

Creating a separate issue for these failures to isolate from #3 and #43 and let #84 continue without these tests blocking.

Exact commands to reproduce, if applicable

  • tests are being skipped in CI now, but either run tests manually or delete the incline-skip tag from those tests in the CMake.

Relevant logs and/or screenshots, if applicable

  1. FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE
20/57 Test #20: FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE .................***Failed    2.76 sec
[ExaGO] Creating OPFlow Functionality Test
Test Description: datafiles/case9/case9mod.m base case
[Warning] Hiop does not understand option 'dualsInitialization' and will ignore its value 'zero'.
[Warning] Detected 1 fixed variables out of a total of 24.
===============
Hiop SOLVER
===============
Using 1 MPI ranks.
---------------
Problem Summary
---------------
Total number of variables: 24
     lower/upper/lower_and_upper bounds: 16 / 16 / 16
Total number of equality constraints: 18
Total number of inequality constraints: 18
     lower/upper/lower_and_upper bounds: 18 / 18 / 18
iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
   0  1.0318125e+04 1.800e+00  4.460e+03  -1.00  0.000e+00  0.000e+00  -(-)
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
  1. FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_RAJA_GPU_TOML
@abhyshr
Copy link
Collaborator

abhyshr commented Dec 5, 2023

Is this issue only on Ascent OR does this happen on other platforms too?

@jaelynlitz
Copy link
Contributor Author

Is this issue only on Ascent OR does this happen on other platforms too?

This behavior is only happening on Incline (not Deception, Newell, or Ascent), @nkoukpaizan was also seeing similar failures on Frontier in #89 so likely AMD related

@cameronrutherford cameronrutherford added this to the 1.6.2 Release milestone Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants