Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures on Frontier #152

Open
4 of 13 tasks
nkoukpaizan opened this issue Jul 29, 2024 · 6 comments
Open
4 of 13 tasks

Test failures on Frontier #152

nkoukpaizan opened this issue Jul 29, 2024 · 6 comments

Comments

@nkoukpaizan
Copy link
Collaborator

nkoukpaizan commented Jul 29, 2024

Issue type

  • New feature
  • Bug
  • Discussion
  • Other

Relates to

  • OPFLOW
  • SOPFLOW
  • SCOPFLOW
  • TCOPFLOW
  • CMake build system
  • Spack configuration
  • Manual
  • Web docs
  • Other

Summary
Issue associated with test failures on Frontier previously reported in #89. I reproduced most of these again with the latest build on Frontier with exago@develop and hiop@develop (#151).

The behavior is different for Debug versus Release builds:

  • With CMAKE_BUILD_TYPE=Debug, 10 tests fail.
	  2 - UNIT_TESTS_OPFLOW_case118.m (Failed)
	  3 - UNIT_TESTS_OPFLOW_case_ACTIVSg200.m (Failed)
	 18 - FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_TOML_TESTSUITE (Failed)
	 20 - FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE (Failed)
	 21 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)
	 35 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_MPI_TESTSUITE (Failed)
	 37 - FUNCTIONALITY_TEST_SCOPFLOW_HIOP_RAJA_TESTSUITE (Failed)
	 49 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_RAJA_GPU_TOML (Failed)
	 50 - FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_MPI_RAJA_GPU_TOML (Failed)

See exago.frontier.debug.log.
Note that FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE failed in #89, but not here.

  • With CMAKE_BUILD_TYPE=Release, 2 test fail.
	 20 - FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE (Failed)
	 21 - FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE (Failed)

See exago.frontier.release.log.

The failure of FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE is less concerning, because it relates to different number of iterations, though outside of the allowed tolerance to warrant a warning.

FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE fails with the following error:

[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
 ...
MPICH ERROR [Rank 0] [job id 2148652.119] [Mon Jul 29 11:36:58 2024] [frontier10152] - Abort(59) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 59) - process 0

The backtrace is:

Thread 1 "test_opflow_fun" received signal SIGSEGV, Segmentation fault.
0x000000000073a1d7 in hiop::hiopKKTLinSysCompressedSparseXDYcYd::build_kkt_matrix(hiop::hiopPDPerturbation const&) ()
(gdb) backtrace
#0  0x000000000073a1d7 in hiop::hiopKKTLinSysCompressedSparseXDYcYd::build_kkt_matrix(hiop::hiopPDPerturbation const&) ()
#1  0x0000000000719bff in hiop::hiopKKTLinSysCurvCheck::factorize() ()
#2  0x000000000071ab50 in hiop::hiopKKTLinSysCompressedXDYcYd::update(hiop::hiopIterate const*, hiop::hiopVector const*, hiop::hiopMatrix const*, hiop::hiopMatrix const*, hiop::hiopMatrix*) ()
#3  0x000000000070c980 in hiop::hiopAlgFilterIPMNewton::run() ()
#4  0x000000000065543d in OPFLOWSolverSolve_HIOPSPARSEGPU(_p_OPFLOW*) ()
#5  0x00000000005c94b4 in OPFLOWSolve ()
#6  0x000000000052780b in OpflowFunctionalityTests::run_test_case(OpflowFunctionalityTestParameters&) ()
#7  0x0000000000524c45 in FunctionalityTestContext<OpflowFunctionalityTestParameters>::run_all_test_cases() ()
#8  0x0000000000523c74 in main ()

cc @cameronrutherford @pelesh

@cameronrutherford
Copy link
Contributor

I am just brainstorming here:

  • It would be worthwhile in getting a build with [email protected] (the latest pinned HiOp version), as well as [email protected] to see if it's just the latest develop branch that is buggy
  • It is likely that there is something that needs to be updated in ExaGO as a result of the new HiOp development. cc @nychiang
  • In preparing for xSDK release in future, there should be a released version of ExaGO that supports the latest released version of HiOp. If we need to make a change to work with HiOp@develop that does not work with [email protected], we will also need a new HiOp release. cc @cnpetra
  • @abhyshr it's possible this is just an ExaGO bug

@nkoukpaizan
Copy link
Collaborator Author

I doubt the HiOp version will make a difference. These tests were failing [email protected] (#89). I'll also note that all tests passed in my latest build on Summit.

@pelesh
Copy link
Collaborator

pelesh commented Jul 30, 2024

This error comes from ExaGO's sparse GPU-enabled interface, which was left in more or less experimental state, as far as I remember. It seems to me as if data is handed over to HiOp on an incorrect device. The other possibility is that this configuration is using Ginkgo linear solver. Ginkgo in HiOp is still expecting data on the host, while this configuration keeps everything on the device.

Does this test pass anywhere else? If so, what is the ExaGO configuration there?

@cameronrutherford
Copy link
Contributor

Does this test pass anywhere else? If so, what is the ExaGO configuration there?

I assume the same tests are being run on Deception and Newell, so the test is then only failing on AMD platforms. We don't have Incline CI quite back online yet, but I would have to assume identical behavior to Frontiner...

@pelesh
Copy link
Collaborator

pelesh commented Jul 30, 2024

Does this test pass anywhere else? If so, what is the ExaGO configuration there?

I assume the same tests are being run on Deception and Newell, so the test is then only failing on AMD platforms. We don't have Incline CI quite back online yet, but I would have to assume identical behavior to Frontiner...

This is what I guessed. I suggest disabling this test on AMD platforms because we don't have a complete software stack that can support it. At least not until we interface the newest version of Re::Solve with the software stack.

@pelesh
Copy link
Collaborator

pelesh commented Jul 30, 2024

When trying to build this on Frontier, I get following error:

[ 56%] Linking CXX executable opflow
ld.lld: error: undefined symbol: mc19ad_
>>> referenced by IpEquilibrationScaling.cpp
>>>               IpEquilibrationScaling.o:(Ipopt::EquilibrationScaling::DetermineScalingParametersImpl(Ipopt::SmartPtr<Ipopt::VectorSpace const>, Ipopt::SmartPtr<Ipopt::VectorSpace const>, Ipopt::SmartPtr<Ipopt::VectorSpace const>, Ipopt::SmartPtr<Ipopt::MatrixSpace const>, Ipopt::SmartPtr<Ipopt::MatrixSpace const>, Ipopt::SmartPtr<Ipopt::SymMatrixSpace const>, Ipopt::Matrix const&, Ipopt::Vector const&, Ipopt::Matrix const&, Ipopt::Vector const&, double&, Ipopt::SmartPtr<Ipopt::Vector>&, Ipopt::SmartPtr<Ipopt::Vector>&, Ipopt::SmartPtr<Ipopt::Vector>&)) in archive /lustre/orion/eng145/world-shared/spack-install/linux-sles15-x86_64/clang-17.0.0-rocm5.7.1-mixed/ipopt-3.12.10-2zdjszoeppgewi5zojnlqevjcqenp66u/lib/libipopt.a
>>> referenced by IpMc19TSymScalingMethod.cpp
>>>               IpMc19TSymScalingMethod.o:(Ipopt::Mc19TSymScalingMethod::ComputeSymTScalingFactors(int, int, int const*, int const*, double const*, double*)) in archive /lustre/orion/eng145/world-shared/spack-install/linux-sles15-x86_64/clang-17.0.0-rocm5.7.1-mixed/ipopt-3.12.10-2zdjszoeppgewi5zojnlqevjcqenp66u/lib/libipopt.a

ld.lld: error: undefined symbol: ma86_finalise_d
>>> referenced by IpMa86SolverInterface.cpp
>>>               IpMa86SolverInterface.o:(Ipopt::Ma86SolverInterface::~Ma86SolverInterface()) in archive /lustre/orion/eng145/world-shared/spack-install/linux-sles15-x86_64/clang-17.0.0-rocm5.7.1-mixed/ipopt-3.12.10-2zdjszoeppgewi5zojnlqevjcqenp66u/lib/libipopt.a
>>> referenced by IpMa86SolverInterface.cpp
>>>               IpMa86SolverInterface.o:(Ipopt::Ma86SolverInterface::InitializeStructure(int, int, int const*, int const*)) in archive /lustre/orion/eng145/world-shared/spack-install/linux-sles15-x86_64/clang-17.0.0-rocm5.7.1-mixed/ipopt-3.12.10-2zdjszoeppgewi5zojnlqevjcqenp66u/lib/libipopt.a
(...)

I believe this is the same bug I reported in #127.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants