-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failures on Frontier #152
Comments
I am just brainstorming here:
|
I doubt the HiOp version will make a difference. These tests were failing [email protected] (#89). I'll also note that all tests passed in my latest build on Summit. |
This error comes from ExaGO's sparse GPU-enabled interface, which was left in more or less experimental state, as far as I remember. It seems to me as if data is handed over to HiOp on an incorrect device. The other possibility is that this configuration is using Ginkgo linear solver. Ginkgo in HiOp is still expecting data on the host, while this configuration keeps everything on the device. Does this test pass anywhere else? If so, what is the ExaGO configuration there? |
I assume the same tests are being run on Deception and Newell, so the test is then only failing on AMD platforms. We don't have Incline CI quite back online yet, but I would have to assume identical behavior to Frontiner... |
This is what I guessed. I suggest disabling this test on AMD platforms because we don't have a complete software stack that can support it. At least not until we interface the newest version of Re::Solve with the software stack. |
When trying to build this on Frontier, I get following error:
I believe this is the same bug I reported in #127. |
Issue type
Relates to
Summary
Issue associated with test failures on Frontier previously reported in #89. I reproduced most of these again with the latest build on Frontier with exago@develop and hiop@develop (#151).
The behavior is different for
Debug
versusRelease
builds:CMAKE_BUILD_TYPE=Debug
, 10 tests fail.See exago.frontier.debug.log.
Note that
FUNCTIONALITY_TEST_SCOPFLOW_HIOP_SERIAL_TESTSUITE
failed in #89, but not here.CMAKE_BUILD_TYPE=Release
, 2 test fail.See exago.frontier.release.log.
The failure of
FUNCTIONALITY_TEST_OPFLOW_IPOPT_POLAR_TOML_TESTSUITE
is less concerning, because it relates to different number of iterations, though outside of the allowed tolerance to warrant a warning.FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE
fails with the following error:The backtrace is:
cc @cameronrutherford @pelesh
The text was updated successfully, but these errors were encountered: