Skip to content

Conversation

@nhanford
Copy link
Contributor

Hello @soumagne,

This PR adds a variant that builds and persistently installs the performance tests for Mercury.
Please let me know in which versions these CMake flags were introduced.

Preliminary results from a system similar to El Capitan at LLNL (200Gbps/HSA):

na_bw_get -c ofi -p cxi -b -l 1000
...
# [859097.930837] mercury->op [warning] /var/tmp/nhanford/spack-stage/spack-stage-mercury-master-ks4eadvpsbbqrwjxzxkebaxqymczvpry/spack-src/src/na/na_ofi.c:6950 na_ofi_cq_readerr() fi_cq_readerr() got err: 5 (Input/output error), prov_errno: 18 (ENTRY_NOT_FOUND)
# [859097.930842] mercury->op [error] /var/tmp/nhanford/spack-stage/spack-stage-mercury-master-ks4eadvpsbbqrwjxzxkebaxqymczvpry/spack-src/src/na/na_ofi.c:7137 na_ofi_cq_process_error() error event on operation ID 0x555555667670 (NA_CB_GET), fi_readmsg(iov_count=1, desc[0]=0x555555608680, msg_iov[0].iov_base=0x5555976f1000, msg_iov[0].iov_len=16384, addr=1, rma_iov_count=1, rma_iov[0].addr=0x3e000000, rma_iov[0].len=16384, rma_iov[0].key=0x900006b999e70000, context=0x5555556678b0, data=0) failed, rc: 5 (Input/output error)
16384                     15224.32                    1.03
32768                     22036.48                    1.42
65536                     22591.44                    2.77
131072                    22581.05                    5.54
262144                    23008.03                   10.87
524288                    23060.44                   21.68
1048576                   23117.53                   43.26
2097152                   23139.96                   86.43
4194304                   23152.97                  172.76
8388608                   23159.87                  345.43
16777216                  23163.12                  690.75

Thanks,
Nate

alalazo
alalazo previously approved these changes Jan 7, 2026
@alalazo alalazo self-assigned this Jan 7, 2026
@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

@nhanford Apologies for the slow response. BUILD_TESTING_PERF was added in 2.3.0. You can add it next to that:

        if "@2.3.0:" in spec:
            cmake_args.append(define("BUILD_TESTING_UNIT", self.run_tests))

BUILD_TESTING has always been there as far as I recall. For some reason I had the impression that this was already supported in the spack recipe...

@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

Also unrelated to that, @nhanford the error that you're getting when running that perf test with cxi is something that we've been investigating. Could you please let me know if you have seen that issue frequently when running that particular benchmark ? Thanks.

@nhanford
Copy link
Contributor Author

nhanford commented Jan 9, 2026

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work.
Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.


if "@2.3.0:" in spec:
cmake_args.append(define("BUILD_TESTING_UNIT", self.run_tests))
cmake_args.append(define("BUILD_TESTING_PERF", self.run_tests))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add a perf variant instead ? The idea behind having separate variables was that in most cases users want to be able to install the perf utilities but do not want to bother building the whole test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so having instead something like:

Suggested change
cmake_args.append(define("BUILD_TESTING_PERF", self.run_tests))
cmake_args.append(define_from_variant("BUILD_TESTING_PERF", "perf"))

@soumagne
Copy link
Contributor

soumagne commented Jan 9, 2026

@soumagne Thanks for the info. This version should be a lot cleaner. Unfortunately I was not able to test it on my system because the build failed for me due to some CMake issues, but this should work. Yes I observed that failure for smaller message sizes persistently across many tests. The system under test is identical to LLNL El Capitan (Cray EX, Slingshot, AMD CPU/APU) except it uses Slingshot Host Software 13 and CXI VNIs set using Flux.

thanks this is very useful information. If you are able to reproduce this error consistently, could you please run with the following env vars set HG_LOG_LEVEL=warn HG_LOG_SUBSYS=hg,na,libfabric and file an issue in mercury's github ? we can follow up there. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants