Skip to content

ScaFFold benchmark fixes and torchrun-hpc launcher option#1234

Open
michaelmckinsey1 wants to merge 13 commits intodevelopfrom
fix-scaffold
Open

ScaFFold benchmark fixes and torchrun-hpc launcher option#1234
michaelmckinsey1 wants to merge 13 commits intodevelopfrom
fix-scaffold

Conversation

@michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Feb 6, 2026

Description

  • Fix broken input file link
  • Enable using torchrun-hpc as launcher for flux and slurm schedulers (native launchers will not work).
    • Usage: benchpark experiment init ... allocation=torchrun-hpc
  • Now working on Tuolumne - trained to 1600 epochs successfully on one node and generated caliper file
    • hanging on >1 node, probably because caliper=rocm turned on, seen hangs before
    • need option to limit epochs to 1
  • Matrix is failing to build mpi4py

@michaelmckinsey1 michaelmckinsey1 self-assigned this Feb 6, 2026
@github-actions github-actions bot added feature New feature or request experiment New or modified experiment system New or modified system config application labels Feb 6, 2026
@github-actions github-actions bot added the ci CI, unit tests, GitHub actions label Feb 6, 2026
@michaelmckinsey1
Copy link
Collaborator Author

https://lc.llnl.gov/gitlab/benchpark/benchpark/-/jobs/3346658 failure here means I think I would need export CALI_SERVICES_ENABLE=roctx

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.43%. Comparing base (41fa7f6) to head (ee5cbf1).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1234      +/-   ##
===========================================
+ Coverage    62.83%   64.43%   +1.60%     
===========================================
  Files           48       48              
  Lines         3643     3644       +1     
  Branches       279      279              
===========================================
+ Hits          2289     2348      +59     
+ Misses        1345     1287      -58     
  Partials         9        9              
Files with missing lines Coverage Δ
lib/benchpark/experiment.py 87.24% <100.00%> (+2.39%) ⬆️
lib/benchpark/test/caliper.py 97.43% <ø> (ø)
lib/benchpark/test/experiment.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

application ci CI, unit tests, GitHub actions experiment New or modified experiment feature New feature or request system New or modified system config

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants