Purpose: Document the two-phase validation approach used in the hotSpring study
See also: STUDY.md for complete results
Every published workload passes through two independent phases:
Reproduce published computational physics results using the original Python implementations on our hardware. This establishes:
- Correctness: Our hardware produces correct physics
- Performance baseline: Traditional scientific Python stack (NumPy, SciPy, Numba, mystic)
- Bug inventory: Upstream issues found during reproduction
Phase A uses the original authors' code wherever possible. When code is inaccessible (Code Ocean gating), we reconstruct from the paper's description plus available data.
Re-implement the same computations using only BarraCuda native functions (Pure Rust + WGSL). Compare:
- Accuracy: chi2/datum on identical experimental datasets
- Throughput: evaluations per second
- Evaluation efficiency: function evaluations needed to reach solution quality
- Dependencies: external library count (BarraCuda target: zero)
- Reproducibility: deterministic results from seed
- Physical constraints: Nuclear Matter Properties within published ranges
Source: Murillo Group, Michigan State University
Code: github.com/murillo-group/sarkas
Reference data: Dense Plasma Properties Database
Phase A: Reproduce Dynamic Structure Factor S(q,omega) for Yukawa and Coulomb one-component plasmas. 12 cases spanning kappa = 0-3, Gamma = 10-1510. Five observables validated per case: DSF, SSF, VACF, g(r), total energy.
Phase C (GPU MD): Re-execute the 9 PP Yukawa cases entirely on GPU using f64 WGSL shaders (SHADER_F64). Validate energy conservation, RDF, VACF, SSF, and diffusion coefficient against physical expectations. N=2000, Velocity-Verlet integrator, Berendsen thermostat, periodic boundary conditions.
Acceptance criteria:
- Phase A: Plasmon peak frequencies within 15% of reference spectra
- Phase C: Energy drift < 5%, RDF tail |g(r)-1| < 0.15, monotonic observable trends with coupling
Source: Murillo Group / UCLA
Code: github.com/MurilloGroupMSU/Two-Temperature-Model
Reproduce electron-ion temperature equilibration for noble gas plasmas (Ar, Xe, He). Both 0D (local) and 1D (hydrodynamic) formulations.
Acceptance criterion: Te-Ti equilibrium achieved within physically reasonable timescales.
Source: Diaw et al. (2024), Nature Machine Intelligence
Code: Code Ocean DOI:10.24433/CO.1152070.v1 — gated, registration denied
Data: Zenodo DOI:10.5281/zenodo.10908462 — open, CC-BY
Reproduce the iterative SparsitySampler + RBF surrogate workflow on 9 benchmark functions plus a physics-based EOS from validated Sarkas MD. The nuclear EOS headline result was reconstructed from first principles because the Code Ocean capsule is inaccessible.
Acceptance criterion: RBF surrogates converge (chi2 decreasing over rounds).
Problem: Fit 10 Skyrme interaction parameters to minimize chi2 between calculated and experimental nuclear binding energies:
objective = chi2_BE/datum + lambda * chi2_NMP/datum
over 52 nuclei from the AME2020 mass evaluation (IAEA Nuclear Data Services).
| Level | Physics | Fidelity | Cost per eval | Phase A | Phase B |
|---|---|---|---|---|---|
| L1 | SEMF (analytic) | Low | ~microseconds | Python | BarraCuda |
| L2 | Spherical HFB (SCF) | Medium | ~seconds | Python | BarraCuda |
| L3 | Deformed HFB (SCF, 2D) | High | ~minutes | — | BarraCuda (complete) |
| Method | Algorithm | Library |
|---|---|---|
| DirectSampler | Round-based multi-start Nelder-Mead | BarraCuda |
| SparsitySampler | RBF surrogate-guided + exploration | BarraCuda / mystic |
| Python/mystic | SparsitySampler + NelderMeadSimplexSolver | SciPy ecosystem |
From published literature (Chabanat 1998, Bender 2003):
| Property | Target | Sigma | Physical meaning |
|---|---|---|---|
| rho0 | 0.16 fm^-3 | 0.005 | Nuclear saturation density |
| E/A | -15.97 MeV | 0.5 | Binding energy per nucleon |
| K_inf | 230 MeV | 20 | Incompressibility |
| m*/m | 0.69 | 0.1 | Effective mass ratio |
| J | 32 MeV | 2 | Symmetry energy coefficient |
- Run Python reference with default settings, record chi2/datum and NMP
- Run BarraCuda with same experimental data, same parameter bounds, varied seeds
- Compare on identical metrics: accuracy, NMP correctness, evaluation count, wall time
- Measure energy: CPU via Intel RAPL, GPU via nvidia-smi polling (100ms interval), integrated to Joules
- Report per-region accuracy breakdown (light A<56, medium 56-100, heavy 100-200, very heavy 200+)
- Report bootstrap confidence intervals on chi2/datum
- All results saved to JSON (structured benchmark reports with hardware inventory) for exact reproduction
- Seed-deterministic: same seed produces identical results
- Three-substrate comparison: Python, BarraCuda CPU, BarraCuda GPU — same physics, different hardware cost
- GPU MD validation (Phase C): energy conservation, RDF physical trends, VACF diffusion, SSF compressibility — validated against known plasma physics expectations
All experiments run on a single consumer workstation:
| Component | Specification |
|---|---|
| CPU | Intel i9-12900K (8P+8E, 24 threads) |
| RAM | 32 GB DDR5-4800 |
| OS | Pop!_OS 22.04 (Linux 6.17) |
| Rust | stable (rustc) |
| Python | 3.9 (Sarkas), 3.10 (surrogate, TTM) via micromamba |
| GPU (current) | RTX 4070 (SHADER_F64 confirmed, ~2x FP64:FP32 ratio via wgpu/Vulkan) |
| GPU (incoming) | Titan V (7.4 TFLOPS FP64) — planned for compute-bound L3 |
Three systems that could not be accessed during this study:
| System | Gate | Impact | Our Response |
|---|---|---|---|
| Code Ocean capsule (Diaw et al.) | Registration denied ("OS is denied") | Cannot run paper's nuclear EOS | Rebuilt from first principles |
| HFBTHO (ORNL Fortran solver) | Institutional access required | No axially-deformed reference | Built L3 from textbook equations |
| LANL nuclear simulation data | Restricted | No ground truth EOS | Used public AME2020 instead |
In each case, we replaced the gated system with an open equivalent built from public data and published equations. This is more work but produces a fully transparent, independently reproducible result.
| Level | Metric | Target |
|---|---|---|
| L1 | chi2_BE/datum | < 10 (SEMF physics floor) |
| L2 | chi2_BE/datum | < 5 (spherical HFB physics floor) |
| L2 | NMP within 2sigma | 5/5 properties |
| L3 | chi2_BE/datum | < 0.5 (deformed HFB physics floor) |
| L3 | RMS (MeV) | < 1 MeV |
Current status: L1 target exceeded (2.27 via DirectSampler, 1.52 via GPU extended run, 0.69 chi2_BE on full 2,042-nucleus Pareto sweep). L2 GPU-batched HFB operational (BatchedEighGpu: 2,042 nuclei in 66 min, 99.85% convergence, chi2=224.52 on full dataset with SLy4). L2 best accuracy on 52 nuclei: 16.11 via DirectSampler (Python SparsitySampler 1.93 — sampling strategy gap). GPU FP64 validated to sub-ULP precision (4.55e-13 MeV). L3 first full-scale run: 295/2036 nuclei improved over L2, best-of-both chi2=13.92, numerical stabilization in progress.
| Observable | Criterion | Status (80k steps) |
|---|---|---|
| Energy drift | < 5% | All 9 cases <= 0.006% |
| RDF tail | abs(g(r)-1) < 0.15 | All 9 cases <= 0.0017 |
| RDF peak trend | Increases with Gamma | Verified all 9 |
| Diffusion D* | Decreases with Gamma | Verified all 9 |
| SSF S(k->0) | Physical compressibility range | Verified all 9 |
Result: 9/9 PP Yukawa cases pass at N=2000, 80k production steps on RTX 4070. 149-259 steps/s sustained. 3.7x GPU speedup vs CPU. 3.4x less energy per step. The 80k long run confirms energy conservation stability well beyond the 30k steps used in the original Sarkas study.
- Native WGSL builtins (sqrt, exp on f64): 2-6× throughput improvement
- N-scaling: 500 → 20,000, 0.000% drift at all N
- Cell-list O(N) + branch-based wrapping (i32 % bug fixed)
- Paper parity: N=10,000 in 5.3 minutes
| Observable | Criterion | Status (N=10k, 80k steps) |
|---|---|---|
| Energy drift | < 5% | All 9 cases: 0.000-0.002% |
| All-pairs vs cell-list | Physics-driven mode selection | 4.1× speedup for cell-list |
| Total wall time | All 9 cases | 3.66 hours, $0.044 |
Result: 9/9 PP Yukawa cases pass at N=10,000, 80k production steps — exact paper configuration. Cell-list (κ=2,3) runs at 118.5 steps/s avg; all-pairs (κ=1) at 28.8 steps/s avg. Toadstool GPU ops (BatchedEighGpu, SsfGpu, PppmGpu) wired into pipeline.
Phase F extends the validated nuclear EOS engine to the full AME2020 dataset (2,042 nuclei). Acceptance criteria shift from pass/fail to characterization — the goal is to document where the models work, where they fail, and why.
| Level | Metric | Criterion | Status |
|---|---|---|---|
| L1 Pareto | Pareto frontier generated | 3+ lambda values, 3+ seeds each | 7 lambda x 5 seeds = 35 runs |
| L1 Pareto | chi2_BE range identified | Frontier spans > 10x range | 0.69 to 15.38 (22x range) |
| L1 Pareto | NMP trade-off characterized | At least one lambda with 4/5 NMP within 2sigma | lambda=25: 4/5, lambda=100: 4/5 |
| L2 GPU | Full dataset processed | All 2,042 nuclei attempted | 2,042/2,042 attempted, 2,039 converged |
| L2 GPU | Convergence rate | > 95% of HFB nuclei converge | 99.85% (2039/2042) |
| L2 GPU | GPU dispatches | BatchedEighGpu operational | 101 dispatches (mega-batch), 40.9 min |
| L3 Deformed | Full dataset attempted | All 2,042 nuclei | 2,042 attempted |
| L3 Deformed | Improvement over L2 | > 0 nuclei where L3 < L2 error | 295/2036 nuclei (14.5%) |
| L3 Deformed | Mass-region analysis | Breakdown by A regions | 4 regions documented |
Phase F quantitative checks: 9 (3 L1 Pareto + 3 L2 GPU + 3 L3 Deformed)
Result: 9/9 Phase F characterization checks pass. The checks confirm the infrastructure works at scale — the physics accuracy gaps (L2 chi2=224, L3 numerical overflow) are documented as current model limitations, not validation failures.
Full-system profiling of L3 GPU-hybrid run (nvidia-smi + vmstat, 2,823 + 3,093
samples). Finding: 79.3% GPU utilization, 16× slower than CPU-only. Root
cause: ~145,000 synchronous dispatches with per-dispatch buffer alloc + blocking
readback. GPU is "busy doing overhead, not physics." The architectural fix is to
batch all eigensolves across all nuclei, persist GPU buffers, and use async
readback for convergence flags only. See experiments/004_GPU_DISPATCH_OVERHEAD_L3.md.
Applied the mega-batch remedy to L2 spherical HFB. Dispatches reduced 206→101, wall time 66.3→40.9 min, GPU utilization rose to 94.9%. But CPU-only L2 finishes in 35.1s — 70× faster. Root cause: eigensolve is ~1% of total SCF iteration time. The remaining 99% (H-build, BCS, density) runs on CPU. This is Amdahl's Law applied to small matrices (4×4 to 12×12).
Complexity boundary: GPU breaks even at matrix dimension ~30–50. Below
this, CPU L1 cache coherence beats GPU dispatch overhead. Above (L3, beyond
mean-field), GPU dominates. The path to pure-GPU-faster-than-CPU: move ALL
physics to WGSL shaders (GPU-resident SCF loop, zero CPU↔GPU round-trips).
See experiments/005_L2_MEGABATCH_COMPLEXITY_BOUNDARY.md.
This is a methodological lesson: GPU profiling must track dispatch count and round-trip overhead, not just utilization percentage. High utilization with small synchronous dispatches is a false positive for GPU efficiency.
| Component | Version | Notes |
|---|---|---|
| Rust | stable (1.77+) | rustc --version |
| wgpu | 0.19+ | Vulkan backend, SHADER_F64 |
| Python | 3.9 (Sarkas), 3.10 (surrogate, TTM) | via micromamba |
| Sarkas | v1.0.0 (pinned, fd908c41) | 3 patches applied |
| NumPy | 1.26+ | 2.x compat patches |
| PyTorch | 2.1+ (CUDA) | GPU RBF only |
| nalgebra | 0.32+ | CPU eigensolvers |
| rayon | 1.8+ | CPU parallel HFB |
| OS | Pop!_OS 22.04 (Linux 6.17) |
End-to-end validation of BarraCuda's abstracted GPU ops through hotSpring's
ValidationHarness. Proves the ToadStool op layer produces correct physics
by cross-validating GPU results against CPU f64 references.
| Binary | Ops Tested | Checks | Key Result |
|---|---|---|---|
validate_barracuda_pipeline |
YukawaForceF64, VelocityVerletKickDrift/HalfKick, BerendsenThermostat, KineticEnergy | 12 | 0.000% energy drift, force error 1.86e-7 |
validate_barracuda_hfb |
BcsBisectionGpu, BatchedEighGpu, BCS with degeneracy | 14 | BCS μ error 6.2e-11, eigenvalue error 2.4e-12 |
| Observable | Criterion | Status |
|---|---|---|
| D* MSD ≈ VACF | Within 50% | Verified |
| Energy conservation | NVE drift < 5% | Verified |
| Temperature stability | T* fluctuations < 30% | Verified |
| D* vs analytical fit | Sarkas-calibrated Daligault within 80% | Verified |
| Physical ordering | D*(κ=2) < D*(κ=1) | Verified |
| Positivity | All D*, η*, λ* > 0 | Verified |
Result: 13/13 checks pass (validate_stanton_murillo). GPU-only pipeline: validate_transport_gpu_only (~493s, zero readback).
| Observable | Criterion | Status |
|---|---|---|
| Cold plaquette | Near 1.0 (weak coupling) | Verified |
| Hot plaquette | < Cold (thermalized) | Verified |
| HMC acceptance | > 50% at physical β | 96-100% |
| Dirac CG convergence | Residual < 1e-10 | Verified |
| Reunitarization | det(U) ≈ 1, U†U ≈ I | Verified |
Result: 12/12 checks pass (validate_pure_gauge).
| Observable | Criterion | Status |
|---|---|---|
| Trace anomaly peak | Near T_c ≈ 155 MeV | Verified |
| Pressure monotonicity | Increasing with T | Verified |
| Speed of sound | → 1/3 at high T | Verified |
Result: Thermodynamic consistency validated (validate_hotqcd_eos).
| Phase | Checks | Description |
|---|---|---|
| A (Python control) | 86 | 60 MD + 6 TTM + 15 surrogate + 5 EOS |
| C (GPU MD, N=2000) | 45 | 9 cases × 5 observables |
| D (N-scaling + builtins) | 16 | 5 N values + 6 cell-list diag + 5 native builtins |
| E (Paper-parity + rewire) | 13 | 9 long-run cases + 1 profiling + 3 GPU ops |
| F (Full-scale nuclear EOS) | 9 | 3 L1 Pareto + 3 L2 GPU + 3 L3 deformed |
| G (Transport + Lattice QCD) | 25 | 13 transport + 12 pure gauge |
| Pipeline (BarraCuda ops) | 26 | 12 MD pipeline + 14 HFB pipeline |
| Total | 220+ | All pass |