Replies: 1 comment
-
|
Great data, Tim. I ran into the same issue - spent ~$50 on 1×H100 runs before realizing the rankings don't transfer. What I found: on a single GPU you get ~500 training steps vs 8,000+ on 8×H100. That means techniques that need more steps to differentiate (QAT, LoRA TTT, EMA) look worse on 1×GPU than they actually are. Architecture changes show up earlier because they affect loss-per-step, not total-steps. Your heuristic is right - architecture first on single GPU, then validate the token-hungry methods on 8×H100 when you're confident enough to spend the $3.50/run. I built a pod benchmarking script that might help - even within 8×H100 pods, performance varies by location. Iceland pods hit 742 TFLOPS vs ~650 in Kansas City. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Over the weekend, I ran a few OpenAI parameter-golf experiments on a single RTX 4090 with a 60-minute budget. The final ranking I got did not match the 8xH100 ranking.
My current guess is that methods like QAT and SWA need more tokens before their gains become visible, so they may be underrepresented in a single-GPU setting. Architecture changes, on the other hand, seem easier to validate under a tight local budget.
So for single-GPU iteration, my current heuristic is to prioritize architecture changes first, while keeping in mind that some token-hungry methods may become more competitive at larger scale.
Curious whether others have seen the same pattern.
Beta Was this translation helpful? Give feedback.
All reactions