Scaling Plots & Analysis as an Executor Step #2243

Helw150 · 2025-12-16T22:30:11Z

rjpower

Thanks, I like the idea!

I had a lot of feedback about the structure, which I acknowledge is sort of inherited and therefore you should feel free to push back about how much you want balance telling me I'm wrong, fixing now, or filing issues to handle later.

In general I'd encourage pruning the tests a bit to focus on integration and end-to-end tests which validate a specific output we want to be reproducible (vs constants or "variable is set" tests), and to explicitly try to separate out the handling of wandb vs executor steps vs compute some math.

tests/test_scaling_laws.py

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

eric-czech

Thanks @Helw150! This is extremely helpful for understanding how to connect all of these things together.

My only somewhat blocking concern is on the assumption that FLOPs from W&B will be on the same scale as FLOPs budgets used in configs. Assuming equal scaling exponents might be another potential issue -- let me know if I'm misunderstanding something there.

lib/marin/src/marin/scaling_laws/scaling_ladder.py

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

eric-czech · 2026-01-06T17:22:58Z

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

+        logger.warning(f"No valid candidates found for budget {target_flops:.2e}")
+        return None
+
+    best = min(candidates, key=lambda c: abs(c.tokens - optimal_tokens))


This seems dangerous to me in a scenario where the config with compute-optimal parameter counts, rather than token counts, did not actually end up being included in the hidden_size loop. I suspect that's more of a problem for experiments like mine where the scaling exponents don't come out equal (and at a limit, models become huge compared to token counts w/ more compute), but either way I think it would be ideal if this code was at least calculating compute-optimal param counts to be used to check that the optimal config found based on optimal token counts is in the ballpark. Or maybe it could just fail if the optimal config ends up being at either boundary as a quicker way to some guardrails?

I suppose $\alpha = \beta$ (equal exponents) is actually a pretty fundamental assumption of all this code, yea? Otherwise, I'm having a hard time seeing why this Chinchilla calculation wouldn't apply where the optimal token count as a function of compute actually scales with the parameter exponent and vise-versa:

cf.

Put somewhat differently, and to check my understanding, that would mean that this function would need to define optimal_tokens using the param scaling exponent because the $C=6ND$ constraint only makes a simultaneous solution necessary when $\alpha \ne \beta$.

I'm doing the following:

Find D_opt at each compute budget by fitting parabolas to actual loss values

Fit D_opt ~ C^b from those empirical minima (from Approach 2 in Chinchilla)

Given D_opt and C, we can compute N_opt directly as C/(6D), so we don't need to fit alpha.

This should work regardless of beta == alpha.

The screenshot you have there is corresponding to approach 3 I believe (the parametric loss form) rather than the parabola based isoflop form I am using here.

Ah, so the alpha being used to define optimal_tokens is actually the same $b$ as in Approach 3, which should be equal to the half the param loss exponent if $\alpha=\beta$ and $\frac{\alpha}{\alpha + \beta}$ otherwise.

On a related note, what do you typically call $\alpha$ and $\beta$ if $a$ and $b$ are the scaling exponents? Loss exponents?

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

rjpower

So this looks much better, thanks!

A few high-level recommendations:

I'd move the "recipe" protocol to isoflop_sweep itself and replace the use of the LlamaConfig with opaque types or protocols as needed. e.g. isoflops might take:

class ModelConfiguration(Protocol):
def num_params()
... whatever ...

class ScalingCallbacks(Protocol):
def build_model_config() ... -> ModelConfiguration
...
(or put the num params callback here as well if you want)

Consider grugifying even more, and favor copy-pasting the code from isoflop_sweep into the scaling_ladder experiment. Feel free to file an issue if that's a bridge too far, but that's sort of what I'd recommend us to lean towards: experiments are copy paste, library code is agnostic to the model. (If copy-paste is too painful to contemplate, then we consider "how do i refactor this to be more model agnostic so I feel okay importing" etc)
There's a bit of inherited magic around the schema of the evaluation code. We should try to put the loading logic for the eval results wherever the code that generated it was, or maybe there's a way to share the schema that was used to generate it. (Though it's good you're trying to separate out the "load this specific eval format" from "analysis an eval table").

Overall it's easier to follow for me and less enmeshed with the specifics of Marin/Levanter configuration which is nice!

lib/marin/src/marin/scaling_laws/eval_metrics_reader.py

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

lib/marin/src/marin/scaling_laws/recipe.py

lib/marin/src/marin/scaling_laws/scaling_ladder.py

lib/marin/src/marin/scaling_laws/recipe.py

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

experiments/isoflop_sweep.py

rjpower

Thanks, this is much better.

I'd recommend merging some of the data types and recipes but otherwise I think it's looking good!

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

rjpower · 2026-01-13T01:38:25Z

lib/marin/src/marin/scaling_laws/isoflop_analysis.py

+    while still working with LlamaConfig, QwenConfig, etc.
+    """
+
+    def flops_per_token(self, vocab_size: int, seq_len: int) -> float:


it might be a bit simpler to flatten everything to be part of the "Recipe" protocol, e.g.:

def flops_per_token(candidate: CandidateConfig) -> float
def total_trainable_params(candidate: CandidateConfig) -> float

then the user only needs to define the one protocol

This one I'll push back on a bit. This protocol defines things that any model config class should have independent of scaling laws (right now in practice our base class is LlamaConfig, which already had these defined).

I don't think things that are features of the model itself in abstract, existed previously to scaling on our base class in practice, and are independently used from scaling should live on the recipe. it'll just force the user to implement needless boilerplate wrapper methods.

Sure, it's definitely not a big deal. By your argument estimate_memory_bytes should be here, since it seems naturally a function of the model configuration, not the scaling recipe?

No, because the memory usage in bytes is dependent on both the model configuration and the training specifics (e.g. batch size, which optimizer, etc.). The same model will use different amounts of memory depending on those things.

This protocol defines things that any model config class should have independent of scaling laws

+1 to separating model properties from scaling law training configuration like this, and that it makes sense to be able to contextualize "model properties" to specific scaling experiments (i.e. these shouldn't be functions on LlamaConfig). This feels like the right place to me to be able to ablate more controversial aspects of "effective" param counting like embedding inclusion, gated units, weight ties or maybe sparse vs active and quantization scaling params down the road. I can imagine a lot of boilerplate around creating ScalingRecipes otherwise.

lib/marin/src/marin/scaling_laws/tpu_utils.py

lib/marin/src/marin/scaling_laws/eval_metrics_reader.py

experiments/exp2166_scaling_ladder_analysis.py

experiments/isoflop_sweep.py

This refactor moves vocab_size entirely inside the recipe, simplifying the API by removing it from function signatures throughout the scaling laws library. Key changes: - Add vocab_size as a required property on ScalingRecipe protocol - Add vocab_size field to Marin2025Recipe (defaults to 128256) - Remove vocab_size parameter from all ScalingRecipe methods: - build_model_config() - estimate_memory_bytes() - build_optimizer_config() - candidate_configs() - compute_training_schedule() - Remove vocab_size parameter from library functions: - generate_isoflop_train_args() - predict_optimal_config() - predict_optimal_configs_for_budgets() - pick_v5p_type() - Remove MARIN_TOKENIZER_VOCAB_SIZE constant (no longer needed) - Remove vocab_size lookup and threading from scaling_ladder.py - Remove tokenizer field from ScalingLadderRungConfig (recipe owns this) - Update tests to use new API Benefits: - Simpler API: no vocab_size threading through 10+ function calls - Single source of truth: recipe owns its vocab_size - Less error-prone: impossible to accidentally mix vocab_sizes - Better encapsulation: recipe is self-contained for model config The low-level helper functions (compute_training_flops, solve_for_batch_size, solve_for_train_steps) still take vocab_size as they're called by the recipe with self.vocab_size.

Instead of hardcoding vocab_size, derive it from the tokenizer field using the get_vocab_size_for_tokenizer utility. This maintains the connection between tokenizer and vocab_size and makes it clear where the value comes from. The utility function is already cached with @lru_cache, so repeated calls are efficient.

Helw150 force-pushed the will/scaling_plots branch 3 times, most recently from 02186f9 to dda4739 Compare December 24, 2025 15:50

Helw150 force-pushed the will/scaling_plots branch from 0d7041f to 50e5545 Compare December 26, 2025 16:24

Helw150 marked this pull request as ready for review December 26, 2025 19:34

Helw150 requested review from dlwh and eric-czech December 26, 2025 19:35

rjpower reviewed Jan 5, 2026

View reviewed changes

Helw150 force-pushed the will/scaling_plots branch from 2e6059e to 9730107 Compare January 5, 2026 23:57

eric-czech reviewed Jan 6, 2026

View reviewed changes

Helw150 force-pushed the will/scaling_plots branch from 6910b7a to b5916c4 Compare January 7, 2026 05:17

eric-czech reviewed Jan 7, 2026

View reviewed changes

lib/marin/src/marin/scaling_laws/isoflop_analysis.py Outdated Show resolved Hide resolved

Helw150 force-pushed the will/scaling_plots branch from 042b3a9 to 4c54f12 Compare January 8, 2026 04:48

Helw150 requested a review from eric-czech January 8, 2026 04:48

Helw150 force-pushed the will/scaling_plots branch from a05b597 to 15bf75e Compare January 9, 2026 00:46

rjpower reviewed Jan 9, 2026

View reviewed changes

eric-czech reviewed Jan 12, 2026

View reviewed changes

experiments/isoflop_sweep.py Outdated Show resolved Hide resolved

experiments/isoflop_sweep.py Outdated Show resolved Hide resolved

experiments/isoflop_sweep.py Show resolved Hide resolved

Helw150 force-pushed the will/scaling_plots branch from 33f0270 to d0d9964 Compare January 12, 2026 19:29

rjpower approved these changes Jan 13, 2026

View reviewed changes

ravwojdyla reviewed Jan 13, 2026

View reviewed changes

lib/marin/src/marin/scaling_laws/eval_metrics_reader.py Outdated Show resolved Hide resolved

ravwojdyla reviewed Jan 13, 2026

View reviewed changes

experiments/exp2166_scaling_ladder_analysis.py Outdated Show resolved Hide resolved

ravwojdyla reviewed Jan 13, 2026

View reviewed changes

experiments/isoflop_sweep.py Outdated Show resolved Hide resolved

eric-czech approved these changes Jan 13, 2026

View reviewed changes

Helw150 force-pushed the will/scaling_plots branch 3 times, most recently from f9384e3 to 3f0e51e Compare January 14, 2026 04:53

Helw150 added 4 commits January 17, 2026 13:20

Plotting Code

45e0937

Run Evals in ISOFlop

4076d48

Write Out JSONL for eval metrics rather than only storing in WandB

61e66f4

Infra for jobs which read eval metrics

58535f5

Helw150 and others added 25 commits January 17, 2026 13:20

Wandb is always available

9cdeebf

Claude Code got out of hand here

4421d2a

Note Differences

569e301

Remove all 6ND

0150292

Simplify

6ce4376

Comment Tweak

9c99d8d

Move Naming to the Experiment Code

3f1fa57

Legacy Support

f50b879

Path Bugs and Logging Bugs in training

c04ac2d

Serialization Issues

f946f83

More Grugging

114e828

Try to remove a lot of the batch_size dependency

5418131

Weird Roundtripping

78620f7

Requested Error

916ca6a

Refactor

75d8e1a

Delete More

101cd87

Comment

e39d074

as_input_path and pre-estimate memory

d70596a

Remove Needless Protocl

0fd52e4

Read Eval Records

dbe4784

This can run the full ladder now

0c7592c

Mild Simplify

20427d4

More Grad Accum

705a47a

Helw150 force-pushed the will/scaling_plots branch from eebf7b0 to 705a47a Compare January 17, 2026 21:20

Helw150 mentioned this pull request Jan 19, 2026

Create a modern scaling suite ("modernized Pythia") #1337

Open

Merge branch 'main' into will/scaling_plots

2330ff8

Helw150 merged commit dab4934 into main Jan 19, 2026
15 checks passed

Helw150 deleted the will/scaling_plots branch January 19, 2026 22:11

Scaling Plots & Analysis as an Executor Step #2243

Scaling Plots & Analysis as an Executor Step #2243

Conversation

Helw150 commented Dec 16, 2025

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-czech left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eric-czech Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Helw150 Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

eric-czech Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rjpower Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Helw150 Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjpower Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Helw150 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

eric-czech Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Helw150 Jan 13, 2026 •

edited

Loading