Skip to content

Conversation

@Helw150
Copy link
Member

@Helw150 Helw150 commented Dec 16, 2025

Closes #2166

@Helw150 Helw150 force-pushed the will/scaling_plots branch 3 times, most recently from 02186f9 to dda4739 Compare December 24, 2025 15:50
@Helw150 Helw150 marked this pull request as ready for review December 26, 2025 19:34
@Helw150 Helw150 requested review from dlwh and eric-czech December 26, 2025 19:35
Copy link
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like the idea!

I had a lot of feedback about the structure, which I acknowledge is sort of inherited and therefore you should feel free to push back about how much you want balance telling me I'm wrong, fixing now, or filing issues to handle later.

In general I'd encourage pruning the tests a bit to focus on integration and end-to-end tests which validate a specific output we want to be reproducible (vs constants or "variable is set" tests), and to explicitly try to separate out the handling of wandb vs executor steps vs compute some math.

@Helw150 Helw150 force-pushed the will/scaling_plots branch from 2e6059e to 9730107 Compare January 5, 2026 23:57
Copy link
Contributor

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Helw150! This is extremely helpful for understanding how to connect all of these things together.

My only somewhat blocking concern is on the assumption that FLOPs from W&B will be on the same scale as FLOPs budgets used in configs. Assuming equal scaling exponents might be another potential issue -- let me know if I'm misunderstanding something there.

logger.warning(f"No valid candidates found for budget {target_flops:.2e}")
return None

best = min(candidates, key=lambda c: abs(c.tokens - optimal_tokens))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous to me in a scenario where the config with compute-optimal parameter counts, rather than token counts, did not actually end up being included in the hidden_size loop. I suspect that's more of a problem for experiments like mine where the scaling exponents don't come out equal (and at a limit, models become huge compared to token counts w/ more compute), but either way I think it would be ideal if this code was at least calculating compute-optimal param counts to be used to check that the optimal config found based on optimal token counts is in the ballpark. Or maybe it could just fail if the optimal config ends up being at either boundary as a quicker way to some guardrails?

I suppose $\alpha = \beta$ (equal exponents) is actually a pretty fundamental assumption of all this code, yea? Otherwise, I'm having a hard time seeing why this Chinchilla calculation wouldn't apply where the optimal token count as a function of compute actually scales with the parameter exponent and vise-versa:

Screenshot 2026-01-06 at 11 01 22 AM

cf.

Screenshot 2026-01-06 at 12 31 01 PM

Put somewhat differently, and to check my understanding, that would mean that this function would need to define optimal_tokens using the param scaling exponent because the $C=6ND$ constraint only makes a simultaneous solution necessary when $\alpha \ne \beta$.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing the following:

  • Find D_opt at each compute budget by fitting parabolas to actual loss values
  • Fit D_opt ~ C^b from those empirical minima (from Approach 2 in Chinchilla)
  • Given D_opt and C, we can compute N_opt directly as C/(6D), so we don't need to fit alpha.

This should work regardless of beta == alpha.

The screenshot you have there is corresponding to approach 3 I believe (the parametric loss form) rather than the parabola based isoflop form I am using here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the alpha being used to define optimal_tokens is actually the same $b$ as in Approach 3, which should be equal to the half the param loss exponent if $\alpha=\beta$ and $\frac{\alpha}{\alpha + \beta}$ otherwise.

On a related note, what do you typically call $\alpha$ and $\beta$ if $a$ and $b$ are the scaling exponents? Loss exponents?

@Helw150 Helw150 force-pushed the will/scaling_plots branch from 6910b7a to b5916c4 Compare January 7, 2026 05:17
@Helw150 Helw150 force-pushed the will/scaling_plots branch from 042b3a9 to 4c54f12 Compare January 8, 2026 04:48
@Helw150 Helw150 requested a review from eric-czech January 8, 2026 04:48
@Helw150 Helw150 force-pushed the will/scaling_plots branch from a05b597 to 15bf75e Compare January 9, 2026 00:46
Copy link
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks much better, thanks!

A few high-level recommendations:

I'd move the "recipe" protocol to isoflop_sweep itself and replace the use of the LlamaConfig with opaque types or protocols as needed. e.g. isoflops might take:

class ModelConfiguration(Protocol):
def num_params()
... whatever ...

class ScalingCallbacks(Protocol):
def build_model_config() ... -> ModelConfiguration
...
(or put the num params callback here as well if you want)

  • Consider grugifying even more, and favor copy-pasting the code from isoflop_sweep into the scaling_ladder experiment. Feel free to file an issue if that's a bridge too far, but that's sort of what I'd recommend us to lean towards: experiments are copy paste, library code is agnostic to the model. (If copy-paste is too painful to contemplate, then we consider "how do i refactor this to be more model agnostic so I feel okay importing" etc)

  • There's a bit of inherited magic around the schema of the evaluation code. We should try to put the loading logic for the eval results wherever the code that generated it was, or maybe there's a way to share the schema that was used to generate it. (Though it's good you're trying to separate out the "load this specific eval format" from "analysis an eval table").

Overall it's easier to follow for me and less enmeshed with the specifics of Marin/Levanter configuration which is nice!

@Helw150 Helw150 force-pushed the will/scaling_plots branch from 33f0270 to d0d9964 Compare January 12, 2026 19:29
Copy link
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is much better.

I'd recommend merging some of the data types and recipes but otherwise I think it's looking good!

while still working with LlamaConfig, QwenConfig, etc.
"""

def flops_per_token(self, vocab_size: int, seq_len: int) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a bit simpler to flatten everything to be part of the "Recipe" protocol, e.g.:

def flops_per_token(candidate: CandidateConfig) -> float
def total_trainable_params(candidate: CandidateConfig) -> float

then the user only needs to define the one protocol

Copy link
Member Author

@Helw150 Helw150 Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I'll push back on a bit. This protocol defines things that any model config class should have independent of scaling laws (right now in practice our base class is LlamaConfig, which already had these defined).

I don't think things that are features of the model itself in abstract, existed previously to scaling on our base class in practice, and are independently used from scaling should live on the recipe. it'll just force the user to implement needless boilerplate wrapper methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's definitely not a big deal. By your argument estimate_memory_bytes should be here, since it seems naturally a function of the model configuration, not the scaling recipe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the memory usage in bytes is dependent on both the model configuration and the training specifics (e.g. batch size, which optimizer, etc.). The same model will use different amounts of memory depending on those things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This protocol defines things that any model config class should have independent of scaling laws

+1 to separating model properties from scaling law training configuration like this, and that it makes sense to be able to contextualize "model properties" to specific scaling experiments (i.e. these shouldn't be functions on LlamaConfig). This feels like the right place to me to be able to ablate more controversial aspects of "effective" param counting like embedding inclusion, gated units, weight ties or maybe sparse vs active and quantization scaling params down the road. I can imagine a lot of boilerplate around creating ScalingRecipes otherwise.

@Helw150 Helw150 force-pushed the will/scaling_plots branch 3 times, most recently from f9384e3 to 3f0e51e Compare January 14, 2026 04:53
Helw150 and others added 25 commits January 17, 2026 13:20
This refactor moves vocab_size entirely inside the recipe, simplifying
the API by removing it from function signatures throughout the scaling
laws library.

Key changes:
- Add vocab_size as a required property on ScalingRecipe protocol
- Add vocab_size field to Marin2025Recipe (defaults to 128256)
- Remove vocab_size parameter from all ScalingRecipe methods:
  - build_model_config()
  - estimate_memory_bytes()
  - build_optimizer_config()
  - candidate_configs()
  - compute_training_schedule()
- Remove vocab_size parameter from library functions:
  - generate_isoflop_train_args()
  - predict_optimal_config()
  - predict_optimal_configs_for_budgets()
  - pick_v5p_type()
- Remove MARIN_TOKENIZER_VOCAB_SIZE constant (no longer needed)
- Remove vocab_size lookup and threading from scaling_ladder.py
- Remove tokenizer field from ScalingLadderRungConfig (recipe owns this)
- Update tests to use new API

Benefits:
- Simpler API: no vocab_size threading through 10+ function calls
- Single source of truth: recipe owns its vocab_size
- Less error-prone: impossible to accidentally mix vocab_sizes
- Better encapsulation: recipe is self-contained for model config

The low-level helper functions (compute_training_flops, solve_for_batch_size,
solve_for_train_steps) still take vocab_size as they're called by the recipe
with self.vocab_size.
Instead of hardcoding vocab_size, derive it from the tokenizer field
using the get_vocab_size_for_tokenizer utility. This maintains the
connection between tokenizer and vocab_size and makes it clear where
the value comes from.

The utility function is already cached with @lru_cache, so repeated
calls are efficient.
@Helw150 Helw150 merged commit dab4934 into main Jan 19, 2026
15 checks passed
@Helw150 Helw150 deleted the will/scaling_plots branch January 19, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Utility to Convert a Completed ISOFlop Run into a Compute Optimal Config for a Target Budget

6 participants