-
Notifications
You must be signed in to change notification settings - Fork 71
Scaling Plots & Analysis as an Executor Step #2243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
02186f9 to
dda4739
Compare
0d7041f to
50e5545
Compare
rjpower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like the idea!
I had a lot of feedback about the structure, which I acknowledge is sort of inherited and therefore you should feel free to push back about how much you want balance telling me I'm wrong, fixing now, or filing issues to handle later.
In general I'd encourage pruning the tests a bit to focus on integration and end-to-end tests which validate a specific output we want to be reproducible (vs constants or "variable is set" tests), and to explicitly try to separate out the handling of wandb vs executor steps vs compute some math.
2e6059e to
9730107
Compare
eric-czech
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Helw150! This is extremely helpful for understanding how to connect all of these things together.
My only somewhat blocking concern is on the assumption that FLOPs from W&B will be on the same scale as FLOPs budgets used in configs. Assuming equal scaling exponents might be another potential issue -- let me know if I'm misunderstanding something there.
| logger.warning(f"No valid candidates found for budget {target_flops:.2e}") | ||
| return None | ||
|
|
||
| best = min(candidates, key=lambda c: abs(c.tokens - optimal_tokens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous to me in a scenario where the config with compute-optimal parameter counts, rather than token counts, did not actually end up being included in the hidden_size loop. I suspect that's more of a problem for experiments like mine where the scaling exponents don't come out equal (and at a limit, models become huge compared to token counts w/ more compute), but either way I think it would be ideal if this code was at least calculating compute-optimal param counts to be used to check that the optimal config found based on optimal token counts is in the ballpark. Or maybe it could just fail if the optimal config ends up being at either boundary as a quicker way to some guardrails?
I suppose
cf.
Put somewhat differently, and to check my understanding, that would mean that this function would need to define optimal_tokens using the param scaling exponent because the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doing the following:
- Find D_opt at each compute budget by fitting parabolas to actual loss values
- Fit D_opt ~ C^b from those empirical minima (from Approach 2 in Chinchilla)
- Given D_opt and C, we can compute N_opt directly as C/(6D), so we don't need to fit alpha.
This should work regardless of beta == alpha.
The screenshot you have there is corresponding to approach 3 I believe (the parametric loss form) rather than the parabola based isoflop form I am using here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the alpha being used to define optimal_tokens is actually the same
On a related note, what do you typically call
6910b7a to
b5916c4
Compare
042b3a9 to
4c54f12
Compare
a05b597 to
15bf75e
Compare
rjpower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks much better, thanks!
A few high-level recommendations:
I'd move the "recipe" protocol to isoflop_sweep itself and replace the use of the LlamaConfig with opaque types or protocols as needed. e.g. isoflops might take:
class ModelConfiguration(Protocol):
def num_params()
... whatever ...
class ScalingCallbacks(Protocol):
def build_model_config() ... -> ModelConfiguration
...
(or put the num params callback here as well if you want)
-
Consider grugifying even more, and favor copy-pasting the code from isoflop_sweep into the scaling_ladder experiment. Feel free to file an issue if that's a bridge too far, but that's sort of what I'd recommend us to lean towards: experiments are copy paste, library code is agnostic to the model. (If copy-paste is too painful to contemplate, then we consider "how do i refactor this to be more model agnostic so I feel okay importing" etc)
-
There's a bit of inherited magic around the schema of the evaluation code. We should try to put the loading logic for the eval results wherever the code that generated it was, or maybe there's a way to share the schema that was used to generate it. (Though it's good you're trying to separate out the "load this specific eval format" from "analysis an eval table").
Overall it's easier to follow for me and less enmeshed with the specifics of Marin/Levanter configuration which is nice!
33f0270 to
d0d9964
Compare
rjpower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is much better.
I'd recommend merging some of the data types and recipes but otherwise I think it's looking good!
| while still working with LlamaConfig, QwenConfig, etc. | ||
| """ | ||
|
|
||
| def flops_per_token(self, vocab_size: int, seq_len: int) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be a bit simpler to flatten everything to be part of the "Recipe" protocol, e.g.:
def flops_per_token(candidate: CandidateConfig) -> float
def total_trainable_params(candidate: CandidateConfig) -> float
then the user only needs to define the one protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one I'll push back on a bit. This protocol defines things that any model config class should have independent of scaling laws (right now in practice our base class is LlamaConfig, which already had these defined).
I don't think things that are features of the model itself in abstract, existed previously to scaling on our base class in practice, and are independently used from scaling should live on the recipe. it'll just force the user to implement needless boilerplate wrapper methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's definitely not a big deal. By your argument estimate_memory_bytes should be here, since it seems naturally a function of the model configuration, not the scaling recipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the memory usage in bytes is dependent on both the model configuration and the training specifics (e.g. batch size, which optimizer, etc.). The same model will use different amounts of memory depending on those things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This protocol defines things that any model config class should have independent of scaling laws
+1 to separating model properties from scaling law training configuration like this, and that it makes sense to be able to contextualize "model properties" to specific scaling experiments (i.e. these shouldn't be functions on LlamaConfig). This feels like the right place to me to be able to ablate more controversial aspects of "effective" param counting like embedding inclusion, gated units, weight ties or maybe sparse vs active and quantization scaling params down the road. I can imagine a lot of boilerplate around creating ScalingRecipes otherwise.
f9384e3 to
3f0e51e
Compare
This refactor moves vocab_size entirely inside the recipe, simplifying the API by removing it from function signatures throughout the scaling laws library. Key changes: - Add vocab_size as a required property on ScalingRecipe protocol - Add vocab_size field to Marin2025Recipe (defaults to 128256) - Remove vocab_size parameter from all ScalingRecipe methods: - build_model_config() - estimate_memory_bytes() - build_optimizer_config() - candidate_configs() - compute_training_schedule() - Remove vocab_size parameter from library functions: - generate_isoflop_train_args() - predict_optimal_config() - predict_optimal_configs_for_budgets() - pick_v5p_type() - Remove MARIN_TOKENIZER_VOCAB_SIZE constant (no longer needed) - Remove vocab_size lookup and threading from scaling_ladder.py - Remove tokenizer field from ScalingLadderRungConfig (recipe owns this) - Update tests to use new API Benefits: - Simpler API: no vocab_size threading through 10+ function calls - Single source of truth: recipe owns its vocab_size - Less error-prone: impossible to accidentally mix vocab_sizes - Better encapsulation: recipe is self-contained for model config The low-level helper functions (compute_training_flops, solve_for_batch_size, solve_for_train_steps) still take vocab_size as they're called by the recipe with self.vocab_size.
Instead of hardcoding vocab_size, derive it from the tokenizer field using the get_vocab_size_for_tokenizer utility. This maintains the connection between tokenizer and vocab_size and makes it clear where the value comes from. The utility function is already cached with @lru_cache, so repeated calls are efficient.
eebf7b0 to
705a47a
Compare

Closes #2166