Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models will always be initialized without dropout layers in self-tuning ruleset #753

Open
georgedahl opened this issue Apr 4, 2024 · 2 comments
Labels
🛑 AlgoPerf Leaderboard Blocking rolling AlgoPerf Leaderboard 👷 In Progress Issue is being worked on

Comments

@georgedahl
Copy link
Contributor

In submission_runner.py, if we are in the self-tuning rules, the hyperparameters argument to train_once will always be None.

Then in this code snippet

    dropout_rate = None
    aux_dropout_rate = None
    if hasattr(hyperparameters, 'dropout_rate'):
      dropout_rate = hyperparameters.dropout_rate
    if hasattr(hyperparameters, 'aux_dropout_rate'):
      aux_dropout_rate = hyperparameters.aux_dropout_rate
    model_params, model_state = workload.init_model_fn(
        model_init_rng, dropout_rate, aux_dropout_rate)

workload.init_model_fn will always get None for dropout_rate and aux_dropout_rate, so Dropout layers won't ever be added to the model.

Although submissions could call workload.init_model_fn again themselves to make use of its side effect of setting workload._model, this is awkward and also challenging for workloads near the memory limit since it involves superfluously reconstructing model_params again on device.

@priyakasimbeg
Copy link
Contributor

Our current API has 2 dropout related limitations:

Currently, in the external tuning ruleset we read the dropout value from the hparam config and pass it to the model initialization functions. In the self-tuning ruleset there exist no convenient way to specify the dropout value in the model initialization.
Furthermore, there is no way to change the dropout value during training.
Having a workload function to change the dropout value that submitters can call will remove both of these limitations.

@priyakasimbeg priyakasimbeg added 👷 In Progress Issue is being worked on 🛑 AlgoPerf Leaderboard Blocking rolling AlgoPerf Leaderboard labels Jan 7, 2025
@Niccolo-Ajroldi
Copy link
Contributor

Niccolo-Ajroldi commented Jan 21, 2025

Some considerations about changing the dropout implementation.

Current situation

The dropout probability value is provided as a hyperparameter in the JSON search space. It is then used in submission_runner.py as follows:

model_params, model_state = workload.init_model_fn(
        model_init_rng, dropout_rate, aux_dropout_rate)

After initializing the model, we torch.compile it and initialize the optimizer.

Current limitations

  1. Self tuning submissions cannot specify a dropout probability value
  2. It's not possible to change dropout during training

How can we address these problems?

I can see several possibilities, some require major changes, some are less disruptive.

(A) extend the submission module API to provide initial dropout value ⭐

A submission should provide a function model_init_hyperparams that returns hyperparameters used in initialization, such as dropout. Something like get_batch_size for dropout. This would address (1) but not (2),

(B) re-init and re-compile the model

We could add a change_dropout method to each workload, for the submission to call. When triggered, it re-initializes the model with the new dropout probability. However, in torch we would also have to recompile the model, which is something that currently happens in submission_runner, not inside the submitter's code. It's also non-trivial to keep the old parameters and initialize a new model in torch, without incurring in an OOM error, because of this double temporary storage.

(C) pass dropout to the model fwd call

Not trivial, need to modify all model implementations.

Conclusion

My suggested option is (A), but I am happy to discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🛑 AlgoPerf Leaderboard Blocking rolling AlgoPerf Leaderboard 👷 In Progress Issue is being worked on
Projects
None yet
Development

No branches or pull requests

3 participants