Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

Closed
NickleDave opened this issue Apr 3, 2024 · 2 comments
Labels
BUG Something isn't working

Comments

@NickleDave
Copy link
Collaborator

NickleDave commented Apr 3, 2024

because the 'DDP' strategy spawns multiple processes

https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

and this ends up causing vak to create multiple results directories (one created by each process), and then not looking in the correct results directory to find checkpoints

ValueError: did not find a single checkpoint path, instead found:
[]

A workaround for now is to set an environment variable to force vak / lightning to run on a single GPU

$ export CUDA_VISIBLE_DEVICES=0
$ vak learncurve config.toml

An annoyingly dumb fix for this might be to just make learncurve one giant function instead of calling train then eval?
Not sure I can engineer something smarter (i.e. an alternative strategy) that would make the cli work relatively painlessly

@NickleDave NickleDave added the BUG Something isn't working label Apr 3, 2024
@NickleDave
Copy link
Collaborator Author

Another quick fix might be to default to single device training for now since this is fine for most of our models

If someone needs all the GPUs we should document, "this is a case where you'll need to move from the CLI to using vak in a script"

I can't actually figure out off it's easy to just tell lightning "use a single GPU", like if there's a string I can pass in to "strategy"

https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.SingleDeviceStrategy.html

NickleDave added a commit that referenced this issue May 5, 2024
* WIP: Add config/trainer.py with TrainerConfig

* Rename common.device -> common.accelerator, return 'gpu' not 'cuda' if torch.cuda.is_available

* Fix config section in doc/api/index.rst

* Import trainer and TrainerConfig in src/vak/config/__init__.py, add to __all__

* Add pytorch-lightning to intersphinx in doc/conf.py

* Fix cross-ref in docstring in src/vak/prep/frame_classification/make_splits.py: :constant: -> :const:

* Make lightning a dependency, instead of pytorch_lightning; import lightning.pytorch everywhere instead of pytorch_lightning as lightning -- trying to make it so we can resolve API correctly in docstrings

* Fix in doc/api/index.rst: common.device -> common.accelerator

* Finish writing TrainerConfig class

* Add tests for TrainerConfig class

* Add trainer sub-table to all configs in tests/data_for_tests/configs

* Add trainer sub-table to all configs in doc/toml

* Add trainer sub-table in config/valid-version-1.0.toml, rename -> valid-version-1.1.toml

* Remove device key from top-level tables in config/valid-version-1.1.toml

* Remove device key from top-level tables in tests/data_for_tests/configs

* Remove 'device' key from configs in doc/toml

* Add 'trainer' attribute to EvalConfig, an instance of TrainerConfig; remove 'device' attribute

* Add 'trainer' attribute to PredictConfig, an instance of TrainerConfig; remove 'device' attribute

* Add 'trainer' attribute to TrainConfig, an instance of TrainerConfig; remove 'device' attribute

* Fix typo in docstring in src/vak/config/train.py

* Add 'trainer' attribute to LearncurveConfig, an instance of TrainerConfig; remove 'device' attribute. Also clean up docstring, removing attributes that no longer exist

* Remove device attribute from TrainConfig docstring

* Fix VALID_TOML_PATH in config/validators.py -> 'valid-version-1.1.toml'

* Fix how we instantiate TrainerConfig classes in from_config_dict method of EvalConfig/LearncurveConfig/PredictConfig/TrainConfig

* Fix typo in src/vak/config/valid-version-1.1.toml: predictor -> predict

* Fix unit tests after adding trainer attribute that is instance of TrainerConfig

* Change src/vak/train/frame_classification.py to take trainer_config argument

* Change src/vak/train/parametric_umap.py to take trainer_config argument

* Change src/vak/train/train_.py to take trainer_config argument

* Fix src/vak/cli/train.py to pass trainer_config.asdict() into vak.train.train_.train

* Replace 'device' with 'trainer_config' in vak/eval

* Fix cli.eval to pass trainer_config into eval.eval_.eval

* Replace 'device' with 'trainer_config' in vak/predict

* Fix cli.predict to pass trainer_config into predict.predict_.predict

* Replace 'device' with 'trainer_config' in vak/learncurve

* Fix cli.learncurve to pass trainer_config into learncurve.learncurve.learning_curve

* Rename/replace 'device' fixture with 'trainer' fixture in tests/

* Use config.table.trainer attribute throughout tests, remove config.table.device attribute that no longer exists

* Fix value for devices in fixtures/trainer.py: when device is 'cpu' trainer must be > 0

* Fix default devices value for when accelerator is cpu in TrainerConfig

* Fix unit tests for TrainerConfig after fixing default devices for accelerator=cpu

* Fix default value for 'devices' set to -1 in some unit tests where we over-ride config in toml file

* fixup use config.table.trainer attribute throughout tests -- missed one place in tests/test_eval/

* Add back 'device' fixture so we can use it to test Model class

* Fix unit tests in test_models/test_base.by that literally used device to put tensors on device, not to change a config

* Fix assertion in tests/test_models/test_tweetynet.py, from where we switched to using lightning as the dependency

* Fix test for DiceLoss, change trainer_type fixture back to device fixture
@NickleDave
Copy link
Collaborator Author

Fixed by #752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant