BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

NickleDave · 2024-04-03T11:46:29Z

because the 'DDP' strategy spawns multiple processes

https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html#distributed-data-parallel

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

and this ends up causing vak to create multiple results directories (one created by each process), and then not looking in the correct results directory to find checkpoints

ValueError: did not find a single checkpoint path, instead found:
[]

A workaround for now is to set an environment variable to force vak / lightning to run on a single GPU

$ export CUDA_VISIBLE_DEVICES=0
$ vak learncurve config.toml

An annoyingly dumb fix for this might be to just make learncurve one giant function instead of calling train then eval?
Not sure I can engineer something smarter (i.e. an alternative strategy) that would make the cli work relatively painlessly

The text was updated successfully, but these errors were encountered:

NickleDave · 2024-04-04T16:32:08Z

Another quick fix might be to default to single device training for now since this is fine for most of our models

If someone needs all the GPUs we should document, "this is a case where you'll need to move from the CLI to using vak in a script"

I can't actually figure out off it's easy to just tell lightning "use a single GPU", like if there's a string I can pass in to "strategy"

https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.SingleDeviceStrategy.html

* WIP: Add config/trainer.py with TrainerConfig * Rename common.device -> common.accelerator, return 'gpu' not 'cuda' if torch.cuda.is_available * Fix config section in doc/api/index.rst * Import trainer and TrainerConfig in src/vak/config/__init__.py, add to __all__ * Add pytorch-lightning to intersphinx in doc/conf.py * Fix cross-ref in docstring in src/vak/prep/frame_classification/make_splits.py: :constant: -> :const: * Make lightning a dependency, instead of pytorch_lightning; import lightning.pytorch everywhere instead of pytorch_lightning as lightning -- trying to make it so we can resolve API correctly in docstrings * Fix in doc/api/index.rst: common.device -> common.accelerator * Finish writing TrainerConfig class * Add tests for TrainerConfig class * Add trainer sub-table to all configs in tests/data_for_tests/configs * Add trainer sub-table to all configs in doc/toml * Add trainer sub-table in config/valid-version-1.0.toml, rename -> valid-version-1.1.toml * Remove device key from top-level tables in config/valid-version-1.1.toml * Remove device key from top-level tables in tests/data_for_tests/configs * Remove 'device' key from configs in doc/toml * Add 'trainer' attribute to EvalConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to PredictConfig, an instance of TrainerConfig; remove 'device' attribute * Add 'trainer' attribute to TrainConfig, an instance of TrainerConfig; remove 'device' attribute * Fix typo in docstring in src/vak/config/train.py * Add 'trainer' attribute to LearncurveConfig, an instance of TrainerConfig; remove 'device' attribute. Also clean up docstring, removing attributes that no longer exist * Remove device attribute from TrainConfig docstring * Fix VALID_TOML_PATH in config/validators.py -> 'valid-version-1.1.toml' * Fix how we instantiate TrainerConfig classes in from_config_dict method of EvalConfig/LearncurveConfig/PredictConfig/TrainConfig * Fix typo in src/vak/config/valid-version-1.1.toml: predictor -> predict * Fix unit tests after adding trainer attribute that is instance of TrainerConfig * Change src/vak/train/frame_classification.py to take trainer_config argument * Change src/vak/train/parametric_umap.py to take trainer_config argument * Change src/vak/train/train_.py to take trainer_config argument * Fix src/vak/cli/train.py to pass trainer_config.asdict() into vak.train.train_.train * Replace 'device' with 'trainer_config' in vak/eval * Fix cli.eval to pass trainer_config into eval.eval_.eval * Replace 'device' with 'trainer_config' in vak/predict * Fix cli.predict to pass trainer_config into predict.predict_.predict * Replace 'device' with 'trainer_config' in vak/learncurve * Fix cli.learncurve to pass trainer_config into learncurve.learncurve.learning_curve * Rename/replace 'device' fixture with 'trainer' fixture in tests/ * Use config.table.trainer attribute throughout tests, remove config.table.device attribute that no longer exists * Fix value for devices in fixtures/trainer.py: when device is 'cpu' trainer must be > 0 * Fix default devices value for when accelerator is cpu in TrainerConfig * Fix unit tests for TrainerConfig after fixing default devices for accelerator=cpu * Fix default value for 'devices' set to -1 in some unit tests where we over-ride config in toml file * fixup use config.table.trainer attribute throughout tests -- missed one place in tests/test_eval/ * Add back 'device' fixture so we can use it to test Model class * Fix unit tests in test_models/test_base.by that literally used device to put tensors on device, not to change a config * Fix assertion in tests/test_models/test_tweetynet.py, from where we switched to using lightning as the dependency * Fix test for DiceLoss, change trainer_type fixture back to device fixture

NickleDave · 2024-05-05T21:49:13Z

Fixed by #752

NickleDave added the BUG Something isn't working label Apr 3, 2024

NickleDave mentioned this issue Apr 17, 2024

BUG: Predict throws KeyError #745

Closed

This was referenced May 1, 2024

ENH: Add TrainerConfig to train/learncurve/eval/predict #691

Closed

ENH: Add lightning.Trainer config, fix #691 #687 #742 #745 #752

Merged

NickleDave closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

NickleDave commented Apr 3, 2024 •

edited

Loading

NickleDave commented Apr 4, 2024

NickleDave commented May 5, 2024

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

BUG: Running lightning with default strategy 'DDP' breaks learncurve function #742

Comments

NickleDave commented Apr 3, 2024 • edited Loading

NickleDave commented Apr 4, 2024

NickleDave commented May 5, 2024

NickleDave commented Apr 3, 2024 •

edited

Loading