📋 [TASK] Implement Multi-GPU Training Support #2258

samet-akcay · 2024-08-19T10:31:41Z

Implement Multi-GPU Support in Anomalib

Depends on:

Background

Anomalib currently uses PyTorch Lightning under the hood, which provides built-in support for multi-GPU training. However, Anomalib itself does not yet expose this functionality to users. Implementing multi-GPU support would significantly enhance the library's capabilities, allowing for faster training on larger datasets and more complex models.

Proposed Feature

Enable multi-GPU support in Anomalib, allowing users to easily utilize multiple GPUs for training without changing their existing code structure significantly.

Example Usage

Users should be able to enable multi-GPU training by simply specifying the number of devices in the Engine configuration:

from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd

datamodule = MVTec(train_batch_size=1)
model = EfficientAd()
engine = Engine(max_epochs=1, accelerator="gpu", devices=2)

This configuration should automatically distribute the training across two GPUs.

Implementation Goals

Seamless integration with existing Anomalib APIs
Minimal code changes required from users to enable multi-GPU training
Proper utilization of PyTorch Lightning's multi-GPU capabilities
Consistent performance improvements when using multiple GPUs

Implementation Steps

Review PyTorch Lightning's multi-GPU implementation and best practices
Modify the Engine class to properly handle multi-GPU configurations
Ensure all Anomalib models are compatible with distributed training
Update data loading mechanisms to work efficiently with multiple GPUs
Implement proper synchronization of metrics and logging across devices
Add multi-GPU tests to the test suite
Update documentation with multi-GPU usage instructions and best practices

Potential Challenges

Ensuring all models in Anomalib are compatible with distributed training
Handling model-specific operations that may not be distribution-friendly
Managing different GPU memory capacities and load balancing
Debugging training issues specific to multi-GPU setups

Discussion Points

Should we support different distributed training strategies (DP, DDP, etc.)?
How do we ensure reproducibility across single and multi-GPU training?

Next Steps

Conduct a thorough review of PyTorch Lightning's multi-GPU capabilities
Create a detailed technical design document for the implementation
Implement a prototype with a single model and test performance gains
Discuss potential impacts on existing features and user workflows
Plan for gradual rollout, starting with a subset of models

Additional Considerations

Performance benchmarking: single GPU vs multi-GPU for various models and datasets
Impact on memory usage and potential optimizations
Handling of model checkpointing and resuming training in multi-GPU setups

We welcome input from the community on this feature. Please share your thoughts, concerns, or suggestions regarding the implementation of multi-GPU support in Anomalib.

The text was updated successfully, but these errors were encountered:

haimat · 2024-09-18T15:36:05Z

Hey guys, this is presumably one of the most important missing features in Anomalib.
Do you have any ideas when v1.2 with multi-GPU training will be released?

samet-akcay · 2024-09-18T16:37:08Z

Hi @haimat, I agree with you, but to enable multi-gpu, we had to go through a number of refactors here and there. You could check the PRs done to feature/design-simplifications branch.

What is left to enable multi-gpu is metric refactor and visualization refactor, which we are currently working on.

haimat · 2024-09-18T16:39:19Z

That sounds great, thanks for the update.
Do you have an estimation on when you might be ready with this whole change?

haimat · 2024-10-02T15:06:28Z

@samet-akcay Hello, do you have any ideas when this might be released?

samet-akcay · 2024-10-14T08:12:50Z

@haimat, we figured this requires quite some changes within AnomalyModule. Required changes unfortunately breaks the backwards compatibility, which is the reason why we decided to release this as part of v2.0. We are currently working on it on feature/design-simplifications branch, which will be released as v2.0.0

haimat · 2024-10-14T10:09:01Z

@samet-akcay Thanks for the update.
Do you have an estimation, when you plan to release version 2.0?

samet-akcay · 2024-10-14T10:32:09Z

we aim to release it by the end of this quarter

samet-akcay added the Feature label Aug 19, 2024

github-project-automation bot added this to Anomalib Aug 19, 2024

github-project-automation bot moved this to 📝 To Do in Anomalib Aug 19, 2024

samet-akcay added this to the v1.2.0 milestone Aug 19, 2024

samet-akcay mentioned this issue Aug 19, 2024

✨ Add Multi-GPU Support to v1.1 #1449

Closed

samet-akcay removed this from Anomalib Aug 19, 2024

samet-akcay removed this from the v1.2.0 milestone Aug 19, 2024

samet-akcay added this to Anomalib Aug 19, 2024

samet-akcay added this to the v1.2.0 milestone Aug 19, 2024

github-project-automation bot moved this to 📝 To Do in Anomalib Aug 19, 2024

samet-akcay assigned samet-akcay, djdameln and ashwinvaidya17 Aug 19, 2024

samet-akcay modified the milestones: v1.2.0, v2.0 Oct 14, 2024

samet-akcay mentioned this issue Oct 14, 2024

🎯 [EPIC] Design and Implement the New AnomalibModule for v2.0 #2364

Open

12 tasks

samet-akcay changed the title ~~Feature: Multi-GPU Support in Anomalib~~ 📋 [TASK] Implement Multi-GPU Training Support Oct 14, 2024

samet-akcay mentioned this issue Oct 31, 2024

📋 [TASK] Can I use GPU for training? I looked at the code and it doesn't seem to have this interface #2401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📋 [TASK] Implement Multi-GPU Training Support #2258

📋 [TASK] Implement Multi-GPU Training Support #2258

samet-akcay commented Aug 19, 2024 •

edited

Loading

haimat commented Sep 18, 2024

samet-akcay commented Sep 18, 2024

haimat commented Sep 18, 2024

haimat commented Oct 2, 2024

samet-akcay commented Oct 14, 2024

haimat commented Oct 14, 2024

samet-akcay commented Oct 14, 2024

📋 [TASK] Implement Multi-GPU Training Support #2258

📋 [TASK] Implement Multi-GPU Training Support #2258

Comments

samet-akcay commented Aug 19, 2024 • edited Loading

Implement Multi-GPU Support in Anomalib

Background

Proposed Feature

Example Usage

Implementation Goals

Implementation Steps

Potential Challenges

Discussion Points

Next Steps

Additional Considerations

haimat commented Sep 18, 2024

samet-akcay commented Sep 18, 2024

haimat commented Sep 18, 2024

haimat commented Oct 2, 2024

samet-akcay commented Oct 14, 2024

haimat commented Oct 14, 2024

samet-akcay commented Oct 14, 2024

samet-akcay commented Aug 19, 2024 •

edited

Loading