Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmstart infrastructure switch #254

Merged
merged 93 commits into from
Sep 17, 2024
Merged

Conversation

le1nux
Copy link
Member

@le1nux le1nux commented Sep 9, 2024

What does this PR do?

This PR mainly addresses the warmstart of model training, e.g., after GPU crashes.

General Changes

  • Fixes issue Warmstart checkpoints have correct amount of steps but wrong amount of tokens #242
  • Warmstarts with changing infrastructure (e.g.,. different number of GPUs) are now supported.
  • Restructures the settings part of the configs to
  • Adds various checks for consistency of model training (e.g., target tokens and number of dataset tokens mismatch)
  • Refactors all configs to be runnable again
  • Adds an interactive jupyter notebook-based Tutorial on how to use Modalities. (merged from PR Interactive getting started tutorial #239 )
  • Adds a warmstart tutorial
  • TrainingReportGenerator that creates a report on the training setup and prints out warnings in case of inconsistencies.
  • Activation Checkpointing is now a component
  • Added further NumberConversion routines

Breaking Changes

  • the settings part of the configs have been completely refactored

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

le1nux and others added 30 commits August 29, 2024 13:41
…o a high value lead to huge memory allocation
@le1nux le1nux marked this pull request as ready for review September 15, 2024 15:01
Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments mainly about the modalities in 15 mins tutorial and README.md

examples/modalities_in_15_mins/modalities_demo.ipynb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
examples/warmstart/README.md Outdated Show resolved Hide resolved
src/modalities/__main__.py Outdated Show resolved Hide resolved
src/modalities/__main__.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would still be nice to change this to "Multimodal Foundation Model Training" (without the "s")

Co-authored-by: Felix Stollenwerk <[email protected]>
Copy link
Member

@fromm-m fromm-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Member

@flxst flxst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@le1nux le1nux merged commit 8158de7 into main Sep 17, 2024
3 checks passed
@le1nux le1nux deleted the warmstart_infrastructure_switch branch September 17, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants