Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warmstart infrastructure switch #254

Merged
merged 93 commits into from
Sep 17, 2024
Merged
Show file tree
Hide file tree
Changes from 88 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
2e0fedf
refactor: introduced ResultItem to EvaluationResultBatch
le1nux Aug 29, 2024
fc9d7b5
fix: fixed max_length warning in tokenizer
le1nux Aug 29, 2024
3b7d74d
refactor: removed excessive print statements
le1nux Aug 29, 2024
cf598a0
refactor: added ResultItem to other components
le1nux Aug 29, 2024
c0bcad4
chore: removed more print statements
le1nux Aug 29, 2024
99bd571
refactor: removed unused parameter from IndexGenerator constructor
le1nux Aug 29, 2024
431766a
feat: added configs for demo
le1nux Aug 31, 2024
335c783
feat: added demo diagrams
le1nux Aug 31, 2024
1867232
feat: added tokenizer config
le1nux Aug 31, 2024
8d2f0b2
feat: added demo jupyternobook
le1nux Aug 31, 2024
0dab792
feat: added img
le1nux Aug 31, 2024
3dd1f7b
refactor: more demo adaptations
le1nux Sep 2, 2024
6398d62
chore: added banner
le1nux Sep 4, 2024
70edf9d
chore: moved the diagrams to new tutorial
le1nux Sep 6, 2024
e6c5a59
feat: added notebooks disclaimer
le1nux Sep 6, 2024
1f83d68
feat: added tokenizer and training config
le1nux Sep 6, 2024
fce1daa
feat: added getting started jupyter notebook
le1nux Sep 6, 2024
fce1d5f
refactor: updated modalities demo
le1nux Sep 6, 2024
1899bfd
feat: added tokenizer configs for tutorial
le1nux Sep 6, 2024
cf2f7db
feat: added wandb_storage to gitignore
le1nux Sep 6, 2024
4fce366
chore: renamed tutorial folder
le1nux Sep 6, 2024
2f80136
chore: removed old debug print statements
le1nux Sep 6, 2024
9281405
chore: Merge branch 'main' into live_demo
le1nux Sep 6, 2024
42e7b1c
fix: removed the max_length tag in huggingfae tokenizer. Setting it t…
le1nux Sep 6, 2024
5af079a
fix: fixed failing warmstart test
le1nux Sep 6, 2024
226719c
Update src/modalities/config/component_factory.py
le1nux Sep 8, 2024
7f3f8fa
feat: added optional rounding for metrics
le1nux Sep 8, 2024
61dba25
refactor: lr now logged with full precision
le1nux Sep 8, 2024
fe6e38a
feat: added evaluator logging
le1nux Sep 8, 2024
78fd763
refactor: added logging of number parameters again
le1nux Sep 8, 2024
d44049a
chore: added gitkeep files
le1nux Sep 8, 2024
39db2bf
refactor: added huggingface dataset download to modalities_in_15_mins…
mali-git Sep 8, 2024
7462198
chore: minor corrections in README.md
flxst Sep 9, 2024
0b9a436
chore: merge sections usage and entry points in README.md
flxst Sep 9, 2024
cefb910
chore: change order of sections in README.md
flxst Sep 9, 2024
5ba0846
refactor: dataloaders are now never shuffled. Samplers do the shuffli…
le1nux Sep 9, 2024
a9812f3
feat: added more number conversion functions
le1nux Sep 9, 2024
dace200
Merge pull request #251 from Modalities/readme_updates
le1nux Sep 10, 2024
5d29535
refactor: moved activation checkpointing to FSDP model factory
le1nux Sep 10, 2024
576086e
refactor: refactored the instantiation model s.t. it separates traini…
le1nux Sep 10, 2024
6d550b6
feat: added train progress class
le1nux Sep 10, 2024
291f557
feat: introduced ActivationCheckpointedModel to allow for checkpointi…
le1nux Sep 11, 2024
bcc20f3
refactor: BatchProgressSubscriber now gets the number of train steps …
le1nux Sep 11, 2024
872d4a0
refactor: calling BatchProgress only Progress from now on
le1nux Sep 11, 2024
6061220
refactor: refactored warmstart functionality in __main__.py
le1nux Sep 11, 2024
939dd3f
refactor: imlemented checkpointing based on TrainingProgress instead …
le1nux Sep 11, 2024
0ced8b4
feat: added further number conversion functions
le1nux Sep 11, 2024
b4ce789
feat: added pydantic type for FSDP wrapped model
le1nux Sep 11, 2024
bb91e9e
refactor: refactored the trainer to work with the new TrainingProgres…
le1nux Sep 11, 2024
1ad1fb0
refactor: introduced a clean separation of training and warmstart set…
le1nux Sep 11, 2024
a70c7c1
fix: fixed dataloader iteration (needed num batches not num steps)
le1nux Sep 11, 2024
ac1171f
fix: repaired number conversion tests
le1nux Sep 11, 2024
4fbcf19
fix: fixed bug FSDPCheckpointSaving._get_paths_to_delete and respecti…
le1nux Sep 11, 2024
a3f4f6d
fix: fixed failing test_checkpoint_strategy_k
le1nux Sep 11, 2024
ddae58a
refactor: improved the settings configuration
le1nux Sep 12, 2024
7db236d
feat: added NumberConversion get_num_tokens_from_packed_mem_map_datas…
le1nux Sep 12, 2024
d87bcc1
fix: fixed all failing unit tests
le1nux Sep 12, 2024
e8e5d76
refactor: refactored config lorem ipsum
le1nux Sep 12, 2024
3d9c0a1
fix: fixed two failing multi-gpu tests
le1nux Sep 12, 2024
a77f932
refactor: removed get_num_tokens_from_num_steps_callable from checkpo…
le1nux Sep 12, 2024
372f34a
fix: fixed configs for other multi-gpu tets
le1nux Sep 12, 2024
97a8ae7
refactor: removed NumberConversion function get_num_tokens_from_num_s…
le1nux Sep 12, 2024
1bebb61
feat: added test for activation checkpointing
le1nux Sep 13, 2024
24dfc75
feat: added debugger function for testing distributed, multi-gpu tests
le1nux Sep 13, 2024
f831d8a
chore: add debugpy dependency
flxst Sep 13, 2024
9f04f8a
fix: getting started example config
flxst Sep 13, 2024
00031df
feat: added missing number conversion tests
le1nux Sep 13, 2024
2544a21
chore: Merge branch 'warmstart_infrastructure_switch' of github.com:M…
le1nux Sep 13, 2024
d065456
feat: added NumberConversion get_num_steps_from_raw_dataset_index
le1nux Sep 14, 2024
e21f64e
feat: introduced get_raw_index in DatasetFactory
le1nux Sep 14, 2024
d105618
chore: minor print fix
le1nux Sep 14, 2024
77bdada
refactor: refactored the library usage example
le1nux Sep 14, 2024
cf638bf
refactor: adapted more configs to the new setttings design
le1nux Sep 14, 2024
7f9bd09
refactor: reduced the coca_config_initialization.yaml
le1nux Sep 14, 2024
b783810
Merge pull request #239 from Modalities/live_demo
le1nux Sep 14, 2024
5b1e7e9
feat: added TrainingReportGenerator
le1nux Sep 15, 2024
77e0c10
refactor: adapted the modalities_in_15_mins config to latest changes
le1nux Sep 15, 2024
3ee086c
feat: added conststency checks for remaining steps
le1nux Sep 15, 2024
d5885bc
fix: fixed coda example config
le1nux Sep 15, 2024
522694d
feat: added information on missed out tokens percentages
le1nux Sep 15, 2024
55db9a2
feat: added warmstart tutorial
le1nux Sep 15, 2024
02b09f2
feat: updated components.md
le1nux Sep 15, 2024
d183fdd
chore: removed unnecessary math.ceil call
le1nux Sep 16, 2024
be0d424
chore: Merge branch 'main' into warmstart_infrastructure_switch
le1nux Sep 16, 2024
23bb463
chore: added short description for modalities in 15mins tutorial to R…
flxst Sep 16, 2024
e71f537
feat: added README to getting started tutorial
le1nux Sep 16, 2024
4756f39
chore: further shortened path explanations in jupyter notebook
le1nux Sep 16, 2024
c9c1ce4
Update README.md
le1nux Sep 16, 2024
658d0d0
refactor: renamed examples to tutorials
le1nux Sep 16, 2024
f70e7b1
chore: fixed type in variable name
le1nux Sep 16, 2024
f7923f7
Update README.md
le1nux Sep 16, 2024
17b7a0c
refactor: consistent usage of progress_subscriber name
le1nux Sep 16, 2024
9a3ff8c
chore: minor config renaming
le1nux Sep 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,5 +160,5 @@ pyenv*
noteboks/*

tests/tmp/*
*wandb_storage*
.coverage/*
wandb_storage/
177 changes: 85 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ Modalities is a PyTorch-native framework for distributed training of Large Langu

We successfully scaled Modalities up to 2048 GPUs on two HPC centers, namely [Leonardo Booster](https://leonardo-supercomputer.cineca.eu/hpc-system/) and [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), featuring Nvidia A100 and H100 GPUs, respectively. The results of our scaling experiments can be found [here](#scaling-experiments).

Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within in Modalities at runtime.
Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within Modalities at runtime.

## Getting Started
For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset.
For training and evaluation of a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset.

## Installation

There are two ways to install modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing modalities directly from source.
There are two ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source.

If you want to use Modalities as a library and register your custom components with Modalities, you can install it directly via pip which provides you with the latest stable version.

Expand Down Expand Up @@ -69,7 +69,7 @@ pip install -e .

### Option 2: Installation via pip

To install modalities via pip, run
To install Modalities via pip, run

```sh
pip install torch
Expand All @@ -78,16 +78,39 @@ pip install modalities

Note, that also here, torch has to be installed before installing Modalities due to flash attention's dependency management.


## Usage
For running the training endpoint on multiple GPUs run
```sh
CUDA_VISIBLE_DEVICES=2,3 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 modalities run --config_file_path config_files/config.yaml
Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.


### Model Training

For model pretraining, we have to pass a configuration file that specifies the model architecture, optimizer, dataset, dataloader, and other training components. Additionally, we specify the number of nodes, the number of processes per node, and the rendezvous endpoint.

Example:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
--nnodes 1 \
--nproc_per_node 4 \
$(which modalities) run --config_file_path configs/pretraining_config.yaml
```

In the example above, we use `torchrun` to run the training endpoint on two GPUs. The `--nnodes` argument specifies the number of nodes in the cluster, `--nproc_per_node` specifies the number of processes per node, and `--rdzv-endpoint` specifies the rendezvous endpoint. The `modalities run` command specifies the training endpoint, and `--config_file_path` specifies the path to the configuration file. The configuraton file contains the exhaustive parameterization for all the training components (e.g., dataset, model, optimize, etc.), making training fully reproducible. A full list of all the components already available in Modalities can be found [here](docs/components/components.md).
Explanation:

Or, if you are a VSCode user, add this to your `launch.json`
* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In the example above, the four GPUs with IDs 0, 1, 2, 3 are selected for training.

* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.

* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.

* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. In the example above, a single-node setup is used.

* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In the example above, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by CUDA_VISIBLE_DEVICES.

* `$(which modalities) run`: This part dynamically finds the path to the Modalities executable and runs it. The run command triggers the main process to start the training.

* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In the example above, it is given by `configs/pretraining_config.yaml`. A configuraton file contains an exhaustive parameterization for all the training components (e.g., dataset, model, optimizer, etc.), making training fully reproducible. An example configuration file can be found [here](examples/getting_started/example_config.yaml), and a complete list of components available in Modalities is provided [here](docs/components/components.md).

If you are a VSCode user, you may want to add this to your `launch.json`:
```json

{
Expand All @@ -96,28 +119,75 @@ Or, if you are a VSCode user, add this to your `launch.json`
"request": "launch",
"module": "torch.distributed.run",
"env": {
"CUDA_VISIBLE_DEVICES": "0"
"CUDA_VISIBLE_DEVICES": "0,1,2,3"
},
"args": [
"--nnodes",
"1",
"--nproc_per_node",
"2",
"--rdzv-endpoint=0.0.0.0:29503",
"--rdzv-endpoint=0.0.0.0:29515",
"src/modalities/__main__.py",
"run",
"--config_file_path",
"config_files/config.yaml",
"config_files/pretraining_config.yaml",
],
"console": "integratedTerminal",
"justMyCode": true,
"envFile": "${workspaceFolder}/.env"
}
```
which will allow you to run the training endpoint directly from VSCode and debug it.
It will allow you to run the training endpoint directly from VSCode and debug it.

### Raw Training Dataset Indexation

The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file. Subsequently, the index file is used to efficiently access the raw data during tokenization.

Example:
```sh
modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
data/raw/fineweb_edu_num_docs_483606.jsonl
```

Explanation:

The `modalities data create_raw_index` command triggers the process of creating the index from the raw data. The `--index_path` argument specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`. The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.

### Raw Training Dataset Tokenization

Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](examples/getting_started/example_dataset_config_train.yaml).

Example:
```sh
modalities data pack_encoded_data configs/tokenization_config.yaml
```

### Inference

For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](examples/getting_started/example_text_generation_config.yaml).

Example:

```sh
modalities generate_text --config_file_path example_text_generation_config.yaml

```

## Tutorials
Even though Modalities significantly simplifies LLM training, there is still some technical complexity left. We provide a series of tutorials to help you get started with training and evaluating models using Modalities.

- [Getting Started](examples/getting_started/README.md)</br>
Brief overview on how to get started with Modalities by training a small GPT model on a tiny subset of the Redpajama V2 dataset.

- [Library Usage](examples/library_usage/README.md)</br>
How to use Modalities as a library and register custom components with Modalities.

- [Modalities in 15mins](examples/modalities_in_15_mins/README.md) </br>
Train a dense model with Modalities in 15 minutes
flxst marked this conversation as resolved.
Show resolved Hide resolved


## Supported Features
In the following, we list the already implemented, planned and in-progress features w.r.t. to improving downstream performance, throughput, multi-modality, and alignment.
In the following, we list the most important features of Modalities.

### Throughput Features

Expand Down Expand Up @@ -157,83 +227,6 @@ In the following, we list the already implemented, planned and in-progress featu
| Knowledge Distillation | planned | Transfers knowledge from a larger, complex model to a smaller, more efficient model, improving the smaller model's performance without the computational cost of the larger model.|
| Hyperparameter Optimization | planned | Grid search for various hyperparameter such as LR, Optimizer arguments etc. Also the integration of µP might be interesting |

## Tutorials
Even though Modalities significantly simplifies LLM training, there is still some technical complexity left. We provide a series of tutorials to help you get started with training and evaluating models using Modalities.

- [Getting Started](examples/getting_started/README.md)</br>
Brief overview on how to get started with Modalities by training a small GPT model on a tiny subset of the Redpajama V2 dataset.

- [Library Usage](examples/library_usage/README.md)</br>
How to use Modalities as a library and register custom components with Modalities.

- [Modalities in 15mins] </br>
Jupyter notebook will be added soon

## Entry Points
Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.


### Model training

For model pretraining, we have to pass a configuration file that specifies the model architecture, optimizer, dataset, dataloader, and other training components. Additionally, we specify the number of nodes, the number of processes per node, and the rendezvous endpoint.

Example:
```sh
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
--nnodes 1 \
--nproc_per_node 4 \
$(which modalities) run --config_file_path configs/pretraining_config.yaml
```

Explanation:

* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In this case, GPUs with IDs 0, 1, 2, 3 are selected for training.

* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.

* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.

* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. Since this is a single-node setup, 1 is used.

* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In this case, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by CUDA_VISIBLE_DEVICES.

* `$(which modalities) run`: This part dynamically finds the path to the Modalities executable and runs it. The run command triggers the main process to start the training.

* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In this example, the configuration is provided in configs/pretraining_config.yaml, which includes settings like model architecture, optimizer, dataset, dataloader and other training components. An example config file can be found [here](examples/getting_started/example_config.yaml).

### Raw Training Dataset Indexation

The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file. Subsequently, the index file is used to efficiently access the raw data during tokenization.

Example:
```sh
modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
data/raw/fineweb_edu_num_docs_483606.jsonl
```

Explanation:

The `modalities data create_raw_index` command triggers the process of creating the index from the raw data. The `--index_path` argument specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`. The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.

### Raw Training Dataset Tokenization

Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](examples/getting_started/example_dataset_config_train.yaml).

Example:
```sh
modalities data pack_encoded_data configs/tokenization_config.yaml
```

### Inference

For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](examples/getting_started/example_text_generation_config.yaml).

Example:

```sh
modalities generate_text --config_file_path example_text_generation_config.yaml

```

## Scaling Experiments

Expand Down
Loading