Modalities · le1nux · Sep 17, 2024 · Aug 29, 2024 · Aug 29, 2024 · Aug 29, 2024
diff --git a/.gitignore b/.gitignore
@@ -160,5 +160,5 @@ pyenv*
 noteboks/*
 
 tests/tmp/*
+*wandb_storage*
 .coverage/*
-wandb_storage/
diff --git a/README.md b/README.md
@@ -20,14 +20,14 @@ Modalities is a PyTorch-native framework for distributed training of Large Langu
 
 We successfully scaled Modalities up to 2048 GPUs on two HPC centers, namely [Leonardo Booster](https://leonardo-supercomputer.cineca.eu/hpc-system/) and [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), featuring Nvidia A100 and H100 GPUs, respectively. The results of our scaling experiments can be found [here](#scaling-experiments).
 
-Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within in Modalities at runtime. 
+Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within Modalities at runtime. 
 
 ## Getting Started
-For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. 
+For training and evaluation of a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. 
 
 ## Installation
 
-There are two ways to install modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing modalities directly from source. 
+There are two ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source. 
 
 If you want to use Modalities as a library and register your custom components with Modalities, you can install it directly via pip which provides you with the latest stable version.
 
@@ -69,7 +69,7 @@ pip install -e .
 
 ### Option 2: Installation via pip
 
-To install modalities via pip, run
+To install Modalities via pip, run
 
 ```sh
 pip install torch
@@ -78,16 +78,39 @@ pip install modalities
 
 Note, that also here, torch has to be installed before installing Modalities due to flash attention's dependency management.
 
-
 ## Usage
-For running the training endpoint on multiple GPUs run 
-```sh 
-CUDA_VISIBLE_DEVICES=2,3 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 modalities run --config_file_path config_files/config.yaml
+Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.
+
+
+### Model Training
+
+For model pretraining, we have to pass a configuration file that specifies the model architecture, optimizer, dataset, dataloader, and other training components. Additionally, we specify the number of nodes, the number of processes per node, and the rendezvous endpoint. 
+
+Example:
+```sh
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
+                                        --nnodes 1 \
+                                        --nproc_per_node 4 \
+                                        $(which modalities) run --config_file_path configs/pretraining_config.yaml
 ```
 
-In the example above, we use `torchrun` to run the training endpoint on two GPUs. The `--nnodes` argument specifies the number of nodes in the cluster, `--nproc_per_node` specifies the number of processes per node, and `--rdzv-endpoint` specifies the rendezvous endpoint. The `modalities run` command specifies the training endpoint, and `--config_file_path` specifies the path to the configuration file. The configuraton file contains the exhaustive parameterization for all the training components (e.g., dataset, model, optimize, etc.), making training fully reproducible. A full list of all the components already available in Modalities can be found [here](docs/components/components.md).
+Explanation:
 
-Or, if you are a VSCode user, add this to your `launch.json`
+* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In the example above, the four GPUs with IDs 0, 1, 2, 3 are selected for training.
+
+* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.
+
+* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.
+
+* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. In the example above, a single-node setup is used.
+
+* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In the example above, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by CUDA_VISIBLE_DEVICES.
+
+* `$(which modalities) run`: This part dynamically finds the path to the Modalities executable and runs it. The run command triggers the main process to start the training.
+
+* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In the example above, it is given by `configs/pretraining_config.yaml`. A configuraton file contains an exhaustive parameterization for all the training components (e.g., dataset, model, optimizer, etc.), making training fully reproducible. An example configuration file can be found [here](examples/getting_started/example_config.yaml), and a complete list of components available in Modalities is provided [here](docs/components/components.md).
+
+If you are a VSCode user, you may want to add this to your `launch.json`:
 ```json
 
         {
@@ -96,28 +119,75 @@ Or, if you are a VSCode user, add this to your `launch.json`
             "request": "launch",
             "module": "torch.distributed.run",
             "env": {
-                "CUDA_VISIBLE_DEVICES": "0"
+                "CUDA_VISIBLE_DEVICES": "0,1,2,3"
             },
             "args": [
                 "--nnodes",
                 "1",
                 "--nproc_per_node",
                 "2",
-                "--rdzv-endpoint=0.0.0.0:29503",
+                "--rdzv-endpoint=0.0.0.0:29515",
                 "src/modalities/__main__.py",
                 "run",
                 "--config_file_path",
-                "config_files/config.yaml",
+                "config_files/pretraining_config.yaml",
             ],
             "console": "integratedTerminal",
             "justMyCode": true,
             "envFile": "${workspaceFolder}/.env"
         }
 ```
-which will allow you to run the training endpoint directly from VSCode and debug it.
+It will allow you to run the training endpoint directly from VSCode and debug it.
+
+### Raw Training Dataset Indexation
+
+The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file. Subsequently, the index file is used to efficiently access the raw data during tokenization.
+
+Example:
+```sh
+modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
+                                               data/raw/fineweb_edu_num_docs_483606.jsonl
+```
+
+Explanation:
+
+The `modalities data create_raw_index` command triggers the process of creating the index from the raw data. The `--index_path` argument specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`. The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.
+
+### Raw Training Dataset Tokenization
+
+Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](examples/getting_started/example_dataset_config_train.yaml).
+
+Example:
+```sh
+modalities data pack_encoded_data configs/tokenization_config.yaml
+```
+
+### Inference
+
+For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](examples/getting_started/example_text_generation_config.yaml).
+
+Example:
+
+```sh
+modalities generate_text --config_file_path example_text_generation_config.yaml 
+
+```
+
+## Tutorials
+Even though Modalities significantly simplifies LLM training, there is still some technical complexity left. We provide a series of tutorials to help you get started with training and evaluating models using Modalities.
+
+- [Getting Started](examples/getting_started/README.md)</br>
+  Brief overview on how to get started with Modalities by training a small GPT model on a tiny subset of the Redpajama V2 dataset.
+
+- [Library Usage](examples/library_usage/README.md)</br>
+  How to use Modalities as a library and register custom components with Modalities.
+
+- [Modalities in 15mins](examples/modalities_in_15_mins/README.md) </br>
+  Train a dense model with Modalities in 15 minutes
+
 
 ## Supported Features
-In the following, we list the already implemented, planned and in-progress features w.r.t. to improving downstream performance, throughput, multi-modality, and alignment. 
+In the following, we list the most important features of Modalities.
 
 ### Throughput Features
 
@@ -157,83 +227,6 @@ In the following, we list the already implemented, planned and in-progress featu
 | Knowledge Distillation         | planned  | Transfers knowledge from a larger, complex model to a smaller, more efficient model, improving the smaller model's performance without the computational cost of the larger model.|
 | Hyperparameter Optimization    | planned          | Grid search for various hyperparameter such as LR, Optimizer arguments etc. Also the integration of µP might be interesting |
 
-## Tutorials
-Even though Modalities significantly simplifies LLM training, there is still some technical complexity left. We provide a series of tutorials to help you get started with training and evaluating models using Modalities.
-
-- [Getting Started](examples/getting_started/README.md)</br>
-  Brief overview on how to get started with Modalities by training a small GPT model on a tiny subset of the Redpajama V2 dataset.
-
-- [Library Usage](examples/library_usage/README.md)</br>
-  How to use Modalities as a library and register custom components with Modalities.
-
-- [Modalities in 15mins] </br>
-  Jupyter notebook will be added soon
-
-## Entry Points
-Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.
-
-
-### Model training
-
-For model pretraining, we have to pass a configuration file that specifies the model architecture, optimizer, dataset, dataloader, and other training components. Additionally, we specify the number of nodes, the number of processes per node, and the rendezvous endpoint. 
-
-Example:
-```sh
-CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --rdzv-endpoint localhost:29515 \
-                                        --nnodes 1 \
-                                        --nproc_per_node 4 \
-                                        $(which modalities) run --config_file_path configs/pretraining_config.yaml
-```
-
-Explanation:
-
-* `CUDA_VISIBLE_DEVICES=0,1,2,3`: This environment variable specifies which GPUs will be used for the job. In this case, GPUs with IDs 0, 1, 2, 3 are selected for training.
-
-* `torchrun`: This is a utility from PyTorch used to launch distributed training. It automatically manages multiple processes for distributed training.
-
-* `--rdzv-endpoint localhost:29515`: Specifies the rendezvous endpoint. Here, localhost is the machine's address, and 29515 is the port. The rendezvous endpoint coordinates the processes involved in distributed training.
-
-* `--nnodes 1`: Specifies the number of nodes to be used in the distributed setup. Since this is a single-node setup, 1 is used.
-
-* `--nproc_per_node 4`: This argument tells torchrun how many processes to launch on each node. In this case, 4 processes are launched per node, corresponding to the 4 GPUs (IDs 0, 1, 2, 3) specified by CUDA_VISIBLE_DEVICES.
-
-* `$(which modalities) run`: This part dynamically finds the path to the Modalities executable and runs it. The run command triggers the main process to start the training.
-
-* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In this example, the configuration is provided in configs/pretraining_config.yaml, which includes settings like model architecture, optimizer, dataset, dataloader and other training components. An example config file can be found [here](examples/getting_started/example_config.yaml).
-
-### Raw Training Dataset Indexation
-
-The goal of the indexation process is to determine the starting byte position and length of each document in the raw data file. Subsequently, the index file is used to efficiently access the raw data during tokenization.
-
-Example:
-```sh
-modalities data create_raw_index --index_path data/preprocessed/fineweb_edu_num_docs_483606.idx \
-                                               data/raw/fineweb_edu_num_docs_483606.jsonl
-```
-
-Explanation:
-
-The `modalities data create_raw_index` command triggers the process of creating the index from the raw data. The `--index_path` argument specifies the location where the generated index file will be saved. In this example, the index will be stored at `data/preprocessed/fineweb_edu_num_docs_483606.idx`. The last part, i.e., `data/raw/fineweb_edu_num_docs_483606.jsonl` is the input file in JSONL (JSON Lines) format containing the raw data. The command will process this file to create the index.
-
-### Raw Training Dataset Tokenization
-
-Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](examples/getting_started/example_dataset_config_train.yaml).
-
-Example:
-```sh
-modalities data pack_encoded_data configs/tokenization_config.yaml
-```
-
-### Inference
-
-For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](examples/getting_started/example_text_generation_config.yaml).
-
-Example:
-
-```sh
-modalities generate_text --config_file_path example_text_generation_config.yaml 
-
-```
 
 ## Scaling Experiments