Modalities · fromm-m · Jan 30, 2024 · Mar 11, 2024 · Mar 11, 2024 · Mar 11, 2024
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -27,4 +27,11 @@ jobs:
     - name: Run tests
       run: |
         pytest
+    - name: Upload coverage data to coveralls.io
+      run: |
+        python -m pip install coveralls[toml]
+        coveralls --service=github
+      env:
+        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
 
diff --git a/.gitignore b/.gitignore
@@ -55,6 +55,7 @@ htmlcov/
 .cache
 nosetests.xml
 coverage.xml
+coverage_html_report
 *.cover
 .hypothesis/
 .pytest_cache/

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -31,6 +31,7 @@ pre-commit install --install-hooks
 
 - Make sure your code passes all the tests and pre-commit hooks. Use `pytest` from within the root of your local repository.
 
+- For vscode users, disable pytest coverage in `settings.json` to enable pytest debugging: `"python.testing.pytestArgs": ["--no-cov"]`
 
 ## Commit Guidelines
 

diff --git a/README.md b/README.md
@@ -1,9 +1,10 @@
 # Modalities
 
+[![Coverage Status](https://coveralls.io/repos/github/Modalities/modalities/badge.svg)](https://coveralls.io/github/Modalities/modalities)
 
 # Getting started
 For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/getting_started_example.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. 
-Also, see our WIki and API reference documentation: https://modalities.github.io/modalities/
+Also, see our Wiki and API reference documentation: https://modalities.github.io/modalities/
 
 # Installation
 
@@ -19,7 +20,7 @@ then, install the repository via
 pip install -e . 
 ```
 
-If you want to contribute, have look at `CONTRIBUTING.md`.
+If you want to contribute, have a look at `CONTRIBUTING.md`.
 
 
 
@@ -56,12 +57,12 @@ Or, if you are a VsCode user, add this to your `launch.json`:
 
 # Pydantic and ClassResolver
 
-The mechanismn introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes 
+The mechanism introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes 
 1) Omegaconf to load the config yaml file
 2) Pydantic for the validation of the config
 3) ClassResolver to instantiate the correct, concrete class of a class hierarchy.
 
-Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as `${subconfig.attribue}`. 
+Firstly, Omegaconf loads the config yaml file and resolves internal references such as `${subconfig.attribute}`. 
 
 Then, Pydantic validates the whole config as is and checks that each of the sub-configs are `pydantic.BaseModel` classes.
 For configs, which allow different concrete classes to be instantiated by `ClassResolver`, the special member names `type_hint` and `config` are introduced.
@@ -79,7 +80,7 @@ activation_kwargs={...}
 activation_resolver.make(type_hint, activation_kwargs),
 ```
 
-In our implmentation we go a step further, as both,
+In our implementation we go a step further, as both,
 * a `type_hint` in a `BaseModel` config must be of type `modalities.config.lookup_types.LookupEnum` and 
 * `config` is a union of allowed concrete configs of base type `BaseModel`. 
 `config` hereby replaces `activation_kwargs` in the example above, and replaces it with pydantic-validated `BaseModel` configs.
@@ -88,7 +89,8 @@ With this, a mapping between type hint strings needed for `class-resolver`, and
 
 ```python
 from enum import Enum
-from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
+from typing import Annotated
+from pydantic import BaseModel, PositiveInt, PositiveFloat, Field
 
 class LookupEnum(Enum):
     @classmethod
@@ -101,8 +103,8 @@ class SchedulerTypes(LookupEnum):
     ConstantLR = torch.optim.lr_scheduler.ConstantLR
 
 class StepLRConfig(BaseModel):
-    step_size: conint(ge=1)
-    gamma: confloat(ge=0.0)
+    step_size: Annotated[int, Field(strict=True, ge=1)]
+    gamma: Annotated[float, Field(strict=True, ge=0.0)]
 
 
 class ConstantLRConfig(BaseModel):
@@ -115,7 +117,7 @@ class SchedulerConfig(BaseModel):
     config: StepLRConfig | ConstantLRConfig
 ```
 
-To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the `extra_kwargs` argument:
+To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependencies can be passed-through with the `extra_kwargs` argument:
 ```python
 resolvers = ResolverRegister(config=config)
 optimizer = ...  # our example dependency
@@ -187,20 +189,20 @@ Alternatively, directly use `src/modalities/__main__.py do_stuff --config_file_p
 The `MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The `MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running
 
 ```sh
-modalities create_memmap_index <path/to/jsonl/file>
+modalities data create_raw_index <path/to/jsonl/file>
 ```
 
-The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities create_memmap_index --help`.
+The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data create_raw_index --help`.
 
 ## Packed Dataset Generator
 
 The `PackedMemMapDatasetContinuous` and `PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a `MemMapDataset` index file as described [above](#memmapdataset-index-generator). Assuming the index and raw data are located in the same directory, you can simply execute the following command:
 
 ```sh
-modalities create_packed_data <path/to/jsonl/file>
+modalities data pack_encoded_data <path/to/jsonl/file>
 ```
 
-The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities create_packed_data --help`.
+The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data pack_encoded_data --help`.
 
 ### Packed Data Format
 

diff --git a/benchmarks/dataloader/README.md b/benchmarks/dataloader/README.md
@@ -0,0 +1,77 @@
+# Benchmarking of Dataset Implementations
+
+## Motivation
+We want to include a storage efficient, fast and generic dataset implementation in this repository.
+Previous work and ideas were based on MegatronLM and its dataset implementation.
+
+Unfortunately its usage is quite intransparent and causes regularly unexpected side effects.
+Those problems are hard to trace, as we are not the original authors of the code.
+
+Therefore we want to provide an own implementation, which comes with all the above mentioned benefits.
+Most importantly, it should be at least as fast as MegatronLM's implementation.
+
+
+## Benchmark Overview
+
+We want to evaluate multiple aspects of the dataset implementations:
+* preparation speed - All datasets need to do some initial steps like tokenization and indexing.
+* initialization speed - When firing up a respective `Dataset` object inside the code.
+* iteration speed - When accessing elements (in a random order) in the respective datasets
+
+
+## Used Example Dataset
+
+The experiments were conducted on a small sample of openwebtext. The data is provided in `.jsonl`-format.
+The relevant data included can be found under `"text"` and is obviously text-only.
+Each dataset with X samples refers to the first X lines in the full openwebtext data,
+ as it can be obtained from huggingface.
+
+
+## Experimental Setup
+
+We relied on the functions provided in `launch_benchmark.sh`. One can reproduce those by calling e.g.
+
+```shell
+. launch_benchmark.sh
+
+INPUT_DIR=<path-to-your-example-dataset.jsonl>
+
+echo "MegatronLM:"
+measure_megatronLM_iteration
+echo "Modalities:"
+measure_modalities_iteration
+```
+
+> For launching the preparation of MegatronLM's dataset, refer to:
+> https://github.com/OpenGPTX/opengptx_data/tree/docs/modalities-vs-megatronlm-dl and look at the `launch_benchmark.sh`
+> script.
+
+#### Glossary
+
+* **preparation:** refers here to the task of turning raw data (e.g. jsonl encoded text) into a binary file,
+  which is loadable later for training. 
+  For MegatronLM this means tokenizing and packing everything according to their defined format.
+  For Modalities it means, indexing the raw data and packing it afterwards as token-ids.
+* **initialization:** refers to the process of initializing a python object, 
+  which represents the respective dataset (mostly represented via the `torch.Dataset`-interface)
+* **iteration:** refers to process of iterating over the respective datasets - once sequentially and once shuffled.
+
+## Results
+
+
+| Evaluation Aspect    | Implementation |   Required Time    | # Samples in Data |
+|----------------------|----------------|:------------------:|-------------------|
+| preparation speed    | MegatronLM     | `0 min 16.965 sec` | `20000(OWT)`      |
+| preparation speed    | Modalities     | `0 min 13.904 sec` | `20000(OWT)`      |
+| preparation speed    | MegatronLM     | `2 min 11.856 sec` | `200000(OWT)`     |
+| preparation speed    | Modalities     | `0 min 38.738 sec` | `200000(OWT)`     |
+| initialization speed | MegatronLM     |    `19.3 msec`     | `20000(OWT)`      |
+| initialization speed | Modalities     |    `5.85 msec`     | `20000(OWT)`      |
+| initialization speed | MegatronLM     |    `180 msec `     | `200000(OWT)`     |
+| initialization speed | Modalities     |     `58 msec`      | `200000(OWT)`     |
+| iteration speed      | MegatronLM     |    `52.4 msec`     | `20000(OWT)`      |
+| iteration speed      | Modalities     |    `66.8 msec`     | `20000(OWT)`      | 
+| iteration speed      | MegatronLM     |    `426 msec `     | `200000(OWT)`     |
+| iteration speed      | Modalities     |     `545 msec`     | `200000(OWT)`     |
+
+
diff --git a/benchmarks/dataloader/launch_benchmark.sh b/benchmarks/dataloader/launch_benchmark.sh
@@ -0,0 +1,87 @@
+#!/bin/bash
+
+
+
+INPUT_DIR="/tmp/i-do-not-exist.jsonl"
+
+
+measure_modalities_preparation() {
+    time (
+        set -e
+        test -f $INPUT_DIR
+        rm -f ${INPUT_DIR/.jsonl/.idx}
+        modalities data create_raw_index $INPUT_DIR &> /dev/null
+        echo "finished memmap index creation"
+        rm -f ${INPUT_DIR/.jsonl/.pbin}
+        modalities data pack_encoded_data $INPUT_DIR &> /dev/null
+        echo "finished memmap packing"
+    )
+}
+
+
+measure_modalities_initialization() {
+  input_file=${INPUT_DIR/.jsonl/.pbin}
+  python -m timeit -n 50 -r 5 -s "
+import sys, io
+null_device = io.StringIO()
+from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
+from pathlib import Path
+p = Path(\"${input_file}\")
+  " -- "
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+"
+}
+
+measure_megatronLM_initialization() {
+  input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
+  python -m timeit -n 50 -r 5 -s "
+import sys, io
+null_device = io.StringIO()
+from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
+p = \"${input_file}\"
+  " -- "
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+MMapIndexedDataset(p)
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+"
+}
+
+measure_modalities_iteration() {
+  input_file=${INPUT_DIR/.jsonl/.pbin}
+  python -m timeit -n 5 -r 3 -s "
+import random, sys, io
+null_device = io.StringIO()
+from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
+from pathlib import Path
+p = Path(\"${input_file}\")
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+dataset = PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
+random_indices = random.sample(range(len(dataset)), len(dataset))
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+  " -- "
+list(dataset)  # sequential access
+for i in random_indices:
+  dataset[i]
+"
+}
+
+
+measure_megatronLM_iteration() {
+  input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
+  python -m timeit -n 5 -r 3 -s "
+import random, sys, io
+null_device = io.StringIO()
+from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
+p = \"${input_file}\"
+sys.stdout = null_device  # deactivate stdout to avoid getting spammed
+dataset = MMapIndexedDataset(p)
+random_indices = random.sample(range(len(dataset)), len(dataset))
+sys.stdout = sys.__stdout__  # reactivate stdout for timeit
+  " -- "
+list(dataset)  # sequential access
+for i in random_indices:
+  dataset[i]
+"
+}
diff --git a/config_files/config.yaml b/config_files/config.yaml
@@ -142,15 +142,13 @@ model:
     prediction_key: "logits"
     block_size: ${data.sequence_len}
     vocab_size: 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
-    n_layer: 12
-    n_head: 12
+    n_layer_q: 12
+    n_head_kv: 12
     ffn_hidden: 2048
     n_embd: 768
     dropout: 0.0
     bias: true # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
-    attention:
-      attention_type: pytorch_flash_attention
-      scaling_factor: 3
+    attention_type: pytorch_flash_attention
     activation: gelu
     epsilon: 1e-5
     weight_init: