Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: group-query-attention implementation #41

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,11 @@ jobs:
- name: Run tests
run: |
pytest
- name: Upload coverage data to coveralls.io
run: |
python -m pip install coveralls[toml]
coveralls --service=github
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}


1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ htmlcov/
.cache
nosetests.xml
coverage.xml
coverage_html_report
*.cover
.hypothesis/
.pytest_cache/
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ pre-commit install --install-hooks

- Make sure your code passes all the tests and pre-commit hooks. Use `pytest` from within the root of your local repository.

- For vscode users, disable pytest coverage in `settings.json` to enable pytest debugging: `"python.testing.pytestArgs": ["--no-cov"]`

## Commit Guidelines

Expand Down
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Modalities

[![Coverage Status](https://coveralls.io/repos/github/Modalities/modalities/badge.svg)](https://coveralls.io/github/Modalities/modalities)

# Getting started
For training and evaluation a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/getting_started_example.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset.
Also, see our WIki and API reference documentation: https://modalities.github.io/modalities/
Also, see our Wiki and API reference documentation: https://modalities.github.io/modalities/

# Installation

Expand All @@ -19,7 +20,7 @@ then, install the repository via
pip install -e .
```

If you want to contribute, have look at `CONTRIBUTING.md`.
If you want to contribute, have a look at `CONTRIBUTING.md`.



Expand Down Expand Up @@ -56,12 +57,12 @@ Or, if you are a VsCode user, add this to your `launch.json`:

# Pydantic and ClassResolver

The mechanismn introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes
The mechanism introduced to instantiate classes via `type_hint` in the `config.yaml`, utilizes
1) Omegaconf to load the config yaml file
2) Pydantic for the validation of the config
3) ClassResolver to instantiate the correct, concrete class of a class hierarchy.

Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as `${subconfig.attribue}`.
Firstly, Omegaconf loads the config yaml file and resolves internal references such as `${subconfig.attribute}`.

Then, Pydantic validates the whole config as is and checks that each of the sub-configs are `pydantic.BaseModel` classes.
For configs, which allow different concrete classes to be instantiated by `ClassResolver`, the special member names `type_hint` and `config` are introduced.
Expand All @@ -79,7 +80,7 @@ activation_kwargs={...}
activation_resolver.make(type_hint, activation_kwargs),
```

In our implmentation we go a step further, as both,
In our implementation we go a step further, as both,
* a `type_hint` in a `BaseModel` config must be of type `modalities.config.lookup_types.LookupEnum` and
* `config` is a union of allowed concrete configs of base type `BaseModel`.
`config` hereby replaces `activation_kwargs` in the example above, and replaces it with pydantic-validated `BaseModel` configs.
Expand All @@ -88,7 +89,8 @@ With this, a mapping between type hint strings needed for `class-resolver`, and

```python
from enum import Enum
from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
from typing import Annotated
from pydantic import BaseModel, PositiveInt, PositiveFloat, Field

class LookupEnum(Enum):
@classmethod
Expand All @@ -101,8 +103,8 @@ class SchedulerTypes(LookupEnum):
ConstantLR = torch.optim.lr_scheduler.ConstantLR

class StepLRConfig(BaseModel):
step_size: conint(ge=1)
gamma: confloat(ge=0.0)
step_size: Annotated[int, Field(strict=True, ge=1)]
gamma: Annotated[float, Field(strict=True, ge=0.0)]


class ConstantLRConfig(BaseModel):
Expand All @@ -115,7 +117,7 @@ class SchedulerConfig(BaseModel):
config: StepLRConfig | ConstantLRConfig
```

To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the `extra_kwargs` argument:
To allow a user-friendly instantiation, all class resolvers are defined in the `ResolverRegistry` and `build_component_by_config` as convenience function is introduced. Dependencies can be passed-through with the `extra_kwargs` argument:
```python
resolvers = ResolverRegister(config=config)
optimizer = ... # our example dependency
Expand Down Expand Up @@ -187,20 +189,20 @@ Alternatively, directly use `src/modalities/__main__.py do_stuff --config_file_p
The `MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The `MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running

```sh
modalities create_memmap_index <path/to/jsonl/file>
modalities data create_raw_index <path/to/jsonl/file>
```

The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities create_memmap_index --help`.
The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data create_raw_index --help`.

## Packed Dataset Generator

The `PackedMemMapDatasetContinuous` and `PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a `MemMapDataset` index file as described [above](#memmapdataset-index-generator). Assuming the index and raw data are located in the same directory, you can simply execute the following command:

```sh
modalities create_packed_data <path/to/jsonl/file>
modalities data pack_encoded_data <path/to/jsonl/file>
```

The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities create_packed_data --help`.
The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data pack_encoded_data --help`.

### Packed Data Format

Expand Down
77 changes: 77 additions & 0 deletions benchmarks/dataloader/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Benchmarking of Dataset Implementations

## Motivation
We want to include a storage efficient, fast and generic dataset implementation in this repository.
Previous work and ideas were based on MegatronLM and its dataset implementation.

Unfortunately its usage is quite intransparent and causes regularly unexpected side effects.
Those problems are hard to trace, as we are not the original authors of the code.

Therefore we want to provide an own implementation, which comes with all the above mentioned benefits.
Most importantly, it should be at least as fast as MegatronLM's implementation.


## Benchmark Overview

We want to evaluate multiple aspects of the dataset implementations:
* preparation speed - All datasets need to do some initial steps like tokenization and indexing.
* initialization speed - When firing up a respective `Dataset` object inside the code.
* iteration speed - When accessing elements (in a random order) in the respective datasets


## Used Example Dataset

The experiments were conducted on a small sample of openwebtext. The data is provided in `.jsonl`-format.
The relevant data included can be found under `"text"` and is obviously text-only.
Each dataset with X samples refers to the first X lines in the full openwebtext data,
as it can be obtained from huggingface.


## Experimental Setup

We relied on the functions provided in `launch_benchmark.sh`. One can reproduce those by calling e.g.

```shell
. launch_benchmark.sh

INPUT_DIR=<path-to-your-example-dataset.jsonl>

echo "MegatronLM:"
measure_megatronLM_iteration
echo "Modalities:"
measure_modalities_iteration
```

> For launching the preparation of MegatronLM's dataset, refer to:
> https://github.com/OpenGPTX/opengptx_data/tree/docs/modalities-vs-megatronlm-dl and look at the `launch_benchmark.sh`
> script.

#### Glossary

* **preparation:** refers here to the task of turning raw data (e.g. jsonl encoded text) into a binary file,
which is loadable later for training.
For MegatronLM this means tokenizing and packing everything according to their defined format.
For Modalities it means, indexing the raw data and packing it afterwards as token-ids.
* **initialization:** refers to the process of initializing a python object,
which represents the respective dataset (mostly represented via the `torch.Dataset`-interface)
* **iteration:** refers to process of iterating over the respective datasets - once sequentially and once shuffled.

## Results


| Evaluation Aspect | Implementation | Required Time | # Samples in Data |
|----------------------|----------------|:------------------:|-------------------|
| preparation speed | MegatronLM | `0 min 16.965 sec` | `20000(OWT)` |
| preparation speed | Modalities | `0 min 13.904 sec` | `20000(OWT)` |
| preparation speed | MegatronLM | `2 min 11.856 sec` | `200000(OWT)` |
| preparation speed | Modalities | `0 min 38.738 sec` | `200000(OWT)` |
| initialization speed | MegatronLM | `19.3 msec` | `20000(OWT)` |
| initialization speed | Modalities | `5.85 msec` | `20000(OWT)` |
| initialization speed | MegatronLM | `180 msec ` | `200000(OWT)` |
| initialization speed | Modalities | `58 msec` | `200000(OWT)` |
| iteration speed | MegatronLM | `52.4 msec` | `20000(OWT)` |
| iteration speed | Modalities | `66.8 msec` | `20000(OWT)` |
| iteration speed | MegatronLM | `426 msec ` | `200000(OWT)` |
| iteration speed | Modalities | `545 msec` | `200000(OWT)` |


87 changes: 87 additions & 0 deletions benchmarks/dataloader/launch_benchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/bin/bash



INPUT_DIR="/tmp/i-do-not-exist.jsonl"


measure_modalities_preparation() {
time (
set -e
test -f $INPUT_DIR
rm -f ${INPUT_DIR/.jsonl/.idx}
modalities data create_raw_index $INPUT_DIR &> /dev/null
echo "finished memmap index creation"
rm -f ${INPUT_DIR/.jsonl/.pbin}
modalities data pack_encoded_data $INPUT_DIR &> /dev/null
echo "finished memmap packing"
)
}


measure_modalities_initialization() {
input_file=${INPUT_DIR/.jsonl/.pbin}
python -m timeit -n 50 -r 5 -s "
import sys, io
null_device = io.StringIO()
from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
from pathlib import Path
p = Path(\"${input_file}\")
" -- "
sys.stdout = null_device # deactivate stdout to avoid getting spammed
PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
sys.stdout = sys.__stdout__ # reactivate stdout for timeit
"
}

measure_megatronLM_initialization() {
input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
python -m timeit -n 50 -r 5 -s "
import sys, io
null_device = io.StringIO()
from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
p = \"${input_file}\"
" -- "
sys.stdout = null_device # deactivate stdout to avoid getting spammed
MMapIndexedDataset(p)
sys.stdout = sys.__stdout__ # reactivate stdout for timeit
"
}

measure_modalities_iteration() {
input_file=${INPUT_DIR/.jsonl/.pbin}
python -m timeit -n 5 -r 3 -s "
import random, sys, io
null_device = io.StringIO()
from modalities.dataloader.dataset import PackedMemMapDatasetMegatron
from pathlib import Path
p = Path(\"${input_file}\")
sys.stdout = null_device # deactivate stdout to avoid getting spammed
dataset = PackedMemMapDatasetMegatron(raw_data_path=p, block_size=1024, sample_key=\"sample\")
random_indices = random.sample(range(len(dataset)), len(dataset))
sys.stdout = sys.__stdout__ # reactivate stdout for timeit
" -- "
list(dataset) # sequential access
for i in random_indices:
dataset[i]
"
}


measure_megatronLM_iteration() {
input_file="${INPUT_DIR/.jsonl/.megLM.bin_text_document}"
python -m timeit -n 5 -r 3 -s "
import random, sys, io
null_device = io.StringIO()
from modalities.dataloader.open_gptx_dataset.mmap_dataset import MMapIndexedDataset
p = \"${input_file}\"
sys.stdout = null_device # deactivate stdout to avoid getting spammed
dataset = MMapIndexedDataset(p)
random_indices = random.sample(range(len(dataset)), len(dataset))
sys.stdout = sys.__stdout__ # reactivate stdout for timeit
" -- "
list(dataset) # sequential access
for i in random_indices:
dataset[i]
"
}
8 changes: 3 additions & 5 deletions config_files/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -142,15 +142,13 @@ model:
prediction_key: "logits"
block_size: ${data.sequence_len}
vocab_size: 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
n_layer: 12
n_head: 12
n_layer_q: 12
n_head_kv: 12
ffn_hidden: 2048
n_embd: 768
dropout: 0.0
bias: true # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
attention:
attention_type: pytorch_flash_attention
scaling_factor: 3
attention_type: pytorch_flash_attention
activation: gelu
epsilon: 1e-5
weight_init:
Expand Down
Loading