Skip to content

Commit

Permalink
Merge pull request #112 from Modalities/fix-getting-started-example
Browse files Browse the repository at this point in the history
Fix getting started example
  • Loading branch information
le1nux committed Apr 29, 2024
2 parents 7a8e495 + 6c86006 commit bd06dab
Show file tree
Hide file tree
Showing 6 changed files with 432 additions and 139 deletions.
85 changes: 63 additions & 22 deletions examples/getting_started/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,15 @@ As a reference, this example has the following folder structure. Folders in <> w
│ └── redpajama_v2_samples_512_train.jsonl
├── getting_started_example.md
├── tokenizer
│ └── tokenizer.json
│ ├── tokenizer.json
│ └── tokenizer_config.json
└── wandb
└── <wandb_logs>
```

## 1. Preprocessing

### Raw Data Format
A single line of the Redpajama V2 JSONL file has the structure denoted below. Since we are not interested in the meta data and quality signals for this minimal example, we consider the `raw_content` from each line without any filtering for model training.
```json
{
Expand All @@ -35,11 +38,38 @@ A single line of the Redpajama V2 JSONL file has the structure denoted below. Si
The two raw dataset splits for training and evaluation can be found in
`data/raw/redpajama_v2_samples_512_train.jsonl` and `data/raw/redpajama_v2_samples_512_test.jsonl`
and need to be preprocessed into the [MemMap dataset format](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/dataset.py).
Firstly, we create the dataset index via

### Config File
To do so, we employ the `example_dataset_config_train.yaml` and `example_dataset_config_test.yaml` configuration files, which contain the paths of the input and output files, the path of the tokenizer as well as some configurable parameters:
```yaml
# example_dataset_config_train.yaml

settings:
src_path: data/raw/redpajama_v2_samples_512_train.jsonl
dst_path: data/mem_map/redpajama_v2_samples_512_train.pbin
index_path: data/mem_map/redpajama_v2_samples_512_train.idx
jq_pattern: .raw_content
num_cpus: ${node_env:num_cpus}
eod_token: <|endoftext|>

tokenizer:
component_key: tokenizer
variant_key: pretrained_hf_tokenizer
config:
pretrained_model_name_or_path: tokenizer
padding: false
max_length: 512
```
### Step 1: Create Index
Firstly, after
```sh
cd modalities/examples/getting_started/
```
we create the dataset index via

```sh
# train split
modalities data create_raw_index --index_path data/mem_map/redpajama_v2_samples_512_train.idx \
data/raw/redpajama_v2_samples_512_train.jsonl
Expand All @@ -52,22 +82,16 @@ In this step, we read the JSON file as a binary file, iterate over all character
as determined by the `\n` character positions. The sample index is stored in the specified `index_path`. Internally, the `create_raw_index` command
instantiates and calls the [IndexGenerator](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/create_index.py#L14).

After having determined the index, we create the packed dataset as described below by leveraging the tokenizer, jsonl file and the created index.
### Step 2: Pack Dataset

After having determined the index, we create the packed dataset as described below by leveraging the tokenizer, jsonl file and the created index like so:

```sh
# train split
modalities data pack_encoded_data --jq_pattern .raw_content \
--index_path data/mem_map/redpajama_v2_samples_512_train.idx \
--dst_path data/mem_map/redpajama_v2_samples_512_train.pbin \
--tokenizer_file tokenizer/tokenizer.json \
data/raw/redpajama_v2_samples_512_train.jsonl
modalities data pack_encoded_data example_dataset_config_train.yaml

# test split
modalities data pack_encoded_data --jq_pattern .raw_content \
--index_path data/mem_map/redpajama_v2_samples_512_test.idx \
--dst_path data/mem_map/redpajama_v2_samples_512_test.pbin \
--tokenizer_file tokenizer/tokenizer.json \
data/raw/redpajama_v2_samples_512_test.jsonl
modalities data pack_encoded_data example_dataset_config_test.yaml
```
This will create the following file structure which can we can directly load into the [PackedMemMapdataset](https://github.com/Modalities/modalities/blob/main/src/modalities/dataloader/dataset.py#L65).
```
Expand Down Expand Up @@ -118,12 +142,14 @@ first and then divides it into chunks of size context-length.



In modalities, we describe the entire training and evaluation setup (i.e., components such das model, trainer, evaluator, dataloder etc.) within a single config file. Not only does this increase reproducibility but also allows for having the entire training runs under version control.
## 2. Training & Evaluation

The example config file for this experiment can be found in `examples/getting_started/example_config.yaml`.
### Config File
In modalities, we describe the entire training and evaluation setup (i.e., components such as model, trainer, evaluator, dataloder etc.) within a single config file. Not only does this increase reproducibility but also allows for having the entire training runs under version control.

## 2. Training
The example config file for this experiment can be found in `examples/getting_started/example_config.yaml`.

### Training
Having created the dataset and defined the experiment in the configuration file, we can already start the training by running the following command.

```sh
Expand Down Expand Up @@ -162,16 +188,31 @@ The command can be broken down into the following parts:

Already during the training, the checkpoints can be found locally in `checkpoints/` and the loss and metric developments can be inspected online in [Weights&Biases](https://wandb.ai/).

## Evaluation
### Evaluation

In order to let the model generate text, we need to specify the last training checkpoint under `model_path` in the config file `example_text_generation_config.yaml`:

```
# example_text_generation_config.yaml
Given a checkpoint and tokenizer, we can load the model for text generation as follows
settings:
referencing_keys:
sample_key: input_ids
prediction_key: logits
model_path: ./checkpoints/<checkpoint_name>.bin
device: 0
context_length: 512
[..]
````

Subsequently, given the checkpoint and tokenizer, we can load the model for text generation as follows:

```sh
modalities generate_text --tokenizer_file tokenizer/tokenizer.json \
checkpoints/2024-01-15__14-02-37/eid_2024-01-15__14-02-37-model-num_samples_768.bin \
example_config.yaml
modalities generate_text --config_file_path example_text_generation_config.yaml
```
which opens an interactive chatting CMD interface.

This opens an interactive chatting CMD interface.

```
enter prompt> Once upon a time,
Expand Down
Loading

0 comments on commit bd06dab

Please sign in to comment.