Skip to content

Commit

Permalink
Merge pull request #96 from Modalities/tokenization
Browse files Browse the repository at this point in the history
Tokenization
  • Loading branch information
le1nux committed Apr 4, 2024
2 parents 60feafe + b903df2 commit 4d9218f
Show file tree
Hide file tree
Showing 25 changed files with 301 additions and 299 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -155,4 +155,4 @@ data
docs/source/generated
docs/source/api
pyenv*
.devcontainer/*
.devcontainer/
34 changes: 34 additions & 0 deletions Dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# MemMap Datasets

## MemMapDataset Index Generator

The `MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The `MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running

```sh
modalities data create_raw_index <path/to/jsonl/file>
```

The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data create_raw_index --help`.

## Packed Dataset Generator

The `PackedMemMapDatasetContinuous` and `PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a `MemMapDataset` index file as described [above](#memmapdataset-index-generator). Assuming the index and raw data are located in the same directory, you can simply execute the following command:

```sh
modalities data pack_encoded_data <path/to/config>
```

The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via `modalities data pack_encoded_data --help`.

### Packed Data Format

The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts:

header segment | data segment | index segment

* **header segment**: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes.
* **data segment**: This section contains a concatenation of all documents in form of 4 bytes sized tokens.
An end-of-sequence token is placed between consecutive documents.
* **index segment**: This section contains a pickled index which locates the documents inside the data segment.
The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the
corresponding document, e.g., `[(start_doc1, len_doc1), (start_doc2, len_doc2), ....]`.
170 changes: 0 additions & 170 deletions config_files/config.yaml

This file was deleted.

15 changes: 15 additions & 0 deletions config_files/data_preparation/packed_dataset_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
settings:
src_path: data/lorem_ipsum.jsonl
dst_path: data/lorem_ipsum.pbin
index_path: data/lorem_ipsum.idx
jq_pattern: .text
num_cpus: ${node_env:num_cpus}
eod_token: <|endoftext|>

tokenizer:
component_key: tokenizer
variant_key: pretrained_hf_tokenizer
config:
pretrained_model_name_or_path: data/tokenizer/hf_gpt2
padding: false
max_length: 512
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ settings:
target_key: target_ids
training:
callback_interval_in_samples: 6
global_num_training_samples: 12
global_num_seen_samples: 0
do_apply_activation_checkpointing: true
gradient_acc_steps: 1
Expand All @@ -20,11 +19,13 @@ settings:
world_size: ${cuda_env:WORLD_SIZE}
paths:
checkpointing_path: data/checkpoints

tokenizer:
component_key: tokenizer
variant_key: gpt2_tokenizer_fast
variant_key: pretrained_hf_tokenizer
config:
tokenizer_file: data/tokenizer/tokenizer_gpt2.json
pretrained_model_name_or_path: /workspaces/modalities/data/tokenizer/hf_gpt2
max_length: ${settings.training.sequence_length}

collate_fn:
component_key: collate_fn
Expand All @@ -35,10 +36,10 @@ collate_fn:

train_dataset:
component_key: dataset
variant_key: mem_map_dataset
variant_key: packed_mem_map_dataset_continuous
config:
raw_data_path: data/lorem_ipsum.jsonl
index_path: data/lorem_ipsum.idx
raw_data_path: /workspaces/modalities/data/lorem_ipsum.pbin
index_path: /workspaces/modalities/data/lorem_ipsum.idx
block_size: ${settings.training.sequence_length}
jq_pattern: ".text"
sample_key: ${settings.referencing_keys.sample_key}
Expand All @@ -62,7 +63,7 @@ train_dataloader:
variant_key: default
config:
batch_size: ${settings.training.local_train_micro_batch_size}
drop_last: false
drop_last: true
sampler:
component_key: sampler
variant_key: distributed_sampler
Expand Down Expand Up @@ -93,7 +94,7 @@ val_dataloader:
variant_key: default
config:
batch_size: 3
drop_last: false
drop_last: true
sampler:
component_key: sampler
variant_key: distributed_sampler
Expand Down Expand Up @@ -124,7 +125,7 @@ test_dataloader:
variant_key: default
config:
batch_size: 3
drop_last: false
drop_last: true
sampler:
component_key: sampler
variant_key: distributed_sampler
Expand Down Expand Up @@ -244,7 +245,7 @@ scheduler:
max_lr: 6e-4
div_factor: 10
final_div_factor: 1
total_steps: 4
total_steps: 5
pct_start: 0.01
anneal_strategy: cos

Expand Down
Binary file modified data/lorem_ipsum.idx
Binary file not shown.
Loading

0 comments on commit 4d9218f

Please sign in to comment.