Skip to content

Commit

Permalink
added data preprocessing docs
Browse files Browse the repository at this point in the history
Signed-off-by: Dushyant Behl <[email protected]>
  • Loading branch information
dushyantbehl committed Dec 20, 2024
1 parent d19b8ec commit 615ed74
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 3 deletions.
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,15 +149,22 @@ Example: For a JSON dataset like, `Train.jsonl`
Pass a dataset containing single/multi turn chat dataset. Your dataset could be supplied like

```
$ head -n 1 train.jsonl
{"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
```

containing single/multi turn chat.

The chat template used to render this data will be `tokenizer.chat_template` from model's default tokenizer config or can be overridden using `--chat_template <chat-template-string>` argument.
As an example, for models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) which contain a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) as part of their `tokenizer_config.json` the users need not pass a chat template to process the data.

Users also need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
`assistant` and `human` response inside the formatted chat template.
For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
```
--instruction_template `<|start_of_role|>user<|end_of_role|>`
--response_template <|start_of_role|>assistant<|end_of_role|>`
```

The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.

Expand All @@ -169,7 +176,9 @@ Users can also pass a pretokenized dataset (containing `input_ids` and `labels`
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
```

For advanced data preprocessing support please see [this document](./docs/advanced-data-preprocessing.md).
### 4. Advanced data preprocessing.

For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).

## Supported Models

Expand Down
35 changes: 34 additions & 1 deletion docs/advanced-data-preprocessing.md
Original file line number Diff line number Diff line change
@@ -1 +1,34 @@
# Advanced Data Processing
# Advanced Data Processing
Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including
1. Providing multiple datasets
1. Creating custom data processing pipeline for the datasets.
1. Combining multiple datasets into one with even differnt formats.
1. Mixing datasets as requried and sampling if needed each with different weights.

These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next,

## Data Config

Data config is a configuration file which users can provide to sft trainer.py

What is data config schema

How can use write data configs

What are data handlers

Preexisting data handlers

Extra data handlers

How can use pass the datasets

What kind of datasets can be passed

How can user perform sampling
- What does sampling means?
- How will it affect the datasets

How can user create a data config for the existing use cases.

Corner cases which needs attention.

0 comments on commit 615ed74

Please sign in to comment.