added data preprocessing docs

Signed-off-by: Dushyant Behl <[email protected]>
foundation-model-stack · Dec 20, 2024 · 615ed74 · 615ed74
1 parent d19b8ec
commit 615ed74
Show file tree

Hide file tree

Showing 2 changed files with 45 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -149,15 +149,22 @@ Example: For a JSON dataset like, `Train.jsonl`
   Pass a dataset containing single/multi turn chat dataset. Your dataset could be supplied like 
 
 ```
+$ head -n 1 train.jsonl
 {"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"}
 ```
 
 containing single/multi turn chat.
 
 The chat template used to render this data will be `tokenizer.chat_template` from model's default tokenizer config or can be overridden using `--chat_template <chat-template-string>` argument.
+As an example, for models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) which contain a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) as part of their `tokenizer_config.json` the users need not pass a chat template to process the data. 
 
-Users also need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
+Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of
 `assistant` and `human` response inside the formatted chat template.
+For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be.
+```
+--instruction_template `<|start_of_role|>user<|end_of_role|>`
+--response_template <|start_of_role|>assistant<|end_of_role|>`
+```
 
 The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.
 
@@ -169,7 +176,9 @@ Users can also pass a pretokenized dataset (containing `input_ids` and `labels`
 python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
 ```
 
-For advanced data preprocessing support please see [this document](./docs/advanced-data-preprocessing.md).
+### 4. Advanced data preprocessing.
+
+For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).
 
 ## Supported Models
 

diff --git a/docs/advanced-data-preprocessing.md b/docs/advanced-data-preprocessing.md
@@ -1 +1,34 @@
-# Advanced Data Processing
+# Advanced Data Processing
+Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including
+1. Providing multiple datasets
+1. Creating custom data processing pipeline for the datasets.
+1. Combining multiple datasets into one with even differnt formats.
+1. Mixing datasets as requried and sampling if needed each with different weights.
+
+These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next,
+
+## Data Config
+
+Data config is a configuration file which users can provide to sft trainer.py
+
+What is data config schema 
+
+How can use write data configs
+
+What are data handlers
+
+Preexisting data handlers
+
+Extra data handlers
+
+How can use pass the datasets 
+
+What kind of datasets can be passed
+
+How can user perform sampling
+ - What does sampling means?
+ - How will it affect the datasets
+
+How can user create a data config for the existing use cases.
+
+Corner cases which needs attention.