diff --git a/README.md b/README.md index 72b7882cb..11b20b866 100644 --- a/README.md +++ b/README.md @@ -149,15 +149,22 @@ Example: For a JSON dataset like, `Train.jsonl` Pass a dataset containing single/multi turn chat dataset. Your dataset could be supplied like ``` +$ head -n 1 train.jsonl {"messages": [{"content": "You are an AI language model developed by IBM Research. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior.", "role": "system"}, {"content": "Look up a word that rhymes with exist", "role": "user"}, {"content": "I found a word that rhymes with \"exist\":\n1\\. Mist", "role": "assistant"}], "group": "lab_extension", "dataset": "base/full-extension", "metadata": "{\"num_turns\": 1}"} ``` containing single/multi turn chat. The chat template used to render this data will be `tokenizer.chat_template` from model's default tokenizer config or can be overridden using `--chat_template ` argument. +As an example, for models like [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) which contain a [chat template](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) as part of their `tokenizer_config.json` the users need not pass a chat template to process the data. -Users also need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of +Users do need to pass `--response_template` and `--instruction_template` which are pieces of text representing start of `assistant` and `human` response inside the formatted chat template. +For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct/blob/main/tokenizer_config.json#L188) for example, the values shall be. +``` +--instruction_template `<|start_of_role|>user<|end_of_role|>` +--response_template <|start_of_role|>assistant<|end_of_role|>` +``` The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat. @@ -169,7 +176,9 @@ Users can also pass a pretokenized dataset (containing `input_ids` and `labels` python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow ``` -For advanced data preprocessing support please see [this document](./docs/advanced-data-preprocessing.md). +### 4. Advanced data preprocessing. + +For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md). ## Supported Models diff --git a/docs/advanced-data-preprocessing.md b/docs/advanced-data-preprocessing.md index a2c7c1dfa..13a45076d 100644 --- a/docs/advanced-data-preprocessing.md +++ b/docs/advanced-data-preprocessing.md @@ -1 +1,34 @@ -# Advanced Data Processing \ No newline at end of file +# Advanced Data Processing +Our library also supports a powerful data processing backed which can be used by the users to perform custom data preprocessing including +1. Providing multiple datasets +1. Creating custom data processing pipeline for the datasets. +1. Combining multiple datasets into one with even differnt formats. +1. Mixing datasets as requried and sampling if needed each with different weights. + +These things are supported via what we call a [`data_config`](#data-config) which can be passed an an argument to sft trainer. We explain data config in detail next, + +## Data Config + +Data config is a configuration file which users can provide to sft trainer.py + +What is data config schema + +How can use write data configs + +What are data handlers + +Preexisting data handlers + +Extra data handlers + +How can use pass the datasets + +What kind of datasets can be passed + +How can user perform sampling + - What does sampling means? + - How will it affect the datasets + +How can user create a data config for the existing use cases. + +Corner cases which needs attention. \ No newline at end of file