Skip to content

Latest commit

 

History

History
 
 

6_synthetic_datasets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Synthetic Datasets

Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some use cases, large language models have made synthetic datasets more popular for pre- and post-training, and the evaluation of language models.

We'll use distilabel, a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. For a deeper dive into the package and best practices, check out the documentation.

Module Overview

Synthetic data for language models can be categorized into three taxonomies: instructions, preferences and critiques. We will focus on the first two categories, which focus on the generation of datasets for instruction tuning and preference alignment. In both categories, we will cover aspects of the third category, which focuses on improving existing data with model critiques and rewrites.

Synthetic Data Taxonomies

Contents

Learn how to generate instruction datasets for instruction tuning. We will explore creating instruction tuning datasets thorugh basic prompting and using prompts more refined techniques from papers. Instruction tuning datasets with seed data for in-context learning can be created through methods like SelfInstruct and Magpie. Additionally, we will explore instruction evolution through EvolInstruct. Start learning.

Learn how to generate preference datasets for preference alignment. We will build on top of the methods and techniques introduced in section 1, by generating additional responses. Next, we will learn how to improve such responses with the EvolQuality prompt. Finally, we will explore how to evaluate responses with the the UltraFeedback prompt which will produce a score and critique, allowing us to create preference pairs. Start learning.

Exercise Notebooks

Title Description Exercise Link Colab
Instruction Dataset Generate a dataset for instruction tuning 🐢 Generate an instruction tuning dataset
🐕 Generate a dataset for instruction tuning with seed data
🦁 Generate a dataset for instruction tuning with seed data and with instruction evolution
Link Colab
Preference Dataset Generate a dataset for preference alignment 🐢 Generate a preference alignment dataset
🐕 Generate a preference alignment dataset with response evolution
🦁 Generate a preference alignment dataset with response evolution and critiques
Link Colab

Resources