Skip to content

Commit 0009d1c

Browse files
burtenshawpre-commit-ci[bot]davidberenstein1957
authored
[FEATURE] basic use of pipeline to generate SFT dataset from documents (#1076)
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: David Berenstein <[email protected]>
1 parent f5ddbc6 commit 0009d1c

File tree

14 files changed

+396
-36
lines changed

14 files changed

+396
-36
lines changed

Diff for: docs/api/step_gallery/hugging_face.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ This section contains the existing steps integrated with `Hugging Face` so as to
55
::: distilabel.steps.LoadDataFromDisk
66
::: distilabel.steps.LoadDataFromFileSystem
77
::: distilabel.steps.LoadDataFromHub
8-
::: distilabel.steps.PushToHub
8+
::: distilabel.steps.PushToHub
9+
::: distilabel.steps.HuggingFaceHubCheckpointer

Diff for: docs/sections/getting_started/quickstart.md

+30-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,11 @@ To install the latest release with `hf-inference-endpoints` extra of the package
2828
pip install distilabel[hf-inference-endpoints] --upgrade
2929
```
3030

31-
## Use a generic pipeline
31+
## Use a generic pipeline template
32+
33+
Distilabel comes with some built in templates for taks like Supervised Fine-Tuning. You can use these templates to generate data for your tasks. The templates are built using the `InstructionResponsePipeline` class, which uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.
34+
35+
### Generate Instructions and Responses
3236

3337
To use a generic pipeline for an ML task, you can use the `InstructionResponsePipeline` class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.
3438

@@ -41,6 +45,31 @@ dataset = pipeline.run()
4145

4246
The `InstructionResponsePipeline` class will use the `InferenceEndpointsLLM` class with the model `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate data based on the system prompt. The output data will be a dataset with the columns `instruction` and `response`. The class uses a generic system prompt, but you can customize it by passing the `system_prompt` parameter to the class.
4347

48+
### Generate based on seed data
49+
50+
You can also use distilabel to generate data based on seed data. This is useful when you have an unstructured dataset that represents your domain and you want instruction response pairs for fine-tuning a model. You can use the `DatasetInstructionResponsePipeline` class with the `dataset` parameter to generate data based on the seed data.
51+
52+
```python
53+
from datasets import Dataset
54+
from distilabel.pipeline import DatasetInstructionResponsePipeline
55+
56+
pipeline = DatasetInstructionResponsePipeline(num_instructions=5) # define the number of instructions to generate per sample
57+
58+
distiset = pipeline.run(
59+
use_cache=False,
60+
dataset=Dataset.from_list(
61+
mapping=[
62+
{
63+
"input": "<document>",
64+
}
65+
]
66+
),
67+
)
68+
69+
```
70+
71+
72+
4473
!!! note
4574
We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.
4675

Diff for: docs/sections/how_to_guides/advanced/checkpointing.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# Push data to the hub while the pipeline is running
22

3-
Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [HuggingFaceHubCheckpointer][distilabel.steps.checkpointer.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.
3+
Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.
44

5-
The [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).
5+
The [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).
66

7-
Just add the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) as any other step in your pipeline.
7+
Just add the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] as any other step in your pipeline.
88

99
## Sample pipeline with dummy data to see the checkpoint strategy in action
1010

11-
The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) step to push the data to the hub.
11+
The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step to push the data to the hub.
1212

1313
```python
1414
from datasets import Dataset

Diff for: src/distilabel/pipeline/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,12 @@
1919
sample_n_steps,
2020
)
2121
from distilabel.pipeline.templates import (
22+
DatasetInstructionResponsePipeline,
2223
InstructionResponsePipeline,
2324
)
2425

2526
__all__ = [
27+
"DatasetInstructionResponsePipeline",
2628
"InstructionResponsePipeline",
2729
"Pipeline",
2830
"RayPipeline",

Diff for: src/distilabel/pipeline/templates/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
from .dataset_instruction import DatasetInstructionResponsePipeline # noqa: F401
1516
from .instruction import InstructionResponsePipeline # noqa: F401

Diff for: src/distilabel/pipeline/templates/base.py

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Copyright 2023-present, Argilla, Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
class BasePipelineTemplate: # defined for recursive subclass finder mkdocs
17+
pass
+167
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Copyright 2023-present, Argilla, Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from typing import Optional
16+
17+
from distilabel.distiset import Distiset
18+
from distilabel.llms import LLM, InferenceEndpointsLLM
19+
from distilabel.pipeline import Pipeline
20+
from distilabel.pipeline.templates.base import BasePipelineTemplate
21+
from distilabel.steps import ExpandColumns, KeepColumns
22+
from distilabel.steps.tasks import SelfInstruct, TextGeneration
23+
24+
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
25+
26+
27+
class DatasetInstructionResponsePipeline(BasePipelineTemplate):
28+
"""Generates instructions and responses for a dataset with input documents.
29+
30+
This example pipeline can be used for a Supervised Fine-Tuning dataset which you
31+
could use to train or evaluate a model. The pipeline generates instructions using the
32+
SelfInstruct step and TextGeneration step.
33+
34+
Attributes:
35+
llm: The LLM to use for generating instructions and responses. Defaults to
36+
InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
37+
system_prompt: The system prompt to use for generating instructions and responses.
38+
Defaults to "You are a creative AI Assistant writer."
39+
hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
40+
num_instructions: The number of instructions to generate. Defaults to 2.
41+
batch_size: The batch size to use for generation. Defaults to 1.
42+
43+
Input columns:
44+
- input (`str`): The input document to generate instructions and responses for.
45+
46+
Output columns:
47+
- conversation (`ChatType`): the generated conversation which is a list of chat
48+
items with a role and a message.
49+
- instruction (`str`): the generated instructions if `only_instruction=True`.
50+
- response (`str`): the generated response if `n_turns==1`.
51+
- system_prompt_key (`str`, optional): the key of the system prompt used to generate
52+
the conversation or instruction. Only if `system_prompt` is a dictionary.
53+
- model_name (`str`): The model name used to generate the `conversation` or `instruction`.
54+
55+
References:
56+
- [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
57+
58+
Examples:
59+
60+
Generate instructions and responses for a given system prompt:
61+
62+
```python
63+
from datasets import Dataset
64+
from distilabel.pipeline import DatasetInstructionResponsePipeline
65+
66+
pipeline = DatasetInstructionResponsePipeline(num_instructions=5)
67+
68+
distiset = pipeline.run(
69+
use_cache=False,
70+
dataset=Dataset.from_list(
71+
mapping=[
72+
{
73+
"input": "<document>",
74+
}
75+
]
76+
),
77+
)
78+
```
79+
"""
80+
81+
def __init__(
82+
self,
83+
llm: Optional[LLM] = None,
84+
system_prompt: str = "You are a creative AI Assistant writer.",
85+
hf_token: Optional[str] = None,
86+
num_instructions: int = 2,
87+
batch_size: int = 1,
88+
) -> None:
89+
"""Initializes the pipeline.
90+
91+
Args:
92+
llm (Optional[LLM], optional): The language model to use. Defaults to None.
93+
system_prompt (str, optional): The system prompt to use. Defaults to "You are a creative AI Assistant writer.".
94+
hf_token (Optional[str], optional): The Hugging Face API token to use. Defaults to None.
95+
num_instructions (int, optional): The number of instructions to generate. Defaults to 2.
96+
batch_size (int, optional): The batch size to use. Defaults to 1.
97+
"""
98+
if llm is None:
99+
self.llm: LLM = InferenceEndpointsLLM(
100+
model_id=MODEL,
101+
tokenizer_id=MODEL,
102+
generation_kwargs={
103+
"temperature": 0.9,
104+
"do_sample": True,
105+
"max_new_tokens": 2048,
106+
},
107+
api_key=hf_token,
108+
)
109+
else:
110+
self.llm = llm
111+
112+
self.pipeline: Pipeline = self._get_pipeline(
113+
system_prompt=system_prompt,
114+
num_instructions=num_instructions,
115+
batch_size=batch_size,
116+
)
117+
118+
def run(self, dataset, **kwargs) -> Distiset:
119+
"""Runs the pipeline and returns a Distiset.
120+
121+
Args:
122+
dataset: The dataset to run the pipeline on.
123+
**kwargs: Additional arguments to pass to the pipeline.
124+
"""
125+
return self.pipeline.run(dataset, **kwargs)
126+
127+
def _get_pipeline(
128+
self, system_prompt: str, num_instructions: int, batch_size: int
129+
) -> Pipeline:
130+
"""Returns a pipeline that generates instructions and responses for a given system prompt."""
131+
with Pipeline(name="dataset_chat") as pipeline:
132+
self_instruct = SelfInstruct(
133+
llm=self.llm,
134+
num_instructions=num_instructions,
135+
)
136+
137+
expand_columns = ExpandColumns(
138+
columns=["instructions"],
139+
output_mappings={"instructions": "instruction"},
140+
)
141+
142+
keep_instruction = KeepColumns(
143+
columns=["instruction", "input"],
144+
)
145+
146+
response_generation = TextGeneration(
147+
name="exam_generation",
148+
system_prompt=system_prompt,
149+
template="Respond to the instruction based on the document. Document:\n{{ input }} \nInstruction: {{ instruction }}",
150+
llm=self.llm,
151+
input_batch_size=batch_size,
152+
output_mappings={"generation": "response"},
153+
)
154+
155+
keep_response = KeepColumns(
156+
columns=["input", "instruction", "response"],
157+
)
158+
159+
(
160+
self_instruct
161+
>> expand_columns
162+
>> keep_instruction
163+
>> response_generation
164+
>> keep_response
165+
)
166+
167+
return pipeline

Diff for: src/distilabel/pipeline/templates/instruction.py

+23-3
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,43 @@
1717
from distilabel.distiset import Distiset
1818
from distilabel.llms import LLM, InferenceEndpointsLLM
1919
from distilabel.pipeline import Pipeline
20+
from distilabel.pipeline.templates.base import BasePipelineTemplate
2021
from distilabel.steps.tasks import MagpieGenerator
2122

2223
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
2324

2425

25-
class InstructionResponsePipeline:
26+
class InstructionResponsePipeline(BasePipelineTemplate):
2627
"""Generates instructions and responses for a given system prompt.
2728
2829
This example pipeline can be used for a Supervised Fine-Tuning dataset which you
2930
could use to train or evaluate a model. The pipeline generates instructions using the
3031
MagpieGenerator and responses for a given system prompt. The pipeline then keeps only
3132
the instruction, response, and model_name columns.
3233
34+
Attributes:
35+
llm: The LLM to use for generating instructions and responses. Defaults to
36+
InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
37+
system_prompt: The system prompt to use for generating instructions and responses.
38+
Defaults to "You are a creative AI Assistant writer."
39+
hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
40+
n_turns: The number of turns to generate for each conversation. Defaults to 1.
41+
num_rows: The number of rows to generate. Defaults to 10.
42+
batch_size: The batch size to use for generation. Defaults to 1.
43+
44+
Output columns:
45+
- conversation (`ChatType`): the generated conversation which is a list of chat
46+
items with a role and a message.
47+
- instruction (`str`): the generated instructions if `only_instruction=True`.
48+
- response (`str`): the generated response if `n_turns==1`.
49+
- system_prompt_key (`str`, optional): the key of the system prompt used to generate
50+
the conversation or instruction. Only if `system_prompt` is a dictionary.
51+
- model_name (`str`): The model name used to generate the `conversation` or `instruction`.
52+
3353
References:
34-
- [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)
54+
- [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
3555
36-
Example:
56+
Examples:
3757
3858
Generate instructions and responses for a given system prompt:
3959

Diff for: src/distilabel/steps/checkpointer.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -16,15 +16,14 @@
1616
import tempfile
1717
from typing import TYPE_CHECKING, Optional
1818

19+
from huggingface_hub import HfApi
1920
from pydantic import PrivateAttr
2021

2122
from distilabel.steps.base import Step, StepInput
2223

2324
if TYPE_CHECKING:
2425
from distilabel.typing import StepOutput
2526

26-
from huggingface_hub import HfApi
27-
2827

2928
class HuggingFaceHubCheckpointer(Step):
3029
"""Special type of step that uploads the data to a Hugging Face Hub dataset.

Diff for: src/distilabel/utils/export_components_info.py

+21
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
from distilabel.models.embeddings.base import Embeddings
1919
from distilabel.models.image_generation.base import ImageGenerationModel
2020
from distilabel.models.llms.base import LLM
21+
from distilabel.pipeline.templates.base import BasePipelineTemplate
2122
from distilabel.steps.base import _Step
2223
from distilabel.steps.tasks.base import _Task
2324
from distilabel.steps.tasks.generate_embeddings import GenerateEmbeddings
@@ -68,6 +69,13 @@ def export_components_info() -> ComponentsInfo:
6869
}
6970
for embeddings_type in _get_embeddings()
7071
],
72+
"pipelines": [
73+
{
74+
"name": pipeline_type.__name__,
75+
"docstring": parse_google_docstring(pipeline_type),
76+
}
77+
for pipeline_type in _get_pipelines()
78+
],
7179
}
7280

7381

@@ -148,6 +156,19 @@ def _get_embeddings() -> List[Type["Embeddings"]]:
148156
]
149157

150158

159+
def _get_pipelines() -> List[Type["BasePipelineTemplate"]]:
160+
"""Get all `Pipeline` subclasses, that are not abstract classes.
161+
162+
Returns:
163+
A list of `Pipeline` subclasses
164+
"""
165+
return [
166+
pipeline_type
167+
for pipeline_type in _recursive_subclasses(BasePipelineTemplate)
168+
if not inspect.isabstract(pipeline_type)
169+
]
170+
171+
151172
# Reference: https://adamj.eu/tech/2024/05/10/python-all-subclasses/
152173
def _recursive_subclasses(klass: Type[T]) -> Generator[Type[T], None, None]:
153174
"""Recursively get all subclasses of a class.

0 commit comments

Comments
 (0)