[FEATURE] basic use of pipeline to generate SFT dataset from documents (#1076)

burtenshaw · pre-commit-ci[bot] · davidberenstein1957 · web-flow · commit 0009d1c567e9 · 2025-01-30T09:36:08.000+01:00
Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: David Berenstein &lt;david.m.berenstein@gmail.com&gt;
diff --git a/docs/api/step_gallery/hugging_face.md b/docs/api/step_gallery/hugging_face.md
@@ -5,4 +5,5 @@ This section contains the existing steps integrated with `Hugging Face` so as to
 ::: distilabel.steps.LoadDataFromDisk
 ::: distilabel.steps.LoadDataFromFileSystem
 ::: distilabel.steps.LoadDataFromHub
-::: distilabel.steps.PushToHub
+::: distilabel.steps.PushToHub
+::: distilabel.steps.HuggingFaceHubCheckpointer
diff --git a/docs/sections/getting_started/quickstart.md b/docs/sections/getting_started/quickstart.md
@@ -28,7 +28,11 @@ To install the latest release with `hf-inference-endpoints` extra of the package
 pip install distilabel[hf-inference-endpoints] --upgrade
 ```
 
-## Use a generic pipeline
+## Use a generic pipeline template
+
+Distilabel comes with some built in templates for taks like Supervised Fine-Tuning. You can use these templates to generate data for your tasks. The templates are built using the `InstructionResponsePipeline` class, which uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.
+
+### Generate Instructions and Responses
 
 To use a generic pipeline for an ML task, you can use the `InstructionResponsePipeline` class. This class is a generic pipeline that can be used to generate data for supervised fine-tuning tasks. It uses the `InferenceEndpointsLLM` class to generate data based on the input data and the model.
 
@@ -41,6 +45,31 @@ dataset = pipeline.run()
 
 The `InstructionResponsePipeline` class will use the `InferenceEndpointsLLM` class with the model `meta-llama/Meta-Llama-3.1-8B-Instruct` to generate data based on the system prompt. The output data will be a dataset with the columns `instruction` and `response`. The class uses a generic system prompt, but you can customize it by passing the `system_prompt` parameter to the class.
 
+### Generate based on seed data
+
+You can also use distilabel to generate data based on seed data. This is useful when you have an unstructured dataset that represents your domain and you want instruction response pairs for fine-tuning a model. You can use the `DatasetInstructionResponsePipeline` class with the `dataset` parameter to generate data based on the seed data.
+
+```python
+from datasets import Dataset
+from distilabel.pipeline import DatasetInstructionResponsePipeline
+
+pipeline = DatasetInstructionResponsePipeline(num_instructions=5) # define the number of instructions to generate per sample
+
+distiset = pipeline.run(
+    use_cache=False,
+    dataset=Dataset.from_list(
+        mapping=[
+            {
+                "input": "<document>",
+            }
+        ]
+    ),
+)
+
+```
+
+
+
 !!! note
     We're actively working on building more pipelines for different tasks. If you have any suggestions or requests, please let us know! We're currently working on pipelines for classification, Direct Preference Optimization, and Information Retrieval tasks.
 
diff --git a/docs/sections/how_to_guides/advanced/checkpointing.md b/docs/sections/how_to_guides/advanced/checkpointing.md
@@ -1,14 +1,14 @@
 # Push data to the hub while the pipeline is running
 
-Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [HuggingFaceHubCheckpointer][distilabel.steps.checkpointer.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.
+Long-running pipelines can be resource-intensive, and ensuring everything is functioning as expected is crucial. To make this process seamless, the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step has been designed to integrate directly into the pipeline workflow.
 
-The [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).
+The [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] allows you to periodically save your generated data as a Hugging Face Dataset at configurable intervals (every `input_batch_size` examples generated).
 
-Just add the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) as any other step in your pipeline.
+Just add the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] as any other step in your pipeline.
 
 ## Sample pipeline with dummy data to see the checkpoint strategy in action
 
-The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`](https://distilabel.argilla.io/dev/sections/getting_started/quickstart/) step to push the data to the hub.
+The following pipeline starts from a fake dataset with dummy data, passes that through a fake `DoNothing` step (any other step/s work here, but this can be useful to explore the behavior), and makes use of the [`HuggingFaceHubCheckpointer`][distilabel.steps.HuggingFaceHubCheckpointer] step to push the data to the hub.
 
 ```python
 from datasets import Dataset
diff --git a/src/distilabel/pipeline/__init__.py b/src/distilabel/pipeline/__init__.py
@@ -19,10 +19,12 @@
     sample_n_steps,
 )
 from distilabel.pipeline.templates import (
+    DatasetInstructionResponsePipeline,
     InstructionResponsePipeline,
 )
 
 __all__ = [
+    "DatasetInstructionResponsePipeline",
     "InstructionResponsePipeline",
     "Pipeline",
     "RayPipeline",
diff --git a/src/distilabel/pipeline/templates/__init__.py b/src/distilabel/pipeline/templates/__init__.py
@@ -12,4 +12,5 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+from .dataset_instruction import DatasetInstructionResponsePipeline  # noqa: F401
 from .instruction import InstructionResponsePipeline  # noqa: F401
diff --git a/src/distilabel/pipeline/templates/base.py b/src/distilabel/pipeline/templates/base.py
@@ -0,0 +1,17 @@
+# Copyright 2023-present, Argilla, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+class BasePipelineTemplate:  # defined for recursive subclass finder mkdocs
+    pass
diff --git a/src/distilabel/pipeline/templates/dataset_instruction.py b/src/distilabel/pipeline/templates/dataset_instruction.py
@@ -0,0 +1,167 @@
+# Copyright 2023-present, Argilla, Inc.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Optional
+
+from distilabel.distiset import Distiset
+from distilabel.llms import LLM, InferenceEndpointsLLM
+from distilabel.pipeline import Pipeline
+from distilabel.pipeline.templates.base import BasePipelineTemplate
+from distilabel.steps import ExpandColumns, KeepColumns
+from distilabel.steps.tasks import SelfInstruct, TextGeneration
+
+MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
+
+
+class DatasetInstructionResponsePipeline(BasePipelineTemplate):
+    """Generates instructions and responses for a dataset with input documents.
+
+    This example pipeline can be used for a Supervised Fine-Tuning dataset which you
+    could use to train or evaluate a model. The pipeline generates instructions using the
+    SelfInstruct step and TextGeneration step.
+
+    Attributes:
+        llm: The LLM to use for generating instructions and responses. Defaults to
+            InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
+        system_prompt: The system prompt to use for generating instructions and responses.
+            Defaults to "You are a creative AI Assistant writer."
+        hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
+        num_instructions: The number of instructions to generate. Defaults to 2.
+        batch_size: The batch size to use for generation. Defaults to 1.
+
+    Input columns:
+        - input (`str`): The input document to generate instructions and responses for.
+
+    Output columns:
+        - conversation (`ChatType`): the generated conversation which is a list of chat
+            items with a role and a message.
+        - instruction (`str`): the generated instructions if `only_instruction=True`.
+        - response (`str`): the generated response if `n_turns==1`.
+        - system_prompt_key (`str`, optional): the key of the system prompt used to generate
+            the conversation or instruction. Only if `system_prompt` is a dictionary.
+        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.
+
+    References:
+        - [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
+
+    Examples:
+
+        Generate instructions and responses for a given system prompt:
+
+        ```python
+        from datasets import Dataset
+        from distilabel.pipeline import DatasetInstructionResponsePipeline
+
+        pipeline = DatasetInstructionResponsePipeline(num_instructions=5)
+
+        distiset = pipeline.run(
+            use_cache=False,
+            dataset=Dataset.from_list(
+                mapping=[
+                    {
+                        "input": "<document>",
+                    }
+                ]
+            ),
+        )
+        ```
+    """
+
+    def __init__(
+        self,
+        llm: Optional[LLM] = None,
+        system_prompt: str = "You are a creative AI Assistant writer.",
+        hf_token: Optional[str] = None,
+        num_instructions: int = 2,
+        batch_size: int = 1,
+    ) -> None:
+        """Initializes the pipeline.
+
+        Args:
+            llm (Optional[LLM], optional): The language model to use. Defaults to None.
+            system_prompt (str, optional): The system prompt to use. Defaults to "You are a creative AI Assistant writer.".
+            hf_token (Optional[str], optional): The Hugging Face API token to use. Defaults to None.
+            num_instructions (int, optional): The number of instructions to generate. Defaults to 2.
+            batch_size (int, optional): The batch size to use. Defaults to 1.
+        """
+        if llm is None:
+            self.llm: LLM = InferenceEndpointsLLM(
+                model_id=MODEL,
+                tokenizer_id=MODEL,
+                generation_kwargs={
+                    "temperature": 0.9,
+                    "do_sample": True,
+                    "max_new_tokens": 2048,
+                },
+                api_key=hf_token,
+            )
+        else:
+            self.llm = llm
+
+        self.pipeline: Pipeline = self._get_pipeline(
+            system_prompt=system_prompt,
+            num_instructions=num_instructions,
+            batch_size=batch_size,
+        )
+
+    def run(self, dataset, **kwargs) -> Distiset:
+        """Runs the pipeline and returns a Distiset.
+
+        Args:
+            dataset: The dataset to run the pipeline on.
+            **kwargs: Additional arguments to pass to the pipeline.
+        """
+        return self.pipeline.run(dataset, **kwargs)
+
+    def _get_pipeline(
+        self, system_prompt: str, num_instructions: int, batch_size: int
+    ) -> Pipeline:
+        """Returns a pipeline that generates instructions and responses for a given system prompt."""
+        with Pipeline(name="dataset_chat") as pipeline:
+            self_instruct = SelfInstruct(
+                llm=self.llm,
+                num_instructions=num_instructions,
+            )
+
+            expand_columns = ExpandColumns(
+                columns=["instructions"],
+                output_mappings={"instructions": "instruction"},
+            )
+
+            keep_instruction = KeepColumns(
+                columns=["instruction", "input"],
+            )
+
+            response_generation = TextGeneration(
+                name="exam_generation",
+                system_prompt=system_prompt,
+                template="Respond to the instruction based on the document. Document:\n{{ input }} \nInstruction: {{ instruction }}",
+                llm=self.llm,
+                input_batch_size=batch_size,
+                output_mappings={"generation": "response"},
+            )
+
+            keep_response = KeepColumns(
+                columns=["input", "instruction", "response"],
+            )
+
+            (
+                self_instruct
+                >> expand_columns
+                >> keep_instruction
+                >> response_generation
+                >> keep_response
+            )
+
+        return pipeline
diff --git a/src/distilabel/pipeline/templates/instruction.py b/src/distilabel/pipeline/templates/instruction.py
@@ -17,23 +17,43 @@
 from distilabel.distiset import Distiset
 from distilabel.llms import LLM, InferenceEndpointsLLM
 from distilabel.pipeline import Pipeline
+from distilabel.pipeline.templates.base import BasePipelineTemplate
 from distilabel.steps.tasks import MagpieGenerator
 
 MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
 
 
-class InstructionResponsePipeline:
+class InstructionResponsePipeline(BasePipelineTemplate):
     """Generates instructions and responses for a given system prompt.
 
     This example pipeline can be used for a Supervised Fine-Tuning dataset which you
     could use to train or evaluate a model. The pipeline generates instructions using the
     MagpieGenerator and responses for a given system prompt. The pipeline then keeps only
     the instruction, response, and model_name columns.
 
+    Attributes:
+        llm: The LLM to use for generating instructions and responses. Defaults to
+            InferenceEndpointsLLM with Meta-Llama-3.1-8B-Instruct.
+        system_prompt: The system prompt to use for generating instructions and responses.
+            Defaults to "You are a creative AI Assistant writer."
+        hf_token: The Hugging Face token to use for accessing the model. Defaults to None.
+        n_turns: The number of turns to generate for each conversation. Defaults to 1.
+        num_rows: The number of rows to generate. Defaults to 10.
+        batch_size: The batch size to use for generation. Defaults to 1.
+
+    Output columns:
+        - conversation (`ChatType`): the generated conversation which is a list of chat
+            items with a role and a message.
+        - instruction (`str`): the generated instructions if `only_instruction=True`.
+        - response (`str`): the generated response if `n_turns==1`.
+        - system_prompt_key (`str`, optional): the key of the system prompt used to generate
+            the conversation or instruction. Only if `system_prompt` is a dictionary.
+        - model_name (`str`): The model name used to generate the `conversation` or `instruction`.
+
     References:
-        - [Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](https://arxiv.org/abs/2406.08464)
+        - [Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)
 
-    Example:
+    Examples:
 
         Generate instructions and responses for a given system prompt:
 
diff --git a/src/distilabel/steps/checkpointer.py b/src/distilabel/steps/checkpointer.py
@@ -16,15 +16,14 @@
 import tempfile
 from typing import TYPE_CHECKING, Optional
 
+from huggingface_hub import HfApi
 from pydantic import PrivateAttr
 
 from distilabel.steps.base import Step, StepInput
 
 if TYPE_CHECKING:
     from distilabel.typing import StepOutput
 
-from huggingface_hub import HfApi
-
 
 class HuggingFaceHubCheckpointer(Step):
     """Special type of step that uploads the data to a Hugging Face Hub dataset.
diff --git a/src/distilabel/utils/export_components_info.py b/src/distilabel/utils/export_components_info.py
@@ -18,6 +18,7 @@
 from distilabel.models.embeddings.base import Embeddings
 from distilabel.models.image_generation.base import ImageGenerationModel
 from distilabel.models.llms.base import LLM
+from distilabel.pipeline.templates.base import BasePipelineTemplate
 from distilabel.steps.base import _Step
 from distilabel.steps.tasks.base import _Task
 from distilabel.steps.tasks.generate_embeddings import GenerateEmbeddings
@@ -68,6 +69,13 @@ def export_components_info() -> ComponentsInfo:
             }
             for embeddings_type in _get_embeddings()
         ],
+        "pipelines": [
+            {
+                "name": pipeline_type.__name__,
+                "docstring": parse_google_docstring(pipeline_type),
+            }
+            for pipeline_type in _get_pipelines()
+        ],
     }
 
 
@@ -148,6 +156,19 @@ def _get_embeddings() -> List[Type["Embeddings"]]:
     ]
 
 
+def _get_pipelines() -> List[Type["BasePipelineTemplate"]]:
+    """Get all `Pipeline` subclasses, that are not abstract classes.
+
+    Returns:
+        A list of `Pipeline` subclasses
+    """
+    return [
+        pipeline_type
+        for pipeline_type in _recursive_subclasses(BasePipelineTemplate)
+        if not inspect.isabstract(pipeline_type)
+    ]
+
+
 # Reference: https://adamj.eu/tech/2024/05/10/python-all-subclasses/
 def _recursive_subclasses(klass: Type[T]) -> Generator[Type[T], None, None]:
     """Recursively get all subclasses of a class.
diff --git a/src/distilabel/utils/mkdocs/components_gallery.py b/src/distilabel/utils/mkdocs/components_gallery.py
diff --git a/src/distilabel/utils/mkdocs/templates/components-gallery/index.md b/src/distilabel/utils/mkdocs/templates/components-gallery/index.md
diff --git a/tests/unit/models/image_generation/huggingface/test_inference_endpoints.py b/tests/unit/models/image_generation/huggingface/test_inference_endpoints.py
diff --git a/tests/unit/models/llms/huggingface/test_inference_endpoints.py b/tests/unit/models/llms/huggingface/test_inference_endpoints.py

Original file line number	Diff line number	Diff line change
`@@ -19,10 +19,12 @@`
`19`	`19`	`sample_n_steps,`
`20`	`20`	`)`
`21`	`21`	`from distilabel.pipeline.templates import (`
	`22`	`+ DatasetInstructionResponsePipeline,`
`22`	`23`	`InstructionResponsePipeline,`
`23`	`24`	`)`
`24`	`25`
`25`	`26`	`__all__ = [`
	`27`	`+ "DatasetInstructionResponsePipeline",`
`26`	`28`	`"InstructionResponsePipeline",`
`27`	`29`	`"Pipeline",`
`28`	`30`	`"RayPipeline",`