diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index c4058d1f..3aed3267 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -17,7 +17,7 @@ jobs: package_name: cookbook path_to_docs: cookbook/notebooks/ additional_args: --not_python_module - languages: en + languages: en zh-CN convert_notebooks: true secrets: hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} \ No newline at end of file diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index d48ebd09..64aaf9fe 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -20,5 +20,5 @@ jobs: package_name: cookbook path_to_docs: cookbook/notebooks/ additional_args: --not_python_module - languages: en + languages: en zh-CN convert_notebooks: true \ No newline at end of file diff --git a/.gitignore b/.gitignore index b49dc82e..f4d34cf4 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,6 @@ .vscode .idea/ +.venv/ **/.ipynb_checkpoints **/.DS_Store diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 9c7afaef..d5cb0876 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -18,3 +18,5 @@ title: Advanced RAG on HuggingFace documentation using LangChain - local: rag_evaluation title: RAG Evaluation + - local: prompt_tuning_peft + title: Prompt tuning with PEFT diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 90908341..68c6693a 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -8,6 +8,7 @@ applications and solving various machine learning tasks using open-source tools Check out the recently added notebooks: - [Stable Diffusion Interpolation](stable_diffusion_interpolation) +- [Prompt Tuning with PEFT Library](prompt_tuning_peft) - [Migrating from OpenAI to Open LLMs Using TGI's Messages API](tgi_messages_api_demo) - [Automatic Embeddings with TEI through Inference Endpoints](automatic_embedding_tei_inference_endpoints) - [Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain](rag_zephyr_langchain) @@ -16,7 +17,7 @@ Check out the recently added notebooks: - [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation) - [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag) -You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). +You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook.ipynb). ## Contributing diff --git a/notebooks/en/prompt_tuning_peft.ipynb b/notebooks/en/prompt_tuning_peft.ipynb new file mode 100644 index 00000000..2ae63c4d --- /dev/null +++ b/notebooks/en/prompt_tuning_peft.ipynb @@ -0,0 +1,1022 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6fba2d42-ed99-4a03-8033-d479ce24d5dd", + "showTitle": false, + "title": "" + }, + "id": "2vkOvTEsVaTA" + }, + "source": [ + "# Prompt Tuning With PEFT.\n", + "_Authored by: [Pere Martra](https://github.com/peremartra)_\n", + "\n", + "\n", + "In this notebook we are introducing how to apply prompt tuning with the PEFT library to a pre-trained model.\n", + "\n", + "For a complete list of models compatible with PEFT refer to their [documentation](https://huggingface.co/docs/peft/main/en/index#supported-methods).\n", + "\n", + "A short sample of models available to be trained with PEFT includes Bloom, Llama, GPT-J, GPT-2, BERT, and more. Hugging Face is working hard to add more models to the library.\n", + "\n", + "## Brief introduction to Prompt Tuning.\n", + "It’s an Additive Fine-Tuning technique for models. This means that we WILL NOT MODIFY ANY WEIGHTS OF THE ORIGINAL MODEL. You might be wondering, how are we going to perform Fine-Tuning then? Well, we will train additional layers that are added to the model. That’s why it’s called an Additive technique.\n", + "\n", + "Considering it’s an Additive technique and its name is Prompt-Tuning, it seems clear that the layers we’re going to add and train are related to the prompt.\n", + "\n", + "![Prompt_Tuning_Diagram](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/Martra_Figure_5_Prompt_Tuning.jpg)\n", + "\n", + "We are creating a type of superprompt by enabling a model to enhance a portion of the prompt with its acquired knowledge. However, that particular section of the prompt cannot be translated into natural language. **It's as if we've mastered expressing ourselves in embeddings and generating highly effective prompts.**\n", + "\n", + "In each training cycle, the only weights that can be modified to minimize the loss function are those integrated into the prompt.\n", + "\n", + "The primary consequence of this technique is that the number of parameters to train is genuinely small. However, we encounter a second, perhaps more significant consequence, namely that, **since we do not modify the weights of the pretrained model, it does not alter its behavior or forget any information it has previously learned.**\n", + "\n", + "The training is faster and more cost-effective. Moreover, we can train various models, and during inference time, we only need to load one foundational model along with the new smaller trained models because the weights of the original model have not been altered\n", + "\n", + "## What are we going to do in the notebook?\n", + "We are going to train two different models using two datasets, each with just one pre-trained model from the Bloom family. One model will be trained with a dataset of prompts, while the other will use a dataset of inspirational sentences. We will compare the results for the same question from both models before and after training.\n", + "\n", + "Additionally, we'll explore how to load both models with only one copy of the foundational model in memory.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tZhdbTh-VaTA" + }, + "source": [ + "## Loading the PEFT Library\n", + "This library contains the Hugging Face implementation of various Fine-Tuning techniques, including Prompt Tuning" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "d16bf5ec-888b-4c76-a655-193fd4cc8a36", + "showTitle": false, + "title": "" + }, + "id": "JechhJhhVaTA" + }, + "outputs": [], + "source": [ + "!pip install -q peft==0.8.2" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": { + "id": "6CRxq5Z2WJ7C" + }, + "outputs": [], + "source": [ + "!pip install -q datasets==2.14.5" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GGbh426RVaTB" + }, + "source": [ + "From the transformers library, we import the necessary classes to instantiate the model and the tokenizer." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "31738463-c9b0-431d-869e-1735e1e2f5c7", + "showTitle": false, + "title": "" + }, + "id": "KWOEt-yOVaTB" + }, + "outputs": [], + "source": [ + "from transformers import AutoModelForCausalLM, AutoTokenizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6qYsnwjSVaTC" + }, + "source": [ + "### Loading the model and the tokenizers.\n", + "\n", + "Bloom is one of the smallest and smartest models available for training with the PEFT Library using Prompt Tuning. You can choose any model from the Bloom Family, and I encourage you to try at least two of them to observe the differences.\n", + "\n", + "I'm opting for the smallest one to minimize training time and avoid memory issues in Colab." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": { + "id": "MnqIhv2UVaTC" + }, + "outputs": [], + "source": [ + "model_name = \"bigscience/bloomz-560m\"\n", + "#model_name=\"bigscience/bloom-1b1\"\n", + "NUM_VIRTUAL_TOKENS = 4\n", + "NUM_EPOCHS = 6" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": { + "id": "fSMu3qRsVaTC" + }, + "outputs": [], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "foundational_model = AutoModelForCausalLM.from_pretrained(\n", + " model_name,\n", + " trust_remote_code=True\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8W2fWhOnVaTC" + }, + "source": [ + "## Inference with the pre trained bloom model\n", + "If you want to achieve more varied and original generations, uncomment the parameters: temperature, top_p, and do_sample, in *model.generate* below\n", + "\n", + "With the default configuration, the model's responses remain consistent across calls." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": { + "id": "47j2D3WWVaTC" + }, + "outputs": [], + "source": [ + "#this function returns the outputs from the model received, and inputs.\n", + "def get_outputs(model, inputs, max_new_tokens=100):\n", + " outputs = model.generate(\n", + " input_ids=inputs[\"input_ids\"],\n", + " attention_mask=inputs[\"attention_mask\"],\n", + " max_new_tokens=max_new_tokens,\n", + " #temperature=0.2,\n", + " #top_p=0.95,\n", + " #do_sample=True,\n", + " repetition_penalty=1.5, #Avoid repetition.\n", + " early_stopping=True, #The model can stop before reach the max_length\n", + " eos_token_id=tokenizer.eos_token_id\n", + " )\n", + " return outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "ca4d203a-5152-4947-ab34-cfd0b40a102a", + "showTitle": false, + "title": "" + }, + "id": "kRLSfuo2VaTC" + }, + "source": [ + "As we want to have two different trained models, I will create two distinct prompts.\n", + "\n", + "The first model will be trained with a dataset containing prompts, and the second one with a dataset of motivational sentences.\n", + "\n", + "The first model will receive the prompt \"I want you to act as a motivational coach.\" and the second model will receive \"There are two nice things that should matter to you:\"\n", + "\n", + "But first, I'm going to collect some results from the model without Fine-Tuning." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "1d4c80a9-4edd-4fcd-aef0-996f4da5cc02", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "QvStaT7cVaTC", + "outputId": "ab34b3cd-a849-4dff-b36d-bf25c9f55ce1" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[\"I want you to act as a motivational coach. Don't be afraid of being challenged.\"]\n" + ] + } + ], + "source": [ + "input_prompt = tokenizer(\"I want you to act as a motivational coach. \", return_tensors=\"pt\")\n", + "foundational_outputs_prompt = get_outputs(foundational_model, input_prompt, max_new_tokens=50)\n", + "\n", + "print(tokenizer.batch_decode(foundational_outputs_prompt, skip_special_tokens=True))" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "1Xhm3jZMVaTD", + "outputId": "305f0137-6a02-4e43-9c9d-2b4ecd377937" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "['There are two nice things that should matter to you: the price and quality of your product.']\n" + ] + } + ], + "source": [ + "input_sentences = tokenizer(\"There are two nice things that should matter to you:\", return_tensors=\"pt\")\n", + "foundational_outputs_sentence = get_outputs(foundational_model, input_sentences, max_new_tokens=50)\n", + "\n", + "print(tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "f438d43b-6b9f-445e-9df4-60ea09640764", + "showTitle": false, + "title": "" + }, + "id": "OGbJTbRnVaTD" + }, + "source": [ + "Both answers are more or less correct. Any of the Bloom models is pre-trained and can generate sentences accurately and sensibly. Let's see if, after training, the responses are either equal or more accurately generated.\n", + "\n", + "## Preparing the Datasets\n", + "The Datasets useds are:\n", + "* https://huggingface.co/datasets/fka/awesome-chatgpt-prompts\n", + "* https://huggingface.co/datasets/Abirate/english_quotes\n" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": { + "id": "RD8H_LLaVaTD" + }, + "outputs": [], + "source": [ + "import os\n", + "#os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "2ed62b41-e3fa-4a41-a0a9-59f35a6904f9", + "showTitle": false, + "title": "" + }, + "id": "xmAp_o4PVaTD" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset_prompt = \"fka/awesome-chatgpt-prompts\"\n", + "\n", + "#Create the Dataset to create prompts.\n", + "data_prompt = load_dataset(dataset_prompt)\n", + "data_prompt = data_prompt.map(lambda samples: tokenizer(samples[\"prompt\"]), batched=True)\n", + "train_sample_prompt = data_prompt[\"train\"].select(range(50))\n" + ] + }, + { + "cell_type": "code", + "source": [ + "display(train_sample_prompt)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "id": "jNlOpGbqBgcu", + "outputId": "3f8106b2-948b-4a7b-cf78-bd3fcc2f0338" + }, + "execution_count": 51, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['act', 'prompt', 'input_ids', 'attention_mask'],\n", + " num_rows: 50\n", + "})" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dZcOaE5CU658", + "outputId": "fb8f5081-012b-4c37-ee1f-3aef2d0f54a7" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "{'act': ['Linux Terminal'], 'prompt': ['I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd'], 'input_ids': [[44, 4026, 1152, 427, 1769, 661, 267, 104105, 28434, 17, 473, 2152, 4105, 49123, 530, 1152, 2152, 57502, 1002, 3595, 368, 28434, 3403, 6460, 17, 473, 4026, 1152, 427, 3804, 57502, 1002, 368, 28434, 10014, 14652, 2592, 19826, 4400, 10973, 15, 530, 16915, 4384, 17, 727, 1130, 11602, 184637, 17, 727, 1130, 4105, 49123, 35262, 473, 32247, 1152, 427, 727, 1427, 17, 3262, 707, 3423, 427, 13485, 1152, 7747, 361, 170205, 15, 707, 2152, 727, 1427, 1331, 55385, 5484, 14652, 6291, 999, 117805, 731, 29726, 1119, 96, 17, 2670, 3968, 9361, 632, 269, 42512]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}\n" + ] + } + ], + "source": [ + "print(train_sample_prompt[:1])" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": { + "id": "WeM66LmEVaTD" + }, + "outputs": [], + "source": [ + "dataset_sentences = load_dataset(\"Abirate/english_quotes\")\n", + "\n", + "data_sentences = dataset_sentences.map(lambda samples: tokenizer(samples[\"quote\"]), batched=True)\n", + "train_sample_sentences = data_sentences[\"train\"].select(range(25))\n", + "train_sample_sentences = train_sample_sentences.remove_columns(['author', 'tags'])" + ] + }, + { + "cell_type": "code", + "source": [ + "display(train_sample_sentences)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + }, + "id": "zUSG_M_nBp_E", + "outputId": "faf36464-de24-4512-aace-c1ff8713c1d4" + }, + "execution_count": 54, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "Dataset({\n", + " features: ['quote', 'input_ids', 'attention_mask'],\n", + " num_rows: 25\n", + "})" + ] + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "b97381d4-5fe2-49d0-be5d-2fe3421edc5c", + "showTitle": false, + "title": "" + }, + "id": "0-5mv1ZpVaTD" + }, + "source": [ + "## Fine-Tuning. \n", + "\n", + "### PEFT configurations\n", + "\n", + "\n", + "API docs:\n", + "https://huggingface.co/docs/peft/main/en/package_reference/tuners#peft.PromptTuningConfig\n", + "\n", + "We can use the same configuration for both models to be trained.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6df8e1f1-be9e-42db-b4a4-6af7cd351004", + "showTitle": false, + "title": "" + }, + "id": "sOg1Yh-oVaTD" + }, + "outputs": [], + "source": [ + "from peft import get_peft_model, PromptTuningConfig, TaskType, PromptTuningInit\n", + "\n", + "generation_config = PromptTuningConfig(\n", + " task_type=TaskType.CAUSAL_LM, #This type indicates the model will generate text.\n", + " prompt_tuning_init=PromptTuningInit.RANDOM, #The added virtual tokens are initializad with random numbers\n", + " num_virtual_tokens=NUM_VIRTUAL_TOKENS, #Number of virtual tokens to be added and trained.\n", + " tokenizer_name_or_path=model_name #The pre-trained model.\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "an9KBtB1VaTD" + }, + "source": [ + "### Creating two Prompt Tuning Models.\n", + "We will create two identical prompt tuning models using the same pre-trained model and the same config." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "c_D8oDQZVaTD", + "outputId": "6b46ca98-3f60-49c1-dab2-91259d6387af" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n", + "None\n" + ] + } + ], + "source": [ + "peft_model_prompt = get_peft_model(foundational_model, generation_config)\n", + "print(peft_model_prompt.print_trainable_parameters())" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IktYfj68VaTE", + "outputId": "28fe03b7-4490-43ba-b913-4633e269737a" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "trainable params: 4,096 || all params: 559,218,688 || trainable%: 0.0007324504863471229\n", + "None\n" + ] + } + ], + "source": [ + "peft_model_sentences = get_peft_model(foundational_model, generation_config)\n", + "print(peft_model_sentences.print_trainable_parameters())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "cff5bc33-8cfb-4144-8962-9c54362a7faa", + "showTitle": false, + "title": "" + }, + "id": "i6WhJSUwVaTE" + }, + "source": [ + "**That's amazing: did you see the reduction in trainable parameters? We are going to train a 0.001% of the paramaters available.**\n", + "\n", + "Now we are going to create the training arguments, and we will use the same configuration in both trainings." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": { + "id": "SJoznfzjVaTE" + }, + "outputs": [], + "source": [ + "from transformers import TrainingArguments\n", + "def create_training_arguments(path, learning_rate=0.0035, epochs=6):\n", + " training_args = TrainingArguments(\n", + " output_dir=path, # Where the model predictions and checkpoints will be written\n", + " use_cpu=True, # This is necessary for CPU clusters.\n", + " auto_find_batch_size=True, # Find a suitable batch size that will fit into memory automatically\n", + " learning_rate= learning_rate, # Higher learning rate than full Fine-Tuning\n", + " num_train_epochs=epochs\n", + " )\n", + " return training_args" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "54b78a8f-81f0-44c0-b0bc-dcb14891715f", + "showTitle": false, + "title": "" + }, + "id": "cb1j50DSVaTE" + }, + "outputs": [], + "source": [ + "\n", + "import os\n", + "\n", + "working_dir = \"./\"\n", + "\n", + "#Is best to store the models in separate folders.\n", + "#Create the name of the directories where to store the models.\n", + "output_directory_prompt = os.path.join(working_dir, \"peft_outputs_prompt\")\n", + "output_directory_sentences = os.path.join(working_dir, \"peft_outputs_sentences\")\n", + "\n", + "#Just creating the directoris if not exist.\n", + "if not os.path.exists(working_dir):\n", + " os.mkdir(working_dir)\n", + "if not os.path.exists(output_directory_prompt):\n", + " os.mkdir(output_directory_prompt)\n", + "if not os.path.exists(output_directory_sentences):\n", + " os.mkdir(output_directory_sentences)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OC5IhO9mVaTE" + }, + "source": [ + "We need to indicate the directory containing the model when creating the TrainingArguments." + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": { + "id": "D4v4RSSeVaTE" + }, + "outputs": [], + "source": [ + "training_args_prompt = create_training_arguments(output_directory_prompt, 0.003, NUM_EPOCHS)\n", + "training_args_sentences = create_training_arguments(output_directory_sentences, 0.003, NUM_EPOCHS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "c593deb6-5626-4fd9-89c2-2329e2f9b6e0", + "showTitle": false, + "title": "" + }, + "id": "GdMfjk5RVaTE" + }, + "source": [ + "## Train\n", + "\n", + "We will create the trainer Object, one for each model to train. " + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": { + "id": "uVAfNdEIVaTE" + }, + "outputs": [], + "source": [ + "from transformers import Trainer, DataCollatorForLanguageModeling\n", + "def create_trainer(model, training_args, train_dataset):\n", + " trainer = Trainer(\n", + " model=model, # We pass in the PEFT version of the foundation model, bloomz-560M\n", + " args=training_args, #The args for the training.\n", + " train_dataset=train_dataset, #The dataset used to tyrain the model.\n", + " data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False) # mlm=False indicates not to use masked language modeling\n", + " )\n", + " return trainer\n" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "32e43bcf-23b2-46aa-9cf0-455b83ef4f38", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/", + "height": 127 + }, + "id": "1Sz9BeFZVaTF", + "outputId": "1b698470-209e-4001-fcbe-6fa8a2ac8707" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " [42/42 11:23, Epoch 6/6]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss

" + ] + }, + "metadata": {} + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "TrainOutput(global_step=42, training_loss=3.5800417945498513, metrics={'train_runtime': 703.2941, 'train_samples_per_second': 0.427, 'train_steps_per_second': 0.06, 'total_flos': 60957279240192.0, 'train_loss': 3.5800417945498513, 'epoch': 6.0})" + ] + }, + "metadata": {}, + "execution_count": 62 + } + ], + "source": [ + "#Training first model.\n", + "trainer_prompt = create_trainer(peft_model_prompt, training_args_prompt, train_sample_prompt)\n", + "trainer_prompt.train()" + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 127 + }, + "id": "afTotMckVaTF", + "outputId": "15bed85d-17f5-4a49-d8d5-bae35e68d294" + }, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "\n", + "

\n", + " \n", + " \n", + " [24/24 03:29, Epoch 6/6]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining Loss

" + ] + }, + "metadata": {} + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "TrainOutput(global_step=24, training_loss=4.4278310139973955, metrics={'train_runtime': 219.765, 'train_samples_per_second': 0.683, 'train_steps_per_second': 0.109, 'total_flos': 17825006936064.0, 'train_loss': 4.4278310139973955, 'epoch': 6.0})" + ] + }, + "metadata": {}, + "execution_count": 63 + } + ], + "source": [ + "#Training second model.\n", + "trainer_sentences = create_trainer(peft_model_sentences, training_args_sentences, train_sample_sentences)\n", + "trainer_sentences.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z2Zsww_2VaTF" + }, + "source": [ + "In less than 10 minutes (CPU time in a M1 Pro) we trained 2 different models, with two different missions with a same foundational model as a base." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "5a6c8daf-8248-458a-9f6f-14865b4fbd2e", + "showTitle": false, + "title": "" + }, + "id": "s5k10HwoVaTG" + }, + "source": [ + "## Save models\n", + "We are going to save the models. These models are ready to be used, as long as we have the pre-trained model from which they were created in memory." + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "409df5ce-e496-46d7-be2c-202a463cdc80", + "showTitle": false, + "title": "" + }, + "id": "E3dn3PeMVaTG" + }, + "outputs": [], + "source": [ + "trainer_prompt.model.save_pretrained(output_directory_prompt)\n", + "trainer_sentences.model.save_pretrained(output_directory_sentences)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "fb14e3fd-bbf6-4d56-92c2-51bfe08de72a", + "showTitle": false, + "title": "" + }, + "id": "rkUKpDDWVaTG" + }, + "source": [ + "## Inference\n", + "\n", + "You can load the model from the path that you have saved to before, and ask the model to generate text based on our input before!" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "cc48af16-c117-4019-a31a-ce1c93cd21d4", + "showTitle": false, + "title": "" + }, + "id": "dlqXXN8oVaTG" + }, + "outputs": [], + "source": [ + "from peft import PeftModel\n", + "\n", + "loaded_model_prompt = PeftModel.from_pretrained(foundational_model,\n", + " output_directory_prompt,\n", + " #device_map='auto',\n", + " is_trainable=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 66, + "metadata": { + "application/vnd.databricks.v1+cell": { + "cellMetadata": {}, + "inputWidgets": {}, + "nuid": "6b44524b-2ac5-4e74-81e6-c406d4414e42", + "showTitle": false, + "title": "" + }, + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-4jd3zCGVaTG", + "outputId": "b55454f1-f1ed-444c-b107-698778406e6e" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "['I want you to act as a motivational coach. You will be helping students learn how they can improve their performance in the classroom and at school.']\n" + ] + } + ], + "source": [ + "loaded_model_prompt_outputs = get_outputs(loaded_model_prompt, input_prompt)\n", + "print(tokenizer.batch_decode(loaded_model_prompt_outputs, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SHbeFTXjVaTG" + }, + "source": [ + "If we compare both answers something changed.\n", + "* ***Pretrained Model:*** *I want you to act as a motivational coach. Don't be afraid of being challenged.*\n", + "* ***Fine-Tuned Model:*** *I want you to act as a motivational coach. You can use this method if you're feeling anxious about your.*\n", + "\n", + "We have to keep in mind that we have only trained the model for a few minutes, but they have been enough to obtain a response closer to what we were looking for." + ] + }, + { + "cell_type": "code", + "execution_count": 67, + "metadata": { + "id": "MuwAsq3uVaTG" + }, + "outputs": [], + "source": [ + "loaded_model_prompt.load_adapter(output_directory_sentences, adapter_name=\"quotes\")\n", + "loaded_model_prompt.set_adapter(\"quotes\")" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "IQm--PWSVaTH", + "outputId": "3e814a6a-a380-4f2c-f887-6852a9f51002" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "['There are two nice things that should matter to you: the weather and your health.']\n" + ] + } + ], + "source": [ + "loaded_model_sentences_outputs = get_outputs(loaded_model_prompt, input_sentences)\n", + "print(tokenizer.batch_decode(loaded_model_sentences_outputs, skip_special_tokens=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UnR8y9gwVaTH" + }, + "source": [ + "With the second model we have a similar result.\n", + "* **Pretrained Model:** *There are two nice things that should matter to you: the price and quality of your product.*\n", + "* **Fine-Tuned Model:** *There are two nice things that should matter to you: the weather and your health.*\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B6TUjNtGVaTH" + }, + "source": [ + "# Conclusion\n", + "Prompt Tuning is an amazing technique that can save us hours of training and a significant amount of money. In the notebook, we have trained two models in just a few minutes, and we can have both models in memory, providing service to different clients.\n", + "\n", + "If you want to try different combinations and models, the notebook is ready to use another model from the Bloom family.\n", + "\n", + "You can change the number of epochs to train, the number of virtual tokens, and the model in the third cell. However, there are many configurations to change. If you're looking for a good exercise, you can replace the random initialization of the virtual tokens with a fixed value.\n", + "\n", + "*The responses of the Fine-Tuned models may vary every time we train them. I've pasted the results of one of my trainings, but the actual results may differ.*" + ] + }, + { + "cell_type": "code", + "execution_count": 68, + "metadata": { + "id": "5OMyCWasVaTH" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "application/vnd.databricks.v1+notebook": { + "dashboards": [], + "language": "python", + "notebookMetadata": { + "pythonIndentUnit": 2 + }, + "notebookName": "LLM 02 - Prompt Tuning with PEFT", + "widgets": {} + }, + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb index 992d5820..55738b98 100644 --- a/notebooks/en/rag_zephyr_langchain.ipynb +++ b/notebooks/en/rag_zephyr_langchain.ipynb @@ -140,11 +140,7 @@ "source": [ "The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n", "\n", - "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.\n", - "\n", - "Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n", - "\n", - "The fixed-size chunking, however, works well for most common cases, so that is what we'll do here." + "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here. " ] }, { @@ -155,9 +151,9 @@ }, "outputs": [], "source": [ - "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", "\n", - "splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", + "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", "\n", "chunked_docs = splitter.split_documents(docs)" ] diff --git a/notebooks/en/tgi_messages_api_demo.ipynb b/notebooks/en/tgi_messages_api_demo.ipynb new file mode 100644 index 00000000..b2e53af6 --- /dev/null +++ b/notebooks/en/tgi_messages_api_demo.ipynb @@ -0,0 +1,514 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Migrating from OpenAI to Open LLMs Using TGI's Messages API\n", + "\n", + "_Authored by: [Andrew Reed](https://huggingface.co/andrewrreed)_\n", + "\n", + "This notebook demonstrates how you can easily transition from OpenAI models to Open LLMs without needing to refactor any existing code.\n", + "\n", + "[Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) now offers a [Messages API](https://huggingface.co/blog/tgi-messages-api), making it directly compatible with the OpenAI Chat Completion API. This means that any existing scripts that use OpenAI models (via the OpenAI client library or third-party tools like LangChain or LlamaIndex) can be directly swapped out to use any open LLM running on a TGI endpoint!\n", + "\n", + "This allows you to quickly test out and benefit from the numerous advantages offered by open models. Things like:\n", + "\n", + "- Complete control and transparency over models and data\n", + "- No more worrying about rate limits\n", + "- The ability to fully customize systems according to your specific needs\n", + "\n", + "In this notebook, we'll show you how to:\n", + "\n", + "1. [Create Inference Endpoint to Deploy a Model with TGI](#section_1)\n", + "2. [Query the Inference Endpoint with OpenAI Client Libraries](#section_2)\n", + "3. [Integrate the Endpoint with LangChain and LlamaIndex Workflows](#section_3)\n", + "\n", + "**Let's dive in!**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup\n", + "\n", + "First we need to install dependencies and set an HF API key.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import getpass\n", + "\n", + "# enter API key\n", + "os.environ[\"HUGGINGFACEHUB_API_TOKEN\"] = HF_API_KEY = getpass.getpass()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 1. Create an Inference Endpoint\n", + "\n", + "To get started, let's deploy [Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO), a fine-tuned Mixtral model, to Inference Endpoints using TGI.\n", + "\n", + "We can deploy the model in just [a few clicks from the UI](https://ui.endpoints.huggingface.co/new?vendor=aws&repository=NousResearch%2FNous-Hermes-2-Mixtral-8x7B-DPO&tgi_max_total_tokens=32000&tgi=true&tgi_max_input_length=1024&task=text-generation&instance_size=2xlarge&tgi_max_batch_prefill_tokens=2048&tgi_max_batch_total_tokens=1024000&no_suggested_compute=true&accelerator=gpu®ion=us-east-1), or take advantage of the `huggingface_hub` Python library to programmatically create and manage Inference Endpoints.\n", + "\n", + "We'll use the Hub library here by specifing an endpoint name and model repository, along with the task of `text-generation`. In this example, we use a `protected` type so access to the deployed model will require a valid Hugging Face token. We also need to configure the hardware requirements like vendor, region, accelerator, instance type, and size. You can check out the list of available resource options [using this API call](https://api.endpoints.huggingface.cloud/#get-/v2/provider), and view recommended configurations for select models in the catalog [here](https://ui.endpoints.huggingface.co/catalog).\n", + "\n", + "_Note: You may need to request a quota upgrade by sending an email to [api-enterprise@huggingface.co](mailto:api-enterprise@huggingface.co)_\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "running\n" + ] + } + ], + "source": [ + "from huggingface_hub import create_inference_endpoint\n", + "\n", + "endpoint = create_inference_endpoint(\n", + " \"nous-hermes-2-mixtral-8x7b-demo\",\n", + " repository=\"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO\",\n", + " framework=\"pytorch\",\n", + " task=\"text-generation\",\n", + " accelerator=\"gpu\",\n", + " vendor=\"aws\",\n", + " region=\"us-east-1\",\n", + " type=\"protected\",\n", + " instance_type=\"p4de\",\n", + " instance_size=\"2xlarge\",\n", + " custom_image={\n", + " \"health_route\": \"/health\",\n", + " \"env\": {\n", + " \"MAX_INPUT_LENGTH\": \"4096\",\n", + " \"MAX_BATCH_PREFILL_TOKENS\": \"4096\",\n", + " \"MAX_TOTAL_TOKENS\": \"32000\",\n", + " \"MAX_BATCH_TOTAL_TOKENS\": \"1024000\",\n", + " \"MODEL_ID\": \"/repository\",\n", + " },\n", + " \"url\": \"ghcr.io/huggingface/text-generation-inference:sha-1734540\", # must be >= 1.4.0\n", + " },\n", + ")\n", + "\n", + "endpoint.wait()\n", + "print(endpoint.status)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It will take a few minutes for our deployment to spin up. We can use the `.wait()` utility to block the running thread until the endpoint reaches a final \"running\" state. Once running, we can confirm its status and take it for a spin via the UI Playground:\n", + "\n", + "![IE UI Overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/messages-api/endpoint-overview.png)\n", + "\n", + "Great, we now have a working endpoint!\n", + "\n", + "_Note: When deploying with `huggingface_hub`, your endpoint will scale-to-zero after 15 minutes of idle time by default to optimize cost during periods of inactivity. Check out [the Hub Python Library documentation](https://huggingface.co/docs/huggingface_hub/guides/inference_endpoints) to see all the functionality available for managing your endpoint lifecycle._\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 2. Query the Inference Endpoint with OpenAI Client Libraries\n", + "\n", + "As mentioned above, since our model is hosted with TGI it now supports a Messages API meaning we can query it directly using the familiar OpenAI client libraries.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### With the Python client\n", + "\n", + "The example below shows how to make this transition using the [OpenAI Python Library](https://github.com/openai/openai-python). Simply replace the `` with your endpoint URL (be sure to include the `v1/` the suffix) and populate the `` field with a valid Hugging Face user token. The `` can be gathered from Inference Endpoints UI, or from the endpoint object we created above with `endpoint.url`.\n", + "\n", + "We can then use the client as usual, passing a list of messages to stream responses from our Inference Endpoint.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Open-source software is important due to a number of reasons, including:\n", + "\n", + "1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.\n", + "\n", + "2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.\n", + "\n", + "3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.\n", + "\n", + "4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.\n", + "\n", + "5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.\n", + "\n", + "6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.\n", + "\n", + "In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>" + ] + } + ], + "source": [ + "from openai import OpenAI\n", + "\n", + "BASE_URL = endpoint.url\n", + "\n", + "# init the client but point it to TGI\n", + "client = OpenAI(\n", + " base_url=os.path.join(BASE_URL, \"v1/\"),\n", + " api_key=HF_API_KEY,\n", + ")\n", + "chat_completion = client.chat.completions.create(\n", + " model=\"tgi\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", + " {\"role\": \"user\", \"content\": \"Why is open-source software important?\"},\n", + " ],\n", + " stream=True,\n", + " max_tokens=500,\n", + ")\n", + "\n", + "# iterate and print stream\n", + "for message in chat_completion:\n", + " print(message.choices[0].delta.content, end=\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Behind the scenes, TGI’s Messages API automatically converts the list of messages into the model’s required instruction format using its [chat template](https://huggingface.co/docs/transformers/chat_templating).\n", + "\n", + "_Note: Certain OpenAI features, like function calling, are not compatible with TGI. Currently, the Messages API supports the following chat completion parameters: `stream`, `max_new_tokens`, `frequency_penalty`, `logprobs`, `seed`, `temperature`, and `top_p`._\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### With the JavaScript client\n", + "\n", + "Here’s the same streaming example above, but using the [OpenAI Javascript/Typescript Library](https://github.com/openai/openai-node).\n", + "\n", + "```js\n", + "import OpenAI from \"openai\";\n", + "\n", + "const openai = new OpenAI({\n", + " baseURL: \"\" + \"/v1/\", // replace with your endpoint url\n", + " apiKey: \"\", // replace with your token\n", + "});\n", + "\n", + "async function main() {\n", + " const stream = await openai.chat.completions.create({\n", + " model: \"tgi\",\n", + " messages: [\n", + " { role: \"system\", content: \"You are a helpful assistant.\" },\n", + " { role: \"user\", content: \"Why is open-source software important?\" },\n", + " ],\n", + " stream: true,\n", + " max_tokens: 500,\n", + " });\n", + " for await (const chunk of stream) {\n", + " process.stdout.write(chunk.choices[0]?.delta?.content || \"\");\n", + " }\n", + "}\n", + "\n", + "main();\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## 3. Integrate with LangChain and LlamaIndex\n", + "\n", + "Now, let’s see how to use this newly created endpoint with popular RAG frameworks like LangChain and LlamaIndex.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How to use with LangChain\n", + "\n", + "To use it in [LangChain](https://python.langchain.com/docs/get_started/introduction), simply create an instance of `ChatOpenAI` and pass your `` and `` as follows:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "AIMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the underlying code, making it easier to understand how the software works and identify any potential security vulnerabilities or bugs. This transparency fosters trust between users and developers.\\n\\n2. Collaboration: Open-source projects encourage collaboration among developers, allowing them to work together to improve the software, fix issues, and add new features. This collective effort can lead to')" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "\n", + "llm = ChatOpenAI(\n", + " model_name=\"tgi\",\n", + " openai_api_key=HF_API_KEY,\n", + " openai_api_base=os.path.join(BASE_URL, \"v1/\"),\n", + ")\n", + "llm.invoke(\"Why is open-source software important?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We’re able to directly leverage the same `ChatOpenAI` class that we would have used with the OpenAI models. This allows all previous code to work with our endpoint by changing just one line of code.\n", + "\n", + "Let’s now use our Mixtral model in a simple RAG pipeline to answer a question over the contents of a HF blog post.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'context': [Document(page_content='To overcome this weakness, amongst other approaches, one can integrate the LLM into a system where it can call tools: such a system is called an LLM agent.\\nIn this post, we explain the inner workings of ReAct agents, then show how to build them using the ChatHuggingFace class recently integrated in LangChain. Finally, we benchmark several open-source LLMs against GPT-3.5 and GPT-4.', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Since the open-source models were not specifically fine-tuned for calling functions in the given output format, they are at a slight disadvantage compared to the OpenAI agents.\\nDespite this, some models perform really well! 💪\\nHere’s an example of Mixtral-8x7B answering the question: “Which city has a larger population, Guiyang or Tacheng?”\\nThought: To answer this question, I need to find the current populations of both Guiyang and Tacheng. I will use the search tool to find this information.\\nAction:\\n{', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Agents Showdown: how do open-source LLMs perform as general purpose reasoning agents?\\n\\t\\n\\nYou can find the code for this benchmark here.\\n\\n\\n\\n\\n\\n\\t\\tEvaluation\\n\\t\\n\\nWe want to measure how open-source LLMs perform as general purpose reasoning agents. Thus we select questions requiring using logic and the use of basic tools: a calculator and access to internet search.\\nThe final dataset is a combination of samples from 3 other datasets:', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'}),\n", + " Document(page_content='Open-source LLMs as LangChain Agents\\n\\t\\n\\nPublished\\n\\t\\t\\t\\tJanuary 24, 2024\\nUpdate on GitHub\\n\\nm-ric\\nAymeric Roucher\\n\\n\\n\\n\\nJofthomas\\nJoffrey THOMAS\\n\\n\\n\\n\\nandrewrreed\\nAndrew Reed\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\t\\tTL;DR\\n\\t\\n\\nOpen-source LLMs have now reached a performance level that makes them suitable reasoning engines for powering agent workflows: Mixtral even surpasses GPT-3.5 on our benchmark, and its performance could easily be further enhanced with fine-tuning.\\n\\n\\n\\n\\n\\n\\t\\tIntroduction', metadata={'description': 'We’re on a journey to advance and democratize artificial intelligence through open source and open science.', 'language': 'No language found.', 'source': 'https://huggingface.co/blog/open-source-llms-as-agents', 'title': 'Open-source LLMs as LangChain Agents'})],\n", + " 'question': 'According to this article which open-source model is the best for an agent behaviour?',\n", + " 'answer': 'According to the article, Mixtral-8x7B is an open-source LLM that performs really well as a general-purpose reasoning agent. It even surpasses GPT-3.5 on the benchmark in the article.'}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain import hub\n", + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain_community.document_loaders import WebBaseLoader\n", + "from langchain_community.vectorstores import Chroma\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_core.runnables import RunnablePassthrough\n", + "from langchain_core.runnables import RunnableParallel\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "\n", + "# Load, chunk and index the contents of the blog\n", + "loader = WebBaseLoader(\n", + " web_paths=(\"https://huggingface.co/blog/open-source-llms-as-agents\",),\n", + ")\n", + "docs = loader.load()\n", + "\n", + "# declare an HF embedding model\n", + "hf_embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-large-en-v1.5\")\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)\n", + "splits = text_splitter.split_documents(docs)\n", + "vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)\n", + "\n", + "# Retrieve and generate using the relevant snippets of the blog\n", + "retriever = vectorstore.as_retriever()\n", + "prompt = hub.pull(\"rlm/rag-prompt\")\n", + "\n", + "\n", + "def format_docs(docs):\n", + " return \"\\n\\n\".join(doc.page_content for doc in docs)\n", + "\n", + "\n", + "rag_chain_from_docs = (\n", + " RunnablePassthrough.assign(context=(lambda x: format_docs(x[\"context\"])))\n", + " | prompt\n", + " | llm\n", + " | StrOutputParser()\n", + ")\n", + "\n", + "rag_chain_with_source = RunnableParallel(\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + ").assign(answer=rag_chain_from_docs)\n", + "\n", + "rag_chain_with_source.invoke(\n", + " \"According to this article which open-source model is the best for an agent behaviour?\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How to use with LlamaIndex\n", + "\n", + "Similarly, you can also use a TGI endpoint in [LlamaIndex](https://www.llamaindex.ai/). We’ll use the `OpenAILike` class, and instantiate it by configuring some additional arguments (i.e. `is_local`, `is_function_calling_model`, `is_chat_model`, `context_window`).\n", + "\n", + "_Note: that the context window argument should match the value previously set for `MAX_TOTAL_TOKENS` of your endpoint._\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "CompletionResponse(text='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', additional_kwargs={}, raw={'id': '', 'choices': [Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Open-source software is important for several reasons:\\n\\n1. Transparency: Open-source software allows users to see the source code, which means they can understand how the software works and how it processes data. This transparency helps build trust in the software and its developers.\\n\\n2. Collaboration: Open-source software encourages collaboration among developers, who can contribute to the code, fix bugs, and add new features. This collaborative approach often leads to faster development and', role='assistant', function_call=None, tool_calls=None))], 'created': 1707342025, 'model': '/repository', 'object': 'text_completion', 'system_fingerprint': '1.4.0-sha-1734540', 'usage': CompletionUsage(completion_tokens=100, prompt_tokens=18, total_tokens=118)}, delta=None)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from llama_index.llms import OpenAILike\n", + "\n", + "llm = OpenAILike(\n", + " model=\"tgi\",\n", + " api_key=HF_API_KEY,\n", + " api_base=BASE_URL + \"/v1/\",\n", + " is_chat_model=True,\n", + " is_local=False,\n", + " is_function_calling_model=False,\n", + " context_window=4096,\n", + ")\n", + "\n", + "llm.complete(\"Why is open-source software important?\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now use it in a similar RAG pipeline. Keep in mind that the previous choice of `MAX_INPUT_LENGTH` in your Inference Endpoint will directly influence the number of retrieved chunk (`similarity_top_k`) the model can process.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from llama_index import (\n", + " ServiceContext,\n", + " VectorStoreIndex,\n", + ")\n", + "from llama_index import download_loader\n", + "from llama_index.embeddings import HuggingFaceEmbedding\n", + "from llama_index.query_engine import CitationQueryEngine\n", + "\n", + "\n", + "SimpleWebPageReader = download_loader(\"SimpleWebPageReader\")\n", + "\n", + "documents = SimpleWebPageReader(html_to_text=True).load_data(\n", + " [\"https://huggingface.co/blog/open-source-llms-as-agents\"]\n", + ")\n", + "\n", + "# Load embedding model\n", + "embed_model = HuggingFaceEmbedding(model_name=\"BAAI/bge-large-en-v1.5\")\n", + "\n", + "# Pass LLM to pipeline\n", + "service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)\n", + "index = VectorStoreIndex.from_documents(\n", + " documents, service_context=service_context, show_progress=True\n", + ")\n", + "\n", + "# Query the index\n", + "query_engine = CitationQueryEngine.from_args(\n", + " index,\n", + " similarity_top_k=2,\n", + ")\n", + "response = query_engine.query(\n", + " \"According to this article which open-source model is the best for an agent behaviour?\"\n", + ")\n", + "\n", + "response.response" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Wrap up\n", + "\n", + "After you are done with your endpoint, you can either pause or delete it. This step can be completed via the UI, or programmatically like follows.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# pause our running endpoint\n", + "endpoint.pause()\n", + "\n", + "# optionally delete\n", + "# endpoint.delete()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/notebooks/zh-CN/_toctree.yml b/notebooks/zh-CN/_toctree.yml new file mode 100644 index 00000000..b17917d8 --- /dev/null +++ b/notebooks/zh-CN/_toctree.yml @@ -0,0 +1,16 @@ +- title: 开源 AI 指南 (Cookbook) + sections: + - local: index + title: 开源 AI 指南 (Cookbook) + - local: automatic_embedding_tei_inference_endpoints + title: 通过推理端点使用 TEI 自动嵌入 + - local: faiss_with_hf_datasets_and_clip + title: 用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索 + - local: fine_tuning_code_llm_on_single_gpu + title: 在单个 GPU 上针对自定义代码微调代码 LLM + - local: rag_zephyr_langchain + title: 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG + - local: advanced_rag + title: 使用 LangChain 在 HuggingFace 文档上构建高级 RAG + - local: rag_evaluation + title: 使用合成数据和 LLM 作为裁判评估 RAG diff --git a/notebooks/zh-CN/advanced_rag.ipynb b/notebooks/zh-CN/advanced_rag.ipynb new file mode 100644 index 00000000..408e9a05 --- /dev/null +++ b/notebooks/zh-CN/advanced_rag.ipynb @@ -0,0 +1,1247 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "hUCaGdAj9-9F" + }, + "source": [ + "# 使用 LangChain 在 HuggingFace 文档上构建高级 RAG\n", + "_作者: [Aymeric Roucher](https://huggingface.co/m-ric)_" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DKv51c_h9-9H" + }, + "source": [ + "这个 notebook 主要讲述了你怎么构建一个高级的 RAG,用于回答一个关于特定知识库的问题(这里,是 HuggingFace 文档),使用 LangChain。\n", + "\n", + "对于 RAG 的介绍,你可以查看[这个教程](rag_zephyr_langch)\n", + "\n", + "RAG 系统是复杂的,它有许多组块:这里画一个简单的 RAG 图表,其中用蓝色标注了所有系统增强的可能性。\n", + "\n", + "\n", + "\n", + "> 💡 可以看到,这个架构中有许多步骤可以调整:正确调整系统将带来显著的性能提升。\n", + "\n", + "在这个 notebook 中,我们将研究许多这些蓝色标注的部分,看看如何调整你的 RAG 系统以获得最佳性能。\n", + "\n", + "__让我们深入研究模型架构吧!__ 首先,安装所需的模型依赖项。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NSX0p0rV9-9I" + }, + "outputs": [], + "source": [ + "!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl pacmap" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8_Uyukt39-9J" + }, + "outputs": [], + "source": [ + "%reload_ext dotenv\n", + "%dotenv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eoujYMwW9-9J" + }, + "outputs": [], + "source": [ + "from tqdm.notebook import tqdm\n", + "import pandas as pd\n", + "from typing import Optional, List, Tuple\n", + "from datasets import Dataset\n", + "import matplotlib.pyplot as plt\n", + "\n", + "pd.set_option(\n", + " \"display.max_colwidth\", None\n", + ") # this will be helpful when visualizing retriever outputs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kr6rN10U9-9J" + }, + "source": [ + "### 加载你的知识基础" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qZLVIEVW9-9J" + }, + "outputs": [], + "source": [ + "import datasets\n", + "\n", + "ds = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "836Q7vF49-9K" + }, + "outputs": [], + "source": [ + "from langchain.docstore.document import Document as LangchainDocument\n", + "\n", + "RAW_KNOWLEDGE_BASE = [\n", + " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n", + " for doc in tqdm(ds)\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0_LxjD5h9-9K" + }, + "source": [ + "# 1. 检索器- 嵌入 🗂️\n", + "__检索器的作用类似于内部搜索引擎__:给定用户查询,它从你的知识库中返回几个相关的片段。\n", + "\n", + "这些片段随后将被输入到阅读器模型中,以帮助其生成答案。\n", + "\n", + "所以 __我们的目标在这里是,给定一个用户问题,从我们的知识库中找到最多的片段来回答这个问题。__\n", + "\n", + "这是一个宽泛的目标,它留下了一些问题。我们应该检索多少片段?这个参数将被命名为`top_k`。\n", + "\n", + "这些片段应该有多长?这被称为 `chunk size` (片段大小)。没有一刀切的答案,但这里有一些要点:\n", + "- 🔀 你的 `chunk size` 允许从一段片段到另一段片段有所不同。\n", + "- 由于你的检索中总会存在一些噪音,增加 `top_k` 可以提高你检索到的片段中包含相关元素的概率。🎯 射更多的箭增加了你命中目标的概率。\n", + "- 同时,你检索到的文档的总长度不应过高:例如,对于大多数当前模型来说,16k 个 token 可能会因为[中间丢失现象](https://huggingface.co/papers/2307.03172)而在信息中淹没你的阅读器模型。🎯 只给你的阅读器模型提供最相关的见解,而不是一堆书!\n", + "\n", + "\n", + "> 在这个 notebook 中,我们使用 Langchain 库,因为 __它为向量数据库提供了大量的选项,并允许我们在整个处理过程中保留文档的元数据__。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-uS6Mv8O9-9L" + }, + "source": [ + "### 1.1 将文档拆分为片段(chuncks)\n", + "\n", + "- 在这一部分,__我们将知识库中的文档拆分成更小的片段__,这些片段将是喂给阅读器 LLM 生成答案的片段。\n", + "- 目标是准备一组**语义上相关的片段**。因此,它们的大小应该适配确切的想法:太小会截断想法,太大则会稀释它们。\n", + "\n", + "💡 _对于文本拆分存在许多选项:按单词拆分,按句子边界拆分,递归拆分以树状方式处理文档以保留结构信息... 要了解更多关于拆分的信息,我建议你阅读[这个很棒的 notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb),这是由 Greg Kamradt 编写的。_\n", + "\n", + "\n", + "- **递归拆分**使用给定的一组分隔符逐步将文本分解为更小的部分,这些分隔符按从最重要到最不重要的顺序排序。如果第一次拆分没有给出正确大小或形状的片段,该方法会使用不同的分隔符在新的片段上重复自身。例如,使用分隔符列表`[\"\\n\\n\", \"\\n\", \".\", \"\"]`:\n", + " - 该方法首先在出现双行中断`\"\\n\\n\"`的任何地方拆分文档。\n", + " - 结果文档将在简单的行中断`\"\\n\"`处再次拆分,然后在句子结尾`\".\"`处拆分。\n", + " - 最后,如果有些片段仍然太大,它们将在超过最大大小时拆分。\n", + "\n", + "- 使用这种方法,整体结构得到了很好的保留,代价是片段大小会有轻微的变化。\n", + "\n", + "> [这个空间](https://huggingface.co/spaces/A-Roucher/chunk_visualizer)让你可视化不同的拆分选项如何影响你得到的片段。\n", + "\n", + "🔬 让我们用片段大小做一些实验,从任意大小开始,看看拆分是如何工作的。我们使用 Langchain 的 `RecursiveCharacterTextSplitter` 实现递归拆分。\n", + "- 参数 `chunk_size` 控制单个片段的长度:这个长度默认计算为片段中的字符数。\n", + "- 参数 `chunk_overlap` 允许相邻片段彼此有一些重叠。这减少了想法被两个相邻片段之间的拆分切割成两半的概率。我们武断地将这个设置为片段大小的1/10,你可以尝试不同的值!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M4m6TwDJ9-9L" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "\n", + "# We use a hierarchical list of separators specifically tailored for splitting Markdown documents\n", + "# This list is taken from LangChain's MarkdownTextSplitter class.\n", + "MARKDOWN_SEPARATORS = [\n", + " \"\\n#{1,6} \",\n", + " \"```\\n\",\n", + " \"\\n\\\\*\\\\*\\\\*+\\n\",\n", + " \"\\n---+\\n\",\n", + " \"\\n___+\\n\",\n", + " \"\\n\\n\",\n", + " \"\\n\",\n", + " \" \",\n", + " \"\",\n", + "]\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=1000, # the maximum number of characters in a chunk: we selected this value arbitrarily\n", + " chunk_overlap=100, # the number of characters to overlap between chunks\n", + " add_start_index=True, # If `True`, includes chunk's start index in metadata\n", + " strip_whitespace=True, # If `True`, strips whitespace from the start and end of every document\n", + " separators=MARKDOWN_SEPARATORS,\n", + ")\n", + "\n", + "docs_processed = []\n", + "for doc in RAW_KNOWLEDGE_BASE:\n", + " docs_processed += text_splitter.split_documents([doc])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d5jJUMgb9-9M" + }, + "source": [ + "我们还必须记住,当我们嵌入文档时,我们将使用一个接受特定最大序列长度 `max_seq_length` 的嵌入模型。\n", + "\n", + "因此,我们应该确保我们的片段大小低于这个限制,因为任何更长的片段在处理之前都会被截断,从而失去相关性。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "referenced_widgets": [ + "ae043feeb0914c879e2a9008b413d952" + ] + }, + "id": "B4hoki349-9M", + "outputId": "64f92a61-7839-476d-f456-7eefde04c20b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model's maximum sequence length: 512\n" + ] + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ae043feeb0914c879e2a9008b413d952", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/31085 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from sentence_transformers import SentenceTransformer\n", + "\n", + "# To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter.\n", + "print(\n", + " f\"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}\"\n", + ")\n", + "\n", + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(\"thenlper/gte-small\")\n", + "lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]\n", + "\n", + "# Plot the distrubution of document lengths, counted as the number of tokens\n", + "fig = pd.Series(lengths).hist()\n", + "plt.title(\"Distribution of document lengths in the knowledge base (in count of tokens)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L3teXczl9-9M" + }, + "source": [ + "👀 可以看到,__片段长度与我们的 512 个 token 的限制不匹配__,并且有些文档超出了限制,因此它们的一部分将在截断中丢失!\n", + " - 因此,我们应该更改 `RecursiveCharacterTextSplitter` 类,以计算 token 数量而不是字符数量。\n", + " - 然后,我们可以选择一个特定的片段大小,这里我们会选择低于 512 的阈值:\n", + " - 较小的文档可能允许拆分更专注于特定想法的内容。\n", + " - 但太小的片段会拆分句子,从而再次失去意义:适当的调整是一个平衡的问题。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "referenced_widgets": [ + "f900cf4ab3a94f45bfa7298f433566ed" + ] + }, + "id": "9hvIL2jO9-9M", + "outputId": "9baf219d-2954-4927-9681-e28572db90db" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f900cf4ab3a94f45bfa7298f433566ed", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/17995 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from transformers import AutoTokenizer\n", + "\n", + "EMBEDDING_MODEL_NAME = \"thenlper/gte-small\"\n", + "\n", + "\n", + "def split_documents(\n", + " chunk_size: int,\n", + " knowledge_base: List[LangchainDocument],\n", + " tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,\n", + ") -> List[LangchainDocument]:\n", + " \"\"\"\n", + " Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.\n", + " \"\"\"\n", + " text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n", + " AutoTokenizer.from_pretrained(tokenizer_name),\n", + " chunk_size=chunk_size,\n", + " chunk_overlap=int(chunk_size / 10),\n", + " add_start_index=True,\n", + " strip_whitespace=True,\n", + " separators=MARKDOWN_SEPARATORS,\n", + " )\n", + "\n", + " docs_processed = []\n", + " for doc in knowledge_base:\n", + " docs_processed += text_splitter.split_documents([doc])\n", + "\n", + " # Remove duplicates\n", + " unique_texts = {}\n", + " docs_processed_unique = []\n", + " for doc in docs_processed:\n", + " if doc.page_content not in unique_texts:\n", + " unique_texts[doc.page_content] = True\n", + " docs_processed_unique.append(doc)\n", + "\n", + " return docs_processed_unique\n", + "\n", + "\n", + "docs_processed = split_documents(\n", + " 512, # We choose a chunk size adapted to our model\n", + " RAW_KNOWLEDGE_BASE,\n", + " tokenizer_name=EMBEDDING_MODEL_NAME,\n", + ")\n", + "\n", + "# Let's visualize the chunk sizes we would have in tokens from a common model\n", + "from transformers import AutoTokenizer\n", + "\n", + "tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)\n", + "lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]\n", + "fig = pd.Series(lengths).hist()\n", + "plt.title(\"Distribution of document lengths in the knowledge base (in count of tokens)\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wc3riwX39-9M" + }, + "source": [ + "➡️ 现在分块长度分布看起来好多了!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J1ho-UKM9-9M" + }, + "source": [ + "### 1.2 构建向量数据库\n", + "\n", + "我们希望为我们知识库的所有片段计算嵌入向量:要了解更多关于句子嵌入的信息,我们建议阅读[这个指南](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/)。\n", + "\n", + "#### 检索的工作原理\n", + "\n", + "一旦所有片段都被嵌入,我们就将它们存储到一个向量数据库中。当用户输入一个查询时,它会被之前使用的同一模型嵌入,并且相似性搜索会返回向量数据库中最接近的文档。\n", + "\n", + "因此,技术挑战在于,给定一个查询向量,快速找到向量数据库中这个向量的最近邻。为此,我们需要选择两件事:一个距离度量,以及一个搜索算法,以便在成千上万的记录数据库中快速找到最近邻。\n", + "\n", + "##### 最近邻搜索算法\n", + "\n", + "最近邻搜索算法有很多选择:我们选择 Facebook 的 [FAISS](https://github.com/facebookresearch/faiss),因为 FAISS 对于大多数用例来说性能足够好,而且它广为人知,因此被广泛实现。\n", + "\n", + "##### 距离度量\n", + "\n", + "关于距离度量,你可以在[这里](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/#distance-between-embeddings)找到一个很好的指南。简而言之:\n", + "- **余弦相似度**计算两个向量之间的相似性,作为它们相对角度的余弦值:它允许我们比较向量的方向,而不考虑它们的幅度。使用它需要对所有向量进行归一化,将它们重新缩放到单位范数。\n", + "- **点积**考虑幅度,有时会有不希望的效果,即增加向量的长度会使它与所有其他向量更相似。\n", + "- **欧氏距离**是向量末端之间的距离。\n", + "\n", + "你可以尝试[这个小测](https://developers.google.com/machine-learning/clustering/similarity/check-your-understanding)来检查你对这些概念的理解。但是一旦向量被归一化,[选择特定的距离度量并不重要](https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use)。\n", + "\n", + "我们的特定模型与余弦相似度配合得很好,所以我们选择这个距离度量,并在嵌入模型中以及 FAISS 索引的 `distance_strategy` 参数中设置它。使用余弦相似度,我们需要归一化我们的嵌入向量。\n", + "\n", + "🚨👇 下面的单元格需要在 A10G 上运行几分钟!\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dalledM99-9M" + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import FAISS\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.vectorstores.utils import DistanceStrategy\n", + "\n", + "embedding_model = HuggingFaceEmbeddings(\n", + " model_name=EMBEDDING_MODEL_NAME,\n", + " multi_process=True,\n", + " model_kwargs={\"device\": \"cuda\"},\n", + " encode_kwargs={\"normalize_embeddings\": True}, # set True for cosine similarity\n", + ")\n", + "\n", + "KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(\n", + " docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0zM-wfiJ9-9N" + }, + "source": [ + "👀 为了可视化搜索最接近的文档,我们使用 PaCMAP 将我们的嵌入向量从 384 维降至 2 维。\n", + "\n", + "💡 _我们选择 PaCMAP 而不是其他技术,如 t-SNE 或 UMAP,因为[它效率高(保留局部和全局结构),对初始化参数鲁棒且速度快](https://www.nature.com/articles/s42003-022-03628-x#Abs1)。_\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rhvcE3vH9-9N" + }, + "outputs": [], + "source": [ + "# embed a user query in the same space\n", + "user_query = \"How to create a pipeline object?\"\n", + "query_vector = embedding_model.embed_query(user_query)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "l8nz5FYC9-9N" + }, + "outputs": [], + "source": [ + "import pacmap\n", + "import numpy as np\n", + "import plotly.express as px\n", + "\n", + "embedding_projector = pacmap.PaCMAP(\n", + " n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1\n", + ")\n", + "\n", + "embeddings_2d = [\n", + " list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0])\n", + " for idx in range(len(docs_processed))\n", + "] + [query_vector]\n", + "\n", + "# fit the data (The index of transformed data corresponds to the index of the original data)\n", + "documents_projected = embedding_projector.fit_transform(np.array(embeddings_2d), init=\"pca\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7Cl9Fw2A9-9N" + }, + "outputs": [], + "source": [ + "df = pd.DataFrame.from_dict(\n", + " [\n", + " {\n", + " \"x\": documents_projected[i, 0],\n", + " \"y\": documents_projected[i, 1],\n", + " \"source\": docs_processed[i].metadata[\"source\"].split(\"/\")[1],\n", + " \"extract\": docs_processed[i].page_content[:100] + \"...\",\n", + " \"symbol\": \"circle\",\n", + " \"size_col\": 4,\n", + " }\n", + " for i in range(len(docs_processed))\n", + " ]\n", + " + [\n", + " {\n", + " \"x\": documents_projected[-1, 0],\n", + " \"y\": documents_projected[-1, 1],\n", + " \"source\": \"User query\",\n", + " \"extract\": user_query,\n", + " \"size_col\": 100,\n", + " \"symbol\": \"star\",\n", + " }\n", + " ]\n", + ")\n", + "\n", + "# visualize the embedding\n", + "fig = px.scatter(\n", + " df,\n", + " x=\"x\",\n", + " y=\"y\",\n", + " color=\"source\",\n", + " hover_data=\"extract\",\n", + " size=\"size_col\",\n", + " symbol=\"symbol\",\n", + " color_discrete_map={\"User query\": \"black\"},\n", + " width=1000,\n", + " height=700,\n", + ")\n", + "fig.update_traces(\n", + " marker=dict(opacity=1, line=dict(width=0, color=\"DarkSlateGrey\")), selector=dict(mode=\"markers\")\n", + ")\n", + "fig.update_layout(\n", + " legend_title_text=\"Chunk source\",\n", + " title=\"2D Projection of Chunk Embeddings via PaCMAP\",\n", + ")\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kWesCSGt9-9N" + }, + "source": [ + "\n", + "\n", + "➡️ 在上面的图表中,你可以看到知识库文档的空间表示。由于向量嵌入代表了文档的含义,它们在意义上的接近应该在它们的嵌入的接近程度上反映出来。\n", + "\n", + "用户查询的嵌入也被显示出来:我们想要找到意义最接近的 `k` 个文档,因此我们选择最接近的 `k` 个向量。\n", + "\n", + "在 LangChain 向量数据库实现中,这个搜索操作是由方法 `vector_database.similarity_search(query)` 执行的。\n", + "\n", + "这里是结果:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VcjQzejH9-9N", + "outputId": "d5b817c2-1b0e-4e47-9658-4892a91e7c51" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Starting retrieval for user_query='How to create a pipeline object?'...\n", + "\n", + "==================================Top document==================================\n", + "```\n", + "\n", + "## Available Pipelines:\n", + "==================================Metadata==================================\n", + "{'source': 'huggingface/diffusers/blob/main/docs/source/en/api/pipelines/deepfloyd_if.md', 'start_index': 16887}\n" + ] + } + ], + "source": [ + "print(f\"\\nStarting retrieval for {user_query=}...\")\n", + "retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)\n", + "print(\"\\n==================================Top document==================================\")\n", + "print(retrieved_docs[0].page_content)\n", + "print(\"==================================Metadata==================================\")\n", + "print(retrieved_docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VjVqmDGh9-9N" + }, + "source": [ + "# 2. 阅读器- LLM 💬\n", + "\n", + "在这一部分,__LLM 阅读器读取检索到的上下文以形成其答案。__\n", + "\n", + "实际上有多个可以调整的子步骤:\n", + "1. 检索到的文档内容被聚合并放入“上下文”中,这其中有许多处理选项,如_提示压缩_。\n", + "2. 上下文和用户查询被聚合并形成一个提示(prompt),然后交给 LLM 生成其答案。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0xiXcG269-9N" + }, + "source": [ + "### 2.1. 阅读器模型\n", + "\n", + "在选择阅读器模型时,有几个方面很重要:\n", + "- 阅读器模型的 `max_seq_length` 必须适应我们的提示(prompt),其中包括检索器调用输出的上下文:上下文包括 5 个每份 512 个 token 的文档,所以我们至少需要 4k 个 token 的上下文长度。\n", + "- 阅读器模型\n", + "\n", + "在这个例子中,我们选择了 [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta),这是一个小而强大的模型。\n", + "\n", + "由于每周都会发布许多模型,你可能想要用最新最好的模型替换这个模型。跟踪开源 LLM 的最佳方式是查看 [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)。\n", + "\n", + "为了加速推理,我们将加载模型的量化版本:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "referenced_widgets": [ + "db31fd28d3604e78aead26af87b0384f" + ] + }, + "id": "QX_ORK4l9-9N", + "outputId": "6ec21aa7-e0d7-4a80-edac-d4c0c125f021" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "db31fd28d3604e78aead26af87b0384f", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Loading checkpoint shards: 0%| | 0/8 [00:00\n", + "Using the information contained in the context, \n", + "give a comprehensive answer to the question.\n", + "Respond only to the question asked, response should be concise and relevant to the question.\n", + "Provide the number of the source document when relevant.\n", + "If the answer cannot be deduced from the context, do not give an answer.\n", + "<|user|>\n", + "Context:\n", + "{context}\n", + "---\n", + "Now here is the question you need to answer.\n", + "\n", + "Question: {question}\n", + "<|assistant|>\n" + ] + } + ], + "source": [ + "prompt_in_chat_format = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"\"\"Using the information contained in the context,\n", + "give a comprehensive answer to the question.\n", + "Respond only to the question asked, response should be concise and relevant to the question.\n", + "Provide the number of the source document when relevant.\n", + "If the answer cannot be deduced from the context, do not give an answer.\"\"\",\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"\"\"Context:\n", + "{context}\n", + "---\n", + "Now here is the question you need to answer.\n", + "\n", + "Question: {question}\"\"\",\n", + " },\n", + "]\n", + "RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(\n", + " prompt_in_chat_format, tokenize=False, add_generation_prompt=True\n", + ")\n", + "print(RAG_PROMPT_TEMPLATE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GZRHLza-9-9O" + }, + "source": [ + "让我们在之前检索的文档上测试我们的阅读器!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "G4XprIih9-9O", + "outputId": "94c63d34-67ad-4f82-a3b4-2a32cecc8427" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "To create a pipeline object, follow these steps:\n", + "\n", + "1. Define the inputs and outputs of your pipeline. These could be strings, dictionaries, or any other format that best suits your use case.\n", + "\n", + "2. Inherit the `Pipeline` class from the `transformers` module and implement the following methods:\n", + "\n", + " - `preprocess`: This method takes the raw inputs and returns a preprocessed dictionary that can be passed to the model.\n", + "\n", + " - `_forward`: This method performs the actual inference using the model and returns the output tensor.\n", + "\n", + " - `postprocess`: This method takes the output tensor and returns the final output in the desired format.\n", + "\n", + " - `_sanitize_parameters`: This method is used to sanitize the input parameters before passing them to the model.\n", + "\n", + "3. Load the necessary components, such as the model and scheduler, into the pipeline object.\n", + "\n", + "4. Instantiate the pipeline object and return it.\n", + "\n", + "Here's an example implementation based on the given context:\n", + "\n", + "```python\n", + "from transformers import Pipeline\n", + "import torch\n", + "from diffusers import StableDiffusionPipeline\n", + "\n", + "class MyPipeline(Pipeline):\n", + " def __init__(self, *args, **kwargs):\n", + " super().__init__(*args, **kwargs)\n", + " self.pipe = StableDiffusionPipeline.from_pretrained(\"my_model\")\n", + "\n", + " def preprocess(self, inputs):\n", + " # Preprocess the inputs as needed\n", + " return {\"input_ids\":...}\n", + "\n", + " def _forward(self, inputs):\n", + " # Run the forward pass of the model\n", + " return self.pipe(**inputs).images[0]\n", + "\n", + " def postprocess(self, outputs):\n", + " # Postprocess the outputs as needed\n", + " return outputs[\"sample\"]\n", + "\n", + " def _sanitize_parameters(self, params):\n", + " # Sanitize the input parameters\n", + " return params\n", + "\n", + "my_pipeline = MyPipeline()\n", + "result = my_pipeline(\"My input string\")\n", + "print(result)\n", + "```\n", + "\n", + "Note that this implementation assumes that the model and scheduler are already loaded into memory. If they need to be loaded dynamically, you can modify the `__init__` method accordingly.\n" + ] + } + ], + "source": [ + "retrieved_docs_text = [\n", + " doc.page_content for doc in retrieved_docs\n", + "] # we only need the text of the documents\n", + "context = \"\\nExtracted documents:\\n\"\n", + "context += \"\".join([f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(retrieved_docs_text)])\n", + "\n", + "final_prompt = RAG_PROMPT_TEMPLATE.format(\n", + " question=\"How to create a pipeline object?\", context=context\n", + ")\n", + "\n", + "# Redact an answer\n", + "answer = READER_LLM(final_prompt)[0][\"generated_text\"]\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rhRHZoww9-9O" + }, + "source": [ + "### 2.3. 重排序(rerank)\n", + "\n", + "对于 RAG 来说,通常更好的选择会最终检索出比你想要的更多的文档,然后在保留 `top_k` 之前,使用更强大的检索模型对结果进行重新排序。\n", + "\n", + "为此,[Colbertv2](https://arxiv.org/abs/2112.01488)是一个很好的选择:它不是像我们传统的嵌入模型那样的双向编码器,而是一个交叉编码器,它计算查询 token 与每个文档 token 之间更细致的交互。\n", + "\n", + "由于有了 [RAGatouille 库](https://github.com/bclavie/RAGatouille),它的使用变得非常简单。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "triOdqTV9-9O" + }, + "outputs": [], + "source": [ + "from ragatouille import RAGPretrainedModel\n", + "\n", + "RERANKER = RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Minj2SV59-9O" + }, + "source": [ + "# 3. 集成所有组件" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "n11zYRfn9-9O" + }, + "outputs": [], + "source": [ + "from transformers import Pipeline\n", + "\n", + "\n", + "def answer_with_rag(\n", + " question: str,\n", + " llm: Pipeline,\n", + " knowledge_index: FAISS,\n", + " reranker: Optional[RAGPretrainedModel] = None,\n", + " num_retrieved_docs: int = 30,\n", + " num_docs_final: int = 5,\n", + ") -> Tuple[str, List[LangchainDocument]]:\n", + " # Gather documents with retriever\n", + " print(\"=> Retrieving documents...\")\n", + " relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)\n", + " relevant_docs = [doc.page_content for doc in relevant_docs] # keep only the text\n", + "\n", + " # Optionally rerank results\n", + " if reranker:\n", + " print(\"=> Reranking documents...\")\n", + " relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)\n", + " relevant_docs = [doc[\"content\"] for doc in relevant_docs]\n", + "\n", + " relevant_docs = relevant_docs[:num_docs_final]\n", + "\n", + " # Build the final prompt\n", + " context = \"\\nExtracted documents:\\n\"\n", + " context += \"\".join([f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(relevant_docs)])\n", + "\n", + " final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)\n", + "\n", + " # Redact an answer\n", + " print(\"=> Generating answer...\")\n", + " answer = llm(final_prompt)[0][\"generated_text\"]\n", + "\n", + " return answer, relevant_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9nA4nwRQ9-9P" + }, + "source": [ + "让我们看看我们的 RAG 流水线是怎么回答用户的询问的。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7ZTC1FtX9-9P", + "outputId": "22597be1-ab72-4f68-d577-0e12820463cf" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=> Retrieving documents...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "=> Reranking documents...\n", + "=> Generating answer...\n" + ] + } + ], + "source": [ + "question = \"how to create a pipeline object?\"\n", + "\n", + "answer, relevant_docs = answer_with_rag(\n", + " question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SwW0oqhZ9-9P", + "outputId": "361f28ed-9cd5-40b8-f8c4-57e8e4a530d9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "==================================Answer==================================\n", + "To create a pipeline object, follow these steps:\n", + "\n", + "1. Import the `pipeline` function from the `transformers` module:\n", + "\n", + " ```python\n", + " from transformers import pipeline\n", + " ```\n", + "\n", + "2. Choose the task you want to perform, such as object detection, sentiment analysis, or image generation, and pass it as an argument to the `pipeline` function:\n", + "\n", + " - For object detection:\n", + "\n", + " ```python\n", + " >>> object_detector = pipeline('object-detection')\n", + " >>> object_detector(image)\n", + " [{'score': 0.9982201457023621,\n", + " 'label':'remote',\n", + " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n", + " ...]\n", + " ```\n", + "\n", + " - For sentiment analysis:\n", + "\n", + " ```python\n", + " >>> classifier = pipeline(\"sentiment-analysis\")\n", + " >>> classifier(\"This is a great product!\")\n", + " {'labels': ['POSITIVE'],'scores': tensor([0.9999], device='cpu', dtype=torch.float32)}\n", + " ```\n", + "\n", + " - For image generation:\n", + "\n", + " ```python\n", + " >>> image = pipeline(\n", + " ... \"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k\"\n", + " ... ).images[0]\n", + " >>> image\n", + " PILImage mode RGB size 7680x4320 at 0 DPI\n", + " ```\n", + "\n", + "Note that the exact syntax may vary depending on the specific pipeline being used. Refer to the documentation for more details on how to use each pipeline.\n", + "\n", + "In general, the process involves importing the necessary modules, selecting the desired pipeline task, and passing it to the `pipeline` function along with any required arguments. The resulting pipeline object can then be used to perform the selected task on input data.\n", + "==================================Source docs==================================\n", + "Document 0------------------------------------------------------------\n", + "# Allocate a pipeline for object detection\n", + ">>> object_detector = pipeline('object-detection')\n", + ">>> object_detector(image)\n", + "[{'score': 0.9982201457023621,\n", + " 'label': 'remote',\n", + " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n", + " {'score': 0.9960021376609802,\n", + " 'label': 'remote',\n", + " 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},\n", + " {'score': 0.9954745173454285,\n", + " 'label': 'couch',\n", + " 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},\n", + " {'score': 0.9988006353378296,\n", + " 'label': 'cat',\n", + " 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},\n", + " {'score': 0.9986783862113953,\n", + " 'label': 'cat',\n", + " 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]\n", + "Document 1------------------------------------------------------------\n", + "# Allocate a pipeline for object detection\n", + ">>> object_detector = pipeline('object_detection')\n", + ">>> object_detector(image)\n", + "[{'score': 0.9982201457023621,\n", + " 'label': 'remote',\n", + " 'box': {'xmin': 40, 'ymin': 70, 'xmax': 175, 'ymax': 117}},\n", + " {'score': 0.9960021376609802,\n", + " 'label': 'remote',\n", + " 'box': {'xmin': 333, 'ymin': 72, 'xmax': 368, 'ymax': 187}},\n", + " {'score': 0.9954745173454285,\n", + " 'label': 'couch',\n", + " 'box': {'xmin': 0, 'ymin': 1, 'xmax': 639, 'ymax': 473}},\n", + " {'score': 0.9988006353378296,\n", + " 'label': 'cat',\n", + " 'box': {'xmin': 13, 'ymin': 52, 'xmax': 314, 'ymax': 470}},\n", + " {'score': 0.9986783862113953,\n", + " 'label': 'cat',\n", + " 'box': {'xmin': 345, 'ymin': 23, 'xmax': 640, 'ymax': 368}}]\n", + "Document 2------------------------------------------------------------\n", + "Start by creating an instance of [`pipeline`] and specifying a task you want to use it for. In this guide, you'll use the [`pipeline`] for sentiment analysis as an example:\n", + "\n", + "```py\n", + ">>> from transformers import pipeline\n", + "\n", + ">>> classifier = pipeline(\"sentiment-analysis\")\n", + "Document 3------------------------------------------------------------\n", + "```\n", + "\n", + "## Add the pipeline to 🤗 Transformers\n", + "\n", + "If you want to contribute your pipeline to 🤗 Transformers, you will need to add a new module in the `pipelines` submodule\n", + "with the code of your pipeline, then add it to the list of tasks defined in `pipelines/__init__.py`.\n", + "\n", + "Then you will need to add tests. Create a new file `tests/test_pipelines_MY_PIPELINE.py` with examples of the other tests.\n", + "\n", + "The `run_pipeline_test` function will be very generic and run on small random models on every possible\n", + "architecture as defined by `model_mapping` and `tf_model_mapping`.\n", + "\n", + "This is very important to test future compatibility, meaning if someone adds a new model for\n", + "`XXXForQuestionAnswering` then the pipeline test will attempt to run on it. Because the models are random it's\n", + "impossible to check for actual values, that's why there is a helper `ANY` that will simply attempt to match the\n", + "output of the pipeline TYPE.\n", + "\n", + "You also *need* to implement 2 (ideally 4) tests.\n", + "\n", + "- `test_small_model_pt` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)\n", + " and test the pipeline outputs. The results should be the same as `test_small_model_tf`.\n", + "- `test_small_model_tf` : Define 1 small model for this pipeline (doesn't matter if the results don't make sense)\n", + " and test the pipeline outputs. The results should be the same as `test_small_model_pt`.\n", + "- `test_large_model_pt` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to\n", + " make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make\n", + " sure there is no drift in future releases.\n", + "- `test_large_model_tf` (`optional`): Tests the pipeline on a real pipeline where the results are supposed to\n", + " make sense. These tests are slow and should be marked as such. Here the goal is to showcase the pipeline and to make\n", + " sure there is no drift in future releases.\n", + "Document 4------------------------------------------------------------\n", + "```\n", + "\n", + "2. Pass a prompt to the pipeline to generate an image:\n", + "\n", + "```py\n", + "image = pipeline(\n", + "\t\"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k\"\n", + ").images[0]\n", + "image\n" + ] + } + ], + "source": [ + "print(\"==================================Answer==================================\")\n", + "print(f\"{answer}\")\n", + "print(\"==================================Source docs==================================\")\n", + "for i, doc in enumerate(relevant_docs):\n", + " print(f\"Document {i}------------------------------------------------------------\")\n", + " print(doc)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w6iNo7lY9-9S" + }, + "source": [ + "✅ 现在我们已经拥有了一个完整且性能出色的 RAG 系统。今天的教程就到这里!恭喜你坚持到了最后 🥳\n", + "\n", + "# 进一步探索 🗺️\n", + "\n", + "这并不是旅程的终点!你可以尝试许多步骤来改进你的 RAG 系统。我们建议以迭代的方式进行:对系统进行小的更改,看看哪些可以提升性能。\n", + "\n", + "### 设置评估流水线\n", + "\n", + "- 💬 “你不能改进你没有衡量的模型性能”,甘地说过... 或者至少 Llama2 告诉我他这么说过。无论如何,你绝对应该从衡量性能开始:这意味着构建一个小的评估数据集,然后在评估数据集上监控你的 RAG 系统的性能。\n", + "\n", + "### 改进检索器\n", + "\n", + "🛠️ __你可以使用这些选项来调整结果:__\n", + "\n", + "- 调整分块方法:\n", + " - 片段的大小\n", + " - 方法:使用不同的分隔符进行拆分,使用[语义分块](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...\n", + "\n", + "- 更改嵌入模型\n", + "\n", + "👷‍♀️ __还可以考虑以下事项:__\n", + "\n", + "- 尝试另一种分块方法,如语义分块\n", + "- 更改使用的索引(这里使用的是 FAISS)\n", + "- 查询扩展:以略微不同的方式重新构建用户查询以检索更多文档。\n", + "\n", + "### 改进阅读器\n", + "🛠️ __这里你可以尝试以下选项来改善结果:__\n", + "\n", + "- 调整提示\n", + "- 开启/关闭重排序\n", + "- 选择一个更强大的阅读器模型\n", + "\n", + "💡 __这里有许多选项可以考虑以进一步改善结果:__\n", + "- 压缩检索到的上下文,只保留与回答查询最相关的部分。\n", + "- 扩展 RAG 系统,使其更加用户友好:\n", + " - 引用来源\n", + " - 使其能够进行对话" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "ml2", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/zh-CN/automatic_embedding_tei_inference_endpoints.ipynb b/notebooks/zh-CN/automatic_embedding_tei_inference_endpoints.ipynb new file mode 100644 index 00000000..07b366b7 --- /dev/null +++ b/notebooks/zh-CN/automatic_embedding_tei_inference_endpoints.ipynb @@ -0,0 +1,826 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5d9aca72-957a-4ee2-862f-e011b9cd3a62", + "metadata": {}, + "source": [ + "# 怎么使用推理端点去嵌入文档\n", + "\n", + "_作者: [Derek Thomas](https://huggingface.co/derek-thomas)_\n", + "\n", + "## 目标\n", + "\n", + "我有一个数据集,我想为其嵌入语义搜索(或问答,或 RAG),我希望以最简单的方式嵌入这个数据集并将其放入一个新的数据集中。\n", + "\n", + "## 方法\n", + "\n", + "我将使用我最喜欢的 subreddit [r/bestofredditorupdates](https://www.reddit.com/r/bestofredditorupdates/) 中的数据集。因为它有很长的条目,同时使用新的 [jinaai/jina-embeddings-v2-base-en](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) 嵌入模型,因为它有 8k 的上下文长度。还将使用 [推理端点](https://huggingface.co/inference-endpoints) 部署这个,以节省时间和金钱。要跟随这个教程,你需要**已经添加了支付方式**。如果你还没有添加,可以在 [账单](https://huggingface.co/docs/hub/billing#billing) 中添加。为了使操作更加简单,我将完全基于 API 进行操作。\n", + "\n", + "为了使这个过程更快,我将使用 [Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 镜像。这有许多好处,比如:\n", + "- 无需模型图编译步骤\n", + "- Docker 镜像小,启动时间快。真正的无服务器!\n", + "- 基于 token 的动态批处理\n", + "- 使用 Flash 注意力机制、Candle 和 cuBLASLt 优化的 transformers 代码进行推理\n", + "- Safetensors 权重加载\n", + "- 生产就绪(使用 Open Telemetry 进行分布式跟踪,Prometheus 指标)\n", + "\n", + "\n", + "![img](https://media.githubusercontent.com/media/huggingface/text-embeddings-inference/main/assets/bs1-tp.png)" + ] + }, + { + "cell_type": "markdown", + "id": "3c830114-dd88-45a9-81b9-78b0e3da7384", + "metadata": {}, + "source": [ + "## 环境(Requirements)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35386f72-32cb-49fa-a108-3aa504e20429", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install -q aiohttp==3.8.3 datasets==2.14.6 pandas==1.5.3 requests==2.31.0 tqdm==4.66.1 huggingface-hub>=0.20" + ] + }, + { + "cell_type": "markdown", + "id": "b6f72042-173d-4a72-ade1-9304b43b528d", + "metadata": {}, + "source": [ + "## 导入包" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "e2beecdd-d033-4736-bd45-6754ec53b4ac", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "import asyncio\n", + "from getpass import getpass\n", + "import json\n", + "from pathlib import Path\n", + "import time\n", + "from typing import Optional\n", + "\n", + "from aiohttp import ClientSession, ClientTimeout\n", + "from datasets import load_dataset, Dataset, DatasetDict\n", + "from huggingface_hub import notebook_login, create_inference_endpoint, list_inference_endpoints, whoami\n", + "import numpy as np\n", + "import pandas as pd\n", + "import requests\n", + "from tqdm.auto import tqdm" + ] + }, + { + "cell_type": "markdown", + "id": "5eece903-64ce-435d-a2fd-096c0ff650bf", + "metadata": {}, + "source": [ + "## 设置(Config)\n", + "`DATASET_IN` 你文本数据的位置\n", + "`DATASET_OUT` 你的嵌入储存的位置\n", + "\n", + "注意:我将 `MAX_WORKERS` 设置为 5,因为 `jina-embeddings-v2` 对内存的需求较大。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "df2f79f0-9f28-46e6-9fc7-27e9537ff5be", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "DATASET_IN = 'derek-thomas/dataset-creator-reddit-bestofredditorupdates'\n", + "DATASET_OUT = \"processed-subset-bestofredditorupdates\"\n", + "ENDPOINT_NAME = \"boru-jina-embeddings-demo-ie\"\n", + "\n", + "MAX_WORKERS = 5 # This is for how many async workers you want. Choose based on the model and hardware \n", + "ROW_COUNT = 100 # Choose None to use all rows, Im using 100 just for a demo" + ] + }, + { + "cell_type": "markdown", + "id": "1e680f3d-4900-46cc-8b49-bb6ba3e27e2b", + "metadata": {}, + "source": [ + "Hugging Face 在推理端点中提供了多种 GPU 供选择。下面以表格形式呈现:\n", + "\n", + "| GPU | 实例类型 | 实例大小 | vRAM |\n", + "|---------------------|----------------|--------------|-------|\n", + "| 1x Nvidia Tesla T4 | g4dn.xlarge | small | 16GB |\n", + "| 4x Nvidia Tesla T4 | g4dn.12xlarge | large | 64GB |\n", + "| 1x Nvidia A10G | g5.2xlarge | medium | 24GB |\n", + "| 4x Nvidia A10G | g5.12xlarge | xxlarge | 96GB |\n", + "| 1x Nvidia A100* | p4de | xlarge | 80GB |\n", + "| 2x Nvidia A100* | p4de | 2xlarge | 160GB |\n", + "\n", + "\\*注意,对于 A100 的机型你需要发邮件给我们来获取权限。" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "3c2106c1-2e5a-443a-9ea8-a3cd0e9c5a94", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "# GPU Choice\n", + "VENDOR=\"aws\"\n", + "REGION=\"us-east-1\"\n", + "INSTANCE_SIZE=\"medium\"\n", + "INSTANCE_TYPE=\"g5.2xlarge\"" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "0ca1140c-3fcc-4b99-9210-6da1505a27b7", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "ee80821056e147fa9cabf30f64dc85a8", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='

`pd.DataFrame` -> `Dataset` 这条路径最为简单。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "9bb993f8-d624-4192-9626-8e9ed9888a1b", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "df = pd.DataFrame(documents)\n", + "dd = DatasetDict({'train': Dataset.from_pandas(df)})" + ] + }, + { + "cell_type": "markdown", + "id": "129760c8-cae1-4b1e-8216-f5152df8c536", + "metadata": {}, + "source": [ + "我默认将其上传到用户的账户(而不是上传到组织),但你可以通过在 `repo_id` 中设置用户或在配置中通过设置 `DATASET_OUT` 来自由推送到任何你想要的地方。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "f48e7c55-d5b7-4ed6-8516-272ae38716b1", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "d3af2e864770481db5adc3968500b5d3", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds[\"train\"][0][\"image\"]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 35 + }, + "id": "FOxmdk-HM7L6", + "outputId": "ff7c2ca8-0c6a-49d0-cfd6-4be775e012a1" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "'Two women are looking out a window. There is snow outside, and there is a snowman with human arms.'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ds[\"train\"][0][\"image_description\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ri187NrFNMaF" + }, + "source": [ + "我们没必要去写任何函数去嵌入例子或创建索引。 🤗 datasets 库的 FAISS 组件抽象这些过程。我们可以仅仅简单使用 dataset 的 `map` 方法就可以创建一个新的带有每个例子嵌入的列,就像下面所示。让我们针对提示列中的文本特征创建一个嵌入。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xB0EfabiBHgR" + }, + "outputs": [], + "source": [ + "dataset = ds[\"train\"]\n", + "ds_with_embeddings = dataset.map(lambda example:\n", + " {'embeddings': model.get_text_features(\n", + " **tokenizer([example[\"image_description\"]],\n", + " truncation=True, return_tensors=\"pt\")\n", + " .to(\"cuda\"))[0].detach().cpu().numpy()})\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iUWvvRB3DJwy" + }, + "outputs": [], + "source": [ + "ds_with_embeddings.add_faiss_index(column='embeddings')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qZcZNgSpCH5e" + }, + "source": [ + "我们可以同样处理图像嵌入" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AwXh-WlZB6q-" + }, + "outputs": [], + "source": [ + "ds_with_embeddings = ds_with_embeddings.map(lambda example:\n", + " {'image_embeddings': model.get_image_features(\n", + " **processor([example[\"image\"]], return_tensors=\"pt\")\n", + " .to(\"cuda\"))[0].detach().cpu().numpy()})\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "s9OX--PsDMNE" + }, + "outputs": [], + "source": [ + "ds_with_embeddings.add_faiss_index(column='image_embeddings')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1BS3TvQO5GGJ" + }, + "source": [ + "## 用文本提示( prompts )查询相关数据" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pxx9fTf83xgE" + }, + "source": [ + "我们现在可以用文本或者图片查询数据集来获取相似的项目" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2UQQyXAbNKGa" + }, + "outputs": [], + "source": [ + "prmt = \"a snowy day\"\n", + "prmt_embedding = model.get_text_features(**tokenizer([prmt], return_tensors=\"pt\", truncation=True).to(\"cuda\"))[0].detach().cpu().numpy()\n", + "scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', prmt_embedding, k=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "id": "O5bkNf4M3_Nt", + "outputId": "b56009fe-dc99-4cc3-84e5-559fb3625d30" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['A man is in the snow. A boy with a huge snow shovel is there too. They are outside a house.']\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def downscale_images(image):\n", + " width = 200\n", + " ratio = (width / float(image.size[0]))\n", + " height = int((float(image.size[1]) * float(ratio)))\n", + " img = image.resize((width, height), Image.Resampling.LANCZOS)\n", + " return img\n", + "\n", + "images = [downscale_images(image) for image in retrieved_examples[\"image\"]]\n", + "# see the closest text and image\n", + "print(retrieved_examples[\"image_description\"])\n", + "display(images[0])\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ufn0oqPx5DUR" + }, + "source": [ + "## 用图片提示( prompts )来查询数据" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R6fNviJ28fns" + }, + "source": [ + "图片相似度推理也类似,你只需要调用 `get_image_features` 函数即可。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 217 + }, + "id": "t1BGXpT659Px", + "outputId": "53478699-5753-4946-90d6-0aa8b76694a6" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import requests\n", + "# image of a beaver\n", + "url = \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/beaver.png\"\n", + "image = Image.open(requests.get(url, stream=True).raw)\n", + "display(downscale_images(image))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3kmz4g1v6SJ_" + }, + "source": [ + "搜索相似的图片" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qWf-G_Iz4RcD" + }, + "outputs": [], + "source": [ + "img_embedding = model.get_image_features(**processor([image], return_tensors=\"pt\", truncation=True).to(\"cuda\"))[0].detach().cpu().numpy()\n", + "scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('image_embeddings', img_embedding, k=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iFGNp5hp6VsV" + }, + "source": [ + "显示与海狸图像最相似的图像。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 197 + }, + "id": "Pq7IR86k54kP", + "outputId": "fa620b08-4435-4929-f67f-32b3f8f46b70" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Salmon swim upstream but they see a grizzly bear and are in shock. The bear has a smug look on his face when he sees the salmon.']\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "images = [downscale_images(image) for image in retrieved_examples[\"image\"]]\n", + "# see the closest text and image\n", + "print(retrieved_examples[\"image_description\"])\n", + "display(images[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6JEZJlkD8UrZ" + }, + "source": [ + "## 保存,推送,加载嵌入( embeddings )\n", + "我们可以用 `save_faiss_index` 函数储存数据集的嵌入。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dXrBMAHx8k51" + }, + "outputs": [], + "source": [ + "ds_with_embeddings.save_faiss_index('embeddings', 'embeddings/embeddings.faiss')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "51dgxmGm-c3x" + }, + "outputs": [], + "source": [ + "ds_with_embeddings.save_faiss_index('image_embeddings', 'embeddings/image_embeddings.faiss')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xO0i-dkY-nK5" + }, + "source": [ + "去储存一个数据集仓库的嵌入是一个很好的练习,所以我们将在那里创建一个,并将我们的嵌入稍后推送到那里。 \n", + "我们会登录 Hugging Face Hub, 创建一个数据集仓库,推送我们的所以,然后使用 `snapshot_download` 加载。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ETmGo_KiAiOr" + }, + "outputs": [], + "source": [ + "from huggingface_hub import HfApi, notebook_login, snapshot_download\n", + "notebook_login()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "K3hmtWQn-k9O" + }, + "outputs": [], + "source": [ + "from huggingface_hub import HfApi\n", + "api = HfApi()\n", + "api.create_repo(\"merve/faiss_embeddings\", repo_type=\"dataset\")\n", + "api.upload_folder(\n", + " folder_path=\"./embeddings\",\n", + " repo_id=\"merve/faiss_embeddings\",\n", + " repo_type=\"dataset\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UTVoI9LWBp1x" + }, + "outputs": [], + "source": [ + "snapshot_download(repo_id=\"merve/faiss_embeddings\", repo_type=\"dataset\",\n", + " local_dir=\"downloaded_embeddings\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HGkYTJsM9BVx" + }, + "source": [ + "我们可以使用 `load_faiss_index` 将嵌入加载到没有嵌入的数据集中。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mbPvs8kV8xTy" + }, + "outputs": [], + "source": [ + "ds = ds[\"train\"]\n", + "ds.load_faiss_index('embeddings', './downloaded_embeddings/embeddings.faiss')\n", + "# infer again\n", + "prmt = \"people under the rain\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mc9JmZSG71WZ" + }, + "outputs": [], + "source": [ + "prmt_embedding = model.get_text_features(\n", + " **tokenizer([prmt], return_tensors=\"pt\", truncation=True)\n", + " .to(\"cuda\"))[0].detach().cpu().numpy()\n", + "\n", + "scores, retrieved_examples = ds.get_nearest_examples('embeddings', prmt_embedding, k=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 341 + }, + "id": "wckNsAX-9zox", + "outputId": "8d5008b4-ab8f-4b42-92e7-b29e57c126cb" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(retrieved_examples[\"image\"][0])" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/zh-CN/fine_tuning_code_llm_on_single_gpu.ipynb b/notebooks/zh-CN/fine_tuning_code_llm_on_single_gpu.ipynb new file mode 100644 index 00000000..49a41367 --- /dev/null +++ b/notebooks/zh-CN/fine_tuning_code_llm_on_single_gpu.ipynb @@ -0,0 +1,1126 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "FNdZ-kD0l78P" + }, + "source": [ + "# 在单个 GPU 上针对自定义代码微调代码 LLM\n", + "\n", + "_作者: [Maria Khalusova](https://github.com/MKhalusova)_\n", + "\n", + "公开发布的代码 LLM,如 Codex、StarCoder 和 Code Llama,在生成遵循通用编程原则和语法的代码方面表现出色,但它们可能不符合组织的内部惯例,或者不了解某些特定的库。\n", + "\n", + "在这个 notebook 中,我们将展示如何微调代码 LLM 来更好的理解你们公司或组织的代码风格和习惯。由于代码 LLM 非常大,按照传统的微调方式可能会消耗大量资源。但不用担心!我们会教你一些技巧,让你只用单个 GPU 就能完成微调工作。\n", + "\n", + "\n", + "## 数据集\n", + "\n", + "对于这个例子,我们选择了 GitHub 上 Hugging Face 的前 10 个公共仓库。我们已经排除了非代码文件,如图片、音频文件、演示文稿等。对于 Jupyter notebook,我们只保留了包含代码的单元格。生成的代码被存储为一个数据集,你可以在 Hugging Face Hub 上找到,位于 [`smangrul/hf-stack-v1`](https://huggingface.co/datasets/smangrul/hf-stack-v1)。它包含仓库 id、文件路径和文件内容。\n", + "\n", + "\n", + "## 模型\n", + "\n", + "我们将微调 [`bigcode/starcoderbase-1b`](https://huggingface.co/bigcode/starcoderbase-1b) 模型,这是一个在 80 多种编程语言上训练的 10 亿参数模型。这是一个需要权限的模型,所以如果你计划使用这个确切模型运行这个 notebook,你需要在其模型页面上获得访问权限。登录你的 Hugging Face 帐户以执行此操作:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bPlCJYDK6vrF" + }, + "outputs": [], + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WMVe_c8q43Qo" + }, + "source": [ + "\n", + "\n", + "To get started, let's install all the necessary libraries. As you can see, in addition to `transformers` and `datasets`, we'll be using `peft`, `bitsandbytes`, and `flash-attn` to optimize the training.\n", + "\n", + "By employing parameter-efficient training techniques, we can run this notebook on a single A100 High-RAM GPU." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Fp7i8WMCjKJG" + }, + "outputs": [], + "source": [ + "!pip install -q transformers datasets peft bitsandbytes flash-attn" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "16EdABzt3_Ig" + }, + "source": [ + "现在让我们定义一些变量。请随意调整这些变量。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hru3G-CLmqis" + }, + "outputs": [], + "source": [ + "MODEL=\"bigcode/starcoderbase-1b\" # Model checkpoint on the Hugging Face Hub\n", + "DATASET=\"smangrul/hf-stack-v1\" # Dataset on the Hugging Face Hub\n", + "DATA_COLUMN=\"content\" # Column name containing the code content\n", + "\n", + "SEQ_LENGTH=2048 # Sequence length\n", + "\n", + "# Training arguments\n", + "MAX_STEPS=2000 # max_steps\n", + "BATCH_SIZE=16 # batch_size\n", + "GR_ACC_STEPS=1 # gradient_accumulation_steps\n", + "LR=5e-4 # learning_rate\n", + "LR_SCHEDULER_TYPE=\"cosine\" # lr_scheduler_type\n", + "WEIGHT_DECAY=0.01 # weight_decay\n", + "NUM_WARMUP_STEPS=30 # num_warmup_steps\n", + "EVAL_FREQ=100 # eval_freq\n", + "SAVE_FREQ=100 # save_freq\n", + "LOG_FREQ=25 # log_freq\n", + "OUTPUT_DIR=\"peft-starcoder-lora-a100\" # output_dir\n", + "BF16=True # bf16\n", + "FP16=False # no_fp16\n", + "\n", + "# FIM trasformations arguments\n", + "FIM_RATE=0.5 # fim_rate\n", + "FIM_SPM_RATE=0.5 # fim_spm_rate\n", + "\n", + "# LORA\n", + "LORA_R=8 # lora_r\n", + "LORA_ALPHA=32 # lora_alpha\n", + "LORA_DROPOUT=0.0 # lora_dropout\n", + "LORA_TARGET_MODULES=\"c_proj,c_attn,q_attn,c_fc,c_proj\" # lora_target_modules\n", + "\n", + "# bitsandbytes config\n", + "USE_NESTED_QUANT=True # use_nested_quant\n", + "BNB_4BIT_COMPUTE_DTYPE=\"bfloat16\"# bnb_4bit_compute_dtype\n", + "\n", + "SEED=0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FyZSXTbJrcnC" + }, + "outputs": [], + "source": [ + "from transformers import (\n", + " AutoModelForCausalLM,\n", + " AutoTokenizer,\n", + " Trainer,\n", + " TrainingArguments,\n", + " logging,\n", + " set_seed,\n", + " BitsAndBytesConfig,\n", + ")\n", + "\n", + "set_seed(SEED)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pO7F5L5AtKo1" + }, + "source": [ + "## 准备数据" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1LmrIZqP0oUE" + }, + "source": [ + "首先加载数据。由于数据集可能相当大,请确保启用流模式。流模式允许我们在遍历数据集时逐步加载数据,而不是一次性下载数据集的整个内容。\n", + "\n", + "我们将前 4000 个示例作为验证集,其余的全部作为训练数据。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4oJZvZb-1J88" + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "import torch\n", + "from tqdm import tqdm\n", + "\n", + "\n", + "dataset = load_dataset(\n", + " DATASET,\n", + " data_dir=\"data\",\n", + " split=\"train\",\n", + " streaming=True,\n", + ")\n", + "\n", + "valid_data = dataset.take(4000)\n", + "train_data = dataset.skip(4000)\n", + "train_data = train_data.shuffle(buffer_size=5000, seed=SEED)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sLQ8t0LM2GR6" + }, + "source": [ + "在这一步,数据集仍然包含任意长度的原始数据。为了训练,我们需要固定长度的输入。让我们创建一个可迭代的数据集,它可以从文本文件流中返回固定长度的 token 块。\n", + "\n", + "首先,让我们估计数据集中每个 token 的平均字符数,这将帮助我们稍后估计文本缓冲区中的 token 数量。默认情况下,我们只从数据集中取 400 个示例(`nb_examples`)。只使用整个数据集的一个子集可以减少计算成本,同时仍然提供了对整体字符到 token 比的合理估计。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KCiAvydztNsu", + "outputId": "cabf7fd0-a922-4371-cbc6-60ee99ef7469" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 400/400 [00:10<00:00, 39.87it/s] " + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The character to token ratio of the dataset is: 2.43\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n" + ] + } + ], + "source": [ + "tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)\n", + "\n", + "def chars_token_ratio(dataset, tokenizer, data_column, nb_examples=400):\n", + " \"\"\"\n", + " Estimate the average number of characters per token in the dataset.\n", + " \"\"\"\n", + "\n", + " total_characters, total_tokens = 0, 0\n", + " for _, example in tqdm(zip(range(nb_examples), iter(dataset)), total=nb_examples):\n", + " total_characters += len(example[data_column])\n", + " total_tokens += len(tokenizer(example[data_column]).tokens())\n", + "\n", + " return total_characters / total_tokens\n", + "\n", + "\n", + "chars_per_token = chars_token_ratio(train_data, tokenizer, DATA_COLUMN)\n", + "print(f\"The character to token ratio of the dataset is: {chars_per_token:.2f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6F13VGobB3Ma" + }, + "source": [ + "字符到 token 的比也可以用作文本标记质量的一个指标。例如,字符到 token 的比为 1.0 意味着每个字符都由一个 token 表示,这并没有太多意义。表明标记化做得不好。在标准的英文文本中,一个 token 通常相当于大约四个字符,这意味着字符到 token 的比率大约是 4.0。我们可以预见在代码数据集中的比率会更低,但一般来说,2.0 到 3.5 之间的数字可以认为是足够好的。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rcwYFRPpwxea" + }, + "source": [ + "**可选的 FIM 变换**\n", + "自回归语言模型通常是从左到右生成序列的。通过应用 FIM 变换,模型也可以学习填充文本。详细信息可以看[\"Efficient Training of Language Models to Fill in the Middle\" 这篇论文](https://arxiv.org/pdf/2207.14255.pdf)了解这种技术。\n", + "\n", + "我们将在下面定义 FIM 变换,并在创建可迭代数据集时使用它们。然而,如果你想省略变换步骤,请将 `fim_rate` 设置为 0。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zmejYvEKw1E-" + }, + "outputs": [], + "source": [ + "import functools\n", + "import numpy as np\n", + "\n", + "\n", + "# Helper function to get token ids of the special tokens for prefix, suffix and middle for FIM transformations.\n", + "@functools.lru_cache(maxsize=None)\n", + "def get_fim_token_ids(tokenizer):\n", + " try:\n", + " FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_PAD = tokenizer.special_tokens_map[\"additional_special_tokens\"][1:5]\n", + " suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = (\n", + " tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD]\n", + " )\n", + " except KeyError:\n", + " suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id = None, None, None, None\n", + " return suffix_tok_id, prefix_tok_id, middle_tok_id, pad_tok_id\n", + "\n", + "\n", + "## Adapted from https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py\n", + "def permute(\n", + " sample,\n", + " np_rng,\n", + " suffix_tok_id,\n", + " prefix_tok_id,\n", + " middle_tok_id,\n", + " pad_tok_id,\n", + " fim_rate=0.5,\n", + " fim_spm_rate=0.5,\n", + " truncate_or_pad=False,\n", + "):\n", + " \"\"\"\n", + " Take in a sample (list of tokens) and perform a FIM transformation on it with a probability of fim_rate, using two FIM modes:\n", + " PSM and SPM (with a probability of fim_spm_rate).\n", + " \"\"\"\n", + "\n", + " # The if condition will trigger with the probability of fim_rate\n", + " # This means FIM transformations will apply to samples with a probability of fim_rate\n", + " if np_rng.binomial(1, fim_rate):\n", + "\n", + " # Split the sample into prefix, middle, and suffix, based on randomly generated indices stored in the boundaries list.\n", + " boundaries = list(np_rng.randint(low=0, high=len(sample) + 1, size=2))\n", + " boundaries.sort()\n", + "\n", + " prefix = np.array(sample[: boundaries[0]], dtype=np.int64)\n", + " middle = np.array(sample[boundaries[0] : boundaries[1]], dtype=np.int64)\n", + " suffix = np.array(sample[boundaries[1] :], dtype=np.int64)\n", + "\n", + " if truncate_or_pad:\n", + " # calculate the new total length of the sample, taking into account tokens indicating prefix, middle, and suffix\n", + " new_length = suffix.shape[0] + prefix.shape[0] + middle.shape[0] + 3\n", + " diff = new_length - len(sample)\n", + "\n", + " # trancate or pad if there's a difference in length between the new length and the original\n", + " if diff > 0:\n", + " if suffix.shape[0] <= diff:\n", + " return sample, np_rng\n", + " suffix = suffix[: suffix.shape[0] - diff]\n", + " elif diff < 0:\n", + " suffix = np.concatenate([suffix, np.full((-1 * diff), pad_tok_id)])\n", + "\n", + " # With the probability of fim_spm_rateapply SPM variant of FIM transformations\n", + " # SPM: suffix, prefix, middle\n", + " if np_rng.binomial(1, fim_spm_rate):\n", + " new_sample = np.concatenate(\n", + " [\n", + " [prefix_tok_id, suffix_tok_id],\n", + " suffix,\n", + " [middle_tok_id],\n", + " prefix,\n", + " middle,\n", + " ]\n", + " )\n", + " # Otherwise, apply the PSM variant of FIM transformations\n", + " # PSM: prefix, suffix, middle\n", + " else:\n", + "\n", + " new_sample = np.concatenate(\n", + " [\n", + " [prefix_tok_id],\n", + " prefix,\n", + " [suffix_tok_id],\n", + " suffix,\n", + " [middle_tok_id],\n", + " middle,\n", + " ]\n", + " )\n", + " else:\n", + " # don't apply FIM transformations\n", + " new_sample = sample\n", + "\n", + " return list(new_sample), np_rng\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AwW5FviD9xBH" + }, + "source": [ + "让我们定义 `ConstantLengthDataset`,这是一个可迭代的数据集,它将返回固定长度的 token 块。为此,我们将从原始数据集中读取文本缓冲区,直到达到大小限制,然后应用分词器将原始文本转换为 token 后的输入。可选项,我们可以在一些序列上执行 FIM 变换(受影响的序列比例由 `fim_rate` 控制)。\n", + "\n", + "定义好后,我们可以从训练和验证数据中创建 `ConstantLengthDataset` 的实例。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AgDW-692wzOl" + }, + "outputs": [], + "source": [ + "from torch.utils.data import IterableDataset\n", + "from torch.utils.data.dataloader import DataLoader\n", + "import random\n", + "\n", + "# Create an Iterable dataset that returns constant-length chunks of tokens from a stream of text files.\n", + "\n", + "class ConstantLengthDataset(IterableDataset):\n", + " \"\"\"\n", + " Iterable dataset that returns constant length chunks of tokens from stream of text files.\n", + " Args:\n", + " tokenizer (Tokenizer): The processor used for proccessing the data.\n", + " dataset (dataset.Dataset): Dataset with text files.\n", + " infinite (bool): If True the iterator is reset after dataset reaches end else stops.\n", + " seq_length (int): Length of token sequences to return.\n", + " num_of_sequences (int): Number of token sequences to keep in buffer.\n", + " chars_per_token (int): Number of characters per token used to estimate number of tokens in text buffer.\n", + " fim_rate (float): Rate (0.0 to 1.0) that sample will be permuted with FIM.\n", + " fim_spm_rate (float): Rate (0.0 to 1.0) of FIM permuations that will use SPM.\n", + " seed (int): Seed for random number generator.\n", + " \"\"\"\n", + "\n", + " def __init__(\n", + " self,\n", + " tokenizer,\n", + " dataset,\n", + " infinite=False,\n", + " seq_length=1024,\n", + " num_of_sequences=1024,\n", + " chars_per_token=3.6,\n", + " content_field=\"content\",\n", + " fim_rate=0.5,\n", + " fim_spm_rate=0.5,\n", + " seed=0,\n", + " ):\n", + " self.tokenizer = tokenizer\n", + " self.concat_token_id = tokenizer.eos_token_id\n", + " self.dataset = dataset\n", + " self.seq_length = seq_length\n", + " self.infinite = infinite\n", + " self.current_size = 0\n", + " self.max_buffer_size = seq_length * chars_per_token * num_of_sequences\n", + " self.content_field = content_field\n", + " self.fim_rate = fim_rate\n", + " self.fim_spm_rate = fim_spm_rate\n", + " self.seed = seed\n", + "\n", + " (\n", + " self.suffix_tok_id,\n", + " self.prefix_tok_id,\n", + " self.middle_tok_id,\n", + " self.pad_tok_id,\n", + " ) = get_fim_token_ids(self.tokenizer)\n", + " if not self.suffix_tok_id and self.fim_rate > 0:\n", + " print(\"FIM is not supported by tokenizer, disabling FIM\")\n", + " self.fim_rate = 0\n", + "\n", + " def __iter__(self):\n", + " iterator = iter(self.dataset)\n", + " more_examples = True\n", + " np_rng = np.random.RandomState(seed=self.seed)\n", + " while more_examples:\n", + " buffer, buffer_len = [], 0\n", + " while True:\n", + " if buffer_len >= self.max_buffer_size:\n", + " break\n", + " try:\n", + " buffer.append(next(iterator)[self.content_field])\n", + " buffer_len += len(buffer[-1])\n", + " except StopIteration:\n", + " if self.infinite:\n", + " iterator = iter(self.dataset)\n", + " else:\n", + " more_examples = False\n", + " break\n", + " tokenized_inputs = self.tokenizer(buffer, truncation=False)[\"input_ids\"]\n", + " all_token_ids = []\n", + "\n", + " for tokenized_input in tokenized_inputs:\n", + " # optionally do FIM permutations\n", + " if self.fim_rate > 0:\n", + " tokenized_input, np_rng = permute(\n", + " tokenized_input,\n", + " np_rng,\n", + " self.suffix_tok_id,\n", + " self.prefix_tok_id,\n", + " self.middle_tok_id,\n", + " self.pad_tok_id,\n", + " fim_rate=self.fim_rate,\n", + " fim_spm_rate=self.fim_spm_rate,\n", + " truncate_or_pad=False,\n", + " )\n", + "\n", + " all_token_ids.extend(tokenized_input + [self.concat_token_id])\n", + " examples = []\n", + " for i in range(0, len(all_token_ids), self.seq_length):\n", + " input_ids = all_token_ids[i : i + self.seq_length]\n", + " if len(input_ids) == self.seq_length:\n", + " examples.append(input_ids)\n", + " random.shuffle(examples)\n", + " for example in examples:\n", + " self.current_size += 1\n", + " yield {\n", + " \"input_ids\": torch.LongTensor(example),\n", + " \"labels\": torch.LongTensor(example),\n", + " }\n", + "\n", + "\n", + "train_dataset = ConstantLengthDataset(\n", + " tokenizer,\n", + " train_data,\n", + " infinite=True,\n", + " seq_length=SEQ_LENGTH,\n", + " chars_per_token=chars_per_token,\n", + " content_field=DATA_COLUMN,\n", + " fim_rate=FIM_RATE,\n", + " fim_spm_rate=FIM_SPM_RATE,\n", + " seed=SEED,\n", + ")\n", + "eval_dataset = ConstantLengthDataset(\n", + " tokenizer,\n", + " valid_data,\n", + " infinite=False,\n", + " seq_length=SEQ_LENGTH,\n", + " chars_per_token=chars_per_token,\n", + " content_field=DATA_COLUMN,\n", + " fim_rate=FIM_RATE,\n", + " fim_spm_rate=FIM_SPM_RATE,\n", + " seed=SEED,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rxev1sk6tRW9" + }, + "source": [ + "## 准备模型" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UCtWV-U42Eq_" + }, + "source": [ + "现在数据已经准备好了,是时候加载模型了!我们将加载量化的模型。\n", + "\n", + "因为量化使用更少的位来表示数据,所以会减少内存使用。我们将使用 `bitsandbytes` 库来量化模型,因为它与 `transformers` 有很好的集成。我们需要做的只是定义一个 `bitsandbytes` 配置,然后在加载模型时使用它。\n", + "\n", + "4 比特位量化有不同的变体,但通常我们推荐使用 NF4 量化以获得更好的性能(`bnb_4bit_quant_type=\"nf4\"`)。\n", + "\n", + "`bnb_4bit_use_double_quant` 选项在第一次量化后添加第二次量化,以节省每个参数额外的 0.4 位。\n", + "\n", + "要了解更多关于量化的信息,请查看 [\"利用 bitsandbytes、4 比特位量化和 QLoRA 让 LLMs 更易于访问\" 的博客](https://huggingface.co/blog/4bit-transformers-bitsandbytes)。\n", + "\n", + "定义好后,将配置传递给 `from_pretrained` 方法以加载量化的模型。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XuwoX6U2DUvK" + }, + "outputs": [], + "source": [ + "from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training\n", + "from peft.tuners.lora import LoraLayer\n", + "\n", + "load_in_8bit = False\n", + "\n", + "# 4-bit quantization\n", + "compute_dtype = getattr(torch, BNB_4BIT_COMPUTE_DTYPE)\n", + "\n", + "bnb_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_quant_type=\"nf4\",\n", + " bnb_4bit_compute_dtype=compute_dtype,\n", + " bnb_4bit_use_double_quant=USE_NESTED_QUANT,\n", + ")\n", + "\n", + "device_map = {\"\": 0}\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(\n", + " MODEL,\n", + " load_in_8bit=load_in_8bit,\n", + " quantization_config=bnb_config,\n", + " device_map=device_map,\n", + " use_cache=False, # We will be using gradient checkpointing\n", + " trust_remote_code=True,\n", + " use_flash_attention_2=True,\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bO9e2FV8D8ZF" + }, + "source": [ + "当使用量化模型进行训练时,你需要调用 `prepare_model_for_kbit_training()` 函数来预处理量化模型以进行训练。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qb_eB4xzEDBk" + }, + "outputs": [], + "source": [ + "model = prepare_model_for_kbit_training(model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lmnLjPZpDVtg" + }, + "source": [ + "现在量化模型已经准备好了,我们可以设置一个 LoRA 配置。LoRA 通过大幅减少可训练参数的数量,使得微调更加高效。\n", + "\n", + "要使用 LoRA 技术训练模型,我们需要将基础模型包装为 `PeftModel`。这涉及到使用 `LoraConfig` 定义 LoRA 配置,并使用 `get_peft_model()` 和 `LoraConfig` 包装原始模型。\n", + "\n", + "要了解更多关于 LoRA 及其参数的信息,请参考 [PEFT 文档](https://huggingface.co/docs/peft/conceptual_guides/lora)。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_pAUU2FR2Gey", + "outputId": "63328c2b-e693-49b1-ce0a-3ca8722f852a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "trainable params: 5,554,176 || all params: 1,142,761,472 || trainable%: 0.4860310866343243\n" + ] + } + ], + "source": [ + "# Set up lora\n", + "peft_config = LoraConfig(\n", + " lora_alpha=LORA_ALPHA,\n", + " lora_dropout=LORA_DROPOUT,\n", + " r=LORA_R,\n", + " bias=\"none\",\n", + " task_type=\"CAUSAL_LM\",\n", + " target_modules=LORA_TARGET_MODULES.split(\",\"),\n", + ")\n", + "\n", + "model = get_peft_model(model, peft_config)\n", + "model.print_trainable_parameters()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHe7AElXzXVV" + }, + "source": [ + "可以看到,通过应用 LoRA 技术,我们现在只需要训练不到 1% 的参数。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T_CqVydc40IM" + }, + "source": [ + "## 训练模型" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q_iN2khjrbD3" + }, + "source": [ + "现在我们已经准备好了数据,并且优化了模型,我们可以将所有东西整合在一起开始训练。\n", + "\n", + "要实例化一个 `Trainer`,你需要定义训练配置。最重要的是 `TrainingArguments`,这是一个包含所有用于配置训练的属性的类。\n", + "\n", + "这些与你可能运行的任何其他类型的模型训练相似,所以我们这里不会详细说明。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "65QHS8l1tKQe" + }, + "outputs": [], + "source": [ + "train_data.start_iteration = 0\n", + "\n", + "\n", + "training_args = TrainingArguments(\n", + " output_dir=f\"Your_HF_username/{OUTPUT_DIR}\",\n", + " dataloader_drop_last=True,\n", + " evaluation_strategy=\"steps\",\n", + " save_strategy=\"steps\",\n", + " max_steps=MAX_STEPS,\n", + " eval_steps=EVAL_FREQ,\n", + " save_steps=SAVE_FREQ,\n", + " logging_steps=LOG_FREQ,\n", + " per_device_train_batch_size=BATCH_SIZE,\n", + " per_device_eval_batch_size=BATCH_SIZE,\n", + " learning_rate=LR,\n", + " lr_scheduler_type=LR_SCHEDULER_TYPE,\n", + " warmup_steps=NUM_WARMUP_STEPS,\n", + " gradient_accumulation_steps=GR_ACC_STEPS,\n", + " gradient_checkpointing=True,\n", + " fp16=FP16,\n", + " bf16=BF16,\n", + " weight_decay=WEIGHT_DECAY,\n", + " push_to_hub=True,\n", + " include_tokens_per_second=True,\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kB_fLRex09ut" + }, + "source": [ + "最后一步,实例化 `Trainer` 并调用 `train` 方法。 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "rS3nVwhUC69O", + "outputId": "61a5bdb2-b7d0-4aed-8290-4bf20c2ccd38" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training...\n" + ] + }, + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " \n", + " [2000/2000 4:16:10, Epoch 1/9223372036854775807]\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
StepTraining LossValidation Loss
1005.5246007.456872
2005.6178007.262190
3005.1291006.410039
4005.0522006.306774
5005.2029006.117062
6004.6541006.018349
7005.1002006.000355
8005.0498005.889457
9004.5412005.813823
10005.0007005.834208
11005.0265005.781939
12004.4118005.720596
13004.7825005.736376
14004.9802005.712276
15004.3687005.689637
16004.8847005.675920
17004.9144005.662421
18004.2487005.660122
19004.7984005.664026
20004.7042005.655665

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "TrainOutput(global_step=2000, training_loss=4.885598585128784, metrics={'train_runtime': 15380.3075, 'train_samples_per_second': 2.081, 'train_steps_per_second': 0.13, 'train_tokens_per_second': 4261.033, 'total_flos': 4.0317260660736e+17, 'train_loss': 4.885598585128784, 'epoch': 1.0})" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "trainer = Trainer(\n", + " model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset\n", + ")\n", + "\n", + "print(\"Training...\")\n", + "trainer.train()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aAERlCnt1PEW" + }, + "source": [ + "最后,你可以将微调好的模型推送到你的 Hub 仓库中,并分享给你的团队。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1h7_AUTTDwE1" + }, + "outputs": [], + "source": [ + "trainer.push_to_hub()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KBVH7uFOM_UF" + }, + "source": [ + "## 推理\n", + "\n", + "一旦模型被上传到 Hub,我们就可以使用它进行推理。为此,我们首先初始化原始的基础模型及其分词器。接下来,我们需要将微调后的权重与基础模型合并。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jtL37piINBFe" + }, + "outputs": [], + "source": [ + "from peft import PeftModel\n", + "import torch\n", + "\n", + "# load the original model first\n", + "tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)\n", + "base_model = AutoModelForCausalLM.from_pretrained(\n", + " MODEL,\n", + " quantization_config=None,\n", + " device_map=None,\n", + " trust_remote_code=True,\n", + " torch_dtype=torch.bfloat16,\n", + ").cuda()\n", + "\n", + "# merge fine-tuned weights with the base model\n", + "peft_model_id = f\"Your_HF_username/{OUTPUT_DIR}\"\n", + "model = PeftModel.from_pretrained(base_model, peft_model_id)\n", + "model.merge_and_unload()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3USQ2suvDi9M" + }, + "source": [ + "现在我们可以使用合并后的模型进行推理。为了方便起见,我们将定义一个 `get_code_completion` 函数 - 请随意尝试文本生成参数!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RoTGpNbjDeWI" + }, + "outputs": [], + "source": [ + "def get_code_completion(prefix, suffix):\n", + " text = prompt = f\"\"\"{prefix}{suffix}\"\"\"\n", + " model.eval()\n", + " outputs = model.generate(\n", + " input_ids=tokenizer(text, return_tensors=\"pt\").input_ids.cuda(),\n", + " max_new_tokens=128,\n", + " temperature=0.2,\n", + " top_k=50,\n", + " top_p=0.95,\n", + " do_sample=True,\n", + " repetition_penalty=1.0,\n", + " )\n", + " return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0kMJiGDfDrBf" + }, + "source": [ + "现在,为了获得代码补全,我们只需要调用 `get_code_complete` 函数,并将我们希望补全的前几行作为前缀传递,以及一个空字符串作为后缀。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "nXlco2_-YcvM", + "outputId": "41c411ad-b7dc-4277-f975-c173888234bb" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "from peft import LoraConfig, TaskType, get_peft_model\n", + "from transformers import AutoModelForCausalLM\n", + "peft_config = LoraConfig(\n", + " task_type=TaskType.CAUSAL_LM,\n", + " r=8,\n", + " lora_alpha=32,\n", + " target_modules=[\"q_proj\", \"v_proj\"],\n", + " lora_dropout=0.1,\n", + " bias=\"none\",\n", + " modules_to_save=[\"q_proj\", \"v_proj\"],\n", + " inference_mode=False,\n", + ")\n", + "model = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n", + "model = get_peft_model(model, peft_config)\n", + "model.print_trainable_parameters()\n" + ] + } + ], + "source": [ + "prefix = \"\"\"from peft import LoraConfig, TaskType, get_peft_model\n", + "from transformers import AutoModelForCausalLM\n", + "peft_config = LoraConfig(\n", + "\"\"\"\n", + "suffix =\"\"\"\"\"\"\n", + "\n", + "print(get_code_completion(prefix, suffix))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ql2563kGlnmu" + }, + "source": [ + "作为刚刚在这个 notebook 中使用过 PEFT 库的人,你可以看到创建为 `LoraConfig` 函数的生成结果相当不错!\n", + "\n", + "如果你回到我们为推理实例化模型的单元格,并注释掉我们合并微调权重的行,你可以看到原始模型对于完全相同的前缀会生成什么内容:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "29xxp1eHTgJ9", + "outputId": "c6d597a2-01da-4d25-a32f-3a551212c5b4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "from peft import LoraConfig, TaskType, get_peft_model\n", + "from transformers import AutoModelForCausalLM\n", + "peft_config = LoraConfig(\n", + " model_name_or_path=\"facebook/wav2vec2-base-960h\",\n", + " num_labels=1,\n", + " num_features=1,\n", + " num_hidden_layers=1,\n", + " num_attention_heads=1,\n", + " num_hidden_layers_per_attention_head=1,\n", + " num_attention_heads_per_hidden_layer=1,\n", + " hidden_size=1024,\n", + " hidden_dropout_prob=0.1,\n", + " hidden_act=\"gelu\",\n", + " hidden_act_dropout_prob=0.1,\n", + " hidden\n" + ] + } + ], + "source": [ + "prefix = \"\"\"from peft import LoraConfig, TaskType, get_peft_model\n", + "from transformers import AutoModelForCausalLM\n", + "peft_config = LoraConfig(\n", + "\"\"\"\n", + "suffix =\"\"\"\"\"\"\n", + "\n", + "print(get_code_completion(prefix, suffix))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Pwy2ZC7U8Ema" + }, + "source": [ + "尽管这是 Python 语法,但你可以看到原始模型并不理解 `LoraConfig` 应该做什么。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CATYE8pp2drQ" + }, + "source": [ + "要了解这种高效参数微调与完全微调的比较,以及如何通过推理端点在 VS Code 中使用这样的模型作为你的编程助手(copilot),或者在本地使用,请查看[\"个人编程助手(copilot):训练你自己的编码助手\"博客](https://huggingface.co/blog/personal-copilot)。这个 notebook 补充了原始博客内容。\n" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "machine_shape": "hm", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/zh-CN/index.md b/notebooks/zh-CN/index.md new file mode 100644 index 00000000..ef20e734 --- /dev/null +++ b/notebooks/zh-CN/index.md @@ -0,0 +1,22 @@ +# 开源 AI 指南 (Cookbook) + +开源 AI 指南 (Cookbook) 是一系列 Notebook 的合集,里面展示了如何利用开源工具和模型来开发 AI 应用和解决各种机器学习问题的实际技巧和方法。 + +## 最新 Notebook + +查看最近添加的 Notebook: +- [通过推理端点使用 TEI 自动嵌入](automatic_embedding_tei_inference_endpoints) +- [用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG](rag_zephyr_langchain) +- [用 🤗 transformers, 🤗 datasets 和 FAISS 嵌入多模态数据进行相似度搜索](faiss_with_hf_datasets_and_clip) +- [在单个 GPU 上针对自定义代码微调代码 LLM](fine_tuning_code_llm_on_single_gpu) +- [使用合成数据和 LLM 作为裁判评估 RAG](rag_evaluation) +- [使用 LangChain 在 HuggingFace 文档上构建高级 RAG](advanced_rag) + +你还可以在指南 (Cookbook) 的[Github 仓库](https://github.com/huggingface/cookbook)中查看 Notebook。 + +## 贡献 + +开源 AI 指南 (Cookbook) 是社区和大家共同努力的成果,我们非常欢迎每个人都来参与贡献! + + +查看指南 (Cookbook) 的[贡献指引](https://github.com/huggingface/cookbook/blob/main/README.md)了解如何添加你的“食谱(教程)”。 diff --git a/notebooks/zh-CN/rag_evaluation.ipynb b/notebooks/zh-CN/rag_evaluation.ipynb new file mode 100644 index 00000000..0fd291df --- /dev/null +++ b/notebooks/zh-CN/rag_evaluation.ipynb @@ -0,0 +1,1467 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "4YErqpfH9jVI" + }, + "source": [ + "# RAG 评估\n", + "_作者 [Aymeric Roucher](https://huggingface.co/m-ric)_\n", + "\n", + "本 notebook 演示了如何评估你的 RAG(Retrieval Augmented Generation),通过构建一个合成评估数据集并使用 LLM-as-a-judge 来计算你系统的准确性。\n", + "\n", + "对于 RAG 系统的介绍,你可以查看[这个技术指南](rag_zephyr_langchain)!\n", + "\n", + "RAG 系统很复杂: 这里有一个 RAG 流程图,我们用蓝色标注了系统增强的所有可能性:\n", + "\n", + "\n", + "\n", + "实施上述任何改进都可能会带来巨大的性能提升;但如果无法监控对系统性能的影响,那么进行任何更改都是无用的!让我们看看如何评估我们的 RAG 系统。\n", + "\n", + "### 评估RAG性能\n", + "\n", + "由于有如此多的部分需要调整,这些部分对性能有很大影响,因此对 RAG 系统进行基准测试是至关重要的。\n", + "\n", + "对于我们的评估流水线,我们将需要:\n", + "1. 一个带有问题-答案对的评估数据集(QA 对)\n", + "2. 一个评估器,用于计算我们的系统在上面的评估数据集上的准确性。\n", + "\n", + "➡️ 结果发现,我们可以在整个过程中使用 LLMs 来帮助!\n", + "1. 评估数据集将由 LLM 🤖 合成生成,并且问题将由其他 LLM 🤖 过滤掉\n", + "2. 然后,[LLM-as-a-judge](https://huggingface.co/papers/2306.05685) 智能体 🤖 将在这个合成数据集上执行评估。\n", + "\n", + "\n", + "__让我们深入挖掘并开始构建我们的评估流水线!__ 首先,安装所需的模型依赖项。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bCKBvOcp9jVK" + }, + "outputs": [], + "source": [ + "!pip install -q torch transformers transformers langchain sentence-transformers faiss-gpu openpyxl openai" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "k_lJFbYm9jVL" + }, + "outputs": [], + "source": [ + "%reload_ext autoreload\n", + "%autoreload 2\n", + "%reload_ext dotenv\n", + "%dotenv" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oIlNZ1Mn9jVL" + }, + "outputs": [], + "source": [ + "from tqdm.notebook import tqdm\n", + "import pandas as pd\n", + "from typing import Optional, List, Tuple\n", + "from langchain_core.language_models import BaseChatModel\n", + "import json\n", + "import datasets\n", + "\n", + "pd.set_option(\"display.max_colwidth\", None)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zeW8P62J9jVM" + }, + "source": [ + "### 加载你的知识基础" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YRbm5tNF9jVM" + }, + "outputs": [], + "source": [ + "ds = datasets.load_dataset(\"m-ric/huggingface_doc\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wy9CKj0M9jVM" + }, + "source": [ + "# 1. 为评估构建合成数据集\n", + "\n", + "我们首先构建一个问题和相关上下文的综合数据集。方法是先从我们的知识库中获取元素,并让 LLM 根据这些文档生成问题。\n", + "\n", + "然后,我们设置其他 LLM 智能体作为生成问答对的质置过滤器:每个智能体将作为一个特定缺陷的过滤器。\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QkoEgiDg9jVM" + }, + "source": [ + "### 1.1. 准备源数据文档" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3gTOlRKO9jVM" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from langchain.docstore.document import Document as LangchainDocument\n", + "\n", + "langchain_docs = [\n", + " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n", + " for doc in tqdm(ds)\n", + "]\n", + "\n", + "\n", + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=2000,\n", + " chunk_overlap=200,\n", + " add_start_index=True,\n", + " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n", + ")\n", + "\n", + "docs_processed = []\n", + "for doc in langchain_docs:\n", + " docs_processed += text_splitter.split_documents([doc])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WjrNhcCh9jVN" + }, + "source": [ + "### 1.2. 为问题生成设置智能体\n", + "\n", + "我们采用 [Mixtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 作为问答对的生成,因为他在各个排行榜上表现极佳,比如 [Chatbot Arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard)。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GoRySj3Q9jVN" + }, + "outputs": [], + "source": [ + "from langchain_community.llms import HuggingFaceHub\n", + "\n", + "repo_id = \"mistralai/Mixtral-8x7B-Instruct-v0.1\"\n", + "\n", + "llm = HuggingFaceHub(\n", + " repo_id=repo_id,\n", + " task=\"text-generation\",\n", + " model_kwargs={\n", + " \"max_new_tokens\": 512,\n", + " \"top_k\": 30,\n", + " \"temperature\": 0.1,\n", + " \"repetition_penalty\": 1.03,\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wubTNTaV9jVN" + }, + "outputs": [], + "source": [ + "from langchain_community.chat_models import ChatHuggingFace\n", + "\n", + "chat_model = ChatHuggingFace(llm=llm)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hIM_DJRo9jVN" + }, + "outputs": [], + "source": [ + "from langchain.prompts import ChatPromptTemplate\n", + "\n", + "QA_generation_prompt = \"\"\"\n", + "Your task is to write a factoid question and an answer given a context.\n", + "Your factoid question should be answerable with a specific, concise piece of factual information from the context.\n", + "Your factoid question should be formulated in the same style as questions users could ask in a search engine.\n", + "This means that your factoid question MUST NOT mention something like \"according to the passage\" or \"context\".\n", + "\n", + "Provide your answer as follows:\n", + "\n", + "Output:::\n", + "Factoid question: (your factoid question)\n", + "Answer: (your answer to the factoid question)\n", + "\n", + "Now here is the context.\n", + "\n", + "Context: {context}\\n\n", + "Output:::\"\"\"\n", + "\n", + "QA_generation_prompt = ChatPromptTemplate.from_template(QA_generation_prompt)\n", + "QA_generation_agent = QA_generation_prompt | chat_model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lVFc-lVy9jVN" + }, + "source": [ + "现在让我们生成我们的问答对。\n", + "\n", + "对于这个例子,我们只生成 10 个问答对,并从 Hub 加载其余的。\n", + "\n", + "但是对于你的特定知识库,考虑到你想要获得至少约 100 个测试样本,并且考虑到我们稍后会用我们的批判智能体过滤掉大约一半的样本,你应该生成更多的样本,超过 200 个。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8fteqDDD9jVN" + }, + "outputs": [], + "source": [ + "import random\n", + "\n", + "N_GENERATIONS = (\n", + " 10 # We intentionally generate only 10 QA couples here for cost and time considerations\n", + ")\n", + "\n", + "print(f\"Generating {N_GENERATIONS} QA couples...\")\n", + "outputs = []\n", + "for context in tqdm(random.sample(langchain_docs, N_GENERATIONS)):\n", + " # Generate QA couple\n", + " output_QA_couple = QA_generation_agent.invoke({\"context\": context.page_content}).content\n", + " try:\n", + " question = output_QA_couple.split(\"Factoid question: \")[1].split(\"Answer: \")[0]\n", + " answer = output_QA_couple.split(\"Answer: \")[1]\n", + " outputs.append(\n", + " {\n", + " \"context\": context.page_content,\n", + " \"question\": question,\n", + " \"answer\": answer,\n", + " \"source_doc\": context.metadata[\"source\"],\n", + " }\n", + " )\n", + " except:\n", + " continue" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aUlOUDv59jVN", + "outputId": "c9634fdb-2a7f-43a6-c4eb-e60b166b8238" + }, + "outputs": [ + { + "data": { + "text/html": [ + "

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
contextquestionanswersource_doc
0!--Copyright 2023 The HuggingFace Team. All rights reserved.\\n\\nLicensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with\\nthe License. You may obtain a copy of the License at\\n\\nhttp://www.apache.org/licenses/LICENSE-2.0\\n\\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\\nan \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\\nspecific language governing permissions and limitations under the License.\\n-->\\n\\n# Schedulers\\n\\n🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`.\\n\\nDepending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output:\\n\\n- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model\\n- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output\\n\\nMany schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below:\\n\\n| A1111/k-diffusion | 🤗 Diffusers | Usage |\\n|---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------|\\n| DPM++ 2M | [`DPMSolverMultistepScheduler`] | |\\n| DPM++ 2M Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM++ 2M SDE | [`DPMSolverMultistepScheduler`] | init with `algorithm_type=\"sde-dpmsolver++\"` |\\n| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` and `algorithm_type=\"sde-dpmsolver++\"` |\\n| DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` |\\n| DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` |\\n| DPM++ SDE | [`DPMSolverSinglestepScheduler`] | |\\n| DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM2 | [`KDPM2DiscreteScheduler`] | |\\n| DPM2 Karras | [`KDPM2DiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM2 a | [`KDPM2AncestralDiscreteScheduler`] | |\\n| DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM adaptive | N/A | |\\n| DPM fast | N/A | |\\n| Euler | [`EulerDiscreteScheduler`] | |\\n| Euler a | [`EulerAncestralDiscreteScheduler`] | |\\n| Heun | [`HeunDiscreteScheduler`] | |\\n| LMS | [`LMSDiscreteScheduler`] | |\\n| LMS Karras | [`LMSDiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| N/A | [`DEISMultistepScheduler`] | |\\n| N/A | [`UniPCMultistepScheduler`] | |\\n\\nAll schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.\\n\\n## SchedulerMixin\\n[[autodoc]] SchedulerMixin\\n\\n## SchedulerOutput\\n[[autodoc]] schedulers.scheduling_utils.SchedulerOutput\\n\\n## KarrasDiffusionSchedulers\\n\\n[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed.\\n\\nThe different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32).\\n\\n## PushToHubMixin\\n\\n[[autodoc]] utils.PushToHubMixin\\nWhat is the class of schedulers in 🤗 Diffusers that are distinguished by their noise sampling strategy, type of network and scaling, training strategy, and loss weighing?\\n[`KarrasDiffusionSchedulers`]huggingface/diffusers/blob/main/docs/source/en/api/schedulers/overview.md
\n", + "
" + ], + "text/plain": [ + " context \\\n", + "0 !--Copyright 2023 The HuggingFace Team. All rights reserved.\\n\\nLicensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with\\nthe License. You may obtain a copy of the License at\\n\\nhttp://www.apache.org/licenses/LICENSE-2.0\\n\\nUnless required by applicable law or agreed to in writing, software distributed under the License is distributed on\\nan \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\\nspecific language governing permissions and limitations under the License.\\n-->\\n\\n# Schedulers\\n\\n🤗 Diffusers provides many scheduler functions for the diffusion process. A scheduler takes a model's output (the sample which the diffusion process is iterating on) and a timestep to return a denoised sample. The timestep is important because it dictates where in the diffusion process the step is; data is generated by iterating forward *n* timesteps and inference occurs by propagating backward through the timesteps. Based on the timestep, a scheduler may be *discrete* in which case the timestep is an `int` or *continuous* in which case the timestep is a `float`.\\n\\nDepending on the context, a scheduler defines how to iteratively add noise to an image or how to update a sample based on a model's output:\\n\\n- during *training*, a scheduler adds noise (there are different algorithms for how to add noise) to a sample to train a diffusion model\\n- during *inference*, a scheduler defines how to update a sample based on a pretrained model's output\\n\\nMany schedulers are implemented from the [k-diffusion](https://github.com/crowsonkb/k-diffusion) library by [Katherine Crowson](https://github.com/crowsonkb/), and they're also widely used in A1111. To help you map the schedulers from k-diffusion and A1111 to the schedulers in 🤗 Diffusers, take a look at the table below:\\n\\n| A1111/k-diffusion | 🤗 Diffusers | Usage |\\n|---------------------|-------------------------------------|---------------------------------------------------------------------------------------------------------------|\\n| DPM++ 2M | [`DPMSolverMultistepScheduler`] | |\\n| DPM++ 2M Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM++ 2M SDE | [`DPMSolverMultistepScheduler`] | init with `algorithm_type=\"sde-dpmsolver++\"` |\\n| DPM++ 2M SDE Karras | [`DPMSolverMultistepScheduler`] | init with `use_karras_sigmas=True` and `algorithm_type=\"sde-dpmsolver++\"` |\\n| DPM++ 2S a | N/A | very similar to `DPMSolverSinglestepScheduler` |\\n| DPM++ 2S a Karras | N/A | very similar to `DPMSolverSinglestepScheduler(use_karras_sigmas=True, ...)` |\\n| DPM++ SDE | [`DPMSolverSinglestepScheduler`] | |\\n| DPM++ SDE Karras | [`DPMSolverSinglestepScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM2 | [`KDPM2DiscreteScheduler`] | |\\n| DPM2 Karras | [`KDPM2DiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM2 a | [`KDPM2AncestralDiscreteScheduler`] | |\\n| DPM2 a Karras | [`KDPM2AncestralDiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| DPM adaptive | N/A | |\\n| DPM fast | N/A | |\\n| Euler | [`EulerDiscreteScheduler`] | |\\n| Euler a | [`EulerAncestralDiscreteScheduler`] | |\\n| Heun | [`HeunDiscreteScheduler`] | |\\n| LMS | [`LMSDiscreteScheduler`] | |\\n| LMS Karras | [`LMSDiscreteScheduler`] | init with `use_karras_sigmas=True` |\\n| N/A | [`DEISMultistepScheduler`] | |\\n| N/A | [`UniPCMultistepScheduler`] | |\\n\\nAll schedulers are built from the base [`SchedulerMixin`] class which implements low level utilities shared by all schedulers.\\n\\n## SchedulerMixin\\n[[autodoc]] SchedulerMixin\\n\\n## SchedulerOutput\\n[[autodoc]] schedulers.scheduling_utils.SchedulerOutput\\n\\n## KarrasDiffusionSchedulers\\n\\n[`KarrasDiffusionSchedulers`] are a broad generalization of schedulers in 🤗 Diffusers. The schedulers in this class are distinguished at a high level by their noise sampling strategy, the type of network and scaling, the training strategy, and how the loss is weighed.\\n\\nThe different schedulers in this class, depending on the ordinary differential equations (ODE) solver type, fall into the above taxonomy and provide a good abstraction for the design of the main schedulers implemented in 🤗 Diffusers. The schedulers in this class are given [here](https://github.com/huggingface/diffusers/blob/a69754bb879ed55b9b6dc9dd0b3cf4fa4124c765/src/diffusers/schedulers/scheduling_utils.py#L32).\\n\\n## PushToHubMixin\\n\\n[[autodoc]] utils.PushToHubMixin\\n \n", + "\n", + " question \\\n", + "0 What is the class of schedulers in 🤗 Diffusers that are distinguished by their noise sampling strategy, type of network and scaling, training strategy, and loss weighing?\\n \n", + "\n", + " answer \\\n", + "0 [`KarrasDiffusionSchedulers`] \n", + "\n", + " source_doc \n", + "0 huggingface/diffusers/blob/main/docs/source/en/api/schedulers/overview.md " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(pd.DataFrame(outputs).head(1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0KG4dNtg9jVN" + }, + "source": [ + "### 1.3. 设置批判智能体\n", + "\n", + "之前的智能体生成的问题可能存在许多缺陷:在验证这些问题之前,我们应该进行质量检查。\n", + "\n", + "因此,我们构建了批判智能体,它们将根据以下几个标准对每个问题进行评分,这些标准在[这篇论文](https://huggingface.co/papers/2312.10003)中给出:\n", + "- **具体性(Groundedness)**:问题是否可以从给定的上下文中得到回答?\n", + "- **相关性(Relevance)**:问题对用户是否相关?例如,`\"transformers 4.29.1 发布的日期是什么?\"`对于 ML 用户来说并不相关。\n", + "\n", + "我们注意到的一个最后的失败案例是,当一个函数是为生成问题的特定环境量身定做的,但本身难以理解,比如`\"这个指南中使用的函数的名称是什么?\"`。 \n", + "我们也为这个标准构建了一个批判智能体:\n", + "- **独立(Stand-alone)**:对于一个具有领域知识/互联网访问权限的人来说,问题在没有任何上下文的情况下是否可以理解?与此相反的是,对于从特定博客文章生成的问题比如\"这篇文章中使用的函数是什么?\"\n", + "\n", + "我们系统地用所有这些智能体对函数进行评分,每当任何一个智能体的分数太低时,我们就从我们的评估数据集中删除这个问题。\n", + "\n", + "💡 ___当要求智能体输出分数时,我们首先要求它们产生其理由。这将帮助我们验证分数,但最重要的是,要求它首先输出理由给了模型更多的 token 来思考和详细阐述答案,然后再将其总结成一个单一的分数 token。___\n", + "\n", + "我们现在构建并运行这些批判智能体。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "05aSgTGs9jVO" + }, + "outputs": [], + "source": [ + "question_groundedness_critique_prompt = \"\"\"\n", + "You will be given a context and a question.\n", + "Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.\n", + "Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.\n", + "\n", + "Provide your answer as follows:\n", + "\n", + "Answer:::\n", + "Evaluation: (your rationale for the rating)\n", + "Total rating: (your rating)\n", + "\n", + "Now here are the question and context.\n", + "\n", + "Question: {question}\\n\n", + "Context: {context}\\n\n", + "Answer::: \"\"\"\n", + "\n", + "question_relevance_critique_prompt = \"\"\"\n", + "You will be given a question.\n", + "Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.\n", + "Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.\n", + "\n", + "Provide your answer as follows:\n", + "\n", + "Answer:::\n", + "Evaluation: (your rationale for the rating)\n", + "Total rating: (your rating)\n", + "\n", + "Now here is the question.\n", + "\n", + "Question: {question}\\n\n", + "Answer::: \"\"\"\n", + "\n", + "question_standalone_critique_prompt = \"\"\"\n", + "You will be given a question.\n", + "Your task is to provide a 'total rating' representing how context-independant this question is.\n", + "Give your answer on a scale of 1 to 5, where 1 means that the question only makes sense in a specific context, and 5 means that the question makes sense by itself.\n", + "For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.\n", + "The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.\n", + "\n", + "Provide your answer as follows:\n", + "\n", + "Answer:::\n", + "Evaluation: (your rationale for the rating)\n", + "Total rating: (your rating)\n", + "\n", + "Now here is the question.\n", + "\n", + "Question: {question}\\n\n", + "Answer::: \"\"\"\n", + "\n", + "question_groundedness_critique_prompt = ChatPromptTemplate.from_template(\n", + " question_groundedness_critique_prompt\n", + ")\n", + "question_groundedness_critique_agent = question_groundedness_critique_prompt | chat_model\n", + "\n", + "question_relevance_critique_prompt = ChatPromptTemplate.from_template(\n", + " question_relevance_critique_prompt\n", + ")\n", + "question_relevance_critique_agent = question_relevance_critique_prompt | chat_model\n", + "\n", + "question_standalone_critique_prompt = ChatPromptTemplate.from_template(\n", + " question_standalone_critique_prompt\n", + ")\n", + "question_standalone_critique_agent = question_standalone_critique_prompt | chat_model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b9tbk7ME9jVO" + }, + "outputs": [], + "source": [ + "print(\"Generating critique for each QA couple...\")\n", + "for output in tqdm(outputs):\n", + " # Critique the generated QA couple\n", + " question_groundedness_evaluation = question_groundedness_critique_agent.invoke(\n", + " {\"context\": output[\"context\"], \"question\": output[\"question\"]}\n", + " ).content\n", + " question_relevance_evaluation = question_relevance_critique_agent.invoke(\n", + " {\"question\": output[\"question\"]}\n", + " ).content\n", + " question_standalone_evaluation = question_standalone_critique_agent.invoke(\n", + " {\"question\": output[\"question\"]}\n", + " ).content\n", + "\n", + " try:\n", + " groundedness_score = int(question_groundedness_evaluation.split(\"Total rating: \")[1][0])\n", + " groundedness_eval = question_groundedness_evaluation.split(\"Total rating: \")[0].split(\n", + " \"Evaluation: \"\n", + " )[1]\n", + " relevance_score = int(question_relevance_evaluation.split(\"Total rating: \")[1][0])\n", + " relevance_eval = question_relevance_evaluation.split(\"Total rating: \")[0].split(\n", + " \"Evaluation: \"\n", + " )[1]\n", + " standalone_score = int(question_standalone_evaluation.split(\"Total rating: \")[1][0])\n", + " standalone_eval = question_standalone_evaluation.split(\"Total rating: \")[0].split(\n", + " \"Evaluation: \"\n", + " )[1]\n", + " output.update(\n", + " {\n", + " \"groundedness_score\": groundedness_score,\n", + " \"groundedness_eval\": groundedness_eval,\n", + " \"relevance_score\": relevance_score,\n", + " \"relevance_eval\": relevance_eval,\n", + " \"standalone_score\": standalone_score,\n", + " \"standalone_eval\": standalone_eval,\n", + " }\n", + " )\n", + " except:\n", + " continue" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IQv36Y_f9jVO" + }, + "source": [ + "现在让我们基于我们批判智能体的分数过滤掉不好的问题:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oBWuOu1b9jVO", + "outputId": "b32bacea-52f8-486a-96fe-5c188605c5a2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Evaluation dataset before filtering:\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
questionanswergroundedness_scorerelevance_scorestandalone_score
0What is the class of schedulers in 🤗 Diffusers that are distinguished by their noise sampling strategy, type of network and scaling, training strategy, and loss weighing?\\n[`KarrasDiffusionSchedulers`]3.01.04.0
1What are some utility functions provided by the Hugging Face library for pipelines?\\nThe Hugging Face library provides several utility functions for pipelines, including `ArgumentHandler`, `ZeroShotClassificationArgumentHandler`, `QuestionAnsweringArgumentHandler` for argument handling, `PipelineDataFormat`, `CsvPipelineDataFormat`, `JsonPipelineDataFormat`, `PipedPipelineDataFormat` for data format, and `PipelineException` for exceptions.5.04.05.0
2What is the default name used in the Gradio demo if no name is provided?\\nUser\\n\\nExplanation: The factoid question asks for the default name used in the Gradio demo if no name is provided. The answer to this question can be found in the `argparse.ArgumentParser()` function, where a default value of \"User\" is set for the `--name` argument.5.03.05.0
3What is the function used to load a pre-trained Resnet-18 model in the provided context?\\nThe function used to load a pre-trained Resnet-18 model in the provided context is `torch.hub.load('pytorch/vision:v0.6.0', 'resnet18', pretrained=True).eval()`.NaNNaNNaN
4What is the name of the component used for creating a button in the given code?\\nThe name of the component used for creating a button in the given code is `BaseButton`.5.01.05.0
5What is the command to get the example ONNX file for Bart model?\\nThe command is `python run_onnx_exporter.py --model_name_or_path facebook/bart-base`.NaNNaNNaN
6What will be covered in the next unit of the course?\\nThe next unit of the course will cover learning more about Unity MLAgents and training agents in Unity environments. It will also prepare students for AI vs AI challenges where they will train their agents to compete against other agents in a snowball fight and a soccer game.5.01.05.0
7What is the purpose of the `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` parameters in SDXL?\\nThese parameters allow SDXL to negatively condition the model on image resolution and cropping parameters.2.04.02.0
8How are transformers models tested in the Hugging Face repository?\\nTransformers models are tested in the Hugging Face repository using two test suites: `tests` for the general API and `examples` for various applications that aren't part of the API. These tests are run on CircleCI and GitHub Actions, with different jobs and configurations for each. The tests can be run in various ways, including running all tests, getting the list of all tests, running a specific test module, and running specific tests by name or keyword expression. Additionally, there are options for running tests in parallel, repeating tests, and running tests on a specific GPU or CPU.3.04.04.0
9What command is used to create a virtual environment in the given context?\\nThe command used to create a virtual environment in the given context is `python -m venv <env_name>`.NaNNaNNaN
\n", + "
" + ], + "text/plain": [ + " question \\\n", + "0 What is the class of schedulers in 🤗 Diffusers that are distinguished by their noise sampling strategy, type of network and scaling, training strategy, and loss weighing?\\n \n", + "1 What are some utility functions provided by the Hugging Face library for pipelines?\\n \n", + "2 What is the default name used in the Gradio demo if no name is provided?\\n \n", + "3 What is the function used to load a pre-trained Resnet-18 model in the provided context?\\n \n", + "4 What is the name of the component used for creating a button in the given code?\\n \n", + "5 What is the command to get the example ONNX file for Bart model?\\n \n", + "6 What will be covered in the next unit of the course?\\n \n", + "7 What is the purpose of the `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` parameters in SDXL?\\n \n", + "8 How are transformers models tested in the Hugging Face repository?\\n \n", + "9 What command is used to create a virtual environment in the given context?\\n \n", + "\n", + " answer \\\n", + "0 [`KarrasDiffusionSchedulers`] \n", + "1 The Hugging Face library provides several utility functions for pipelines, including `ArgumentHandler`, `ZeroShotClassificationArgumentHandler`, `QuestionAnsweringArgumentHandler` for argument handling, `PipelineDataFormat`, `CsvPipelineDataFormat`, `JsonPipelineDataFormat`, `PipedPipelineDataFormat` for data format, and `PipelineException` for exceptions. \n", + "2 User\\n\\nExplanation: The factoid question asks for the default name used in the Gradio demo if no name is provided. The answer to this question can be found in the `argparse.ArgumentParser()` function, where a default value of \"User\" is set for the `--name` argument. \n", + "3 The function used to load a pre-trained Resnet-18 model in the provided context is `torch.hub.load('pytorch/vision:v0.6.0', 'resnet18', pretrained=True).eval()`. \n", + "4 The name of the component used for creating a button in the given code is `BaseButton`. \n", + "5 The command is `python run_onnx_exporter.py --model_name_or_path facebook/bart-base`. \n", + "6 The next unit of the course will cover learning more about Unity MLAgents and training agents in Unity environments. It will also prepare students for AI vs AI challenges where they will train their agents to compete against other agents in a snowball fight and a soccer game. \n", + "7 These parameters allow SDXL to negatively condition the model on image resolution and cropping parameters. \n", + "8 Transformers models are tested in the Hugging Face repository using two test suites: `tests` for the general API and `examples` for various applications that aren't part of the API. These tests are run on CircleCI and GitHub Actions, with different jobs and configurations for each. The tests can be run in various ways, including running all tests, getting the list of all tests, running a specific test module, and running specific tests by name or keyword expression. Additionally, there are options for running tests in parallel, repeating tests, and running tests on a specific GPU or CPU. \n", + "9 The command used to create a virtual environment in the given context is `python -m venv `. \n", + "\n", + " groundedness_score relevance_score standalone_score \n", + "0 3.0 1.0 4.0 \n", + "1 5.0 4.0 5.0 \n", + "2 5.0 3.0 5.0 \n", + "3 NaN NaN NaN \n", + "4 5.0 1.0 5.0 \n", + "5 NaN NaN NaN \n", + "6 5.0 1.0 5.0 \n", + "7 2.0 4.0 2.0 \n", + "8 3.0 4.0 4.0 \n", + "9 NaN NaN NaN " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "============================================\n", + "Final evaluation dataset:\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
questionanswergroundedness_scorerelevance_scorestandalone_score
1What are some utility functions provided by the Hugging Face library for pipelines?\\nThe Hugging Face library provides several utility functions for pipelines, including `ArgumentHandler`, `ZeroShotClassificationArgumentHandler`, `QuestionAnsweringArgumentHandler` for argument handling, `PipelineDataFormat`, `CsvPipelineDataFormat`, `JsonPipelineDataFormat`, `PipedPipelineDataFormat` for data format, and `PipelineException` for exceptions.5.04.05.0
\n", + "
" + ], + "text/plain": [ + " question \\\n", + "1 What are some utility functions provided by the Hugging Face library for pipelines?\\n \n", + "\n", + " answer \\\n", + "1 The Hugging Face library provides several utility functions for pipelines, including `ArgumentHandler`, `ZeroShotClassificationArgumentHandler`, `QuestionAnsweringArgumentHandler` for argument handling, `PipelineDataFormat`, `CsvPipelineDataFormat`, `JsonPipelineDataFormat`, `PipedPipelineDataFormat` for data format, and `PipelineException` for exceptions. \n", + "\n", + " groundedness_score relevance_score standalone_score \n", + "1 5.0 4.0 5.0 " + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "pd.set_option(\"display.max_colwidth\", None)\n", + "\n", + "generated_questions = pd.DataFrame.from_dict(outputs)\n", + "\n", + "print(\"Evaluation dataset before filtering:\")\n", + "display(\n", + " generated_questions[\n", + " [\"question\", \"answer\", \"groundedness_score\", \"relevance_score\", \"standalone_score\"]\n", + " ]\n", + ")\n", + "generated_questions = generated_questions.loc[\n", + " (generated_questions[\"groundedness_score\"] >= 4)\n", + " & (generated_questions[\"relevance_score\"] >= 4)\n", + " & (generated_questions[\"standalone_score\"] >= 4)\n", + "]\n", + "print(\"============================================\")\n", + "print(\"Final evaluation dataset:\")\n", + "display(\n", + " generated_questions[\n", + " [\"question\", \"answer\", \"groundedness_score\", \"relevance_score\", \"standalone_score\"]\n", + " ]\n", + ")\n", + "\n", + "eval_dataset = datasets.Dataset.from_pandas(\n", + " generated_questions, split=\"train\", preserve_index=False\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HaOMZyu69jVO" + }, + "source": [ + "现在我们合成评估数据集已完成!我们可以在这个评估数据集上评估不同的 RAG 系统。\n", + "\n", + "我们在这里只生成了少数几个问答对,以减少时间和成本。下面,让我们通过加载一个预先生成的数据集来进行下一部分:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q3RRz4W79jVO" + }, + "outputs": [], + "source": [ + "eval_dataset = datasets.load_dataset(\"m-ric/huggingface_doc_qa_eval\", split=\"train\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K5s19uTd9jVO" + }, + "source": [ + "# 2. 构建我们的 RAG 系统" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z-mET8Dy9jVO" + }, + "source": [ + "### 2.1. 预处理文档来构建我们的向量数据库\n", + "\n", + "- 在这一部分,__我们将知识库中的文档分割成更小的片段__:这些将是被检索器选取的片段,然后被阅读器 LLM 作为支持其答案的元素。\n", + "- 目标是构建语义上相关的片段:不要太小,以免不足以支持答案,也不要太大,以免稀释单个内容。\n", + "\n", + "文本分割有许多选项:\n", + "- 每隔 `n` 个单词/字符分割,但这有可能割裂段落甚至句子\n", + "- 在 `n` 个单词/字符后分割,但只在句子边界处\n", + "- **递归分割** 尝试通过树状处理文档来保留更多文档结构,首先在最大单元(章节)上分割,然后递归地在更小单元(段落,句子)上分割。\n", + "要了解更多关于分块的信息,我建议你阅读由 Greg Kamradt 编写的[不错的教程](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) 。\n", + "\n", + "[这个 space](https://huggingface.co/spaces/m-ric/chunk_visualizer) 让你可视化不同的分割选项是如何影响你得到的片段的流程。\n", + "\n", + "> 在以下内容中,我们使用 Langchain 的 `RecursiveCharacterTextSplitter`。\n", + "💡 _为了在我们的文本分割器中测量片段长度,我们的长度函数将不是字符的数量,而是 token 化文本中的 token 数量:实际上,对于后续处理 token 的嵌入器来说,以 token 为单位测量长度更为相关,并且在经验上表现更好._\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "H4fhm55Q9jVO" + }, + "outputs": [], + "source": [ + "from langchain.docstore.document import Document as LangchainDocument\n", + "\n", + "RAW_KNOWLEDGE_BASE = [\n", + " LangchainDocument(page_content=doc[\"text\"], metadata={\"source\": doc[\"source\"]})\n", + " for doc in tqdm(ds)\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sz9Jw2_q9jVO" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import RecursiveCharacterTextSplitter\n", + "from transformers import AutoTokenizer\n", + "\n", + "\n", + "def split_documents(\n", + " chunk_size: int,\n", + " knowledge_base: List[LangchainDocument],\n", + " tokenizer_name: str,\n", + ") -> List[LangchainDocument]:\n", + " \"\"\"\n", + " Split documents into chunks of size `chunk_size` characters and return a list of documents.\n", + " \"\"\"\n", + " text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(\n", + " AutoTokenizer.from_pretrained(tokenizer_name),\n", + " chunk_size=chunk_size,\n", + " chunk_overlap=int(chunk_size / 10),\n", + " add_start_index=True,\n", + " strip_whitespace=True,\n", + " separators=[\"\\n\\n\", \"\\n\", \".\", \" \", \"\"],\n", + " )\n", + "\n", + " docs_processed = []\n", + " for doc in knowledge_base:\n", + " docs_processed += text_splitter.split_documents([doc])\n", + "\n", + " # Remove duplicates\n", + " unique_texts = {}\n", + " docs_processed_unique = []\n", + " for doc in docs_processed:\n", + " if doc.page_content not in unique_texts:\n", + " unique_texts[doc.page_content] = True\n", + " docs_processed_unique.append(doc)\n", + "\n", + " return docs_processed_unique" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QzBYfNG79jVO" + }, + "source": [ + "### 2.2. 检索器 - 嵌入 🗂️\n", + "\n", + "__检索器的作用类似于内部搜索引擎__:给定用户查询,它从你的知识库中返回最相关的文档。\n", + "\n", + "> 对于知识库,我们使用 Langchain 向量数据库,因为它提供了一个方便的 [FAISS](https://github.com/facebookresearch/faiss) 索引,并允许我们在整个处理过程中保留文档元数据。\n", + "\n", + "🛠️ __包含可选项:__\n", + "\n", + "- 调整分块方法:\n", + " - 片段(chunks)的大小\n", + " - 方法:在不同的分隔符上分割,使用[语义分块](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)...\n", + "- 更改嵌入模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LqJlIDZR9jVO" + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import FAISS\n", + "from langchain_community.embeddings import HuggingFaceEmbeddings\n", + "from langchain_community.vectorstores.utils import DistanceStrategy\n", + "import os\n", + "\n", + "\n", + "def load_embeddings(\n", + " langchain_docs: List[LangchainDocument],\n", + " chunk_size: int,\n", + " embedding_model_name: Optional[str] = \"thenlper/gte-small\",\n", + ") -> FAISS:\n", + " \"\"\"\n", + " Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.\n", + "\n", + " Args:\n", + " langchain_docs: list of documents\n", + " chunk_size: size of the chunks to split the documents into\n", + " embedding_model_name: name of the embedding model to use\n", + "\n", + " Returns:\n", + " FAISS index\n", + " \"\"\"\n", + " # load embedding_model\n", + " embedding_model = HuggingFaceEmbeddings(\n", + " model_name=embedding_model_name,\n", + " multi_process=True,\n", + " model_kwargs={\"device\": \"cuda\"},\n", + " encode_kwargs={\"normalize_embeddings\": True}, # set True to compute cosine similarity\n", + " )\n", + "\n", + " # Check if embeddings already exist on disk\n", + " index_name = f\"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}\"\n", + " index_folder_path = f\"./data/indexes/{index_name}/\"\n", + " if os.path.isdir(index_folder_path):\n", + " return FAISS.load_local(\n", + " index_folder_path,\n", + " embedding_model,\n", + " distance_strategy=DistanceStrategy.COSINE,\n", + " )\n", + "\n", + " else:\n", + " print(\"Index not found, generating it...\")\n", + " docs_processed = split_documents(\n", + " chunk_size,\n", + " langchain_docs,\n", + " embedding_model_name,\n", + " )\n", + " knowledge_index = FAISS.from_documents(\n", + " docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE\n", + " )\n", + " knowledge_index.save_local(index_folder_path)\n", + " return knowledge_index" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b6y1mQJX9jVO" + }, + "source": [ + "### 2.3. 阅读器 - LLM 💬\n", + "\n", + "在这一部分,__LLM 阅读器读取检索到的文档以形成其答案。__\n", + "\n", + "🛠️ 为了改善结果,我们尝试了以下选项:\n", + "- 切换重排序开启或关闭的状态\n", + "- 更改阅读器模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9PdpuWyP9jVP" + }, + "outputs": [], + "source": [ + "RAG_PROMPT_TEMPLATE = \"\"\"\n", + "<|system|>\n", + "Using the information contained in the context,\n", + "give a comprehensive answer to the question.\n", + "Respond only to the question asked, response should be concise and relevant to the question.\n", + "Provide the number of the source document when relevant.\n", + "If the answer cannot be deduced from the context, do not give an answer.\n", + "<|user|>\n", + "Context:\n", + "{context}\n", + "---\n", + "Now here is the question you need to answer.\n", + "\n", + "Question: {question}\n", + "\n", + "<|assistant|>\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9SDqenld9jVP" + }, + "outputs": [], + "source": [ + "from langchain_community.llms import HuggingFaceHub\n", + "\n", + "repo_id = \"HuggingFaceH4/zephyr-7b-beta\"\n", + "READER_MODEL_NAME = \"zephyr-7b-beta\"\n", + "\n", + "READER_LLM = HuggingFaceHub(\n", + " repo_id=repo_id,\n", + " task=\"text-generation\",\n", + " model_kwargs={\n", + " \"max_new_tokens\": 512,\n", + " \"top_k\": 30,\n", + " \"temperature\": 0.1,\n", + " \"repetition_penalty\": 1.03,\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QZ62CbcZ9jVP" + }, + "outputs": [], + "source": [ + "from ragatouille import RAGPretrainedModel\n", + "from langchain_core.vectorstores import VectorStore\n", + "from langchain_core.language_models.llms import LLM\n", + "\n", + "\n", + "def answer_with_rag(\n", + " question: str,\n", + " llm: LLM,\n", + " knowledge_index: VectorStore,\n", + " reranker: Optional[RAGPretrainedModel] = None,\n", + " num_retrieved_docs: int = 30,\n", + " num_docs_final: int = 7,\n", + ") -> Tuple[str, List[LangchainDocument]]:\n", + " \"\"\"Answer a question using RAG with the given knowledge index.\"\"\"\n", + " # Gather documents with retriever\n", + " relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)\n", + " relevant_docs = [doc.page_content for doc in relevant_docs] # keep only the text\n", + "\n", + " # Optionally rerank results\n", + " if reranker:\n", + " relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)\n", + " relevant_docs = [doc[\"content\"] for doc in relevant_docs]\n", + "\n", + " relevant_docs = relevant_docs[:num_docs_final]\n", + "\n", + " # Build the final prompt\n", + " context = \"\\nExtracted documents:\\n\"\n", + " context += \"\".join([f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(relevant_docs)])\n", + "\n", + " final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)\n", + "\n", + " # Redact an answer\n", + " answer = llm(final_prompt)\n", + "\n", + " return answer, relevant_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hiygbqfT9jVP" + }, + "source": [ + "# 3. 对 RAG 系统进行基准测试\n", + "\n", + "RAG 系统和评估数据集现在准备好了。最后一步是在这个评估数据集上判断 RAG 系统的输出。\n", + "为此,__我们设置了一个裁判智能体__。 ⚖️🤖\n", + "\n", + "在[不同的 RAG 评估指标](https://docs.ragas.io/en/latest/concepts/metrics/index.html)中,我们选择只关注忠实度,因为这是衡量我们系统性能的最佳的端到端指标。\n", + "\n", + "> 我们使用 GPT4 作为评判者,因为它在实际应用中表现良好,但你也可以尝试其他模型,例如 [kaist-ai/prometheus-13b-v1.0](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) 或 [BAAI/JudgeLM-33B-v1.0](https://huggingface.co/BAAI/JudgeLM-33B-v1.0)。\n", + "\n", + "💡 _在评估提示中,我们给出了每个指标的详细描述,采用 1-5 分的评分刻度,正如 [Prometheus 的提示模板](https://huggingface.co/kaist-ai/prometheus-13b-v1.0) 所做的那样:这有助于模型精确地确定其指标。如果你给评判 LLM 一个模糊的评分刻度,那么不同示例之间的输出将不够一致。_\n", + "\n", + "💡 _再次提示 LLM 在给出最终评分之前先输出其理由,这样它就有更多的 token 来帮助它正式化和详细阐述评判。_" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VrlMh_ZI9jVP" + }, + "outputs": [], + "source": [ + "def run_rag_tests(\n", + " eval_dataset: datasets.Dataset,\n", + " llm: BaseChatModel,\n", + " knowledge_index: VectorStore,\n", + " output_file: str,\n", + " reranker: Optional[RAGPretrainedModel] = None,\n", + " verbose: Optional[bool] = True,\n", + " test_settings: Optional[str] = None, # To document the test settings used\n", + "):\n", + " \"\"\"Runs RAG tests on the given dataset and saves the results to the given output file.\"\"\"\n", + " try: # load previous generations if they exist\n", + " with open(output_file, \"r\") as f:\n", + " outputs = json.load(f)\n", + " except:\n", + " outputs = []\n", + "\n", + " for example in tqdm(eval_dataset):\n", + " question = example[\"question\"]\n", + " if question in [output[\"question\"] for output in outputs]:\n", + " continue\n", + "\n", + " answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)\n", + " if verbose:\n", + " print(\"=======================================================\")\n", + " print(f\"Question: {question}\")\n", + " print(f\"Answer: {answer}\")\n", + " print(f'True answer: {example[\"answer\"]}')\n", + " result = {\n", + " \"question\": question,\n", + " \"true_answer\": example[\"answer\"],\n", + " \"source_doc\": example[\"source_doc\"],\n", + " \"generated_answer\": answer,\n", + " \"retrieved_docs\": [doc for doc in relevant_docs],\n", + " }\n", + " if test_settings:\n", + " result[\"test_settings\"] = test_settings\n", + " outputs.append(result)\n", + "\n", + " with open(output_file, \"w\") as f:\n", + " json.dump(outputs, f)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ae-3KWzK9jVP" + }, + "outputs": [], + "source": [ + "EVALUATION_PROMPT = \"\"\"###Task Description:\n", + "An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.\n", + "1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.\n", + "2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.\n", + "3. The output format should look as follows: \\\"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\\\"\n", + "4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.\n", + "\n", + "###The instruction to evaluate:\n", + "{instruction}\n", + "\n", + "###Response to evaluate:\n", + "{response}\n", + "\n", + "###Reference Answer (Score 5):\n", + "{reference_answer}\n", + "\n", + "###Score Rubrics:\n", + "[Is the response correct, accurate, and factual based on the reference answer?]\n", + "Score 1: The response is completely incorrect, inaccurate, and/or not factual.\n", + "Score 2: The response is mostly incorrect, inaccurate, and/or not factual.\n", + "Score 3: The response is somewhat correct, accurate, and/or factual.\n", + "Score 4: The response is mostly correct, accurate, and factual.\n", + "Score 5: The response is completely correct, accurate, and factual.\n", + "\n", + "###Feedback:\"\"\"\n", + "\n", + "from langchain.prompts.chat import (\n", + " ChatPromptTemplate,\n", + " HumanMessagePromptTemplate,\n", + ")\n", + "from langchain.schema import SystemMessage\n", + "\n", + "\n", + "evaluation_prompt_template = ChatPromptTemplate.from_messages(\n", + " [\n", + " SystemMessage(content=\"You are a fair evaluator language model.\"),\n", + " HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ia9Mvn859jVP" + }, + "outputs": [], + "source": [ + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "eval_chat_model = ChatOpenAI(model=\"gpt-4-1106-preview\", temperature=0)\n", + "evaluator_name = \"GPT4\"\n", + "\n", + "\n", + "def evaluate_answers(\n", + " answer_path: str,\n", + " eval_chat_model: BaseChatModel,\n", + " evaluator_name: str,\n", + " evaluation_prompt_template: ChatPromptTemplate,\n", + ") -> None:\n", + " \"\"\"Evaluates generated answers. Modifies the given answer file in place for better checkpointing.\"\"\"\n", + " answers = []\n", + " if os.path.isfile(answer_path): # load previous generations if they exist\n", + " answers = json.load(open(answer_path, \"r\"))\n", + "\n", + " for experiment in tqdm(answers):\n", + " if f\"eval_score_{evaluator_name}\" in experiment:\n", + " continue\n", + "\n", + " eval_prompt = evaluation_prompt_template.format_messages(\n", + " instruction=experiment[\"question\"],\n", + " response=experiment[\"generated_answer\"],\n", + " reference_answer=experiment[\"true_answer\"],\n", + " )\n", + " eval_result = eval_chat_model.invoke(eval_prompt)\n", + " feedback, score = [item.strip() for item in eval_result.content.split(\"[RESULT]\")]\n", + " experiment[f\"eval_score_{evaluator_name}\"] = score\n", + " experiment[f\"eval_feedback_{evaluator_name}\"] = feedback\n", + "\n", + " with open(answer_path, \"w\") as f:\n", + " json.dump(answers, f)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EXH-szLe9jVP" + }, + "source": [ + "🚀 让我们允许下测试和评估一下答案!👇" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jW2nnvUT9jVQ" + }, + "outputs": [], + "source": [ + "if not os.path.exists(\"./output\"):\n", + " os.mkdir(\"./output\")\n", + "\n", + "for chunk_size in [200]: # Add other chunk sizes (in tokens) as needed\n", + " for embeddings in [\"thenlper/gte-small\"]: # Add other embeddings as needed\n", + " for rerank in [True, False]:\n", + " settings_name = f\"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}\"\n", + " output_file_name = f\"./output/rag_{settings_name}.json\"\n", + "\n", + " print(f\"Running evaluation for {settings_name}:\")\n", + "\n", + " print(\"Loading knowledge base embeddings...\")\n", + " knowledge_index = load_embeddings(\n", + " RAW_KNOWLEDGE_BASE,\n", + " chunk_size=chunk_size,\n", + " embedding_model_name=embeddings,\n", + " )\n", + "\n", + " print(\"Running RAG...\")\n", + " reranker = (\n", + " RAGPretrainedModel.from_pretrained(\"colbert-ir/colbertv2.0\") if rerank else None\n", + " )\n", + " run_rag_tests(\n", + " eval_dataset=eval_dataset,\n", + " llm=READER_LLM,\n", + " knowledge_index=knowledge_index,\n", + " output_file=output_file_name,\n", + " reranker=reranker,\n", + " verbose=False,\n", + " test_settings=settings_name,\n", + " )\n", + "\n", + " print(\"Running evaluation...\")\n", + " evaluate_answers(\n", + " output_file_name,\n", + " eval_chat_model,\n", + " evaluator_name,\n", + " evaluation_prompt_template,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tytXV5-h9jVT" + }, + "source": [ + "### 检查结果" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D4YDSfmr9jVT" + }, + "outputs": [], + "source": [ + "import glob\n", + "\n", + "outputs = []\n", + "for file in glob.glob(\"./output/*.json\"):\n", + " output = pd.DataFrame(json.load(open(file, \"r\")))\n", + " output[\"settings\"] = file\n", + " outputs.append(output)\n", + "result = pd.concat(outputs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CdkXMNvS9jVT" + }, + "outputs": [], + "source": [ + "result[\"eval_score_GPT4\"] = result[\"eval_score_GPT4\"].apply(\n", + " lambda x: int(x) if isinstance(x, str) else 1\n", + ")\n", + "result[\"eval_score_GPT4\"] = (result[\"eval_score_GPT4\"] - 1) / 4" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lgxBpid29jVT", + "outputId": "9a3bcf32-4b0c-4df1-c76c-3ebbca82929d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "settings\n", + "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:False_reader-model:zephyr-7b-beta.json 0.884328\n", + "./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:False_reader-model:zephyr-7b-beta.json 0.906716\n", + "./output/rag_chunk:200_embeddings:BAAI~bge-base-en-v1.5_rerank:True_reader-model:zephyr-7b-beta.json 0.906716\n", + "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral.json 0.906716\n", + "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:zephyr-7b-beta.json 0.921642\n", + "./output/rag_chunk:200_embeddings:thenlper~gte-small_rerank:True_reader-model:mixtral0.json 0.947761\n", + "Name: eval_score_GPT4, dtype: float64" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "average_scores = result.groupby(\"settings\")[\"eval_score_GPT4\"].mean()\n", + "average_scores.sort_values()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pSPH9DYI9jVT" + }, + "source": [ + "## 结果示例\n", + "\n", + "让我们加载通过调整这个 notebook 中可用的不同选项所获得的结果。关于这些选项为何有效或无效的更多细节,请参阅 [高级 RAG](advanced_rag) 的 notebook。\n", + "\n", + "正如在下面的图表中所看到的,一些调整并没有带来任何改善,而有些则带来了巨大的性能提升。\n", + "\n", + "➡️ ___所以没有单一的好方法:在调整你的 RAG 系统时,应该尝试几种不同的方向。___\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RVOxatv99jVT" + }, + "outputs": [], + "source": [ + "import plotly.express as px\n", + "\n", + "scores = datasets.load_dataset(\"m-ric/rag_scores_cookbook\", split=\"train\")\n", + "scores = pd.Series(scores[\"score\"], index=scores[\"settings\"])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vqK0Dg2Q9jVT" + }, + "outputs": [], + "source": [ + "fig = px.bar(\n", + " scores,\n", + " color=scores,\n", + " labels={\n", + " \"value\": \"Accuracy\",\n", + " \"settings\": \"Configuration\",\n", + " },\n", + " color_continuous_scale=\"bluered\",\n", + ")\n", + "fig.update_layout(w\n", + " width=1000,\n", + " height=600,\n", + " barmode=\"group\",\n", + " yaxis_range=[0, 100],\n", + " title=\"Accuracy of different RAG configurations\",\n", + " xaxis_title=\"RAG settings\",\n", + " font=dict(size=15),\n", + ")\n", + "fig.layout.yaxis.ticksuffix = \"%\"\n", + "fig.update_coloraxes(showscale=False)\n", + "fig.update_traces(texttemplate=\"%{y:.1f}\", textposition=\"outside\")\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dPUOMWGk9jVT" + }, + "source": [ + "\n", + "\n", + "如上图所示,这些调整对性能的影响各不相同。尤其是调整片段大小,既简单又非常有影响力。\n", + "\n", + "但这只是针对我们的情况:你的结果可能大不相同:现在你已经有了一个可靠的评估流水线,可以开始探索其他选项了!🗺️" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "ml2", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/notebooks/zh-CN/rag_zephyr_langchain.ipynb b/notebooks/zh-CN/rag_zephyr_langchain.ipynb new file mode 100644 index 00000000..08dbfdc3 --- /dev/null +++ b/notebooks/zh-CN/rag_zephyr_langchain.ipynb @@ -0,0 +1,521 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "Kih21u1tyr-I" + }, + "source": [ + "# 用 Hugging Face Zephyr 和 LangChain 针对 Github issues 构建简单的 RAG\n", + "\n", + "_作者: [Maria Khalusova](https://github.com/MKhalusova)_\n", + "\n", + "本 notebook 展示了如何使用 [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) 模型和 LangChain 快速构建一个针对项目 GitHub issues 的简单 RAG。\n", + "\n", + "\n", + "\n", + "**什么是 RAG**\n", + "\n", + "RAG 是一个很流行的方法,用来解决强大的 LLM 不知道具体内容的问题,因为具体内容不在其训练数据中,或者当它看到它之前时产生幻觉。这样的具体内容可能是专有的、敏感的,或者,就像这个例子中一样,是最近的和更新的。\n", + "\n", + "如果你的数据集是静态的和不需要定期更新的,那么你可能会考虑微调一个大模型。但在大多数情况下,微调模型花费巨大并且重复去微调的话(比如,处理数据漂移的时候),可能会导致“模型偏移”。这种情况模型行为的变换就不是设计的那样了。\n", + "\n", + "**RAG (检索增强生成)** 并不需要模型微调。相反, RAG 通过提供检索到的额外的相关内容喂给 LLM 以此来获得更好的回答。\n", + "\n", + "这里是一个简单说明:\n", + "\n", + "![RAG diagram](https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/rag-diagram.png)\n", + "\n", + "* 额外的数据通过独立的嵌入模型会被转化为嵌入向量,这些向量会储存在向量数据库里。嵌入模型通常都比较小,因此在常规偏差上更新嵌入向量相比于微调模型会更快,便宜,和简单。\n", + "\n", + "* 与此同时,由于不需要微调,给了你极大的自由度去切换选择你自己的更强的 LLM,或者对于更快速的推理去切换更小的蒸馏模型。\n", + "\n", + "让我们用开源的 LLM ,嵌入模型,和 LangChain 快速构建一个针对项目 GitHub issues 的简单 RAG。\n", + "\n", + "\n", + "首先安装相关依赖:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lC9frDOlyi38" + }, + "outputs": [], + "source": [ + "!pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "-aYENQwZ-p_c" + }, + "outputs": [], + "source": [ + "# If running in Google Colab, you may need to run this cell to make sure you're using UTF-8 locale to install LangChain\n", + "import locale\n", + "locale.getpreferredencoding = lambda: \"UTF-8\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W5HhMZ2c-NfU" + }, + "outputs": [], + "source": [ + "!pip install -q langchain" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "R8po01vMWzXL" + }, + "source": [ + "## 准备数据\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3cCmQywC04x6" + }, + "source": [ + "在这个例子中,我们会从[PEFT 库的仓库](https://github.com/huggingface/peft)加载所有的 issues(包括现在开放的和已经关闭的)。\n", + "\n", + "首先,你需要获取一个 [GitHub 个人权限 token](https://github.com/settings/tokens?type=beta) 来访问 GitHub API。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8MoD7NbsNjlM" + }, + "outputs": [], + "source": [ + "from getpass import getpass\n", + "ACCESS_TOKEN = getpass(\"YOUR_GITHUB_PERSONAL_TOKEN\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fccecm3a10N6" + }, + "source": [ + "下一步,我们将会加载 [huggingface/peft](https://github.com/huggingface/peft) 仓库中所有的 issues:\n", + "- 默认情况下, PR 也被认定为 issues,这里我们要设置 `include_prs=False` 来排除 PR。\n", + "- 设置 `state = \"all\"` 意味着我们会把开放和已经关闭的 issues 都加载了。" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "8EKMit4WNDY8" + }, + "outputs": [], + "source": [ + "from langchain.document_loaders import GitHubIssuesLoader\n", + "\n", + "loader = GitHubIssuesLoader(\n", + " repo=\"huggingface/peft\",\n", + " access_token=ACCESS_TOKEN,\n", + " include_prs=False,\n", + " state=\"all\"\n", + ")\n", + "\n", + "docs = loader.load()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CChTrY-k2qO5" + }, + "source": [ + "个人仓库的 issues 内容可能会长于一个嵌入模型可以最为输入处理的长度。如果我们想要嵌入所有可用的内容,我们需要把文档分割成适当大小的块。\n", + "\n", + "最普通直接的切块方法就是定义一个固定的块大小,以及判断块之间是否加入重叠。保存一些块之间的重叠允许我们去保存一些语义上下文。\n", + "\n", + "其他方法通常更复杂,会考虑到文档的结构和上下文。例如,人们可能希望根据句子或段落来分割文档,然而,固定大小的分块在大多数常见情况下都表现得很好,所以我们将在这里采用这种方法。\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OmsXOf59Pmm-" + }, + "outputs": [], + "source": [ + "from langchain.text_splitter import CharacterTextSplitter\n", + "\n", + "splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n", + "\n", + "chunked_docs = splitter.split_documents(docs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DAt_zPVlXOn7" + }, + "source": [ + "## 创建嵌入和检索器" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-mvat6JQl4yp" + }, + "source": [ + "现在所有的文档都设置成立合适的大小,我们可以用他们的嵌入创建一个数据集了。\n", + "\n", + "为了创建文档块嵌入,我们将会使用 `HuggingFaceEmbeddings` 和 [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) 嵌入模型。在 Hub 上有许多其他的嵌入模型可用,你也可以查看 [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) 关注表现最好的模型。\n", + "\n", + "为了创建向量数据库,我们将会使用 `FAISS` 库。这个库提供高效的相似度搜索和稠密向量的聚类,正是我们需要的。FAISS 目前是大规模数据集上 NN 搜索最常用的库之一。\n", + "\n", + "我们通过 LangChain 的 API 来获取嵌入模型和 FAISS 向量数据库。" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ixmCdRzBQ5gu" + }, + "outputs": [], + "source": [ + "from langchain.vectorstores import FAISS\n", + "from langchain.embeddings import HuggingFaceEmbeddings\n", + "\n", + "db = FAISS.from_documents(chunked_docs,\n", + " HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2iCgEPi0nnN6" + }, + "source": [ + "我们需要一种方式,来返回给定无结构的查询所需要的文档。针对这个,我们会使用 `as_retriever` 方法,使用 `db` 作为支柱:\n", + "- `search_type=\"similarity\"` 意味着我们会执行查询和文档之间的相似度搜索\n", + "- `search_kwargs={'k': 4}` 指示我们指定返回的最高的 4 个结果\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "mBTreCQ9noHK" + }, + "outputs": [], + "source": [ + "retriever = db.as_retriever(\n", + " search_type=\"similarity\",\n", + " search_kwargs={'k': 4}\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WgEhlISJpTgj" + }, + "source": [ + "向量数据库和检索器现在设置好了,下一步我们需要设置好链中的下一块 - 模型。" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tzQxx0HkXVFU" + }, + "source": [ + "## 加载量化模型" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9jy1cC65p_GD" + }, + "source": [ + "针对本例,我们选择 [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), 一个小而强大的模型。\n", + "\n", + "随着每周都会出好多模型,你可能会想要替换这个模型到最新的最好的模型。最好的方式是查看 [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)。\n", + "\n", + "为了推理更快,我们将加载模型的量化版本:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L-ggaa763VRo" + }, + "outputs": [], + "source": [ + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n", + "\n", + "model_name = 'HuggingFaceH4/zephyr-7b-beta'\n", + "\n", + "bnb_config = BitsAndBytesConfig(\n", + " load_in_4bit=True,\n", + " bnb_4bit_use_double_quant=True,\n", + " bnb_4bit_quant_type=\"nf4\",\n", + " bnb_4bit_compute_dtype=torch.bfloat16\n", + ")\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hVNRJALyXYHG" + }, + "source": [ + "## 设置 LLM 链" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RUUNneJ1smhl" + }, + "source": [ + "最后,我们有了所有的需要设置的 LLM 链的部分。\n", + "\n", + "首先,使用加载的模型和他的tokenizer创建一个文本生成的流水线(pipeline)\n", + "\n", + "下一步,创建一个提示模板-这个应该遵循模型的格式,所以如果你替换了模型检查点,确保使用合适的格式。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "cR0k1cRWz8Pm" + }, + "outputs": [], + "source": [ + "from langchain.llms import HuggingFacePipeline\n", + "from langchain.prompts import PromptTemplate\n", + "from transformers import pipeline\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "\n", + "text_generation_pipeline = pipeline(\n", + " model=model,\n", + " tokenizer=tokenizer,\n", + " task=\"text-generation\",\n", + " temperature=0.2,\n", + " do_sample=True,\n", + " repetition_penalty=1.1,\n", + " return_full_text=True,\n", + " max_new_tokens=400,\n", + ")\n", + "\n", + "llm = HuggingFacePipeline(pipeline=text_generation_pipeline)\n", + "\n", + "prompt_template = \"\"\"\n", + "<|system|>\n", + "Answer the question based on your knowledge. Use the following context to help:\n", + "\n", + "{context}\n", + "\n", + "\n", + "<|user|>\n", + "{question}\n", + "\n", + "<|assistant|>\n", + "\n", + " \"\"\"\n", + "\n", + "prompt = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=prompt_template,\n", + ")\n", + "\n", + "llm_chain = prompt | llm | StrOutputParser()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l19UKq5HXfSp" + }, + "source": [ + "注意:你也可以使用 `tokenizer.apply_chat_template` 转换列表消息为合适聊天格式的字符串(字典也行 `{'role': 'user', 'content': '(...)'}`)\n", + "\n", + "最后,我们需要将 LLM 链与检索器(retriever)结合起来创建一个 RAG 链。我们将原始问题以及检索到的文档上下文传递到最后生成步骤:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "_rI3YNp9Xl4s" + }, + "outputs": [], + "source": [ + "from langchain_core.runnables import RunnablePassthrough\n", + "\n", + "retriever = db.as_retriever()\n", + "\n", + "rag_chain = (\n", + " {\"context\": retriever, \"question\": RunnablePassthrough()}\n", + " | llm_chain\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UsCOhfDDXpaS" + }, + "source": [ + "## 比较结果\n", + "\n", + "让我们看看对于特定领域库的问题不同的 RAG 的生成的回答。" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "W7F07fQLXusU" + }, + "outputs": [], + "source": [ + "question = \"How do you combine multiple adapters?\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KC0rJYU1x1ir" + }, + "source": [ + "首先,让我们看看仅仅通过模型自身不加检索内容能得到什么答案:" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 125 + }, + "id": "GYh-HG1l0De5", + "outputId": "277d8e89-ce9b-4e04-c11b-639ad2645759" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use together. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer only has USB-A ports).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a USB-C laptop to an HDMI monitor, plug the USB-C to HDMI adapter into the laptop's USB-C port.\\n\\n3. Connect the second adapter: Next, connect the second adapter to the first one. In this case, connect the USB-C to USB-A adapter to the USB-C port of the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. For example, connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. If necessary, adjust the settings on your devices to ensure optimal performance.\\n\\nBy combining multiple adapters, you can connect a variety of devices together, even if they don't have the same type of connector. Just be sure to choose adapters that are compatible with all the devices you want to connect and test the connection thoroughly before relying on it for critical tasks.\"" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "llm_chain.invoke({\"context\":\"\", \"question\": question})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i-TIWr3wx9w8" + }, + "source": [ + "可以看到,模型将这个问题解释为关于物理电脑适配器的问题,而在 PEFT 的背景下,“适配器”指的是 LoRA 适配器。\n", + "让我们看看添加 GitHub issues 的上下文是否有助于模型给出更相关的答案:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 125 + }, + "id": "FZpNA3o10H10", + "outputId": "31f9aed3-3dd7-4ff8-d1a8-866794fefe80" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + }, + "text/plain": [ + "\" Based on the provided context, it seems that combining multiple adapters is still an open question in the community. Here are some possibilities:\\n\\n 1. Save the output from the base model and pass it to each adapter separately, as described in the first context snippet. This allows you to run multiple adapters simultaneously and reuse the output from the base model. However, this approach requires loading and running each adapter separately.\\n\\n 2. Export everything into a single PyTorch model, as suggested in the second context snippet. This would involve saving all the adapters and their weights into a single model, potentially making it larger and more complex. The advantage of this approach is that it would allow you to run all the adapters simultaneously without having to load and run them separately.\\n\\n 3. Merge multiple Lora adapters, as mentioned in the third context snippet. This involves adding multiple distinct, independent behaviors to a base model by merging multiple Lora adapters. It's not clear from the context how this would be done, but it suggests that there might be a recommended way of doing it.\\n\\n 4. Combine adapters through a specific architecture, as proposed in the fourth context snippet. This involves merging multiple adapters into a single architecture, potentially creating a more complex model with multiple behaviors. Again, it's not clear from the context how this would be done.\\n\\n Overall, combining multiple adapters is still an active area of research, and there doesn't seem to be a widely accepted solution yet. If you're interested in exploring this further, it might be worth reaching out to the Hugging Face community or checking out their documentation for more information.\"" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rag_chain.invoke(question)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hZQedZKSyrwO" + }, + "source": [ + "我们可以看到,加入检索的信息后,同一个模型能够对于特定库的问题给出更准确、更相关的答案。\n", + "\n", + "值得注意的是,将多个适配器结合用于推理的功能已经被添加到库中,人们可以在文档中找到这些信息,因此在下一个迭代的RAG中,包含文档嵌入可能是有价值的。" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}