From 230f0da88dd7c5b7ceeae261a1178b20b37fcb6a Mon Sep 17 00:00:00 2001
From: jacoblee93 <jacoblee93@gmail.com>
Date: Sat, 17 Feb 2024 08:47:28 -0800
Subject: [PATCH 1/5] Update simple RAG use-case to support streaming

---
 notebooks/en/rag_zephyr_langchain.ipynb | 367 +++++++++++-------------
 1 file changed, 174 insertions(+), 193 deletions(-)

diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
index f07c9ea7..0864976a 100644
--- a/notebooks/en/rag_zephyr_langchain.ipynb
+++ b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -1,23 +1,10 @@
 {
-  "nbformat": 4,
-  "nbformat_minor": 0,
-  "metadata": {
-    "colab": {
-      "provenance": [],
-      "gpuType": "T4"
-    },
-    "kernelspec": {
-      "name": "python3",
-      "display_name": "Python 3"
-    },
-    "language_info": {
-      "name": "python"
-    },
-    "accelerator": "GPU"
-  },
   "cells": [
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "Kih21u1tyr-I"
+      },
       "source": [
         "# Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain\n",
         "\n",
@@ -43,10 +30,7 @@
         "Let's illustrate building a RAG using an open-source LLM, embeddings model, and LandChain.\n",
         "\n",
         "First, install the required dependencies:"
-      ],
-      "metadata": {
-        "id": "Kih21u1tyr-I"
-      }
+      ]
     },
     {
       "cell_type": "code",
@@ -61,73 +45,78 @@
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "-aYENQwZ-p_c"
+      },
+      "outputs": [],
       "source": [
         "# If running in Google Colab, you may need to run this cell to make sure you're using UTF-8 locale to install LangChain\n",
         "import locale\n",
         "locale.getpreferredencoding = lambda: \"UTF-8\""
-      ],
-      "metadata": {
-        "id": "-aYENQwZ-p_c"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "!pip install -q langchain"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "W5HhMZ2c-NfU"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "!pip install -q langchain"
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "## Prepare the data\n"
-      ],
       "metadata": {
         "id": "R8po01vMWzXL"
-      }
+      },
+      "source": [
+        "## Prepare the data\n"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "3cCmQywC04x6"
+      },
       "source": [
         "In this example, we'll load all of the issues (both open and closed) from [PEFT library's repo](https://github.com/huggingface/peft).\n",
         "\n",
         "First, you need to acquire a [GitHub personal access token](https://github.com/settings/tokens?type=beta) to access the GitHub API."
-      ],
-      "metadata": {
-        "id": "3cCmQywC04x6"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "from getpass import getpass\n",
-        "ACCESS_TOKEN = getpass(\"YOUR_GITHUB_PERSONAL_TOKEN\")"
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "8MoD7NbsNjlM"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "from getpass import getpass\n",
+        "ACCESS_TOKEN = getpass(\"YOUR_GITHUB_PERSONAL_TOKEN\")"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "fccecm3a10N6"
+      },
       "source": [
         "Next, we'll load all of the issues in the [huggingface/peft](https://github.com/huggingface/peft) repo:\n",
         "- By default, pull requests are considered issues as well, here we chose to exclude them from data with by setting `include_prs=False`\n",
         "- Setting `state = \"all\"` means we will load both open and closed issues."
-      ],
-      "metadata": {
-        "id": "fccecm3a10N6"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "8EKMit4WNDY8"
+      },
+      "outputs": [],
       "source": [
         "from langchain.document_loaders import GitHubIssuesLoader\n",
         "\n",
@@ -139,15 +128,13 @@
         ")\n",
         "\n",
         "docs = loader.load()"
-      ],
-      "metadata": {
-        "id": "8EKMit4WNDY8"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "CChTrY-k2qO5"
+      },
       "source": [
         "The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n",
         "\n",
@@ -156,37 +143,37 @@
         "Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n",
         "\n",
         "The fixed-size chunking, however, works well for most common cases, so that is what we'll do here."
-      ],
-      "metadata": {
-        "id": "CChTrY-k2qO5"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "OmsXOf59Pmm-"
+      },
+      "outputs": [],
       "source": [
         "from langchain.text_splitter import CharacterTextSplitter\n",
         "\n",
         "splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
         "\n",
         "chunked_docs = splitter.split_documents(docs)"
-      ],
-      "metadata": {
-        "id": "OmsXOf59Pmm-"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "## Create the embeddings + retriever"
-      ],
       "metadata": {
         "id": "DAt_zPVlXOn7"
-      }
+      },
+      "source": [
+        "## Create the embeddings + retriever"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "-mvat6JQl4yp"
+      },
       "source": [
         "Now that the docs are all of the appropriate size, we can create a database with their embeddings.\n",
         "\n",
@@ -196,84 +183,86 @@
         "To create the vector database, we'll use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets.\n",
         "\n",
         "We'll access both the embeddings model and FAISS via LangChain API."
-      ],
-      "metadata": {
-        "id": "-mvat6JQl4yp"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "ixmCdRzBQ5gu"
+      },
+      "outputs": [],
       "source": [
         "from langchain.vectorstores import FAISS\n",
         "from langchain.embeddings import HuggingFaceEmbeddings\n",
         "\n",
         "db = FAISS.from_documents(chunked_docs,\n",
         "                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))"
-      ],
-      "metadata": {
-        "id": "ixmCdRzBQ5gu"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "2iCgEPi0nnN6"
+      },
       "source": [
         "We need a way to return(retrieve) the documents given an unstructured query. For that, we'll use the `as_retriever` method using the `db` as a backbone:\n",
         "- `search_type=\"similarity\"` means we want to perform similarity search between the query and documents\n",
         "- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.\n"
-      ],
-      "metadata": {
-        "id": "2iCgEPi0nnN6"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "mBTreCQ9noHK"
+      },
+      "outputs": [],
       "source": [
         "retriever = db.as_retriever(\n",
         "    search_type=\"similarity\",\n",
         "    search_kwargs={'k': 4}\n",
         ")"
-      ],
-      "metadata": {
-        "id": "mBTreCQ9noHK"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "The vector database and retriever are now set up, next we need to set up the next piece of the chain - the model."
-      ],
       "metadata": {
         "id": "WgEhlISJpTgj"
-      }
+      },
+      "source": [
+        "The vector database and retriever are now set up, next we need to set up the next piece of the chain - the model."
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "## Load quantized model"
-      ],
       "metadata": {
         "id": "tzQxx0HkXVFU"
-      }
+      },
+      "source": [
+        "## Load quantized model"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "9jy1cC65p_GD"
+      },
       "source": [
         "For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.\n",
         "\n",
         "With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).\n",
         "\n",
         "To make inference faster, we will load the quantized version of the model:"
-      ],
-      "metadata": {
-        "id": "9jy1cC65p_GD"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "L-ggaa763VRo"
+      },
+      "outputs": [],
       "source": [
         "import torch\n",
         "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
@@ -289,42 +278,42 @@
         "\n",
         "model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n",
         "tokenizer = AutoTokenizer.from_pretrained(model_name)"
-      ],
-      "metadata": {
-        "id": "L-ggaa763VRo"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "## Setup the LLM chain"
-      ],
       "metadata": {
         "id": "hVNRJALyXYHG"
-      }
+      },
+      "source": [
+        "## Setup the LLM chain"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "RUUNneJ1smhl"
+      },
       "source": [
         "Finally, we have all the pieces we need to set up the LLM chain.\n",
         "\n",
         "First, create a text_generation pipeline using the loaded model and its tokenizer.\n",
         "\n",
         "Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting."
-      ],
-      "metadata": {
-        "id": "RUUNneJ1smhl"
-      }
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "cR0k1cRWz8Pm"
+      },
+      "outputs": [],
       "source": [
         "from langchain.llms import HuggingFacePipeline\n",
         "from langchain.prompts import PromptTemplate\n",
         "from transformers import pipeline\n",
-        "from langchain.chains import LLMChain\n",
+        "from langchain_core.output_parsers import StrOutputParser\n",
         "\n",
         "text_generation_pipeline = pipeline(\n",
         "    model=model,\n",
@@ -357,28 +346,28 @@
         "    template=prompt_template,\n",
         ")\n",
         "\n",
-        "llm_chain = LLMChain(llm=llm, prompt=prompt)"
-      ],
-      "metadata": {
-        "id": "cR0k1cRWz8Pm"
-      },
-      "execution_count": null,
-      "outputs": []
+        "llm_chain = prompt | llm | StrOutputParser()"
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "l19UKq5HXfSp"
+      },
       "source": [
         "Note: _You can also use `tokenizer.apply_chat_template` to convert a list of messages (as dicts: `{'role': 'user', 'content': '(...)'}`) into a string with the appropriate chat format._\n",
         "\n",
         "\n",
-        "Finally, we need to combine the `llm_chain` with the retriever to create the RAG:"
-      ],
-      "metadata": {
-        "id": "l19UKq5HXfSp"
-      }
+        "Finally, we need to combine the `llm_chain` with the retriever to create a RAG chain. We pass the original question through to the final generation step, as well as the retrieved context docs:"
+      ]
     },
     {
       "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "_rI3YNp9Xl4s"
+      },
+      "outputs": [],
       "source": [
         "from langchain.schema.runnable import RunnablePassthrough\n",
         "\n",
@@ -388,49 +377,42 @@
         " {\"context\": retriever, \"question\": RunnablePassthrough()}\n",
         "    | llm_chain\n",
         ")\n"
-      ],
-      "metadata": {
-        "id": "_rI3YNp9Xl4s"
-      },
-      "execution_count": null,
-      "outputs": []
+      ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "UsCOhfDDXpaS"
+      },
       "source": [
         "## Compare the results\n",
         "\n",
         "Let's see the difference RAG makes in generating answers to the library-specific questions."
-      ],
-      "metadata": {
-        "id": "UsCOhfDDXpaS"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "question = \"How do you combine multiple adapters?\""
-      ],
+      "execution_count": null,
       "metadata": {
         "id": "W7F07fQLXusU"
       },
-      "execution_count": null,
-      "outputs": []
+      "outputs": [],
+      "source": [
+        "question = \"How do you combine multiple adapters?\""
+      ]
     },
     {
       "cell_type": "markdown",
-      "source": [
-        "First, let's see what kind of answer we can get with just the model itself, no context added:"
-      ],
       "metadata": {
         "id": "KC0rJYU1x1ir"
-      }
+      },
+      "source": [
+        "First, let's see what kind of answer we can get with just the model itself, no context added:"
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "llm_chain.invoke({\"context\":\"\", \"question\": question})['text']\n"
-      ],
+      "execution_count": null,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/",
@@ -439,38 +421,24 @@
         "id": "GYh-HG1l0De5",
         "outputId": "549e0bdd-b186-4d16-e7fa-90b3865d6f83"
       },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer doesn't have a USB-C port).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a laptop to a monitor, plug the USB-C to HDMI adapter into your laptop's USB-C port.\\n\\n3. Connect the second adapter: If necessary, connect the second adapter to the first one. In our example, you would connect the USB-C to USB-A adapter to the USB-C port on the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. In our case, you would connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. You should now be able to use the connected device as normal.\\n\\nRemember to always check compatibility before purchasing any adapters to ensure they will work together and with your specific devices.\""
-            ],
-            "application/vnd.google.colaboratory.intrinsic+json": {
-              "type": "string"
-            }
-          },
-          "metadata": {},
-          "execution_count": 13
-        }
+      "outputs": [],
+      "source": [
+        "llm_chain.invoke({\"context\":\"\", \"question\": question})['text']\n"
       ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "i-TIWr3wx9w8"
+      },
       "source": [
         "As you can see, the model interpreted the question as one about physical computer adapters, while in the context of PEFT, \"adapters\" refer to LoRA adapters.\n",
         "Let's see if adding context from GitHub issues helps the model give a more relevant answer:"
-      ],
-      "metadata": {
-        "id": "i-TIWr3wx9w8"
-      }
+      ]
     },
     {
       "cell_type": "code",
-      "source": [
-        "rag_chain.invoke(question)['text']"
-      ],
+      "execution_count": null,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/",
@@ -479,33 +447,46 @@
         "id": "FZpNA3o10H10",
         "outputId": "9ddc0eef-0503-445d-8f70-26be3ec19de6"
       },
-      "execution_count": null,
-      "outputs": [
-        {
-          "output_type": "execute_result",
-          "data": {
-            "text/plain": [
-              "\" Based on the provided context, it seems like combining multiple adapters is still being explored and discussed within the community. Here are some insights from the issues raised:\\n\\n  1. In issue #1040, AlbertoZerbinati asks about merging multiple adapters, suggesting that it might be useful to add multiple distinct behaviors to a base model by merging multiple Lora adapters. However, no clear recommendation was given.\\n\\n  2. In issue #1025, Ali1858 encountered a ValueError while trying to load multiple adapters simultaneously for inference. This suggests that currently, loading multiple adapters at once may not be supported or straightforward.\\n\\n  3. In issue #449, TheShy-Dream expressed interest in incorporating multimodal information into an adapter they were creating themselves. It's unclear whether this involves combining multiple adapters or just modifying a single one.\\n\\n   Overall, it seems that combining multiple adapters is still an open question, and more exploration and experimentation is needed to determine how best to do it. If you're interested in contributing to this discussion, you might consider joining the conversation in these issues or opening a new one with your own ideas and questions.\""
-            ],
-            "application/vnd.google.colaboratory.intrinsic+json": {
-              "type": "string"
-            }
-          },
-          "metadata": {},
-          "execution_count": 14
-        }
+      "outputs": [],
+      "source": [
+        "rag_chain.invoke(question)['text']"
       ]
     },
     {
       "cell_type": "markdown",
+      "metadata": {
+        "id": "hZQedZKSyrwO"
+      },
       "source": [
         "As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.\n",
         "\n",
         "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings."
-      ],
-      "metadata": {
-        "id": "hZQedZKSyrwO"
-      }
+      ]
     }
-  ]
-}
\ No newline at end of file
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.3"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}

From a03d0db229e42e6467e13bb1aef269fa39d7006a Mon Sep 17 00:00:00 2001
From: jacoblee93 <jacoblee93@gmail.com>
Date: Sat, 17 Feb 2024 08:51:47 -0800
Subject: [PATCH 2/5] Add streaming example

---
 notebooks/en/rag_zephyr_langchain.ipynb | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
index 0864976a..13e0158c 100644
--- a/notebooks/en/rag_zephyr_langchain.ipynb
+++ b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -423,7 +423,7 @@
       },
       "outputs": [],
       "source": [
-        "llm_chain.invoke({\"context\":\"\", \"question\": question})['text']\n"
+        "llm_chain.invoke({\"context\":\"\", \"question\": question})['text']"
       ]
     },
     {
@@ -460,8 +460,27 @@
       "source": [
         "As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.\n",
         "\n",
-        "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings."
+        "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.\n",
+        "\n",
+        "We can also stream the output to get output tokens more quickly:"
       ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "for chunk in rag_chain.stream(question)['text']:\n",
+        "    print(chunk)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": []
     }
   ],
   "metadata": {

From fd200de3a265fbc6bf01293b3157dc7b5d266925 Mon Sep 17 00:00:00 2001
From: jacoblee93 <jacoblee93@gmail.com>
Date: Sat, 17 Feb 2024 09:12:17 -0800
Subject: [PATCH 3/5] Add back in outputs, remove streaming

---
 notebooks/en/rag_zephyr_langchain.ipynb | 86 ++++++++++++++-----------
 1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
index 13e0158c..353b8105 100644
--- a/notebooks/en/rag_zephyr_langchain.ipynb
+++ b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -27,7 +27,7 @@
         "\n",
         "* At the same time, the fact that fine-tuning is not required gives you the freedom to swap your LLM for a more powerful one when it becomes available, or switch to a smaller distilled version, should you need faster inference.\n",
         "\n",
-        "Let's illustrate building a RAG using an open-source LLM, embeddings model, and LandChain.\n",
+        "Let's illustrate building a RAG using an open-source LLM, embeddings model, and LangChain.\n",
         "\n",
         "First, install the required dependencies:"
       ]
@@ -45,7 +45,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 2,
       "metadata": {
         "id": "-aYENQwZ-p_c"
       },
@@ -112,7 +112,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 5,
       "metadata": {
         "id": "8EKMit4WNDY8"
       },
@@ -213,7 +213,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 8,
       "metadata": {
         "id": "mBTreCQ9noHK"
       },
@@ -304,7 +304,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 15,
       "metadata": {
         "id": "cR0k1cRWz8Pm"
       },
@@ -320,6 +320,7 @@
         "    tokenizer=tokenizer,\n",
         "    task=\"text-generation\",\n",
         "    temperature=0.2,\n",
+        "    do_sample=True,\n",
         "    repetition_penalty=1.1,\n",
         "    return_full_text=True,\n",
         "    max_new_tokens=400,\n",
@@ -363,13 +364,13 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 17,
       "metadata": {
         "id": "_rI3YNp9Xl4s"
       },
       "outputs": [],
       "source": [
-        "from langchain.schema.runnable import RunnablePassthrough\n",
+        "from langchain_core.runnables import RunnablePassthrough\n",
         "\n",
         "retriever = db.as_retriever()\n",
         "\n",
@@ -392,7 +393,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 18,
       "metadata": {
         "id": "W7F07fQLXusU"
       },
@@ -412,18 +413,32 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 20,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/",
-          "height": 216
+          "height": 125
         },
         "id": "GYh-HG1l0De5",
-        "outputId": "549e0bdd-b186-4d16-e7fa-90b3865d6f83"
+        "outputId": "277d8e89-ce9b-4e04-c11b-639ad2645759"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use together. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer only has USB-A ports).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a USB-C laptop to an HDMI monitor, plug the USB-C to HDMI adapter into the laptop's USB-C port.\\n\\n3. Connect the second adapter: Next, connect the second adapter to the first one. In this case, connect the USB-C to USB-A adapter to the USB-C port of the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. For example, connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. If necessary, adjust the settings on your devices to ensure optimal performance.\\n\\nBy combining multiple adapters, you can connect a variety of devices together, even if they don't have the same type of connector. Just be sure to choose adapters that are compatible with all the devices you want to connect and test the connection thoroughly before relying on it for critical tasks.\""
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "string"
+            }
+          },
+          "metadata": {},
+          "execution_count": 20
+        }
+      ],
       "source": [
-        "llm_chain.invoke({\"context\":\"\", \"question\": question})['text']"
+        "llm_chain.invoke({\"context\":\"\", \"question\": question})"
       ]
     },
     {
@@ -438,18 +453,32 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
+      "execution_count": 21,
       "metadata": {
         "colab": {
           "base_uri": "https://localhost:8080/",
-          "height": 198
+          "height": 125
         },
         "id": "FZpNA3o10H10",
-        "outputId": "9ddc0eef-0503-445d-8f70-26be3ec19de6"
+        "outputId": "31f9aed3-3dd7-4ff8-d1a8-866794fefe80"
       },
-      "outputs": [],
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "\" Based on the provided context, it seems that combining multiple adapters is still an open question in the community. Here are some possibilities:\\n\\n  1. Save the output from the base model and pass it to each adapter separately, as described in the first context snippet. This allows you to run multiple adapters simultaneously and reuse the output from the base model. However, this approach requires loading and running each adapter separately.\\n\\n  2. Export everything into a single PyTorch model, as suggested in the second context snippet. This would involve saving all the adapters and their weights into a single model, potentially making it larger and more complex. The advantage of this approach is that it would allow you to run all the adapters simultaneously without having to load and run them separately.\\n\\n  3. Merge multiple Lora adapters, as mentioned in the third context snippet. This involves adding multiple distinct, independent behaviors to a base model by merging multiple Lora adapters. It's not clear from the context how this would be done, but it suggests that there might be a recommended way of doing it.\\n\\n  4. Combine adapters through a specific architecture, as proposed in the fourth context snippet. This involves merging multiple adapters into a single architecture, potentially creating a more complex model with multiple behaviors. Again, it's not clear from the context how this would be done.\\n\\n   Overall, combining multiple adapters is still an active area of research, and there doesn't seem to be a widely accepted solution yet. If you're interested in exploring this further, it might be worth reaching out to the Hugging Face community or checking out their documentation for more information.\""
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "string"
+            }
+          },
+          "metadata": {},
+          "execution_count": 21
+        }
+      ],
       "source": [
-        "rag_chain.invoke(question)['text']"
+        "rag_chain.invoke(question)"
       ]
     },
     {
@@ -462,25 +491,8 @@
         "\n",
         "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.\n",
         "\n",
-        "We can also stream the output to get output tokens more quickly:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "for chunk in rag_chain.stream(question)['text']:\n",
-        "    print(chunk)"
+        "We can also stream the output to access tokens as they are generated by the model:"
       ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": []
     }
   ],
   "metadata": {
@@ -508,4 +520,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
+}
\ No newline at end of file

From c31af9aee0b473c98299057c52907e71a9307eb9 Mon Sep 17 00:00:00 2001
From: jacoblee93 <jacoblee93@gmail.com>
Date: Mon, 19 Feb 2024 08:43:57 -0800
Subject: [PATCH 4/5] Remove extra line

---
 notebooks/en/rag_zephyr_langchain.ipynb | 28 ++++++++++++-------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
index 849832cf..d34b2fdb 100644
--- a/notebooks/en/rag_zephyr_langchain.ipynb
+++ b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -424,17 +424,17 @@
       },
       "outputs": [
         {
-          "output_type": "execute_result",
           "data": {
-            "text/plain": [
-              "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use together. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer only has USB-A ports).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a USB-C laptop to an HDMI monitor, plug the USB-C to HDMI adapter into the laptop's USB-C port.\\n\\n3. Connect the second adapter: Next, connect the second adapter to the first one. In this case, connect the USB-C to USB-A adapter to the USB-C port of the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. For example, connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. If necessary, adjust the settings on your devices to ensure optimal performance.\\n\\nBy combining multiple adapters, you can connect a variety of devices together, even if they don't have the same type of connector. Just be sure to choose adapters that are compatible with all the devices you want to connect and test the connection thoroughly before relying on it for critical tasks.\""
-            ],
             "application/vnd.google.colaboratory.intrinsic+json": {
               "type": "string"
-            }
+            },
+            "text/plain": [
+              "\" To combine multiple adapters, you need to ensure that they are compatible with each other and the devices you want to connect. Here's how you can do it:\\n\\n1. Identify the adapters you need: Determine which adapters you require to connect the devices you want to use together. For example, if you want to connect a USB-C device to an HDMI monitor, you may need a USB-C to HDMI adapter and a USB-C to USB-A adapter (if your computer only has USB-A ports).\\n\\n2. Connect the first adapter: Plug in the first adapter into the device you want to connect. For instance, if you're connecting a USB-C laptop to an HDMI monitor, plug the USB-C to HDMI adapter into the laptop's USB-C port.\\n\\n3. Connect the second adapter: Next, connect the second adapter to the first one. In this case, connect the USB-C to USB-A adapter to the USB-C port of the USB-C to HDMI adapter.\\n\\n4. Connect the final device: Finally, connect the device you want to use to the second adapter. For example, connect the HDMI cable from the monitor to the HDMI port on the USB-C to HDMI adapter.\\n\\n5. Test the connection: Turn on both devices and check whether everything is working correctly. If necessary, adjust the settings on your devices to ensure optimal performance.\\n\\nBy combining multiple adapters, you can connect a variety of devices together, even if they don't have the same type of connector. Just be sure to choose adapters that are compatible with all the devices you want to connect and test the connection thoroughly before relying on it for critical tasks.\""
+            ]
           },
+          "execution_count": 20,
           "metadata": {},
-          "execution_count": 20
+          "output_type": "execute_result"
         }
       ],
       "source": [
@@ -464,17 +464,17 @@
       },
       "outputs": [
         {
-          "output_type": "execute_result",
           "data": {
-            "text/plain": [
-              "\" Based on the provided context, it seems that combining multiple adapters is still an open question in the community. Here are some possibilities:\\n\\n  1. Save the output from the base model and pass it to each adapter separately, as described in the first context snippet. This allows you to run multiple adapters simultaneously and reuse the output from the base model. However, this approach requires loading and running each adapter separately.\\n\\n  2. Export everything into a single PyTorch model, as suggested in the second context snippet. This would involve saving all the adapters and their weights into a single model, potentially making it larger and more complex. The advantage of this approach is that it would allow you to run all the adapters simultaneously without having to load and run them separately.\\n\\n  3. Merge multiple Lora adapters, as mentioned in the third context snippet. This involves adding multiple distinct, independent behaviors to a base model by merging multiple Lora adapters. It's not clear from the context how this would be done, but it suggests that there might be a recommended way of doing it.\\n\\n  4. Combine adapters through a specific architecture, as proposed in the fourth context snippet. This involves merging multiple adapters into a single architecture, potentially creating a more complex model with multiple behaviors. Again, it's not clear from the context how this would be done.\\n\\n   Overall, combining multiple adapters is still an active area of research, and there doesn't seem to be a widely accepted solution yet. If you're interested in exploring this further, it might be worth reaching out to the Hugging Face community or checking out their documentation for more information.\""
-            ],
             "application/vnd.google.colaboratory.intrinsic+json": {
               "type": "string"
-            }
+            },
+            "text/plain": [
+              "\" Based on the provided context, it seems that combining multiple adapters is still an open question in the community. Here are some possibilities:\\n\\n  1. Save the output from the base model and pass it to each adapter separately, as described in the first context snippet. This allows you to run multiple adapters simultaneously and reuse the output from the base model. However, this approach requires loading and running each adapter separately.\\n\\n  2. Export everything into a single PyTorch model, as suggested in the second context snippet. This would involve saving all the adapters and their weights into a single model, potentially making it larger and more complex. The advantage of this approach is that it would allow you to run all the adapters simultaneously without having to load and run them separately.\\n\\n  3. Merge multiple Lora adapters, as mentioned in the third context snippet. This involves adding multiple distinct, independent behaviors to a base model by merging multiple Lora adapters. It's not clear from the context how this would be done, but it suggests that there might be a recommended way of doing it.\\n\\n  4. Combine adapters through a specific architecture, as proposed in the fourth context snippet. This involves merging multiple adapters into a single architecture, potentially creating a more complex model with multiple behaviors. Again, it's not clear from the context how this would be done.\\n\\n   Overall, combining multiple adapters is still an active area of research, and there doesn't seem to be a widely accepted solution yet. If you're interested in exploring this further, it might be worth reaching out to the Hugging Face community or checking out their documentation for more information.\""
+            ]
           },
+          "execution_count": 21,
           "metadata": {},
-          "execution_count": 21
+          "output_type": "execute_result"
         }
       ],
       "source": [
@@ -489,9 +489,7 @@
       "source": [
         "As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.\n",
         "\n",
-        "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.\n",
-        "\n",
-        "We can also stream the output to access tokens as they are generated by the model:"
+        "Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings."
       ]
     }
   ],

From 555e90053339279872d23c3b6b4b8808a8c9bb55 Mon Sep 17 00:00:00 2001
From: jacoblee93 <jacoblee93@gmail.com>
Date: Mon, 19 Feb 2024 08:46:54 -0800
Subject: [PATCH 5/5] Reapply author addition

---
 notebooks/en/rag_zephyr_langchain.ipynb | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
index d34b2fdb..992d5820 100644
--- a/notebooks/en/rag_zephyr_langchain.ipynb
+++ b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -8,6 +8,8 @@
       "source": [
         "# Simple RAG for GitHub issues using Hugging Face Zephyr and LangChain\n",
         "\n",
+        "_Authored by: [Maria Khalusova](https://github.com/MKhalusova)_\n",
+        "\n",
         "This notebook demonstrates how you can quickly build a RAG (Retrieval Augmented Generation) for a project's GitHub issues using [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) model, and LangChain.\n",
         "\n",
         "\n",