Skip to content

Commit

Permalink
Merge pull request #69 from rnckp/main
Browse files Browse the repository at this point in the history
Fix typos and update several links in notebook «Advanced RAG on Hugging Face»
  • Loading branch information
merveenoyan committed May 7, 2024
2 parents 3c0cff3 + 88ff73c commit a2d4bcc
Showing 1 changed file with 36 additions and 36 deletions.
72 changes: 36 additions & 36 deletions notebooks/en/advanced_rag.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"id": "hUCaGdAj9-9F"
},
"source": [
"# Advanced RAG on HuggingFace documentation using LangChain\n",
"# Advanced RAG on Hugging Face documentation using LangChain\n",
"_Authored by: [Aymeric Roucher](https://huggingface.co/m-ric)_"
]
},
Expand All @@ -20,7 +20,7 @@
"\n",
"For an introduction to RAG, you can check [this other cookbook](rag_zephyr_langchain)!\n",
"\n",
"RAG systems are complex, with many moving parts: here a RAG diagram, where we noted in blue all possibilities for system enhancement:\n",
"RAG systems are complex, with many moving parts: here is a RAG diagram, where we noted in blue all possibilities for system enhancement:\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png\" height=\"700\">\n",
"\n",
Expand Down Expand Up @@ -70,7 +70,7 @@
"\n",
"pd.set_option(\n",
" \"display.max_colwidth\", None\n",
") # this will be helpful when visualizing retriever outputs"
") # This will be helpful when visualizing retriever outputs"
]
},
{
Expand Down Expand Up @@ -122,13 +122,13 @@
"\n",
"These snippets will then be fed to the Reader Model to help it generate its answer.\n",
"\n",
"So __our objective here is, given a user question, to find the most snippets from our knowledge base to answer that question.__\n",
"So __our objective here is, given a user question, to find the most relevant snippets from our knowledge base to answer that question.__\n",
"\n",
"This is a wide objective, it leaves open some questions. How many snippets should we retrieve? This parameter will be named `top_k`.\n",
"\n",
"How long should these snippets be? This is called the `chunk size`. There's no one-size-fits-all answers, but here are a few elements:\n",
"- 🔀 Your `chunk size` is allowed to vary from one snippet to the other.\n",
"- Since there will always be some noise in your retrieval, increasing the `top_k` increases the chance to get relevant elements in your retrieved snippets. 🎯 Shooting more arrows increases your probability to hit your target.\n",
"- Since there will always be some noise in your retrieval, increasing the `top_k` increases the chance to get relevant elements in your retrieved snippets. 🎯 Shooting more arrows increases your probability of hitting your target.\n",
"- Meanwhile, the summed length of your retrieved documents should not be too high: for instance, for most current models 16k tokens will probably drown your Reader model in information due to [Lost-in-the-middle phenomenon](https://huggingface.co/papers/2307.03172). 🎯 Give your reader model only the most relevant insights, not a huge pile of books!\n",
"\n",
"\n",
Expand All @@ -144,15 +144,15 @@
"### 1.1 Split the documents into chunks\n",
"\n",
"- In this part, __we split the documents from our knowledge base into smaller chunks__ which will be the snippets on which the reader LLM will base its answer.\n",
"- The goal is to prepare a collection of **semantically relevant snippets**. So their size should be adapted to precise ideas: too small will truncate ideas, too large will dilute them.\n",
"- The goal is to prepare a collection of **semantically relevant snippets**. So their size should be adapted to precise ideas: too small will truncate ideas, and too large will dilute them.\n",
"\n",
"💡 _Many options exist for text splitting: splitting on words, on sentence boundaries, recursive chunking that processes documents in a tree-like way to preserve structure information... To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt._\n",
"💡 _Many options exist for text splitting: splitting on words, on sentence boundaries, recursive chunking that processes documents in a tree-like way to preserve structure information... To learn more about chunking, I recommend you read [this great notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) by Greg Kamradt._\n",
"\n",
"\n",
"- **Recursive chunking** breaks down the text into smaller parts step by step using a given list of separators sorted from the most important to the least important separator. If the first split doesn't give the right size or shape chunks, the method repeats itself on the new chunks using a different separator. For instance with the list of separators `[\"\\n\\n\", \"\\n\", \".\", \"\"]`:\n",
"- **Recursive chunking** breaks down the text into smaller parts step by step using a given list of separators sorted from the most important to the least important separator. If the first split doesn't give the right size or shape of chunks, the method repeats itself on the new chunks using a different separator. For instance with the list of separators `[\"\\n\\n\", \"\\n\", \".\", \"\"]`:\n",
" - The method will first break down the document wherever there is a double line break `\"\\n\\n\"`.\n",
" - Resulting documents will be split again on simple line breaks `\"\\n\"`, then on sentence ends `\".\"`.\n",
" - And finally, if some chunks are still too big, they will be split whenever they overflow the maximum size.\n",
" - Finally, if some chunks are still too big, they will be split whenever they overflow the maximum size.\n",
"\n",
"- With this method, the global structure is well preserved, at the expense of getting slight variations in chunk size.\n",
"\n",
Expand All @@ -174,7 +174,7 @@
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"# We use a hierarchical list of separators specifically tailored for splitting Markdown documents\n",
"# This list is taken from LangChain's MarkdownTextSplitter class.\n",
"# This list is taken from LangChain's MarkdownTextSplitter class\n",
"MARKDOWN_SEPARATORS = [\n",
" \"\\n#{1,6} \",\n",
" \"```\\n\",\n",
Expand All @@ -188,8 +188,8 @@
"]\n",
"\n",
"text_splitter = RecursiveCharacterTextSplitter(\n",
" chunk_size=1000, # the maximum number of characters in a chunk: we selected this value arbitrarily\n",
" chunk_overlap=100, # the number of characters to overlap between chunks\n",
" chunk_size=1000, # The maximum number of characters in a chunk: we selected this value arbitrarily\n",
" chunk_overlap=100, # The number of characters to overlap between chunks\n",
" add_start_index=True, # If `True`, includes chunk's start index in metadata\n",
" strip_whitespace=True, # If `True`, strips whitespace from the start and end of every document\n",
" separators=MARKDOWN_SEPARATORS,\n",
Expand All @@ -206,9 +206,9 @@
"id": "d5jJUMgb9-9M"
},
"source": [
"We also have to keep in mind that when embedding documents, we will use an embedding model that has accepts a certain maximum sequence length `max_seq_length`.\n",
"We also have to keep in mind that when embedding documents, we will use an embedding model that accepts a certain maximum sequence length `max_seq_length`.\n",
"\n",
"So we should make sure that our chunk sizes are below this limit, because any longer chunk will be truncated before processing, thus losing relevancy."
"So we should make sure that our chunk sizes are below this limit because any longer chunk will be truncated before processing, thus losing relevancy."
]
},
{
Expand Down Expand Up @@ -259,7 +259,7 @@
"source": [
"from sentence_transformers import SentenceTransformer\n",
"\n",
"# To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter.\n",
"# To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter\n",
"print(\n",
" f\"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}\"\n",
")\n",
Expand All @@ -269,7 +269,7 @@
"tokenizer = AutoTokenizer.from_pretrained(\"thenlper/gte-small\")\n",
"lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]\n",
"\n",
"# Plot the distrubution of document lengths, counted as the number of tokens\n",
"# Plot the distribution of document lengths, counted as the number of tokens\n",
"fig = pd.Series(lengths).hist()\n",
"plt.title(\"Distribution of document lengths in the knowledge base (in count of tokens)\")\n",
"plt.show()"
Expand Down Expand Up @@ -398,23 +398,23 @@
"source": [
"### 1.2 Building the vector database\n",
"\n",
"We want to compute the embeddings for all the chunks of our knowledge base: to learn more on sentence embeddings, we recommend reading [this guide](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/).\n",
"We want to compute the embeddings for all the chunks of our knowledge base: to learn more about sentence embeddings, we recommend reading [this guide](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/).\n",
"\n",
"#### How does retrieval work ?\n",
"#### How does retrieval work?\n",
"\n",
"Once the chunks are all embedded, we store them into a vector database. When the user types in a query, it gets embedded by the same model previously used, and a similarity search returns the closest documents from the vector database.\n",
"Once the chunks are all embedded, we store them in a vector database. When the user types in a query, it gets embedded by the same model previously used, and a similarity search returns the closest documents from the vector database.\n",
"\n",
"The technical challenge is thus, given a query vector, to quickly find the nearest neighbours of this vector in the vector database. To do this, we need to choose two things: a distance, and a search algorithm to find the nearest neighbors quickly within a database of thousands of records.\n",
"The technical challenge is thus, given a query vector, to quickly find the nearest neighbors of this vector in the vector database. To do this, we need to choose two things: a distance, and a search algorithm to find the nearest neighbors quickly within a database of thousands of records.\n",
"\n",
"##### Nearest Neighbor search algorithm\n",
"\n",
"There are plentiful choices for the nearest neighbor search algorithm: we go with Facebook's [FAISS](https://github.com/facebookresearch/faiss), since FAISS is performant enough for most use cases, and it is well known thus widely implemented.\n",
"There are plentiful choices for the nearest neighbor search algorithm: we go with Facebook's [FAISS](https://github.com/facebookresearch/faiss) since FAISS is performant enough for most use cases, and it is well known and thus widely implemented.\n",
"\n",
"##### Distances\n",
"\n",
"Regarding distances, you can find a good guide [here](https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/#distance-between-embeddings). In short:\n",
"\n",
"- **Cosine similarity** computes similarity between two vectors as the cosinus of their relative angle: it allows us to compare vector directions are regardless of their magnitude. Using it requires to normalize all vectors, to rescale them into unit norm.\n",
"- **Cosine similarity** computes the similarity between two vectors as the cosinus of their relative angle: it allows us to compare vector directions regardless of their magnitude. Using it requires normalizing all vectors, to rescale them into unit norm.\n",
"- **Dot product** takes into account magnitude, with the sometimes undesirable effect that increasing a vector's length will make it more similar to all others.\n",
"- **Euclidean distance** is the distance between the ends of vectors.\n",
"\n",
Expand All @@ -441,7 +441,7 @@
" model_name=EMBEDDING_MODEL_NAME,\n",
" multi_process=True,\n",
" model_kwargs={\"device\": \"cuda\"},\n",
" encode_kwargs={\"normalize_embeddings\": True}, # set True for cosine similarity\n",
" encode_kwargs={\"normalize_embeddings\": True}, # Set `True` for cosine similarity\n",
")\n",
"\n",
"KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(\n",
Expand All @@ -468,7 +468,7 @@
},
"outputs": [],
"source": [
"# embed a user query in the same space\n",
"# Embed a user query in the same space\n",
"user_query = \"How to create a pipeline object?\"\n",
"query_vector = embedding_model.embed_query(user_query)"
]
Expand All @@ -494,7 +494,7 @@
" for idx in range(len(docs_processed))\n",
"] + [query_vector]\n",
"\n",
"# fit the data (The index of transformed data corresponds to the index of the original data)\n",
"# Fit the data (the index of transformed data corresponds to the index of the original data)\n",
"documents_projected = embedding_projector.fit_transform(\n",
" np.array(embeddings_2d), init=\"pca\"\n",
")"
Expand Down Expand Up @@ -532,7 +532,7 @@
" ]\n",
")\n",
"\n",
"# visualize the embedding\n",
"# Visualize the embedding\n",
"fig = px.scatter(\n",
" df,\n",
" x=\"x\",\n",
Expand Down Expand Up @@ -565,9 +565,9 @@
"<img src=\"https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/PaCMAP_embeddings.png\" height=\"700\">\n",
"\n",
"\n",
"➡️ On the graph above, you can see a spatial representation of the kowledge base documents. As the vector embeddings represent the document's meaning, their closeness in meaning should be reflected in their embedding's closeness.\n",
"➡️ On the graph above, you can see a spatial representation of the knowledge base documents. As the vector embeddings represent the document's meaning, their closeness in meaning should be reflected in their embedding's closeness.\n",
"\n",
"The user query's embedding is also shown : we want to find the `k` document that have the closest meaning, thus we pick the `k` closest vectors.\n",
"The user query's embedding is also shown: we want to find the `k` documents that have the closest meaning, thus we pick the `k` closest vectors.\n",
"\n",
"In the LangChain vector database implementation, this search operation is performed by the method `vector_database.similarity_search(query)`.\n",
"\n",
Expand Down Expand Up @@ -619,9 +619,9 @@
"\n",
"In this part, the __LLM Reader reads the retrieved context to formulate its answer.__\n",
"\n",
"There are actually substeps that can all be tuned:\n",
"There are substeps that can all be tuned:\n",
"1. The content of the retrieved documents is aggregated together into the \"context\", with many processing options like _prompt compression_.\n",
"2. The context and the user query are aggregated into a prompt then given to the LLM to generate its answer."
"2. The context and the user query are aggregated into a prompt and then given to the LLM to generate its answer."
]
},
{
Expand All @@ -632,8 +632,8 @@
"source": [
"### 2.1. Reader model\n",
"\n",
"The choice of a reader model is important on a few aspects:\n",
"- the reader model's `max_seq_length` must accomodate our prompt, which includes the context output by the retriever call: the context consists in 5 documents of 512 tokens each, so we aim for a context length of 4k tokens at least.\n",
"The choice of a reader model is important in a few aspects:\n",
"- the reader model's `max_seq_length` must accommodate our prompt, which includes the context output by the retriever call: the context consists of 5 documents of 512 tokens each, so we aim for a context length of 4k tokens at least.\n",
"- the reader model\n",
"\n",
"For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.\n",
Expand Down Expand Up @@ -885,7 +885,7 @@
"source": [
"retrieved_docs_text = [\n",
" doc.page_content for doc in retrieved_docs\n",
"] # we only need the text of the documents\n",
"] # We only need the text of the documents\n",
"context = \"\\nExtracted documents:\\n\"\n",
"context += \"\".join(\n",
" [f\"Document {str(i)}:::\\n\" + doc for i, doc in enumerate(retrieved_docs_text)]\n",
Expand Down Expand Up @@ -961,7 +961,7 @@
" relevant_docs = knowledge_index.similarity_search(\n",
" query=question, k=num_retrieved_docs\n",
" )\n",
" relevant_docs = [doc.page_content for doc in relevant_docs] # keep only the text\n",
" relevant_docs = [doc.page_content for doc in relevant_docs] # Keep only the text\n",
"\n",
" # Optionally rerank results\n",
" if reranker:\n",
Expand Down Expand Up @@ -1193,7 +1193,7 @@
"id": "w6iNo7lY9-9S"
},
"source": [
"✅ We now have a fully functional, performant RAG sytem. That's it for today! Congratulations for making it to the end 🥳\n",
"✅ We now have a fully functional, performant RAG system. That's it for today! Congratulations for making it to the end 🥳\n",
"\n",
"\n",
"# To go further 🗺️\n",
Expand All @@ -1202,7 +1202,7 @@
"\n",
"### Setting up an evaluation pipeline\n",
"\n",
"- 💬 \"You cannot improve the model performance that you do not measure\", said Gandhi... or at least Llama2 told me he said it. Anyway, you should absolutely start by measuring performance: this means building a small evaluation dataset, then monitor the performance of your RAG system on this evaluation dataset.\n",
"- 💬 \"You cannot improve the model performance that you do not measure\", said Gandhi... or at least Llama2 told me he said it. Anyway, you should absolutely start by measuring performance: this means building a small evaluation dataset, and then monitor the performance of your RAG system on this evaluation dataset.\n",
"\n",
"### Improving the retriever\n",
"\n",
Expand Down

0 comments on commit a2d4bcc

Please sign in to comment.