diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index ff302fd4..523625f3 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -43,6 +43,8 @@ title: Implementing semantic cache to improve a RAG system. - local: structured_generation title: RAG with source highlighting using Structured generation + - local: rag_with_unstructured_data + title: Building RAG with Custom Unstructured Data - title: Computer Vision Recipes isExpanded: false diff --git a/notebooks/en/index.md b/notebooks/en/index.md index 40c3b994..717d0b91 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -11,6 +11,7 @@ Check out the recently added notebooks: - [Using LLM-as-a-judge 🧑‍⚖️ for an automated and versatile evaluation](llm_judge) - [Create a legal preference dataset](pipeline_notus_instructions_preferences_legal) - [Suggestions for Data Annotation with SetFit in Zero-shot Text Classification](labelling_feedback_setfit) +- [Building RAG with Custom Unstructured Data](rag_with_unstructured_data) You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). diff --git a/notebooks/en/rag_with_unstructured_data.ipynb b/notebooks/en/rag_with_unstructured_data.ipynb new file mode 100644 index 00000000..fccca497 --- /dev/null +++ b/notebooks/en/rag_with_unstructured_data.ipynb @@ -0,0 +1,533 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "gpuType": "T4" + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Building RAG with Custom Unstructured Data\n", + "\n", + "_Authored by: [Maria Khalusova](https://github.com/MKhalusova)_\n", + "\n", + "If you're new to RAG, please explore the basics of RAG first in [this other notebook](https://huggingface.co/learn/cookbook/rag_zephyr_langchain), and then come back here to learn about building RAG with custom data.\n", + "\n", + "Whether you're building your own RAG-based personal assistant, a pet project, or an enterprise RAG system, you will quickly discover that a lot of important knowledge is stored in various formats like PDFs, emails, Markdown files, PowerPoint presentations, HTML pages, Word documents, and so on.\n", + "\n", + "How do you preprocess all of this data in a way that you can use it for RAG?\n", + "In this quick tutorial, you'll learn how to build a RAG system that will incorporate data from multiple data types. You'll use [Unstructured](https://github.com/Unstructured-IO/unstructured) for data preprocessing, open-source models from Hugging Face Hub for embeddings and text generation, ChromaDB as a vector store, and LangChain for bringing everything together.\n", + "\n", + "Let's go! We'll begin by installing the required dependencies:" + ], + "metadata": { + "id": "CFP5sQVU_OsU" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true, + "id": "MBxI5B35_NqW" + }, + "outputs": [], + "source": [ + "!pip install -q torch transformers accelerate bitsandbytes sentence-transformers unstructured[all-docs] langchain chromadb langchain_community" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Next, let's get a mix of documents. Suppose, I want to build a RAG system that'll help me manage pests in my garden. For this purpose, I'll use diverse documents that cover the topic of IPM (integrated pest management):\n", + "* PDF: `https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf`\n", + "* Powerpoint: `https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx`\n", + "* EPUB: `https://www.gutenberg.org/ebooks/45957`\n", + "* HTML: `https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html`\n", + "\n", + "Feel free to use your own documents for your topic of choice from the list of document types supported by Unstructured: `.eml`, `.html`, `.md`, `.msg`, `.rst`, `.rtf`, `.txt`, `.xml`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`, `.heic`, `.csv`, `.doc`, `.docx`, `.epub`, `.odt`, `.pdf`, `.ppt`, `.pptx`, `.tsv`, `.xlsx`." + ], + "metadata": { + "id": "y9OYqTQjEXu5" + } + }, + { + "cell_type": "code", + "source": [ + "!mkdir -p \"./documents\"\n", + "!wget https://www.gov.nl.ca/ecc/files/env-protection-pesticides-business-manuals-applic-chapter7.pdf -O \"./documents/env-protection-pesticides-business-manuals-applic-chapter7.pdf\"\n", + "!wget https://ipm.ifas.ufl.edu/pdfs/Citrus_IPM_090913.pptx -O \"./documents/Citrus_IPM_090913.pptx\"\n", + "!wget https://www.gutenberg.org/ebooks/45957.epub3.images -O \"./documents/45957.epub\"\n", + "!wget https://blog.fifthroom.com/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html -O \"./documents/what-to-do-about-harmful-garden-and-plant-insects-and-pests.html\"" + ], + "metadata": { + "collapsed": true, + "id": "Y6lrfx-pEJgJ" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Unstructured data preprocessing\n", + "\n", + "You can use the Unstructured library to preprocess documents one by one, and write your own script to walk through a directory, but it's easier to use a Local source connector to ingest all documents in a given directory. Unstructured can ingest documents from local directories, S3 buckets, blob storage, SFTP, and many other places your documents might be stored in. The ingestion from those sources will be very similar differing mostly in authentication options.\n", + "Here you'll use Local source connector, but feel free to explore other options in the [Unstructured documentation](https://docs.unstructured.io/open-source/ingest/source-connectors/overview).\n", + "\n", + "Optionally, you can also choose a [destination](https://docs.unstructured.io/open-source/ingest/destination-connectors/overview) for the processed documents - this could be MongoDB, Pinecone, Weaviate, etc. In this notebook, we'll keep everything local." + ], + "metadata": { + "id": "zWB-b7Dv_ofZ" + } + }, + { + "cell_type": "code", + "source": [ + "# Optional cell to reduce the amount of logs\n", + "\n", + "import logging\n", + "\n", + "logger = logging.getLogger(\"unstructured.ingest\")\n", + "logger.root.removeHandler(logger.root.handlers[0])" + ], + "metadata": { + "id": "WPpj1J8VVy_D" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "import os\n", + "\n", + "from unstructured.ingest.connector.local import SimpleLocalConfig\n", + "from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig\n", + "from unstructured.ingest.runner import LocalRunner\n", + "\n", + "output_path = \"./local-ingest-output\"\n", + "\n", + "runner = LocalRunner(\n", + " processor_config=ProcessorConfig(\n", + " # logs verbosity\n", + " verbose=True,\n", + " # the local directory to store outputs\n", + " output_dir=output_path,\n", + " num_processes=2,\n", + " ),\n", + " read_config=ReadConfig(),\n", + " partition_config=PartitionConfig(\n", + " partition_by_api=True,\n", + " api_key=\"YOUR_UNSTRUCTURED_API_KEY\",\n", + " ),\n", + " connector_config=SimpleLocalConfig(\n", + " input_path=\"./documents\",\n", + " # whether to get the documents recursively from given directory\n", + " recursive=False,\n", + " ),\n", + " )\n", + "runner.run()\n" + ], + "metadata": { + "id": "-cE2mU_b_q7Q", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "e5fc9afb-85d5-4b44-cc21-7217f634f94c" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "INFO: NumExpr defaulting to 2 threads.\n" + ] + }, + { + "output_type": "stream", + "name": "stderr", + "text": [ + "2024-06-04 13:08:20,411 MainProcess INFO running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {\"reprocess\": false, \"verbose\": true, \"work_dir\": \"/root/.cache/unstructured/ingest/pipeline\", \"output_dir\": \"./local-ingest-output\", \"num_processes\": 2, \"raise_on_error\": false}\n", + "2024-06-04 13:08:20,554 MainProcess INFO Running doc factory to generate ingest docs. Source connector: {\"processor_config\": {\"reprocess\": false, \"verbose\": true, \"work_dir\": \"/root/.cache/unstructured/ingest/pipeline\", \"output_dir\": \"./local-ingest-output\", \"num_processes\": 2, \"raise_on_error\": false}, \"read_config\": {\"download_dir\": \"\", \"re_download\": false, \"preserve_downloads\": false, \"download_only\": false, \"max_docs\": null}, \"connector_config\": {\"input_path\": \"./documents\", \"recursive\": false, \"file_glob\": null}}\n", + "2024-06-04 13:08:20,577 MainProcess INFO processing 4 docs via 2 processes\n", + "2024-06-04 13:08:20,581 MainProcess INFO Calling Reader with 4 docs\n", + "2024-06-04 13:08:20,583 MainProcess INFO Running source node to download data associated with ingest docs\n", + "2024-06-04 13:08:23,632 MainProcess INFO Calling Partitioner with 4 docs\n", + "2024-06-04 13:08:23,633 MainProcess INFO Running partition node to extract content from json files. Config: {\"pdf_infer_table_structure\": false, \"strategy\": \"auto\", \"ocr_languages\": null, \"encoding\": null, \"additional_partition_args\": {}, \"skip_infer_table_types\": null, \"fields_include\": [\"element_id\", \"text\", \"type\", \"metadata\", \"embeddings\"], \"flatten_metadata\": false, \"metadata_exclude\": [], \"metadata_include\": [], \"partition_endpoint\": \"https://api.unstructured.io/general/v0/general\", \"partition_by_api\": true, \"api_key\": \"*******\", \"hi_res_model_name\": null}, partition kwargs: {}]\n", + "2024-06-04 13:08:23,637 MainProcess INFO Creating /root/.cache/unstructured/ingest/pipeline/partitioned\n", + "2024-06-04 13:09:41,468 MainProcess INFO Calling Copier with 4 docs\n", + "2024-06-04 13:09:41,469 MainProcess INFO Running copy node to move content to desired output location\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let's take a closer look at the configs that we have here.\n", + "\n", + "`ProcessorConfig` controls various aspects of the processing pipeline, including output locations, number of workers, error handling behavior, logging verbosity and more. The only mandatory parameter here is the `output_dir` - the local directory where you want to store the outputs.\n", + "\n", + "`ReadConfig` can be used to customize the data reading process for different scenarios, such as re-downloading data, preserving downloaded files, or limiting the number of documents processed. In most cases the default `ReadConfig` will work.\n", + "\n", + "In the `PartitionConfig` you can choose whether to partition the documents locally or via API. This example uses API, and for this reason requires Unstructured API key. You can get yours [here](https://unstructured.io/api-key-free). The free Unstructured API is capped at 1000 pages, and offers better OCR models for image-based documents than a local installation of Unstructured.\n", + "If you remove these two parameters, the documents will be processed locally, but you may need to install additional dependencies if the documents require OCR and/or document understanding models. Namely, you may need to install poppler and tesseract in this case, which you can get with brew:\n", + "\n", + "```\n", + "!brew install poppler\n", + "!brew install tesseract\n", + "```\n", + "\n", + "If you're on Windows, you can find alternative installation instructions in the [Unstructured docs](https://docs.unstructured.io/open-source/installation/full-installation). \n", + "\n", + "Finally, in the `SimpleLocalConfig` you need to specify where your original documents reside, and whether you want to walk through the directory recursively." + ], + "metadata": { + "id": "68WTKNVSzgVw" + } + }, + { + "cell_type": "markdown", + "source": [ + "Once the documents are processed you'll find 4 json files in the `local-ingest-output` directory, one per document that was processed.\n", + "Unstructured partitions all types of documents in a uniform manner, and returns json with document elements.\n", + "\n", + "[Document elements](https://docs.unstructured.io/api-reference/api-services/document-elements) have a type, e.g. `NarrativeText`, `Title`, or `Table`, they contain the extracted text, and metadata that Unstructured was able to obtain. Some metadata is common for all elements, such as filename of the document the element is from. Other metadata depends on file type or element type. For example, a `Table` element will contain table's representation as html in the metadata, and metadata for emails will contain information about senders and recipients.\n", + "\n", + "Let's import element objects from these json files." + ], + "metadata": { + "id": "AJ4TbyjDTvJG" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured.staging.base import elements_from_json\n", + "\n", + "elements = []\n", + "\n", + "for filename in os.listdir(output_path):\n", + " filepath = os.path.join(output_path, filename)\n", + " elements.extend(elements_from_json(filepath))" + ], + "metadata": { + "id": "SFYTNEV3Toi5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Now that that you have extracted the elements from the documents, you can chunk them to fit the context window of the embeddings model." + ], + "metadata": { + "id": "NNxdUhBpgEP0" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Chunking\n", + "\n", + "If you are familiar with chunking methods that split long text documents into smaller chunks, you'll notice that Unstructured's chunking methods slightly differ, since the partitioning step already divides an entire document into its structural elements: titles, list items, tables, text, etc. By partitioning documents this way, you can avoid a situation where unrelated pieces of text end up in the same element, and then same chunk. \n", + "\n", + "Now, when you chunk the document elements with Unstructured, individual elements are already small so they will only be split if they exceed the desired maximum chunk size. Otherwise, they will remain as is. You can also optionally choose to combine consecutive text elements such as list items, for instance, that will together fit within chunk size limit.\n" + ], + "metadata": { + "id": "7Qkqf-1vcHkj" + } + }, + { + "cell_type": "code", + "source": [ + "from unstructured.chunking.title import chunk_by_title\n", + "\n", + "chunked_elements = chunk_by_title(elements,\n", + " # maximum for chunk size\n", + " max_characters=512,\n", + " # You can choose to combine consecutive elements that are too small\n", + " # e.g. individual list items\n", + " combine_text_under_n_chars=200,\n", + " )\n" + ], + "metadata": { + "collapsed": true, + "id": "b5TQXKevflgD" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "The chunks are ready for RAG. To use them with LangChain, you can easily convert Unstructured elements to LangChain documents." + ], + "metadata": { + "id": "oqLV_c58UccF" + } + }, + { + "cell_type": "code", + "source": [ + "from langchain_core.documents import Document\n", + "\n", + "documents = []\n", + "for chunked_element in chunked_elements:\n", + " metadata = chunked_element.metadata.to_dict()\n", + " metadata[\"source\"] = metadata[\"filename\"]\n", + " del metadata[\"languages\"]\n", + " documents.append(Document(page_content=chunked_element.text, metadata=metadata))" + ], + "metadata": { + "id": "PXL6O-mqUeQA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Setting up the retriever" + ], + "metadata": { + "id": "QC_wbI0khYrS" + } + }, + { + "cell_type": "markdown", + "source": [ + "This example uses ChromaDB as a vector store and [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model, feel free to use any other vector store." + ], + "metadata": { + "id": "j-b291hb05zn" + } + }, + { + "cell_type": "code", + "source": [ + "from langchain_community.vectorstores import Chroma\n", + "from langchain.embeddings import HuggingFaceEmbeddings\n", + "\n", + "from langchain.vectorstores import utils as chromautils\n", + "\n", + "# ChromaDB doesn't support complex metadata, e.g. lists, so we drop it here.\n", + "# If you're using a different vector store, you may not need to do this\n", + "docs = chromautils.filter_complex_metadata(documents)\n", + "\n", + "embeddings = HuggingFaceEmbeddings(model_name=\"BAAI/bge-base-en-v1.5\")\n", + "vectorstore = Chroma.from_documents(documents, embeddings)\n", + "retriever = vectorstore.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 3})" + ], + "metadata": { + "id": "Z6Nm67BohXF8", + "collapsed": true + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "If you plan to use a gated model from the Hugging Face Hub, be it an embeddings or text generation model, you'll need to authenticate yourself with your Hugging Face token, which you can get in your Hugging Face profile's settings." + ], + "metadata": { + "id": "5t8kHHor1DfX" + } + }, + { + "cell_type": "code", + "source": [ + "from huggingface_hub import notebook_login\n", + "\n", + "notebook_login()" + ], + "metadata": { + "id": "J21Oj3trhinC" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## RAG with LangChain\n", + "\n", + "Let's bring everything together and build RAG with LangChain.\n", + "In this example we'll be using [`Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) from Meta. To make sure it can run smoothly in the free T4 runtime from Google Colab, you'll need to quantize it." + ], + "metadata": { + "id": "0pYCTJ8s1QJd" + } + }, + { + "cell_type": "code", + "source": [ + "from langchain.prompts import PromptTemplate\n", + "from langchain.llms import HuggingFacePipeline\n", + "from transformers import pipeline\n", + "import torch\n", + "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n", + "from langchain.chains import RetrievalQA" + ], + "metadata": { + "id": "J14vrinjh2N5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "model_name = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n", + "\n", + "bnb_config = BitsAndBytesConfig(\n", + " load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type=\"nf4\", bnb_4bit_compute_dtype=torch.bfloat16\n", + ")\n", + "\n", + "model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)\n", + "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "\n", + "terminators = [\n", + " tokenizer.eos_token_id,\n", + " tokenizer.convert_tokens_to_ids(\"<|eot_id|>\")\n", + "]\n", + "\n", + "text_generation_pipeline = pipeline(\n", + " model=model,\n", + " tokenizer=tokenizer,\n", + " task=\"text-generation\",\n", + " temperature=0.2,\n", + " do_sample=True,\n", + " repetition_penalty=1.1,\n", + " return_full_text=False,\n", + " max_new_tokens=200,\n", + " eos_token_id=terminators,\n", + ")\n", + "\n", + "llm = HuggingFacePipeline(pipeline=text_generation_pipeline)\n", + "\n", + "prompt_template = \"\"\"\n", + "<|start_header_id|>user<|end_header_id|>\n", + "You are an assistant for answering questions using provided context.\n", + "You are given the extracted parts of a long document and a question. Provide a conversational answer.\n", + "If you don't know the answer, just say \"I do not know.\" Don't make up an answer.\n", + "Question: {question}\n", + "Context: {context}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n", + "\"\"\"\n", + "\n", + "prompt = PromptTemplate(\n", + " input_variables=[\"context\", \"question\"],\n", + " template=prompt_template,\n", + ")\n", + "\n", + "\n", + "qa_chain = RetrievalQA.from_chain_type(\n", + " llm,\n", + " retriever=retriever,\n", + " chain_type_kwargs={\"prompt\": prompt}\n", + ")" + ], + "metadata": { + "id": "tLe4Y3aBh4A3", + "collapsed": true + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Results and next steps\n", + "\n", + "Now that you have your RAG chain, let's ask it about aphids. Are they a pest in my garden?" + ], + "metadata": { + "id": "_hvjRpOe1qYp" + } + }, + { + "cell_type": "code", + "source": [ + "question = \"Are aphids a pest?\"\n", + "\n", + "qa_chain.invoke(question)['result']" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 89 + }, + "id": "whll1qGuyDnC", + "outputId": "31ca901b-bae7-487a-88c6-1d245ef6cdfb" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "\"Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!\"" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Output:\n", + "\n", + "```bash\n", + "Yes, aphids are considered pests because they feed on the nutrient-rich liquids within plants, causing damage and potentially spreading disease. In fact, they're known to multiply quickly, which is why it's essential to control them promptly. As mentioned in the text, aphids can also attract ants, which are attracted to the sweet, sticky substance they produce called honeydew. So, yes, aphids are indeed a pest that requires attention to prevent further harm to your plants!\n", + "```" + ], + "metadata": { + "id": "CYWNJ9DGVkg0" + } + }, + { + "cell_type": "markdown", + "source": [ + "This looks like a promising start! Now that you know the basics of preprocessing complex unstructured data for RAG, you can continue improving upon this example. Here are some ideas:\n", + "\n", + "* You can connect to a different source to ingest the documents from, for example, an S3 bucket.\n", + "* You can add `return_source_documents=True` in the `qa_chain` arguments to make the chain return the documents that were passed to the prompt as context. This can be useful to understand what sources were used to generate the answer.\n", + "* If you want to leverage the elements metadata at the retrieval stage, consider using Hugging Face agents and creating a custom retriever tool as described in [this other notebook](https://huggingface.co/learn/cookbook/agents#2--rag-with-iterative-query-refinement--source-selection).\n", + "* There are many things you could do to improve search results. For instance, you could use Hybrid search instead of a single similarity-search retriever. Hybrid search combines multiple search algorithms to improve the accuracy and relevance of search results. Typically it's a combination of keyword-based search algorithms with vector search methods.\n", + "\n", + "Have fun building RAG applications with Unstructured data!" + ], + "metadata": { + "id": "bOh5z28I10Te" + } + } + ] +}