Merge pull request #41 from abhikvarma/fix-charactertextsplitter-chun…

…king use RecursiveCharacterTextSplitter to better split docs
huggingface · Feb 21, 2024 · 9e8806d · 9e8806d
2 parents 598aa52 + df59e3f
commit 9e8806d
Showing 1 changed file with 3 additions and 7 deletions.
diff --git a/notebooks/en/rag_zephyr_langchain.ipynb b/notebooks/en/rag_zephyr_langchain.ipynb
@@ -140,11 +140,7 @@
       "source": [
         "The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n",
         "\n",
-        "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.\n",
-        "\n",
-        "Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n",
-        "\n",
-        "The fixed-size chunking, however, works well for most common cases, so that is what we'll do here."
+        "The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here. "
       ]
     },
     {
@@ -155,9 +151,9 @@
       },
       "outputs": [],
       "source": [
-        "from langchain.text_splitter import CharacterTextSplitter\n",
+        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
         "\n",
-        "splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
+        "splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
         "\n",
         "chunked_docs = splitter.split_documents(docs)"
       ]