Skip to content

Commit

Permalink
Merge pull request #41 from abhikvarma/fix-charactertextsplitter-chun…
Browse files Browse the repository at this point in the history
…king

use RecursiveCharacterTextSplitter to better split docs
  • Loading branch information
MKhalusova committed Feb 21, 2024
2 parents 598aa52 + df59e3f commit 9e8806d
Showing 1 changed file with 3 additions and 7 deletions.
10 changes: 3 additions & 7 deletions notebooks/en/rag_zephyr_langchain.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -140,11 +140,7 @@
"source": [
"The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.\n",
"\n",
"The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.\n",
"\n",
"Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the\n",
"\n",
"The fixed-size chunking, however, works well for most common cases, so that is what we'll do here."
"The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks. The recommended splitter for generic text is the [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter), and that's what we'll use here. "
]
},
{
Expand All @@ -155,9 +151,9 @@
},
"outputs": [],
"source": [
"from langchain.text_splitter import CharacterTextSplitter\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
"splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)\n",
"\n",
"chunked_docs = splitter.split_documents(docs)"
]
Expand Down

0 comments on commit 9e8806d

Please sign in to comment.