Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion notebooks/document-chunking/tokenization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,13 @@
"\n",
"For users of Elasticsearch it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512th are ignored in your semantic search. Hence it is valuable to know the number of tokens for your input texts before choosing the right model and indexing method.\n",
"\n",
"Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing. Currently (as of version 8.12) this has to be done by the user. Future versions will remove this necessity and Elasticsearch will automatically create chunks behind the scenes."
"Currently it is not possible to get the token count information via the API, so here we share the code for calculating token counts. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing.\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
Copy link
Contributor Author

@maxjakob maxjakob Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SharonRosencwaig1 @ElishevaStern Do we usually add tracking parameters for links from notebooks to Search Labs posts?

]
},
{
Expand Down
8 changes: 7 additions & 1 deletion notebooks/document-chunking/with-index-pipelines.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,13 @@
"This interactive notebook will:\n",
"- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n",
"- create an index and ingest pipeline that will chunk large fields into smaller passages and vectorize them using the model\n",
"- perform a search and return docs with the most relevant passages"
"- perform a search and return docs with the most relevant passages\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
]
},
{
Expand Down
8 changes: 7 additions & 1 deletion notebooks/document-chunking/with-langchain-splitters.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,13 @@
"This interactive notebook will:\n",
"- load the model \"sentence-transformers__all-minilm-l6-v2\" from Hugging Face and into Elasticsearch ML Node\n",
"- Use LangChain splitters to chunk the passages into sentences and index them into Elasticsearch with nested dense vector\n",
"- perform a search and return docs with the most relevant passages"
"- perform a search and return docs with the most relevant passages\n",
"\n",
"# Prefer the `semantic_text` field type\n",
"\n",
"**Elasticsearch version 8.14 introduced the [`semantic_text`](https://www.elastic.co/guide/en/elasticsearch/reference/master/semantic-text.html) field type which handles the chunking process behind the scenes. Before continuing with this notebook, we highly recommend looking into this:**\n",
"\n",
"**<https://www.elastic.co/search-labs/blog/semantic-search-simplified-semantic-text>**"
]
},
{
Expand Down