You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug?
Text Chunking processor in ingest pipeline while connecting to external embedding model like nvidia/nv-embedqa-mistral-7b-v2 is not sending the data to external model in correctly. It sends list of chunks in input instead of sending individual chunks to external mode. It sends chunks as list in input key as shown
NS_PIPE_LINE_BODY = {
"description": "Pipeline for generating embeddings with neural model couple with pre-processor for text chunking",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 500,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"text": "text_chunks"
}
}
},
{
"text_embedding": {
"field_map": {
"text_chunks": "embeddings"
},
"batch_size": 1
}
},
{
"script": {
"source": """
if (ctx.text_chunks != null && ctx.embeddings != null) {
ctx.nested_chunks_embeddings = [];
for (int i = 0; i < ctx.text_chunks.length; i++) {
ctx.nested_chunks_embeddings.add(
['chunk': ctx.text_chunks[i], 'embedding': ctx.embeddings[i].knn]
);
}
}
ctx.remove('text_chunks');
ctx.remove('embeddings');
"""
}
}
]
}
Index definition is
INDEX_SETTINGS = {
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0,
"knn": True, # Enable k-Nearest Neighbors for nmslib
"default_pipeline": "neural-search-pipeline"
}
},
"mappings": {
"properties": {
"text": {"type": "text"},
"nested_chunks_embeddings": {
"type": "nested",
"properties" : {
"chunk": {"type": "text"},
"embedding": {
"type": "knn_vector", # Vector type field
"dimension": int(CSS_EMBEDDING_OPENAI_DIMENSION), # Number of dimensions from the embedding model
"method": {
"name": "hnsw", # Method for the vector search
"space_type": "l2", # Euclidean distance for similarity
"engine": "lucene" # Use nmslib as the vector search engine
}
}
}
}
}
}
}
What is the expected behavior?
Expected behaviour is send individual chunk to external model and receive the embedding and send the array of embeddings to the pos processor in ingest pipeline
What is your host/environment?
OS: 2.17
The text was updated successfully, but these errors were encountered:
I was able to reproduce the error. It seems that there is a potential bug within text chunking/embedding processor for remote model, I will look deeper into this issue.
Hi @layavadi, I did some debugging on my side and found that decreasing the token limit fixes the error. When the token limit is decreased, the text chunking and embedding work properly. Can you try changing the token_limit field in your pipeline definition to a smaller value (e.g., 100)?
What is the bug?
Text Chunking processor in ingest pipeline while connecting to external embedding model like nvidia/nv-embedqa-mistral-7b-v2 is not sending the data to external model in correctly. It sends list of chunks in input instead of sending individual chunks to external mode. It sends chunks as list in input key as shown
original payload to remote model:
This is seen as single token stream in external model and it complains that token length exceeds limit.
How can one reproduce the bug?
Here is the connector definition
Pipeline definition is
Index definition is
What is the expected behavior?
Expected behaviour is send individual chunk to external model and receive the embedding and send the array of embeddings to the pos processor in ingest pipeline
What is your host/environment?
The text was updated successfully, but these errors were encountered: