Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Text Chunking processor with Embedding doesn't work for nVIDIA model nvidia/nv-embedqa-mistral-7b-v2 #3383

Open
layavadi opened this issue Jan 13, 2025 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@layavadi
Copy link

layavadi commented Jan 13, 2025

What is the bug?
Text Chunking processor in ingest pipeline while connecting to external embedding model like nvidia/nv-embedqa-mistral-7b-v2 is not sending the data to external model in correctly. It sends list of chunks in input instead of sending individual chunks to external mode. It sends chunks as list in input key as shown

original payload to remote model:

{
    "model": "nvidia/nv-embedqa-mistral-7b-v2",
    "input": [
        "Cloudera Operational Database . Accessing Data Date published: 2020-08-14 Date modified: 2023-01-12 https://docs.cloudera.com/Legal Notice © Cloudera Inc. 2024. All rights reserved. The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual property rights. No license under copyright or any other intellectual property right is granted herein. Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0. Copyright information for Cloudera software may be found within the documentation accompanying each component in a particular release. Cloudera software includes software from various open source or other third party projects, and may be released under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms. Other software included may be released under the terms of alternative open source licenses. Please review the license and notice files accompanying the software for additional licensing information. Please visit the Cloudera software product page for more information on Cloudera software. For more information on Cloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss your specific needs. Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera. Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregistered trademarks in the United States and other countries. All other trademarks are the property of their respective owners. Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA, CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OF ANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY OR RELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THAT CLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BE FREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTION NOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS. WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, AND FITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASED ON COURSE OF DEALING OR USAGE IN TRADE.Cloudera Operational Database | Contents | iii Contents Accessing Hue from COD .4 Accessing HBase REST API from COD .6 Accessing SQLLine from COD .7Cloudera Operational Database Accessing Hue from COD Accessing Hue from COD Hue is a web-based interactive SQL editor that enables you to interact with data stored in Cloudera Operational Database (COD). You can access the Hue user interface from the COD web user interface to create and browse HBase tables. Procedure 1.Click Databases , and then select a database from the list. 2.Click Hue. Example You can use Hue to quickly browse large tables, create new tables, add data, ",
        "Contents | iii Contents Accessing Hue from COD .4 Accessing HBase REST API from COD .6 Accessing SQLLine from COD .7Cloudera Operational Database Accessing Hue from COD Accessing Hue from COD Hue is a web-based interactive SQL editor that enables you to interact with data stored in Cloudera Operational Database (COD). You can access the Hue user interface from the COD web user interface to create and browse HBase tables. Procedure 1.Click Databases , and then select a database from the list. 2.Click Hue. Example You can use Hue to quickly browse large tables, create new tables, add data, modify existing cells, and also filter data using the auto-complete search. 4Cloudera Operational Database Accessing Hue from COD Related Information Hue Overview Using Hue Use the Hue HBase app 5Cloudera Operational Database Accessing HBase REST API from COD Accessing HBase REST API from COD You can use the Apache HBase REST server to interact with Cloudera Operational Database (COD). Interactions happen using URLs and the REST API. REST uses HTTP to perform various actions, and this makes it easy to interface with COD using a wide range of programming languages. Procedure 1.Click Databases , and then select a database from the list. 2.Click Connect HBase REST . 3.From the HBase REST Server URL field, copy the URL to the HBase REST Server to connect to the selected database. What to do next Use the HBase REST API to interact with the HBase services, tables, and regions using HTTP endpoints. You can create tables, delete tables, and perform other operations that have the REST endpoints. For more information, see Using the HBase REST API using the link in the related information section. Related Information Use the HBase REST server Using the REST API 6Cloudera Operational Database Accessing SQLLine from COD Accessing SQLLine from COD SQLLine is a command-line utility included with Cloudera Operational Database (COD) that enables you to connect and execute SQL commands using Phoenix from an edge node. Procedure 1.Download the client JAR files for your thick or thin clients using this URL syntax. You can get the [*** Phoenix Maven URL ***], [*** Phoenix_(Thin)_Client_Version ***], and the [***Phoenix_(Thick)_Client_Version ***] information from the Database connectivity page. The URL is in the following format: URL for the Phoenix thick client in Cloudera Runtime 7.2.9 (environment) and higher: [***Phoenix MAVEN URL ***]/org/apache/phoenix/phoenix-client-hbase-2.2/ [***Phoenix THICK CLIENT VERSION ***]/phoenix-client-hbase-2.2-[*** Phoenix THICK CLIENT VERSION ***].jar URL for the Phoenix thick client in Cloudera Runtime 7.2.8 (environment) and lower: [***Phoenix MAVEN URL ***]/org/apache/phoenix/phoenix-client/ [***Phoenix THICK CLIENT VERSION***]/phoenix-client-[*** Phoenix THICK CLIENT VERSION ***].jar For the Phoenix thin client: [***Phoenix MAVEN URL ***]/org/apache/phoenix/phoenix-queryserver-client/ [***Phoenix THIN CLIENT VERSION ***]/phoenix-queryserver-client-[*** Phoenix THIN CLIENT VERSION ***].jar You can use Maven to download the Phoenix client JAR files. If you only need the JAR files for SQLLine connection, you can use the curl tool to download ",
        "Phoenix THICK CLIENT VERSION ***]/phoenix-client-hbase-2.2-[*** Phoenix THICK CLIENT VERSION ***].jar URL for the Phoenix thick client in Cloudera Runtime 7.2.8 (environment) and lower: [***Phoenix MAVEN URL ***]/org/apache/phoenix/phoenix-client/ [***Phoenix THICK CLIENT VERSION***]/phoenix-client-[*** Phoenix THICK CLIENT VERSION ***].jar For the Phoenix thin client: [***Phoenix MAVEN URL ***]/org/apache/phoenix/phoenix-queryserver-client/ [***Phoenix THIN CLIENT VERSION ***]/phoenix-queryserver-client-[*** Phoenix THIN CLIENT VERSION ***].jar You can use Maven to download the Phoenix client JAR files. If you only need the JAR files for SQLLine connection, you can use the curl tool to download the JAR files using the following command: curl -L -f -o \"phoenix-client.jar\" \"[*** PHOENIX CLIENT JAR FILE URL ***]\" 7Cloudera Operational Database Accessing SQLLine from COD 2.From the Databases page, download HBase client configuration Zip file using the client configuration URL . Note: You cannot download the client configuration file using a web browser. You must copy and paste the HBase client configuration URL to your CLI to download this file. For example, curl -f -o \"hbase-co nfig.zip\" -u \"csso_[*** username ***]\" \"https://[*** download url ***]\". 3.From the Databases page, copy the JDBC connection URL for the Phoenix (Thick) or Phoenix (Thin) client to use in the next step. 4.Run this command from your CLI: java $PHOENIX_OPTS -cp \"[*** HBase-CONFIGURATION ***]:[***PHOENIX-CLIENT- JAR**]]\" sqlline.SqlLine -d org.apache.phoenix.jdbc.PhoenixDriver -u [***JDBC-CONNECTION-URL ***] -n none -p none --color\u003dtrue --fastConnect\u003d false --verbose\u003dtrue --incremental\u003dfalse --isolation\u003dTRANSACTION_READ_COMMITTED Related Information SQLLine Command Reference 8"
    ],
    "input_type": "query"
}

This is seen as single token stream in external model and it complains that token length exceeds limit.

How can one reproduce the bug?
Here is the connector definition

CONNECTOR_FOR_EMBEDDING_BODY =  {
            "name": CSS_CONNECTOR_NAME_EMBEDDING,
            "description": "The connector CML Embedding model service ",
            "version": 1,
            "protocol": "http",
            "parameters": {
                "endpoint": CSS_EMBEDDING_OPENAI_ENDPOINT,
                "model": CSS_EMBEDDING_OPENAI_MODEL,
                "api_version": CSS_EMBEDDING_OPENAI_VERSION
            },
            "credential": {
                "openAI_key": CSS_EMBEDDING_OPENAI_KEY
            },
            "actions": [
                {
                    "action_type": "predict",
                    "method": "POST",
                    "url": "${parameters.endpoint}",
                    "headers": {
                        "Authorization" : "Bearer ${credential.openAI_key}",
                        "Content-type": "application/json"
                    },
                    "request_body": "{ \"model\": \"${parameters.model}\", \"input\": ${parameters.input}, \"input_type\": \"query\" }",
                    "pre_process_function": "connector.pre_process.openai.embedding",
                    "post_process_function": "connector.post_process.openai.embedding"
                }
            ]
        }

Pipeline definition is

 NS_PIPE_LINE_BODY =  {
            "description": "Pipeline for generating embeddings with neural model couple with pre-processor for text chunking",
            "processors": [
                {
                    "text_chunking": {
                        "algorithm": {
                            "fixed_token_length": {
                                "token_limit": 500,
                                "overlap_rate": 0.2,
                                "tokenizer": "standard"
                            }
                        },
                        "field_map": {
                            "text": "text_chunks"
                        }
                    }
                },
                {
                    "text_embedding": {
                        "field_map": {
                            "text_chunks": "embeddings"
                        },
                        "batch_size": 1
                    }
                },
                {
                    "script": {
                        "source": """
                        if (ctx.text_chunks != null && ctx.embeddings != null) {
                            ctx.nested_chunks_embeddings = [];
                            for (int i = 0; i < ctx.text_chunks.length; i++) {
                                ctx.nested_chunks_embeddings.add(
                                    ['chunk': ctx.text_chunks[i], 'embedding': ctx.embeddings[i].knn]
                                );
                            }
                        }
                        ctx.remove('text_chunks');
                        ctx.remove('embeddings');
                        """
                    }
                }
            ]
    }

Index definition is

  INDEX_SETTINGS = {
                "settings": {
                    "index": {
                        "number_of_shards": 1,
                        "number_of_replicas": 0,
                        "knn": True,  # Enable k-Nearest Neighbors for nmslib
                        "default_pipeline": "neural-search-pipeline"
                    }
                },
                "mappings": {
                    "properties": {
                        "text": {"type": "text"},
                        "nested_chunks_embeddings": {
                            "type": "nested",
                            "properties" : {
                                "chunk": {"type": "text"},
                                "embedding": {
                                    "type": "knn_vector",  # Vector type field
                                    "dimension": int(CSS_EMBEDDING_OPENAI_DIMENSION),  # Number of dimensions from the embedding model
                                    "method": {
                                        "name": "hnsw",  # Method for the vector search
                                        "space_type": "l2",  # Euclidean distance for similarity
                                        "engine": "lucene"  # Use nmslib as the vector search engine
                                    }
                                }
                            }
                        }
                    }
                }
               
            }

What is the expected behavior?
Expected behaviour is send individual chunk to external model and receive the embedding and send the array of embeddings to the pos processor in ingest pipeline

What is your host/environment?

  • OS: 2.17
@layavadi layavadi added bug Something isn't working untriaged labels Jan 13, 2025
@yuye-aws
Copy link
Member

Hi @ylwu-amzn , this is actually a bug related to model and connector preprocess function. Can you provide more context?

@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Jan 17, 2025

@layavadi, suggest format the issue description to make it more readable. I helped format it

Are you using OpenAI Embedding model or nvidia/nv-embedqa-mistral-7b-v2?

@nathaliellenaa
Copy link
Contributor

nathaliellenaa commented Jan 18, 2025

I was able to reproduce the error. It seems that there is a potential bug within text chunking/embedding processor for remote model, I will look deeper into this issue.

@ylwu-amzn
Copy link
Collaborator

Thanks @nathaliellenaa, assign this issue to you

@nathaliellenaa
Copy link
Contributor

Hi @layavadi, I did some debugging on my side and found that decreasing the token limit fixes the error. When the token limit is decreased, the text chunking and embedding work properly. Can you try changing the token_limit field in your pipeline definition to a smaller value (e.g., 100)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

4 participants