Server error due to Chroma query index dimensionality mismatch #296

cowile · 2023-05-31T18:11:17Z

I've been trying to set up an example Chroma database that can be queried with the retrieval plugin.

I generate the database with this code:

import chromadb
from chromadb.config import Settings
import os

cl = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=os.path.expanduser("~/Code/chatgpt-retrieval-plugin/openai")
))
co = cl.create_collection('openaiembeddings')

for c in range(ord('A'), ord('Z')+1):
    co.add(documents=chr(c), ids=chr(c))

Then I run the retrieval plugin with poetry run dev and try the plugin with ChatGPT.

After the ChatGPT session, the log says:

INFO:     Will watch for changes in these directories: ['/home/cwl/Code/chatgpt-retrieval-plugin']
INFO:     Uvicorn running on http://localhost:3333 (Press CTRL+C to quit)
INFO:     Started reloader process [103737] using WatchFiles
INFO:     Started server process [103748]
INFO:     Waiting for application startup.
Using embedded DuckDB with persistence: data will be stored in: openai
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
INFO:     Application startup complete.
Error: Dimensionality of (1536) does not match index dimensionality (384)
INFO:     127.0.0.1:57770 - "POST /query HTTP/1.1" 500 Internal Server Error

The text was updated successfully, but these errors were encountered:

atroyn · 2023-05-31T21:40:38Z

The issue is that when you added the documents, you used the built-in default embedding function.
If you want to use Chroma in this way, you should use the OpenAI embedding function when adding documents.

I'll add that to the chroma specific README.

cowile · 2023-06-05T22:08:12Z

Then the codebase should use the OpenAI embedding function in chroma_datastore.py. Currently that also uses the default when creating a chroma client.

atroyn · 2023-06-05T22:35:26Z

For the retrieval plugin, the embeddings are handled above the data store interface. Embeddings for upsert are handled here: https://github.com/openai/chatgpt-retrieval-plugin/blob/742fdf7cfcd1ca6082de1ee2ee5dc5e14dc00e0f/datastore/datastore.py#LL18C15-L18C21

and for query, here:

chatgpt-retrieval-plugin/datastore/datastore.py

Line 59 in 742fdf7

query_embeddings = get_embeddings(query_texts)

Because you populated your index directly, we did not know that the OpenAI embedding function should be used, so used our default.

In chroma_datastore.py the function is deliberately set to None as it should never be called directly for collections created via the retrieval plugin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server error due to Chroma query index dimensionality mismatch #296

Server error due to Chroma query index dimensionality mismatch #296

cowile commented May 31, 2023

atroyn commented May 31, 2023 •

edited

Loading

cowile commented Jun 5, 2023

atroyn commented Jun 5, 2023

Server error due to Chroma query index dimensionality mismatch #296

Server error due to Chroma query index dimensionality mismatch #296

Comments

cowile commented May 31, 2023

atroyn commented May 31, 2023 • edited Loading

cowile commented Jun 5, 2023

atroyn commented Jun 5, 2023

atroyn commented May 31, 2023 •

edited

Loading