Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Loading not working if dumped from different process #154

Open
KCaverly opened this issue Jun 9, 2023 · 3 comments
Open

Index Loading not working if dumped from different process #154

KCaverly opened this issue Jun 9, 2023 · 3 comments

Comments

@KCaverly
Copy link

KCaverly commented Jun 9, 2023

Running the example works fine if you both generate, dump then load the index. However, if you generate and dump the index, you cannot reload the index in a new process, without adding the documents again. Running a query on a loaded index, leads to missing document errors.

Do you have to add_documents again after load? As I believe the 'add_documents' method, generates the embeddings itself, does this not lead to redundant calls to openai in which you have to regenerate the embeddings on load a second time?

@Pablo1785
Copy link
Collaborator

Hi, thank you for filing the issue.

Obviously, I cannot see your code, but I'm assuming you are using the defaults from the example. The issue here seems to me is that we do not have a persistent DocumentStore implementation - we only have an InMemoryDocumentStore. So what effectively happens is that the HNSW index itself (so the embedded vectors) does get saved to the file, but the documents (so the contents) only ever lived in your original process and are not saved to file when dumping the index.

For now if you want persistence quickly I would recommend the QdrantVectorStore

@KCaverly
Copy link
Author

KCaverly commented Jun 9, 2023

Thanks for the response.

Given this, would it be possible (and would you be open to a pr) in which we can load the documents to a vector store without embedding? Assuming of course that the vector store already has the embeddings/index from the loaded .hnsw files.

Hnsw indexes are great for POC and lightweight without exploring full vector db solutions, so Im hesitant to move to qdrant at this time.

@Pablo1785
Copy link
Collaborator

Sure, we are always open to new PRs and this is definitely a blindspot.

I'm not entirely sure if having a method that would simply add a doc to the vectorstore without embedding would be sound. This could lead to invalid states if the user misuses the API.

I think it might be better to have some dump_docs()/load_docs() type of methods implemented specifically for HNSWVectorStore with InMemoryDocstore. This is my first idea but this is definitely not the only solution.

I think the most important thing is that it should never be possible to have a VectorStore with a Document without an embedded vector, or vice versa - a vector without a corresponding Document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants