-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
+mdb atlas vectordb [clean_final] #3000
base: main
Are you sure you want to change the base?
Conversation
Hi @ranfysvalle02 , thank you. I can see the issues of notebook have not been addressed. See comments here: #2996 |
Co-authored-by: HRUSHIKESH DOKALA <[email protected]>
The simple change of
Highlighted that I need to create a "wrapper" class around the MongoDB collection, similar to what pgvector did.
but for MongoDB. Will be working on this @Hk669 |
@thinkall - Can you please tell me how to 'fix' the notebook? Or perhaps have it as a 'suggested commit'? I'll be addressing the notebook and any final touches later today. |
What about run it successfully in your local env and remove only the sensitive info? A new user should be able to run it by fill in the missed message, which should only be the So, the connect string of mongodb should not be empty, the one I suggested in your last PR worked for me. Does it work for you? The one you previously used didn't work for me and was not connecting to the docker container. The output of the chat in the last cell is not correct. Could you please check my previous comments and the pgvector notebook example? |
It's OK, no need to wrap a |
I see what you mean @thinkall ! I found the issue with the notebook and notebook output!
vs
We are close!!! I'll push the fix/code later today |
No errors here. I've fixed this and made a commit. |
"VectorDB returns doc_ids: [[]]\n", | ||
"\u001b[32mNo more context, will terminate.\u001b[0m\n", | ||
"\u001b[33mragproxyagent\u001b[0m (to assistant):\n", | ||
"\n", | ||
"TERMINATE\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retrieve_docs is not working as expected. No doc is returned. Either the query pipeline or the atlas local env is not functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On it! Thank you so much for helping debug! There is something wrong with the implementation I believe -- will debug this shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thinkall - It is an implementation bug! its an issue with the index_name :) WIl fix shortly.
@thinkall finally tracked this down --- its all about the index! the create_collection method
does not use 'index_name' or 'similarity' -- which I had added. Working on a fix! |
@thinkall - I finally got it to run, but I have to add a strange programmatic arbitrary delay for things to work. I am working on a more elegant solution. After ~15seconds it works. Anything 5 seconds or less fails. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3000 +/- ##
===========================================
- Coverage 32.45% 19.88% -12.58%
===========================================
Files 93 95 +2
Lines 10109 10426 +317
Branches 2172 2388 +216
===========================================
- Hits 3281 2073 -1208
- Misses 6544 8214 +1670
+ Partials 284 139 -145
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
https://github.com/microsoft/autogen/actions/runs/9629801423/job/26560601555?pr=3000 -> can you look into the tests that are failing. thanks @ranfysvalle02 |
PYTHON-4506 Expanded tests and simplified vector search pipelines
index_name change; keeping track of lucene indexes is tricky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ranfysvalle02 , there is still some issues with the notebook and test, could you please help investigate? Btw, please use pre-commit run --all-files
to make sure the format is good. Thank you so much!
@@ -171,7 +171,7 @@ def get_docs_by_ids( | |||
ids: List[ItemID] | A list of document ids. If None, will return all the documents. Default is None. | |||
collection_name: str | The name of the collection. Default is None. | |||
include: List[str] | The fields to include. Default is None. | |||
If None, will include ["metadatas", "documents"], ids will always be included. | |||
If None, will include ["metadata", "content"], ids will always be included. # TODO - Confirm keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this will lead to changes to all the other dbs. It's better to update the keys in the results of mongodb wrapper.
" mongodb:\n", | ||
" image: mongodb/mongodb-atlas-local:latest\n", | ||
" restart: unless-stopped\n", | ||
" ports:\n", | ||
" - \"27017:27017\"\n", | ||
" environment:\n", | ||
" MONGODB_INITDB_ROOT_USERNAME: mongodb_user\n", | ||
" MONGODB_INITDB_ROOT_PASSWORD: mongodb_password\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this setting, I always get empty returns like below:
2024-06-30 21:18:02,516 - autogen.agentchat.contrib.vectordb.mongodb - INFO - Search index vector_index created successfully.
VectorDB returns doc_ids: [[]]
No more context, will terminate.
Are you sure the free version of mongodb works?
" \"vector_db\": \"mongodb\", # MongoDB Atlas database\n", | ||
" \"collection_name\": \"flaml_collection\",\n", | ||
" \"db_config\": {\n", | ||
" \"connection_string\": \"<connection_string>\", # MongoDB Atlas connection string\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the notebook with a mongodb instance deployed with the docker-compose.yml provided at the beginning and share the actual connection_string you used?
"\u001b[32mUpdating context and resetting conversation.\u001b[0m\n", | ||
"index is ready to use.\n", | ||
"{'id': '6677781cbb83ea33c40099e1', 'name': 'default_index', 'type': 'vectorSearch', 'status': 'READY', 'queryable': True, 'latestDefinitionVersion': {'version': 0, 'createdAt': datetime.datetime(2024, 6, 23, 1, 19, 24, 336000)}, 'latestDefinition': {'fields': [{'type': 'vector', 'numDimensions': 384, 'path': 'embedding', 'similarity': 'cosine'}]}, 'statusDetail': [{'hostname': 'shared-shard-00-search-onamml', 'status': 'READY', 'queryable': True, 'mainIndex': {'status': 'READY', 'queryable': True, 'definitionVersion': {'version': 0, 'createdAt': datetime.datetime(2024, 6, 23, 1, 19, 24)}, 'definition': {'fields': [{'type': 'vector', 'path': 'embedding', 'numDimensions': 384, 'similarity': 'cosine'}]}}}, {'hostname': 'shared-shard-00-search-6xag8e', 'status': 'READY', 'queryable': True, 'mainIndex': {'status': 'READY', 'queryable': True, 'definitionVersion': {'version': 0, 'createdAt': datetime.datetime(2024, 6, 23, 1, 19, 24)}, 'definition': {'fields': [{'type': 'vector', 'path': 'embedding', 'numDimensions': 384, 'similarity': 'cosine'}]}}}]}\n", | ||
"Now running pipeline: [{'$vectorSearch': {'index': 'default_index', 'limit': 60, 'numCandidates': 60, 'queryVector': [-0.08256451040506363, -0.07900252193212509, -0.05290786176919937, 0.021982736885547638, 0.046406690031290054, 0.027769701555371284, -0.02768588438630104, -0.020102187991142273, -0.05407266318798065, -0.061684805899858475, -0.03940979018807411, -0.029285598546266556, -0.1118478998541832, -0.03136416897177696, -0.04099257290363312, -0.07897000014781952, -0.02522769570350647, 0.043702732771635056, -0.030820483341813087, -0.041595760732889175, 0.10552595555782318, 0.0023172772489488125, 0.08983399718999863, 0.10865391790866852, -0.06146957352757454, 0.04154617711901665, 0.015428234823048115, 0.016568025574088097, 0.013623313046991825, -0.06059451401233673, 0.08428270369768143, 0.009563339874148369, -0.002620439976453781, 0.016997039318084717, -0.07201018929481506, -0.010901586152613163, -0.030768705531954765, -0.04398634657263756, -0.026716720312833786, -0.019298473373055458, 0.029043301939964294, -0.03137688338756561, -0.0516120120882988, -0.033414166420698166, 0.05385608226060867, -0.025596346706151962, -0.02077491395175457, -0.0634346529841423, 0.03223349153995514, 0.02784794755280018, -0.06079091876745224, -0.012161108665168285, -0.0933445394039154, -0.018985357135534286, -0.022000310942530632, 0.08059032261371613, 0.03905639797449112, 0.008981743827462196, -0.04856802150607109, -0.0195226538926363, -0.016003113240003586, -0.10165907442569733, -0.004733760375529528, 0.030122995376586914, -0.038355227559804916, 0.03839924931526184, -0.028533125296235085, 0.01822500303387642, 0.0707336813211441, -0.02592848241329193, 0.02241864986717701, 0.022557010874152184, 0.007257631979882717, 0.03511698544025421, 0.008497730828821659, 0.06233576685190201, 0.06869452446699142, 0.06520985811948776, -0.018009020015597343, 0.008016299456357956, -0.09440284222364426, -0.06914988905191422, -0.016991959884762764, -0.004849573597311974, 0.015289856120944023, -0.05368100106716156, -0.07648778706789017, 0.04355047643184662, -0.013986689038574696, 0.03536888584494591, 0.03178128972649574, 0.03904074802994728, 0.027542345225811005, 0.021311746910214424, -0.08981165289878845, 0.050620175898075104, 0.006543598137795925, 0.07310184836387634, -0.033499374985694885, -0.01851048693060875, -0.07171830534934998, -0.07049573212862015, -0.02946554869413376, 0.04081925004720688, -0.015752671286463737, -0.05440584942698479, -0.00638421019539237, -0.027693038806319237, -0.015809008851647377, -0.0794110968708992, 0.08307767659425735, -0.010127314366400242, 0.031197702512145042, -0.0325561985373497, 0.028586456552147865, 0.05326930806040764, -0.04397851228713989, -0.06359461694955826, 0.003676487598568201, 0.06998850405216217, -0.02999182790517807, 0.03461074084043503, 0.05651488155126572, -0.05784572660923004, 0.02231559529900551, -0.07732831686735153, -0.029416916891932487, 1.8518434945716996e-33, 0.0358523465692997, -0.002374001545831561, 0.009263500571250916, -0.05580880120396614, 0.030508413910865784, -0.037797845900058746, 0.01508091390132904, 0.02779262885451317, -0.04756521061062813, 0.010429342277348042, -0.005697719287127256, 0.03368696570396423, -0.014907917007803917, -0.02615354210138321, -0.05337945744395256, -0.08737822622060776, 0.04612358659505844, 0.016435381025075912, -0.03597096726298332, -0.06492944061756134, 0.11139646172523499, -0.04470240697264671, 0.013333962298929691, 0.06944458186626434, 0.04924115538597107, 0.021988168358802795, -0.0033458129037171602, -0.021327221766114235, 0.04618706554174423, 0.09092214703559875, -0.009819227270781994, 0.03574197739362717, -0.02589249238371849, 0.015359507873654366, 0.01923568733036518, 0.009884021244943142, -0.0687863752245903, 0.008688706904649734, 0.0003024878678843379, 0.006991893518716097, -0.07505182921886444, -0.045765507966279984, 0.005778071004897356, 0.0200499240309, -0.07049272209405899, -0.06168036535382271, 0.044801026582717896, 0.026470575481653214, 0.01803005486726761, 0.04355733096599579, 0.034672655165195465, -0.08011800795793533, 0.03965161740779877, 0.08112046867609024, 0.07237163931131363, 0.07554267346858978, -0.0966770201921463, 0.05703232064843178, 0.007653184700757265, 0.09404793381690979, 0.02874479629099369, 0.032439567148685455, -0.006544401869177818, 0.0747322142124176, -0.06779398024082184, -0.03769124671816826, 0.018574388697743416, -0.0027497054543346167, 0.05186106637120247, 0.045869190245866776, 0.052037931978702545, 0.00877095852047205, 0.00956355594098568, 0.06010708585381508, 0.07063814997673035, -0.05281956121325493, 0.11385682970285416, 0.0014734964352101088, -0.13000114262104034, 0.04160114377737045, 0.002756801201030612, -0.03354136645793915, -0.012316903099417686, -0.04667062684893608, -0.021649040281772614, 0.009122663177549839, 0.07305404543876648, 0.050488732755184174, 0.0037498027086257935, 0.06742933392524719, -0.09808871150016785, -0.02533995360136032, 0.07752660661935806, -0.008930775336921215, -0.020734407007694244, -8.718873943854186e-34, 0.030775681138038635, -0.04046367108821869, -0.07485030591487885, 0.06837300956249237, 0.03777360916137695, 0.03171695023775101, 0.038366734981536865, -0.009698187932372093, -0.06721752882003784, 0.03483430668711662, -0.03264770656824112, -0.004821446258574724, 0.017873667180538177, -0.01217806525528431, -0.06693356484174728, -0.042935941368341446, 0.07182027399539948, -0.023592444136738777, 0.010779321193695068, 0.03270953893661499, -0.03838556632399559, -0.010096886195242405, -0.058566078543663025, -0.06304068863391876, -0.013382021337747574, -0.011351224966347218, -0.08517401665449142, 0.007304960861802101, -0.04197632893919945, -0.008837309665977955, 0.000581165833864361, 0.009765408001840115, -0.02323746308684349, -0.07040572166442871, -0.0630621388554573, -0.01030951738357544, 0.07319610565900803, -0.002567168092355132, -0.00982675701379776, 0.08009836822748184, 0.06278694421052933, -0.053986601531505585, -0.13036444783210754, -0.05632428079843521, -0.012127791531383991, -0.00034488266101107, -0.05524465814232826, -0.019998280331492424, -0.041557829827070236, 0.07457990199327469, -0.004864905495196581, 0.0744631364941597, -0.038698967546224594, 0.11076352000236511, 0.08321533352136612, -0.1319902539253235, 0.05189663544297218, -0.08637715131044388, -0.047119464725255966, 0.0712425485253334, 0.038989413529634476, -0.06715074181556702, 0.0770900622010231, -0.016237575560808182, 0.16853967308998108, -0.003975923638790846, 0.11307050287723541, 0.07726389169692993, -0.028748558834195137, 0.04492560029029846, 0.0768602192401886, 0.0852692499756813, 0.021246735006570816, 0.11719376593828201, 0.0029091970063745975, -0.011192459613084793, -0.09389575570821762, 0.021549541503190994, -0.0055024465546011925, 0.032183919101953506, 0.0651387944817543, -0.0652405172586441, 0.03021097555756569, 0.1095665693283081, -0.02563057281076908, 0.05070950835943222, 0.09074468910694122, 0.08164751529693604, 0.039858028292655945, -0.045717816799879074, -0.01968374475836754, -0.01942502148449421, 0.020252034068107605, 0.028495490550994873, -0.014108758419752121, -2.6071681702433125e-08, -0.004948799964040518, -0.03374723717570305, -0.006966953631490469, 0.04770921543240547, 0.060589514672756195, 0.039017271250486374, -0.06870992481708527, 0.04758283868432045, -0.04153140261769295, -0.009761914610862732, 0.05678777024149895, -0.024886248633265495, 0.08310353755950928, 0.04019981995224953, 0.04347654804587364, -0.016476230695843697, 0.02281028777360916, 0.044384729117155075, 0.012391419149935246, 0.03150279074907303, 0.03414358198642731, 0.023670021444559097, -0.035867370665073395, 0.00584121560677886, 0.03878429904580116, -0.03416749835014343, 0.0317315049469471, 0.014832393266260624, 0.06329585611820221, -0.07007385790348053, -0.11312873661518097, -0.0667077898979187, 0.031542230397462845, 0.03318323940038681, -0.05146196484565735, -0.04369741305708885, 0.030556850135326385, 0.05148332566022873, -0.09324397146701813, 0.08804989606142044, -0.05473781377077103, 0.02356131188571453, -0.0072563826106488705, -0.013308629393577576, 0.022258494049310684, 0.047823697328567505, -0.014027439057826996, -0.018331162631511688, -0.02744504064321518, 0.027262693271040916, -0.03694259002804756, 0.04492212459445, 0.04835069552063942, 0.09086570143699646, -0.0022586847189813852, -0.03940355032682419, -0.005774456076323986, -0.06551025062799454, -0.04700932279229164, -0.00200175354257226, -0.039275478571653366, -0.04998438432812691, -0.08698498457670212, 0.015872927382588387], 'path': 'embedding'}}, {'$project': {'score': {'$meta': 'vectorSearchScore'}}}, {'$lookup': {'from': 'flaml_collection', 'localField': '_id', 'foreignField': '_id', 'as': 'full_document_array'}}, {'$addFields': {'full_document': {'$arrayElemAt': [{'$map': {'input': '$full_document_array', 'as': 'doc', 'in': {'id': '$$doc.id', 'content': '$$doc.content'}}}, 0]}}}, {'$project': {'full_document_array': 0, 'embedding': 0}}]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid printing out all the embeddings?
|
||
logger = logging.getLogger(__name__) | ||
|
||
MONGODB_URI = os.environ.get("MONGODB_URI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A default value should be given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ranfysvalle02 , I've fixed the mongodb test errors. It looks like the distance_threshold
is not working. I'm not sure if it's not working for free mongodb or there is some issue with the code.
|
||
# Empty list of queries returns empty list of results | ||
queries = ["Some good pets", "What kind of Sandwich?"] | ||
results = db.retrieve_docs(queries=queries, collection_name=MONGODB_COLLECTION, n_results=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test with distance_threshold
set. It's not working in my local env.
# Compute embedding vector from semantic query | ||
query_vector = np.array(self.embedding_function([query_text])).tolist()[0] | ||
# Find documents with similar vectors using the specified index | ||
query_result = _vector_search( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't return embedding by default.
# if overwrite is False and get_or_create is False, raise a ValueError | ||
if not overwrite and not get_or_create: | ||
raise ValueError("If overwrite is False, get_or_create must be True.") | ||
|
||
collection_names = self.db.list_collection_names() | ||
if collection_name not in collection_names: | ||
# Create a new collection | ||
return self.db.create_collection(collection_name) | ||
|
||
if overwrite: | ||
self.db.drop_collection(collection_name) | ||
|
||
if get_or_create: | ||
# The collection already exists, return it. | ||
return self.db[collection_name] | ||
else: | ||
# get_or_create is False and the collection already exists, raise an error. | ||
raise ValueError(f"Collection {collection_name} already exists.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if overwrite is False and get_or_create is False, raise a ValueError | |
if not overwrite and not get_or_create: | |
raise ValueError("If overwrite is False, get_or_create must be True.") | |
collection_names = self.db.list_collection_names() | |
if collection_name not in collection_names: | |
# Create a new collection | |
return self.db.create_collection(collection_name) | |
if overwrite: | |
self.db.drop_collection(collection_name) | |
if get_or_create: | |
# The collection already exists, return it. | |
return self.db[collection_name] | |
else: | |
# get_or_create is False and the collection already exists, raise an error. | |
raise ValueError(f"Collection {collection_name} already exists.") | |
# if overwrite is False and get_or_create is False, raise a ValueError | |
if overwrite: | |
self.db.drop_collection(collection_name) | |
collection_names = self.db.list_collection_names() | |
if collection_name not in collection_names: | |
# Create a new collection | |
return self.db.create_collection(collection_name) | |
if get_or_create: | |
# The collection already exists, return it. | |
return self.db[collection_name] | |
else: | |
# get_or_create is False and the collection already exists, raise an error. | |
raise ValueError(f"Collection {collection_name} already exists.") |
Why are these changes needed?
MongoDB has been ranked as the best vector database(https://www.mongodb.com/blog/post/atlas-vector-search-commands-highest-developer-nps-retool-state-ai-2023-survey) in the Retool AI report, so it is quite important to add MongoDB vector search as an option for Autogen RAG.
You can easily start the MongoDB vector search on a free tier M0 MongoDB Atlas cluster. Free tier cluster provides the full functionality of the MongoDB vector search. https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-overview/
But why is MongoDB such a standout? Well, there are a few key reasons.
As such, implementing MongoDB as a Retrieval Agent can unlock new potential in your AI applications, bringing the full power of vector storage to bear.
Related issue number: 711
Closes #711
Closes #2996
Checks