mongodb-developer · jesmith17 · Feb 20, 2025 · Mar 20, 2025
@@ -1 +1,77 @@
-Javascript and Python apps and demos showcasing how to use MongoDB in GenAI applications.
+# Hybrid Search Demo
+
+Performs a configurable hybrid search with the ability to control the weights of the scoring as part of the reciprical rank fusion approach.
+
+Uses a set of financial statements from major vehicle manufacturers as the source. The demo allows you to show the following items
+
+- Benefits of vector search
+- Pre-filtering on the vector side
+- Hybrid search capabilities
+- Ability to influence the weighting of data in the RRF approach
+
+## Setup
+
+Run the following commands
+
+```shell
+
+python -v venv venv
+
+pip install -r requirements.txt
+
+```
+
+### Loading Data
+
+MongoDB employees can use this link to download the dump file from the existing DB and use the command below to load the data
+[Google Drive for Dump Files](https://drive.google.com/drive/folders/15m-7-Mp8jTZn0IP-AXvfN9pd3p1ubJh9?usp=drive_link)
+
+```shell
+mongorestore --uri="<connection string>" --db langchain --file <folder for data>
+
+```
+That will load teh existing data into your cluster and avoid you needing to make additional calls to the LLM to generate embeddings.
+
+Others (or those wishing to use a different model) can submit a `POST` request to the `http://localhost:5000/generate` endpoint and the system will go through the process of generating embeddings on its own.
+
+
+
+
+
+
+
+## Execute demo
+
+The demo runs as a flask app that allows to submit API request to be able to easily change the input params and get back the results.
+
+### Start the app
+
+```shell
+
+source ./venv/bin/activate
+
+export FLASK_APP=app.py
+export FLASK_ENV=development
+flask run
+
+```
+
+
+To run calls you need to submit a JSON payload like the example below as a `POST` request to `http://localhost:5000`
+
+```json
+{
+    "prompt": "What was combined revenue in 2022",
+    "company":"Ford",
+    "pageNum": 3,
+    "vectorWeight": 0.2,
+    "textWeight": 0.8,
+    "textBoost":3
+}
+
+```
+
+Changing the values of the weights (and the question) will allow you to show the benefits of the hybrid search.
+
+The `pageNum` attribute is the one that is used for pre-filtering. The code does a simple `{$lte: {'metadata.page' : <variable> }}` on the vector search. It's simplistic, but it does the job of showing the benefits.
+`Company` attribute is used by the Lexical search. You can choose between `Stellantis`, `GM`, or `Ford`
@@ -0,0 +1,12 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager" inherit-compiler-output="true">
+    <exclude-output />
+    <content url="file://$MODULE_DIR$">
+      <sourceFolder url="file://$MODULE_DIR$" isTestSource="false" />
+      <excludeFolder url="file://$MODULE_DIR$/venv" />
+    </content>
+    <orderEntry type="jdk" jdkName="Python 3.9 (Lang-Chain)" jdkType="Python SDK" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+</module>
@@ -0,0 +1,77 @@
+# Hybrid Search Demo
+
+Performs a configurable hybrid search with the ability to control the weights of the scoring as part of the reciprical rank fusion approach.
+
+Uses a set of financial statements from major vehicle manufacturers as the source. The demo allows you to show the following items
+
+- Benefits of vector search
+- Pre-filtering on the vector side
+- Hybrid search capabilities
+- Ability to influence the weighting of data in the RRF approach
+
+## Setup
+
+Run the following commands
+
+```shell
+
+python -v venv venv
+
+pip install -r requirements.txt
+
+```
+
+### Loading Data
+
+MongoDB employees can use this link to download the dump file from the existing DB and use the command below to load the data
+[Google Drive for Dump Files](https://drive.google.com/drive/folders/15m-7-Mp8jTZn0IP-AXvfN9pd3p1ubJh9?usp=drive_link)
+
+```shell
+mongorestore --uri="<connection string>" --db langchain --file <folder for data>
+
+```
+That will load teh existing data into your cluster and avoid you needing to make additional calls to the LLM to generate embeddings.
+
+Others (or those wishing to use a different model) can submit a `POST` request to the `http://localhost:5000/generate` endpoint and the system will go through the process of generating embeddings on its own.
+
+
+
+
+
+
+
+## Execute demo
+
+The demo runs as a flask app that allows to submit API request to be able to easily change the input params and get back the results.
+
+### Start the app
+
+```shell
+
+source ./venv/bin/activate
+
+export FLASK_APP=app.py
+export FLASK_ENV=development
+flask run
+
+```
+
+
+To run calls you need to submit a JSON payload like the example below as a `POST` request to `http://localhost:5000`
+
+```json
+{
+    "prompt": "What was combined revenue in 2022",
+    "company":"Ford",
+    "pageNum": 3,
+    "vectorWeight": 0.2,
+    "textWeight": 0.8,
+    "textBoost":3
+}
+
+```
+
+Changing the values of the weights (and the question) will allow you to show the benefits of the hybrid search.
+
+The `pageNum` attribute is the one that is used for pre-filtering. The code does a simple `{$lte: {'metadata.page' : <variable> }}` on the vector search. It's simplistic, but it does the job of showing the benefits.
+`Company` attribute is used by the Lexical search. You can choose between `Stellantis`, `GM`, or `Ford`
@@ -0,0 +1,196 @@
+import os
+
+from flask import Flask, jsonify, request
+from flask_pymongo import PyMongo
+from langchain_community.document_loaders import PyPDFLoader
+from langchain_openai import OpenAIEmbeddings
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from openai import OpenAI
+
+app = Flask(__name__)
+app.app_context()
+app.config["DEBUG"] = True
+app.config["MONGO_URI"] = os.getenv("ATLAS_URL")
+client = PyMongo(app).cx["langchain"]
+openai = OpenAI()
+
+
+@app.route("/generate", methods=["POST"])
+def generateEmbeddings():
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=1000, chunk_overlap=200, add_start_index=True
+    )
+    for file in os.listdir("./finacialstatements"):
+        loader = PyPDFLoader(f"./finacialstatements/{file}")
+        docs = loader.load()
+        splits = text_splitter.split_documents(docs)
+        i = 0
+        docs = []
+        for split in splits:
+            vector = (
+                openai.embeddings.create(
+                    input=[str(split.page_content)], model="text-embedding-3-small"
+                )
+                .data[0]
+                .embedding
+            )
+            dbDoc = {
+                "content": split.page_content,
+                "vectors": vector,
+                "metadata": {
+                    "page": split.metadata.get("page"),
+                    "start_index": split.metadata.get("start_index"),
+                    "fileName": file,
+                },
+            }
+            docs.append(dbDoc)
+            i += 1
+            if i % 20 == 0:
+                client.financial_statements.insert_many(docs)
+                docs = []
+        client.financial_statements.insert_many(docs)
+    return jsonify({"results" "Docs successfully parsed and loaded"})
+
+
+@app.route("/", methods=["POST"])
+def getAnswers():
+    data = request.get_json()
+    prompt = data.get("prompt")
+    company = data.get("company")
+    pageNum = data.get("pageNum")
+    vcWeight = data.get("vectorWeight")
+    textWeight = data.get("textWeight")
+    textBoost = data.get("textBoost")
+
+    embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
+    promptEmbedding = embeddings.embed_query(prompt)
+    if company:
+        vectorWeight = vcWeight if vcWeight else 0.9
+        fullTextWeight = textWeight if textWeight else 0.1
+        textBoostValue = textBoost if textBoost else 1
+
+        pipeline = [
+            {
+                "$vectorSearch": {
+                    "index": "vector_index",
+                    "path": "vectors",
+                    "queryVector": promptEmbedding,
+                    "numCandidates": 100,
+                    "limit": 20,
+                }
+            },
+            {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
+            {"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
+            {
+                "$addFields": {
+                    "vs_score": {
+                        "$multiply": [
+                            vectorWeight,
+                            {"$divide": [1.0, {"$add": ["$rank", 60]}]},
+                        ]
+                    }
+                }
+            },
+            {
+                "$project": {
+                    "vs_score": 1,
+                    "_id": "$docs._id",
+                    "content": "$docs.content",
+                    "metadata": "$docs.metadata",
+                }
+            },
+            {
+                "$unionWith": {
+                    "coll": "financial_statements",
+                    "pipeline": [
+                        {
+                            "$search": {
+                                "index": "rrf-full-text-search",
+                                "text": {
+                                    "query": company,
+                                    "path": "metadata.fileName",
+                                    "score": {"boost": {"value": textBoostValue}},
+                                },
+                            }
+                        },
+                        {"$limit": 20},
+                        {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
+                        {"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
+                        {
+                            "$addFields": {
+                                "fts_score": {
+                                    "$multiply": [
+                                        fullTextWeight,
+                                        {"$divide": [1.0, {"$add": ["$rank", 60]}]},
+                                    ]
+                                }
+                            }
+                        },
+                        {
+                            "$project": {
+                                "fts_score": 1,
+                                "_id": "$docs._id",
+                                "content": "$docs.content",
+                                "metadata": "$docs.metadata",
+                            }
+                        },
+                    ],
+                }
+            },
+            {
+                "$group": {
+                    "_id": "$_id",
+                    "metadata": {"$first": "$metadata"},
+                    "content": {"$first": "$content"},
+                    "vs_score": {"$max": "$vs_score"},
+                    "fts_score": {"$max": "$fts_score"},
+                }
+            },
+            {
+                "$project": {
+                    "_id": 1,
+                    "content": 1,
+                    "metadata": 1,
+                    "vs_score": {"$ifNull": ["$vs_score", 0]},
+                    "fts_score": {"$ifNull": ["$fts_score", 0]},
+                }
+            },
+            {
+                "$project": {
+                    "score": {"$add": ["$fts_score", "$vs_score"]},
+                    "_id": 1,
+                    "content": 1,
+                    "metadata": 1,
+                    "vs_score": 1,
+                    "fts_score": 1,
+                }
+            },
+            {"$sort": {"score": -1}},
+            {"$limit": 10},
+        ]
+    else:
+        pipeline = [
+            {
+                "$vectorSearch": {
+                    "queryVector": promptEmbedding,
+                    "path": "vectors",
+                    "numCandidates": 100,
+                    "index": "vector_index",
+                    "limit": 20,
+                    "exact": False,
+                }
+            },
+            {"$project": {"page": 1, "content": 1, "metadata": 1}},
+        ]
+    if pageNum:
+        pipeline[0].get("$vectorSearch")["filter"] = {
+            "metadata.page": {"$lte": pageNum}
+        }
+
+    results = client.financial_statements.aggregate(pipeline)
+
+    return jsonify(results)
+
+
+if __name__ == "__main__":
+    app.run(debug=True)
@@ -0,0 +1,9 @@
+langchain-community
+pypdf
+pymongo
+langchain-aws
+langchain-openai
+python-dotenv
+langchain-mongodb
+flask
+flask-pymongo