Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added hybrid search demo #87

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 77 additions & 1 deletion apps/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,77 @@
Javascript and Python apps and demos showcasing how to use MongoDB in GenAI applications.
# Hybrid Search Demo

Performs a configurable hybrid search with the ability to control the weights of the scoring as part of the reciprical rank fusion approach.

Uses a set of financial statements from major vehicle manufacturers as the source. The demo allows you to show the following items

- Benefits of vector search
- Pre-filtering on the vector side
- Hybrid search capabilities
- Ability to influence the weighting of data in the RRF approach

## Setup

Run the following commands

```shell

python -v venv venv

pip install -r requirements.txt

```

### Loading Data

MongoDB employees can use this link to download the dump file from the existing DB and use the command below to load the data
[Google Drive for Dump Files](https://drive.google.com/drive/folders/15m-7-Mp8jTZn0IP-AXvfN9pd3p1ubJh9?usp=drive_link)

```shell
mongorestore --uri="<connection string>" --db langchain --file <folder for data>

```
That will load teh existing data into your cluster and avoid you needing to make additional calls to the LLM to generate embeddings.

Others (or those wishing to use a different model) can submit a `POST` request to the `http://localhost:5000/generate` endpoint and the system will go through the process of generating embeddings on its own.







## Execute demo

The demo runs as a flask app that allows to submit API request to be able to easily change the input params and get back the results.

### Start the app

```shell

source ./venv/bin/activate

export FLASK_APP=app.py
export FLASK_ENV=development
flask run

```


To run calls you need to submit a JSON payload like the example below as a `POST` request to `http://localhost:5000`

```json
{
"prompt": "What was combined revenue in 2022",
"company":"Ford",
"pageNum": 3,
"vectorWeight": 0.2,
"textWeight": 0.8,
"textBoost":3
}

```

Changing the values of the weights (and the question) will allow you to show the benefits of the hybrid search.

The `pageNum` attribute is the one that is used for pre-filtering. The code does a simple `{$lte: {'metadata.page' : <variable> }}` on the vector search. It's simplistic, but it does the job of showing the benefits.
`Company` attribute is used by the Lexical search. You can choose between `Stellantis`, `GM`, or `Ford`
12 changes: 12 additions & 0 deletions apps/hybrid-search/Lang-Chain.iml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
<?xml version="1.0" encoding="UTF-8"?>
<module type="PYTHON_MODULE" version="4">
<component name="NewModuleRootManager" inherit-compiler-output="true">
<exclude-output />
<content url="file://$MODULE_DIR$">
<sourceFolder url="file://$MODULE_DIR$" isTestSource="false" />
<excludeFolder url="file://$MODULE_DIR$/venv" />
</content>
<orderEntry type="jdk" jdkName="Python 3.9 (Lang-Chain)" jdkType="Python SDK" />
<orderEntry type="sourceFolder" forTests="false" />
</component>
</module>
77 changes: 77 additions & 0 deletions apps/hybrid-search/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Hybrid Search Demo

Performs a configurable hybrid search with the ability to control the weights of the scoring as part of the reciprical rank fusion approach.

Uses a set of financial statements from major vehicle manufacturers as the source. The demo allows you to show the following items

- Benefits of vector search
- Pre-filtering on the vector side
- Hybrid search capabilities
- Ability to influence the weighting of data in the RRF approach

## Setup

Run the following commands

```shell

python -v venv venv

pip install -r requirements.txt

```

### Loading Data

MongoDB employees can use this link to download the dump file from the existing DB and use the command below to load the data
[Google Drive for Dump Files](https://drive.google.com/drive/folders/15m-7-Mp8jTZn0IP-AXvfN9pd3p1ubJh9?usp=drive_link)

```shell
mongorestore --uri="<connection string>" --db langchain --file <folder for data>

```
That will load teh existing data into your cluster and avoid you needing to make additional calls to the LLM to generate embeddings.

Others (or those wishing to use a different model) can submit a `POST` request to the `http://localhost:5000/generate` endpoint and the system will go through the process of generating embeddings on its own.







## Execute demo

The demo runs as a flask app that allows to submit API request to be able to easily change the input params and get back the results.

### Start the app

```shell

source ./venv/bin/activate

export FLASK_APP=app.py
export FLASK_ENV=development
flask run

```


To run calls you need to submit a JSON payload like the example below as a `POST` request to `http://localhost:5000`

```json
{
"prompt": "What was combined revenue in 2022",
"company":"Ford",
"pageNum": 3,
"vectorWeight": 0.2,
"textWeight": 0.8,
"textBoost":3
}

```

Changing the values of the weights (and the question) will allow you to show the benefits of the hybrid search.

The `pageNum` attribute is the one that is used for pre-filtering. The code does a simple `{$lte: {'metadata.page' : <variable> }}` on the vector search. It's simplistic, but it does the job of showing the benefits.
`Company` attribute is used by the Lexical search. You can choose between `Stellantis`, `GM`, or `Ford`
196 changes: 196 additions & 0 deletions apps/hybrid-search/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
import os

from flask import Flask, jsonify, request
from flask_pymongo import PyMongo
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from openai import OpenAI

app = Flask(__name__)
app.app_context()
app.config["DEBUG"] = True
app.config["MONGO_URI"] = os.getenv("ATLAS_URL")
client = PyMongo(app).cx["langchain"]
openai = OpenAI()


@app.route("/generate", methods=["POST"])
def generateEmbeddings():
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
for file in os.listdir("./finacialstatements"):
loader = PyPDFLoader(f"./finacialstatements/{file}")
docs = loader.load()
splits = text_splitter.split_documents(docs)
i = 0
docs = []
for split in splits:
vector = (
openai.embeddings.create(
input=[str(split.page_content)], model="text-embedding-3-small"
)
.data[0]
.embedding
)
dbDoc = {
"content": split.page_content,
"vectors": vector,
"metadata": {
"page": split.metadata.get("page"),
"start_index": split.metadata.get("start_index"),
"fileName": file,
},
}
docs.append(dbDoc)
i += 1
if i % 20 == 0:
client.financial_statements.insert_many(docs)
docs = []
client.financial_statements.insert_many(docs)
return jsonify({"results" "Docs successfully parsed and loaded"})


@app.route("/", methods=["POST"])
def getAnswers():
data = request.get_json()
prompt = data.get("prompt")
company = data.get("company")
pageNum = data.get("pageNum")
vcWeight = data.get("vectorWeight")
textWeight = data.get("textWeight")
textBoost = data.get("textBoost")

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
promptEmbedding = embeddings.embed_query(prompt)
if company:
vectorWeight = vcWeight if vcWeight else 0.9
fullTextWeight = textWeight if textWeight else 0.1
textBoostValue = textBoost if textBoost else 1

pipeline = [
{
"$vectorSearch": {
"index": "vector_index",
"path": "vectors",
"queryVector": promptEmbedding,
"numCandidates": 100,
"limit": 20,
}
},
{"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
{"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
{
"$addFields": {
"vs_score": {
"$multiply": [
vectorWeight,
{"$divide": [1.0, {"$add": ["$rank", 60]}]},
]
}
}
},
{
"$project": {
"vs_score": 1,
"_id": "$docs._id",
"content": "$docs.content",
"metadata": "$docs.metadata",
}
},
{
"$unionWith": {
"coll": "financial_statements",
"pipeline": [
{
"$search": {
"index": "rrf-full-text-search",
"text": {
"query": company,
"path": "metadata.fileName",
"score": {"boost": {"value": textBoostValue}},
},
}
},
{"$limit": 20},
{"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
{"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
{
"$addFields": {
"fts_score": {
"$multiply": [
fullTextWeight,
{"$divide": [1.0, {"$add": ["$rank", 60]}]},
]
}
}
},
{
"$project": {
"fts_score": 1,
"_id": "$docs._id",
"content": "$docs.content",
"metadata": "$docs.metadata",
}
},
],
}
},
{
"$group": {
"_id": "$_id",
"metadata": {"$first": "$metadata"},
"content": {"$first": "$content"},
"vs_score": {"$max": "$vs_score"},
"fts_score": {"$max": "$fts_score"},
}
},
{
"$project": {
"_id": 1,
"content": 1,
"metadata": 1,
"vs_score": {"$ifNull": ["$vs_score", 0]},
"fts_score": {"$ifNull": ["$fts_score", 0]},
}
},
{
"$project": {
"score": {"$add": ["$fts_score", "$vs_score"]},
"_id": 1,
"content": 1,
"metadata": 1,
"vs_score": 1,
"fts_score": 1,
}
},
{"$sort": {"score": -1}},
{"$limit": 10},
]
else:
pipeline = [
{
"$vectorSearch": {
"queryVector": promptEmbedding,
"path": "vectors",
"numCandidates": 100,
"index": "vector_index",
"limit": 20,
"exact": False,
}
},
{"$project": {"page": 1, "content": 1, "metadata": 1}},
]
if pageNum:
pipeline[0].get("$vectorSearch")["filter"] = {
"metadata.page": {"$lte": pageNum}
}

results = client.financial_statements.aggregate(pipeline)

return jsonify(results)


if __name__ == "__main__":
app.run(debug=True)
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
9 changes: 9 additions & 0 deletions apps/hybrid-search/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
langchain-community
pypdf
pymongo
langchain-aws
langchain-openai
python-dotenv
langchain-mongodb
flask
flask-pymongo
Loading