[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

boroju · 2024-10-17T14:06:37Z

Contact Details

Feature Description

Leveraging Bedrock for Embeddings in "Insurance PDF Search"

Use Case Description

I'm a member of the Industry Solutions team at MongoDB. Today, we had a meeting with Timo where we discussed the possibility of refactoring our existing 'Insurance PDF Search' demo to use Bedrock embeddings instead of the OpenAI ones (OpenAIEmbedding). The code, which was written several months ago, still calls the 'superduperdb' library, which I believe is legacy since the library/company name has changed.

Considering our desire to move away from OpenAI embeddings and start leveraging Bedrock, perhaps we can collaborate to make this happen. By the way, Timo mentioned that we should open a feature request to explore how to handle this using 'plugins'.

Organization

Refactoring our existing 'Insurance PDF Search' demo, to use Bedrock embeddings and remove legacy imports/dependencies calling to superduperdb.

Who are the stake-holders?

MongoDB Industry Solutions team, Luca Napoli [email protected], Jeff Needham [email protected]

blythed · 2024-10-17T21:14:49Z

Hi @boroju happy to help! There are some examples in the /templates directory, which should work fairly well out-of-the box.

boroju · 2024-10-18T07:01:31Z

Hey @blythed, perfect! I'll look into it and reaching out in case of further questions/doubts. Thanks for replying!

boroju · 2024-10-31T09:58:51Z

Hey Timo, hope you well ☺️, just a quick question:

Where can I find the classes "Table" or "Collection" within the superduper framework? According to your documentation, they should be here:

from superduper import Table

https://docs.superduper.io/docs/execute_api/data_encodings_and_schemas#create-a-table-with-a-schema
"In MongoDB this Table refers to a MongoDB collection, otherwise to an SQL table."

Previously, when the framework was "superduperdb," we were able to import "Collection" this way:
from superduperdb.backends.mongodb import Collection

BUT I couldn't find either "Collection" or "Table" as described in your current documentation. I think you should take a look. 👀

This is basically what I'm trying to do:

from superduper import Table

def save_pdfs(db, pdf_path: str = PDF_PATH):

    db.add(unstructured_encoder)

    file_path = get_file_path()
    logging.info(f"File Path: {file_path}")

    # Concatenating current file path with pdf path
    pdf_folder = file_path + pdf_path
    logging.info(f"PDF Folder: {pdf_folder}")

    pdf_names = [pdf for pdf in os.listdir(pdf_folder) if pdf.endswith(".pdf")]
    logging.info(f"PDF Names: {pdf_names}")

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in pdf_names]

    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]

    logging.info(f"Documents to insert: {to_insert}")
    logging.info(to_insert)

    # In MongoDB this Table refers to a MongoDB collection, otherwise to an SQL table.
    # https://docs.superduper.io/docs/execute_api/data_encodings_and_schemas#create-a-table-with-a-schema
    collection = Table(COLLECTION_NAME_SOURCE)

    # Inserting documents into the MongoDB
    logging.info(f"Inserting documents in {COLLECTION_NAME_SOURCE} collection over MongoDB database")
    db.execute(collection.insert_many(to_insert))

And I am getting this error:

 from superduper import Table
ImportError: cannot import name 'Table' from 'superduper' (/Users/julian.boronat/Github/mongodb/mongodb-industry-solutions/is-insurance-pdf-search/.venv/lib/python3.11/site-packages/superduper/__init__.py)

The package version I am using is:
superduper-framework = "0.3.0"

I know this is the latest available version because I just added it.

The same functionality was working before with the older "superduperdb" framework version. Please see our previous code below:

from superduperdb.backends.mongodb import Collection
from superduperdb import Document
from superduperdb.ext.unstructured.encoder import unstructured_encoder
from pdf2image import convert_from_path

def save_pdfs(db, pdf_folder):

    db.add(unstructured_encoder)

    pdf_names = [pdf for pdf in os.listdir(pdf_folder) if pdf.endswith(".pdf")]

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in pdf_names]
    collection = Collection(COLLECTION_NAME_SOURCE)
    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]
    db.execute(collection.insert_many(to_insert))

I'm going to add this comment as part of the open issue:
#2539

I would like to point out that in the /templates directory, there is nothing for connecting to Bedrock and generating embeddings from one of the models there. This was mentioned to me by @blythed in response.

If you want to connect and pair up to fix/debug the code, I’m willing to do so with you. 😊

Thank you!

blythed · 2024-10-31T13:08:37Z

Hi @boroju we are about to release 0.4.0 in the next couple of days. In this new version, adding pdfs (and other data) is much easier than before, and is decoupled from which data-backend you would use.

Here is a pdf example, which you can play with directly/ update in the notebook: https://github.com/superduper-io/superduper/tree/main/templates/pdf_rag/build.ipynb.

This is similar code to the code we used to build the "volvo" demo. I hope that helps!

boroju · 2024-11-04T10:52:12Z

Hi @blythed,

I've been following mentioned examples, and now I am looking at the "volvo" demo, which uses the old framework version. I am wondering what the alternative is now for the unstructured encoder:

from superduperdb.ext.unstructured.encoder import unstructured_encoder

The line above is no longer supported in the newer version of superduper.

The same issue occurs with "Collection," but I believe I can use "Table" instead:

from superduper import Table

Would you mind clarifying this for me?

boroju · 2024-11-04T11:01:42Z

@blythed This is basically what I'm trying to do here:

def parse_and_store(db):

    db.add(unstructured_encoder)

    pdf_folder = 'pdf-folders'

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in os.listdir(pdf_folder)]
    collection = Table("source")
    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]
    db.execute(collection.insert_many(to_insert))

blythed · 2024-12-12T11:23:26Z

@boroju do you think that you will create a bedrock plugin with the findings we discussed offline?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

boroju commented Oct 17, 2024

blythed commented Oct 17, 2024

boroju commented Oct 18, 2024

boroju commented Oct 31, 2024 •

edited

Loading

blythed commented Oct 31, 2024 •

edited

Loading

boroju commented Nov 4, 2024 •

edited

Loading

boroju commented Nov 4, 2024

blythed commented Dec 12, 2024

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

Comments

boroju commented Oct 17, 2024

Contact Details

Feature Description

Use Case Description

Organization

Who are the stake-holders?

blythed commented Oct 17, 2024

boroju commented Oct 18, 2024

boroju commented Oct 31, 2024 • edited Loading

I know this is the latest available version because I just added it.

blythed commented Oct 31, 2024 • edited Loading

boroju commented Nov 4, 2024 • edited Loading

boroju commented Nov 4, 2024

blythed commented Dec 12, 2024

boroju commented Oct 31, 2024 •

edited

Loading

blythed commented Oct 31, 2024 •

edited

Loading

boroju commented Nov 4, 2024 •

edited

Loading