Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

Open
boroju opened this issue Oct 17, 2024 · 7 comments
Open

[FEATURE REQUEST]: Being able to handle embeddings with Bedrock #2539

boroju opened this issue Oct 17, 2024 · 7 comments

Comments

@boroju
Copy link

boroju commented Oct 17, 2024

Contact Details

[email protected]; [email protected]

Feature Description

Leveraging Bedrock for Embeddings in "Insurance PDF Search"

Use Case Description

I'm a member of the Industry Solutions team at MongoDB. Today, we had a meeting with Timo where we discussed the possibility of refactoring our existing 'Insurance PDF Search' demo to use Bedrock embeddings instead of the OpenAI ones (OpenAIEmbedding). The code, which was written several months ago, still calls the 'superduperdb' library, which I believe is legacy since the library/company name has changed.

Considering our desire to move away from OpenAI embeddings and start leveraging Bedrock, perhaps we can collaborate to make this happen. By the way, Timo mentioned that we should open a feature request to explore how to handle this using 'plugins'.

Organization

Refactoring our existing 'Insurance PDF Search' demo, to use Bedrock embeddings and remove legacy imports/dependencies calling to superduperdb.

Who are the stake-holders?

MongoDB Industry Solutions team, Luca Napoli [email protected], Jeff Needham [email protected]

@blythed
Copy link
Collaborator

blythed commented Oct 17, 2024

Hi @boroju happy to help! There are some examples in the /templates directory, which should work fairly well out-of-the box.

@boroju
Copy link
Author

boroju commented Oct 18, 2024

Hey @blythed, perfect! I'll look into it and reaching out in case of further questions/doubts. Thanks for replying!

@boroju
Copy link
Author

boroju commented Oct 31, 2024

Hey Timo, hope you well ☺️, just a quick question:

Where can I find the classes "Table" or "Collection" within the superduper framework? According to your documentation, they should be here:

from superduper import Table

https://docs.superduper.io/docs/execute_api/data_encodings_and_schemas#create-a-table-with-a-schema
"In MongoDB this Table refers to a MongoDB collection, otherwise to an SQL table."

Previously, when the framework was "superduperdb," we were able to import "Collection" this way:
from superduperdb.backends.mongodb import Collection

BUT I couldn't find either "Collection" or "Table" as described in your current documentation. I think you should take a look. 👀

This is basically what I'm trying to do:

from superduper import Table

def save_pdfs(db, pdf_path: str = PDF_PATH):

    db.add(unstructured_encoder)

    file_path = get_file_path()
    logging.info(f"File Path: {file_path}")

    # Concatenating current file path with pdf path
    pdf_folder = file_path + pdf_path
    logging.info(f"PDF Folder: {pdf_folder}")

    pdf_names = [pdf for pdf in os.listdir(pdf_folder) if pdf.endswith(".pdf")]
    logging.info(f"PDF Names: {pdf_names}")

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in pdf_names]

    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]

    logging.info(f"Documents to insert: {to_insert}")
    logging.info(to_insert)

    # In MongoDB this Table refers to a MongoDB collection, otherwise to an SQL table.
    # https://docs.superduper.io/docs/execute_api/data_encodings_and_schemas#create-a-table-with-a-schema
    collection = Table(COLLECTION_NAME_SOURCE)

    # Inserting documents into the MongoDB
    logging.info(f"Inserting documents in {COLLECTION_NAME_SOURCE} collection over MongoDB database")
    db.execute(collection.insert_many(to_insert))
    

And I am getting this error:

 from superduper import Table
ImportError: cannot import name 'Table' from 'superduper' (/Users/julian.boronat/Github/mongodb/mongodb-industry-solutions/is-insurance-pdf-search/.venv/lib/python3.11/site-packages/superduper/__init__.py)

The package version I am using is:
superduper-framework = "0.3.0"

I know this is the latest available version because I just added it.

The same functionality was working before with the older "superduperdb" framework version. Please see our previous code below:

from superduperdb.backends.mongodb import Collection
from superduperdb import Document
from superduperdb.ext.unstructured.encoder import unstructured_encoder
from pdf2image import convert_from_path

def save_pdfs(db, pdf_folder):

    db.add(unstructured_encoder)

    pdf_names = [pdf for pdf in os.listdir(pdf_folder) if pdf.endswith(".pdf")]

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in pdf_names]
    collection = Collection(COLLECTION_NAME_SOURCE)
    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]
    db.execute(collection.insert_many(to_insert))

I'm going to add this comment as part of the open issue:
#2539

I would like to point out that in the /templates directory, there is nothing for connecting to Bedrock and generating embeddings from one of the models there. This was mentioned to me by @blythed in response.

If you want to connect and pair up to fix/debug the code, I’m willing to do so with you. 😊

Thank you!

@blythed
Copy link
Collaborator

blythed commented Oct 31, 2024

Hi @boroju we are about to release 0.4.0 in the next couple of days. In this new version, adding pdfs (and other data) is much easier than before, and is decoupled from which data-backend you would use.

Here is a pdf example, which you can play with directly/ update in the notebook: https://github.com/superduper-io/superduper/tree/main/templates/pdf_rag/build.ipynb.

This is similar code to the code we used to build the "volvo" demo. I hope that helps!

@boroju
Copy link
Author

boroju commented Nov 4, 2024

Hi @blythed,

I've been following mentioned examples, and now I am looking at the "volvo" demo, which uses the old framework version. I am wondering what the alternative is now for the unstructured encoder:

from superduperdb.ext.unstructured.encoder import unstructured_encoder

The line above is no longer supported in the newer version of superduper.

The same issue occurs with "Collection," but I believe I can use "Table" instead:

from superduper import Table

Would you mind clarifying this for me?

@boroju
Copy link
Author

boroju commented Nov 4, 2024

@blythed This is basically what I'm trying to do here:

def parse_and_store(db):

    db.add(unstructured_encoder)

    pdf_folder = 'pdf-folders'

    pdf_paths = [os.path.join(pdf_folder, pdf) for pdf in os.listdir(pdf_folder)]
    collection = Table("source")
    to_insert = [
        Document({"elements": unstructured_encoder(pdf_path)}) for pdf_path in pdf_paths
    ]
    db.execute(collection.insert_many(to_insert))

@blythed
Copy link
Collaborator

blythed commented Dec 12, 2024

@boroju do you think that you will create a bedrock plugin with the findings we discussed offline?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants