Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add interfaces and tests for nearest neighbors and embeddings #50

Merged
merged 14 commits into from
Dec 19, 2024

Conversation

gschoeni
Copy link
Collaborator

@gschoeni gschoeni commented Dec 13, 2024

from oxen import Workspace, DataFrame

# Create a workspace off main
workspace = Workspace(remote_repo, "main")

# Create a data frame index
remote_df = DataFrame(workspace, "question_embeddings.parquet")

# Check if nearest neighbors is enabled
column = "question_embeddings"
is_enabled = remote_df.is_nearest_neighbors_enabled(column=column)

# Index the embeddings column
remote_df.enable_nearest_neighbors(column=column)

# Do a nearest neighbors search
rows = remote_df.query(
    find_embedding_where={"title": "A"}, sort_by_similarity_to="question_embeddings"
)

assert len(rows) > 1
assert rows[0]["id"] == "290"
assert rows[0]["title"] == "A"

@gschoeni
Copy link
Collaborator Author

gschoeni commented Dec 18, 2024

@jcelliott @EloyMartinez there are test parquet files in the oxen/tests/data/QuestionEmbeddings/ folder if you want to give it a spin with the python snippet above. There are also a few unit tests for reference.

Let me know if the APIs make sense and are intuitive to use or if we should iterate on them.

Also we should do a python deep dive sometime this week with the team.

@@ -20,6 +20,8 @@ dependencies = [
"requests",
"toml",
"tqdm",
"torch",
"tensorflow",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will make package managers try to install torch and tensorflow as dependencies for oxen. Is that what we want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating, because they are used in our data loaders (that are undocumented) but are a heavy install. Maybe we do just keep them in the dev dependencies for now...

@jcelliott
Copy link
Collaborator

Worked for me, and I think the interface makes sense.

Not really related to this PR, but I ran into that issue with the PyRemoteRepo vs RemoteRepo. It looks like remote_repo.get_repo actually returns a PyRemoteRepo, but the docs say it returns a RemoteRepo.

@gschoeni gschoeni merged commit 9a5ff75 into main Dec 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants