New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Embeddings search experimental API #1164

Open

mlin wants to merge 10 commits into main from mlin/similarity-search-api

Contributor

mlin commented May 27, 2024 •

edited

Loading

Adds two new functions to cellxgene_census.experimental:

find_nearest_obs uses TileDB-Vector-Search indexes of Census embeddings to find nearest neighbors of given embedding vectors (in an AnnData obsm layer). Census cell similarity search: experimental Python API for searching given AnnData #1114
predict_obs_metadata uses the nearest neighbors to predict metadata attributes like cell_type and tissue_general for the query cells. Naive initial implementation is just a starting point to start experimenting with. Census cell similarity search: experimental Python API for metadata prediction #1115

The TileDB-Vector-Search query speed seems to be very S3-latency-sensitive, even moreso than typical Census queries. It's many times faster to run from within AWS us-west-2 than externally.

mlin mentioned this pull request

Census cell similarity search: experimental Python API for searching given AnnData #1114

Open

codecov bot commented Jun 9, 2024 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.29%. Comparing base (443845d) to head (e73102b).

❗ Current head e73102b differs from pull request most recent head 874b2eb

Please upload reports for the commit 874b2eb to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1164      +/-   ##
==========================================
+ Coverage   91.11%   91.29%   +0.17%     
==========================================
  Files          77       79       +2     
  Lines        5923     6044     +121     
==========================================
+ Hits         5397     5518     +121     
  Misses        526      526

Flag	Coverage Δ
unittests	`91.29% <100.00%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


          squash for PR

639e64c

mlin force-pushed the mlin/similarity-search-api branch from 6bf1181 to 639e64c Compare

June 10, 2024 01:18

mlin marked this pull request as ready for review

June 10, 2024 07:32

mlin requested review from ebezzi, ivirshup and pablo-gar

June 10, 2024 07:33

Contributor Author

mlin commented Jun 10, 2024

@ebezzi Putting this up for initial review since it's working well, but we still need to plan action on #1181 -- this still copies the approach of hard-coding the base S3 URI.

mlin requested a review from prathapsridharan

June 10, 2024 17:43

ivirshup reviewed

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Outdated Show resolved Hide resolved

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

		"""


		def find_nearest_obs(

Contributor

ivirshup Jun 10, 2024

On the API side, it would be nice if this could produce output that can be directly with sklearn style classes. For example, if this returned a KNNTransformer subclass, that could be used directly with the KNeighborsClassifier and KNeighborsRegressor classes.

Contributor Author

mlin Jun 17, 2024

@ivirshup I like this idea very much, but I'm not quite sure it's workable (albeit I'm not as familiar with those APIs)...

Those scikit-learn classes seem oriented around the scenario where you're providing either all the points (in the "universe") or the complete distance matrix for them. Here we're working with a more limited view of the query points and their neighbor distances; we don't have or want the complete distance matrix, and actually we don't even have the coordinates of the neighbors immediately handy.

Do you think the shoe fits? I see there's some stuff about the "K neighbors graph" that might be relevant, but I'm not personally familiar enough to use them in an unconventional/advanced way like this.

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Outdated



		def find_nearest_obs(
		embedding_metadata: Dict[str, Any],

Contributor

ivirshup Jun 10, 2024

Why does this use a different way to specify an embedding than get_embedding does?

Contributor Author

mlin Jun 17, 2024

get_embedding is a little low level in that it wants the full URI to the embeddings TileDB array, which isn't actually needed to find the index. embedding_metadata is the information returned from get_embedding_metadata() which seems like the appropriate level (especially in view of #1181 wherein we will actually put the relative URIs to the index arrays in there), although of course it'd be nice if it were more typesafe.

Collaborator

pablo-gar Jun 24, 2024

@mlin can we change it to?

   embedding_name: str, 
   organism: str, 
   census_version: str

I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Show resolved Hide resolved

mlin added 4 commits

June 16, 2024 22:27


          use DEFAULT_TILEDB_CONFIGURATION

8088f9e


          workaround

fc91d2d


          workaround

c8bb01a

fix

e73102b

mlin mentioned this pull request

[python] fix mypy complaint in experimental/_embedding.py #1197

Merged


          Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

fcc05b4

…h-api

pablo-gar requested changes

View reviewed changes

Collaborator

pablo-gar left a comment

Looks good to me except one comment in API signature

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Outdated



		def find_nearest_obs(
		embedding_metadata: Dict[str, Any],

Collaborator

pablo-gar Jun 24, 2024

@mlin can we change it to?

   embedding_name: str, 
   organism: str, 
   census_version: str

I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name

mlin added 3 commits

June 26, 2024 13:07


          resolve indexes through JSONs

ca8d44e


          lint

ee6c184


          API refactoring

a1e4daa

mlin commented

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

+              def _resolve_embedding_index(
+                  embedding_metadata: Dict[str, Any],
+                  mirror: Optional[str] = None,
+              ) -> Optional[Tuple[str, str]]:

Contributor Author

mlin Jun 27, 2024

@ebezzi new index resolution method here

Contributor Author

mlin commented Jun 27, 2024

@ebezzi @pablo-gar @ivirshup Updated this to resolve indexes through mirrors/contributions json and remove the need for caller to use get_embedding_metadata_by_name() on their own. Please take another pass including the prior discussion. Unfortunately we have known CI issues currently but I've run the new test cases locally. 🙏

pablo-gar approved these changes

View reviewed changes


          Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

874b2eb

…h-api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment