Draft SpatialData.filter() #626

aeisenbarth · 2024-07-08T19:12:08Z

(In reference to #620)

This PR imlements an more advanced filtering options than subset, allowing to create a new SpatialData object that contains only specific tables, layers, obs keys, var keys.

Use cases

From a concatenated SpatialData, one can extract parts of it.
When testing an operation that adds elements or table columns, one can extract from an expected reference dataset the input data and pass it to the operation, then compare the processed data against the reference.
…

Closes #280
Closes #284
Closes #556

codecov · 2024-07-08T19:18:00Z

Codecov Report

Attention: Patch coverage is 10.71429% with 25 lines in your changes missing coverage. Please review.

Project coverage is 91.59%. Comparing base (95d69ff) to head (d9c1e0e).
Report is 49 commits behind head on main.

Files with missing lines	Patch %	Lines
src/spatialdata/_core/spatialdata.py	10.71%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #626      +/-   ##
==========================================
- Coverage   91.93%   91.59%   -0.35%     
==========================================
  Files          44       44              
  Lines        6661     6688      +27     
==========================================
+ Hits         6124     6126       +2     
- Misses        537      562      +25

Files with missing lines	Coverage Δ
src/spatialdata/_core/spatialdata.py	`88.51% <10.71%> (-2.47%)`	⬇️

aeisenbarth · 2024-07-08T19:29:17Z

In the current state, it does not yet complete the issues that were aimed to resolve.

Subset spatialdata by list of cell ids #556:
- A parameter instances could be added. If provided, rows of these instances will be selected in the table, if not provided, all instances are returned. But it should only allow a single region/element (otherwise it gets complicated).
- Shapes/points elements should also be filtered (easy)
- Still unanswered is what effect it should have on the labels. Shall we create a new labels image with the instances eliminated (0 = background), or leave the labels unchanged and just have no reference to them in the table?
Filter spatialData #280:
- This involves a condition. In my opinion, implementing a parameter with a condition as function or query expression is out of scope due to its complexity. I would do this in two steps, users use Pandas to get a list of instances, then pass the instances to the SpatialData filter function.
- The user also asked about adjusting shapes to match the filtered instances in the table.
Feature request: spatial cropping from select table rows #284:
- Filtering labels/shapes/points elements

LucaMarconato · 2024-07-12T17:55:29Z

Thanks @aeisenbarth, after discussing with @melonora, we are going to first turn the code #627 into an internal function, merge, and then continue working on your PR. The idea is to provide a single entry point for filtering filter() and use for instance subset() or the function from Wouter internally.

owenwilkins · 2024-10-28T20:59:10Z

is there somewhere is the domentation that now describes how to filter a spatialdata object by cell IDs? this is valuable for several reasons, e.g. filtering cells removed by QC in analysis using other libraries

LucaMarconato · 2025-01-27T11:11:48Z

I went back to this and to #627 today and realized that we maybe do not need to add a new API, since all the points covered by this PR and by the linked PR, including all the points listed in this message here: #626 (comment) are essentially covered by the example below, which uses the currently available APIs:

##
# constructing the example data
from spatialdata.datasets import blobs_annotating_element
from spatialdata import concatenate
from spatialdata import join_spatialelement_table
from spatialdata import SpatialData

sdata1 = blobs_annotating_element("blobs_polygons")
sdata2 = blobs_annotating_element("blobs_polygons")

sdata = concatenate({"sdata1": sdata1, "sdata2": sdata2}, concatenate_tables=True)
print(sdata)

##
# filtering the data
table_name = "table"
filtered_table = sdata[table_name][sdata[table_name].obs.instance_id < 3]
annotated_regions = sdata.get_annotated_regions(sdata[table_name])
elements, table = join_spatialelement_table(
    sdata, spatial_element_names=annotated_regions, table=filtered_table, how="inner"
)
sdata_filtered = SpatialData.init_from_elements(elements | {table_name: table})
print(sdata_filtered)

Explicitly, the code above first filters the table with standard pandas/anndata operations (and thus is very general), and then reuses the join operations to filter the SpatialData object (which again are very general and allow for several cases). In doing this:

we limit code redundancy
we can use any query/condition to filter the object, including:
- we can filter by a threshold
- we can filter by instances
- we can manually subset certain obs/var/layers of the table
shapes and points are filtered to match the table
if a table annotates multiple elements, they are all filtered
filtering labels is delegated to the join operations (currently not supported, but if in the future it will, it would be also supported here)

I think we could proceed by choosing one of the following strategies:

we do not add any new API, and put the example above visible in the docs
we add a very minimalistic API that essentially reproduces the example above. So convenient for most cases, and for more general cases the user can modify the code
we add a feature complete API (similar to this PR); we use the code above internally so we don't have code redundancy.

Any preference?

aeisenbarth · 2025-01-30T23:12:54Z

This PR indeed overlaps with existing APIs.

I agree that redundancy should be avoided. But I think an API should help to minimize gluing code. The above example solves the task, but contains two filtering steps and two intermediate function calls.

So from my side, I would favor extending the existing API to be more feature-complete. But I would not see it high priority.

LucaMarconato · 2025-02-02T17:27:23Z

I agree that redundancy should be avoided. But I think an API should help to minimize gluing code. The above example solves the task, but contains two filtering steps and two intermediate function calls.

You are right, that approach is not ergonomic enough. I thought about this and now in the linked PR #627 I introduce a new API match_sdata_to_table() that bundles the code (@aeisenbarth could you please review the PR?).

All the use cases mentioned in your message are included in the tests. I wonder if now the function is ergonomic enough or if we should still add a filter() function.

Here are my thoughts on this:

Ergonomics limitation of the match_sdata_to_table()

you need to import the function and pass the sdata argument to it; a method in SpatialData could be added to fix this.
while its usage is powerful and simple, as the tests linked above show, it may not be immediate how to use it without looking at the tests. We could add some examples in the docstring for this.
the name tells exactly what the function does, but it's not what people would look for when trying to subset a SpatialData object by a threshold on obs/var.

Limitations of the SpatialData.filter()

filter may suggest a threshold-based filtering, which we probably do not want to implement as the same result can be achieved via pandas/anndata APIs + match_sdata_to_table(), as shown in this test .
code redundancy

my preferred take

I think my preferred approach would be to gather feedback from the users, and then actually provide a new function (what would be the best name filter()? query()? subset()? match()?) that simplifies the usage of match_sdata_to_table(). This is in line to what reported here: #225. I'd consider strategizing this by closing the current PR and reworking the code when we are ready into a new function, that eventually covers also the bounding box and polygon query cases. But happy to hear what you think @aeisenbarth @melonora. In particular if you think that it would be much more intuitive to have already a filter()/query()/subset() function that mimics what this PR does and that calls match_sdata_to_table() internally, then we could do this directly in the context of this PR.

srivarra · 2025-02-11T16:48:16Z

@aeisenbarth @LucaMarconato

Hi, I saw this discussion and some I have some input.

I made a small package which provides an ergonomic API for filtering, selecting and group by's for AnnData here: annsel. You provide a Narwhals's (which uses a subset of the Polars API) compatible expression and it'll run the query, then return the subset of the AnnData object to the user. It's uses an accessor-style API like spatialdata-plot or xarray's accessors.

For example, here is how filter is used:

import annsel as an # registers the AnnData accessor
adata.an.filter(obs=an.col(["Cell_label"]) == "CD8+CD103+ tissue resident memory T cells")

and it's equivalent to

adata[adata.obs["Cell_label"] == "CD8+CD103+ tissue resident memory T cells", :]

and you can add as many expressions as you feel like:

adata.an.filter(
    obs=(
        an.col(["Cell_label"]).is_in(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"]),
        an.col(["Genotype"]).is_in(["APL", "FLT3-ITD,NPM1-mut"]),
    ),
    var=(an.col(["vst.mean"]) >= 3, an.col("feature_type").is_in(["IG_C_gene", "lncRNA", "protein_coding"])),
)

Using standard Indexing

# First create the observation (obs) filters
cell_label_filter = adata.obs["Cell_label"].isin(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"])
genotype_filter = adata.obs["Genotype"].isin(["APL", "FLT3-ITD,NPM1-mut"])
obs_index_filter = cell_label_filter & genotype_filter

# Then create the variable (var) filters
vst_mean_filter = adata.var["vst.mean"] >= 3
feature_type_filter = adata.var["feature_type"].isin(["IG_C_gene", "lncRNA", "protein_coding"])
var_index_filter = vst_mean_filter & feature_type_filter

# Apply both indices to get the final subset
filtered_adata = adata[obs_index_filter, var_index_filter]

You can also filter on var_names, obs_names, the X expression matrix, and by layer.

I plan on adding SpatialData and AnnData with Dask support as down the line, but I'd like to see if this might help out.

LucaMarconato · 2025-02-11T19:09:19Z

Hi @srivarra! Your package looks super cool! (CC @giovp) The syntax reads great and it removes a lot of verbosity.

A question, to loop-in the discussion above. What's your opinion of filtering the AnnData object followed by calling match_sdata_to_table(see tests here https://github.com/scverse/spatialdata/pull/627/files#diff-98b5bcd7dc2c08e7cd12d22b25e446f191b57f71b7be3f8255c813f58771d154): do you think that this 2-step process would be convenient enough for users that want to subset a SpatialData object by matching it to a table or should there be a single function allowing for all of it, and maybe more (subset to certain elements, spatial query, etc.)?

My opinion is that for the moment we could merge the linked PR, which introduces match_sdata_to_table(), and make a notebook showing what we do in the tests above. We could use your package in the notebook so that the syntax is even shorter.

I'd then close this PR and instead try things out for a while. Maybe with the pipe functionality that you wrote a while ago #722 we would not even need to have a single function that allows to filter everything and we could still achieve an ergonomic workflow. The pipe functionality in-fact would allow to combine subset(), bounding_box_query(), subsetting the AnnData table and matching it to elements in "one line".

WDYT?

By the way, if you have some time to leave a review for #627, it would be fantastic 😊

srivarra · 2025-02-12T00:58:52Z

I’ve given this some thought, and while the two step approach can work, my perspective as a user is that a one step filtering method, similar to those found in DataFrame libraries like Pandas and Xarray would make for a more intuitive experience. Many users of SpatialData are likely familiar with those patterns, so providing a single, unified entry point seems preferable.

At first glance, the name match_sdata_to_table in #627 doesn’t immediately convey its purpose. After reading the docs, I have a rough idea of what it does, but renaming it to something like filter_sdata_by_table would better express its purpose (though naming is a secondary concern, I understand it does do some table matching / joins, but the experience on the user side is it's effectively a filter).

The proposed 2 step process, filtering the AnnData object first, then calling match_sdata_to_table feels like a temporary compromise.

Thanks @aeisenbarth, after discussing with @melonora, we are going to first turn the code #627 into an internal function, merge, and then continue working on your PR. The idea is to provide a single entry point for filtering filter() and use for instance subset() or the function from Wouter internally.

I like this idea of making match_sdata_to_table an internal function to help handle the details.

My ideal unified filter function with expression functionality look like the following:

Expressions = Expression | Iterable[Expression]

class SpatialData:
    # Other methods and attributes...

    def filter(
        self,
        elements: Iterable[str] | None,  # Elements to filter on (optional)
        # regions: Users would optionally set this in `obs_expr` if needed
        obs_expr: Expressions | None,     # Accept Narwhals expressions or an iterable of strings for observations
        var_expr: Expressions | None,     # Same as obs_expr, but for variable selection
        x_expr: Expressions | None,        # For filtering based on variable expression values
        obs_names_expr: Expressions | None, # For filtering based on table.obs_names
        var_names_expr: Expressions | None, # For filtering based on table.var_names
        table_name,                           # Ideally, this would work on a single table at a time to keep it simple
        layer: str | None,                # Filter based on a specific layer of the table, if provided, and using `x_expr`
        how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
    ):
        sdata_subset = self.subset(element_names=elements, filter_tables=True) if elements else self.sdata

        filtered_adata = sdata_subset[table_name].an.filter(obs_expr, var_expr, x_epxr, var_names, obs_names, layer)

        filtered_sdata = match_sdata_to_table(sdata=sdata_subset, table=filtered_adata, how=how)

        return filtered_sdata

It mostly looks like the current one with the match_sdata_to_table.
Of course there would be a handful of extra checks and handling for corner cases and whatnot, but this is the general idea.

Maybe with the pipe functionality that you wrote a while ago #722

Oh man, I completely forgot about this, looks like I accidentally closed it when going through and deleting up some forks I had. I'll see if I can recover that PR somehow.

If the current 2 step process was to be merged in today, I'd be satisfied using it, but hoping it would be touched up later down the line into one step.

And, if you'd like I could try to whip up a tiny implementation of filtering with Annsel / Narwhals expressions within a couple of weeks.

Thanks for your guys' great work, and let me know your thoughts!

LucaMarconato · 2025-02-12T11:49:25Z

Thanks for the discussion! I really like how the API would look like and the arguments are convincing. I will check with other devs and try to first merge the other PRs, keep this PR as a discussion (closed PR) and ideally go for the new approach you described.

Oh man, I completely forgot about this, looks like I accidentally closed it when going through and deleting up some forks I had. I'll see if I can recover that PR somehow.

Thanks a lot!

melonora · 2025-02-13T13:21:22Z

@srivarra I agree with you that ultimately it would be nice to have one entry point. Personally, I have been a fan of the Polar's API but then again a lot of people are used to the pandas API so good tutorials / education is a must though I find it more clean than Pandas API. Regarding function name I would probably not name it plain filter but something like filter_by_table_query. I know it is a bit more verbose but is more informative regarding what kind of filtering is being performed. But overall am in favour of this.

I did have a look at #627 and would merge this one beforehand. Regarding the name match_sdata_to_table, I think this follows more the function definition we already had. It is hard to get feedback here from users whether they found the other function definitions intuitive or not so I am quite neutral about it.

srivarra · 2025-02-13T19:55:58Z

@melonora

Personally, I have been a fan of the Polar's API but then again a lot of people are used to the pandas API so good tutorials / education is a must though I find it more clean than Pandas API.

Yeah I love the Polars API, it's super expressive and pleasant to use.

Regarding function name I would probably not name it plain filter but something like filter_by_table_query. I know it is a bit more verbose but is more informative regarding what kind of filtering is being performed. But overall am in favour of this.

Yeah I think that's a very good name and I agree.

Regarding the name match_sdata_to_table, I think this follows more the function definition we already had. It is hard to get feedback here from users whether they found the other function definitions intuitive or not so I am quite neutral about it.

I see that it follows similar functions like match_element_to_table() and match_table_to_element(), so it makes sense to me.

Draft SpatialData.filter()

d9c1e0e

melonora mentioned this pull request Jul 9, 2024

allow filtering by ids #627

Open

LucaMarconato mentioned this pull request Jul 9, 2024

Draft SpatialData.filter() #620

Closed

LucaMarconato marked this pull request as draft July 12, 2024 17:55

LucaMarconato mentioned this pull request Feb 12, 2025

Feedback on concatenate() #541

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft SpatialData.filter() #626

Draft SpatialData.filter() #626

aeisenbarth commented Jul 8, 2024

codecov bot commented Jul 8, 2024 •

edited

Loading

aeisenbarth commented Jul 8, 2024

LucaMarconato commented Jul 12, 2024

owenwilkins commented Oct 28, 2024

LucaMarconato commented Jan 27, 2025 •

edited

Loading

aeisenbarth commented Jan 30, 2025

LucaMarconato commented Feb 2, 2025 •

edited

Loading

srivarra commented Feb 11, 2025 •

edited

Loading

LucaMarconato commented Feb 11, 2025 •

edited

Loading

srivarra commented Feb 12, 2025 •

edited

Loading

LucaMarconato commented Feb 12, 2025

melonora commented Feb 13, 2025

srivarra commented Feb 13, 2025

Draft SpatialData.filter() #626

Are you sure you want to change the base?

Draft SpatialData.filter() #626

Conversation

aeisenbarth commented Jul 8, 2024

codecov bot commented Jul 8, 2024 • edited Loading

Codecov Report

aeisenbarth commented Jul 8, 2024

LucaMarconato commented Jul 12, 2024

owenwilkins commented Oct 28, 2024

LucaMarconato commented Jan 27, 2025 • edited Loading

aeisenbarth commented Jan 30, 2025

LucaMarconato commented Feb 2, 2025 • edited Loading

my preferred take

srivarra commented Feb 11, 2025 • edited Loading

LucaMarconato commented Feb 11, 2025 • edited Loading

srivarra commented Feb 12, 2025 • edited Loading

LucaMarconato commented Feb 12, 2025

melonora commented Feb 13, 2025

srivarra commented Feb 13, 2025

codecov bot commented Jul 8, 2024 •

edited

Loading

LucaMarconato commented Jan 27, 2025 •

edited

Loading

LucaMarconato commented Feb 2, 2025 •

edited

Loading

srivarra commented Feb 11, 2025 •

edited

Loading

LucaMarconato commented Feb 11, 2025 •

edited

Loading

srivarra commented Feb 12, 2025 •

edited

Loading