-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft SpatialData.filter() #626
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #626 +/- ##
==========================================
- Coverage 91.93% 91.59% -0.35%
==========================================
Files 44 44
Lines 6661 6688 +27
==========================================
+ Hits 6124 6126 +2
- Misses 537 562 +25
|
In the current state, it does not yet complete the issues that were aimed to resolve.
|
Thanks @aeisenbarth, after discussing with @melonora, we are going to first turn the code #627 into an internal function, merge, and then continue working on your PR. The idea is to provide a single entry point for filtering |
is there somewhere is the domentation that now describes how to filter a spatialdata object by cell IDs? this is valuable for several reasons, e.g. filtering cells removed by QC in analysis using other libraries |
I went back to this and to #627 today and realized that we maybe do not need to add a new API, since all the points covered by this PR and by the linked PR, including all the points listed in this message here: #626 (comment) are essentially covered by the example below, which uses the currently available APIs: ##
# constructing the example data
from spatialdata.datasets import blobs_annotating_element
from spatialdata import concatenate
from spatialdata import join_spatialelement_table
from spatialdata import SpatialData
sdata1 = blobs_annotating_element("blobs_polygons")
sdata2 = blobs_annotating_element("blobs_polygons")
sdata = concatenate({"sdata1": sdata1, "sdata2": sdata2}, concatenate_tables=True)
print(sdata)
##
# filtering the data
table_name = "table"
filtered_table = sdata[table_name][sdata[table_name].obs.instance_id < 3]
annotated_regions = sdata.get_annotated_regions(sdata[table_name])
elements, table = join_spatialelement_table(
sdata, spatial_element_names=annotated_regions, table=filtered_table, how="inner"
)
sdata_filtered = SpatialData.init_from_elements(elements | {table_name: table})
print(sdata_filtered) Explicitly, the code above first filters the table with standard
I think we could proceed by choosing one of the following strategies:
Any preference? |
This PR indeed overlaps with existing APIs. I agree that redundancy should be avoided. But I think an API should help to minimize gluing code. The above example solves the task, but contains two filtering steps and two intermediate function calls. So from my side, I would favor extending the existing API to be more feature-complete. But I would not see it high priority. |
You are right, that approach is not ergonomic enough. I thought about this and now in the linked PR #627 I introduce a new API All the use cases mentioned in your message are included in the tests. I wonder if now the function is ergonomic enough or if we should still add a Here are my thoughts on this: Ergonomics limitation of the
Limitations of the
my preferred takeI think my preferred approach would be to gather feedback from the users, and then actually provide a new function (what would be the best name |
Hi, I saw this discussion and some I have some input. I made a small package which provides an ergonomic API for filtering, selecting and group by's for AnnData here: annsel. You provide a Narwhals's (which uses a subset of the Polars API) compatible expression and it'll run the query, then return the subset of the AnnData object to the user. It's uses an accessor-style API like For example, here is how filter is used: import annsel as an # registers the AnnData accessor
adata.an.filter(obs=an.col(["Cell_label"]) == "CD8+CD103+ tissue resident memory T cells") and it's equivalent to adata[adata.obs["Cell_label"] == "CD8+CD103+ tissue resident memory T cells", :] and you can add as many expressions as you feel like: adata.an.filter(
obs=(
an.col(["Cell_label"]).is_in(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"]),
an.col(["Genotype"]).is_in(["APL", "FLT3-ITD,NPM1-mut"]),
),
var=(an.col(["vst.mean"]) >= 3, an.col("feature_type").is_in(["IG_C_gene", "lncRNA", "protein_coding"])),
) Using standard Indexing# First create the observation (obs) filters
cell_label_filter = adata.obs["Cell_label"].isin(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"])
genotype_filter = adata.obs["Genotype"].isin(["APL", "FLT3-ITD,NPM1-mut"])
obs_index_filter = cell_label_filter & genotype_filter
# Then create the variable (var) filters
vst_mean_filter = adata.var["vst.mean"] >= 3
feature_type_filter = adata.var["feature_type"].isin(["IG_C_gene", "lncRNA", "protein_coding"])
var_index_filter = vst_mean_filter & feature_type_filter
# Apply both indices to get the final subset
filtered_adata = adata[obs_index_filter, var_index_filter]
You can also filter on I plan on adding |
Hi @srivarra! Your package looks super cool! (CC @giovp) The syntax reads great and it removes a lot of verbosity. A question, to loop-in the discussion above. What's your opinion of filtering the My opinion is that for the moment we could merge the linked PR, which introduces I'd then close this PR and instead try things out for a while. Maybe with the pipe functionality that you wrote a while ago #722 we would not even need to have a single function that allows to filter everything and we could still achieve an ergonomic workflow. The pipe functionality in-fact would allow to combine WDYT? By the way, if you have some time to leave a review for #627, it would be fantastic 😊 |
I’ve given this some thought, and while the two step approach can work, my perspective as a user is that a one step filtering method, similar to those found in DataFrame libraries like Pandas and Xarray would make for a more intuitive experience. Many users of SpatialData are likely familiar with those patterns, so providing a single, unified entry point seems preferable. At first glance, the name The proposed 2 step process, filtering the AnnData object first, then calling
I like this idea of making My ideal unified filter function with expression functionality look like the following: Expressions = Expression | Iterable[Expression]
class SpatialData:
# Other methods and attributes...
def filter(
self,
elements: Iterable[str] | None, # Elements to filter on (optional)
# regions: Users would optionally set this in `obs_expr` if needed
obs_expr: Expressions | None, # Accept Narwhals expressions or an iterable of strings for observations
var_expr: Expressions | None, # Same as obs_expr, but for variable selection
x_expr: Expressions | None, # For filtering based on variable expression values
obs_names_expr: Expressions | None, # For filtering based on table.obs_names
var_names_expr: Expressions | None, # For filtering based on table.var_names
table_name, # Ideally, this would work on a single table at a time to keep it simple
layer: str | None, # Filter based on a specific layer of the table, if provided, and using `x_expr`
how: Literal["left", "left_exclusive", "inner", "right", "right_exclusive"] = "right",
):
sdata_subset = self.subset(element_names=elements, filter_tables=True) if elements else self.sdata
filtered_adata = sdata_subset[table_name].an.filter(obs_expr, var_expr, x_epxr, var_names, obs_names, layer)
filtered_sdata = match_sdata_to_table(sdata=sdata_subset, table=filtered_adata, how=how)
return filtered_sdata It mostly looks like the current one with the
Oh man, I completely forgot about this, looks like I accidentally closed it when going through and deleting up some forks I had. I'll see if I can recover that PR somehow. If the current 2 step process was to be merged in today, I'd be satisfied using it, but hoping it would be touched up later down the line into one step. And, if you'd like I could try to whip up a tiny implementation of filtering with Annsel / Narwhals expressions within a couple of weeks. Thanks for your guys' great work, and let me know your thoughts! |
Thanks for the discussion! I really like how the API would look like and the arguments are convincing. I will check with other devs and try to first merge the other PRs, keep this PR as a discussion (closed PR) and ideally go for the new approach you described.
Thanks a lot! |
@srivarra I agree with you that ultimately it would be nice to have one entry point. Personally, I have been a fan of the Polar's API but then again a lot of people are used to the pandas API so good tutorials / education is a must though I find it more clean than Pandas API. Regarding function name I would probably not name it plain I did have a look at #627 and would merge this one beforehand. Regarding the name |
Yeah I love the Polars API, it's super expressive and pleasant to use.
Yeah I think that's a very good name and I agree.
I see that it follows similar functions like |
(In reference to #620)
This PR imlements an more advanced filtering options than
subset
, allowing to create a new SpatialData object that contains only specific tables, layers, obs keys, var keys.Use cases
Closes #280
Closes #284
Closes #556