You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m wondering if there has been much thought on how metadata for combined datasets are handled. Here I’m thinking about multiple datasets measuring the same variables which have been combined.
Typically, this becomes a single concatenated object with a ”batch” or ”dataset” annotation. However, it could be represented as a collection of objects.
Can/ should there be a convention for maintaining experiment level metadata when multiple experiments are combined? This is trivial for the “collection of experiments” object, but is more complicated for the concatenated object.
For a more concrete example, what happens to the dataset id, and external data identified by the dataset id when we concatenate? Another example is the “files” from a muon.atac generated AnnData: scverse/mudata#20
squidpy's solution for concatenated objects
A similar issue came up in squidpy, which we addressed by essentially requiring a ”library_id” annotation for the observations. Image data is stored under .uns/spatial/{library_id}/ to avoid conflicts when merging. E.g.
# These do not conflict
uns/spatial/library1/images/hires: “image1.png”
uns/spatial/library2/images/hires: “image2.png”
# These do
uns/spatial/images/hires: “image1.png”
uns/spatial/images/hires: “image2.png”
Relevant docs:
tutorial on setting up and AnnData to work with squidpy for a more in depth description.
The collection of objects sidesteps this issue by allowing each constituent object to hold its own metadata. However, my impression is that far more tools expect a single concatenated object. There is also not as much tooling for collections of objects, though this has been changing (e.g. anndata.AnnCollection, snapatac2.AnnDataSet)
Question
Should there be conventions for maintaining metadata with concatenated objects? Should we insist on collections of objects if we want to maintain metadata?
Relating to #3, what would the obs_subset of a concatenated object be?
The text was updated successfully, but these errors were encountered:
I’m wondering if there has been much thought on how metadata for combined datasets are handled. Here I’m thinking about multiple datasets measuring the same variables which have been combined.
Typically, this becomes a single concatenated object with a
”batch”
or”dataset”
annotation. However, it could be represented as a collection of objects.Can/ should there be a convention for maintaining experiment level metadata when multiple experiments are combined? This is trivial for the “collection of experiments” object, but is more complicated for the concatenated object.
For a more concrete example, what happens to the dataset id, and external data identified by the dataset id when we concatenate? Another example is the “files” from a
muon.atac
generated AnnData: scverse/mudata#20squidpy's solution for concatenated objects
A similar issue came up in squidpy, which we addressed by essentially requiring a
”library_id”
annotation for the observations. Image data is stored under.uns/spatial/{library_id}/
to avoid conflicts when merging. E.g.Relevant docs:
Collection of objects
The collection of objects sidesteps this issue by allowing each constituent object to hold its own metadata. However, my impression is that far more tools expect a single concatenated object. There is also not as much tooling for collections of objects, though this has been changing (e.g.
anndata.AnnCollection
,snapatac2.AnnDataSet
)Question
Should there be conventions for maintaining metadata with concatenated objects? Should we insist on collections of objects if we want to maintain metadata?
Relating to #3, what would the
obs_subset
of a concatenated object be?The text was updated successfully, but these errors were encountered: