Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies with R SummarizedExperiment and RangedSummarizedExperiment #77

Open
DiDeoxy opened this issue Sep 10, 2024 · 2 comments

Comments

@DiDeoxy
Copy link

DiDeoxy commented Sep 10, 2024

Making SummarizedExperiments

In Bioconductor there is no RangedSummarizedExperiment that does not have row ranges, if you try to build a RangedSummarizedExperiment without rowRanges in Bioconductor you get a SummarizedExperiment back instead. I think in BiocPy the best way to handle this issue is to make row_ranges required in RangedSummarizedExperiment and/or a SummarizedExperiment/RangedSummarizedExperiment constructor function.

Row data in RangedSummarizedExperiment is incorrect:

In Bioconductor the rowData accessor calls the mcols accessor on the SummarizedExperiment: https://github.com/Bioconductor/SummarizedExperiment/blob/c5b7ca2f8d975af13a18b3b5931449f092657f5c/R/SummarizedExperiment-class.R#L121

In Bioconductor the mcols accessor of RangedSummarizedExperiment calls the mcols accessor of the rowRanges : https://github.com/Bioconductor/SummarizedExperiment/blob/c5b7ca2f8d975af13a18b3b5931449f092657f5c/R/RangedSummarizedExperiment-class.R#L215

Thus in Bioconductor the rowData of a RangedSummarizedExperiment is the mcols of the rowRanges.

In BiocPy the get_row_data method accesses the _rows BiocFrame attribute of the RangedSummarizedExperiment. It is likely necessary to remove the row_data init parameter of RangedSummarizedExperiment as part of fixing this issue, alternatively you could have any supplied row_data be column concatenated to the mcols of the required row_ranges.

There are 4+ row indices in RangedSummarizedExperiment

This relates to the issue raised about GenomicRanges: BiocPy/GenomicRanges#121

The IRanges of the GenomicRanges has a names index, the GenomicRanges object has a names index, the mcols of the GenomicRanges has a row_names index, the row_data of the RangedSummarizedExperiment has a row_names index. None of these are cross validated to ensure they have the same values in the same order, this will likely lead to confusion by users in the future. Which one of these is the correct one? It would be good to resolve this upon object construction.

@jkanche
Copy link
Member

jkanche commented Sep 23, 2024

This behavior is managed differently in the R package through a dedicated constructor method. However, in the Python package, there is no equivalent constructor function. Users are expected to directly instantiate objects by invoking the appropriate class constructors.

Making SummarizedExperiments
In Bioconductor there is no RangedSummarizedExperiment that does not have row ranges, if you try to build a RangedSummarizedExperiment without rowRanges in Bioconductor you get a SummarizedExperiment back instead. I think in BiocPy the best way to handle this issue is to make row_ranges required in RangedSummarizedExperiment and/or a SummarizedExperiment/RangedSummarizedExperiment constructor function.

Given the complicated relationship between row_data and row_ranges in RangedSummarizedExperiment, we decided not to replicate this behavior in the Python versions of these classes. There is an open issue in the Bioconductor package’s GitHub repository that addresses the inconsistent behavior during mutations of these objects.

Row data in RangedSummarizedExperiment is incorrect:
...

it's not immediately clear why they require validation. I don't believe the R implementations enforce validation either. These names are primarily human-readable identifiers for intervals or regions; they are not supposed to be indices.

There are 4+ row indices in RangedSummarizedExperiment
This relates to the issue raised about GenomicRanges: BiocPy/GenomicRanges#121

The IRanges of the GenomicRanges has a names index, the GenomicRanges object has a names index, the mcols of the GenomicRanges has a row_names index, the row_data of the RangedSummarizedExperiment has a row_names index. None of these are cross validated to ensure they have the same values in the same order, this will likely lead to confusion by users in the future. Which one of these is the correct one? It would be good to resolve this upon object construction.

@DiDeoxy
Copy link
Author

DiDeoxy commented Oct 1, 2024

The seqnames of the GRanges and the names of the IRanges object in contains are not validated in Bioconductor, point taken.

However, since the mcols of RangedSummarizedExperiment are the mcols of its GRanges component and the mcols of a GRanges object is not allowed to have rownames: https://code.bioconductor.org/browse/GenomicRanges/blob/RELEASE_3_19/R/GRanges-class.R#L55. It could therefore make sense to prevent the mcols of RangedSummarizedExperiment in BiocPy from having row_names given the choice to not fully replicate the structure of the Bioconductor RangedSummarizedExperiment version in Biocpy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants