Skip to content

Commit

Permalink
Implement combine generic (#29)
Browse files Browse the repository at this point in the history
- Implement `combine` generic for GenomicRanges and GenomicRangesList
- use rich to print instances of gr and grl objects
- Update sphinx setup to use `sphinx-autodoc-typehints` extension
- Update documentation, README
- Add tests for combine
  • Loading branch information
jkanche authored Oct 17, 2023
1 parent c9ba493 commit 71c8b53
Show file tree
Hide file tree
Showing 12 changed files with 448 additions and 179 deletions.
106 changes: 89 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,40 @@

# GenomicRanges

GenomicRanges is a Python container class designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
GenomicRanges provides container classes designed to represent genomic locations and support genomic analysis. It is similar to Bioconductor's [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).

## Install
**_Intervals are inclusive on both ends and starts at 1._**

Package is published to [PyPI](https://pypi.org/project/genomicranges/)
To get started, install the package from [PyPI](https://pypi.org/project/genomicranges/)

```shell
pip install genomicranges
```

## Usage
## `GenomicRanges`

The package provides several ways to represent genomic annotations and intervals.
`GenomicRanges` is the base class to represent and operate over genomic regions and annotations.

### Initialize a `GenomicRanges` object
### From UCSC or GTF file

#### From UCSC or GTF file

You can easily access UCSC genomes or load a genome annotation from a GTF file using the following methods:
You can easily download and parse genome annotations from UCSC or load a genome annotation from a GTF file,

```python
import genomicranges

gr = genomicranges.from_gtf(<PATH TO GTF>)
gr = genomicranges.read_gtf(<PATH TO GTF>)
# OR
gr = genomicranges.from_ucsc(genome="hg19")
```
#### Pandas DataFrame
gr = genomicranges.read_ucsc(genome="hg19")

A common representation in Python is a pandas DataFrame for all tabular datasets. You can convert a DataFrame into a `GenomicRanges` object. Please note that intervals are inclusive on both ends, and your DataFrame must contain columns seqnames, starts, and ends to represent genomic coordinates.
print(gr)
## output
## GenomicRanges with 1760959 intervals & 10 metadata columns.
## ... truncating the console print ...
```

Here's an example:
### Pandas DataFrame

A common representation in Python is a pandas `DataFrame` for all tabular datasets. `DataFrame` must contain columns "seqnames", "starts", and "ends" to represent genomic intervals. Here's an example:

```python
import genomicranges
Expand All @@ -54,11 +55,23 @@ df = pd.DataFrame(
)

gr = genomicranges.from_pandas(df)
print(gr)
```

## output
GenomicRanges with 5 intervals & 2 metadata columns
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ row_names ┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃ GC <list> ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ 0 │ chr1 │ 101 │ 112 │ * │ 0 │ 0.22617584001235103 │
│ 1 │ chr2 │ 102 │ 103 │ - │ 1 │ 0.25464256182466394 │
│ ... │ ... │ ... │ ... │ ... │ ... │ ... │
│ 4 │ chr2 │ 109 │ 111 │ - │ 4 │ 0.5414168889911801 │
└───────────┴─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┴─────────────────────┘

### Interval Operations

GenomicRanges currently supports most commonly used [interval based operations](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).
`GenomicRanges` supports most [interval based operations](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html).

```python
subject = genomicranges.from_ucsc(genome="hg38")
Expand All @@ -77,8 +90,67 @@ hits = subject.nearest(query)
print(hits)
```

For more usage examples, check out the [documentation](https://biocpy.github.io/GenomicRanges/).
## `GenomicRangesList`

Just as it sounds, a `GenomicRangesList` is a named-list like object. If you are wondering why you need this class, a `GenomicRanges` object lets us specify multiple genomic elements, usually where the genes start and end. Genes are themselves made of many sub-regions, e.g. exons. `GenomicRangesList` allows us to represent this nested structure.

**Currently, this class is limited in functionality.**

To construct a GenomicRangesList

```python
gr1 = GenomicRanges(
{
"seqnames": ["chr1", "chr2", "chr1", "chr3"],
"starts": [1, 3, 2, 4],
"ends": [10, 30, 50, 60],
"strand": ["-", "+", "*", "+"],
"score": [1, 2, 3, 4],
}
)

gr2 = GenomicRanges(
{
"seqnames": ["chr2", "chr4", "chr5"],
"starts": [3, 6, 4],
"ends": [30, 50, 60],
"strand": ["-", "+", "*"],
"score": [2, 3, 4],
}
)

grl = GenomicRangesList(ranges=[gr1, gr2], names=["gene1", "gene2"])
print(grl)
```

## output
GenomicRangesList with 2 genomic elements

Name: gene1
GenomicRanges with 4 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr1 │ 1 │ 10 │ - │ 1 │
│ chr2 │ 3 │ 30 │ + │ 2 │
│ chr3 │ 4 │ 60 │ + │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘

Name: gene2
GenomicRanges with 3 intervals & 1 metadata columns
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ seqnames <list> ┃ starts <list> ┃ ends <list> ┃ strand <list> ┃ score <list> ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ chr2 │ 3 │ 30 │ - │ 2 │
│ chr4 │ 6 │ 50 │ + │ 3 │
│ chr5 │ 4 │ 60 │ * │ 4 │
└─────────────────┴───────────────┴─────────────┴───────────────┴──────────────┘

## Further information

- [Tutorial](https://biocpy.github.io/GenomicRanges/tutorial.html)
- [API documentation](https://biocpy.github.io/GenomicRanges/api/modules.html)
- [Bioc/GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html)

<!-- pyscaffold-notes -->

Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
"sphinx.ext.ifconfig",
"sphinx.ext.mathjax",
"sphinx.ext.napoleon",
"sphinx_autodoc_typehints",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ furo
# sphinx_rtd_theme
myst-parser[linkify]
sphinx>=3.2.1
sphinx-autodoc-typehints
Loading

0 comments on commit 71c8b53

Please sign in to comment.