LongPASS

Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification

LongPASS is an annotation-free hierarchical parametric-clustering strategy that iteratively divides raw TSS/PASs from full-length reads into specific TSS/PAS clusters and labels the highest-expressed site in each cluster as the dominant site.

Requirement

minimap2
pysam
pybedtools

Usage

0. Prepare the Genome Alignment bam file using minimap2

0.0 Mapping using minimap2

minimap2  -t $ncpu --secondary=no -a -x splice $fastq $infile  --splice-flank=$flank | samtools sort -@ $ncpu > $outfile

0.1 Filter out supplementary alignments

python filter_reads_for_clustering.py ${in.bam} ${out.bam}

(Optional: We also recommend using the Fulquant pipeline we developed previously for adapter removal, mapping, and read alignment filtering：Fulquant steps 1->6)

(Optional: We also include a polyA/T trimming pipeline here)

usage:

python trimpolyA.py in.fasta out.fasta

1. Generate TBS file(transcript boundary file) from bam

python misc/bamtotbs.py [in.bam] [out.tbs]

in.bam: input bam file

out.tbs: output tbs file

tbsfile:

The Tbs file is a custom file type that records the coordinates and counts of read boundary (TSS/PAS) sites.

The tbsfile contains 7 columns of information:

Chromosome
Genomic coordinates(1-base)
Gene strand information (inferred from the TS tag in BAM files generated by minimap2)
Count of TSS sites
Count of PAS sites
Read IDs corresponding to TSS sites
Read IDs corresponding to PAS sites

2. Run longPASS

python longPass/LongPass.py --params [param.txt] --clustering [paraclu,distclu] --tbsfile [tbsfile]  -o [outfile] --normalization [simplecpm,raw]

--params: Parameter file, specifying parameters for clustering, see example at:

--clustering: Clustering algorithm, options: paraclu, distclu.

--tbsfile: Transcript boundary file generated from step1.

--outfile: Path for output file.

--normalization: simplecpm(simple count per million (cpm)); powerlaw: power-law based normalization¹.

Algorithm for clustering:

(1) distclu: A straightforward distance-based clustering method where two adjacent TSSs are merged if their distance is less than a predefined threshold.

(2) paraclu: Parametric clustering of TSS/PAS sites based on signal density².

Parameter File

distclu:
    maxdist = 20
paraclu:
    minStability = 1
    maxLength = 150
    removeSingletons = True
    keepSingletonAbove = 0.1
    reducetoNoneoverlap = True
    peak_threshold = 3

distclu:

maxdist: Two sites will be merged if their distance is less than or equal to this value.

paraclu:

minStability: Minimum stability (defined as the ratio of the maximum distance to the minimum distance within a segment).

maxLength: Retain segments shorter than this length.

removeSingletons: Whether to remove segments of single-base length.(True/False)

keepSingletonAbove: If removeSingletons is set to True, single-base segments below this threshold will be filtered out.

reducetoNoneoverlap: Whether to merge overlapping segments.

peakThreshold: Sites with a count below this value will be filtered out.

Output Description

Column
Chromosome
Gene strand
Genomic coordinate of cluster start site (0-base)
Genomic coordinate of cluster end site (0-base)
Site number
Genomic coordinate of dominant site in cluster
Normalized count at dominant site
Normalized total count in cluster
Maximum density of the cluster
Minimum density of the cluster
Cluster type (TSS/PAS)
Read IDs associated with this cluster

Reference

Citation

Song, X., Yan, H., Hong, Y., Huang, J., et al. (2024). Co-regulation of alternative splicing with transcription initiation and termination revealed by long-read RNA sequencing.

Contributors

This package is developed and maintained by Xiaodong Song, Yanhong Hong (hongyanhong2020@sibs.ac.cn) and Wu Wei(wuwei@lglab.ac.cn)

Footnotes

Balwierz, P.J., Carninci, P., Daub, C.O. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol 10, R79 (2009). https://doi.org/10.1186/gb-2009-10-7-r79 ↩
Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LongPASS

Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification

Requirement

Usage

0. Prepare the Genome Alignment bam file using minimap2

0.0 Mapping using minimap2

0.1 Filter out supplementary alignments

1. Generate TBS file(transcript boundary file) from bam

tbsfile:

2. Run longPASS

Algorithm for clustering:

Parameter File

Output Description

Reference

Citation

Contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

LongPASS

Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification

Requirement

Usage

0. Prepare the Genome Alignment bam file using minimap2

0.0 Mapping using minimap2

0.1 Filter out supplementary alignments

1. Generate TBS file(transcript boundary file) from bam

tbsfile:

2. Run longPASS

Algorithm for clustering:

Parameter File

Output Description

Reference

Citation

Contributors

Footnotes