Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification
LongPASS is an annotation-free hierarchical parametric-clustering strategy that iteratively divides raw TSS/PASs from full-length reads into specific TSS/PAS clusters and labels the highest-expressed site in each cluster as the dominant site.
minimap2
pysam
pybedtools
minimap2 -t $ncpu --secondary=no -a -x splice $fastq $infile --splice-flank=$flank | samtools sort -@ $ncpu > $outfile
python filter_reads_for_clustering.py ${in.bam} ${out.bam}
(Optional: We also recommend using the Fulquant pipeline we developed previously for adapter removal, mapping, and read alignment filtering:Fulquant steps 1->6)
(Optional: We also include a polyA/T trimming pipeline here)
usage:
python trimpolyA.py in.fasta out.fasta
python misc/bamtotbs.py [in.bam] [out.tbs]
in.bam: input bam file
out.tbs: output tbs file
The Tbs file is a custom file type that records the coordinates and counts of read boundary (TSS/PAS) sites.
The tbsfile contains 7 columns of information:
- Chromosome
- Genomic coordinates(1-base)
- Gene strand information (inferred from the TS tag in BAM files generated by minimap2)
- Count of TSS sites
- Count of PAS sites
- Read IDs corresponding to TSS sites
- Read IDs corresponding to PAS sites
python longPass/LongPass.py --params [param.txt] --clustering [paraclu,distclu] --tbsfile [tbsfile] -o [outfile] --normalization [simplecpm,raw]
--params: Parameter file, specifying parameters for clustering, see example at:
--clustering: Clustering algorithm, options: paraclu, distclu.
--tbsfile: Transcript boundary file generated from step1.
--outfile: Path for output file.
--normalization: simplecpm(simple count per million (cpm)); powerlaw: power-law based normalization1.
(1) distclu: A straightforward distance-based clustering method where two adjacent TSSs are merged if their distance is less than a predefined threshold.
(2) paraclu: Parametric clustering of TSS/PAS sites based on signal density2.
distclu:
maxdist = 20
paraclu:
minStability = 1
maxLength = 150
removeSingletons = True
keepSingletonAbove = 0.1
reducetoNoneoverlap = True
peak_threshold = 3
distclu:
maxdist: Two sites will be merged if their distance is less than or equal to this value.
paraclu:
minStability: Minimum stability (defined as the ratio of the maximum distance to the minimum distance within a segment).
maxLength: Retain segments shorter than this length.
removeSingletons: Whether to remove segments of single-base length.(True/False)
keepSingletonAbove: If removeSingletons
is set to True
, single-base segments below this threshold will be filtered out.
reducetoNoneoverlap: Whether to merge overlapping segments.
peakThreshold: Sites with a count below this value will be filtered out.
Column |
---|
Chromosome |
Gene strand |
Genomic coordinate of cluster start site (0-base) |
Genomic coordinate of cluster end site (0-base) |
Site number |
Genomic coordinate of dominant site in cluster |
Normalized count at dominant site |
Normalized total count in cluster |
Maximum density of the cluster |
Minimum density of the cluster |
Cluster type (TSS/PAS) |
Read IDs associated with this cluster |
Song, X., Yan, H., Hong, Y., Huang, J., et al. (2024). Co-regulation of alternative splicing with transcription initiation and termination revealed by long-read RNA sequencing.
This package is developed and maintained by Xiaodong Song, Yanhong Hong ([email protected]) and Wu Wei([email protected])
Footnotes
-
Balwierz, P.J., Carninci, P., Daub, C.O. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol 10, R79 (2009). https://doi.org/10.1186/gb-2009-10-7-r79 ↩
-
Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772. ↩