Skip to content

Latest commit

 

History

History
177 lines (97 loc) · 5.04 KB

README.md

File metadata and controls

177 lines (97 loc) · 5.04 KB

LongPASS

Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification

LongPASS is an annotation-free hierarchical parametric-clustering strategy that iteratively divides raw TSS/PASs from full-length reads into specific TSS/PAS clusters and labels the highest-expressed site in each cluster as the dominant site.

drawing

Requirement

minimap2
pysam
pybedtools

Usage

0. Prepare the Genome Alignment bam file using minimap2
0.0 Mapping using minimap2
minimap2  -t $ncpu --secondary=no -a -x splice $fastq $infile  --splice-flank=$flank | samtools sort -@ $ncpu > $outfile    
0.1 Filter out supplementary alignments
python filter_reads_for_clustering.py ${in.bam} ${out.bam}

(Optional: We also recommend using the Fulquant pipeline we developed previously for adapter removal, mapping, and read alignment filtering:Fulquant steps 1->6)

(Optional: We also include a polyA/T trimming pipeline here)

usage:

python trimpolyA.py in.fasta out.fasta
1. Generate TBS file(transcript boundary file) from bam
python misc/bamtotbs.py [in.bam] [out.tbs]

in.bam: input bam file

out.tbs: output tbs file

tbsfile:

The Tbs file is a custom file type that records the coordinates and counts of read boundary (TSS/PAS) sites.

The tbsfile contains 7 columns of information:

  • Chromosome
  • Genomic coordinates(1-base)
  • Gene strand information (inferred from the TS tag in BAM files generated by minimap2)
  • Count of TSS sites
  • Count of PAS sites
  • Read IDs corresponding to TSS sites
  • Read IDs corresponding to PAS sites
2. Run longPASS
python longPass/LongPass.py --params [param.txt] --clustering [paraclu,distclu] --tbsfile [tbsfile]  -o [outfile] --normalization [simplecpm,raw]

--params: Parameter file, specifying parameters for clustering, see example at:

--clustering: Clustering algorithm, options: paraclu, distclu.

--tbsfile: Transcript boundary file generated from step1.

--outfile: Path for output file.

--normalization: simplecpm(simple count per million (cpm)); powerlaw: power-law based normalization1.

Algorithm for clustering:

(1) distclu: A straightforward distance-based clustering method where two adjacent TSSs are merged if their distance is less than a predefined threshold.

(2) paraclu: Parametric clustering of TSS/PAS sites based on signal density2.

Parameter File
distclu:
    maxdist = 20
paraclu:
    minStability = 1
    maxLength = 150
    removeSingletons = True
    keepSingletonAbove = 0.1
    reducetoNoneoverlap = True
    peak_threshold = 3

distclu:

maxdist: Two sites will be merged if their distance is less than or equal to this value.

paraclu:

minStability: Minimum stability (defined as the ratio of the maximum distance to the minimum distance within a segment).

maxLength: Retain segments shorter than this length.

removeSingletons: Whether to remove segments of single-base length.(True/False)

keepSingletonAbove: If removeSingletons is set to True, single-base segments below this threshold will be filtered out.

reducetoNoneoverlap: Whether to merge overlapping segments.

peakThreshold: Sites with a count below this value will be filtered out.

Output Description

Column
Chromosome
Gene strand
Genomic coordinate of cluster start site (0-base)
Genomic coordinate of cluster end site (0-base)
Site number
Genomic coordinate of dominant site in cluster
Normalized count at dominant site
Normalized total count in cluster
Maximum density of the cluster
Minimum density of the cluster
Cluster type (TSS/PAS)
Read IDs associated with this cluster

Reference

Citation

Song, X., Yan, H., Hong, Y., Huang, J., et al. (2024). Co-regulation of alternative splicing with transcription initiation and termination revealed by long-read RNA sequencing.

Contributors

This package is developed and maintained by Xiaodong Song, Yanhong Hong ([email protected]) and Wu Wei([email protected])

Footnotes

  1. Balwierz, P.J., Carninci, P., Daub, C.O. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol 10, R79 (2009). https://doi.org/10.1186/gb-2009-10-7-r79

  2. Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772.