Skip to content

Latest commit

 

History

History
186 lines (103 loc) · 7.69 KB

README.md

File metadata and controls

186 lines (103 loc) · 7.69 KB

LongPASS

Long-read RNA sequencing data of PolyAdenylation sites and transcription Start Sites identification and quantification

LongPASS is an annotation-free hierarchical parametric-clustering strategy that iteratively divides raw TSS/PASs from full-length reads into specific TSS/PAS clusters and labels the highest-expressed site in each cluster as the dominant site.

drawing

Requirement

minimap2
pysam
pybedtools

Usage

0. Prepare the Genome Alignment BAM file using minimap2
0.0 Mapping using minimap2
minimap2  -t $ncpu --secondary=no -a -x splice $fastq --splice-flank=$flank | samtools sort -@ $ncpu > $outfile    
0.1 Filter out supplementary alignments
python filter_reads_for_clustering.py ${in.bam} ${out.bam}

(Optional: We also recommend using the FulQuant pipeline we developed previously for adapter removal, mapping, and read alignment filtering:FulQuant steps 1->6)

(Optional: We also include a polyA/T trimming pipeline here)

usage:

python trimpolyA.py in.fasta out.fasta
1. Generate TPS(TSS/PAS file) file from BAM
python misc/bamtotps.py [in.bam] [out.tps]

in.bam: input BAM file

out.tps: output TPS file

TPS file:

The TPS file records the coordinates and counts of TSS/PAS sites from the reads.

The TPS file contains 7 columns of information:

  • Chromosome
  • Genomic coordinates(1-base)
  • Gene strand information (inferred from the TS tag in BAM files generated by minimap2)
  • Count of TSS sites
  • Count of PAS sites
  • Read IDs corresponding to TSS sites
  • Read IDs corresponding to PAS sites

Example of a TPS file:

1       14363   -       0       13      ,       ONT_BMK211213-AS036-01O0005_clean:P:-:140:21540a7e-b3cb-4f37-bbc8-643210b7ab9c:P-T-P:127:2110:1984:2180:Q11.69,ONT_BMK211213-AS036-01O0007_clean:P:-:85:c9b15af0-6b0c-4512-9995-5473d5c5d4ff:P-T-P:108:1864:1757:1933:Q13.71,ONT_BMK211213-AS036-01O0007_clean:P:-:181:204a3fc0-18fb-45f8-8e33-60d19f337fba:P-T-P:138:1635:1498:1707:Q10.92,ONT_BMK211213-AS036-01O0007_clean:P:-:152:9c82af8b-2ac7-47d2-8638-acf705c27c4b:P-T-P:136:1677:1542:1726:Q10.86,ONT_BMK211213-AS036-01O0007_clean:P:-:128:ac00e1a0-f1fa-4306-9e8f-ea2705a45414:P-T-P:133:1848:1716:1897:Q12.58,ONT_BMK211213-AS036-01O0008_clean:P:-:22:1a613155-7d21-4c38-a472-f93fffb1c8de:P-T-P:149:827:679:877:Q12.46,ONT_BMK211213-AS036-01O0005_clean:P:+:47:6dd4d7c4-2ff9-476a-8f53-f165af59f77d:P-A-P:85:712:628:779:Q11.92,ONT_BMK211213-AS036-01O0006_clean:P:+:83:d5fb033e-459a-42b1-90cb-7a1b5d1263f8:P-A-P:86:1433:1348:1564:Q13.26,ONT_BMK211213-AS036-01O0007_clean:P:+:196:9247e219-7bd4-4b27-8510-af856f20a152:P-A-P:79:1476:1398:1588:Q8.86,ONT_BMK211213-AS036-01O0009_clean:P:+:103:bb4cee07-b8d7-4085-946d-57fa63dfe26a:P-A-P:84:3204:3121:3329:Q14.25,ONT_BMK211213-AS036-01O0009_clean:P:+:130:11e419e6-fdeb-494c-8e8b-6274f771c228:P-A-P:92:927:836:1010:Q9.57,ONT_BMK220926-BB670-02O0002_clean.fq.gz:P:+:91:c07ca0c1-ad57-4247-96cc-424a450b2dc2:P-A-P:83:1523:1441:1629:Q12.85,ONT_BMK211213-AS036-01O0002_clean:P:+:216:32b0d5a7-81ed-49f2-b355-d032d89d134e:P-A-P:89:1377:1289:1436:Q8.54,
1       24877   -       1       0       ONT_BMK211213-AS036-01O0005_clean:P:-:140:21540a7e-b3cb-4f37-bbc8-643210b7ab9c:P-T-P:127:2110:1984:2180:Q11.69, ,
1       29347   -       8       0       ONT_BMK211213-AS036-01O0007_clean:P:-:152:9c82af8b-2ac7-47d2-8638-acf705c27c4b:P-T-P:136:1677:1542:1726:Q10.86,ONT_BMK211213-AS036-01O0007_clean:P:+:175:e78c3cc1-c467-43cc-bcce-b89748fa1b32:P-A-P:81:1617:1537:1737:Q10.64,ONT_BMK211213-AS036-01O0008_clean:P:-:97:4360e005-0bee-4e7a-8312-a1b16d8a5288:P-T-P:110:995:886:1067:Q10.48,ONT_BMK211213-AS036-01O0001_clean:P:+:317:7609282f-ae06-4613-8635-2dfbb6dd27f5:P-A-P:76:2710:2635:2814:Q9.85,ONT_BMK211213-AS036-01O0007_clean:P:+:101:108c45ea-d2ad-4992-a3d4-c6290e3740b4:P-A-P:83:1787:1705:1908:Q12.54,ONT_BMK211213-AS036-01O0007_clean:P:-:70:cfc6f093-ca3c-42a8-bac1-1ec6179f051d:P-T-P:105:1214:1110:1287:Q12.41,ONT_BMK211213-AS036-01O0007_clean:P:-:65:165a7bc0-d52d-45f1-a753-30e95f8ca184:P-T-P:121:966:846:1039:Q12.11,ONT_BMK211213-AS036-01O0001_clean:P:-:63:bfbd3f68-7982-419e-966f-43eaf6d13c5a:P-N-P:86:2022:1937:2092:Q15.01,     ,
2. Run longPASS
python longPass/LongPass.py --params [param.txt] --clustering [paraclu,distclu] --tpsfile [TPS file]  -o [outfile] --normalization [simplecpm,raw]

--params: Parameter file, specifying parameters for clustering, see example at here

--clustering: Clustering algorithm, options: paraclu, distclu.

--tpsfile: Transcript boundary file generated from step1.

--outfile: Path for output file.

--normalization: simplecpm(simple count per million (cpm)); powerlaw: power-law based normalization1.

Algorithm for clustering:

(1) distclu: A straightforward distance-based clustering method where two adjacent TSSs are merged if their distance is less than a predefined threshold.

(2) paraclu: Parametric clustering of TSS/PAS sites based on signal density2.

Parameter File
distclu:
    maxdist = 20
paraclu:
    minStability = 1
    maxLength = 150
    removeSingletons = True
    keepSingletonAbove = 0.1
    reducetoNoneoverlap = True
    peak_threshold = 3

distclu:

maxdist: Two sites will be merged if their distance is less than or equal to this value.

paraclu:

minStability: Minimum stability for a segment. default: 1

The stability ratio of each cluster is defined as the ratio:

img

A high cluster stability ratio indicates that the cluster remains significant over a wide range of d values, implying it is a stable and robust cluster. A small stability score indicates that the break in this layer does not effectively separate the two segments.

maxLength: The maximum length of retained clusters, clusters longer than this length should be further split. default: 150

removeSingletons: Whether to remove single-bp clusters (True/False). defualt: True

keepSingletonAbove: If removeSingletons is set to True, single-bp clusters with count below this threshold will be filtered out, default: 0.1.

peakThreshold: Sites with a count below this value will be filtered out. default=3.

Output Description

Column
Chromosome
Gene strand
Genomic coordinate of cluster start site (0-base)
Genomic coordinate of cluster end site (0-base)
Site number
Genomic coordinate of dominant site in cluster
Normalized count at dominant site
Normalized total count in cluster
Maximum density of the cluster
Minimum density of the cluster
Cluster type (TSS/PAS)
Read IDs associated with this cluster

Reference

Contributors

This package is developed and maintained by Xiaodong Song, Yanhong Hong ([email protected]) and Wu Wei([email protected])

Footnotes

  1. Balwierz, P.J., Carninci, P., Daub, C.O. et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biol 10, R79 (2009). https://doi.org/10.1186/gb-2009-10-7-r79

  2. Frith MC, Valen E, Krogh A, Hayashizaki Y, Carninci P, Sandelin A. A code for transcription initiation in mammalian genomes. Genome Res. 2008 Jan;18(1):1-12. doi: 10.1101/gr.6831208. Epub 2007 Nov 21. PMID: 18032727; PMCID: PMC2134772.