forked from HuangLab-Fudan/ASJA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
208 lines (181 loc) · 11.6 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
ASJA: a program for Assembling Splice Junctions Analysis
--------------------------------------------
What is the ASJA
----------------
RNA splicing may generate different kinds of splice junctions, such as linear, back-splice and fusion junctions. Only a limited number of programs are available for detection and quantification of splice junctions.
Here, we present Assembling Splice Junctions Analysis (ASJA), a software package that identifies and characterizes all splice junctions from high-throughput RNA sequencing (RNA-seq) data. ASJA processes assembled transcripts and chimeric alignments from the STAR aligner and S tringTie assembler.
ASJA provides the unique position and normalized expression level of each junction. Annotations and integrative analysis of the junctions enable additional filtering. It is also appropriate for the identification of novel junctions.
Implementation and Dependencies
-------------------------------
ASJA was developed with perl (v5) and shell (bash) language. Before running the program, it is necessary to check or download perl packages as follow:
*File::Basename;
*Getopt::Long;
*List::Util qw/min sum max/;
Moreover, ASJA works based on the STAR and StringTie fearturecount, so these tools also should be installed and their pathway should be added in ~/.bashrc
* STAR (version <= 2.5)
* StringTie (version <= 1.2.3)
* featureCounts (version >= 1.5.0)
* sambamba (version >=0.6.6)
ASJA Installation
------------
Download the packages and then unzip it in Linux (CentOS or Ubuntu)
Documentation
-------------
This chapter provides detailed commands arguments and description of output. the commands are labeled after ‘usage’
Files Needed:
------------
1. Human genome sequence (hg38.fasta) and GTF File (we recommend GENCODE, and the program will report an error if you use GTF from UCSC) are used to generate STAR index.
2. Raw data of RNA-seq (fasta.gz)
Commands and arguments
--------------------------
* Note: the absolute pathway is necessary to perform scripts
--------------------------------------------------------------------------------------------------------------------------------------------------------------
1, Mapping of RNA-seq data
usage: perl runSTAR.pl [OPTIONS]
The arguments of runSTAR.pl are as followings and if you want to work with single-read ,please see more detail of STAR on https://github.com/alexdobin/STAR:
-f1 <FASTA1>
Using Illumina paired-end reads, and the name of read1 has to be supplied.
-f2 <FASTA2>
Using Illumina paired-end reads, and the name of read2 has to be supplied.
-fq_dir <fastq dir>
Specifies path to files containing the sequences to be mapped
-G <path_and_gtf>
Specifies the path to the file with annotated transcripts in the standard GTF format.
-GA <genomeFastaFiles>
Specified one or more FASTA files with the genome reference sequences.
-O <outdir>
Specifies path to the directory (henceforth called "genome directory" where the alignment results are stored.)
-pass
Running STAR in the 2-pass alignment mode
-index
Generating genome index of STAR with default settings
-SI_dir < genome index dir>
specifies path to the genome directory where genome indexes where generated
-I <path>
Specifies path to the directory where the ASJA installation
-S <sample>
Name of sample
### Generating genome indexes ##
usage: perl runSTAR.pl -I path/to/ASJA -index -SI_dir path/to/star_index -G path/to/genecode.annoataion.gtf -GA path/to/GRCh38.primary_assembly.genome.fa
### running STAR in the 2-pass mode [Kahles et al., 2018, Cancer Cell 34, 1–14] ###
usage: perl runSTAR.pl -I path/to/ASJA -pass -SI_dir /path/to/star_index -f1 R2.fq.gz -f2 R1.fq.gz -fq_dir path/to/fastq -GA path/to/GRCh38.primary_assembly.genome.fa -O path/to/out_dir -S sample_name
Output: sample_mapped_reads.bam
Chimeric.out.junction
SJ.out.tab
--------------------------------------------------------------------------------------------------------------------------------------------------------------
2, the extraction and processing of junctions
* We provide step-by-step processing (ASJA.pl filtering.pl integration.pl) and quick processing(ASJA-all.pl) program to obtain junctions. However, preparing file for annotation only be implemented with ASJA.pl -setup, and generating transcripts for linear junction only be implemented with StringTie.
****** step-by-step processing ******
usage: perl ASJA.pl [options]
The arguments of ASJA.pl are as followings:
-I <ASJA dir>
Specifies path to the directory where the ASJA installation
-G < path_and_gtf >
Specifies the path to the file with annotated transcripts in the standard GTF format.
-setup
Preparing reference file for annotation junctions
-linear
Extraction linear junctions
-backsplicing
Extraction back splicing junctions
-fusion
Extraction fusion junctions
-CI < alignment dir>
Specified path with the alignment result of STAR
-SI <path_and_file >
Name(s) (with path) of the files containing generated transcript by StringTie.
The path is also an out_dir
-ann
Annotation for junctions
-ratio
Calculation ratio
****** The mapped reads were further used to obtain transcripts by StringTie with reference-based transcriptome assembly. please see http://ccb.jhu.edu/software/stringtie/ ******
usage: stringtie input_mapped_reads.bam -f 0.1 -o path/to/stringtie_assembly.gtf -p 4 -G path/to/gencode.v29.annotation.gtf
### The process of preparing file for annotation junctions. ###
Usage: perl ASJA.pl -I /path/to/ASJA -G path/to/ref/gencode.v29.annotation.gtf -setup
## The extraction of liner junction form stringtie_assembly ##
Usage: perl ASJA.pl -I path/to/ASJA -linear -G path/to/gencode.v29.annotation.gtf -SI path/to/example/assembly/input/stringtie_assembly.gtf -CI path/to/example/alignment/input -ann -ratio
## The extraction of back splicing junction form Chimeric.out.junction ##
usage: perl ASJA.pl -I path/to/ASJA -backsplicing -G path/to/gencode.v29.annotation.gtf -SI path/to/example/assembly/input/stringtie_assembly.gtf -CI path/to/example/alignment/input -ann -ratio
## The extraction of extraction fusion junction form Chimeric.out.junction ##
usage: perl ASJA.pl -I path/to/ASJA -fusion -G path/to/gencode.v29.annotation.gtf -SI path/to/example/assembly/input/stringtie_assembly.gtf -CI path/to/example/alignment/input -ann -ratio
usage: perl filtering.pl [options]
The arguments of filtering.pl are as followings:
-read <1>
Set threshold for filtration based on counts of junction reads (optional: e.g. 1)
-ratio<0.01>
Set threshold for filtration based on ratio (linear weight ratio/back splicing ratio /fused ratio) of junction (optional: e.g. 0.01)
-linear
Filtration of linear junctions
-backsplicing
Filtration of back splicing junctions
-fusion
Filtration of fusion junctions
-IN<input file>
Name(s) (with path) of the files for filtration
-O<output file>
Name(s) (with path) of the files for result
## Generating junctions with high-confidence ##
usage: perl filtering.pl -read 1 -ratio 0.08 -linear -IN path/to/Linear.txt -O path/to/F_linear.txt
* NOTE: For any kind of junction, there should be a threshold to get a high-confidence junction.
For example, we believe that the screening criteria for high-confidence liner junctions should satisfy the condition that ratio is greater than 0.01 and the number of reads are greater than 1.
### The integration of three types of junctions ###
usage:perl integration.pl -A liner.txt -B circRNA.txt -C fusion.txt -O all.txt
* Note: These junctions need to be annotated
****** quick processing*******
usage: perl ASJA-all.pl [options]
-I <dir ASJA>
Specifies path to the directory where the ASJA installation
-G <path_and_gtf>
Specifies the path to the file with annotated transcripts in the standard GTF format.
-CI < dir alignment >
Specified path with the alignment result of STAR.
-SI <path_and_file >
Name(s) (with path) of the files containing generated transcript by StringTie
-O<outdir>
Specifies path to the directory where the results are stored.
###Quickly get three types of junctions using default parameters ###
Usage: perl ASJA-all.pl -I /path/to/ASJA -G path/to/gencode.v29.annotation.gtf -CI /path/to/example/alignment/input-SI path/to/example/assembly/input/stringtie_assembly.gtf -O path/to/result
--------------------------------------------------------------------------------------------------------------------------------------------------------------
3.Other programs
### The read counts of gene level can be calculated by featureCounts. Please see http://subread.sourceforge.net/ ###
usage: featureCounts -p -T 6 -a genecode.annoataion.gtf -o path/to/featurecount.txt sample_mapped_reads.bam
### the calculation of TPM from featureCounts ###
usage: perl TPM.pl -A featurecount -B featurecount.summary -O TPM.txt
--------------------------------------------------------------------------------------------------------------------------------------------------------------
The description of Output files generated by ASJA
Linear junction primary format
* junctions: A unique identifier for a linear junction
* CPT: The expression of junction with custom formal (CPT).
* read: The read count of junction that SJ.out.tab matched.
* transID: The transcript_id in the reference annotation that the instance matched.
* geneID: The gene_id in the reference annotation that the instance matched.
* gene: The gene_name in the reference annotation that the instance matched.
* type: The gene_type in the reference annotation that the instance matched.
* Weight ratio: the weight of junction in annotated gene.
Back splicing junction primary format
* circID: A unique identifier for a back splicing junction
* read: the sum of GT_AG_read and CT_AC_read.
* GT_AG_read: The read count of back splicing that junction type=1(STAR manual) matched.
* CT_AC_read: The read count of back splicing that junction type=2(STAR manual) matched.
* left_backratio: 5’ratio of circRNA.
* right_backratio: 3’ratio of circRNA.
* annotation: the annotation of circRNA, including gene_id;trans_id;gene_type; gene_name
* length_exon: the length of exon.
* pos_exon: the position of exon
Fusion junction primary format
* fusionID: A unique identifier for a fusion junction
* read: the sum of GT_AG_read and CT_AC_read.
* GT_AG_read: The read count of back splicing that junction type=1(STAR manual) matched.
* CT_AC_read: The read count of back splicing that junction type=2(STAR manual) matched.
* Leftbackratio: the ratio of accepter
* Rightbackratio: the ratio of donor
* left_type: the type of annotation in accepter
* leftann: the annotation of accepter, including gene_id;transcript_id;gene_type;gene_name;exon_number
* right_type: the type of annotation in donor
* rightann: the annotation of accepter, including gene_id;transcript_id;gene_type;gene_name;exon_number
An integration output:
* Gene_name: Gene symbol
* Linear junctions: A unique identifier for a linear junction
* circRNA: A unique identifier for circRNA related to linear junction and gene, separate the two circRNAs with a semicolon
* fusion: A unique identifier for fusion related to linear junction and gene, separate the two fusions with a semicolon