You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Also a reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz)
30
+
Further requirements:
31
+
- A reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz).
32
+
- A file containing all genes, which deeprvat should consider together with a unique integer id for each gene. This file may be created manually by the user or automatically using the gtf file as input to create a gene id file for all protein coding genes. See [here](#geneid) for more details.
33
+
31
34
32
35
33
36
## Configure the annotation pipeline
@@ -38,6 +41,7 @@ The config above would use the following directory structure:
38
41
|--reference
39
42
||-- fasta file
40
43
||-- GTF file
44
+
||-- gene id file
41
45
42
46
|-- preprocessing_workdir
43
47
||-- norm
@@ -80,6 +84,7 @@ A GTF file as described in [requirements](#requirements) and the FASTA file used
80
84
The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir`.
81
85
Data for VEP plugins and the CADD cache are stored in `annotation_data`.
82
86
87
+
(running)=
83
88
## Running the annotation pipeline on example data
84
89
85
90
@@ -140,6 +145,22 @@ af_mode : 'af_gnomadg'
140
145
```
141
146
to the config file.
142
147
148
+
(geneid)=
149
+
## Gene id file
150
+
as mentioned in the [requirements](#requirements) section, the pipeline expects a parquet file contiaining all genes that deeprvat should consider, together with a unique integer id for each gene.
151
+
This file can be created automatically using a GTF file as input. The output is then a parquet file in the expected format containing all protein coding genes of the provided GTF file.
152
+
To automatically create the gene id file, make sure the annotation environment (mentioned [here](#running) ) is active and run
with `deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz` pointing to any downloaded GTF file and `deeprvat/example/annotations/reference/protein_coding_genes.parquet` pointing to the desired output path, which has to be specified in the config file.
157
+
158
+
Alternatively, when the user want to selecta specific set of genes to consider, the gene id file may be created by the user. The file is expected to have two columns:
159
+
- column`gene`:`str` name for each gene
160
+
- column `id`:`int` unique id for each gene
161
+
Each row represents a gene the user want to include in the analysis.
0 commit comments