Skip to content

Commit 46f0247

Browse files
authored
Made gene_id_file mandatory. Added instructions to create it manually in docs
1 parent 0d23c55 commit 46f0247

File tree

5 files changed

+27
-17
lines changed

5 files changed

+27
-17
lines changed

Diff for: deeprvat/annotations/annotations.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1988,7 +1988,7 @@ def add_gene_ids(gene_id_file: str, annotations_path: str, out_file: str):
19881988
"""
19891989
genes = pd.read_parquet(gene_id_file)
19901990
genes[["gene_base", "feature"]] = genes["gene"].str.split(".", expand=True)
1991-
genes.drop(columns=["feature", "gene", "gene_name", "gene_type"], inplace=True)
1991+
genes=genes[['id','gene_base']]
19921992
genes.rename(columns={"id": "gene_id"}, inplace=True)
19931993
annotations = pd.read_parquet(annotations_path)
19941994
len_anno = len(annotations)

Diff for: docs/annotations.md

+22-1
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,10 @@ Download paths:
2727
- [PrimateAI](https://basespace.illumina.com/s/yYGFdGih1rXL) PrimateAI supplementary data/"PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz"
2828
- [AlphaMissense](https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz)
2929

30-
Also a reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz)
30+
Further requirements:
31+
- A reference GTF file containing transcript annotations is required, this can be downloaded from [here](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz).
32+
- A file containing all genes, which deeprvat should consider together with a unique integer id for each gene. This file may be created manually by the user or automatically using the gtf file as input to create a gene id file for all protein coding genes. See [here](#geneid) for more details.
33+
3134

3235

3336
## Configure the annotation pipeline
@@ -38,6 +41,7 @@ The config above would use the following directory structure:
3841
|--reference
3942
| |-- fasta file
4043
| |-- GTF file
44+
| |-- gene id file
4145

4246
|-- preprocessing_workdir
4347
| |-- norm
@@ -80,6 +84,7 @@ A GTF file as described in [requirements](#requirements) and the FASTA file used
8084
The output is stored in the `output_dir/annotations` folder and any temporary files in the `tmp` subfolder. All repositories used including VEP with its corresponding cache as well as plugins are stored in `repo_dir`.
8185
Data for VEP plugins and the CADD cache are stored in `annotation_data`.
8286

87+
(running)=
8388
## Running the annotation pipeline on example data
8489

8590

@@ -140,6 +145,22 @@ af_mode : 'af_gnomadg'
140145
```
141146
to the config file.
142147

148+
(geneid)=
149+
## Gene id file
150+
as mentioned in the [requirements](#requirements) section, the pipeline expects a parquet file contiaining all genes that deeprvat should consider, together with a unique integer id for each gene.
151+
This file can be created automatically using a GTF file as input. The output is then a parquet file in the expected format containing all protein coding genes of the provided GTF file.
152+
To automatically create the gene id file, make sure the annotation environment (mentioned [here](#running) ) is active and run
153+
```
154+
deeprvat_annotations create-gene-id-file deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz deeprvat/example/annotations/reference/protein_coding_genes.parquet
155+
```
156+
with `deeprvat/example/annotations/reference/gencode.v44.annotation.gtf.gz` pointing to any downloaded GTF file and `deeprvat/example/annotations/reference/protein_coding_genes.parquet` pointing to the desired output path, which has to be specified in the config file.
157+
158+
Alternatively, when the user want to select a specific set of genes to consider, the gene id file may be created by the user. The file is expected to have two columns:
159+
- column`gene`:`str` name for each gene
160+
- column `id`:`int` unique id for each gene
161+
Each row represents a gene the user want to include in the analysis.
162+
163+
143164
## References
144165

145166
(reference-1-target)=

Diff for: example/config/deeprvat_annotation_config.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ kipoiveff_repo_dir : repo_dir/kipoi-veff2
2525
faatpipe_repo_dir : repo_dir/faatpipe
2626
vep_repo_dir : repo_dir/ensembl-vep
2727
preprocessing_workdir : ../preprocess/workdir
28+
gene_id_parquet: reference/protein_coding_genes.parquet
2829
additional_vep_plugin_cmds:
2930
cadd : CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz
3031
spliceAI : SpliceAI,snv=annotation_data/spliceAI/spliceai_scores.raw.snv.hg38.vcf.gz,indel=annotation_data/spliceAI/spliceai_scores.raw.indel.hg38.vcf.gz

Diff for: example/config/deeprvat_annotation_config_minimal.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
fasta_dir : reference
77
fasta_file_name : GRCh38.primary_assembly.genome.fa
88
gtf_file_name : gencode.v44.annotation.gtf.gz
9+
gene_id_parquet: reference/protein_coding_genes.parquet
910

1011
source_variant_file_pattern : chr{chr}test
1112
source_variant_file_type: 'bcf'
@@ -24,6 +25,7 @@ kipoiveff_repo_dir : repo_dir/kipoi-veff2
2425
faatpipe_repo_dir : repo_dir/faatpipe
2526
vep_repo_dir : repo_dir/ensembl-vep
2627
preprocessing_workdir : preprocessing_workdir
28+
2729
include_absplice : False
2830
include_deepSEA : False
2931
vep_online: True

Diff for: pipelines/annotations.snakefile

+1-15
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ genome_assembly = config.get("genome_assembly") or "GRCh38"
4444
fasta_dir = Path(config["fasta_dir"])
4545
fasta_file_name = config["fasta_file_name"]
4646
gtf_file = fasta_dir / config["gtf_file_name"]
47-
gene_id_file = config.get("gene_id_parquet")
47+
gene_id_file = config["gene_id_parquet"]
4848

4949
deeprvat_parent_path = Path(config["deeprvat_repo_dir"])
5050
annotation_python_file = (
@@ -191,20 +191,6 @@ rule all:
191191
chckpt = anno_dir / 'chckpts' / 'select_rename_fill_columns.chckpt',
192192
annotations = anno_dir / 'annotations.parquet'
193193

194-
if not gene_id_file:
195-
gene_id_file = fasta_dir / "protein_coding_genes.parquet"
196-
197-
rule create_gene_id_file:
198-
input:
199-
gtf_file,
200-
output:
201-
gene_id_file,
202-
resources:
203-
mem_mb=lambda wildcards, attempt: 15_000 * (attempt + 1),
204-
shell:
205-
" ".join(
206-
[f"deeprvat_annotations", "create-gene-id-file", "{input}", "{output}"]
207-
)
208194

209195
rule extract_with_header:
210196
input:

0 commit comments

Comments
 (0)