You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to open an issue to mention a small bug I've come across. I've been running the annotations pipeline on a few sets of samples and I noticed that the GRCh37 ones fail at the add_gene_ids step, with the error:
rule add_gene_ids:
input: output_TCGA/annotations/tmp/protein_coding_genes.parquet, output_TCGA/annotations/chckpts/calculate_MAF.chckpt
output: output_TCGA/annotations/chckpts/add_gene_ids.chckpt
jobid: 5
reason: Missing output files: output_TCGA/annotations/chckpts/add_gene_ids.chckpt; Input files updated by another job: output_TCGA/annotations/chckpts/calculate_MAF.chckpt, output_TCGA/annotations/tmp/protein_coding_genes.parquet
resources: tmpdir=/tmp, mem_mb=38000
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
_numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
_numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
_numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
Traceback (most recent call last):
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/g/strcombio/fsupek_data/users/mmunteanu/test/deeprvat/deeprvat/annotations/annotations.py", line 1997, in add_gene_ids
assert len(merged) == len_anno
This seems to be caused by one gene, SPRY3, which is present twice in the gene id file (pseudoautosomal gene):
When the gene name is split and the feature column is removed, the gene_base name is present twice so it interferes with the merging to the annotations file. This happens with the v47 Gencode release, specifically for GRCh37, GRCh38 only has one copy of the gene on the X chromosome so this is not an issue. I'm not sure if other Gencode releases are affected. To solve this, I've just removed the ENSG00000168939.6_PAR_Y gene from the parquet file for now, but maybe this could be handled more elegantly in the gene_id_file function somehow.
Thanks,
Maia
The text was updated successfully, but these errors were encountered:
#155)
* make sure that every gene base exists once in gene id file (addressing #154)
* fixup! Format Python code with psf/black pull_request
---------
Co-authored-by: “Marcel-Mueck” <“[email protected]”>
Co-authored-by: PMBio <[email protected]>
Hi there!
I wanted to open an issue to mention a small bug I've come across. I've been running the annotations pipeline on a few sets of samples and I noticed that the GRCh37 ones fail at the add_gene_ids step, with the error:
This seems to be caused by one gene, SPRY3, which is present twice in the gene id file (pseudoautosomal gene):
When the gene name is split and the feature column is removed, the gene_base name is present twice so it interferes with the merging to the annotations file. This happens with the v47 Gencode release, specifically for GRCh37, GRCh38 only has one copy of the gene on the X chromosome so this is not an issue. I'm not sure if other Gencode releases are affected. To solve this, I've just removed the ENSG00000168939.6_PAR_Y gene from the parquet file for now, but maybe this could be handled more elegantly in the gene_id_file function somehow.
Thanks,
Maia
The text was updated successfully, but these errors were encountered: