Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rule add_gene_ids error in GRCh37 (GTF v47) #154

Closed
maia-munteanu opened this issue Jan 28, 2025 · 1 comment · Fixed by #155
Closed

rule add_gene_ids error in GRCh37 (GTF v47) #154

maia-munteanu opened this issue Jan 28, 2025 · 1 comment · Fixed by #155

Comments

@maia-munteanu
Copy link

Hi there!

I wanted to open an issue to mention a small bug I've come across. I've been running the annotations pipeline on a few sets of samples and I noticed that the GRCh37 ones fail at the add_gene_ids step, with the error:

rule add_gene_ids:
    input: output_TCGA/annotations/tmp/protein_coding_genes.parquet, output_TCGA/annotations/chckpts/calculate_MAF.chckpt
    output: output_TCGA/annotations/chckpts/add_gene_ids.chckpt
    jobid: 5
    reason: Missing output files: output_TCGA/annotations/chckpts/add_gene_ids.chckpt; Input files updated by another job: output_TCGA/annotations/chckpts/calculate_MAF.chckpt, output_TCGA/annotations/tmp/protein_coding_genes.parquet
    resources: tmpdir=/tmp, mem_mb=38000

/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
Traceback (most recent call last):
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/g/strcombio/fsupek_data/users/mmunteanu/test/deeprvat/deeprvat/annotations/annotations.py", line 1997, in add_gene_ids
    assert len(merged) == len_anno

This seems to be caused by one gene, SPRY3, which is present twice in the gene id file (pseudoautosomal gene):

19624 ENSG00000168939.13_8    protein_coding SPRY3    
20076 ENSG00000168939.6_PAR_Y protein_coding SPRY3 

When the gene name is split and the feature column is removed, the gene_base name is present twice so it interferes with the merging to the annotations file. This happens with the v47 Gencode release, specifically for GRCh37, GRCh38 only has one copy of the gene on the X chromosome so this is not an issue. I'm not sure if other Gencode releases are affected. To solve this, I've just removed the ENSG00000168939.6_PAR_Y gene from the parquet file for now, but maybe this could be handled more elegantly in the gene_id_file function somehow.

Thanks,
Maia

@bfclarke
Copy link
Contributor

bfclarke commented Feb 5, 2025

Thank you for raising this issue, Maia! I'm glad you found a workaround, and we'll work on a general fix.

meyerkm pushed a commit that referenced this issue Feb 12, 2025
#155)

* make sure that every gene base exists once in gene id file (addressing #154)

* fixup! Format Python code with psf/black pull_request

---------

Co-authored-by: “Marcel-Mueck” <“[email protected]”>
Co-authored-by: PMBio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants