rule add_gene_ids error in GRCh37 (GTF v47) #154

maia-munteanu · 2025-01-28T15:40:09Z

Hi there!

I wanted to open an issue to mention a small bug I've come across. I've been running the annotations pipeline on a few sets of samples and I noticed that the GRCh37 ones fail at the add_gene_ids step, with the error:

rule add_gene_ids:
    input: output_TCGA/annotations/tmp/protein_coding_genes.parquet, output_TCGA/annotations/chckpts/calculate_MAF.chckpt
    output: output_TCGA/annotations/chckpts/add_gene_ids.chckpt
    jobid: 5
    reason: Missing output files: output_TCGA/annotations/chckpts/add_gene_ids.chckpt; Input files updated by another job: output_TCGA/annotations/chckpts/calculate_MAF.chckpt, output_TCGA/annotations/tmp/protein_coding_genes.parquet
    resources: tmpdir=/tmp, mem_mb=38000

/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/dask/dataframe/utils.py:369: FutureWarning: pandas.UInt64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  _numeric_index_types = (pd.Int64Index, pd.Float64Index, pd.UInt64Index)
Traceback (most recent call last):
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/bin/deeprvat_annotations", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'deeprvat_annotations')())
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/g/strcombio/fsupek_home/mmunteanu/.local/share/mamba/envs/deeprvat_annotations/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/g/strcombio/fsupek_data/users/mmunteanu/test/deeprvat/deeprvat/annotations/annotations.py", line 1997, in add_gene_ids
    assert len(merged) == len_anno

This seems to be caused by one gene, SPRY3, which is present twice in the gene id file (pseudoautosomal gene):

19624 ENSG00000168939.13_8    protein_coding SPRY3    
20076 ENSG00000168939.6_PAR_Y protein_coding SPRY3

When the gene name is split and the feature column is removed, the gene_base name is present twice so it interferes with the merging to the annotations file. This happens with the v47 Gencode release, specifically for GRCh37, GRCh38 only has one copy of the gene on the X chromosome so this is not an issue. I'm not sure if other Gencode releases are affected. To solve this, I've just removed the ENSG00000168939.6_PAR_Y gene from the parquet file for now, but maybe this could be handled more elegantly in the gene_id_file function somehow.

Thanks,
Maia

The text was updated successfully, but these errors were encountered:

bfclarke · 2025-02-05T09:25:32Z

Thank you for raising this issue, Maia! I'm glad you found a workaround, and we'll work on a general fix.

…#154)

#155) * make sure that every gene base exists once in gene id file (addressing #154) * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“[email protected]”> Co-authored-by: PMBio <[email protected]>

Marcel-Mueck pushed a commit that referenced this issue Feb 11, 2025

make sure that every gene base exists once in gene id file (addressing …

4a9ee9e

…#154)

Marcel-Mueck mentioned this issue Feb 11, 2025

make sure that every gene base exists once in gene id file (addressin… #155

Merged

meyerkm linked a pull request Feb 12, 2025 that will close this issue

make sure that every gene base exists once in gene id file (addressin… #155

Merged

meyerkm closed this as completed in #155 Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rule add_gene_ids error in GRCh37 (GTF v47) #154

rule add_gene_ids error in GRCh37 (GTF v47) #154

maia-munteanu commented Jan 28, 2025

bfclarke commented Feb 5, 2025

rule add_gene_ids error in GRCh37 (GTF v47) #154

rule add_gene_ids error in GRCh37 (GTF v47) #154

Comments

maia-munteanu commented Jan 28, 2025

bfclarke commented Feb 5, 2025