Skip to content

[Bug]: Metadata validator crashes when gwas_id is not inferred from filename #46

@teague-23andme

Description

@teague-23andme

System information

  • Ubuntu 22.04
  • 1.0.5
  • Cloud

Description of the Issue

The format command with --generate-metadata crashes for a filename that doesn't contain GCST, even if the metadata is otherwise valid due to attempting to concatenate a string and None.

Creating a symlink to the file with a GCST name processes the metadata (more or less) as expected, except that it adds the GWAS Catalog IDs.

Calling the format command with a GCST filename that doesn't exist, still processes and writes the metadata file.

Ideally, gwas_id and gwas_catalog_api shouldn't be forced to be inferred for files they are not required of.

Error Message

---------- METADATA ----------

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/cli.py:188 in ss_format        │
│                                                                                                  │
│   185 │   │   if custom_header_map else {}                                                       │
│   186 │   meta_dict = metadata_dict_from_args(args=extra_args.args) \                            │
│   187 │   │   if metadata_edit_mode else {}                                                      │
│ ❱ 188 │   format(filename=filename,                                                              │
│   189 │   │      data_outfile=data_outfile,                                                      │
│   190 │   │      minimal_to_standard=minimal_to_standard,                                        │
│   191 │   │      generate_metadata=generate_metadata,                                            │
│                                                                                                  │
│ ╭──────────────────────────────── locals ────────────────────────────────╮                       │
│ │      custom_header_map = False                                         │                       │
│ │           data_outfile = None                                          │                       │
│ │             extra_args = <click.core.Context object at 0x7f1629176b70> │                       │
│ │               filename = PosixPath('output.tsv.gz')                    │                       │
│ │      generate_metadata = True                                          │                       │
│ │             header_map = {}                                            │                       │
│ │              meta_dict = {}                                            │                       │
│ │     metadata_edit_mode = False                                         │                       │
│ │ metadata_from_gwas_cat = False                                         │                       │
│ │        metadata_infile = PosixPath('minimal.yaml')                     │                       │
│ │       metadata_outfile = PosixPath('generated.yaml')                   │                       │
│ │    minimal_to_standard = False                                         │                       │
│ ╰────────────────────────────────────────────────────────────────────────╯                       │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/format.py:144 in format        │
│                                                                                                  │
│   141 │   # Get metadata                                                                         │
│   142 │   if generate_metadata:                                                                  │
│   143 │   │   print("[bold]\n---------- METADATA ----------\n[/bold]")                           │
│ ❱ 144 │   │   metadata = formatter.set_metadata(                                                 │
│   145 │   │   │   from_gwas_cat=metadata_from_gwas_cat, custom_metadata=metadata_dict            │
│   146 │   │   )                                                                                  │
│   147 │   │   print(metadata)                                                                    │
│                                                                                                  │
│ ╭───────────────────────────────────────── locals ─────────────────────────────────────────╮     │
│ │           data_outfile = None                                                            │     │
│ │               filename = PosixPath('output.tsv.gz')                                      │     │
│ │              formatter = <gwas_sumstats_tools.format.Formatter object at 0x7f16289d3e60> │     │
│ │      generate_metadata = True                                                            │     │
│ │             header_map = {}                                                              │     │
│ │          metadata_dict = {}                                                              │     │
│ │ metadata_from_gwas_cat = False                                                           │     │
│ │        metadata_infile = PosixPath('minimal.yaml')                                       │     │
│ │       metadata_outfile = PosixPath('generated.yaml')                                     │     │
│ │    minimal_to_standard = False                                                           │     │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────╯     │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/format.py:88 in set_metadata   │
│                                                                                                  │
│    85 │   │   │   metadata object                                                                │
│    86 │   │   """                                                                                │
│    87 │   │   self.meta.from_file()                                                              │
│ ❱  88 │   │   meta_dict = get_file_metadata(                                                     │
│    89 │   │   │   in_file=self.data_infile,                                                      │
│    90 │   │   │   out_file=self.data_outfile,                                                    │
│    91 │   │   │   meta_dict=self.meta.as_dict(),                                                 │
│                                                                                                  │
│ ╭───────────────────────────────────── locals ──────────────────────────────────────╮            │
│ │ custom_metadata = {}                                                              │            │
│ │   from_gwas_cat = False                                                           │            │
│ │            self = <gwas_sumstats_tools.format.Formatter object at 0x7f16289d3e60> │            │
│ ╰───────────────────────────────────────────────────────────────────────────────────╯            │
│                                                                                                  │
│ /home/jovyan/env/lib/python3.12/site-packages/gwas_sumstats_tools/interfaces/metadata.py:186 in  │
│ get_file_metadata                                                                                │
│                                                                                                  │
│   183 │   inferred_meta_dict['genome_assembly'] = GENOME_ASSEMBLY_MAPPINGS.get(parse_genome_as   │
│   184 │   inferred_meta_dict['data_file_md5sum'] = get_md5sum(out_file) if Path(out_file).exis   │
│   185 │   inferred_meta_dict['date_last_modified'] = date.today()                                │
│ ❱ 186 │   inferred_meta_dict['gwas_catalog_api'] = GWAS_CAT_API_STUDIES_URL + parse_accession_   │
│   187 │   for field, value in inferred_meta_dict.items():                                        │
│   188 │   │   update_dict_if_not_set(meta_dict, field, value)                                    │
│   189 │   return meta_dict                                                                       │
│                                                                                                  │
│ ╭────────────────────────────────────── locals ───────────────────────────────────────╮          │
│ │            in_file = PosixPath('output.tsv.gz')                                     │          │
│ │ inferred_meta_dict = {                                                              │          │
│ │                      │   'gwas_id': None,                                           │          │
│ │                      │   'data_file_name': 'output.tsv.gz',                         │          │
│ │                      │   'file_type': 'GWAS-SFF v1.0',                              │          │
│ │                      │   'genome_assembly': 'unknown',                              │          │
│ │                      │   'data_file_md5sum': '7e29306421cfb296a5e1099f2e461390',    │          │
│ │                      │   'date_last_modified': datetime.date(2024, 11, 6)           │          │
│ │                      }                                                              │          │
│ │          meta_dict = {                                                              │          │
│ │                      │   'genotyping_technology': [                                 │          │
│ │                      │   │   'Genome-wide genotyping array'                         │          │
│ │                      │   ],                                                         │          │
│ │                      │   'gwas_id': None,                                           │          │
│ │                      │   'trait_description': None,                                 │          │
│ │                      │   'minor_allele_freq_lower_limit': None,                     │          │
│ │                      │   'data_file_name': 'output.tsv.gz',                         │          │
│ │                      │   'file_type': 'GWAS-SSF v1.0',                              │          │
│ │                      │   'data_file_md5sum': None,                                  │          │
│ │                      │   'is_harmonised': False,                                    │          │
│ │                      │   'is_sorted': False,                                        │          │
│ │                      │   'date_last_modified': datetime.date(2024, 11, 6),          │          │
│ │                      │   ... +12                                                    │          │
│ │                      }                                                              │          │
│ │           out_file = PosixPath('output.tsv.gz')                                     │          │
│ ╰─────────────────────────────────────────────────────────────────────────────────────╯          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: can only concatenate str (not "NoneType") to str```

### Command used and terminal output

```console
$ gwas-ssf format empty.tsv.gz --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata
...
# Crashes with the message above
TypeError: can only concatenate str (not "NoneType") to str

# However simply calling the validator with a symlink to the same file works
$ ln -s emtpy.tsv GCST1.tsv
$ gwas-ssf format GCST1.tsv.gz --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata

---------- METADATA ----------

adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_md5sum: 05eea3e7b985d4f552fcec50c102bed8
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
gwas_catalog_api: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST1
gwas_id: GCST1
harmonisation_reference: null
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
minor_allele_freq_lower_limit: null
ontology_mapping: null
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  case_count: null
  control_count: null
  sample_ancestry: null
  sample_size: 1000
sex: combined
trait_description: null

Writing metadata --> generated.yaml

# Surprising, even if that file doesn't actually exist
$ gwas-ssf format GCST999999999999999.tsv --meta-in minimal.yaml --meta-out generated.yaml --generate-metadata
[Errno 2] No such file or directory: 'GCST999999999999999.tsv'

---------- METADATA ----------

adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_md5sum: null
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
gwas_catalog_api: https://www.ebi.ac.uk/gwas/rest/api/studies/GCST999999999999999
gwas_id: GCST999999999999999
harmonisation_reference: null
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
minor_allele_freq_lower_limit: null
ontology_mapping: null
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  case_count: null
  control_count: null
  sample_ancestry: null
  sample_size: 1000
sex: combined
trait_description: null

Writing metadata --> generated.yaml
$

First 10 Rows of the Input File

empty.tsv:

chromosome      base_pair_location      effect_allele   other_allele    beta    standard_error  p_value variant_id      ref_allele

minimal.yaml:

adjusted_covariates:
- age
- sex
analysis_software: PLINK 1.9
author_notes: Example
coordinate_system: 1-based
data_file_name: output.tsv.gz
date_last_modified: 2024-11-06
file_type: GWAS-SSF v1.0
genome_assembly: GRCh37
genotyping_technology:
- Genome-wide genotyping array
imputation_panel: 1000 Genomes Phase 3 (placeholder)
imputation_software: GENOTYPE
is_harmonised: false
is_sorted: false
samples:
- ancestry_method:
  - self-reported
  - gentically determined
  case_control_study: false
  sample_size: 1000
sex: combined```

### Relevant files

_No response_

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions