unsure if constructed models in GRCh38 are correct #50

vifehe · 2024-06-26T13:32:42Z

We intend to use gnomix in order to infer local ancestry for our data. However, as we use HG38, our first step is to build a pretrained model for HG38. We follow the demo script.

As we use phased data, we downloaded 1000G data from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/.

The sample smap file is constructed as,

$ tail -n +2 1000g.smap | cut -f 1 > sample.smap

and then we run (this is a test for chr1),

$ bcftools view -S sample.smap -o reference_1000g.vcf 1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

in order to get our reference.vcf file.

Now, we run,

gnomix.sh 01-1000G-PHASED/1kGP_high_coverage_Illumina.chr1.filtered.SNV_INDEL_SV_phased_panel.vcf.gz /testing0/phased_chr1 chr1 True Genetic-map-b38.gmap2 reference_1000g.vcf 1000g.smap

and we get this output,

#######################################################################

/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n'
warnings.warn('invalid INFO header: %r' % header)
...

----------------------------------- Gnomix -----------------------------------

When using this software, please cite:
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante,
Daniel Mas Montserrat, Alexander G Ioannidis:
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data"
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1

Launching in training mode...
Reading vcf file...
Getting genetic map info...
Getting sample map info...
Building founders...
Splitting sample map...
Running Simulation...
Training...
Reading data...
Building model...
Training base models...
100%|████████████████████████████████████████| 1431/1431 [05:11<00:00, 4.60it/s]Training smoother...

[12:03:06] WARNING: /workspace/src/learner.cc:480:
Parameters: { use_label_encoder } might not be used.

This may not be accurate due to some parameters are only used in language bindings but
passed down to XGBoost core. Or some parameters are not used but slip through this
verification. Please open an issue if you find above cases.

Evaluating model...
Re-training base models...
100%|████████████████████████████████████████| 1431/1431 [09:33<00:00, 2.49it/s]
/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n'
warnings.warn('invalid INFO header: %r' % header)
Analyzing model performance...
Estimated train accuracy: 99.34%
Estimated val accuracy: 98.48%
Model, info and analysis saved at /nas/osotolongo/images/testing0/phased_chr1/models/model_chm_chr1

Launching inference...
Loading and processing query file...

Number of SNPs from model: 5759060
Number of SNPs from file: 5759060
Number of intersecting SNPs: 5439307
Percentage of model SNPs covered by query file: 94.45%
Traceback (most recent call last):
File "gnomix.py", line 409, in
run_inference(base_args, model,
File "gnomix.py", line 49, in run_inference
X_query, vcf_idx, fmt_idx = vcf_to_npy(query_vcf_data, model.snp_pos, model.snp_ref, return_idx=True, verbose=verbose)
File "/home/gnomix/src/utils.py", line 132, in vcf_to_npy
fill = np.full((n_ind*2, len(snp_pos_fmt)), miss_fill)
File "/usr/local/lib/python3.8/dist-packages/numpy/core/numeric.py", line 342, in full
a = empty(shape, dtype, order)
numpy.core._exceptions.MemoryError: Unable to allocate 275. GiB for an array with shape (6404, 5759060) and data type int64
########################################################################

Now, despite the final error, the pretrained model seems to be in place,

$ ls phased_chr1/models/model_chm_chr1/
analysis config.txt model_chm_chr1.pkl

as it is said in the line: "Model, info and analysis saved at ....". However, as we are not sure what is happening later, we don't know if this pretained model is usable. Can you enlighten us about what is making the program at this point? Is the model good to be used at this point?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unsure if constructed models in GRCh38 are correct #50

unsure if constructed models in GRCh38 are correct #50

vifehe commented Jun 26, 2024

unsure if constructed models in GRCh38 are correct #50

unsure if constructed models in GRCh38 are correct #50

Comments

vifehe commented Jun 26, 2024

/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n' warnings.warn('invalid INFO header: %r' % header) ...

----------------------------------- Gnomix -----------------------------------

When using this software, please cite: Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante, Daniel Mas Montserrat, Alexander G Ioannidis: "High Resolution Ancestry Deconvolution for Next Generation Genomic Data" https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1

/usr/local/lib/python3.8/dist-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">\n'
warnings.warn('invalid INFO header: %r' % header)
...

When using this software, please cite:
Helgi Hilmarsson, Arvind S Kumar, Richa Rastogi, Carlos D Bustamante,
Daniel Mas Montserrat, Alexander G Ioannidis:
"High Resolution Ancestry Deconvolution for Next Generation Genomic Data"
https://www.biorxiv.org/content/10.1101/2021.09.19.460980v1