How should I know what are the genotype values that are missing and drop them before the computation? #303

garyzhubc · 2022-01-27T06:08:53Z

How should I know what are the values that are missing and drop them before the computation?

> obj.svd <- big_SVD(G, fun.scaling = as_scaling_fun(control_means, case_std))
Error in do.call(.Call, args = dot_call_args) : 
  Evaluation error: You can't have missing values in 'X'..
> obj.svd <- big_randomSVD(G, fun.scaling = as_scaling_fun(control_means, case_std))
Error in svds_real_gen(A, k, nu, nv, opts, mattype = "function", extra_args = list(Atrans = Atrans,  : 
  TridiagEigen: eigen decomposition failed

I tried removing missing genotype data with mind and geno filter in plink2 but still the above error appear.

[zhupy@gra-login1 zhupy]$ cat chr_geno_mind.log
PLINK v2.00a3LM 64-bit Intel (11 Oct 2021)
Options in effect:
  --export bgen-1.2 bits=8
  --geno
  --maf
  --mind
  --out chr_geno_mind
  --pfile chr

Hostname: gra798
Working directory: /project/6046455/zhupy
Start time: Wed Jan 26 19:59:10 2022

Random number seed: 1643245150
128539 MiB RAM detected; reserving 64269 MiB for main workspace.
Using up to 32 threads (change this with --threads).
632 samples (0 females, 0 males, 632 ambiguous; 632 founders) loaded from
chr.psam.
40359612 variants loaded from chr.pvar.
Note: No phenotype data present.
Calculating sample missingness rates... done.
0 samples removed due to missing genotype data (--mind).
632 samples (0 females, 0 males, 632 ambiguous; 632 founders) remaining after
main filters.
Calculating allele frequencies... done.
--geno: 0 variants removed due to missing genotype data.
32137613 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
8221999 variants remaining after main filters.
Warning: Unphased heterozygous hardcalls in partially-phased variants are
poorly represented with bits=8.
It is necessary to use e.g. --dosage-erase-threshold 0.006 to re-import them
cleanly.
Writing chr_geno_mind.bgen ... done.
Writing chr_geno_mind.sample ... done.

End time: Wed Jan 26 19:59:36 2022

The function as_scaling_fun is defined as

as_scaling_fun <- function(center.col, scale.col, ind.col = seq_along(center.col)) {
  
  bigassertr::assert_lengths(center.col, scale.col, ind.col)
  
  e <- new.env(parent = baseenv())
  assign("_DF_", data.frame(center = center.col, scale = scale.col), envir = e)
  assign("_IND_COL_", ind.col, envir = e)
  
  f <- function(X, ind.row, ind.col) {
    ind <- match(ind.col, `_IND_COL_`)
    `_DF_`[ind, ]
  }
  environment(f) <- e
  
  f
}

as mentioned in another issue #301. I've decided to open a new issue here because it looks like a different question.

The text was updated successfully, but these errors were encountered:

privefl · 2022-01-27T08:15:56Z

Hi, you still have three other issues open here.
Please take care of them (i.e. follow up or close them) before opening new ones.

garyzhubc · 2022-01-27T08:36:30Z

Sounds good, I closed two of them and will get back to one of them later on (since I'm taking a different path now, and I think is related to this question)

privefl · 2022-01-27T09:03:49Z

To handle missing values, have a look at this pinned issue.

But isn't it imputed data that you have here? Why do you still have missing values in there?

garyzhubc · 2022-01-27T19:26:10Z

Yes it's imputed, so it wouldn't be a problem of imputation, but it looks like as_scaling_fun changed things, because I can apply big_SVD without applying it first and it works fine. It is perhaps because there are variants that have zero standard deviations as there are 132 columns with zero standard deviations. Is it possible to drop these columns after normalizing?

garyzhubc · 2022-01-27T19:45:15Z

I think I can totally get the indices and use plink2 to filter out these snps but just wondering if there's a way to handle this in this package too, and I think it may be helpful to type check this in the as_scaling_fun function.

privefl · 2022-01-27T19:51:49Z

You can use the ind.col parameter in the PCA functions to only use a subset of variants.

garyzhubc changed the title ~~How should I know what are the values that are missing and drop them before the computation?~~ How should I know what are the genotype values that are missing and drop them before the computation? Jan 27, 2022

garyzhubc closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should I know what are the genotype values that are missing and drop them before the computation? #303

How should I know what are the genotype values that are missing and drop them before the computation? #303

garyzhubc commented Jan 27, 2022 •

edited

Loading

privefl commented Jan 27, 2022

garyzhubc commented Jan 27, 2022

privefl commented Jan 27, 2022 •

edited

Loading

garyzhubc commented Jan 27, 2022 •

edited

Loading

garyzhubc commented Jan 27, 2022 •

edited

Loading

privefl commented Jan 27, 2022

How should I know what are the genotype values that are missing and drop them before the computation? #303

How should I know what are the genotype values that are missing and drop them before the computation? #303

Comments

garyzhubc commented Jan 27, 2022 • edited Loading

privefl commented Jan 27, 2022

garyzhubc commented Jan 27, 2022

privefl commented Jan 27, 2022 • edited Loading

garyzhubc commented Jan 27, 2022 • edited Loading

garyzhubc commented Jan 27, 2022 • edited Loading

privefl commented Jan 27, 2022

garyzhubc commented Jan 27, 2022 •

edited

Loading

privefl commented Jan 27, 2022 •

edited

Loading

garyzhubc commented Jan 27, 2022 •

edited

Loading

garyzhubc commented Jan 27, 2022 •

edited

Loading