Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should I know what are the genotype values that are missing and drop them before the computation? #303

Closed
garyzhubc opened this issue Jan 27, 2022 · 6 comments

Comments

@garyzhubc
Copy link

garyzhubc commented Jan 27, 2022

How should I know what are the values that are missing and drop them before the computation?

> obj.svd <- big_SVD(G, fun.scaling = as_scaling_fun(control_means, case_std))
Error in do.call(.Call, args = dot_call_args) : 
  Evaluation error: You can't have missing values in 'X'..
> obj.svd <- big_randomSVD(G, fun.scaling = as_scaling_fun(control_means, case_std))
Error in svds_real_gen(A, k, nu, nv, opts, mattype = "function", extra_args = list(Atrans = Atrans,  : 
  TridiagEigen: eigen decomposition failed

I tried removing missing genotype data with mind and geno filter in plink2 but still the above error appear.

[zhupy@gra-login1 zhupy]$ cat chr_geno_mind.log
PLINK v2.00a3LM 64-bit Intel (11 Oct 2021)
Options in effect:
  --export bgen-1.2 bits=8
  --geno
  --maf
  --mind
  --out chr_geno_mind
  --pfile chr

Hostname: gra798
Working directory: /project/6046455/zhupy
Start time: Wed Jan 26 19:59:10 2022

Random number seed: 1643245150
128539 MiB RAM detected; reserving 64269 MiB for main workspace.
Using up to 32 threads (change this with --threads).
632 samples (0 females, 0 males, 632 ambiguous; 632 founders) loaded from
chr.psam.
40359612 variants loaded from chr.pvar.
Note: No phenotype data present.
Calculating sample missingness rates... done.
0 samples removed due to missing genotype data (--mind).
632 samples (0 females, 0 males, 632 ambiguous; 632 founders) remaining after
main filters.
Calculating allele frequencies... done.
--geno: 0 variants removed due to missing genotype data.
32137613 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
8221999 variants remaining after main filters.
Warning: Unphased heterozygous hardcalls in partially-phased variants are
poorly represented with bits=8.
It is necessary to use e.g. --dosage-erase-threshold 0.006 to re-import them
cleanly.
Writing chr_geno_mind.bgen ... done.
Writing chr_geno_mind.sample ... done.

End time: Wed Jan 26 19:59:36 2022

The function as_scaling_fun is defined as

as_scaling_fun <- function(center.col, scale.col, ind.col = seq_along(center.col)) {
  
  bigassertr::assert_lengths(center.col, scale.col, ind.col)
  
  e <- new.env(parent = baseenv())
  assign("_DF_", data.frame(center = center.col, scale = scale.col), envir = e)
  assign("_IND_COL_", ind.col, envir = e)
  
  f <- function(X, ind.row, ind.col) {
    ind <- match(ind.col, `_IND_COL_`)
    `_DF_`[ind, ]
  }
  environment(f) <- e
  
  f
}

as mentioned in another issue #301. I've decided to open a new issue here because it looks like a different question.

@garyzhubc garyzhubc changed the title How should I know what are the values that are missing and drop them before the computation? How should I know what are the genotype values that are missing and drop them before the computation? Jan 27, 2022
@privefl
Copy link
Owner

privefl commented Jan 27, 2022

Hi, you still have three other issues open here.
Please take care of them (i.e. follow up or close them) before opening new ones.

@garyzhubc
Copy link
Author

Sounds good, I closed two of them and will get back to one of them later on (since I'm taking a different path now, and I think is related to this question)

@privefl
Copy link
Owner

privefl commented Jan 27, 2022

To handle missing values, have a look at this pinned issue.

But isn't it imputed data that you have here? Why do you still have missing values in there?

@garyzhubc
Copy link
Author

garyzhubc commented Jan 27, 2022

Yes it's imputed, so it wouldn't be a problem of imputation, but it looks like as_scaling_fun changed things, because I can apply big_SVD without applying it first and it works fine. It is perhaps because there are variants that have zero standard deviations as there are 132 columns with zero standard deviations. Is it possible to drop these columns after normalizing?

@garyzhubc
Copy link
Author

garyzhubc commented Jan 27, 2022

I think I can totally get the indices and use plink2 to filter out these snps but just wondering if there's a way to handle this in this package too, and I think it may be helpful to type check this in the as_scaling_fun function.

@privefl
Copy link
Owner

privefl commented Jan 27, 2022

You can use the ind.col parameter in the PCA functions to only use a subset of variants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants