Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling missing values in SNP matrix #49

Closed
naglemi opened this issue Aug 30, 2019 · 13 comments
Closed

Handling missing values in SNP matrix #49

naglemi opened this issue Aug 30, 2019 · 13 comments
Labels

Comments

@naglemi
Copy link

naglemi commented Aug 30, 2019

Thanks for making this tool available. I wasn't sure whether to mark this as a question or issue because it's unclear to me if there is existing support for PCA and GWAS with SNP matrices that contain NAs. I am accustomed to setting thresholds in PLINK and GEMMA for which all SNPs missing in more than X genotypes are excluded, but haven't found a similar feature for bigsnpr in documentation or the demo.

I'm trying to run my data, which I've formatted the same way as the demo data, but there are NAs in my SNP set (from QC filtering, indels, etc) and this leads me to the error below.

Do I need to format the NAs a particular way, or do anything else to build models that include SNPs not found in all genotypes?

Error: You can't have missing values in 'X'.
Traceback:

1. big_univLogReg(G, y01.train = y01, covar.train = covariates, ncores = NCORES)
2. check_args()
3. with(args, eval(parse(text = check[i])))
4. with.default(args, eval(parse(text = check[i])))
5. eval(substitute(expr), data, enclos = parent.frame())
6. eval(substitute(expr), data, enclos = parent.frame())
7. eval(parse(text = check[i]))
8. eval(parse(text = check[i]))
9. assert_noNA(X)
10. stop2("You can't have missing values in '%s'.", deparse(substitute(x)))
11. stop(sprintf(...), call. = FALSE)
@privefl
Copy link
Owner

privefl commented Aug 30, 2019

Yes, most of the functions in the packages don't handle missing values.
Note that I will add a PCA algorithm that handle missing values soon (in @bedpca branch).
This is not planned for GWAS since I extensively use linear algebra, which does not mix well with missing values.

If you want to use {bigsnpr}, you should therefore filter to no missing values or impute them.
To impute, you can impute using snp_fastImpute() which should be taking a few hours for say 10K samples and 500K variants.
Otherwise, you could impute with the mean (rounded), or the mode, or randomly according to frequencies, or by 0. Although these are not very hard to code, I should probably make them available as functions in {bigsnpr}.

@privefl
Copy link
Owner

privefl commented Sep 1, 2019

I've just added snp_fastImputeSimple() to do a very fast imputation.

@privefl
Copy link
Owner

privefl commented Sep 7, 2019

Is this answering your questions?

@privefl privefl pinned this issue Sep 7, 2019
@naglemi
Copy link
Author

naglemi commented Sep 9, 2019

Thanks for clarifying. Is imputation mathematically equivalent to excluding genotypes missing a given SNP when building the model for that SNP? Should it produce the same test statistics? I think most of the other GWAS methods we use (particularly GEMMA) exclude those genotypes instead of imputing.

@privefl
Copy link
Owner

privefl commented Sep 9, 2019

No, I don't think that this is equivalent.
If the imputation is good, then I think you get more power.
Also the best thing to do might be multiple imputation, but requires more computation.

@ghost
Copy link

ghost commented Jan 28, 2020

Hi Florian,
Thanks for making the very fast imputation function. Will it be possible that you add "impute by 0" option as well?

@privefl
Copy link
Owner

privefl commented Jan 28, 2020

Yes, I guess it would be easy.
I just feel like it should be best to impute by any of the other methods.
What do you think?

@ghost
Copy link

ghost commented Jan 29, 2020

I just think this is more conservative.

@privefl
Copy link
Owner

privefl commented Jan 29, 2020

Should be now implemented in latest version, using method = "zero".

@ghost
Copy link

ghost commented Jan 30, 2020 via email

@privefl
Copy link
Owner

privefl commented Feb 22, 2020

@wongck-kevin I just remembered why I didn't implement this before.

In fact, you can do G2 <- G$copy(code = c(0, 1, 2, 0, rep(NA, 252))), and you will access all NAs as 0s when using G2 instead of G.
So, by using this method, you could impute by 1 or 2 as well.

@ghost
Copy link

ghost commented Mar 4, 2020

Thanks for the update Florian.

@privefl
Copy link
Owner

privefl commented Mar 9, 2020

Note that I will soon deprecate 'method = "zero"' in favor of using "$copy()".
This will give a warning only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants