Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in step #3 of projecting PC scores on new data #496

Open
ozandikilitas opened this issue Apr 30, 2024 · 16 comments
Open

Error in step #3 of projecting PC scores on new data #496

ozandikilitas opened this issue Apr 30, 2024 · 16 comments

Comments

@ozandikilitas
Copy link

ozandikilitas commented Apr 30, 2024

Hello,
I receive the following error upon trying to project a new dataset onto 1KG reference panel.

> obj.bed.new
A 'bed' object with 12096 samples and 128122270 variants.
> test_pca <- bed_projectPCA(
+   obj.bed,
+   obj.bed.new,
+   k = 20,
+   ind.row.new = rows_along(obj.bed.new),
+   ind.row.ref = ind.row,
+   ind.col.ref = ind.col,
+   strand_flip = TRUE,
+   join_by_pos = TRUE,
+   match.min.prop = 0.5,
+   verbose = TRUE,
+   ncores = nb_cores()
+ ) 

[Step 1/3] Matching variants of reference with target data..
1,233,111 variants to be matched.
6 ambiguous SNPs have been removed.
1,228,268 variants have been matched; 0 were flipped and 317,288 were reversed.

[Step 2/3] Computing (auto) SVD of reference..

Phase of clumping (on MAC) at r^2 > 0.2.. keep 302930 variants.
Discarding 0 variant with MAC < 10.

Iteration 1:
Computing SVD..
0 outlier variant detected..

Converged!

[Step 3/3] Projecting PC scores on new data..
Error in { : task 1 failed - "'...' used in an incorrect context"

Have you encountered this issue before (or the specific error I am receiving at the end)? Thank you for your help!

@ozandikilitas
Copy link
Author

Here's the line by line debugging, problem seems to stem from big_parallelize(). What would you like me to provide next?

 aric_test_pca <- bed_projectPCA(
+   obj.bed,
+   obj.bed.aric,
+   k = 20,
+   ind.row.new = rows_along(obj.bed.aric),
+ #  ind.row.ref = ind.row,
+ #  ind.col.ref = ind.col,
+   strand_flip = TRUE,
+   join_by_pos = TRUE,
+ #  match.min.prop = 0.5,
+   verbose = TRUE,
+   ncores = nb_cores()
+ ) 
debugging in: bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
    strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
debug: {
    printf2 <- function(...) if (verbose) 
        printf(...)
    printf2("\n[Step 1/3] Matching variants of reference with target data..\n")
    map.ref <- setNames(obj.bed.ref$map[-3], c("chr", "rsid", 
        "pos", "a1", "a0"))
    map.new <- setNames(obj.bed.new$map[-3], c("chr", "rsid", 
        "pos", "a1", "a0"))
    if (join_by_pos && (build.new != build.ref)) {
        if (is.null(liftOver)) {
            stop2("Please provide path to liftOver executable.")
        }
        else if (substr(liftOver, 1, 1) != ".") {
            liftOver <- file.path(".", liftOver)
        }
        map.new <- snp_modifyBuild(map.new, liftOver = liftOver, 
            from = build.new, to = build.ref)
    }
    info_snp <- snp_match(cbind(map.ref, beta = 1), map.new, 
        strand_flip = strand_flip, join_by_pos = join_by_pos, 
        match.min.prop = match.min.prop)
    printf2("\n[Step 2/3] Computing (auto) SVD of reference..\n")
    obj.svd <- bed_autoSVD(obj.bed.ref, ind.row = ind.row.ref, 
        ind.col = intersect(ind.col.ref, info_snp$`_NUM_ID_.ss`), 
        k = k, ncores = ncores, verbose = verbose, ...)
    printf2("\n[Step 3/3] Projecting PC scores on new data..\n")
    keep <- match(attr(obj.svd, "subset"), info_snp$`_NUM_ID_.ss`)
    X_norm <- FBM(length(ind.row.new), 1, init = 0)
    XV <- FBM(length(ind.row.new), k, init = 0)
    big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
        ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
        center = (obj.svd$center - 1) * info_snp$beta[keep] + 
            1, scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
        XV = XV, X_norm = X_norm)
    list(obj.svd.ref = obj.svd, simple_proj = XV[, , drop = FALSE], 
        OADP_proj = OADP_proj(XV, X_norm, obj.svd$d, ncores = ncores))
}
Browse[2]> n
debug: printf2 <- function(...) if (verbose) printf(...)
Browse[2]> n
debug: printf2("\n[Step 1/3] Matching variants of reference with target data..\n")
Browse[2]> n

[Step 1/3] Matching variants of reference with target data..
debug: map.ref <- setNames(obj.bed.ref$map[-3], c("chr", "rsid", "pos", 
    "a1", "a0"))
Browse[2]> n
debug: map.new <- setNames(obj.bed.new$map[-3], c("chr", "rsid", "pos", 
    "a1", "a0"))
Browse[2]> n
debug: if (join_by_pos && (build.new != build.ref)) {
    if (is.null(liftOver)) {
        stop2("Please provide path to liftOver executable.")
    }
    else if (substr(liftOver, 1, 1) != ".") {
        liftOver <- file.path(".", liftOver)
    }
    map.new <- snp_modifyBuild(map.new, liftOver = liftOver, 
        from = build.new, to = build.ref)
}
Browse[2]> n
debug: info_snp <- snp_match(cbind(map.ref, beta = 1), map.new, strand_flip = strand_flip, 
    join_by_pos = join_by_pos, match.min.prop = match.min.prop)
Browse[2]> n
1,233,111 variants to be matched.
6 ambiguous SNPs have been removed.
1,228,268 variants have been matched; 0 were flipped and 317,288 were reversed.
debug: printf2("\n[Step 2/3] Computing (auto) SVD of reference..\n")
Browse[2]> n

[Step 2/3] Computing (auto) SVD of reference..
debug: obj.svd <- bed_autoSVD(obj.bed.ref, ind.row = ind.row.ref, ind.col = intersect(ind.col.ref, 
    info_snp$`_NUM_ID_.ss`), k = k, ncores = ncores, verbose = verbose, 
    ...)
Browse[2]> n

Phase of clumping (on MAC) at r^2 > 0.2.. keep 304498 variants.
Discarding 0 variant with MAC < 10.

Iteration 1:
Computing SVD..
364 outlier variants detected..
5 long-range LD regions detected..

Iteration 2:
Computing SVD..
0 outlier variant detected..

Converged!
debug: printf2("\n[Step 3/3] Projecting PC scores on new data..\n")
Browse[2]> n

[Step 3/3] Projecting PC scores on new data..
debug: keep <- match(attr(obj.svd, "subset"), info_snp$`_NUM_ID_.ss`)
Browse[2]> n
debug: X_norm <- FBM(length(ind.row.new), 1, init = 0)
Browse[2]> n
debug: XV <- FBM(length(ind.row.new), k, init = 0)
Browse[2]> n
debug: big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
    ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
    center = (obj.svd$center - 1) * info_snp$beta[keep] + 1, 
    scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
    XV = XV, X_norm = X_norm)
Browse[2]> n
Error in { : task 1 failed - "'...' used in an incorrect context"
> traceback()
6: stop(simpleError(msg, call = expr))
5: e$fun(obj, substitute(ex), parent.frame(), e$data)
4: foreach(ic = rows_along(intervals)) %dopar% {
       ind.part <- ind[seq(intervals[ic, "lower"], intervals[ic, 
           "upper"])]
       FUN(ind = ind.part, ...)
   }
3: bigparallelr::split_parapply(p.FUN, ind, X, ..., .combine = p.combine, 
       ncores = ncores, opts_cluster = list(type = getOption("bigstatsr.cluster.type")))
2: big_parallelize(obj.bed.new$light, p.FUN = part_prod, ind = seq_along(keep), 
       ncores = ncores, ind.row = ind.row.new, ind.col = info_snp$`_NUM_ID_`[keep], 
       center = (obj.svd$center - 1) * info_snp$beta[keep] + 1, 
       scale = obj.svd$scale * info_snp$beta[keep], V = obj.svd$v, 
       XV = XV, X_norm = X_norm)
1: bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
       strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.8 (Ootpa)

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.8.0
LAPACK: /usr/lib64/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C           
 [4] LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C       
 [7] LC_PAPER=C           LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bigsnpr_1.12.2    bigstatsr_1.5.12  patchwork_1.2.0   ggplot2_3.5.0    
 [5] flashpcaR_2.1     Hmisc_5.1-2       tidyr_1.3.1       tibble_3.2.1     
 [9] data.table_1.15.4 dplyr_1.1.4      

loaded via a namespace (and not attached):
 [1] runonce_0.2.3      foreach_1.5.2      carData_3.0-5      Formula_1.2-5     
 [5] doRNG_1.8.6        robustbase_0.99-2  pillar_1.9.0       backports_1.4.1   
 [9] lattice_0.20-45    glue_1.7.0         digest_0.6.35      ggsignif_0.6.4    
[13] checkmate_2.3.1    colorspace_2.1-1   cowplot_1.1.3      htmltools_0.5.8.1 
[17] Matrix_1.6-5       pkgconfig_2.0.3    broom_1.0.5        bigparallelr_0.3.2
[21] purrr_1.0.2        rmio_0.4.0         scales_1.3.0       RSpectra_0.16-1   
[25] ff_4.0.12          htmlTable_2.4.2    generics_0.1.3     car_3.1-2         
[29] ggpubr_0.6.0       withr_3.0.0        nnet_7.3-18        cli_3.6.2         
[33] magrittr_2.0.3     evaluate_0.23      ps_1.7.6           bigassertr_0.1.6  
[37] fansi_1.0.6        parallelly_1.37.1  doParallel_1.0.17  rstatix_0.7.2     
[41] foreign_0.8-86     tools_4.2.2        bigutilsr_0.3.4    lifecycle_1.0.4   
[45] stringr_1.5.1      munsell_0.5.1      bigsparser_0.6.1   cluster_2.1.4     
[49] rngtools_1.5.2     compiler_4.2.2     rlang_1.1.3        grid_4.2.2        
[53] iterators_1.0.14   rstudioapi_0.16.0  htmlwidgets_1.6.4  base64enc_0.1-3   
[57] rmarkdown_2.26     gtable_0.3.4       codetools_0.2-18   flock_0.7         
[61] abind_1.4-5        R6_2.5.1           gridExtra_2.3      knitr_1.46        
[65] bit_4.0.5          fastmap_1.1.1      utf8_1.2.4         bigreadr_0.2.5    
[69] stringi_1.8.3      parallel_4.2.2     Rcpp_1.0.12        vctrs_0.6.5       
[73] rpart_4.1.19       DEoptimR_1.1-3     tidyselect_1.2.1   xfun_0.43 

@privefl
Copy link
Owner

privefl commented May 2, 2024

Thanks for reporting.

Does this work for you?

library(bigsnpr)

bedfile <- download_1000G("tmp-data")
obj.bed.aric <- obj.bed <- bed(bedfile)

bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
               strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())

@ozandikilitas
Copy link
Author

ozandikilitas commented May 2, 2024

It does! With the following object as a result:

str(test)

List of 3
 $ obj.svd.ref:List of 7
  ..$ d     : num [1:20] 8238 5427 2966 2589 1288 ...
  ..$ u     : num [1:2490, 1:20] -0.013 -0.0129 -0.013 -0.0128 -0.0129 ...
  ..$ v     : num [1:378519, 1:20] 2.57e-03 -2.84e-05 2.64e-03 -1.46e-03 -1.19e-03 ...
  ..$ niter : num 7
  ..$ nops  : num 234
  ..$ center: num [1:378519] 0.0691 0.0538 0.0703 0.3827 0.1522 ...
  ..$ scale : num [1:378519] 0.258 0.229 0.26 0.556 0.375 ...
  ..- attr(*, "class")= chr "big_SVD"
  ..- attr(*, "subset")= int [1:378519] 1 3 8 9 10 12 14 16 18 19 ...
  ..- attr(*, "lrldr")='data.frame':	5 obs. of  3 variables:
  .. ..$ Chr  : num [1:5] 6 6 6 15 17
  .. ..$ Start: num [1:5] 30468791 33260215 33684313 25677831 1863804
  .. ..$ Stop : num [1:5] 32800427 33512731 36288421 25823561 2359952
 $ simple_proj: num [1:2490, 1:20] -107 -106 -107 -105 -106 ...
 $ OADP_proj  : num [1:2490, 1:20] -107 -107 -108 -106 -107 ...

Any suggestions on next steps?

@privefl
Copy link
Owner

privefl commented May 3, 2024

What do you get for class(obj.bed) and class(obj.bed.aric) with your data?

@ozandikilitas
Copy link
Author

Here it is
> class(obj.bed) [1] "bed" attr(,"package") [1] "bigsnpr"
> class(obj.bed.aric) [1] "bed" attr(,"package") [1] "bigsnpr"

@ozandikilitas
Copy link
Author

FYI, when I also do the following with my obj.bed.aric (after following QC, removing related, restrict to HapMap3 etc.), it works with no issues. Is the issue something to do with subsetting datasets to the common SNPs between the reference and target?

bed_projectPCA(obj.bed.aric, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())

@privefl
Copy link
Owner

privefl commented May 6, 2024

You're using obj.bed.aric twice in the latest code you presented; is a test or just a typo?
Is it working when you do so?

I'm not sure I understood everything you tried to do (QC, etc). Could you try only one of those at once to see which one solves the problem? We could then have a better guess on finding what's going on (I currently have no idea).

@ozandikilitas
Copy link
Author

No, it was just a test. I still receive the original error if I use 1000G as the reference and ARIC as the target dataset. QC stuff I mentioned was just going through what you have done on 1000G on your PCA vignette, nothing more than that. In any case, I still receive the error when the reference and target datasets are different.

@privefl
Copy link
Owner

privefl commented May 6, 2024

But you don't when using only ARIC?
And you don't when using some QC?

@ozandikilitas
Copy link
Author

Yes, if I use either 1000G OR ARIC for both reference and target dataset within bed_projectPCA() (i.e., bed_projectPCA(obj.bed.aric, obj.bed.aric, ...) OR bed_projectPCA(obj.bed.1KG, obj.bed.1KG, ...), the function executes without any errors. However, if I put 1000G as the reference and ARIC as the target dataset (i.e., bed_projectPCA(obj.bed.1KG, obj.bed.ARIC, ...)), I get the error I posted in my first message. Forget about the QC stuff I mentioned, that has not impact on the error message whatsoever, I just unnecessarily confused you with that.

@privefl
Copy link
Owner

privefl commented May 6, 2024

Would you be able to share the bim file (infos on the variants) from the ARIC dataset with me?

@ozandikilitas
Copy link
Author

After you requested the bim file, I figured out the problem. Reference dataset bim file had variant names in the format chr.pos whereas target dataset (i.e., ARIC) had variant names in the format chr:pos:ref:alt. Even though I had join_by_pos=TRUE in bed_projectPCA, I guess having different variant names created an issue. Once I made the variant names in the exact same format across both datasets, bed_projectPCA runs without any errors. Thank you for all your help! Happy to provide more information if you need it!

@privefl
Copy link
Owner

privefl commented May 8, 2024

Variant names should not matter.
For example, this works:

library(bigsnpr)

bedfile <- download_1000G("tmp-data")
obj.bed <- bed(bedfile)

bigsnp <- snp_attach(snp_readBed2(bedfile, tempfile(), ind.col = 1:1000))
bigsnp$map$marker.ID <- 
  with(bigsnp$map, paste(chromosome, physical.pos, allele1, allele2, sep = ":"))
obj.bed.aric <- bed(snp_writeBed(bigsnp, tempfile(fileext = ".bed")))

bed_projectPCA(obj.bed, obj.bed.aric, k = 20, ind.row.new = rows_along(obj.bed.aric), 
               strand_flip = TRUE, join_by_pos = TRUE, verbose = TRUE, ncores = nb_cores())

@ozandikilitas
Copy link
Author

Interesting, that is the only thing I changed. I will look into it a bit more to see whether I am missing something.

@privefl
Copy link
Owner

privefl commented May 10, 2024

If we can understand what was going on, that would be great.

@privefl
Copy link
Owner

privefl commented Aug 2, 2024

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants