-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #18 from lcpilling/v0.2.8
v0.2.8
Showing
13 changed files
with
355 additions
and
130 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,3 +48,4 @@ po/*~ | |
# RStudio Connect folder | ||
rsconnect/ | ||
docs | ||
inst/doc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
*.html | ||
*.R |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
--- | ||
title: "Ascertain diagnoses" | ||
description: > | ||
Ascertain UK Biobank participant diagnoses from all sources (medical records and self-report data). | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Ascertain diagnoses} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
library(ukbrapR) | ||
``` | ||
|
||
Diagnosis of conditions in UK Biobank participants come from multiple data sources: | ||
|
||
* Self-report during assessment | ||
|
||
* Hospital inpatient records (HES) | ||
|
||
* Primary care (GP) | ||
|
||
* Cancer registry | ||
|
||
* Cause of death | ||
|
||
The {ukbrapR} package makes it fast and easy to ascertain diagnoses from multiple UK Biobank data sources in the DNAnexus Research Analysis Platform (RAP). | ||
|
||
|
||
## Requires exported files | ||
|
||
This only needs to happen once per project. Running `export_tables()` will submit the necessary `table-exporter` jobs to save the raw medical records files to the RAP persistent storage for the project. ~10Gb of text files are created. This will cost ~£0.15 per month to store in the RAP standard storage. | ||
|
||
Once the files are exported (~15mins) these can then be used by the below functions to extract diagnoses based on codes lists. | ||
|
||
|
||
## Input | ||
|
||
Depending on the data source different coding vocabularies are required: | ||
|
||
* `ICD10` (for searching HES diagnoses, cause of death, and cancer registry) | ||
|
||
* `ICD9` (for searching older HES diagnosis data) | ||
|
||
* `Read2` and `CTV3` (for GP clinical events) | ||
|
||
* `OPCS3` and `OPCS4` (for HES operations) | ||
|
||
* `ukb_cancer` and `ukb_noncancer` (for self-reported illness at UK Biobank assessments - all instances will be searched) | ||
|
||
Ascertaining diagnoses typically takes two steps: | ||
|
||
|
||
## 1. Get medical records and self-reported illness data for provided codes | ||
|
||
For a given set of diagnostic codes get the participant medical events and self-reported data. Returns a list of 6 data frames: the subset of the long clinical files with matched codes. | ||
|
||
Codes need to be provided as a data frame with two fields: `vocab_id` and `code`. Valid code vocabularies are listed above. Other cols (such as condition and description) are ignored. | ||
|
||
```{r} | ||
# example diagnostic codes for Chronic Kidney Disease | ||
codes_df_ckd <- ukbrapR:::codes_df_ckd | ||
head(codes_df_ckd) | ||
# get diagnosis data - returns list of data frames (one per source) | ||
diagnosis_list <- get_diagnoses(codes_df_ckd) | ||
# N records for each source | ||
nrow(diagnosis_list$gp_clinical) | ||
nrow(diagnosis_list$hesin_diag) | ||
nrow(diagnosis_list$death_cause) | ||
``` | ||
|
||
If providing primary care codes for measures (BMI etc) these are also returned (the `gp_clinical` object in the returned list contains all cols for matched codes). | ||
|
||
|
||
## 2. Get date first diagnosed | ||
|
||
Usually the user is interested in combining the separate data sources into a combined phenotype: the date first diagnosed for each participant from the data/codes in step 1 (cause of death, HES diagnoses, GP clinical, cancer registry, HES operations, and self-reported illness fields). | ||
|
||
In addition to the "date first" `df` field are: | ||
|
||
- a `src` field indicating the source of the date of first diagnosis. | ||
- a `bin` field indicating the cases [1] and controls [0]. This relies on a small number of baseline fields also exported. The `df` field for the controls is the date of censoring (currently 30 October 2022). | ||
- a `bin_prev` field indicating whether the case was before the UK Biobank baseline assessment | ||
|
||
```{r} | ||
# for each participant, get Date First diagnosed with the condition | ||
diagnosis_df <- get_df(diagnosis_list) | ||
names(diagnosis_df) | ||
summary(diagnosis_df) | ||
``` | ||
|
||
You can add a prefix to all the variable names by specifying the "prefix" option: | ||
|
||
```{r} | ||
diagnosis_df <- get_df(diagnosis_list, prefix="ckd") | ||
# how many cases ascertained? | ||
table(diagnosis_df$ckd_bin) | ||
# source of earliest diagnosis date | ||
table(diagnosis_df$ckd_src) | ||
# date of diagnosis for prevalent cases (i.e., before UK Biobank baseline assessment) | ||
summary(diagnosis_df$ckd_df[ diagnosis_df$ckd_bin_prev == 1 ]) | ||
``` | ||
|
||
## Ascertaining multiple conditions at once | ||
|
||
The default `get_df()` behaviour is to use all available codes. However, the most time-efficient way to get multiple conditions is to run `get_diagnoses()` once for all codes for the conditions you wish to ascertain, then get the "date first diagnosed" for each condition separately. In the codes data frame you just need a field indicating the condition name, that will become the variable prefixes. | ||
|
||
```{r} | ||
# combine haemochromatosis and CKD codes together | ||
# each contain there columns: condition, vocab_id, and code | ||
# where `condition` is either "hh" or "ckd" and will become the variable prefix | ||
codes_df_combined <- rbind(ukbrapR:::codes_df_hh, ukbrapR:::codes_df_ckd) | ||
# get diagnosis data - returns list of data frames (one per source) | ||
diagnosis_list <- get_diagnoses(codes_df_combined) | ||
# for each participant, get Date First diagnosed with the condition | ||
diagnosis_df <- get_df(diagnosis_list, group_by="condition") | ||
# each condition has full set of output | ||
table(diagnosis_df$hh_bin) | ||
table(diagnosis_df$ckd_bin) | ||
``` | ||
|
||
In the above example we also included a UK Biobank self-reported illness code for haemochromatosis, that was also ascertained (the Date First is run on each condition separately, they do not all need to have the same data sources). | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
--- | ||
title: "Extract fields" | ||
description: > | ||
Get participant data for specific list of fields from the cohort database. | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Extract fields} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
library(ukbrapR) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
--- | ||
title: "Extract variants" | ||
description: > | ||
Pull specific variants from whole genome sequence DRAGEN variant call files (pVCFs) into R. | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Extract variants} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
library(ukbrapR) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
title: "Label fields" | ||
description: > | ||
Assign categorical UK Biobank fields the labels from the showcase schema. | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Label fields} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
library(ukbrapR) | ||
``` | ||
|
||
Categorical fields are exported as integers but are encoded with labels. | ||
|
||
For example [20116 "Smoking status"](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=20116): | ||
|
||
| Coding | Meaning | | ||
|--------|----------------------| | ||
| -3 | Prefer not to answer | | ||
| 0 | Never | | ||
| 1 | Previous | | ||
| 2 | Current | | ||
|
||
This package includes two functions to label a single UK Biobank field or a data frame of them using the [UK Biobank encoding schema](https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi). Examples: | ||
|
||
```{r, eval=FALSE, echo=TRUE} | ||
# update the Smoking status field | ||
ukb <- label_ukb_field(ukb, field="p20116_i0") | ||
table(ukb$p20116_i0) # tabulates the values | ||
#> -3 0 1 2 | ||
#> 2057 273405 172966 52949 | ||
table(haven::as_factor(ukb$p20116_i0)) # tabulates the labels | ||
#> Prefer not to answer Never Previous Current | ||
#> 2057 273405 172966 52949 | ||
haven::print_labels(ukb$p20116_i0) # show the value:label mapping for this variable | ||
#> Labels: | ||
#> value label | ||
#> -3 Prefer not to answer | ||
#> 0 Never | ||
#> 1 Previous | ||
#> 2 Current | ||
# | ||
# if you have a whole data frame of exported fields, you can use the wrapper function label_ukb_fields() | ||
# say the `ukb` data frame contains 4 variables: `eid`, `p54_i0`, `p31` and `age_at_assessment` | ||
# update the variables that looks like UK Biobank fields with titles and, where cateogrical, labels | ||
# i.e., `p54_i0` and `p31` only -- `eid` and `age_at_assessment` are ignored | ||
ukb <- label_ukb_fields(ukb) | ||
table(ukb$p31) # tabulates the values | ||
#> 0 1 | ||
#> 273238 229031 | ||
table(haven::as_factor(ukb$p31)) # tabulates the labels | ||
#> Female Male | ||
#> 273238 229031 | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
title: "Spark functions" | ||
description: > | ||
Pull phenotype data from Spark environment. | ||
output: rmarkdown::html_vignette | ||
vignette: > | ||
%\VignetteIndexEntry{Spark functions} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
library(ukbrapR) | ||
``` | ||
|
||
|
||
## Pull phenotype data from Spark environment to an R data frame | ||
|
||
**Needs to be run in an Apache Spark environment on the UK Biobank DNAnexus RAP.** | ||
|
||
Recommend launching a Spark cluster with at least `mem1_hdd1_v2_x16` and **2 nodes** otherwise this can fail with error "...ensure that workers...have sufficient resources" | ||
|
||
The underlying code is mostly from the [UK Biobank GitHub](https://github.com/UK-Biobank/UKB-RAP-Notebooks/blob/main/NBs_Prelim/105_export_participant_data_to_r.ipynb). | ||
|
||
```{r, eval=FALSE, echo=TRUE} | ||
# get phenotype data (participant ID, sex, baseline age, and baseline assessment date) | ||
ukb <- get_rap_phenos(c("eid", "p31", "p21003_i0", "p53_i0")) | ||
#> 48.02 sec elapsed | ||
# summary of data | ||
table(ukb$p31) | ||
#> Female Male | ||
#> 273297 229067 | ||
summary(ukb$p21003_i0) | ||
#> Min. 1st Qu. Median Mean 3rd Qu. Max. | ||
#> 37.00 50.00 58.00 56.53 63.00 73.00 | ||
``` | ||
|
||
### No more updates... | ||
|
||
I am moving away from using Spark as the default environment, mostly due to the cost implications; it is significantly cheaper (and quicker!) to store and search exported raw text files in the RAP persistant storage than do everything in a Spark environment (plus the added benefit that the RStudio interface is available in "normal" instances). | ||
|
||
The Spark functions are available as before but all updates are to improve functionality in "normal" instances using RStudio, as we move to the new era of RAP-only UK Biobank analysis. | ||
|
||
If you need to see the previous release documentation follow the tags to the version required: https://github.com/lcpilling/ukbrapR/tree/v0.1.7 | ||
|