-
Notifications
You must be signed in to change notification settings - Fork 4
/
DataExtraction.Rmd
105 lines (76 loc) · 3.47 KB
/
DataExtraction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "Data Extraction Checking"
output:
html_document: default
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa.csl
bibliography: R-Pckgs.bib
nocite: |
@*
---
```{r setup, include=FALSE}
## Load required packages
library(googlesheets)
library(dplyr)
library(knitr)
library(DT)
library(readr)
```
This script downloads data that has been extracted by coding teams during a data extraction activity. Minimal data cleaning is performed to allow for analysis.
### First let's get the most recent version of the data from Google Sheets
```{r import, warning=FALSE}
## Import data from Google Sheets
key<-gs_key("1HOnqlfCHHo8NAQFeh8EzIIUr56fYaMsb4DHigZafNig")
data_import<-gs_read(key,
col_types = cols(
`Study ID` = col_character(),
`Intervention Mean` = col_double(),
`Intervention N` = col_integer(),
`Intervention SD` = col_double(),
`Control Mean` = col_double(),
`Control N` = col_integer(),
`Control SD` = col_double()
)
)
```
### Imported Data
*This shows the data as entered into Google Sheets*
`r datatable(data_import, options = list(pageLength = 5), rownames=FALSE)`
### Data for manual inspection
```{r cleaning, warning=FALSE}
## Clean data and create standard column names
data_cleaned <- data_import
names(data_cleaned) <- make.names(names(data_cleaned))
data_cleaned$Study.ID <- toupper(data_cleaned$Study.ID)
data_cleaned <- data_cleaned %>%
arrange(.dots = c(data_cleaned$Study.ID))
## Drop unnecessary attributes from working data file
data_cleaned <- data_cleaned[ , c("Study.ID", "Intervention.Mean", "Intervention.SD", "Intervention.N", "Control.Mean", "Control.SD", "Control.N")]
## Identify duplicate cases by removing exact mataching and keeping duplicate entries with inconsistent data
data_exact <- data_cleaned %>%
distinct(Study.ID, Intervention.Mean, Intervention.SD, Intervention.N, Control.Mean, Control.SD, Control.N, .keep_all = TRUE)
data_exact <- data_exact %>%
mutate(row.ID = row_number())
data_exact <- data_exact[, c("row.ID", setdiff(names(data_exact), "row.ID"))]
data_fuzzy <- data_cleaned %>%
distinct(Study.ID, .keep_all = TRUE) ## Needed for the logic about inconsistent duplicates below
```
*This data has had exact duplicates removed.* **`r ifelse(nrow(data_fuzzy)==nrow(data_exact), "There don't appear to be duplicate entries after cleaning - looking good", "However, there are duplicate study entries with inconsistent data, manual data checking is required")`**
`r datatable(data_exact, options = list(pageLength = 5), rownames=FALSE)`
### Data for analysis
```{r manual-clean}
## You can manually specify which row(s) to drop below. You should specify the row.IDs for the rows that you want to discard. If you don't wish to discard any rows keep rows_drop<-NULL
rows_drop <- NULL
#rows_drop<-c(9,10)
data_manually <- data_exact[!(data_exact$row.ID %in% rows_drop), ]
```
*Specificed rows (`r rows_drop`) have been discarded. This data should be ready to go*
`r datatable(data_manually, options = list(pageLength = 5), rownames=FALSE)`
```{r save}
data <- data_manually
write.csv(data, "data.csv")
```
## Packages used in this document
```{r include=FALSE}
citPkgs <- names(sessionInfo()$otherPkgs)
write_bib(citPkgs, file = "R-Pckgs.bib")
```