forked from IMPALA-Consortium/ctas
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
318 lines (227 loc) · 13.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
---
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
<!-- badges: start -->
[![R-CMD-check](https://github.com/IMPALA-Consortium/ctas/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/IMPALA-Consortium/ctas/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
# Clinical Timeseries Anomaly Spotter (CTAS)
Pekka Tiikkainen, Bayer Pharmaceuticals
## Introduction
CTAS is an R package for identifying outliers and anomalies in clinical trial time series.
Its main focus is on flagging sites with one or more study parameters whose profiles differ
from those of the other sites. However, also anomalies for individual subjects can be identified
with the results the package gives.
This document focuses on the data format requirements of the input and output data. For details on
methodology, please refer to the paper and presentation published at the PHUSE EU Connect 22
meeting. The files are attached to this repository.
## How to use the package
First the clinical data has to be wrangled in a format compatible with the main function's parameters
(please see below for definitions). Please note that data preparation step is intentionally out-of-scope
for the package because each user is likely to have their data in a different format.
Then the main function (process_a_study) is called separately for each study you wish to process.
The function returns the results as a list of data frames. For details on the output, please see below.
## Example
```{r example}
library(ctas)
data("ctas_data", package = "ctas")
ctas_data
feats <- c(
"autocorr",
"average",
"own_site_simil_score",
"sd",
"unique_value_count_relative",
"lof",
"range"
) %>%
paste(collapse = ";")
ls_ctas <- process_a_study(
data = ctas_data$data,
subjects = ctas_data$subjects,
parameters = ctas_data$parameters,
custom_timeseries = ctas_data$custom_timeseries,
custom_reference_groups = ctas_data$custom_reference_groups,
default_timeseries_features_to_calculate = feats,
default_minimum_timepoints_per_series = 3,
default_minimum_subjects_per_series = 3,
default_max_share_missing_timepoints_per_series = 0.5,
default_generate_change_from_baseline = FALSE,
autogenerate_timeseries = TRUE
)
ls_ctas
```
## Input parameters
The function process_a_study has a number of parameters that need to be provided. Some of these are data frames
while others are simple variables.
The following gives more details on each parameter. Examples are also provided where needed.
### timeseries_features_to_calculate
Semicolon-delimited string of time series features to calculate.
The list must contain at least one of the following feature codes:
| Feature code | Description |
| --- | --- |
| autocorr | Auto-correlation |
| average | Average value |
| own_site_simil_score | Measure of co-clustering of time series from the same study site |
| sd | Standard deviation |
| unique_value_count_relative | Number of unique values divided by number of values available |
| range | Maximum difference between two time points in a series |
| lof | Local Outlier Factor. Values around one are inliers while the larger the value, more of an outlier the timeseries is |
### default_minimum_timepoints_per_series
Minimum number of time points for auto-generated time series. This value will be used for
a parameter if they do not have the time_point_count_min column defined in the parameters df.
### default_minimum_subjects_per_series
Minimum number of eligible subjects for auto-generated time series. This value will be used for
a parameter if they do not have the subject_count_min column defined in the parameters df.
### default_max_share_missing_timepoints_per_series
Maximum share of missing time points a subject can have in order to be eligible for a time series.
This value will be used for a parameter if they do not have the max_share_missing defined in the parameters df.
### default_generate_change_from_baseline
If TRUE, change-from-baseline adjusted version of each time series is generated (in addition to the time series
with actual measurements).
### autogenerate_timeseries
If TRUE, time series are autogenerated in a data-driven matter for all parameters. If set to FALSE, the
custom_timeseries data frame must have at least one entry.
### subjects
Data frame with one row per study subject.
Data frame columns:
| Column name | Data type | Data required | Description |
| --- | --- | --- | --- |
| subject_id | chr | Y | Unique identifier for each subject |
| country | chr | Y | Country where the subject is enrolled. |
| site | chr | Y | Name of the study site. |
| region | chr | (Y) | Region of the world for the country. Required if the custom reference table has "region" defined for any of the rows. |
### parameters
Data frame with one row for each parameter that should be processed.
Parameters are, for example, laboratory assays and vital sign measurements done
during the course of a study.
Data frame columns:
| Column name | Data type | Data required | Description |
| --- | --- | --- | --- |
| parameter_id | chr | Y | Parameter identifier |
| parameter_category_1 | chr | | Top level category of the parameter (e.g. LB for a laboratory assay). |
| parameter_category_2 | chr | | Second level category of the parameter (e.g. General chemistry for a laboratory assay). |
| parameter_category_3 | chr | | Third level category of the parameter (e.g. Local lab for a laboratory assay). |
| parameter_name | chr | Y | Parameter name |
| time_point_count_min | int | | Minimum number of time points required for auto-generated time series. If not provided, global default value is used. |
| subject_count_min | int | | Minimum number of eligible subjects required for auto-generated time series. If not provided, global default value is used. |
| max_share_missing | float | | Maximum number of missing data points an eligible subject can have for auto-generated time series. If not provided, global default value is used. |
| generate_change_from_baseline | boolean | | Whether to also generate baseline-adjusted time series for the parameter. If not provided, global default value is used. |
| timeseries_features_to_calculate | chr | | A comma-delimited list of time series features to calculate for the parameter. If not provided, global default value is used. |
| use_only_custom_timeseries | boolean | | Whether to only use the custom timeseries for the parameter (i.e. ignore any autogenerated time series). |
Example for two parameters, one vital sign and one local lab. For any time series generated for
the vital sign, at least 50 eligible subjects are required. For the local lab, time series with one
time point only are acceptable. These parameter-specific properties override the corresponding global
parameters (see above).
| parameter_id | parameter_category_1 | parameter_category_2 | parameter_category_3 | parameter_name | time_point_count_min | subject_count_min | max_share_missing | generate_change_from_baseline |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| dummy_param_1 | VS | NA | NA | DIABP | | 50 | | | | |
| dummy_param_2 | LB | General chemistry | Local lab | Glucose | 1 | | | | | |
### data
Data frame for the data collected during the study with one row per data point.
Data frame columns:
| Column name | Data type | Data required | Description |
| --- | --- | --- | --- |
| subject_id | chr | Y | Pointer to the subject identifier in the subjects table. |
| parameter_id | chr | Y | Pointer to the parameter identifier in the parameters table. |
| timepoint_1_name | chr | Y | Name of the top level time point (e.g. Visit 1). |
| timepoint_2_name | chr | | Name of the second level time point (e.g. "30 mins after dosing"). |
| timepoint_rank | int | Y | Integer for ranking time points into chronological order. Ranking is separate for each parameter. |
| result | float | Y | Numerical result of the parameter. |
| baseline | float | | Possible baseline value for the parameter and the subject. This is used to calculate change-from-baseline time series. |
Example of data contents:
| subject_id | parameter_id | timepoint_1_name | timepoint_2_name | timepoint_rank | result | baseline |
| --- | --- | --- | --- | --- | --- | --- |
| subj_1 | dummy_param_1 | Visit 1 | pre-dose | 1 | 72 | 65 |
| subj_1 | dummy_param_1 | Visit 1 | post-dose | 2 | 81 | 65 |
| subj_1 | dummy_param_1 | Visit 2 | NA | 3 | 69 | 65 |
| subj_1 | dummy_param_1 | Visit 3 | NA | 4 | 70 | 65 |
| subj_1 | dummy_param_2 | Visit 3 | NA | 1 | 85 | 84 |
| subj_2 | dummy_param_2 | Visit 3 | NA | 1 | 102 | 96 |
### custom_timeseries
The study team might be interested in specific time series - for instance in an oncology trial
one might be interested in data collected in the first two treatment cycles.
When time series are auto-generated in a data-driven manner, it cannot be guaranteed that a particular
time series is generated. To make sure that a specific time series is analyzed, this can be
defined with the _custom_timeseries_ data frame.
Please note: if you do not have any custom time series to process, you still have to provide the main
function a blank data frame!
Data frame columns:
| Column name | Data type | Data required | Description |
| --- | --- | --- | --- |
| timeseries_id | chr | Y | Identifier for the the custom time series. |
| parameter_id | chr | Y | Pointer to the parameter identifier in the parameters table. |
| timepoint_combo | chr | Y | Semicolon-delimited string of timepoint ranks that the time series consists of. These must match values in the timepoint_rank column of the data df. |
In the example below, a custom time series has been defined for dummy_param_1.
Timepoints are defined with their respective ranks in the _data_ df.
| timeseries_id | parameter_id | timepoint_combo |
| --- | --- | --- |
| custom_ts_1 | dummy_param_1 | 1;2;3;4 |
### custom_reference_groups
For some parameters, it might make more sense to compare a site to other sites
in the same country or region than globally. This can be a case for subject weight
or height for example where it is better to compare, say, the average height or weight of
Japanese sites to other East Asian sites. If compared globally, most Japanese sites would
get flagged since in general Japanese are lighter and shorter than people elsewhere.
Data frame columns:
| Column name | Data type | Data required | Description |
| --- | --- | --- | --- |
| parameter_id | chr | Y | Pointer to the parameter identifier in the parameters table. |
| feature | chr | Y | Time series feature for which to use the custom reference group. |
| ref_group | chr | Y | "country" if the parameter features should only be compared to sites from the same country, "region" if the comparison should be made to other sites from the same region of the world. |
## Output data
Output is a list of four data frames. Their details are explained in the following.
### timeseries
Time series processed. Contains both the auto-generated and/or custom time series.
Data frame columns:
| Column name | Data type | Description |
| --- | --- | --- |
| timeseries_id | chr | Unique identifier for the time series. |
| parameter_id | chr | The parameter the time series is for. |
| baseline | chr | Whether the time series values are original measurements or baseline-adjusted. |
| timepoint_combo | chr | Ranks of the time points that constitute the time series. Semicolon delimited string. |
| timepoint_combo_readable | chr | Names of the time points that constitute the time series. Semicolon-delimited string. |
| timepoint_count | int | Number of timepoints in the time series. |
### timeseries_features
Data frame with features calculated for time series.
Data frame columns:
| Column name | Data type | Description |
| --- | --- | --- |
| timeseries_id | chr | Unique identifier of the time series. Points to the timeseries table. |
| subject_id | chr | Subject ID pointing to the input df subjects. |
| feature | chr | Feature code. |
| feature_value | float | Feature value. |
### PCA_coordinates
Data frame with the top two principal components for a time series. These are useful
for the visualization of the similarity of subjects.
Data frame columns:
| Column name | Data type | Description |
| --- | --- | --- |
| timeseries_id | chr | Unique identifier of the time series. Points to the timeseries table. |
| subject_id | chr | Subject ID pointing to the input df subjects. |
| pc1 | float | The first principal component. |
| pc2 | float | The second principal component. |
### site_scores
Biasness scores for sites. The data frame contains a row for each site/time series/feature combination.
Data frame columns:
| Column name | Data type | Description |
| --- | --- | --- |
| timeseries_id | chr | Unique identifier of the time series. Points to the timeseries table. |
| site | chr | Study site for which the score has been calculated. |
| country | chr | Site's country. |
| feature | chr | Time series feature the score is for. |
| pvalue_kstest_logp | float | Negative logarithm of the raw p-value. |
| kstest_statistic | float | Test statistic of the Kolmogorov-Smirnov test. |
| fdr_corrected_pvalue_logp | float | Site score: negative logarithm of the multiple testing corrected p-value. FDR (False Discovery Rate) is used for correction. |
| reference_group | chr | Reference group for the site. "All" means that the site has been compared to all sites in the study. |
| subject_count | int | Number of site subjects who are eligible for the time series. |