-
Notifications
You must be signed in to change notification settings - Fork 0
/
OvipositionDataAnalysis.Rmd
262 lines (171 loc) · 10.6 KB
/
OvipositionDataAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
title: "Oviposition Prefernce by Genotype Data Analaysis"
output: github_document
always_allow_html: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r message = FALSE, include=FALSE}
library(tigerstats)
library(tidyverse)
library(kableExtra)
library(visreg)
```
Preference is the number of eggs laid on rice to the number laid on Leersia, and
Genotype refers to the parent race on rice (“rice”), the parent race on
Leersia (“leer”), their F1 and F2 hybrids (“f1”, “f2”), and the backcrosses between the F1 hybrid and each parent race (“br” for rice and “bl” for Leersia).
```{r}
ovipos <- read_csv("egglayingpreferencedata.csv")
attach(ovipos)
head(ovipos)
```
AIC = - 2 * ln L(Model | Data) + 2k
BIC = - 2 * ln L(Model | Data) + ln(n) * k
Where L(Model | Data) is the log-likelihood of the model given the dataset, k is the number of parameters in the model, and n is the number of observations in the dataset.
The equations for both AIC and BIC are based on the same criteria except for the penalty imposed by the number of variables included in the model. This penalty is important because, without it, the more complex model with the highest number of parameters would always be calculated as the most likely. Therefore both AIC and BIC impose this penalty as a way to account for the trade-off between variance and bias. BIC imposes a larger penalty than AIC in most cases. If we compare the penalty portions of the two equations we find that if the sample size is greater than 7, BIC imposes a higher penalty than AIC as shown below.
BIC > AIC
ln(n)k > 2k
ln(n) > 2
n > exp(2)
Visualize the oviposition preference of the different genotypes.
```{r fig.cap="**Figure 1:** Boxplots showing preference ratios for egg-laying in Rice and Leersia plants by genotype. The 'rice' genotype refers to the parent race that oviposits on rice, 'Leer' is the parent race that oviposits on Leerisa plants. F1 and F2 refer to hybrid genotypes. The 'br' genotype is the backcross of F1 hybrids with rice parent genotype. The 'bl' genotype is the backcross of the F1 hybrid with the Leerisa parent genotype. (n rice = 55, n bl = 53, n br = 191, n f1 = 91, n f2 = 113, n leer = 21))"}
plot(ovipos$preference ~ as.factor(ovipos$genotype), xlab = "Genotype", ylab = "Oviposition Prefernce Ratio (Rice/Leersia)")
```
The data seems to indicate that prefence may depend on the genotype of the insects. The rice genotype shows the strongest preference for oviposition on rice plants, while the leer phenotype shows the lowest preference for oviposition on rice plants. The hybrids have intermediate preference. Since there is overlap of all treatments further analysis is needed to confirm the significance of these trends.
Table of means and standard deviations of the genotypes.
```{r}
favstats(preference ~ genotype) -> ovipos.stats
ovipos.stats <- ovipos.stats[,c("genotype", "mean", "sd")]
names(ovipos.stats) <- c("Genotype", "Mean", "Standard Deviation")
ovipos.stats$Mean<- round(ovipos.stats$Mean, 2)
ovipos.stats$`Standard Deviation` <- round(ovipos.stats$`Standard Deviation`, 3)
kable(ovipos.stats,
caption = "**Table 1:** Descriptive statistics for the Oviposition data",
digits = c(0, 2, 3), align = "ccrr") %>%
kable_styling(full_width = FALSE, position = "left") %>%
add_header_above(c("Preference ratio of of eggs laid on rice plants to the number laid on Leersia plants" = 3))
```
Adding a numeric variable in the data set to represent the proportion of the genome inherited from the rice parent:
1 for the rice parent genotype
0 for the Leersia parent genotype (leer)
0.5 for the F1 and F2 hybrids (f1, f2)
0.25 for the backcross to the Leersia population (bl)
0.75 for the backcross to the rice backcross (br)
```{r}
ovipos$prop.genome <- ovipos$genotype
ovipos$prop.genome<-gsub("rice", 1, ovipos$prop.genome)
ovipos$prop.genome<-gsub("leer", 0, ovipos$prop.genome)
ovipos$prop.genome<-gsub("f1", 0.5, ovipos$prop.genome)
ovipos$prop.genome<-gsub("f2", 0.5, ovipos$prop.genome)
ovipos$prop.genome<-gsub("bl", 0.25, ovipos$prop.genome)
ovipos$prop.genome<-gsub("br", 0.75, ovipos$prop.genome)
#unique(ovipos$genotype)
#unique(ovipos$prop.genome)
ovipos$prop.genome <- as.numeric(ovipos$prop.genome)
head(ovipos)
```
Fit the numeric variable you to the preference data using a linear model.
This is called the additive model (model 1), whereby mean preference for rice increases linearly with the proportion of the genome inherited from the rice parent. Include your code.
```{r}
attach(ovipos)
ovipos.lm.additive <- lm(preference ~ prop.genome)
summary(ovipos.lm.additive)
```
Checking the assumptions of the linear model and evaluate the fit of the model.
```{r fig.cap="**Figure 2:** Resiual and normal quantile plots for the additive model."}
par(mfrow=c(2,2))
plot(ovipos.lm.additive)
```
The data look reasonably normally distributed. However the distribution of the residuals do not look equally distributed. I believe this is because there are different numbers of individuals in each group.
```{r fig.cap="**Figure 3:** Preference of oviposition of different genotypes of Planthopper insects on Rice vs Leersia plants. The blue solid line is the least square regression line, 95% confidence bands are shown in gray shading."}
visreg(ovipos.lm.additive, points.par = list(pch = 16, cex = 1.2, col = "purple"), ylab = "Oviposition Preference", xlab="Proportion of genome from rice oviposition preference genotype")
```
```{r}
anova(ovipos.lm.additive)
```
Comparing the additive model to the null intercept model. The high F-value indicates that more variation is explained by the additive model than the null intercept model.
Evaluate the model fit by calculating and comparing the AIC weight of the previous model with the one of the null model where preference is assumed to be independent from the proportion inherited from rice parents.
```{r}
#make a null model
null <- lm(ovipos$preference ~ 1)
#AIC comparing the null to the additive model
AICdelta2 <- c(AIC(ovipos.lm.additive), AIC(null)) - min(AIC(ovipos.lm.additive), AIC(null))
AICdelta2
```
```{r}
Like.null.add <- exp(-0.5 * AICdelta2) # relative likelihoods of models
akweight.null.add <- Like.null.add/sum(Like.null.add) # Akaike weights
akweight.null.add
cbind(round(akweight.null.add, 10), c("Additive Model", "Null Model")) %>% data.frame() -> weights_0
names(weights_0) <- c("AIC Weights","Model")
weights_0
```
Weights represent relative liklihood of the model, in the weighted values, a higher values indicates the better model.
Add another numeric variable to the data set to represent dominance effects that might be present in the hybrids:
- 0 for both parental genotypes
- 1 for the F1 hybrid genotype
- 0.5 for the remaining three hybrid genotypes
```{r}
ovipos$dom.effect <- ovipos$genotype
ovipos$dom.effect<-gsub("rice", 0, ovipos$dom.effect)
ovipos$dom.effect<-gsub("leer", 0, ovipos$dom.effect)
ovipos$dom.effect<-gsub("f1", 1, ovipos$dom.effect)
ovipos$dom.effect<-gsub("f2", 0.5, ovipos$dom.effect)
ovipos$dom.effect<-gsub("bl", 0.5, ovipos$dom.effect)
ovipos$dom.effect<-gsub("br", 0.5, ovipos$dom.effect)
ovipos$dom.effect <- as.numeric(ovipos$dom.effect)
str(ovipos)
```
Fit a second model to the same preference data that includes both of the numeric variables.
```{r}
ovipos.lm.additive.dom <- lm(ovipos$preference ~ ovipos$prop.genome + ovipos$dom.effect)
summary(ovipos.lm.additive.dom)
```
Evaluate the model fit by calculating and comparing the AIC weights of the model set composed of (1) the null, (2) the additive and the additive plus dominance models.
```{r}
#comparing null, additive, and dominance
AICdelta3 <- c(AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null)) -
min(AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null))
AICdelta3
Like.null.add.dom <- exp(-0.5 * AICdelta3)
akweight.null.add.dom <- Like.null.add.dom/sum(Like.null.add.dom)
akweight.null.add.dom
weight_table <- cbind(round(akweight.null.add.dom,4), c("Additive","Additive_Dom","Null")) %>% data.frame()
names(weight_table) <- c("AIC Weight", "Model")
weight_table
```
The additive model has the highest weighted value, indicating it is the best fit relative to the additive + dominance model and the null model. However the additive + dominance model is a better fit than null model based on the weighted values.
fit a third model that has the original genotype variable as the only explanatory variable.
```{r}
ovipos.og.lm <- lm(ovipos$preference ~ ovipos$genotype)
summary(ovipos.lm.additive)
```
Compare model fits by calculating the AIC weight of the four models.
```{r}
#comparing all 4 models
AICdelta <- c(AIC(ovipos.og.lm), AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null)) -
min(AIC(ovipos.og.lm), AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null))
#AICdelta
Likihood <- exp(-0.5 * AICdelta) # relative likelihoods of models
akweight <- Likihood/sum(Likihood) # Akaike weights
akweight
weight_table2 <- cbind(round(akweight, 4), c("Original","Additive","Additive_Dom","Null")) %>% data.frame()
names(weight_table2) <- c("AIC Weight", "Model")
weight_table2
```
The original model has the largest weight and therefore represents the most likely hypothesis. This is the model that only included the genotype as an explanatory varible. Biologically, this follows the trends observed in figure 1 that prefrence was directly related to genotype.
redo model comparison but using BIC
```{r}
BICdelta <- c(BIC(ovipos.og.lm), BIC(ovipos.lm.additive), BIC(ovipos.lm.additive.dom), BIC(null)) - min(BIC(ovipos.og.lm), BIC(ovipos.lm.additive), BIC(ovipos.lm.additive.dom), BIC(null))
BICdelta
```
```{r}
BICLikihood <- exp(-0.5 * BICdelta)
BICakweight <- BICLikihood/sum(BICLikihood)
BICakweight
weight_table3 <- cbind(round(BICakweight, 4), c("Original","Additive","Additive_Dom","Null")) %>% data.frame()
names(weight_table3) <- c("BIC Weight", "Model")
weight_table3
```
Doing BIC analysis results in the additive model being the best fit model to the data, opposed to the original model when AIC was calculated. The difference in results could be due to the large sample size because BIC penalizes more than AIC when the sample size is higher than 7. In this case, the genotype preference was penalized more than the proportion of the genome inherited by the rice parent, making the additive model a better fit when using BIC.