OvipositionDataAnalysis.Rmd

---
title: "Oviposition Prefernce by Genotype Data Analaysis"
output: github_document
always_allow_html: true
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r message = FALSE, include=FALSE}
library(tigerstats)
library(tidyverse)
library(kableExtra)
library(visreg)
```


Preference is the number of eggs laid on rice to the number laid on Leersia, and
Genotype refers to the parent race on rice (“rice”), the parent race on 
Leersia (“leer”), their F1 and F2 hybrids (“f1”, “f2”), and the backcrosses between the F1 hybrid and each parent race (“br” for rice and “bl” for Leersia).
```{r}

ovipos <- read_csv("egglayingpreferencedata.csv")
attach(ovipos)
head(ovipos)
```



AIC = -  2 * ln L(Model | Data) + 2k

BIC = - 2 * ln L(Model | Data) + ln(n) * k


Where L(Model | Data) is the log-likelihood of the model given the dataset, k is the number of parameters in the model, and n is the number of observations in the dataset.


The equations for both AIC and BIC are based on the same criteria except for the penalty imposed by the number of variables included in the model. This penalty is important because, without it, the more complex model with the highest number of parameters would always be calculated as the most likely. Therefore both AIC and BIC impose this penalty as a way to account for the trade-off between variance and bias.  BIC imposes a larger penalty than AIC in most cases. If we compare the penalty portions of the two equations we find that if the sample size is greater than 7, BIC imposes a higher penalty than AIC as shown below. 

BIC > AIC

ln(n)k > 2k 

ln(n) > 2

n > exp(2)



Visualize the oviposition preference of the different genotypes. 
```{r fig.cap="**Figure 1:** Boxplots showing preference ratios for egg-laying in Rice and Leersia plants by genotype. The 'rice' genotype refers to the parent race that oviposits on rice, 'Leer' is the parent race that oviposits on Leerisa plants. F1 and F2 refer to hybrid genotypes. The 'br' genotype is the backcross of F1 hybrids with rice parent genotype. The 'bl' genotype is the backcross of the F1 hybrid with the Leerisa parent genotype. (n rice = 55, n bl = 53, n br = 191, n f1 = 91, n f2 = 113, n leer = 21))"}
plot(ovipos$preference ~ as.factor(ovipos$genotype), xlab = "Genotype", ylab = "Oviposition Prefernce Ratio (Rice/Leersia)")
```
The data seems to indicate that prefence may depend on the genotype of the insects. The rice genotype shows the strongest preference for oviposition on rice plants, while the leer phenotype shows the lowest preference for oviposition on rice plants. The hybrids have intermediate preference. Since there is overlap of all treatments further analysis is needed to confirm the significance of these trends. 

Table of means and standard deviations of the genotypes.
```{r}
favstats(preference ~ genotype) -> ovipos.stats
ovipos.stats <- ovipos.stats[,c("genotype", "mean", "sd")]
names(ovipos.stats) <- c("Genotype", "Mean", "Standard Deviation")

ovipos.stats$Mean<- round(ovipos.stats$Mean, 2)
ovipos.stats$`Standard Deviation` <- round(ovipos.stats$`Standard Deviation`, 3)

kable(ovipos.stats,  
      caption = "**Table 1:** Descriptive statistics for the Oviposition data",
      digits = c(0, 2, 3), align = "ccrr") %>%
  kable_styling(full_width = FALSE, position = "left") %>%
  add_header_above(c("Preference ratio of of eggs laid on rice plants to the number laid on Leersia plants" = 3))
```



Adding a numeric variable in the data set to represent the proportion of the genome inherited from the rice parent:

1 for the rice parent genotype

0 for the Leersia parent genotype (leer)

0.5 for the F1 and F2 hybrids (f1, f2)


0.25 for the backcross to the Leersia population  (bl)


0.75 for the backcross to the rice backcross (br)


```{r}
ovipos$prop.genome <- ovipos$genotype
ovipos$prop.genome<-gsub("rice", 1, ovipos$prop.genome)
ovipos$prop.genome<-gsub("leer", 0, ovipos$prop.genome)
ovipos$prop.genome<-gsub("f1", 0.5, ovipos$prop.genome)
ovipos$prop.genome<-gsub("f2", 0.5, ovipos$prop.genome)
ovipos$prop.genome<-gsub("bl", 0.25, ovipos$prop.genome)
ovipos$prop.genome<-gsub("br", 0.75, ovipos$prop.genome)
#unique(ovipos$genotype)
#unique(ovipos$prop.genome)
ovipos$prop.genome <- as.numeric(ovipos$prop.genome)

head(ovipos)
```


Fit the numeric variable you to the preference data using a linear model. 

This is called the additive model (model 1), whereby mean preference for rice increases linearly with the proportion of the genome inherited from the rice parent. Include your code.

```{r}
attach(ovipos)
ovipos.lm.additive <- lm(preference ~ prop.genome)
summary(ovipos.lm.additive)
```

Checking the assumptions of the linear model and evaluate the fit of the model. 
```{r fig.cap="**Figure 2:** Resiual and normal quantile plots for the additive model."}
par(mfrow=c(2,2)) 
plot(ovipos.lm.additive)
```

The data look reasonably normally distributed. However the distribution of the residuals do not look equally distributed. I believe this is because there are different numbers of individuals in each group.  

```{r fig.cap="**Figure 3:** Preference of oviposition of different genotypes of Planthopper insects on Rice vs Leersia plants. The blue solid line is the least square regression line, 95% confidence bands are shown in gray shading."}
visreg(ovipos.lm.additive, points.par = list(pch = 16, cex = 1.2, col = "purple"), ylab = "Oviposition Preference", xlab="Proportion of genome from rice oviposition preference genotype")
```


```{r}
anova(ovipos.lm.additive)
```
Comparing the additive model to the null intercept model. The high F-value indicates that more variation is explained by the additive model than the null intercept model. 


Evaluate the model fit by calculating and comparing the AIC weight of the previous model with the one of the null model where preference is assumed to be independent from the proportion inherited from rice parents.
```{r}
#make a null model 

null <- lm(ovipos$preference ~ 1)

#AIC comparing the null to the additive model 

AICdelta2 <-  c(AIC(ovipos.lm.additive), AIC(null)) - min(AIC(ovipos.lm.additive), AIC(null))
AICdelta2
```

```{r}
Like.null.add <- exp(-0.5 * AICdelta2) # relative likelihoods of models
akweight.null.add <- Like.null.add/sum(Like.null.add) # Akaike weights

akweight.null.add

cbind(round(akweight.null.add, 10), c("Additive Model", "Null Model")) %>% data.frame() -> weights_0
names(weights_0) <- c("AIC Weights","Model")
weights_0
```

Weights represent relative liklihood of the model, in the weighted values, a higher values indicates the better model. 

Add another numeric variable to the data set to represent dominance effects that might be present in the hybrids:
- 0 for both parental genotypes
- 1 for the F1 hybrid genotype
- 0.5 for the remaining three hybrid genotypes

```{r}
ovipos$dom.effect <- ovipos$genotype
ovipos$dom.effect<-gsub("rice", 0, ovipos$dom.effect)
ovipos$dom.effect<-gsub("leer", 0, ovipos$dom.effect)
ovipos$dom.effect<-gsub("f1", 1, ovipos$dom.effect)
ovipos$dom.effect<-gsub("f2", 0.5, ovipos$dom.effect)
ovipos$dom.effect<-gsub("bl", 0.5, ovipos$dom.effect)
ovipos$dom.effect<-gsub("br", 0.5, ovipos$dom.effect)

ovipos$dom.effect <- as.numeric(ovipos$dom.effect)
str(ovipos)

```

Fit a second model to the same preference data that includes both of the numeric variables.


```{r}
ovipos.lm.additive.dom <- lm(ovipos$preference ~ ovipos$prop.genome + ovipos$dom.effect)
summary(ovipos.lm.additive.dom)
```

Evaluate the model fit by calculating and comparing the AIC weights of the model set composed of (1) the null, (2) the additive and the additive plus dominance models. 

```{r}
#comparing null, additive, and dominance 

AICdelta3 <- c(AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null)) -
min(AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null))
AICdelta3

Like.null.add.dom <- exp(-0.5 * AICdelta3)           
akweight.null.add.dom <- Like.null.add.dom/sum(Like.null.add.dom) 
akweight.null.add.dom

weight_table <- cbind(round(akweight.null.add.dom,4), c("Additive","Additive_Dom","Null")) %>% data.frame()
names(weight_table) <- c("AIC Weight", "Model")
weight_table
```

 The additive model has the highest weighted value, indicating it is the best fit relative to the additive + dominance model and the null model. However the additive + dominance model is a better fit than null model based on the weighted values. 

fit a third model that has the original genotype variable as the only explanatory variable. 

```{r}
ovipos.og.lm <- lm(ovipos$preference ~ ovipos$genotype)
summary(ovipos.lm.additive)
```

Compare model fits by calculating the AIC weight of the four models.

```{r}
#comparing all 4 models 

AICdelta <- c(AIC(ovipos.og.lm), AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null)) -
  min(AIC(ovipos.og.lm), AIC(ovipos.lm.additive), AIC(ovipos.lm.additive.dom), AIC(null))

#AICdelta

Likihood <- exp(-0.5 * AICdelta)            # relative likelihoods of models
akweight <- Likihood/sum(Likihood)                     # Akaike weights

akweight

weight_table2 <- cbind(round(akweight, 4), c("Original","Additive","Additive_Dom","Null")) %>% data.frame()

names(weight_table2) <- c("AIC Weight", "Model")

weight_table2
```

The original model has the largest weight and therefore represents the most likely hypothesis. This is the model that only included the genotype as an explanatory varible. Biologically, this follows the trends observed in figure 1 that prefrence was directly related to genotype. 


redo model comparison but using BIC
```{r}
BICdelta <- c(BIC(ovipos.og.lm), BIC(ovipos.lm.additive), BIC(ovipos.lm.additive.dom), BIC(null)) - min(BIC(ovipos.og.lm), BIC(ovipos.lm.additive), BIC(ovipos.lm.additive.dom), BIC(null))
BICdelta
```


```{r}
BICLikihood <- exp(-0.5 * BICdelta)           
BICakweight <- BICLikihood/sum(BICLikihood) 
BICakweight


weight_table3 <- cbind(round(BICakweight, 4), c("Original","Additive","Additive_Dom","Null")) %>% data.frame()

names(weight_table3) <- c("BIC Weight", "Model")

weight_table3
```

Doing BIC analysis results in the additive model being the best fit model to the data, opposed to the original model when AIC was calculated. The difference in results could be due to the large sample size because BIC penalizes more than AIC when the sample size is higher than 7. In this case, the genotype preference was penalized more than the proportion of the genome inherited by the rice parent, making the additive model a better fit when using BIC.