-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathIntro to ENM and SDM.Rmd
228 lines (143 loc) · 6.42 KB
/
Intro to ENM and SDM.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Intro to ENM and SDM"
output: html_document
---
This is going to be just a quick walk-through. For a full tutorial, check out Jane Elith and Robert Hijmans' vignette at:
https://cran.r-project.org/web/packages/dismo/vignettes/sdm.pdf
```{r}
library(dismo)
library(rgeos)
library(rgbif)
ibm <- occ_search(scientificName = "Iberolacerta monticola", limit = 1500)
ibm <- subset(ibm$data, select = c("species", "decimalLatitude",
"decimalLongitude"))
ibm <- as.data.frame(unique(ibm))
ibm <- ibm[complete.cases(ibm),]
colnames(ibm) <- c("species", "lat", "lon")
all.worldclim <- raster::getData("worldclim", res = 10, var = "bio")
spain.worldclim <- crop(all.worldclim, extent(-10, 4, 35, 45))
plot(spain.worldclim[["bio1"]])
points(ibm[,c("lon", "lat")], pch = 16)
```
Okay, so we've got environmental data and occurrence data. How do we build a model?
Some methods (e.g., bioclim or domain) can work with just the data we've got.
```{r}
ibm.bc <- bioclim(spain.worldclim, ibm[,c("lon", "lat")])
plot(ibm.bc)
# Notice that the plot is only two variables, but our model uses 19!
response(ibm.bc)
attributes(ibm.bc)
# To make a map, we'll need to use dismo's predict function
ibm.bc.pred <- predict(object = ibm.bc, spain.worldclim)
plot(ibm.bc.pred)
points(ibm[,c("lon", "lat")], pch = 16, cex = 0.5)
# Let's stop and think about what a Bioclim model assumes about our species' response to the environment
```
Class exercise: do the same but using the domain function. Don't worry that plot(ibm.dm) doesn't work - there's just not a plot function associated with domain models in dismo. We'll do a much better job of visualizing these models later.
```{r}
ibm.dm <- dismo::domain(spain.worldclim, ibm[,c("lon", "lat")])
attributes(ibm.dm)
response(ibm.dm)
# To make a map, we'll need to use dismo's predict function
ibm.dm.pred <- predict(object = ibm.dm, spain.worldclim)
plot(ibm.dm.pred)
points(ibm[,c("lon", "lat")], pch = 16, cex = 0.5)
```
We can't really evaluate the predictive capacity of our models without some absence or background data, so let's make some. Here's one way, using circular buffers around our occurrence points.
```{r}
buffer <- circles(ibm[,c("lon", "lat")], d=100000, lonlat=TRUE)
plot(buffer)
pol <- gUnaryUnion(buffer@polygons)
buffer.raster <- mask(spain.worldclim[[1]], pol)
plot(buffer.raster)
points(ibm[,c("lon", "lat")], pch = 16, cex = 0.5)
background <- sampleRandom(buffer.raster, size=1000, xy=TRUE)
head(background)
# EXERCISE: let's create a data frame that just has lats and lons for presence
# and another one that just has lats and longs for background data. We'll
# give them the same column names and order, just to keep it tidy
ibm.pres <- ibm[,c("lon", "lat")]
ibm.bg <- background[,c("x", "y")]
colnames(ibm.bg) <- c("lon", "lat")
head(ibm.pres)
head(ibm.bg)
plot(ibm.dm.pred)
points(ibm.bg, pch = 4, cex = 0.5, col = "red")
points(ibm.pres, pch = 16, cex = 0.5)
# Evaluate the model using the evaluate function from dismo
ibm.dm.eval <- evaluate(p = ibm.pres, a = ibm.bg, model = ibm.dm, x = spain.worldclim)
ibm.dm.eval
attributes(ibm.dm.eval)
plot(ibm.dm.eval, "ROC")
plot(ibm.dm.eval, "TPR")
plot(ibm.dm.eval, "kappa")
# Compare to bioclim model
ibm.bc.eval <- evaluate(p = ibm.pres, a = ibm.bg, model = ibm.bc, x = spain.worldclim)
ibm.bc.eval
attributes(ibm.bc.eval)
plot(ibm.bc.eval, "ROC")
plot(ibm.bc.eval, "TPR")
plot(ibm.bc.eval, "kappa")
```
That's only training data, though! We usually prefer to set aside some data for testing. We'll do a quick k-fold partitioning, with k = 5
```{r}
group <- kfold(ibm.pres, 5)
group
# All this does is randomly assign a number from 1 to 5 for each presence point.
# To use these partitions for training and testing a model, we can do this:
ibm.bc <- bioclim(spain.worldclim, ibm.pres[group != 1,])
ibm.bc.pred <- predict(object = ibm.bc, spain.worldclim)
plot(ibm.bc.pred)
points(ibm.pres[group != 1,], pch = 16, cex = 0.5, col = "green")
points(ibm.pres[group == 1,], pch = 16, cex = 0.5, col = "blue")
# So now the green points are the ones used to train the model and the blue
# ones are the ones we'll use to test it
ibm.bc.eval.train <- evaluate(p = ibm.pres[group != 1,], a = ibm.bg, model = ibm.bc, x = spain.worldclim)
ibm.bc.eval.test <- evaluate(p = ibm.pres[group == 1,], a = ibm.bg, model = ibm.bc, x = spain.worldclim)
ibm.bc.eval.train
ibm.bc.eval.test
plot(ibm.bc.eval.train, "ROC")
plot(ibm.bc.eval.test, "ROC")
```
We're going to fit a model manually using one more method: GLM. This requires our data to be in a different format.
See if you can format your data so that presence and background data are in the same table, with a new column "pres" that contains a value of 1 where the species is present and 0 where it isn't.
```{r}
pres <- c(rep(1, nrow(ibm.pres)), rep(0, nrow(ibm.bg)))
latlon <- rbind(ibm.pres, ibm.bg)
ibm.df <- cbind(latlon, pres)
head(ibm.df)
tail(ibm.df)
# We also need to get our environmental data into our data frame
pa.env <- extract(spain.worldclim, ibm.df[,1:2])
pa.env <- as.data.frame(pa.env)
pa.env$pres <- ibm.df$pres
head(pa.env)
# Here's one way to do it
ibm.glm <- glm(pres ~ ., data = pa.env)
plot(predict(spain.worldclim, ibm.glm))
points(ibm.pres, pch =16, cex = 0.5, col = "black")
# It's complaining about a rank-deficient fit, which means we have too many
# predictor variables for the number of observations we have. Let's
# fit a simpler model.
ibm.glm <- glm(pres ~ bio1 + bio8, data = pa.env)
plot(predict(spain.worldclim, ibm.glm))
points(ibm.pres, pch =16, cex = 0.5, col = "black")
# What does our model look like?
response(ibm.glm)
# Let's fit a quadratic model instead
ibm.glm <- glm(pres ~ poly(bio1, 2) + poly(bio8, 2), data = pa.env)
ibm.glm
plot(predict(spain.worldclim, ibm.glm))
points(ibm.pres, pch =16, cex = 0.5, col = "black")
# response(ibm.glm)
# dismo does NOT like plotting responses for polynomial glms!
# We'll come back to that when we get into ENMTools
```
We can also use these models to predict to other places or time periods.
```{r}
# You can now use that model to predict habitat suitability in other places
# but note that this is SUPER slow for domain models
plot(predict(all.worldclim, ibm.glm))
```
Exercise 1: Can you figure out how to evaluate the model ibm.glm?
Exercise 2: Can you figure out how to increase the radius from which you draw your background data? What happens to your model evaluation metrics when you do?