-
Notifications
You must be signed in to change notification settings - Fork 13
/
09-applications-of-spatial-weights.Rmd
994 lines (710 loc) · 37.4 KB
/
09-applications-of-spatial-weights.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
# Applications of Spatial Weights
## Introduction {-}
This notebook cover the functionality of the [Applications of Spatial Weights](https://geodacenter.github.io/workbook/4d_weights_applications/lab4d.html) section of the GeoDa workbook. We refer to that document for details on the methodology, references, etc. The goal of these notes is to approximate as closely as possible the operations carried out using GeoDa by means of a range of R packages.
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).
For this notebook, we use Cleveland house price data. Our goal in this lab is show how to assign spatial weights based on different distance functions.
```{r}
```
### Objectives {-}
After completing the notebook, you should know how to carry out the following tasks:
- Create a spatially lagged variable as an average or sum of the neighbors
- Create a spatially lagged variable as a window sum or average
- Create a spatially lagged variable based on inverse distance weights
- Create a spatially lagged variable based on kernel weights
- Rescaling coordinates to obtain inverse distance weights
- Compute and map spatially smoothed rates
- Compute and map spatial Empirical Bayes smoothed rates
#### R Packages used {-}
- **sf**: To read in the shapefile.
- **spdep**: To create k-nearest neighbors and distance-band neighbors, calculate distances between neighbors, convert to a weights structure, and coercion methods to sparse matrices.
- **knitr**: To make nicer looking tables
- **ggplot2**: To make a scatterplot comparing two different rate variables
- **tmap**: To make maps of different spatial rate variables
- **broom**: To get regression statistics in a tidy format
- **GGally**: To make a parallel coordinates plot
- **purrr**: To map a centroid function of the geometry column of an **sf** data structure
- **geodaData**: To access the data for this notebook
- **tidyverse**: Basic data frame manipulation
#### R Commands used {-}
Below follows a list of the commands used in this notebook. For further details
and a comprehensive list of options, please consult the
[R documentation](https://www.rdocumentation.org).
- **Base R**: `install.packages`, `library`, `setwd`, `class`, `str`, `lapply`, `attributes`, `summary`, `head`, `seq`, `as`, `cbind`, `max`, `unlist`, `length`, `sqrt`, `exp`, `diag`, `sort`, `append`, `sum`
- **sf**: `st_read`, `plot`
- **spdep**: `knn2nb`, `dnearneigh`, `knearneigh`, `nb2listw`, `mat2listw`, `EBlocal`, `lag.listw`, `include.self`
- **knitr**: `kable`
- **ggplot2**: `ggplot`, `geom_smooth`, `geom_point`, `ggtitle`
- **tmap**: `tm_shape`, `tm_fill`, `tm_border`, `tm_layout`
- **broom**: `tidy`
- **GGally**: `ggparcoord`
- **purrr**: `map_dbl`
- **tidyverse**: `mutate`
## Preliminaries {-}
Before starting, make sure to have the latest version of R and of packages that are compiled for the matching version of R (this document was created using R 3.5.1 of 2018-07-02). Also, optionally, set a working directory, even though we will not
actually be saving any files.^[Use `setwd(directorypath)` to specify the working directory.]
### Load packages {-}
First, we load all the required packages using the `library` command. If you don't have some of these in your system, make sure to install them first as well as
their dependencies.^[Use
`install.packages(packagename)`.] You will get an error message if something is missing. If needed, just install the missing piece and everything will work after that.
```{r, message=FALSE,warning=FALSE}
library(sf)
library(spdep)
library(tmap)
library(ggplot2)
library(GGally)
library(broom)
library(knitr)
library(tidyverse)
library(purrr)
```
### geodaData {-}
All of the data for the R notebooks is available in the **geodaData**
package. We loaded the library earlier, now to access the individual
data sets, we use the double colon notation. This works similar to
to accessing a variable with `$`, in that a drop down menu will
appear with a list of the datasets included in the package. For this
notebook, we use `clev_pts`.
If you choose to download the data into your working directory, you will need to go to [Cleveland Home Sales](https://geodacenter.github.io/data-and-lab//clev_sls_154_core/) The download format is a
zipfile, so you will need to unzip it by double clicking on the file in your file
finder. From there move the resulting folder titled: nyc into your working directory
to continue. Once that is done, you can use the **sf** function: `st_read()` to read
the shapefile into your R environment.
```{r}
clev.points <- geodaData::clev_pts
```
## Spatially lagged variables {-}
With a neighbor structure defined by the non-zero elements of the spatial weights matrix W, a spatially lagged
variable is a weighted sum or a weighted average of the neighboring values for that variable. In most commonly
used notation, the spatial lag of y is then expressed as Wy.
.
Formally, for observation i, the spatial lag of $y_i$, referred to as $[Wy]_i$ (the variable Wy observed for
location i) is:
$$[Wy]_i = w_{i1}y_1 + w_{i2}y_2 + ... + w_{in}y_n$$
or,
$$[Wy]_i = \sum_{j=1}^nw_{ij}y_j$$
where the weights $w_{ij}$ consist of the elements of the i-th row of the matrix W, matched up with the
corresponding elements of the vector y.
In other words, the spatial lag is a weighted sum of the values observed at neighboring locations, since the
non-neighbors are not included (those i for which $w_{ij}=0$). Typically, the weights matrix is very sparse, so
that only a small number of neighbors contribute to the weighted sum. For row-standardized weights,
with$\sum_jw_{ij} = 1$, the spatially lagged variable becomes a weighted average of the values at
neighboring observations.
In matrix notation, the spatial lag expression corresponds to the matrix product of the **n × n** spatial weights
matrix W with the **n × 1** vector of observations y, or **W × y**. The matrix W can therefore be considered to be
the spatial lag operator on the vector **y**.
In a number of applied contexts, it may be useful to include the observation at location i itself in the weights
computation. This implies that the diagonal elements of the weights matrix must be non-zero, i.e., $w_{ii}≠0$.
Depending on the context, the diagonal elements may take on the value of one or equal a specific value (e.g., for
kernel weights where the kernel function is applied to the diagonal). We will highlight this issue in the specific
illustrations that follow.
### Creating a spatially lagged variable {-}
In this section, we will go through the four different spatial lag options covered in the Geoda workbook. These
are **Spatial lag with row-standardized weights**, ** Spatial lag as a sum of neighboring values**,
**Spatial window average**, and **Spatial window sum**.
#### Creating the weights {-}
To start, we will need to make our weights in order to follow along with the Geoda workbook example. This will require
making a k-nearest neighbors for k = 6 with row standarized weights.
The process for making these weights is the same as covered in earlier notebooks. However, to summarize, we
start by getting the coordinates in a separate matrix with the base R function `cbind`, then use that with
the **spdep** functions `knn2nb` and `knearneigh` to compute a neighbors list. From there we convert the
neighbors list to class **listw** with `nb2listw`. This will result in row standardized weights. It is
important to note that we will have to work with neighbors structure in some cases to get the desired
functionality, which will require creating more than one weights object for the k-nearest neighbors weights
used in this section of the notebook.
```{r, message=FALSE,warning=FALSE}
# getting coordinates
coords <- cbind(clev.points$x,clev.points$y)
# creating neighbors list
k6 <- knn2nb(knearneigh(coords, k = 6))
# converting to weights structure from neighbors list
k6.weights1 <- nb2listw(k6)
k6.weights1$weights[[1]]
```
#### Spatial lag with row-standardized weights {-}
The default case in Geoda is to use row-standardized weights and to leave out the diagonal element. Implementing this
will be relatively simple, as we will only need one line of code with the weights structure:
**k6.weights1**. To create the lag variable, the **spdep** function, `lag.listw` is used. The inputs require
a weights structure and a variable with the same length as the weights structure. The length should not
be a problem if the variable comes from the same dataset used to create your weights.
We create a new data frame of the **sale_price** variable and the lag variable to get a side by side
look. To get a brief look at the first few observations of both variables, we use `head`. Then to get
a nicer looking table, we use `kable` from the **knitr** package.
```{r}
lag1 <- lag.listw(k6.weights1, clev.points$sale_price)
df <- data.frame(sale_price = clev.points$sale_price, lag1)
kable(head(df))
```
The values match with the example from the Geoda workbook, only difference being in that it is rounded to less
significant figures than GeoDa.
The lag variables are calculated in this instance, by taking an average of the neighboring values. We will
illustrate this by finding the neighboring sale price values for the first observation. To get these values,
we get the neighbor ids for the first observation by double bracket notation. We then select the sale
price values by indexing with the resulting numeric vector.
```{r}
nb1 <- k6[[1]]
nb1 <- clev.points$sale_price[nb1]
nb1
```
We now verify the value for the spatial lag listed in the table. It is obtained as the average of the sales price
for the six neighbors, or (131650 + 65000 + 81500 + 76000 + 120000 + 5000)/6 = 79858.33.
We can quickly assess the effect of spatial averaging by comparing the descriptive statistics for the lag variable
with those of the original sale price variable.
```{r}
summary(clev.points$sale_price)
```
```{r}
summary(lag1)
```
As seen above, the range for the original sale price variable is 1,049 to 527,409. The range for the
lag variable is much more compressed, being 6,583 to 229,583. The typical effect of spatial lag is a compression
of the range and variance of a variable. We can see this further when examing the standard deviation.
```{r}
sd(clev.points$sale_price)
```
```{r}
sd(lag1)
```
The standard deviation for home sales gets significantly condensed from 60654.34 to 36463.76 in the creation
of the lag variable.
#### PCP {-}
To get a more dramatic view of the influence of high-valued or low-valued neighbors on the spatial lag, we use
a parallel coordinates plot. This will be done using **GGally**, which is an R package that builds off of the
**ggplot2** package for some more complicated plots, ie the parallel coordinates plot and the scatter plot
matrix. We won't go in depth about the different options this package offers in this notebook, but if interested,
check out [GGally documentation](https://www.rdocumentation.org/packages/GGally/versions/1.4.0).
To make the pcp with just the two variables, we will need to put them in a separate matrix, or data
frame. We use `cbind` to do this, as with the coordinates. From there we use `ggparcoord` from the **GGally**
package. The necessary inputs are the data(**pcp.vars**) and the scale parameter. We use `scale = "globalminmax"
to keep the same price values as our lag variable and sale price variable. The default option scales it down,
which is not what we are looking for in this case.
```{r}
sale.price <- clev.points$sale_price
pcp.vars <- cbind(sale.price,lag1)
ggparcoord(data = pcp.vars,scale ="globalminmax") +
ggtitle("Parallel Coordinate Plot")
```
#### Spatial lag as a sum of neighboring values {-}
We can calculate spatial lag as a sum of neighboring values by assigning
binary weights. This requires us to go back to our neighbors list, then
apply a function that will assign binary weights, then we use `glist =` in
the `nb2listw` function to explicitly assign these weights.
We start by applying a function that will assign a value of 1 per each neighbor. This is done with `lapply`, which we have been using to manipulate
the neighbors structure throughout the past notebooks. Basically it applies
a function across each value in the neighbors structure.
```{r}
binary.weights <- lapply(k6, function(x) 0*x + 1)
k6.weights2 <- nb2listw(k6, glist = binary.weights, style = "B")
```
With the proper weights assigned, we can use `lag.listw` to compute a lag
variable from our weight and sale price data.
```{r}
lag2 <- lag.listw(k6.weights2, clev.points$sale_price)
df <- df %>% mutate(lag2 = lag2)
kable(head(df))
```
#### Spatial window average {-}
The spatial window average uses row-standardized weights and includes the
diagonal element. To do this in R, we need to go back to the neighbors
structure and add the diagonal element before assigning weights. To begin
we assign **k6** to a new variable because we will directly alter its
structure to add the diagonal elements.
```{r}
k6a <- k6
```
To add the diagonal element to the neighbors list, we just need `include.self`
from **spdep**.
```{r}
include.self(k6a)
```
Now we obtain weights with `nb2listw`
```{r, message=FALSE,warning=FALSE}
k6.weights3 <- nb2listw(k6a)
```
Lastly, we just need to create the lag variable from our weight structure and
**sale_price** variable.
```{r}
lag3 <- lag.listw(k6.weights3, clev.points$sale_price)
df <- df %>% mutate(lag3 = lag3)
kable(head(df))
```
#### Spatial window sum {-}
The spatial window sum is the counter part of the window average, but
without using row-standardized weights. To do this we assign binary
weights to the neighbor structure that includes the diagonal element.
```{r}
binary.weights2 <- lapply(k6a, function(x) 0*x + 1)
binary.weights2[1]
```
Again we use `nb2listw` and `glist =` to explicitly assign weight values.
```{r}
k6.weights4 <- nb2listw(k6a, glist = binary.weights2, style = "B")
```
With our new weight structure, we can compute the lag variable with `lag.listw`.
```{r}
lag4 <- lag.listw(k6.weights4, clev.points$sale_price)
df <- df %>% mutate(lag4 = lag4)
kable(head(df))
```
## Spatially lagged variables from inverse distance weights {-}
### Principle {-}
The spatial lag operation can also be applied using spatial weights calculated from the inverse
distance between observations. As mentioned in our earlier discussion, the magnitude of these
weights is highly scale dependent (depends on the scale of the coordinates). An uncritical
application of a spatial lag operation with these weights can easily result in non-sensical
values. More specifically, since the resulting weights can take on very small values, the
spatial lag could end up being essentially zero.
Formally, the spatial lag operation amounts to a weighted average of the neighboring values,
with the inverse distance function as the weights:
$$[Wy]_i = \sum_jy_j/d_{ij}^\alpha$$
where in our implementation, $\alpha$ is either 1 or 2. In the latter case (a so-called gravity model
weight), the spatial lag is sometimes referred to as a potential in geo-marketing analyses. It
is a measure of how accessible location i is to opportunities located in the neighboring
locations (as defined by the weights).
### Default approach {-}
#### inverse distance for k = 6 nearest neighbors {-}
To calculate the inverse distances, we must first compute the distances between each neighbor.
This is done with `nbdists`.
```{r}
k6.distances <- nbdists(k6,coords)
```
Now we apply the function 1/x to each element of the distance structure, which gives us
the inverse distances.
```{r}
invd1a <- lapply(k6.distances, function(x) (1/x))
invd1a[1]
```
Now we explicitly assign the inverse distance weight with `glist =` in the `nb2listw` function.
```{r}
invd.weights <- nb2listw(k6,glist = invd1a,style = "B")
```
The default case with row-standardized weights off and no diagonal elements
is put into the variable **IDRSND**. The lag values here are very different
from the sale price because the weight values are relatively small, the largest
being .0026
```{r}
idrsnd <- lag.listw(invd.weights, clev.points$sale_price)
kable(head(idrsnd))
```
```{r}
idnrwd <- clev.points$sale_price + idrsnd
kable(head(idnrwd))
```
The above would be the result for when the diagonal is included. This amounts
to the original sale price being added to the the lag value, which is the
equivalent of:
$$[W_y]_i = y_i + \Sigma_j\frac{y_i}{d_{ij}^\alpha}$$
### Spatial lags with row-standardized inverse distance weights {-}
```{r}
row_stand <- function(x){
row_sum <- sum(x)
scalar <- 1/row_sum
x * scalar
}
```
```{r}
row.standw <- lapply(invd1a, row_stand)
df <- data.frame(inverse_distance = invd1a[[1]], row_stand_invd = row.standw[[1]])
kable(df)
```
```{r}
sum(row.standw[[1]])
```
```{r}
invd.weights2 <- nb2listw(k6,glist = row.standw,style = "B")
invd.weights2$weights[[1]]
```
```{r}
lag6 <- lag.listw(invd.weights2, clev.points$sale_price)
head(lag6)
```
## Spatially lagged variables from kernel weights {-}
Spatially lagged variables can also be computed from kernel weights. However,
in this instance, only one of the options with respect to row-standardization
and diagonal weights makes sense. Since the kernel weights are the result of a
specific kernel function, they should not be altered. Also, each kernel
function results in a specific value for the diagonal element, which should not
be changed either. As a result, the only viable option to create spatially
lagged variables based on kernel weights is to have no row-standardization and
have the diagonal elements included.
```{r}
k6a.distances <- nbdists(k6a,coords)
```
Here we apply the Epanechnikov kernal weight function to the distance structure, but
calculate z as a variable bandwidth rather than the critical threshold. For the variable
bandwidth, we take the maximum distance between a location and it's set of 6 neighbors. Then
this is the distance used to calculate the weights for the neighbors of that location.
```{r}
for (i in 1:length(k6a.distances)){
maxk6 <- max(k6a.distances[[i]])
bandwidth <- maxk6
new_row <- .75*(1-(k6a.distances[[i]] / bandwidth)^2)
k6a.distances[[i]] <- new_row
}
k6a.distances[[1]]
```
Now we create the weights structure with `nb2listw`, and directly assign the weights values
to the neighbors structure with `glist =`.
```{r}
epan.weights <- nb2listw(k6a,glist = k6a.distances,style = "B")
epan.weights$weights[[1]]
```
Lastly, we use `lag.listw` to create the lag variable from our Epanechnikov kernal weights, then take a
brief look at the resulting lag variable, **epalag**.
```{r}
epalag <- lag.listw(epan.weights, clev.points$sale_price)
df <- data.frame(clev.points$sale_price,epalag)
kable(head(df))
```
$$[W_y]_i = \Sigma_jK_{ij}y_j$$
## Spatial rate smoothing {-}
### Principle {-}
A spatial rate smoother is a special case of a nonparameteric rate estimator, based on
the principle of locally weighted estimation. Rather than applying a local average to
the rate itself, as in an application of a spatial window average, the weighted
average is applied separately to the numerator and denominator.
The spatially smoothed rate for a given location i
is then given as:
$$\pi_i = \frac{\Sigma_{j=1}^nw_{ij}O_j} {\Sigma_{j=1}^nw_{ij}P_j}$$
where $O_j$ is the event count in location j, $P_j$ is the population at risk,
and $w_{ij}$ are spatial weights (typically with $w_{ii} = 0$, i.e. including the
diagonal)
Different smoothers are obtained for different spatial definitions of neighbors and/or
different weights applied to those neighbors (e.g., contiguity weights, inverse
distance weights, or kernel weights).
The window average is not applied to the rate itself, but it is computed separately for the
numerator and denominator. The simplest case boils down to applying the idea of a spatial
window sum to the numerator and denominator (i.e., with binary spatial weights in both, and
including the diagonal term):
$$\pi_i = \frac{O_i + \Sigma_{j=1}^nO_j} {O_i +\Sigma_{j=1}^nP_j}$$
$\pi_i = \frac{O_i+\Sigma_{j=1}^J_iO_j}{O_i + \Sigma_{j=1}^J_iP_j}$
where $J_i$ is a reference set (neighbors) for observation i. In practice, this is achieved by
using binary spatial weights for both numerator and denominator, and including the diagonal in
both terms, as in the expression above.
A map of spatially smoothed rates tends to emphasize broad spatial trends and is
useful for identifying general features of the data. However, it is not useful for the
analysis of spatial autocorrelation, since the smoothed rates are autocorrelated by
construction. It is also not very useful for identifying outlying observations, since
the values portrayed are really regional averages and not specific to an individual
location. By construction, the values shown for individual locations are determined by
both the events and the population sizes of adjoining spatial units, which can lead to
misleading impressions. Often, inverse distance weights are applied to both the
numerator and denominator
### Preliminaries {-}
We return to the rate smoothing examples using the Ohio county lung cancer data.
Therefore, we need to close the current project and load the ohlung data set.
Next, we need to create the spatial weights files we will use if we don’t have them
already stored in a project file. In order to make sure that some smoothing will
occur, we take a fairly wide definition of neighbors. Specifically, we will create a
second order queen contiguity, inclusive of first order neighbors, inverse distance
weights based on knn = 10 nearest neighbor weights, and Epanechnikov kernel weights,
using the same 10 nearest neighbors and with the kernel applied to the diagonal (its
value will be 0.75).
To proceed, we will be using procedures outlined in earlier notebooks to create the
weights. The process will be less in depth than the ones in the distance and
contiguity weight notebooks, but will be enough to make and review the process
of making the weights.
#### geodaData {-}
All of the data for the R notebooks is available in the **geodaData**
package. We loaded the library earlier, now to access the individual
data sets, we use the double colon notation. This works similar to
to accessing a variable with `$`, in that a drop down menu will
appear with a list of the datasets included in the package. For this
notebook, we use `ohio_lung`.
Otherwise, to get the data for this notebook, you will need to go to [Ohio Lung Cancer](https://geodacenter.github.io/data-and-lab/ohiolung/) The download format is a
zipfile, so you will need to unzip it by double clicking on the file in your file
finder. From there move the resulting folder titled: nyc into your working directory
to continue. Once that is done, you can use the **sf** function: `st_read()` to read
the shapefile into your R environment.
```{r}
ohio_lung <- geodaData::ohio_lung
```
#### Creating a Crude Rate {-}
In addition to the spatial weights, we also need an example for crude rates. If not
already saved in the data set, we compute the crude rate for lung cancer among white
females in 68.
To get the rate we make a new variable, **LRATE** and calculated the rate through
bivariate operation. We multiply the ratio by 10,000 to the crude rate of lung cancer
per 10,000 white women in 1968.
```{r}
ohio_lung$LRATE <- ohio_lung$LFW68 / ohio_lung$POPFW68 * 10000
```
```{r}
tm_shape(ohio_lung) +
tm_fill("LRATE",palette = "-RdBu",style ="sd",title = "Standard Deviation: LRATE") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
#### Creating the Weights {-}
##### Queen Contiguity {-}
To create the queen contiguity weights, we will use the **spdep** package and **sf**
package to make a neighbors list with both first and second order neighbors
included.
To begin we create a queen contiguity function with the corresponding pattern. We
use the `st_relate` function to accomplish this. For a more in depth explanation
of the specified pattern, check the contiguity based weights notebook.
```{r}
st_queen <- function(a, b = a) st_relate(a, b, pattern = "F***T****")
```
Here we define a function needed to convert the resulting data structure from the
`st_relate` function. Specifically from **sgbp** to **nb**. It's fairly
straightforward, we just change the class name, transfer the attributes,
and check for observations without a neighbor.
```{r}
as.nb.sgbp <- function(x, ...) {
attrs <- attributes(x)
x <- lapply(x, function(i) { if(length(i) == 0L) 0L else i } )
attributes(x) <- attrs
class(x) <- "nb"
x
}
```
Now we create a comprehnsive neighbors list with both first and second order
neighbors. To start, we use the two functions we created to get a 1st order
neighbors list, then we use `nblag` to get the first and second order neighbors. The
next step is reoragnize the data structre so that first and second order neighbors
are not separated, this is done with `nblag_cumul`. With these steps, we have a
neighbors list of first and second order neighbors.
```{r}
queen.sgbp <- st_queen(ohio_lung)
queen.nb <- as.nb.sgbp(queen.sgbp)
second.order.queen <- nblag(queen.nb, 2)
second.order.queen.cumul <- nblag_cumul(second.order.queen)
```
Here we add the diagonal elements to the neighbors list because we will need them
later on in computations.
```{r}
include.self(second.order.queen.cumul)
```
Now, we just use the `nb2listw` function to convert the neighbors list to a
weights structure, which is row-standardized by default.
```{r, message=FALSE,warning=FALSE}
queen.weights <- nb2listw(second.order.queen.cumul)
queen.weights$weights[[1]]
```
##### K-nearest neighbors {-}
Here we will make the inverse distance weights for the 10th-nearest neighbors. To
start, we need the coordinates in a separate data structure before we can proceed
with the neighbor and distance calculations. We use the same method to extract these
values as the distance-band weights notebook.
```{r}
longitude <- map_dbl(ohio_lung$geometry, ~st_centroid(.x)[[1]])
latitude <- map_dbl(ohio_lung$geometry, ~st_centroid(.x)[[2]])
coords <- cbind(longitude, latitude)
```
From here we just use `knearneigh` and `knn2nb` to get our neighbors structure.
```{r}
k10 <- knn2nb(knearneigh(coords, k = 10))
```
Now we need the distances between each observation and its neighbors. Having this will
allow us to calculate the inverse later and assign these values as weights when
converting the neighbors structure to a weights structure.
```{r}
k10.dists <- nbdists(k10,coords)
```
Here we apply a function to the distances of the 10th nearest neighbors and change
the scale by an order of 10,000. The purpose of changing the scale is to get
weight values that are not approximently zero.
```{r}
invdk10 <- lapply(k10.dists, function(x) (1/(x/10000)))
invdk10[1]
```
With the `nb2listw` function, we can assign non-default weights to the new weights
structure by means of the `glist` argument. Here we just assign the scaled inverse
weights we created above.
```{r}
k10.weights <- nb2listw(k10,glist = invdk10, style = "B")
k10.weights$weights[[1]]
```
##### Epanechnikov kernel {-}
For the Epanechnikov kernal weights, things are a bit more complicated, as we need
to add in diagonal elements. This is best done with neighbors list.
We start with the 10-nearest neighbors and assign to a new object because we will
be altering the structure to get the resulting weights.
This gives us the distances between neighbors in the same structure as the neighbors
list.
```{r}
k10.distances <- nbdists(k10,coords)
```
This four loop gives us the epanechnikov weights in the the distance data structure
that we computed above.
```{r}
for (i in 1:length(k10.distances)){
maxk10 <- max(k10.distances[[i]])
bandwidth <- maxk10
new_row <- .75*(1-(k10.distances[[i]] / bandwidth)^2)
k10.distances[[i]] <- new_row
}
k10.distances[[1]]
```
Now we can assign our epanechnikov weight values as weights with the `nb2listw` function throught the `glist =` parameter.
```{r}
epan.weights <- nb2listw(k10,glist = k10.distances,style = "B")
epan.weights$weights[[1]]
```
### Simple window average of rates {-}
This is an illustration of how not to proceed with the spatial rate calculation.
Here we use our crude rate and create a spatial lag variable directly from the second order queen contiguity weights.
```{r}
ohio_lung$W_LRATE <- lag.listw(queen.weights,ohio_lung$LRATE)
```
We then plot **W_LRATE** with a standard deviation map.
```{r}
tm_shape(ohio_lung) +
tm_fill("W_LRATE",palette = "-RdBu",style ="sd",title = "Standard Deviation: W_LRATE") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
Characteristic of the spatial averaging, several larger groupings of similarly
classified observations appear. The pattern is quite different from that displayed for
the crude rate. For example, the upper outliers have disappeared, and there is now one
new lower outlier.
### Spatially Smoothed Rates {-}
As mentioned, applying the spatial averaging directly to the crude rates is not the proper way to operate. This approach ignores the differences between the populations of
the different counties and the associated variance instability of the rates.
To create the spatially smoothed rate, we need to make lag variables of the
event variable and base variable, then divide them to get our rate. In GeoDa
we just select our variables and the software does it for us, in R we need
some extra steps.
To get this rate we create lag variables for both population and event count with
`lag.listw`. To get the rate, we then divide the event count lag variable by the
the population lag variable.
```{r}
lag_lfw68 <- lag.listw(queen.weights,ohio_lung$LFW68)
lag_popfw68 <- lag.listw(queen.weights,ohio_lung$POPFW68)
ohio_lung$R_SPAT_R <- lag_lfw68 / lag_popfw68
```
We map the results with **tmap** functions.
```{r}
tm_shape(ohio_lung) +
tm_fill("R_SPAT_R",palette = "-RdBu",style ="sd",title = "SRS-Smoothed LFW68 over POPFW68") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
We rescale the spatial rate variable by 10,000 to make it comparable to **W_LRATE**. Then we add all
of the rate to a data frame with `data.frame` and take a brief side by side look with `head` and `kable`.
```{r}
ohio_lung$R_SPAT_RT <- ohio_lung$R_SPAT_R * 10000
df <- data.frame(LRATE = ohio_lung$LRATE,W_LRATE = ohio_lung$W_LRATE,R_SPAT_RT= ohio_lung$R_SPAT_RT)
kable(head(df))
```
To get a more quantifiable comparason the table above, we can make a scatterplot with **gplot2** functions.
We will not go into depth on the different options here, but will create a basic scatterplot of **W_LRATE**
and **R_SPAT_RT**. `geom_point` add the points to the plot, and `geom_smooth(method=lm) ` add a linear
regression line.
```{r}
ggplot(data=ohio_lung,aes(x=W_LRATE,y=R_SPAT_RT)) +
geom_point() +
geom_smooth(method=lm)
```
`lm` calculates the associated regression statistics with the scatterplot above. We then use `tidy` and
`kable` to display these statistics in a neat table.
```{r}
lmfit <- lm(R_SPAT_RT ~ W_LRATE,data = ohio_lung)
kable(tidy(lmfit))
```
#### Inverse Distance Smoothed Rate {-}
For the inverse distance weights, we proceed with a different approach. The **k10.weights** do not
include diagonal elements, so we will need to address this in our computation of the rate. We begin
with the lag variables for population and event count with our inverse distance weights. To add the diagonal
element in our rate calculation, we add the event count variable to lag event count variable and do the
same with population before dividing to compute the rate. This works because the diagonal element would be assigned 1
in the weight structure and as a result would just add the original sale proce from the location to the lag sale price,
if originally included in the weight's structure.
```{r}
lag_lfw68 <- lag.listw(k10.weights,ohio_lung$LFW68)
lag_popfw68 <- lag.listw(k10.weights,ohio_lung$POPFW68)
ohio_lung$ID_RATE <- (lag_lfw68 + ohio_lung$LFW68)/ (lag_popfw68 + ohio_lung$POPFW68) * 10000
```
```{r}
tm_shape(ohio_lung) +
tm_fill("ID_RATE",palette = "-RdBu",style ="sd",title = "Standard Deviation IDRATE") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
#### Kernal Smoothed Rate {-}
We follow the same process for the kernal weights calculation as the inverse distance, except rescale the diagonal
element by .75 as done in epanechnikov kernal function.
```{r}
lag_lfw68 <- lag.listw(epan.weights,ohio_lung$LFW68)
lag_popfw68 <- lag.listw(epan.weights,ohio_lung$POPFW68)
ohio_lung$KERN_RATE <- (lag_lfw68 + .75 * ohio_lung$LFW68)/ (lag_popfw68 + .75 * ohio_lung$POPFW68) * 10000
ohio_lung$KERN_RATE <- lag_lfw68 / lag_popfw68 * 10000
```
```{r}
tm_shape(ohio_lung) +
tm_fill("KERN_RATE",palette = "-RdBu",style ="sd",title = "Standard Deviation: KERN_RATE") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
```{r}
df <- df %>% mutate(ID_RATE = ohio_lung$ID_RATE, KERN_RATE = ohio_lung$KERN_RATE)
kable(head(df))
```
It is important to keep in mind that both the inverse distance and kernel weights spatially smoothed rates are based on a
particular trade-off between the value at the location and its neighbors. This trade-off depends critically on the distance
metric used in the calculations (or, on the scale in which the coordinates are expressed). There is no right answer, and a
thorough sensitivity analysis is advised.
For example, we can observe that the three spatially smoothed maps point to some elevated rates in
the south of the state, but the extent of the respective regions and the counties on which they are centered differ slightly.
Also, the general regional patterns are roughly the same, but there are important differences in terms of the specific
counties affected.
## Spatial Empirical Bayes smoothing {-}
### Principle {-}
The second option for spatial rate smoothing is Spatial Empirical Bayes. This
operates in the same way as the standard Empirical Bayes smoother (covered in the
rate mapping Chapter), except that the reference rate is computed for a spatial
window for each individual observation, rather than taking the same overall
reference rate for all. This only works well for larger data sets, when the window
(as defined by the spatial weights) is large enough to allow for effective
smoothing.
Similar to the standard EB principle, a reference rate (or prior) is computed.
However, here, this rate is estimated from the spatial window surrounding a given
observation, consisting of the observation and its neighbors. The neighbors are
defined by the non-zero elements in the row of the spatial weight matrix (i.e.,
the spatial weights are treated as binary).
Formally, the reference mean for location i is then:
$$\mu_i = \frac{\Sigma_jw_{ij}O_j}{\Sigma_jw_{ij}P_j}$$
with $w_{ij}$ as binary spatial weights, and $w_{ii}=1$
The local estimate of the prior variance follows the same logic as for EB, but replacing the population and rates by their local counterparts:
$$\sigma_i^2 = \frac{\Sigma_jw_{ij}[P_j(r_i - \mu_i)^2]}{\Sigma_jw_{ij}P_j} - \frac{\mu_i}{\Sigma_jw_{ij}P_i/(k_i +1)}$$
Note that the average population in the second term pertains to all locations within
the window, therefore, this is divided by $k_i +1$(with $k_i$ as the number of
neighbors of i). As in the case of the standard EB rate, it is quite possible (and
quite common) to obtain a negative estimate for the local variance, in which case it is
set to zero.
The spatial EB smoothed rate is computed as a weighted average of the crude rate and
the prior, in the same manner as for the standard EB rate
For reference the EB rate in this case is as denoted below:
$$w_i = \frac{\sigma_i^2}{\sigma_i^2 + \mu_i / P_i}$$
$$\pi_i^{EB}= w_ir_i + (1-w_i)\theta$$
### Spatial EB rate smoother {-}
The spatial Empirical Bayes is included in the **spdep** package. It even has a `geoda =` that calculates the rate in the same manner as GeoDa. To do this, we use the `EBlocal` function from **spdep** and input the two variables with a neighbors
list. We then set `geoda =` to be TRUE.
```{r, message=FALSE,warning=FALSE}
eb_risk <- EBlocal(ohio_lung$LFW68, ohio_lung$POPFW68, second.order.queen.cumul, geoda = TRUE)
```
With our Empirical Bayes calculation, we can now examine it with a choropleth map.
Before we proceed, we add **eb_risk** to the data frame. We need to select the **est**
not the **raw** to get the correct figures.
```{r}
ohio_lung <- ohio_lung %>% mutate(eb_risk = eb_risk$est)
tm_shape(ohio_lung) +
tm_fill("eb_risk",palette = "-RdBu",style ="sd",title = "SEBS-Smoothed LFW68 over POPFW68") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
We do the same for the 10-nearest neighbors:
```{r, message=FALSE,warning=FALSE}
eb_risk_k10 <- EBlocal(ohio_lung$LFW68, ohio_lung$POPFW68, k10, geoda = TRUE)
ohio_lung <- ohio_lung %>% mutate(eb_risk_k10 = eb_risk_k10$est)
tm_shape(ohio_lung) +
tm_fill("eb_risk_k10",palette = "-RdBu",style ="sd",title = "SEBS-Smoothed LFW68 over POPFW68") +
tm_borders() +
tm_layout(legend.outside = TRUE, legend.outside.position = "right")
```
As for the spatial rate smoothing approaches, a careful sensitivity analysis is in
order. The results depend considerably on the range used for the reference rate. The
larger the range, the more the smoothed rates will be similar to the global EB rate.
For a narrow range, the estimate of the local variance will often result in a zero
imputed value, which makes the EB approach less meaningful.