Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding References #18

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 01-spatial-data-handling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This R notebook covers the functionality of the [Spatial Data Handling](http://g
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).

In this lab, we will use the City of Chicago open data portal to download data on abandoned vehicles. Our end goal is to create a choropleth map with abandoned vehicles per capita for Chicago community areas. Before we can create the maps, we will need to download the information, select observations, aggregate data, join different files and carry out variable transformations in order to obtain a so-called “spatially intensive” variable for mapping (i.e., not just a count of abandoned vehicles, but a per capita ratio).
In this lab, we will use the City of Chicago open data portal to download data on abandoned vehicles. Our end goal is to create a choropleth map with abandoned vehicles per capita for Chicago community areas. Before we can create the maps, we will need to download the information, select observations, aggregate data, join different files and carry out variable transformations in order to obtain a so-called “spatially intensive” variable for mapping (i.e., not just a count of abandoned vehicles, but a per capita ratio). These manipulations (also called data munging or wrangling) are typically required to get your data set ready for analysis. It is commonly argued that this typically takes around 80% of the effort in a data science project (@data_mining).

### Objectives {-}

Expand Down
8 changes: 4 additions & 4 deletions 02-eda-1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -118,8 +118,8 @@ We follow the discussion in the GeoDa workbook and start with the common univari
descriptive graphs, the histogram and box plot. Before covering the specifics, we
provide a brief overview of the principles behind the **ggplot** operations.

Note that linking and brushing between a plot and a map is not (yet) readily
implemented in R, so that our discussion will focus primarily on static graphs.
Note that linking^[Linking refers to how a selection in any of the views results in the same observation to immediately be selected in all other views.] and brushing^[Brushing is a dynamic extension of the linking process. For some early exposition and discussion of these ideas pertaining to so-called dynamic graphics, see, e.g. the classic references of @s87, @bc87, @bcw87, @m89; as well as in the outline of legacy `GeoDa` fictionality in @ask06] between a plot and a map is not (yet) readily
implemented in R, so that our discussion will focus primarily on static graphs.

### A quick introduction to **ggplot** {-}
We will be using the commands in the **ggplot2** package for the descriptive statistics plots. There are many options to create nice looking graphs in R, including the functionality in base R, but we chose **ggplot2** for its clean logic and its
Expand Down Expand Up @@ -654,7 +654,7 @@ results on the graph. We don't pursue this any further.

#### Loess smoother {-}
The default nonlinear smoother in **ggplot** uses the **loess** algorithm as a locally
weighted regression model. This is similar in spirit to the **LOWESS** method used in GeoDa, but not the same.^[See the GeoDa workbook for further discussion] The implementation is along the same lines as the linear smoother, using
weighted regression model. This is similar in spirit to the **LOWESS** method used in GeoDa, but not the same.^[See the GeoDa workbook, and @c79, @l99, for further discussion] The implementation is along the same lines as the linear smoother, using
`geom_smooth`, with the only difference that the `method` is now `loess`, as shown below.

```{r}
Expand Down Expand Up @@ -810,7 +810,7 @@ ggplot(nyc.data,aes(x=kids2000,y=pubast00)) +


### Chow test {-}
In GeoDa, a Chow test on the equality of the regression coefficients between the selected and unselected observations is calculated on the fly and shown at the
In GeoDa, a Chow test (@c60)on the equality of the regression coefficients between the selected and unselected observations is calculated on the fly and shown at the
bottom of the scatter plot. This is not supported by **ggplot**, but we can
run separate regressions for each subset using `lm`. We can also run the Chow test itself, using the `chow.test` command from the **gap** package.

Expand Down
2 changes: 1 addition & 1 deletion 03-eda-2.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,7 @@ plot_ly(nyc.data, x = ~kids2000, y = ~pubast00, z = ~rent2002) %>%
## True Multivariate EDA: Parallel Coordinate Plot and Conditional Plots {-}
True multivariate EDA deals with situations where more than three variables
are considered. We follow the GeoDa Workbook and illustrate the Parallel Coordinate
Plot, or PCP, and conditional plots. For the former, we again need to resort to
Plot, or PCP^[The parallel coordinate plot or PCP is designed to visually identify clusters and patterns in a multi-dimensional variable space. Originally suggested by @i85 (see also @i90) it has become a main feature in many visual data mining frameworks, e.g. @w90 and @wd03.], and conditional plots^[Conditional plots, also known as facet graphs or Trellis graphs (@bcs96)]. For the former, we again need to resort to
**GGally**, but for the latter, we can exploit the `facet_wrap` and `facet_grid` functions of **ggplot**. In addition, we can turn these plots into interactive graphs by means of the **plotly** functionality.


Expand Down
10 changes: 5 additions & 5 deletions 04-mapping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -555,7 +555,7 @@ tm_shape(nyc.bound) +

### Natural breaks map {-}

A natural breaks map is obtained by specifying the **style = "jenks"** in `tm_fill`. All
A natural breaks map^[A natural breaks map uses a nonlinear algorithm to group observations such that the within-group homogeneity is maximized, following the pathbreaking work of @f58 and @j77.] is obtained by specifying the **style = "jenks"** in `tm_fill`. All
the other options are as before. Again, we illustrate this for four categories,
with **n=4**.

Expand Down Expand Up @@ -616,7 +616,7 @@ tm_shape(nyc.bound) +
## Extreme Value Maps {-}

In addition to the common map classifications, GeoDa also supports three types of extreme
value maps: a percentile map, box map, and standard deviation map. For details on the
value maps^[Extreme value maps are variations of common choropleth maps where the classification is designed to highlight extreme values at the lower and upper end of the scale, with the goal of identifying outliers. These maps were developed in the spirit of spatializing EDA, i.e., adding spatial features to commonly used approaches in non-spatial EDA (@a94)]: a percentile map, box map, and standard deviation map. For details on the
rationale and methodology behind these maps, we refer to the GeoDa Workbook.

Of the three extreme value maps, only
Expand Down Expand Up @@ -978,7 +978,7 @@ tm_shape(nyc.bound) +


### Co-location map {-}
A special case of a map for categorical variables is a so-called co-location map,
A special case of a map for categorical variables is a so-called co-location map^[The idea behind a co-location map is the extension of the unique value map concept to a multivariate context. In essence, it is the implementation of ideas related to the principles of map overlay or map algebra applied to categorical maps.Map algebra tends to be geared to applications for raster data, i.e., regular grids. However, since the polygons for the different variables are identical, the same principles can be applied in the context of the categorical maps. The classic reference on the principles of map algebra is @t90],
implemented in GeoDa. This map shows the values for those locations where two
categorical variables take on the same value (it is up to the user to make sure
the values make sense). Further details are given in the GeoDa Workbook.
Expand Down Expand Up @@ -1103,7 +1103,7 @@ tm_shape(nyc.bound) +

## Conditional Map {-}

A conditional map, or facet map, or small multiples, is created by the `tm_facets` command.
A conditional map^[Discussed at length in @cp10], or facet map, or small multiples, is created by the `tm_facets` command.
This largely follows the logic of the `facet_grid` command in **ggplot** that we covered in the
EDA notes. An extensive set of options is available to customize the facet maps. An in-depth
coverage of all the subtleties is beyond our scope
Expand Down Expand Up @@ -1154,7 +1154,7 @@ tm_shape(nyc.bound) +

## Cartogram {-}

A final map functionality that we replicate from the GeoDa Workbook is the cartogram. GeoDa
A final map functionality that we replicate from the GeoDa Workbook is the cartogram^[A cartogram is a map type where the original layout of the areal unit is replaced by a geometric form (usually a circle, rectangle, or hexagon) that is proportional to the value of the variable for the location. This is in contrast to a standard choropleth map, where the size of the polygon corresponds to the area of the location in question. The cartogram has a long history and many variants have been suggested, some quite creative. In essence, the construction of a cartogram is an example of a nonlinear optimization problem, where the geometric forms have to be located such that they reflect the topology (spatial arrangement) of the locations as closely as possible (see @t04, for an extensive discussion of various aspects of the cartogram)]. GeoDa
implements a so-called circular cartogram, where circles represent spatial units and their
size is proportional to a specified variable.

Expand Down
6 changes: 3 additions & 3 deletions 05-rate-mapping.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ other observations. This idea goes back to the fundamental contributions of Jame
and Stein (the so-called James-Stein paradox), who showed that in some instances
biased estimators may have better precision in a mean squared error sense.

GeoDa includes three methods to smooth the rates: an Empirical Bayes approach, a
GeoDa includes three methods to smooth the rates: an Empirical Bayes approach^[There are several excellent books and articles on Bayesian statistics, with @gcsdvr14 as a classic reference.], a
spatial averaging approach, and a combination between the two. We will consider
the spatial approaches after we discuss distance-based spatial weights. Here, we
focus on the Empirical Bayes (EB) method. First, we provide some formal
Expand Down Expand Up @@ -396,7 +396,7 @@ prior and the likelihood in such a way that a proper posterior distribution
results. In the context of rate estimation, the standard approach is to specify a
Poisson distribution for the observed count of events (conditional upon the risk
parameter), and a Gamma distribution for the prior of the risk parameter $\pi$.
This is referred to as the Poisson-Gamma model.
This is referred to as the Poisson-Gamma model^[For an extensive discussion, see, for example, the classic papers by @ck87 and @m91.].

In this model, the prior distribution for the (unknown) risk parameter $\pi$ is
$Gamma(\alpha,\beta)$, where $\alpha$ and $\beta$ are the shape and scale
Expand Down Expand Up @@ -439,7 +439,7 @@ In essense, the EB technique consists of computing a weighted average between th
raw rate for each county and the state average, with weights proportional to the
underlying population at risk. Simply put, small counties (i.e., with a small
population at risk) will tend to have their rates adjusted considerably, whereas
for larger counties the rates will barely change.
for larger counties the rates will barely change^[For an extensive technical discussion, see also @alk06].

More formally, the EB estimate for the risk in location i is:
$$\pi_i^{EB}=w_ir_i + (1-w_i)\theta$$
Expand Down
8 changes: 4 additions & 4 deletions 06-contiguity-spatial-weights.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This notebook covers the functionality of the [Contiguity-Based Spatial Weights]
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).

For this notebook, we use U.S. Homicide data. Our goal in this lab is show how to implement contiguity based spatial weights
For this notebook, we use U.S. Homicide data. Our goal in this lab is show how to implement contiguity based spatial weights.


```{r}
Expand Down Expand Up @@ -105,7 +105,7 @@ In practice, the construction of the spatial weights from the geometry of the da
cannot be done by visual inspection or manual calculation, except in the most
trivial of situations. To assess whether two polygons are contiguous requires the
use of explicit spatial data structures to deal with the location and arrangement of
the polygons. This is implemented through the spatial weights functionality in
the polygons^[Further technical details on spatial weights are contained Chapters 3 and 4 of @ass02.]. This is implemented through the spatial weights functionality in
GeoDa. We will do this with **sf** and **spdep** libraries.

We will create our neighbors using **sf** first, as the **spdep** library doesn't
Expand Down Expand Up @@ -235,7 +235,7 @@ sf.nb.queen <- as.nb.sgbp(sf.sgbp.queen)

## Higher Order Contiguity {-}

Now we move on to higher order contiguity weights. To make these we will need the
Now we move on to higher order contiguity weights^[Importantly, there is quite a difference between the higher order contiguity and lower order neighbors, namely that the pure higher order contiguity does not include any lower order neighbors. This is the notion appropriate for use in a statistical analysis of spatial autocorrelation for different spatial lag orders. In order to achieve this, all redundant and circular paths need to be removed (see @as96, for a technical discussion)]. To make these we will need the
**spdep** package. We will use the `nblag` and `nblag_cumul` functions to compute
the higher order weights.

Expand Down Expand Up @@ -431,7 +431,7 @@ summary(rook.card)

## Saving Neighbors {-}
To save our neighbors list, we use the `write.nb.gal` function from
the **spdep** package. The file format is a GAL Lattice file. We
the **spdep** package. The file format is a GAL Lattice file^[The GAL weights file is a simple text file that contains, for each observation, the number of neighbors and their identifiers. The format was suggested in the 1980s by the Geometric Algorithms Lab at Nottingham University and achieved widespread use after its inclusion in `SpaceStat` (@a92), and subsequent adoption by the R `spdep` package and others.]. We
input our neighbors list, and the the filename second. We have two
options from this point. We can save the file with the old style
or the new GeoDa header style.
Expand Down
4 changes: 2 additions & 2 deletions 07-distance-based-spatial-weights.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This notebook cover the functionality of the [Distance-Based Spatial Weights](ht
The notes are written with R beginners in mind, more seasoned R users can probably skip most of the comments
on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even be more elegant, work faster, or scale better).

For this notebook, we use Cleveland homesale point data. Our goal in this lab is show how to implement distance-band spatial weights
For this notebook, we use Cleveland homesale point data. Our goal in this lab is show how to implement distance-band spatial weights^[Further technical details on distance-based spatial weights are contained Chapters 3 and 4 of @ar14, although the software illustrations are for an earlier `GeoDa` interface design.].



Expand Down Expand Up @@ -473,7 +473,7 @@ plot(k6, coords, lwd=.2, col="blue", cex = .5)
## Generalizing the Concept of Contiguity {-}

In GeoDa, the concept of contiguity can be generalized to point layers by converting
the latter to a tessellation, specifically Thiessen polygons. Queen or rook contiguity
the latter to a tessellation, specifically Thiessen polygons^[For a more extensive technical discussion and historical background, see @y16.]. Queen or rook contiguity
weights can then be created for the polygons, in the usual way.

Similarly, the concepts of distance-band weights and k-nearest neighbor weights can be
Expand Down
6 changes: 3 additions & 3 deletions 08-spatial-weights-as-distance-functions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ $$w_{ij}=f(d_{ij},\theta)$$
with f as a functional form and $\theta$ a vector of parameters.

In order to conform to Tobler’s first law of geography, a distance decay effect must be
respected. In other words, the value of the function of distance needs to decrease with a
respected^[Tober’s so-called first law of geography postulates that everything is related to everything else, but closer things more so @t70]. In other words, the value of the function of distance needs to decrease with a
growing distance. More formally, the partial derivative of the distance function with respect
to distance should be negative, $\partial{}w_{ij}/\partial{}d_{ij}<0$
.
Expand Down Expand Up @@ -294,7 +294,7 @@ invd.weights.knn$weights[1]

Kernel weights are used in non-parametric approaches to model spatial covariance, such
as in the HAC method for heteroskedastic and spatial autocorrelation consistent
variance estimates.
variance estimates^[This method is currently not implemented in GeoDa, but is available in GeoDaSpace and PySal (see @hp94, @kp07, among others, for technical aspects, and @ass02 for implementation details.].

The kernel weights are defined as a function K(z) of the ratio between the distance dij
from i to j, and the bandwidth $h_i$, with $z=d_{ij}/h_i$. This ensures that z is
Expand Down Expand Up @@ -334,7 +334,7 @@ farthest apart.

In creating kernal weights, we will cover two important options: the fixed bandwidth
and the variable bandwidth. For the fixed bandwidth, we will be using distance-band
neighbors. For the variable bandwidth we will need kth-nearest neighbors.
neighbors. For the variable bandwidth we will need kth-nearest neighbors^[In GeoDa the default value for k equals the cube root of the number of observations (following the recommendation in @kp07. In general, a wider bandwidth gives smoother and more robust results, so the bandwidth should always be set at least as large as the recommended default.].

To start, we will compute a new distance-band neighbors list with the critcial threshold,
calculated earlier in the notebook.
Expand Down
9 changes: 4 additions & 5 deletions 09-applications-of-spatial-weights.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -503,7 +503,7 @@ kable(head(df))

$$[W_y]_i = \Sigma_jK_{ij}y_j$$


Kernel-based spatially lagged variables correspond to a form of local smoothing. They can be used in specialized regression specifications, such as geographically weighted regression (GWR)^[GWR is not implemented in GeoDa. For further details on the use of kernel-based spatially lagged variables in GWR, see, e.g., @fbc02].



Expand All @@ -513,7 +513,7 @@ $$[W_y]_i = \Sigma_jK_{ij}y_j$$
### Principle {-}

A spatial rate smoother is a special case of a nonparameteric rate estimator, based on
the principle of locally weighted estimation. Rather than applying a local average to
the principle of locally weighted estimation (see, e.g., @wg04). Rather than applying a local average to
the rate itself, as in an application of a spatial window average, the weighted
average is applied separately to the numerator and denominator.

Expand All @@ -530,7 +530,7 @@ diagonal)

Different smoothers are obtained for different spatial definitions of neighbors and/or
different weights applied to those neighbors (e.g., contiguity weights, inverse
distance weights, or kernel weights).
distance weights, or kernel weights)^[An early example was the spatial rate smoother outlined in @k96, based on the notion of a spatial moving average or window average (see also @97).].


The window average is not applied to the rate itself, but it is computed separately for the
Expand Down Expand Up @@ -942,8 +942,7 @@ set to zero.


The spatial EB smoothed rate is computed as a weighted average of the crude rate and
the prior, in the same manner as for the standard EB rate

the prior, in the same manner as for the standard EB rate (see the discussion in the Chapter on mapping rates, as well as @alk06, for technical details).
For reference the EB rate in this case is as denoted below:

$$w_i = \frac{\sigma_i^2}{\sigma_i^2 + \mu_i / P_i}$$
Expand Down
Loading