091_lags.Rmd

<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")

# 9.1. Autocorrelation

[Jay Ulfelder][ju], a political scientist who has been credited for publishing [rather accurate lists][fp] of most likely military coups in recent years, wrote on that topic: "Unsurprisingly, coups also turn out to be a recurrent problem; the risk is higher in countries that have experienced other coup attempts in the past several years, a factor common to the top eight countries on this list."

[ju]: https://dartthrowingchimp.wordpress.com/
[fp]: http://blog.foreignpolicy.com/posts/2013/03/25/stats_junkie_successfully_predicts_african_coup_again

This observation is applicable in many different contexts with relatively slow-moving quantities: the unemployment rate, for instance, is highly redundant from one month to another, modulo a small percent change; GDP approximates itself from one year to another; and so on. In brief, the best way to predict an event at time $t$ is often to look at that same event at $t-1, t - 2, ..., t - k$, with $k$ lags.

This form of temporal dependence is called autocorrelation, or serial correlation, in the context of time series. They can be shown in specific plot arrangements, like correlograms, or expressed through notions like marginal change (which mathematically relies on derivatives), lagged values, or detrended series. We will sample a few of these methods below.

```{r packages, message=FALSE, warning=FALSE}
# Load packages.
packages <- c("downloader", "ggplot2", "MASS", "reshape", "splines")
packages <- lapply(packages, FUN = function(x) {
  if(!require(x, character.only = TRUE)) {
    install.packages(x)
    library(x, character.only = TRUE)
  }
})
```

## A time series of air pollution in Beijing

We used a [Twitter source][sw-ms] that [logs pollution data][bc-tap-1] in Beijing, where there was a [pollution peak][bc-tap-2] in January 2013 (the city [looked like][bj] *Blade Runner* with [more fog][ss], and staying in Beijing then amounted to [smoking 1.5 to 3 cigarettes a day][ss-smok]).

[bc-tap-1]: http://brainchronicle.blogspot.co.uk/2012/07/twitter-analysis-of-air-pollution-in.html
[bc-tap-2]: http://brainchronicle.blogspot.fr/2013/01/air-quality-analysis-from-beijing.html
[bj]: http://kateoplis.tumblr.com/post/40555052298/nope-this-is-not-a-still-from-blade-runner-its
[ss]: http://simplystatistics.org/2013/01/14/welcome-to-the-smog-ocalypse/
[ss-smok]: http://simplystatistics.org/2011/12/14/smoking-is-a-choice-breathing-is-not/

Unfortunately, the Twitter API is now less open than it used to be, so we will not be able to access the data publicly. I have saved the results of a [recent scrape][beijing-scraper] for your viewing pleasure, as the U.S. Embassy has not turned the logging machine down even when pressured so. Read first on the [Air Quality Index][aqi] (AQI) used to measure pollution information.

[beijing-scraper]: 9_twitter.R
[aqi]: http://airnow.gov/index.cfm?action=aqibasics.aqi

```{r twitter-pollution-auto, fig.width = 12, fig.height = 9, message = FALSE, tidy = FALSE}
# Target data source.
link = "https://raw.github.com/briatte/ida/master/data/beijing.aqi.2013.csv"
file = "data/beijing.aqi.2013.csv"
if(!file.exists(file)) download(link, file)
# Read CSV file.
bp <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
head(bp)
# Convert date.
bp$time <- strptime(bp$time, format = "%Y-%m-%d %T")
# Plot air pollution.
ggplot(data = bp, aes(x = time, y = PM)) +
  geom_line(color = "gray80") +
  geom_point(color = "blue", alpha = .5) +
  geom_smooth(fill = "lightblue") +
  labs(x = NULL, y = "Fine particles (PM2.5) 24hr avg")
```

In this example, the main interest is the time series formed by the repeated measures of a single variable. Teetor, ch. 14, offers a quick introduction to the topic, using the dedicated `zoo` and `xts` packages. Our introduction will just show how to detrend and lag the series with base functions.

There's plenty of other ways to look at the data. Here's one way with cubic splines.

```{r twitter-spline-auto, , fig.width = 12, fig.height = 9, message = FALSE, tidy = FALSE}
# Plot cubic spline with 2-length knots.
ggplot(data = bp, aes(x = time, y = PM)) +
  geom_line(color = "gray80") +
  geom_point(color = "blue", alpha = .5) +
  geom_smooth(method ="rlm", formula = y ~ ns(x, 12), alpha = .25, fill = "lightblue") +
  labs(x = NULL, y = "Fine particles (PM2.5) 24hr avg")
```

## Detrending a time series

_Differencing_ a time series is to subtract $x_{t-1}$ to all its values $x_{t}$, that is, obtain the net difference between two values separated by one time period. This is equivalent to asking for the marginal change in $x$ at every point $t$ of the curve $x(t)$. The `diff` function offers a very convenient way to obtain lagged differences:

```{r ts-diff, fig.width = 12, fig.height = 9, warning = FALSE, tidy = FALSE}
# Plot a differenced time series.
qplot(x = bp$time[-1], 
      y = diff(bp$PM), 
      geom = "line") + 
  labs(x = "t")
```

Similarly, if you lag the series by $k = 1, 2, ..., n$ and then divide it by its original values, you will get an indication of how frequent the pollution spikes are within the overall trend. This is particularly useful to detect seasonal patterns, like weekly spikes and drops.

```{r ts-lags, fig.width = 12, fig.height = 9, warning = FALSE, tidy = FALSE}
# Set vector of lagged values.
d = 1:9
# Create lags for one to eight days.
lags = sapply(d, FUN = function(i) { c(bp$PM[-1:-i], rep(NA, i)) } )
# Divide lagged values by series.
lags = lags / bp$PM
# Create lags dataset.
lags = data.frame(bp$time, lags)
# Fix variables names.
names(lags) = c("t", d)
# Melt data over days.
lags = melt(lags, id = "t", variable = "lag")
# Plot lagged dataset.
qplot(data = lags,
      x = t,
      y = value,
      colour = lag,
      geom = "line") + 
  labs(x = "t") + 
  scale_colour_brewer() + 
  theme(legend.position = "none")
```

Lagging a series is useful to control for autocorrelation, i.e. the series of correlation coefficients $\rho_k$ for $k = 1, 2, ..., k$ lags. This is done by regressing the data onto its time index, for which Teetor, ch. 14.13-14.21, provides a fuller treatment. We will content ourselves with autocorrelation function plots, drawn with `ggplot2`:

```{r ts-gglag, tidy = FALSE}
# Correlogram function.
gglag <- function(x, method = "acf") {
  data = do.call(method, list(x, plot = FALSE))
  qplot(y = 0,
        yend  = data$acf,
        size  = data$acf,
        color = data$acf,
        x     = data$lag,
        xend  = data$lag,
        geom  = "segment") +
  scale_y_continuous("", lim = c(ifelse(method == "acf", 0, -1), 1)) +
  scale_size(toupper(method)) +
  scale_color_gradient2("",
                        low = "blue", 
                        mid = "white", 
                        high = "red", 
                        midpoint = 0) +
  labs(x = "Number of lags")
}
```

This function will draw the correlograms (plots of correlation coefficients) for the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the time series. These indicate that a small amount of lags (probably just the first two) are relevant to detrend the series before analysis.

```{r acf-plots, fig.width = 12, fig.height = 9, warning = FALSE}
# Plot autocorrelation function.
gglag(bp$PM, "acf")
# Plot partial autocorrelation function.
gglag(bp$PM, "pacf")
```

> __Next__: [Smoothing](092_smoothing.html).