Statistical arbitrage with pairs trading

## New to Org? Use TAB to expand the headings (denoted by stars)

Introduction (how to use this file)

Familiar with Emacs, Org, and R? Here’s your TL;DR:

Install the following packages:

install.packages(c("zoo", "timeSeries", "xts", "abind", "foreach", "doMC"))

If you haven’t already, clone this project from github (https://github.com/leoalekseyev/statarb-al)
Download data files one (20 MB) and two (2 MB) and put them in the ./workspace directory of the project

Now feel free to skip on to the next section, evaluating code blocks as you please.

New to Emacs and/or to Org? Here’s the gist of what you need to know:

TAB key folds and unfolds headlines (i.e. lines starting with *’s). Folding/unfolding only works if the point (cursor) is on the headline.
SHIFT-TAB cycles through folding levels of all headlines in the buffer. It works independently of the current location of the point.
Emacs keys combo shorthands look something like C-x C-s or C-x 1 or M-s o. Here, C stands for ctrl and M for alt. C-x C-s means first press ctrl+x, then press ctrl+s.
M-x is a special shortcut that prompts you for a command name. Everything you can do in the editor corresponds to a certain command that you can execute with M-x. Many of those commands have additional keybindings.
More Emacs terminology: editing / interpreter sessions / documentation viewing inside Emacs takes place in buffers. When several buffers appear side by side in a single Emacs session, they are called windows. An Emacs window (in the sense of an application window) is called a frame.
Org-mode code snippets can be executed by hitting C-c C-c with the point inside a source code block. The code will be then sent to the interpreter (in this project, the interpreter will be R). If something doesn’t work quite right, it’s often useful to switch to the interpreter buffer and see if the interpreter complained about the code.
Emacs documentation is very thorough and accessible. If you are completely new to Emacs, I recommend hitting C-h t and browsing through the tutorial for a few minutes. A great online resource is http://masteringemacs.org; Google search for Emacs tutorials will also produce lots of useful information.

What is org-mode?

Org-mode is an editing mode inside Emacs; it is automatically invoked when you open a file with a .org extension. Org provides a unified framework for taking notes, recording ideas, managing agendas and calendars, managing projects, and more. If you are new to org, spend a few minutes browsing through http://orgmode.org to get a feel for what it is about. Org files are just text files with simple markup; they are 100% human-readable and portable. Org makes it easy to export these files into formats suitable for publication (with LaTeX export), as well as presentation or blogging (with HTML export). Other export formats (e.g. ODT) are also supported. For example, the web version of this tutorial at http://dnquark.com/org/statarb/statarb.php is generated directly from the current file.

People use org-mode in many different ways. For me, org is (among other things) a unified environment for project management, literate programming, report generation, and reproducible research. In the ideal scenario, given my original data, my analysis code, and the org file containing links to the code, stand-alone code snippets, and workflow description, anybody should be able to reproduce, critique, and extend my results. This combination of flexible and interactive information delivery also makes Org a great teaching tool, and it is my hope that this project illustrates this facet of Org.

How Org is used here (or: enough already, let’s play with the data!)

You can think of this file as a project report that covers every step in the analysis (so as to be perfectly reproducible) and is intended to teach (and not just present results). (It’s really a cross between a project report and a tutorial.) Every original source file and every code snippet is available for you to play with. Source blocks can be executed with C-c C-c.

Before you begin, make sure that you have R and ESS running, as well as the relevant packages. It is also highly recommended that you use the latest version of Org mode (which you can get from http://orgmode.org). (In a few weeks/months this step will become easier, since up-to-date Org will ship with Emacs 24).

To install the packages used in this project, launch R and issue the following command:

install.packages(c("zoo", "timeSeries", "xts", "abind", "foreach", "doMC"))

Introduction

Back when I was still in grad school on the East Coast, I couldn’t help noticing a steady stream of freshly-minted PhDs (or even ABDs) in physics, math, CS, and engineering taking residence 1.5 hours up north on the train line, and converting their knowledge of applied probability theory into cold, hard cash. Before getting acquainted with the field of financial engineering first-hand, I had a rather distorted view of the whos and the whats of capital markets (shaped, in part, by growing up in Soviet Russia and doing undergrad in California). To be sure, the finance industry can seem complex and alienating. A significant part of the field, however, is, essentially, comprised by our fellow geeks crunching numbers on beefy servers using the open source tools we know and love.

One fun side project I’ve worked on a couple of years ago involved reproducing a trading strategy described in a paper by Marco Avellaneda and Jeong-Hyun Lee. This paper is cool for several reasons:

the math is interesting without being overly arcane
it’s fairly easy to implement in R
there are many interesting things to explore both in terms of optimizing the trading strategy and in studying some fundamental aspects of financial time series
if you had a time machine, you could make a killing with this strategy 5-10 years ago! (Yeah, Gray’s Sports Almanac is for those amateur time travelers.)

I recently stumbled upon the project source tree when cleaning up some git repositories. After looking at the code, I had a couple of realizations:

my coding style in both R and C++ has evolved quite a bit over the last couple of years
this would be an awesome project to share with the world using org-mode’s reproducible research/literate programming features

As a result, I’ve started putting together a public version of this project. My goal is to enable interested readers to follow the implementation details step by step, and be able to experiment with the code freely. The code itself, currently, is not far from its starting point as a one-off scrappy personal project (this means I am not using package.skeleton or ROxygen). The current focus is on getting everything to run smoothly under org-babel. So far, the signal generation routines work well; the trading simulation should be up shortly. So if you want to get the flavor of the sorts of things Wall Street quants do, grab a fresh version of org-mode, clone this project from Github, grab the data files (one and two), load statarb-al.org (from the project root) into Emacs, and continue following along in an org-mode buffer!

A quick note on where to get the tools used here: if you are on Windows or OS X, and don’t already use Emacs and/or Org, Vincent Goulet maintains a version of Emacs with Org, AUCTeX, and ESS integrated. R is available from http://r-project.org. Under Ubuntu, of course, you can get everything from Synaptic (search for r-base to get R; also, if the org-mode version in the repository is < 7.8, you might want to get a more up-to-date version from http://orgmode.org).

Pairs trading and mean reversion: a preview

What is pairs trading?

Winning in the stock market is easy. Buy low, sell high. Of course, to do this reliably we need to be reasonably sure that when the stock is low, it will actually rise. Making accurate predictions of this sort is difficult and notoriously error-prone. What we can do instead, however, is make some predictions about the behavior of pairs of stocks (or, in general, stock portfolios). The key idea here is that certain groups of stocks will usually be very strongly correlated, and deviations from the long-term correlation are temporary and mean-reverting. For instance, consider a pair of stocks from the same industry, e.g. Intel and Microsoft, or Ford and GM. Both pairs will generally trend with the overall market, and with the market sector. Within the pairs, the correlation is not perfect: on a particular day, stock A might, relatively speaking, outperform stock B and vice versa. We assume that on average, though, the returns on these stocks are going to be linearly related.

How do we use this to trade? Suppose that we somehow knew that over a 30 day period, the two stocks will generate exactly the same returns, regardless of what the overall market is doing. Let’s say that they start at the same price (and, by assumption, they will end at the same price) – but, by day 15, A is trading way higher than B. If we believe in the assumption that the returns will equilibrate by day 30, we are going to buy stock B and short stock A. If the assumption is correct, we will have netted a profit of (F-B)+(A-F)=A-B, where F gives the (identical) final price of the two stocks, and A and B are the prices when we open our position. Note that it doesn’t matter which direction the market has moved, nor whether our stocks rose or fell. We just had to be correct about the returns converging.

This strategy is known as pairs trading, and the general approach of using statistics for placing (almost) sure bets on the market goes by the name statistical arbitrage.

So this is all fine and well, but do stocks really behave like that? Let’s find out! The simplest way to do so is to pick a correlated pair and use it go construct a “market neutral” portfolio, i.e. one with expected zero return. We then hold this portfolio over n days, and look at our actual returns. This brings us to our first code block which we will evaluate in an R session.

Eyeballing mean reversion: the code

First, we have to prepare the workspace, load the data, and make sure we have the available packages installed in R. Here, we use some time series libraries that give us convenient rolling-window filtering, as well as pretty plots. For the data, we provide daily returns on a small group of stocks (contained in workspace/sample_returns.csv.gz).

need.packages <- c("zoo", "timeSeries", "xts")
for (p in need.packages)
  if (!is.element(p, installed.packages()[, 1]))  install.packages(p)
require("xts", quiet=T)
require("timeSeries", quiet=T)

rets <- read.csv(file="workspace/sample_returns.csv.gz", row.names=1)

RollingBetaFit <- function(data, win=60) {
  WindowFit <- function(data.win) {
    beta.fit <- lm.fit(cbind(rep(1, win), data.win[, 2]), data.win[, 1])
    beta.fit$coefficients[2]
  }
  betas <- rollapply(data, win, WindowFit, by.column=F)
  length(betas) <- nrow(data)
  cbind(data, beta=betas)
}

MarketNeutralReturns <- function(data, holding.period=1, timespec="/") {
  data <- data[complete.cases(data), ]
  data <- as.xts(as.timeSeries(data))[timespec]   # automatically sorts
  dates.seq <- holding.period:nrow(data)
  CompoundReturns <- function(x) exp(sum(log(1 + x))) - 1
  comp.rets <- rollapply(data[, 1:2], holding.period, CompoundReturns, by.column=T, align="left")
  mn.rets <- as.xts(timeSeries(rep(NA, nrow(comp.rets)), index(comp.rets)))
  for (i in 1:length(mn.rets))
    mn.rets[i] <- sum(comp.rets[i] * c(1, -data[i, "beta"]))
  names(mn.rets) <- paste(holding.period,"-day ret",sep="")
  mn.rets
}

PlotMNReturns <- function(rets, pair=c("JPM", "XLF"),
                          periods=c(1, 5, 15, 30), timeframe="2006/2007") {
  rets.betas <- RollingBetaFit(rets[pair])
  oldpar <- par(no.readonly=T)
  par(mfrow=c(length(periods), 1))
  for (p in periods)
    plot(MarketNeutralReturns(rets.betas, holding.period=p, timespec=timeframe),
         main=paste("Market-neutral returns:", paste(pair, collapse="/"), "held for", p, "day(s)"))
  par(oldpar)
}

PlotMNReturns(rets)

What’s happening here is the following: we pick daily returns on a pair of stocks, and for every day look back over a 60 day window and use lm.fit to get the correlation coefficient β. We then construct a portfolio where we allocate $1 to the first stock of the pair, and -$β to the second stock (i.e. we are long the first stock, short the second). We then hold that portfolio over p days and see what returns we generate. At the end, we examine a plot of these returns for the pair – in this case, we pick JPM and its corresponding sector ETF (XLF).

Indeed, we see a random signal that seems to oscillate around 0, and the characteristic oscillation period increases as we increase the holding period of the portfolio. You might be curious to know whether this mean-reverting behavior persists if we pick a pair of stocks that we don’t expect to be very strongly correlated, e.g. JPM and MSFT, or JPM and INTC, or JPM and AA. If you are running this code interactively, it is worth re-running PlotMNReturns with these stocks as the pair.

What you might find is that the empirical behavior that we glean from the plots is not very consistent. Mean reversion seems to be much better defined for some stocks than for others, but as with all stochastic signals, eyeballing their behavior does not get us very far. Instead, we need a more thorough mathematical framework in which to treat the mean reversion.

Mean-reversion mathematics

Analyzing financial time series can quickly degenerate into impenetrable stochastic calculus and an alphabet soup denoting the various flavors of autoregressive models. Fortunately, what we are doing here is quite simple, and the basic model can be treated as a black-box abstraction.

Here are a few pages of my notes summarizing the basic model we use for pairs trading. The key ideas are the following:

log-returns for the two instruments are linearly related, with an additional term given by the stationary and mean-reverting process X_t
X_t is modeled as an Ornstein–Uhlenbeck process
the solution to the O-U stochastic differential equation is exactly the AR(1) time series model (which we can easily fit with R)
O-U process is characterized by a few parameters, including the speed of mean reversion, its mean, and the stationary (long-term) variance. We can extract those parameters from the AR(1) model fit.

Finally, the trading signal s is just the normalized deviation of the estimated O-U process X_t from its estimated mean, where we use the long-term variance for normalization. Generating the signal appears straightforward and, indeed, it is!

Generating the trading signals

Working with the supplied code / preparing the environment

To manage project paths, we rely on the global variable statarb-al.proj, which is must be a list containing the elements src.path and workspace.path. We define it as follows: suppose your project root is

"~/finance/research/statarb-al/"

We then define statarb-al.proj as follows:

statarb.al.proj <-
  list(src.path=paste(projectroot, "src/", sep=""),
       workspace.path=paste(projectroot, "workspace/", sep=""),
       project.path=projectroot)

This bit of code must be sourced into every R session for the project. The boilerplate code for sourcing the necessary function definitions and setting the work directory then becomes something like

if (!exists("statarb.al.proj")) stop("Need project metadata file to proceed")
source.files <- c("functions.R")
for (f in paste(statarb.al.proj$src.path, source.files, sep="")) source(f)
setwd(statarb.al.proj$workspace.path)

We provide the following data files:

univ1_ret_mtx.gz (link, 20 MB)
etf_ret_mtx.gz (link, 2 MB)
ticker_to_sec_etf.csv (in the git repository)

The returns data files should be placed in the workspace directory of the project tree. The univ1 file contains daily returns for several hundred stocks; the etf file contains daily returns for sector ETFs, and tickertosecetf.csv is used to associate stocks with sector ETFs using the GICS industry classification.

Finally, now is a good time to install all the packages that this project depends on. They include:

timeSeries, xts, abind, foreach, doMC, stinepack

These packages aren’t crucial for the analysis itself, but time series libraries make the presentation/handling of price data somewhat more convenient, while foreach and doMC are used to parallelize computations on a multicore workstation. All of the above packages can be installed automatically using

need.packages <- c("timeSeries", "xts", "abind", "foreach", "doMC", "stinepack")
for (p in need.packages)
  if (!is.element(p, installed.packages()[, 1]))  install.packages(p)

Signal generation test: JPM vs XLF

To illustrate the general analysis workflow, we first compute the s-score for a simple stock/ETF pair. We pick JPM and XLF as the stock and ETF.

This code also illustrates the boilerplate environment setup and data loading. To run it, make sure the project metadata variable statarb.al.proj exists in the workspace and source jpm-xlf-s-score.R

To run this from within org-mode, do:

source(paste(statarb.al.proj$src.path, "jpm_xlf_s_score.R", sep=""))

We can now plot the s-score as follows:

setwd(statarb.al.proj$project.path)
plot.xts(as.xts(s.jpm.inv.ts)["2006/2007"], main="JPM vs XLF s-signal")

We can also plot signal lines using this handy bit of code (visualizing the signal lines will come in handy when setting up the trading simulation):

setwd(statarb.al.proj$project.path)
sig.thresholds <- c(sbo=-1.25, sso=1.25, sbc=0.75, ssc=-0.5)
## buy to open, sell/short to open, buy to close [close short pos.], sell to close [close long pos.]
signal.lines <- t(apply(sig.jpm[,,"JPM",drop=T], 1,
                        function(...) decode.signals(..., names=c("model.valid", names(sig.thresholds)))))
signal.lines <- as.xts(as.timeSeries(signal.lines[, -1]))
sig.colors <- c(2, 3, 4, 5)

timespec <- "2007"
plot.xts(as.xts(s.jpm.inv.ts)[timespec], main="JPM vs XLF s-signal")
for (s in seq_along(sig.thresholds)) {
  abline(h=sig.thresholds[s],lty=2)
  lines(signal.lines[, s][timespec] * sig.thresholds[s], col=sig.colors[s])
}

Run the signal generation for all financials

The code that we wrote to test the JPM vs XLF signal was very general; the only difference is that we now want to subset ret.s by all of the financial tickers (which are given by tc.xlf$TIC). Also, now that we are running the signal generation over multiple stocks, it’s a good idea to set subtract.average to T (since this is reported to produce better results). (In the future, it might be worth exploring whether or not that claim is true, and to what extent.)

ret.s.fin <- ret.s[, tc.xlf$TIC, drop=F]
system.time(sig.fin <-
            stock.etf.signals(ret.s.fin, ret.e, tc.xlf, subtract.average=T))

This took just under a minute to run on a quad-core machine.

So this is it! If you so desire, you can generate the trading signals for all the stocks in the dataset. In parts 2 and 3 of this write-up, we will be simulating the trades using the beautiful Rcpp framework, and exploring how to use PCA analysis in constructing mean-reverting portfolios. Stay tuned!

Preview of things to come

Trading simulation

We will run the signals on historical data and tweak model parameters to get higher returns. This is the kind of output we will produce. See that major dip in the right half of the plot? This was a 2007 market anomaly that wiped out a few hedge funds that relied extensively on mean-reverting strategies such as ours.

PCA analysis of the market

Can we capture the maximum amount of information about market behavior using the minimum number of stocks? Let’s use the standard dimensionality reduction approach to find out! For example, here we can rather faithfully reproduce the behavior of S&P 500 using just a handful of instruments.

This approach will also allow us to construct mean-reverting portfolio that use stocks instead of ETFs.

Time series analysis of the price process: ARCH, GARCH and all that

Our strategy used a simple AR(1) model to fit portfolio returns. In practice, more sophisticated models that incorporate variations in volatility are known to produce better results. We can use R’s prowess in dealing with all flavors of ARMA models to investigate the strategy’s sensitivity to different kinds of price processes.

Conclusion

I hope that this write-up removes some of the mystery behind the sort of work that goes on in quant finance. If you thought that this was helpful, or have any other feedback, feel free to drop me a line!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statarb-al.org

statarb-al.org

Statistical arbitrage with pairs trading

Introduction (how to use this file)

Introduction

Pairs trading and mean reversion: a preview

What is pairs trading?

Eyeballing mean reversion: the code

Mean-reversion mathematics

Generating the trading signals

Working with the supplied code / preparing the environment

Signal generation test: JPM vs XLF

Run the signal generation for all financials

Preview of things to come

Trading simulation

PCA analysis of the market

Time series analysis of the price process: ARCH, GARCH and all that

Conclusion

Files

statarb-al.org

Latest commit

History

statarb-al.org

File metadata and controls

Statistical arbitrage with pairs trading

Introduction (how to use this file)

Introduction

Pairs trading and mean reversion: a preview

What is pairs trading?

Eyeballing mean reversion: the code

Mean-reversion mathematics

Generating the trading signals

Working with the supplied code / preparing the environment

Signal generation test: JPM vs XLF

Run the signal generation for all financials

Preview of things to come

Trading simulation

PCA analysis of the market

Time series analysis of the price process: ARCH, GARCH and all that

Conclusion