-
Notifications
You must be signed in to change notification settings - Fork 6
r for newbies
How to get started in R. Best to have an objective since then you'll know that you've achieved something. Let's plot a graph. You'll be done in 20 minutes[^1].
Here's what we're going to cover
Install R and RStudio
Find your way around RStudio
Three keys to understanding R
Write your R script
-
Download and install R from here
-
Download and install RStudio. This is a nice shiny interface for R, and the easiest way to use it. Download it here. There should be an 'installer' for your operating system.
The screen should be divided in quadrants or panes. The two most important are labelled Source (top left), and Console (bottom left).
The console is R! Type anything here, and it will be interpreted by R.
Try typing 2+2
> 2+2
[1] 4
There are 4 things to explain in the little code snippet above.
- The command prompt "
>
" (or greater than sign to you and me) is simply R prompting you to enter some text - The expression
2+2
is the sum that we asked R to perform. - We'll come back to the
[1]
at the beginning of the next line in a moment. - R prints the answer
4
The number in square brackets is actually R 'numbering' your answer for you. There's only one answer so it is 'numbered' 1
.[^3]
Re-assuring as it is that R knows that 2+2=4
, you were probably hoping for a little more. Typing directly into R is a start, but we want to teach you reproducible research. The scientific method requires that we document our work, but we can't reproduce your typing unless we record it somewhere.
The solution is to create a file, write your commmands in that file, and then tell R to work through the commands in that file. Switch to the pane labelled source, and this time type 2-2
. When you get to the end of the line, hit (on Windows ). This sends the last line you wrote from the 'source' document, to the console. You should now see that R can add and substract.
> 2+2
[1] 4
> 2-2
[1] 0
Now save the file you have written as labbook_YYMMDD.R
(replace YYMMDD with today's date e.g. labbook_160103.R
). You must use the .R
extension to indicate that this is an R script, but you can, of course, choose any name you wish.[^2]
This is an opinionated list, and others might suggest other elements, but I think this is a reasonable place to start.
- Comments: Notes to your future self (and others) to explain your decisions
- Functions: The 'verbs' of R. Functions do things
- Data frames: The Excel sheet of R
We're good to go! Well almost. There is another crucial but trivial skill you need, and that is 'commenting' or annotating your scripts. Writing 2+2
is fine for R, but you need to remind yourself and others of why you are doing things.
If you saw these two lines of code, it would be surprising if you knew what R was doing.
7/5
7 %% 5
It is obviously much better to write this:
# This is division
7/5
# This is the 'modulus' function (divides and returns the remainder)
7 %% 5
The #
sign tells R to ignore that line (it's for a future 'you' who might read this, and wonder what the 'past' you was up to!). You can even put a #
after some code to add a specific comment for that line. For example,
# This is division
7/5 # Answer should be 1.4
# This is the 'modulus' function (divides and returns the remainder)
7 %% 5 # Anwer should be 2
Functions are the workhorses of R. You use R because it can 'do' things such a plot graphs, average numbers, print tables, or build statistical models. These are all functions. You want to know the square root of 9?
> sqrt(9)
[1] 3
That was (surprise, surprise) the square root function. It has a name (sqrt
), and we pass it an 'argument' (in this case 9
). An argument is something that the function needs to do its job. The function then 'returns' a result. Rewriting the above using this vocabulary:
> function_name(argument)
[1] return value
Functions such as mean()
will return the mean of a set of numbers. Don't try this yet because you'll be annoyed that simply typing mean(1,2,3,4,5,6)
doesn't work.
R has a bunch of functions that are its core (known as base R) that allow you to do maths, stats, load, manipulate, and save data, and make graphs. In addition, there is a wide community of academics and enthusiasts who contribute functions. These are typically wrapped together in something called a package. When you want to use these additional functions, you just tell R to load the package into its memory by typing library(package_name_here)
. For example, ggplot2 is a package of great plotting and graphing functions, so you would type library(ggplot2)
to make these functions available.
- library and packages a bundles of pre-written functions
I am guessing that you would be pretty comfortable with a table of data that looked like this.
intials | height | weight |
---|---|---|
dj | 190 | 88 |
hs | 183 | 80 |
gj | 182 | 110 |
sm | 175 | 95 |
A data frame is just the R version of a that table. It has 3 columns that are labelled 'initials', 'height', and 'weight'.
Now, please don't be confused when I tell you that it has 4 not 5 rows. We don't count the first row because these are the labels. We can specify any cell in the table by giving it's row and column number in that order. We always write these references in square brackets so [1,2]
refers to the first row and the second column of our table: that is the height of person 'dj' which we can see is 190
.
It is preferable to think of any table of data as just 'columns' of data that are aligned and bound together. We have three columns here.
- Column 1 contains the initials of the people measured: dj,hs,gj,sm
- Column 2 contains their heights (in cm): 190,183,182,175
- Column 3 contains their weights (in kg): 88,80,110,95
The formal term for each column is a vector. To create a column (or vector) of data we use the c()
function.
height <- c(190,183,182,175)
This simply tells R that these 4 numbers are the same 'type of thing', and that the order we have provided is important too. The funny backwards arrow "<-
" (typed as a greater than sign, followed by a hyphen) tells R to give these 4 numbers a name. If we type that name, we will see that R has 'remembered' these numbers.
> height <- c(190,183,182,175)
> height
[1] 190 183 182 175
If you think of this vector as a column, then you shouldn't be surprised to learn that typing height[2]
will return the second 'row' of the column.
> height[2]
[1] 183
Now if we were to create 'columns' of initials, and weights we could bind the columns together into our table.
weight <- c(88,80,110,95)
initials <- c("dj", "hs", "gj", "sm")
Note R needs a way of distinguishing names of things (e.g. height
) from bits of data (here a list of initials). It is possible that dj
(the first set of initials) is the 'name' for something. We don't mean that so we put quotes around the the items in the initials column to indicate that these are strings not names. If you forget, and type c(dj, hs, gj, sm)
then R would go off looking for things named dj
, hs
, etc. which is not what we want.
To bind these data together we use the data.frame
function, and assign a name ddata
[^4] so that we can access these data again.
> ddata <- data.frame(initials, height, weight)
Typing ddata
will print our table
> ddata
initials height weight
1 dj 190 88
2 hs 183 80
3 gj 182 110
4 sm 175 95
And typing ddata[1,2]
will print the contents of the cell in row 1 and column 2.
> ddata[1,2]
[1] 190
Do you remember my saying not to be surprised that typing mean(1,2,3,4,5,6)
didn't work. Now I can explain why. The mean()
function takes a vector of numbers as an argument. So you can either make the vector (aka column) on the fly using the c()
function inside the mean()
function:
> mean(c(190,183,182,175))
[1] 182.5
Or since we have already named this vector 'height', then it is easier (and clearer) to write:
> mean(height)
[1] 182.5
If you do try without first bunching your numbers together into a single argument ...
> mean(190,183,182,175)
[1] 190
... then R takes just the first number in the list provided (remember it is only expecting one argument), and gives you the mean of that. The remaining three numbers are ignored.
Here is the script. To save typing, you can just copy and paste this into your own file. I have used lots of comments to explain what each line does, but there is also a more detailed explanation of each part of the script below.
# ========================
# = Clear your workspace =
# ========================
# This ugly little line clears your workspace. Just copy and paste if
# needed but bewarned in that it will *delete everything*.
rm(list=ls(all=TRUE))
# =================
# = Load the data =
# =================
# The data we need is stored on the web at ...
# https://raw.githubusercontent.com/datascibc/course/master/hrate_rrate200.csv
# read.csv() is a function that will accept a file name or a URL and
# load 'Comma Separated Value' data
ddata <- read.csv("https://raw.githubusercontent.com/datascibc/course/master/hrate_rrate200.csv")
# ====================
# = Inspect the data =
# ====================
# Don't trust anyone! Always inspect everything you work with.
head(ddata)
# The function head() shows you the first 6 rows of data (by default).
# If you want to see more then you pass the number of rows you want as
# a second argument (here 10 rows)
head(ddata, 10)
# You should be able to guess what the next line does
tail(ddata, 12)
# So it looks as if we have about 200 rows of data (note the numbering on the left in the console), and that there are just two variables hrate (for heart rate) and rrate (for respiratory rate)
# A quick way of summarising the 'structure' of an data object like this is ...
str(ddata)
# This should report that you have a ...
# 'data.frame': 200 obs. of 2 variables:
# And then it shows you the names of the columns, their type (here
# 'int' means integer data), and a quick look at the first items in
# each column
# hrate: int 62 130 100 90 80 111 102 56 73 98 ...
# rrate: int 16 40 26 15 30 20 17 18 16 36 ...
# ================
# = Make a graph =
# ================
# The very best way to inspect *all* your data is to look at it
# First load a library of plotting functions (with thanks to Hadley Wickham who wrote them)
library(ggplot2)
# Now we plot using the qplot() function
# We tell qplot that
# - we want the y variable to be hrate
# - we want the x variable to be rrate
# - and that the data where these x and y variables will be found is called ddata
qplot(y=hrate, x=rrate, data=ddata)
Hopefully, it should look something like this
Now that wasn't difficult was it?
[^1]: This assumes you don't run into problems downloading and installing R and R studio which shouldn't be a problem there are computers that like to say 'no'.
[^2]: I recommend this naming scheme because it's not a bad idea to start each day's work in a clean 'labbook' so that you record your progress. Good stuff can then be extracted from the labbook, named more specifically, and saved with other related files in a specific project folder.
[^3]: I could have just said that R always works with vectors, and that the numbering simply refers to the position in the vector. This probably wouldn't have confused someone who reads the footnotes, but it might have confused others. If you want try typing 1:100
at in the console at the command prompt. This is shorthand for 1,2,3 ... 98,99,100 (i.e. all the numbers from 1 to 100). R prints this, and to help you read the output prints where it is up to (as a number in square brackets) everytime it starts a new line.
[^4]: ddata
is not a typo. Naming things is famously hard, and good names like 'data' are often used by the R programming language itself. I have the habit therefore of doubling the first letter of things that I create to help me remind me when something is 'mine'. For example, R has a function table()
, but I might name my table ttable
. Easy!
Please contact Steve Harris if you have any questions.