-
Notifications
You must be signed in to change notification settings - Fork 0
/
search_index.json
19 lines (19 loc) · 69.4 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[
["index.html", "R User Notebook 1 R User Notebook", " R User Notebook Kieran Driscoll 2019-02-11 1 R User Notebook "],
["basic-concepts.html", "2 Basic Concepts 2.1 Vectors 2.2 Matrices 2.3 Dataframes 2.4 Tibbles 2.5 Extracting data from Dataframes/Tibbles 2.6 Lists 2.7 Factors 2.8 Dates 2.9 Style Guide", " 2 Basic Concepts R is an object oriented language. The main data structures (which are objects) include : Vectors Matrices DataFrame & tibbles Lists Factors These data objects can be manipulated using functions (which are also objects). There are built in functions, but you can also create customised functions. R functions are stored in packages in your R library. Packages can be installed that give you additional functionalities (such as machine learning or graphing capabilities). 2.1 Vectors These can be a single values or multiple values like an array. year <- 2019 ##OR## colours <- c("red", "orange", "yellow", "green") Each element in the vector is indexed from 1 to n. You can use these index numbers to extract an element, for example to get the 2nd element in colours vector: colours[2] ## [1] "orange" 2.2 Matrices These are 2-dimensional vectors, but can only be numeric. 2.3 Dataframes A dataframe is a 2-dimensional table. They can be manually created by combining vectors together: staff <- data.frame("Name" = c("John Doe", "Alice Liddel", "Peter Piper", "Jolie Hope"), "Age" = c(25, 29, 34, 38), "Gender" = c("M", "F", "M", "F") ) 2.4 Tibbles Tibbles are dataframes that can be easier to use, but you will need the tibble package and other tidyverse packages to work with them. library(tibble) staff <- tibble::tibble( "Name" = c("John Doe", "Alice Liddel", "Peter Piper", "Jolie Hope"), "Age" = c(25, 29, 34, 38), "Gender" = c("M", "F", "M", "F") ) 2.5 Extracting data from Dataframes/Tibbles Rows and columns can be extracted from dataframes using index number or column names. To extract the FIRST COLUMN of values as a DataFrame : staff[1] # Or staff["Name"] ## # A tibble: 4 x 1 ## Name ## <chr> ## 1 John Doe ## 2 Alice Liddel ## 3 Peter Piper ## 4 Jolie Hope To extract the FIRST ROW of values as a DataFrame ## To extract the FIRST ROW of values *as a DataFrame* staff[1,] ## # A tibble: 1 x 3 ## Name Age Gender ## <chr> <dbl> <chr> ## 1 John Doe 25.0 M To extract the FIRST ROW of FIRST COLUMN of values as a Factor staff[1,1] ## # A tibble: 1 x 1 ## Name ## <chr> ## 1 John Doe If you need more than 1 row/column you can use a semi-colon (:). For example to get the first 3 rows: staff[1:3,] ## # A tibble: 3 x 3 ## Name Age Gender ## <chr> <dbl> <chr> ## 1 John Doe 25.0 M ## 2 Alice Liddel 29.0 F ## 3 Peter Piper 34.0 M If you are using a tibble the results will be the same, but the output will always be another tibble. You can convert a tibble column into a vector/list by adding extra []. This can also be done using the pull() function from dplyr. ## Extract the first COLUMN of values *as a charater vector* staff[[1]] ## [1] "John Doe" "Alice Liddel" "Peter Piper" "Jolie Hope" ## Extract the first ROW of first column of values *as a vector* staff[[1,1]] ## [1] "John Doe" 2.6 Lists Lists can be used to group objects together. They can contain different types of object. They are a bit like dictionaries in Python. Each item in a List can be given a name. mylist <- list("Year" = year, "Colours" = colours, "Staff" = staff) You can access Lists in the same way as vectors (by index number or name), but the results will also be in a list structure. To prevent this you need to use double brackets: mylist[[1]] ## Will return the first item in the list 'myvalue' ## ## [1] 2019 2.7 Factors These are like lists but more complicated - you will normally want to convert them to regular vectors/lists. If you have a Factor you can convert it to a vector list: # If your factor contains strings myfactor <- factor(c("North", "South", "West", "West", "South")) as.vector(myfactor) ## [1] "North" "South" "West" "West" "South" # If your factor only contains numbers myfactor <- factor(c(100, 200, 300, 300, 200)) as.numeric(as.vector(myfactor)) # Nb. numbers are initally held as strings, so you need to convert to numeric ## [1] 100 200 300 300 200 2.8 Dates Dates are regarded as the number of days since 1st Jan 1970. To store a date you can use the as.Date() function which accepts dates written in the format ‘YYYY-MM-DD’. Other formats can be used is specified. datevar <- as.Date("2017-04-06") 2.9 Style Guide Make sure you use correct upper/lower case spelling The three main data types are numeric, character and factor Use the setwd(‘C:/…’) function to indicate the working directory for your files Filepath reference must have forward slashes …/…/…/ Use == when evaluating equilavence, eg. if (a==b) … To time a code add the following before and after : ptm <- proc.time() proc.time() - ptm To add comments: ## Comment ## "],
["importing-exporting.html", "3 Importing & Exporting 3.1 Importing with Base R 3.2 Importing with other packages 3.3 Exporting", " 3 Importing & Exporting There are various ways to import and export files, but the tidyverse includes the readr package. This imports data as tibbles. readxl can be used to import Excel files, and haven can import SAS/SPSS files. You can check whether a object is a tibble dataframe using the function is.tibble() Nb. Columns containing characters are automatically given the character data type. 3.1 Importing with Base R Base R import functions can import delimited files as dataframes. datasetname <- read.table("data/testdata.csv", header=TRUE, sep=",") # sep="\\t" will import tab delimited files You can check whether a object is a dataframe using the function is.data.frame() Nb. Columns containing characters are automatically given the factor data type. 3.2 Importing with other packages 3.2.1 Importing a CSV library(readr) datasetname <- readr::read_csv("data/testdata.csv") 3.2.2 Importing a TSV datasetname <- readr::read_tsv("filepath") 3.2.3 Importing a R data file readr::read_rds("challenge.rds") 3.2.4 Importing a Excel file library(readxl) readxl::read_excel(xlsx_example) 3.2.5 Importing a SAS file library(haven) datasetname <- haven::read_sas(system.file("examples", "iris.sas7bdat", package = "haven")) 3.2.6 Importing JSON (Newline Delimited JSON) datasetname <- jsonlite::stream_in("data/testdata.json") 3.3 Exporting 3.3.1 Write to an R dataset readr::write_rds(datasetname, "datasetname.rds") 3.3.2 Write to an CSV file readr::write_csv(datasetname, "datasetname.csv") ###Write to a SAS dataset haven::write_sas(datasetname, "datasetname.sas7bdat") "],
["manipulating-data.html", "4 Manipulating data 4.1 Basic data wrangling 4.2 Summary statistics 4.3 Conditional statements 4.4 Recode 4.5 Appending columns and rows 4.6 Joins 4.7 Tidy Data 4.8 tibbles", " 4 Manipulating data Most common data manpulation can be done using the tidyverse package, which includes dplyr and tidyr. They include functions to subset and merge and pivot dataframes/tibbles. 4.1 Basic data wrangling # Keep/Drop specific columns with select() dplyr::select(mtcars, mpg) # Only keeps mpg dplyr::select(mtcars,-mpg) # Drop mpg by prefixing with a '-' # Keep columns that meet specific conditions select_if() dplyr::select_if(mtcars, is.numeric) # Select rows that meet specific conditions using filter() dplyr::filter(mtcars, hp > 100 & hp < 200) # Rename columns using rename() dplyr::rename(mtcars, HorsePower = hp) #Nb. the new name is specifed first New=Old # Create new columns using mutate() or transmute() dplyr::mutate(mtcars, kmpg = mpg*1.609) # OR to also drop all other columns dplyr::transmute(mtcars, kmpg = mpg*1.609) # Extract a column as a vector/list dplyr::pull(peopletdf, Age) # Sort data (ascending by default) dplyr::arrange(mtcars, wt) # add desc(wt) for descending # Extract unique values dplyr::distinct(mtcars, cyl) 4.2 Summary statistics Basic statistics about a column (eg. sum, mean, median, max, min) can be calculated using the summarise function dplyr::summarise(mtcars, "mean" = mean(mpg), "median" = median(mpg), "max" = max(mpg), "n" = n() ) ## Warning: package 'bindrcpp' was built under R version 3.4.4 ## mean median max n ## 1 20.09062 19.2 33.9 32 Ranking can be done within the mutate function. This example adds a rank based on descending horsepower: dplyr::mutate(mtcars, rank_hp = rank(-hp, ties.method = "first")) # Other methods are "last", "random" ## mpg cyl disp hp drat wt qsec vs am gear carb rank_hp ## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 19 ## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 20 ## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 26 ## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21 ## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 11 ## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 23 ## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 3 ## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 31 ## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 25 ## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 16 ## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17 ## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 8 ## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 9 ## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 10 ## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 7 ## 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 6 ## 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 5 ## 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 28 ## 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 32 ## 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 30 ## 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 24 ## 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 14 ## 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15 ## 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 4 ## 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 12 ## 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 29 ## 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 27 ## 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 18 ## 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 2 ## 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 13 ## 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 1 ## 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 22 The group_by function can be used to produce grouped statistics: dplyr::group_by(mtcars, am) %>% dplyr::summarise("mean" = mean(wt)) %>% dplyr::ungroup() # Stop grouping ## # A tibble: 2 x 2 ## am mean ## <dbl> <dbl> ## 1 0 3.77 ## 2 1.00 2.41 The count function is a quick way of producing frequency counts by one or more variables: # Frequency count of variable (+ optional sorting of results) dplyr::count(mtcars, cyl, sort = TRUE) 4.3 Conditional statements New variables can be created with conditional rules by using the case_when function: dplyr::mutate(mtcars, weight_group = dplyr::case_when(wt < 2.5 ~ "low", wt < 3.5 ~ "medium", TRUE ~ "high")) 4.4 Recode Recoding data may be useful if you have coded values and want to display a description. This is a convienience function based on case_when(): dplyr::recode(mtcars$am, `0`="Automatic", # If coded values are numeric then you need to use `` `1`="Manual", .default = "Other", .missing = "missing") 4.5 Appending columns and rows Additional columns or rows can be added to a dataframe, but they must have the same number/types of elements, and be in the correct order. # Add a column cbind(mtcars, mtcars$mpg) dplyr::bind_cols(mtcars, mpg2=mtcars$mpg) # Add a row rbind(mtcars, mtcars) dplyr::bind_rows(mtcars, mtcars) 4.6 Joins To join dataframes together by a common id standard joins (left, inner, outer..) can be used: dplyr::left_join(mydf1, mydf2, by = 'id') # Anti-Join : Only keep rows from mydf1 if they DONT match with mydf2, eg. to remove certain types of observation dplyr::anti_join(mydf1, mydf2, by = 'id') 4.7 Tidy Data You can convert data to/from a tidy format with the tidyr package. Nb. tidyr includes dplyr. library(tidyr) tidydf <- tibble::tibble( 'Year'=c(2016,2016,2017,2017,2018,2018), 'Type'=c('A','B','A','C','B','C'), 'Amount'=c(111,222,333,444,555,666) ) # Pivot tidy data so that values become columns nottidy <- tidyr::spread(tidydf, key=Year, value=Amount) # Pivot data into a tidy format (ie. reverse spread) tidyr::gather(nottidy, key=Age, value=Amount, -Type) ## # A tibble: 9 x 3 ## Type Age Amount ## <chr> <chr> <dbl> ## 1 A 2016 111 ## 2 B 2016 222 ## 3 C 2016 NA ## 4 A 2017 333 ## 5 B 2017 NA ## 6 C 2017 444 ## 7 A 2018 NA ## 8 B 2018 555 ## 9 C 2018 666 4.8 tibbles The tibble package includes some extra functions for altering tibbles. library(tibble) # Add new column to a tibble (with value) tibble::add_column(mydf, newcol='2014-15') "],
["functions.html", "5 Functions 5.1 Function Basics - an example using str() 5.2 Common (base) functions 5.3 Stringr - more string functions 5.4 Lubridate - more date functions 5.5 Converting data to other formats 5.6 User Defined Functions 5.7 Chaining/Piping", " 5 Functions 5.1 Function Basics - an example using str() Functions are the main way that you will handle or manipulate data. Something is a function if it is in the form functionname(arg1, arg2, …) str(mtcars) #Displays basic info about the data (like Proc Contents) To see what a function does, and what arguments it requires, you can enter the function name into the help function, eg help(str) 5.2 Common (base) functions 5.2.1 Strings x <- "The cat sat on the mat" # Substring - Use with any vector or list. You should include the start and end position. substr(x, 3, 4) # Search for a string within a string - will return either TRUE or FALSE. grepl("cat", x) # Find and replace a string within a string. gsub("cat","dog", x) # Paste - Concatenates vectors or lists (including an optional separator). All values will be converted into strings. paste("value1", "value2", sep=" ") paste0("value1", "value2") # This function doesnt have any separator # Remove leading/trailing whitespace from character strings with trimws() trimws(" Hello World ") # Apply a special format to a number/string and return it as a string sprintf('%03d', 1) #This simple example converts 1 into a string and pads it with 0's so that it 3 characters long. The first argument defines what and how text willl be displayed. The % symbol is a placeholders that can have different types. All subsequent arguments represent the data variables that are represented by the placeholders (in the order they occur).For more info see https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/sprintf. 5.2.2 Mathematical Round/Floor/Ceiling round(5.15, digits=1) # Specify number of decimal places to round to round(515, digits=-1) # A negative digit rounds to the nearest 10^n (ie. 10, 100, 1000) floor(5.15) # Rounds down to nearest whole number ceiling(5.15) # Rounds up to nearest whole number Sum sum(c(1,2,3)) sum(1,2,4) sum(mtcars$wt) 5.2.3 Properties & Lookups x <- c(14, NA, 35, 64) Length - The number of elements in an object length("ABCDE") # A single vector has length 1 ## [1] 1 length(x) # This concatenated vector has length 4 ## [1] 4 length(mtcars) # The length of a Dataframe is equal to the number of columns ## [1] 11 Logic functions for different types of value is.na(x) # Produces a vector Will be TRUE if a value is missing (ie. NA) ## [1] FALSE TRUE FALSE FALSE is.numeric(x) # Will be TRUE if all values are numeric (ignores NA's) ## [1] TRUE Find the POSITION of the element with highest/lowest value which.min(x) which.max(x) 5.2.4 Transformation x <- c(1000, 1500, 20000) Reverse elements in a vector rev(x) Format numbers to include comma separators - this converts numbers to characters format(x, big.mark=",", big.interval=3L, trim=TRUE) ## [1] "1,000" "1,500" "20,000" When a DataFrame/tibble has been created, various functions can be used to describe it, such as: dim() # Shows number of rows and colums length() # Number of columns colnames() # The column names head() # Displays first 6 rows str() # variable name, type and example values There are also some mathematical functions, including: * summary() # Summary statistics for numeric columns (mean, min, max, Q1, Q3) * colSums() # Sum of each column * colMeans() # Mean of each column 5.2.5 Other useful functions Generate a number sequence seq(1, 5) ## [1] 1 2 3 4 5 seq(1, 5, by = 2) ## [1] 1 3 5 5.3 Stringr - more string functions Loading packages will give you access to more complex functions: library(stringr) x <- "string to test" # Count of words in a string stringr::str_count(x, " ") # Different delimiters can be used ## [1] 2 # Substring stringr::str_sub(x, 1, 6) ## [1] "string" # Split a string into N pieces base on your chosen delimiter - stored in a matrix stringr::str_split_fixed(x, "to", 2) ## [,1] [,2] ## [1,] "string " " test" # Find and replace a string within a string stringr::str_replace_all(x, "test", "replace") ## [1] "string to replace" # Locate the start and end position of a string - stored in a matrix stringr::str_locate(x, "test")[[1]] # [[1]] is needed to return start, [[2]] to get end ## [1] 11 # Detect if a sub-string appears within a string stringr::str_detect(x, "test") # Returns either TRUE or FALSE ## [1] TRUE You can add regular expressions to these functions. https://stringr.tidyverse.org/articles/regular-expressions.html 5.4 Lubridate - more date functions Lubridate helps parse, convert, and extract information from dates. date <- "2017-04-06" lubridate::day(date) ## [1] 6 lubridate::week(date) ## [1] 14 lubridate::month(date, label = TRUE, # Switch between numeric (FALSE) and character (TRUE) month. abbr = FALSE) # Switch between full (FALSE) and abbreviated (TRUE) names. ## [1] April ## 12 Levels: January < February < March < April < May < June < ... < December lubridate::quarter(date) ## [1] 2 lubridate::year(date) ## [1] 2017 5.5 Converting data to other formats 5.5.1 JSON The RJSONIO package can convert R objects into JSON. A vector/list will become a JSON array ["“,”“,”"] x <- c(1,2,4,8,16) RJSONIO::toJSON(x) ## [1] "[ 1, 2, 4, 8, 16 ]" A dataframe or List will become a JSON object eg. {"“:[],”":[]} RJSONIO::toJSON(mtcars) 5.6 User Defined Functions The basic syntax to create a UDF is: myfunction <- function(arg1, arg2, ... ){ statements return(object) } #For example this function checks whether a number is equal to 1 and returns a Yes/No vector: checkifone <- function(x){ if (x==1) { return("Yes") } else return("No") } #You can return any object, including vectors, list, other objects or functions. To use a function : checkifone(23) # Or save the result of the function in a new variable functionresult <- checkifone(23) #functionresult will contain either Yes or NO ## Using dplyr in a function - Passing columns names to a function conating dplyr will not work: udf <- function(d, x, y) { # Column names referenced in a function call must be converted with the enqou() function. x <- enquo(x) y <- enquo(y) # Columns names can now be used by adding the !! prefix (or use the UQ() function) return(d %>% group_by(!!x) %>% summarise(n =n(), avg=mean(!!y))) } udf(mtcars, cyl, mpg) 5.7 Chaining/Piping There is an alternative way of writing R code that may be easier to read/understand. It is called chaining/piping, and it reduces the need for nested functions by using a special operator %>%. #Normal R code xvar <- c(1,1,2,3,5,8,13,21) #initialses an array xresult <- round(log(xvar),1) # Calculates the log, and rounds to nearest dp #Chaining/Piping xvar <- c(1,1,2,3,5,8,13,21) xresult <- xvar %>% log() %>% round(1) # The %>% effectively passes xvar to the log() function and the result is passed to round(). This means that complex code with multiple functions should be easier to read "],
["loops.html", "6 Loops 6.1 Basic for Loop structure 6.2 Looping through a vector 6.3 Creating a List in a Loop 6.4 Looping with apply functions", " 6 Loops If you want to repeat a process many times you will need to create a loop: 6.1 Basic for Loop structure for (x in 1:4) { print(x) } 6.2 Looping through a vector loopvector <- c("a","b","c") for (x in 1:length(loopvector)) { print(loopvector[x]) } ## [1] "a" ## [1] "b" ## [1] "c" 6.3 Creating a List in a Loop newList <- list() # An empty List for (x in 1:5) { newList[[x]] = x^2 # Inserts an element into the List } newList[[3]] ## [1] 9 6.4 Looping with apply functions An alternative to using ‘for’ loops is to use the apply functions. This means turning process into a function and then applying the function to a vector/matrix/list: loopvector <- c("a","b","c") print_list <- function(x) { print(x) } test <- sapply(loopvector, print_list) ## [1] "a" ## [1] "b" ## [1] "c" There are various types of apply function. sapply will usually return output as a vector. lapply will usually return output in a list. "],
["tables.html", "7 Tables 7.1 Displaying tables - htmlTable 7.2 Displaying Interactive tables - DT 7.3 Creating Tabulations", " 7 Tables 7.1 Displaying tables - htmlTable htmlTable is a package that will produce the HTML code needed to display a table in a browser. library(htmlTable) htmlTable::htmlTable(esoph[1:5,]) # You can only see what the table looks like when Knitr generates a document agegp alcgp tobgp ncases ncontrols 1 25-34 0-39g/day 0-9g/day 0 40 2 25-34 0-39g/day 10-19 0 10 3 25-34 0-39g/day 20-29 0 6 4 25-34 0-39g/day 30+ 0 5 5 25-34 40-79 0-9g/day 0 27 There are various options to change how a htmlTable will look: htmlTable::htmlTable(esoph[1:5,], rnames = FALSE, header = c("Age Group","Alcohol Group","Tobacco Group","ncases","ncontrol"), # Change headings - specify all columns caption = "Oesophageal Cancer", #To add a Title align = 'lllrr', #To align column values 'r'=right, 'l'=left, c='centre(default) align.header='lllcc', #To align column headers css.table = "width: 70%; table-layout: fixed; word-wrap: break-word;") # Add CSS formats to the table Oesophageal Cancer Age Group Alcohol Group Tobacco Group ncases ncontrol 25-34 0-39g/day 0-9g/day 0 40 25-34 0-39g/day 10-19 0 10 25-34 0-39g/day 20-29 0 6 25-34 0-39g/day 30+ 0 5 25-34 40-79 0-9g/day 0 27 More css styling is avilable : https://www.w3schools.com/css/css_table.asp 7.2 Displaying Interactive tables - DT DT is a package that will produce an interactive table in a browser. library(DT) ## Warning: package 'DT' was built under R version 3.4.4 DT::datatable(mtcars[1:5,1:5]) There are various options and additional format functions to change how a DT table looks: mtcars[,1:5] %>% mutate(cyl = as.factor(cyl)) %>% DT::datatable(colnames = c("Miles per gallon","Cylinders","disp","hp","drat") # Rename columns ,rownames = FALSE #To remove row numbers ,width = '80%' ,filter = 'top' #Add filters above each column ,options = list(pageLength = 4, # Set no. rows on each page dom = 'rtBp')) %>% # Specify wheter search box etc. appears DT::formatStyle(0, target = 'row', lineHeight='8px') %>% #Changes row height DT::formatRound(columns=c('mpg', 'drat'), digits=1) #Rounds values to 1dp You might need to convert some columns to factors in order for the filters to work. For more detail : https://rstudio.github.io/DT/ 7.3 Creating Tabulations Before displaying a table you may need to coerce your data into the right shape. A basic frequency table can be created by using dplyr functions as then displayed with htmlTable: mtcars %>% count(Cylinders=cyl, Gears=gear) %>% htmlTable::htmlTable(rnames=FALSE) Cylinders Gears n 4 3 1 4 4 8 4 5 2 6 3 2 6 4 4 6 5 1 8 3 12 8 5 2 When grouping by multiple categories the results will be in a tidy format. To produce the same reuslts as a 2-way tables, you can use the spread function mtcars %>% count(Cylinders=cyl, Gears=gear) %>% spread(Gears, n) %>% # The 'Gears' category will be along columns, 'n' values will fill the cells htmlTable::htmlTable(rnames=FALSE) Cylinders 3 4 5 4 1 8 2 6 2 4 1 8 12 2 To easily add a row totals use the adorn functions from the janitor package: mtcars %>% count(Cylinders=cyl, Gears=gear) %>% spread(Gears, n) %>% janitor::adorn_totals("row") %>% htmlTable::htmlTable(rnames=FALSE, total=TRUE) Cylinders 3 4 5 4 1 8 2 6 2 4 1 8 12 2 Total 15 12 5 They only way to nest multiple categories along the column dimension, is to create a combined field using the unite function. mtcars %>% count(Cylinders=cyl, Gears=gear, am) %>% unite(gears_am, Gears, am, sep = " <br> ") %>% spread(gears_am, n) %>% htmlTable::htmlTable(rnames=FALSE) Cylinders 3 0 4 0 4 1 5 1 4 1 2 6 2 6 2 2 2 1 8 12 2 An alternative to the above is to reshape data with a cast function. The best version of this function is from the data.table package (it originally comes from the reshape package). data.table::data.table(mtcars) %>% data.table::dcast(cyl + am ~ gear, sep = "<br>", fun.aggregate = c(length, median), value.var = c("mpg","wt")) %>% htmlTable::htmlTable(rnames=FALSE) cyl am mpglength3 mpglength4 mpglength5 wtlength3 wtlength4 wtlength5 mpgmedian3 mpgmedian4 mpgmedian5 wtmedian3 wtmedian4 wtmedian5 4 0 1 2 0 1 2 0 21.5 23.6 2.465 3.17 4 1 0 6 2 0 6 2 28.85 28.2 2.0675 1.8265 6 0 2 2 0 2 2 0 19.75 18.5 3.3375 3.44 6 1 0 2 1 0 2 1 21 19.7 2.7475 2.77 8 0 12 0 0 12 0 0 15.2 3.81 8 1 0 0 2 0 0 2 15.4 3.37 # Your data must be converted to a data.table format. # The shape of the table is defined by a formula (eg. x1 + x2 ~ y) where variables on the left of the ~ represent row dimensions and those on the right represent column dimensions. # You can calcuate one or more statistics (mean, max, median..) to appear in the cells, for one or more values. The function automatically aggregates (ie. no group-by is needed). "],
["charts.html", "8 Charts", " 8 Charts There are various ways to create charts, static or interactive, in R. 8.0.1 Static charts with ggplot2 Static charts can be created with ggplot2 (in PNG format). By default they have a width=7 and height=5. You can alter this in the code chunk options: {r fig.width=8, fig.height=4} For full reference see: http://ggplot2.tidyverse.org/reference/index.html library(ggplot2) ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(colour="red", size=2) # The 1st argument in ggplot() should be the name of your dataframe; # The 2nd argument should be an aesthetic attribute - this defines which variables will be on the x and y axes. # You can now add 'Layers' to the ggplot object using the + symbol. You must have at least one 'geom' layer that specifies what type of chart it is (eg. geom_bar, geom_point, geom_line etc..) If you only want to chart frequencies, then you only need to specify the x axis variable. In the geom layer add stat=“count” to automatically generate a count.f ggplot2::ggplot(mtcars, aes(x = hp)) + geom_point(stat="count", colour="red") 8.0.2 Changing Axes By default the axes labels are based on variable names. To change these or add a title use a ‘labs’ layer: ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(stat="identity", colour="red") + labs(title = "Cars", x = 'Horsepower', y = "Miles/Gallon") The width of the x and y axes are set automatically, but can be changed: ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(colour="red") + labs(title = "Cars", x = 'Horsepower', y = "Miles/Gallon") + expand_limits(y = c(0,40), x = c(0,350)) + # Customise where the x and y axes start/end scale_y_continuous(expand=c(0,0)) + # Removes The gap at the bottom of the y axis scale_x_continuous(expand=c(0,0)) # Removes The gap at left of the x axis 8.0.3 Chart Themes & Styles You can change the appearance of most elements of the chart use a ‘theme’ layer. There are various templates available such as theme_classic (see https://ggplot2.tidyverse.org/reference/ggtheme.html): ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(stat="identity", colour="red") + labs(title = "Car Efficiency", x = 'Horsepower', y = "Miles/Gallon") + expand_limits(y = 0) + scale_y_continuous(expand=c(0,0)) + scale_x_continuous(expand=c(0,0)) + theme_classic() You can create a customised theme to : * Change the appeareance of axes (axis.line) * Change the plot area (panel.background) * Add/Remove/Change the gridlines (panel.grid.major.x & panel.grid.major.y) * Change the position and size of the title (plot.title) ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(stat="identity", colour="red") + theme(axis.line = element_line(colour = "black", size = 0.25, linetype = "solid") ,axis.ticks = element_blank() ,panel.background = element_blank() ,panel.grid.major.y = element_line(colour = "blue", size = 0.1, linetype = "dashed") ,panel.grid.major.x = element_line(colour = "green", size = 0.1, linetype = "dashed") ,plot.title = element_text(hjust = 0.5) ) You can create reusable styles and themes, which can be applied to all charts: custom_axis <- list(expand_limits(y = 0, x = 0), scale_y_continuous(expand=c(0,0)), scale_x_continuous(expand=c(0,0)) ) custom_theme <- theme(axis.line = element_line(colour = "grey", size = 0.25, linetype = "solid") ,axis.ticks = element_blank() ,panel.background = element_blank() ,panel.grid.major.y = element_line(colour = "grey", size = 0.1, linetype = "dashed") ,panel.grid.major.x = element_line(colour = "grey", size = 0.1, linetype = "dashed") ,plot.title = element_text(hjust = 0.5) ) ggplot2::ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point(stat="identity", colour="red") + custom_axis + custom_theme 8.0.4 Line Charts (geom_line) ggplot2::ggplot(mtcars, aes(x = cyl)) + geom_line(stat="count", colour="blue", size=2) 8.0.5 Bar Charts (geom_bar & geom_histogram) geom_bar produces basic bar charts and is best used with categorical variables. geom_histogram allows you to choose bin widths to automatically re-group you data, so it is better for continuous variables. ggplot2::ggplot(mtcars, aes(x = hp)) + geom_bar(stat="count", fill="orange", width=1) # Use **colour=** to set the border colour ggplot2::ggplot(mtcars, aes(x = hp)) + geom_histogram(binwidth = 25, fill="orange") To flip a bar chart from vertical to horizontal use ‘coord_flip’ ggplot2::ggplot(mtcars, aes(x = hp)) + geom_histogram(binwidth = 25, fill="orange") + coord_flip() 8.0.6 Adding features Extra layers can be added to the chart to show additional information. Vertical or horizontal lines can be used to show statistics. ggplot2::ggplot(mtcars, aes(x = hp)) + geom_histogram(binwidth = 25, alpha=0.5, fill="orange") + geom_vline(aes(xintercept = median(hp)),col='red',size=1) + geom_vline(aes(xintercept = quantile(hp, 0.25)),col='red',size=0.5) + geom_vline(aes(xintercept = quantile(hp, 0.75)),col='red',size=0.5) 8.0.7 Heatmaps esoph %>% group_by(agegp, alcgp) %>% summarise(n = sum(ncontrols)) %>% ggplot(aes(x = agegp, y = alcgp, fill = n)) + geom_tile() 8.0.8 Boxplots ggplot2::ggplot(data = mtcars, aes(y = mpg)) + geom_boxplot() + facet_wrap(vars(cyl)) 8.0.9 Combination Charts You can have multiple layers, for example bars and points ggplot2::ggplot(mtcars, aes(x = cyl, y = mpg)) + geom_bar(stat="identity") + geom_point() 8.0.10 Scales scales package 8.0.11 ggplot wizard The esquisse package has a shiny wizard that can generate code for different chart types: esquisse::esquisser() 8.0.12 Interactive charts with ggiraph The ggiraph package extends ggplot and allows interactivity. library(ggiraph) ## Warning: package 'ggiraph' was built under R version 3.4.4 8.0.13 Interactive charts with dygraph For interactive time series charts you can use the dygraph package which is related to the dygraph javascript library. library(dygraphs) lungDeaths <- cbind(mdeaths, fdeaths) #Creates the data set for the graph dygraphs::dygraph(lungDeaths) # Displays the chart with default settings #It is possible to add a range selector under the chart dygraphs::dygraph(lungDeaths) %>% dygraphs::dyRangeSelector() There are lots of options that can be changed such as axis labels, legends, and colours. #Adding a title dygraphs::dygraph(lungDeaths,main='This is the chart title') #Adding a legend dygraphs::dygraph(lungDeaths,main='This is the chart title') %>% dygraphs::dyLegend(show='always',width = 500) #Adding othe options dygraphs::dygraph(lungDeaths,main='This is the chart title') %>% dygraphs::dyOptions(includeZero = TRUE) "],
["maps.html", "9 Maps 9.1 Leaflet 9.2 Inserting HTML & Javascript directly into R", " 9 Maps 9.1 Leaflet To create interactive maps you need the leaflet library. This provides various functions that will be transalated into Javascript. 9.1.1 Display a basic map library(leaflet) ## Warning: package 'leaflet' was built under R version 3.4.4 mymap <- leaflet::leaflet() # This initiates a map; it is possible to change the width and height here mymap <- leaflet::setView(mymap, lng = -0.12481, # The initial longitude + latitude lat = 51.50811, zoom = 10) # The initial zoom level mymap <- leaflet::addTiles(mymap) # This add tiles to the map; by default it uses OpenStreetMap mymap # this will display the map 9.1.2 Add markers, shapes and popups to a map mymap <- leaflet::addMarkers(mymap, lng=-0.126965, lat=51.501555, popup="HMRC") #Use longitude and latitude to place a marker. The default blue marker is used. mymap <- leaflet::addCircles(mymap, lng=-0.021002, lat=51.504475,radius=10, popup="HMRC") #Use longitude and latitude to place a circle, and radius (in metres) for its size. mymap 9.1.3 Add Boundaries To add boundaries you need geojson or topojson. This needs to be converted into a format the R understands using functions in the geojsonio library. library(geojsonio) la <- geojsonio::topojson_read('~/R/Local_Auths_Dec16_Gen_Clip_GB.json') R converts this geojson into a Spatial Polygon Dataframe (SPD). This should consist of 2 main attributes: data which has descriptive information about the polygon such as a name, and polygons and contains all the coordinates for each polygon. To see what the contain you can us the following: la@data la@polygons # The @ symbol is used to access different 'slots' in the SPD. The boundary data can now be attached to the map when it is first initialized. mymap <- leaflet::leaflet(la) mymap <- leaflet::setView(mymap,lng=-3.332291,lat=54.978353,zoom=5) mymap <- leaflet::addPolygons(mymap, color='black',weight=1, fillColor='white',fillOpacity = 0.8) #This draws the boundary lines mymap 9.1.4 Add Interactivity Highlights can be added when you hover over an area. mymap <- leaflet::addPolygons(mymap, color='black', weight=1, fillColor='white', fillOpacity = 0.8 , highlight = highlightOptions( weight = 2, color = "red", bringToFront = TRUE) ) Popup info can also be added when you hover over an area. This requires using the sprintf function, which includes HTML. mylabel <- sprintf("<strong>%s</strong><br/>%s", la$lad16nm, la$lad16cd ) %>% lapply(htmltools::HTML) # These can can now be assigned to the label attribute mymap <- leaflet::addPolygons(mymap, color='black',weight=1, fillColor='white',fillOpacity = 0.8, highlight = highlightOptions( weight = 2, color = "red", bringToFront = TRUE), label = mylabel) The code that is written in the sprintf function rquires a special format. The first argument defines what and how text willl be dispalayed. It includes placeholders (eg. %s) that can have differnet types. All subsequent arguments represent the data variables that will are represented by the placeholders (in the order they occur).For more info see https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/sprintf. 9.1.5 Attach extra information to your boundary dataset You will usually want to add additional data/statistics to the boundary file, so that you can improve visualition/interactivity. You can merge a regular dataframe wih an SPD provided they have a common variable (eg. ONS Id’s). # First create a dataframe with test data mydf <- data.frame('lad16cd'=c('E06000001','E06000002','E06000003'), 'extraStat'=c(25000,34000,38000) ) # Merge extra data into the SPD require(sp) # Need for the following merge to work correctly la <- sp::merge(la, mydf, by.x = "lad16cd", by.y = "lad16cd") #Check that merge has worked la@data # A new column should appear with the extra data 9.1.6 Add colour based on area data First you will need to define a colour pallete, and the range of values (bins) they will apply to. This is done using the colorBin function. pal <- leaflet::colorBin("YlOrRd", bins = c(0, 24999, 34000, 38000, Inf)) # 'YlorRd' is a particular color scale. See http://colorbrewer2.org for alternatives. You could also create a user defined list of colours using Hexadecimal eg. #fcbba1 instead. Now the colour pallete can be assigned to the fillColor attribute, allomg with the variable name that will be used . mymap <- leaflet::addPolygons(mymap, color='black', weight=1, fillColor=~pal(extraStat), #Nb. the ~ indicates that the variable is contained with the la SPD fillOpacity = 0.8, highlight = highlightOptions( weight = 4, color = "red", bringToFront = TRUE)) 9.2 Inserting HTML & Javascript directly into R Instead of using packages to create javavscript, you can insert javascript directly into Rmarkdown, and it should be displayed correctly when it Knitr is used. <link rel="stylesheet" href="https://unpkg.com/[email protected]/dist/leaflet.css" /> <script src="https://unpkg.com/[email protected]/dist/leaflet.js"></script> <script type="text/javascript" src="https://stamen-maps.a.ssl.fastly.net/js/tile.stamen.js"></script> <div id="mapid"></div> // Define Map area/position and any background tiles var map = new L.Map('mapid', { center: new L.LatLng(53.10, -1.26),zoom: 7 }); var layer = new L.StamenTileLayer("terrain"); map.addLayer(layer); "],
["rmarkdown-knitr.html", "10 RMarkdown & Knitr 10.1 Knitr 10.2 YAML 10.3 Modular coding", " 10 RMarkdown & Knitr 10.1 Knitr Knitr is a feature in RStudio that will turn your R programme into a HTML (or PDF/Word) document. You need to use RMarkdown, and write your R code in ‘chunks’. You can include regular markdown text, and also html, javascript or python. Knitr will evaulate all the code and embed any tables/charts/maps into the output. 10.2 YAML At the top of all RMarkdown documents there is some YAML code: --- title: "Untitled" output: html_document --- YAML includes information that tells Knitr what type of document to create. In the example above it wil be a HTML document. However it cann also be used to add a table of contents, or to allow users to input cutom paraemters. Adding a table of contents --- title: "Untitled" output: html_document toc: TRUE --- Nb. That the contents is automatically generated based on any RMarkdown header tags ‘#’ that you have used. Adding parameters Parameters can be added after the params: option. They must have a label and a value. By defaut the parameter will be a textbox. --- title: "Untitled" output: html_document params: parametername1: label: "Display label" value: "Default value" --- Different types of parameter input are allowed, such as check boxes and dropdown lists. --- title: "Untitled" output: html_document params: mycheckboxparam: label: "Display label" value: TRUE mydropdownparam: label: "Display label" value: apples input: select choices: [apples,oranges,bananas] --- You can refer to the parameter values in the rest of the your code using the form params%parametername1 To use these parameters the Rmarkdown document must be Knitted using ‘Knit with parameters…’ 10.3 Modular coding If your R code is written in modules, then you can call them using the source() function, eg: source('module1.r') source('module2.r') "],
["machine-learning-predictive-analytics.html", "11 Machine Learning & Predictive Analytics", " 11 Machine Learning & Predictive Analytics 11.0.1 Decision Trees To prodcue decision tress you need to use variuos packages, such as rpart for recursive partioning and caret. ## Create a decision tree library(rpart) TreeResults <- rpart::rpart(force~mass + acceleration, data=dataset1, method="class", control = rpart.control(minsplit = 30, cp = 0)) ## Visualise the decision tree rattle::fancyRpartPlot(TreeResults) ## Use model to predict value predictions <- predict(TreeResults, dataset1, type = "class") ## Evaluate the results confusionMatrix(predictions, dataset1$force) "],
["text-mining-nlp.html", "12 Text Mining & NLP 12.1 Create a Term Document Matrix 12.2 Topic Modelling & Latent Dirichlet Allocation (LDA) 12.3 Clustering - Similarity between topics 12.4 Calculating distance between objects 12.5 Non-Euclidian Distances - Jensen Shannon 12.6 Scaling - Principal Components 12.7 Eigen vectors 12.8 K-Means 12.9 Naive Bayes Classifiers 12.10 TF-IDF Classifiers (Supervised) 12.11 Sentiment Analysis 12.12 Word Bubble", " 12 Text Mining & NLP There are various types of analysis you can do with text data, such as n-grams, sentiemnt analysis, and topic modelling. Various packages are available, the main ones are tm and NLP. 12.0.1 Text data The source data for your text may come in various formats, for example a single string or a dataframe. Ideally you will put these into a tidy format # Text data as a list of strings textdoc <- c("Once upon a time", "in a galaxy far far away", "On a dark and stormy night") # Text data as a tibble/dataframe textdoc <- tibble('line'=c(1,2,3), 'text'=c("Once upon a time", "in a galaxy far far away.", "on a dark and stormy night")) 12.0.2 Tidy text Ideally you will convert your text data into a tidy format using the tidytext package. library(tidytext) ## Warning: package 'tidytext' was built under R version 3.4.4 textdoc %>% unnest_tokens(input=text, output=word, token="words", to_lower = TRUE) -> tidytext # This splits up your text into 'tokens'. # By default a token is a word, but other options include "characters", "sentences","ngrams". # By default all text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed. # Numbers are not removed. tidytext includes a stop_words tibble, which contain stopwords sourced from SMART, snowball, and onix. You can remove these stopwords from you tidy data by doing an anti join, or create your own custom stopword tibble. library(dplyr) stop_words ## # A tibble: 1,149 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART ## 7 accordingly SMART ## 8 across SMART ## 9 actually SMART ## 10 after SMART ## # ... with 1,139 more rows # Remove all common stopwords from your data tidytext %>% anti_join(stop_words) -> tidytext2 ## Joining, by = "word" # Remove stopwords from a particular source (eg. snowball) from your data tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2 ## Joining, by = "word" # Custom stopwords custom_sw <- tibble(word=c("a","in","on","and")) tidytext %>% anti_join(custom_sw) -> tidytext2 ## Joining, by = "word" # Nb. the tm package also includes a stopword function with the same list of words 12.0.3 Stemming library(SnowballC) tidytext2 <- tidytext2 %>% mutate(word_stem = wordStem(word, language="english")) 12.0.4 Basic text stats If the text data is in a tidy format you can easily to use dplyr to manipulate data, such as produce word frequency: tidytext %>% count(word, sort = TRUE) # Produces token frequency using dplyr count() function ## # A tibble: 13 x 2 ## word n ## <chr> <int> ## 1 a 3 ## 2 far 2 ## 3 and 1 ## 4 away 1 ## 5 dark 1 ## 6 galaxy 1 ## 7 in 1 ## 8 night 1 ## 9 on 1 ## 10 once 1 ## 11 stormy 1 ## 12 time 1 ## 13 upon 1 12.0.5 N-Grams analysis To look at neighbouring words you need to use the unnest_token function again: textdoc %>% unnest_tokens(input=text, output=ngram, token="ngrams", n=2) -> ngramdoc # All text will be converted to lowercase, and punctuation (eg .,!?£$&) will be removed, but numbers are not removed. #N-gram frequency ngramdoc %>% count(ngram, sort = TRUE) ## # A tibble: 13 x 2 ## ngram n ## <chr> <int> ## 1 a dark 1 ## 2 a galaxy 1 ## 3 a time 1 ## 4 and stormy 1 ## 5 dark and 1 ## 6 far away 1 ## 7 far far 1 ## 8 galaxy far 1 ## 9 in a 1 ## 10 on a 1 ## 11 once upon 1 ## 12 stormy night 1 ## 13 upon a 1 12.0.6 Text Preparation with tm An alternative way of working with text data is to use the tm package. This includes text cleaning functions. This pacakge uses a data structure Corpus. library(tm) # Also loads NLP ## Warning: package 'tm' was built under R version 3.4.4 ## Loading required package: NLP ## ## Attaching package: 'NLP' ## The following object is masked from 'package:ggplot2': ## ## annotate #Convert yor text data into a Corpus myCorpus <- VCorpus(VectorSource(textdoc)) # You can clean your corpus using the tm_map() function. This has various options: tm_map(myCorpus, content_transformer(tolower)) # Changes text to lowercase ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 tm_map(myCorpus, removePunctuation) # Removes all punctuations [.,':;] from your Corpus ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 tm_map(myCorpus, removeNumbers) # Removes any numbers from your Corpus ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 tm_map(myCorpus, stripWhitespace) # Removes multiple whitespace ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 tm_map(myCorpus, removeWords, c("i", "a", "is", "the", "and", "but") ) # Removes custom stopwords ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 # A stopwords() function is available, and can be added to the list above, for example tm_map(myCorpus, removeWords, c(stopwords("en")) ) # Removes stopwords in the snowball list ## <<VCorpus>> ## Metadata: corpus specific: 0, document level (indexed): 0 ## Content: documents: 2 12.1 Create a Term Document Matrix Transforming text data into a matrix allows you to do further modelling such as LDA, Naive Bayes, regression. ### From tidy data # First, the tidy data needs to be summarised so that it contains the count of each token per document. tidytext %>% count(line, word, sort = TRUE) -> tidytext_count # Now use the cast_dtm function to convert this into a Document Term Matrix. tidytext_count %>% cast_dtm(line, word, n) -> myDTM ## Warning: Trying to compute distinct() for variables not found in the data: ## - `row_col`, `column_col` ## This is an error, but only a warning is raised for compatibility reasons. ## The operation will return the input unchanged. inspect(myDTM) ## <<DocumentTermMatrix (documents: 3, terms: 13)>> ## Non-/sparse entries: 15/24 ## Sparsity : 62% ## Maximal term length: 6 ## Weighting : term frequency (tf) ## Sample : ## Terms ## Docs a and away dark far galaxy in once time upon ## 1 1 0 0 0 0 0 0 1 1 1 ## 2 1 0 1 0 2 1 1 0 0 0 ## 3 1 1 0 1 0 0 0 0 0 0 12.1.1 From a Corpus (by default words are converted to lowercase less than 3 characters are excluded) myDTM <- tm::DocumentTermMatrix(myCorpus) ## By default words will be converted to lowercase, and words with <3 characters are removed. To alter defaults, you can specify a control list. myDTM <- tm::DocumentTermMatrix(myCorpus, control = list(tolower=FALSE, wordLengths=c(1,Inf), removePunctuation=FALSE )) 12.1.2 Matrix Managment To perform calculations on very large matrices you may need to use the slam package. ## Sum rows of a matrix slam::row_sums(myDTM) ## 1 2 ## 3 16 ## Sum columns of a matrix slam::col_sums(myDTM) ## 1 2 3 a and away. dark far galaxy in ## 1 1 1 3 1 1 1 2 1 1 ## night on Once stormy time upon ## 1 1 1 1 1 1 12.2 Topic Modelling & Latent Dirichlet Allocation (LDA) Topic Modelling is a common method for discovering topics from text, such as comments. Most topic modelling techniques, such as LDA (Latent Dirichlet Allocation), require you to choose the number of topics, and will then use an algorithm to createthe topics. You must interpret what these topic mean Some words are equally likely to appear across topics; so a word like “account” could appear in both topic lists. You can build an unsupervised LDA model using a DocumentTermMatrix library(topicmodels) # Not available # my_lda <- LDA(myDTM, k=4, control = list(seed=87533)) # k is the number of topics you want. # Option 1 : using tidytext to examine topic probabilities topics_beta <- tidy(my_lda, matrix="beta") # beta represents word/topic probabilities topicstats <- topics_beta %>% group_by(topic) %>% top_n(10,beta) %>% ungroup() %>% arrange(topic, -beta) #Top terms for each topic topics_gamma <- tidy(my_lda, matrix="gamma") # gamma represents document/topic probabilities classification <- topics_gamma %>% group_by(document) %>% top_n(1,gamma) %>% ungroup() # Most likely topic for each document # Merge back to original document classification <- mutate(classification, id=as.numeric(document)) final <- left_join(textdoc, classification, by="id") Initially, each word from each document is randomly assigned to a topic. Gibbs sampling is used to re-assign topics, which involves taking each document in turn, and calculating the % of words that are currently assigned to each topic (eg. 12%/24%/36%/28% from topic A/B/C/D). It also looks at each word in the document, and calculates how often that word appears in each topic (eg. 2.5%/2.1%/1.7%/0.5% of topic A/B/C/D). These two sets %’s are multiplied together, and used to are used as weights to randomly re-assign a word to a new topic. This process is repeated for every word, at least 2000 times. [NB. During this process the word being assessed is temporarily removed from all caculations.] LDA Implications A word is more likely to be re-assigned to another topic if lots of neighbouring words alreading belong to it, or if another topic has a higher concentration of that word. A word is more likely to keep its existing topic if it is part of the majority topic within its document, or becasue the word is spread evenly across topics. Words that only appear once in the corpus shouldnt have a significant effect on the creaton of topics. Relatively uncommon words (that appear 2-3 times in the corpus) should be assigned the same topic quite quickly. LDA terms phi : the lokeihood that a word appears in atpoic (ie. the frequency of ‘w’ in a topic, divided by frequency of ‘w’ across the corpus). theta : the proportion of words in a document that wer assigned to each opic (nb. alpha has been added, so 0% is not possible). alpha : The LDA function stores results in the following attributes : @n : The total number of words in the corpus @terms : A simple list of all the distinct words in the corpus @beta : A table (words x topics) containg the log of phi @gamma : A table (documents x topics) containing theta More complex analysis of LDA including graph topicstats %>% mutate(term = reorder(term, beta)) %>% ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme geom_col(show.legend = FALSE) + # as a bar plot facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot labs(x = NULL, y = "Beta") + # no x label, change y label coord_flip() # turn bars sideways Since topic models are usually unsupervised, it can be difficult to assess how effective/reliable they are. The topics generated by LDA may not make much sense if most comments: * contain few words * contain too many words covering multiple themes * are too general/generic/non specific LDA can be sensitive, so could produce very different results depending on number of topics or when re-running with additional data. 12.3 Clustering - Similarity between topics Objects with multiple features/dimensions (such as comments) can be grouped together based on how similar they are. First we need to measure the distances between all the objects. 12.4 Calculating distance between objects The example below shows 3 simple objects (A,B,C) with x and y coordinates. Obs x y A 1 1 B 4 1 C 4 5 twoDimensions <- (data.frame(x=c(1,4,4),y=c(1,1,5))) plot(twoDimensions) For simple 2D objects you can calculate the standard (Euclidean) distance using trigonometry. This can be done with the dist() function. proxy::dist(x = twoDimensions) # Euclidean distance method by default ## 1 2 ## 2 3 ## 3 5 4 This produces a matrix showing the distance between each object: Between 1(A) and 2(B) the distance is 3 Between 1(A) and 3(C) the distance is 5 Between 2(B) and 3(C) the distance is 4 12.5 Non-Euclidian Distances - Jensen Shannon The topics generated by a topic model consist of 000’s of words (dimensions) as shown in the phi matrix. We can also use dist() to calculate distance between multi-dimensional objects, but you may want to use a different method to calculate distance. Since phi contains probability distributions, a divergence measure such as Kullback-Liebler or Jensen-Shannon can be used. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence These compare the relative values of phi for each element of a topic (eg. the relative likelihood of ‘tax’ appearing in Topic 1 versus Topic 2), and compute an overall differences. distanceMatrix <- proxy::dist(x = phi, method="Kullback") distanceMatrix This produces a matrix showing the distance between each Topic: Between Topic 1 and 2 the distance is 2.038 Between Topic 1 and 3 the distance is 2.297 Between Topic 2 and 3 the distance is 1.817 Nb. The units of distance may not be meaningful This means that Topics that share a similar word (phi) distribution will be closer to one another. The Jensen-Shannon is similar to Kullback-Liebler, however is compares each distribution to the average rather than directly (ie. P vs average(P+Q) rather than P vs Q). This is meant to mitigate the effects of noise in the data. jsPCA <- function(phi) { # first, we compute a pairwise distance between topic distributions # using a symmetric version of KL-divergence # http://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence jensenShannon <- function(x, y) { m <- 0.5 * (x + y) lhs <- ifelse(x == 0, 0, x * (log(x) - log(m))) rhs <- ifelse(y == 0, 0, y * (log(y) - log(m))) 0.5 * sum(lhs) + 0.5 * sum(rhs) } 12.6 Scaling - Principal Components If you have used a method to calculate the distance between various objects, it will still have multiple-dimensions. You can use Principal Components Analysis (multi-dimensional scaling) to reduce these, eg. to 2 or 3 dimensions. This will make the data more manageable and suitable for visualisation. # Multidimensional Scaling - reduces the K by K proximity matrix down to K by 2 components pca <- stats::cmdscale(distanceMatrix, k = 2, eig=TRUE) pca$points plot(pca$points) The coordinates produced by the scaling are not very meaningful, however the distance between objects (as calculated previously) will be preserved as much as possible. 12.7 Eigen vectors 12.8 K-Means K-Means Clustering requires a DTM. kmeans(DTM, # The Document Term Matrix 10, # The number of clusters iter.max = 10, # nstart = 3, # trace = TRUE) # The kmeans function stores results in the following attributes : $cluster : The cluster assigned to each document $centers : The position of the cluster centre 12.9 Naive Bayes Classifiers library(e1071) 12.10 TF-IDF Classifiers (Supervised) 12.11 Sentiment Analysis 12.12 Word Bubble To create a word bubble visualisation we can use ggplot2 and the packcircles package, which decides the size and position of bubbles mytext <- tibble('Comment'=c('very good, easy to use once setup, couldnt do what i wanted, very quick way of cahnging details. it took only a few minutes to do what i needed. not all my information was correct. lots of confusing information, really good. quick and easy') ) # Use tidytext to extract all the words and remove stopwords library(tidytext) mytext %>% unnest_tokens(input=Comment, output=word, token="words", to_lower = TRUE) -> tidytext tidytext %>% anti_join(filter(stop_words, lexicon=="snowball")) -> tidytext2 ## Joining, by = "word" # Calculate word frequency and average score tidytext3 <- group_by(tidytext2, word) %>% summarise(freq=n()) # The packcircles packages decides how to arrange a group a circles, automatically calculating their size and coordinates library(packcircles) ## Warning: package 'packcircles' was built under R version 3.4.4 circles <- circleProgressiveLayout(tidytext3$freq, sizetype='area') # Circle size is proportional to frequeny # Add coordinates back to list of words tidytext3 = cbind(tidytext3, circles) # Prodcue vertices so that the circles acan be constructed circles2 <- circleLayoutVertices(circles, npoints=40) # Option to choose how many vertices - more means better drawn circle # Plot circles using ggplot2 & ggiraph library(ggiraph) library(ggplot2) mybc <- ggplot() + geom_polygon_interactive(data = circles2, aes(x, y, group = id, data_id = id), alpha = 0.6) + scale_fill_manual(values="steelblue") + geom_text(data = tidytext3, aes(x, y, size = freq, label = word)) + scale_size_continuous(range = c(1,13)) + theme_void() + theme(legend.position="none") + coord_equal() ggiraph(ggobj = mybc, width_svg = 12, height_svg = 12) ## Warning: package 'gdtools' was built under R version 3.4.4 "],
["shiny.html", "13 Shiny 13.1 Basic Shiny Structure 13.2 User Interface 13.3 Server 13.4 Shiny Dashboard", " 13 Shiny Lets you present tables and charts that will react to inputs. 13.1 Basic Shiny Structure library(shiny) ui <- fluidPage( titlePanel("Hello Shiny!") ) server <- function(input, output) { } shinyApp(ui, server) 13.2 User Interface 13.2.1 Layout Various layouts can be used however the most common is a fluid layout. The page can be divided into rows and columns. Fixed text and styling can also be added. library(shiny) ui <- fluidPage( fluidRow( column(width=4, "row1 column1", style = "background-color:red;"), column(width=4, "row1 column2", style = "background-color:orange;"), column(width=4, "row1 column3", style = "background-color:yellow;") ), fluidRow( column(width=4, "row2 column1", style = "background-color:green;"), column(width=4, "row2 column2", style = "background-color:blue;"), column(width=4, "row2 column3", style = "background-color:purple;") ) ) server <- function(input, output) { } shinyApp(ui, server) 13.2.2 Inputs There are various Inputs that can be added to the UI. Behind the scenes these generate HTML code: * sliderInput() # Adds an input slider * fileInput() # Adds a file selector * textInput() # Adds a text input box * selectInput() # Adds a dropdown list * actionButton() # Adds an action button * dateRangeInput() # Adds a date selector The first argument of all the inputs/ouputs is a unique id. ui <- fluidPage( fluidRow( column(width=4, sliderInput("slider", "Generate n randoms", min = 0, max = 1000, value = 500, step = 100) ), column(width=4, fileInput("loadfile", "Choose file")), column(width=4, selectInput("dropdown", "Select an option", c("Rare" = "1", "Medium" = "2", "Well Done" = "3"))) ) ) server <- function(input, output) { } shinyApp(ui, server) 13.2.3 Outputs There are various Outputs that can be added to the UI: * htmlOutput() # Adds space for text * tableOutput() # Adds space for a table * plotOutput() # Adds space for a chart Nb. these will be empty until a function is added to the server that adds content. ui <- fluidPage( fluidRow( htmlOutput("text1", width="75%", height="50px"), tableOutput("table1"), plotOutput("chart1", width="75%", height="100px") ) ) server <- function(input, output) { } shinyApp(ui, server) Other packages extend shiny ui outputs. * plotlyOutput # Adds space for a plotly chart [plotly] * ggiraphOutput # Adds space for a ggiraph chart [ggiraph] * DTOutput # Adds space for a data table [DT] 13.3 Server UI inputs and outputs wont work unless they link to a server process/function. 13.3.1 Linking to UI outputs renderPlot() # creates reactive ggplot output renderText() # creates reactive tetx output renderTable() # creates reactive table output renderDataTable() # creates reactive DT output The UI output id and the server output name must match. ui <- fluidPage( fluidRow( plotOutput("chart1") ) ) server <- function(input, output) { # This output links to the ui plotOutput via the id "chart1" output$chart1 <- renderPlot({ plot(mtcars$wt, mtcars$mpg) }) } shinyApp(ui, server) Other packages extend shiny server outputs. * renderPlotly # creates reactive plotly chart [plotly] * renderggiraph # creates reactive ggiraph chart [ggiraph] 13.3.2 Linking to UI inputs The UI input id and the server input name must match. ui <- fluidPage( fluidRow( selectInput("dropdown", "Select an option", c("Weight" = "wt", "Horespower" = "hp", "Cylinders" = "cyl")) ), fluidRow( plotOutput("chart1") ) ) server <- function(input, output) { # This output links to the ui plotOutput via the id "chart1" output$chart1 <- renderPlot({ plot(mtcars[[input$dropdown]], mtcars$mpg) }) } shinyApp(ui, server) 13.4 Shiny Dashboard Provides a dashboard template for Shiny Apps. library(shiny) library(shinydashboard) "],
["bookdown.html", "14 Bookdown 14.1 Set-up 14.2 Create the book", " 14 Bookdown The bookdown package allows you to creates a html book from various Rmarkdown files. 14.1 Set-up Create a new folder that will contain your book. Create an R Project file (.Rproj) for this folder (File > New Project…). Create an index.Rmd with a YAML header like the example below. This is the first page of the book. --- title: "Bookdown Example" author: "Joe Bloggs" date: "`r Sys.Date()`" site: bookdown::bookdown_site output: bookdown::gitbook documentclass: book description: "This is a simple bookdown tutorial" --- # Bookdown Example For each chapter/section you need a separate .Rmd file. By convention these should start with the section nummber, eg. 01-Chapter-A, 02-Chapter-B, and use ‘-’ instead of a space. 14.2 Create the book When all the Rmd files have ready, click on the Build tab (in the top right window) and click Build Book. This will create a subfolder (_book) that contains the html (and other) files for your book. The html will mirror the .Rmd file names. To view you book open the index.html file in a browser. "],
["creating-packages.html", "15 Creating Packages 15.1 Setup Folders 15.2 R folder 15.3 man folder 15.4 Package details 15.5 Package Dependencies 15.6 Build the package 15.7 Making changes", " 15 Creating Packages You can create your own packages, which are basically a collection of functions. You wil need to have installed the devtools package. 15.1 Setup Folders Create an R Package Project file (File > New Project…> New Directory > R Package). Give a name to your package and the location where all the relevant files will be saved. Your package folder will contain subfolders ‘R’ and ‘man’ which will contain details of functions. By default these should contain ‘hello.R’ and ‘hello.Rd’ files. 15.2 R folder Delete the contents of ‘hello.R’ and replace with your own package function(s). For example: packagefunction <- function(x,y) { xyz <- x*y return(xyz) } NB. DO NOT use library() or require() in any function files. See the Dependencies section below. Rename the .R file so that it matches the function name. 15.3 man folder Delete the contents of ‘hello.Rd’ and replace with documentation of your own functions. A .Rd file is needed for each function that you create. It should have the following format: \\name{ packagefunction } \\alias{ packagefunction } \\title{ A basic function } \\usage{ packagefunction() } \\description{ Multiplies 2 numbers. } \\examples{ packagefunction() } Rename the .Rd file so that it matches the function name. 15.4 Package details By default a ‘DESCRIPTION’ file should have been created such as the example below. You can update these as neccessary. Package: kdpackage0 Type: Package Title: What the Package Does (Title Case) Version: 0.1.0 Author: Who wrote it Maintainer: The package maintainer <[email protected]> Description: More about what it does (maybe more than one line) Use four spaces when indenting paragraphs within the Description. License: What license is it under? Encoding: UTF-8 LazyData: true 15.5 Package Dependencies If your package uses functions from other packages then you must add these to the ‘DESCRIPTION’ file so that they are imported. For example this package will use functions form dplyr and tm (sepearted by commas): Package: kdpackage0 ... Imports: dplyr, tm 15.6 Build the package In the menu, click Build > Install & Restart (or Build & Reload). The pacakge should now be installed and available in your list of packages. 15.7 Making changes If you make chnages to your package functions or documentation, you will need to re-build it again. http://web.mit.edu/insong/www/pdf/rpackage_instructions.pdf http://tinyheero.github.io/jekyll/update/2015/07/26/making-your-first-R-package.html "],
["git-version-control.html", "16 Git - Version Control 16.1 Set-up 16.2 Link Git to a GitHub account 16.3 Starting a project with Github 16.4 Hosting a site on Github", " 16 Git - Version Control 16.1 Set-up Download and install Git Go to Tools > Global Options > Git/SVN Make sure “Enable version control interface for RStudio projects” is selected Change the paths so they refer to the folder where Git.exe is stored (Probably C:Files). Restart RSTudio 16.2 Link Git to a GitHub account You can access Git shell in R using Tools > Shell (or you can open Git-bash) and enter your GitHub details: git config --global user.name 'yourGitHubUsername' git config --global user.email '[email protected]' 16.3 Starting a project with Github In GitHub create a repository, and copy its URL (eg. https://github.com/username/test.git) In RStudio Create a New Project , choosing Version Control > Git Paste in the GitHub Repository URL (from above). The project name should auto-complete. Also choose the folder where a local copy will be stored (Nb. dont use a Google Drive folder as tis casuses synch conflicts) You can now commit-push changes to the code and these will also be updated on GitHub. The first time you you do this you might need to provie GitHub login details. 16.4 Hosting a site on Github In the GitHub repository, click on Settings and scroll down to the GitHub pages section. Select the source were your html files are (eg. master branch) and click save. Your site will be hosted at on your account in the form : https://kierandriscoll.github.io/repositoryname/ "],
["other-languages.html", "17 Other Languages 17.1 SQL 17.2 Python", " 17 Other Languages You can use other languages in R. 17.1 SQL Use the sqldf package to write code. The SQL must all be contained within " ". library(sqldf) sqldf("select * from mtcars where wt > 4") # is equivalent to: mtcars %>% filter(wt > 4) 17.2 Python If you have Python installed on you computer, you can use the reticulate package to run Python code within R. library(reticulate) "]
]