-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Renamed RStudio Cloud to Posit Cloud and added datasets
- Loading branch information
1 parent
401b677
commit 32381f5
Showing
12 changed files
with
5,085 additions
and
21 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "88e82b34e9956af0ff61c20356828567", | ||
"result": { | ||
"markdown": "---\ntitle: \"Chapter 6 - Case Study\"\nauthor: \"Government Analysis Function and ONS Data Science Campus\"\nengine: knitr\nexecute:\n echo: true\n eval: false\n freeze: auto # re-render only when source changes\n---\n\n\n> To switch between light and dark modes, use the toggle in the top left\n\n# Introduction\n\nBy the end of this case study, you should have more confidence with manipulating data and using techniques from the first five chapters of Intro to R, as such, they are a **pre-requisite** for it.\n\nThese data sets and question are designed to be an initial springboard for you to continue with your data journey. \n\nAnswers are provided; but these may only show one or two ways of solving the issue. \n\n>**Your answers may differ slightly from ours, this is fine if the output is consistent, but consider whether you could achieve your answer with less or better written code.** \n\n\n## Structure:\n\nQuestions will be presented in tabs.\n\n* Tab 1 will contain the question \n* Tab 2 will contain the solution in R.\n\nPlease choose the tab with the language you wish to use.\nAn example is below.\n\n## Example \n::: {.panel-tabset}\n\n### **Question**{-}\n\nThis is an example question.\n\n### **Solution**{-} \n\n::: {.cell}\n\n```{.r .cell-code}\n# Solution cell\n\n\"Insert code here\"\n```\n:::\n\n:::\n\n\n# Question 1: Packages\n::: {.panel-tabset}\n\n## **Question**{-}\n\nLoad the following packages:\n\n* tidyverse\n* janitor\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# load packages\n\nlibrary(tidyverse)\nlibrary(janitor)\n```\n:::\n\n:::\n\n# Question 2: Data \n::: {.panel-tabset}\n\n## **Question**{-}\n\nRead in the two files from the **data** folder below, assigning them to the variables suggested:\n\nnetflix - nextflix_data.csv\nimdb_scores - imdb_scores.csv\n\nNote - The data is sourced from [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) and directly from IMDB.\n\nSome data has been altered to suit the difficulty level of this course. This is a training dataset, and so shouldn't be relied upon for 100% accuracy.\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Read in imdb and netflix data\n\nnetflix <- readr::read_csv(\"Data/netflix_data.csv\")\n\nimdb_scores <- readr::read_csv(\"Data/imdb_scores.csv\")\n```\n:::\n\n:::\n\n# Question 3 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nClean up the column names of imdb_scores\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Use janitor to clean names of imdb data\n\nimdb_scores <- clean_names(imdb_scores)\n\nnames(imdb_scores)\n```\n:::\n\n:::\n\n# Question 4 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nWhat are the dimensions of the Netflix data?\n\nSee if you can output them in a sentence.\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Find the dimensions with dim()\n\ndim(netflix) # Rows and columns\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Output a sentence with the dimensions\n\ncat(\"There are\", nrow(netflix), \"rows and\", ncol(netflix), \"columns in the neflix dataset.\")\n```\n:::\n\n:::\n\n# Question 5 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nUse an inspection function to determine the datatypes of the columns in the Netflix data.\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Have a glimpse of netflix\n\nglimpse(netflix)\n```\n:::\n\n:::\n \n# Question 6 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nHow many missing values do we have in each dataset?\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Number of missings in the netflix data\n\n\ncolSums(is.na(netflix))\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Number of missings in imdb data\n\ncolSums(is.na(imdb_scores))\n```\n:::\n\n:::\n\n# Question 7 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nHow many times does each unique country occur in the dataset? \n\n## **Show Answer** {-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Number of unique categories in primary_country\n\nnetflix |> \n count(primary_country)\n```\n:::\n\n:::\n\n# Question 8\n::: {.panel-tabset}\n\n## **Question**{-}\n\nCreate a new tibble \"netflix_movies\" by filtering the netflix tibble to contain only \"Movie\". \n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create a tibble with \"Movie\"s only\n\nnetflix_movies <- netflix |> \n filter(type == \"Movie\")\n\nglimpse(netflix_movies)\n```\n:::\n\n:::\n\n# Question 9 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nUsing your netflix_movies tibble, clean the duration column by:\n\n* Removing the suffix \"min\".\n* Converting the resulting column to an integer\n\nFollowing this, rename the column to \"duration_mins\".\n\n> Note, you can do this in one pipeline!\n\nEnsuring that you overwrite and reassign the dataset!\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Use mutate to clean the duration column\n\nnetflix_movies <- netflix_movies |> \n mutate(duration = as.integer(str_replace(duration, \n pattern = \"min\",\n replacement = \"\"))) |> \n rename(duration_mins = duration)\n\nglimpse(netflix_movies)\n```\n:::\n\n:::\n\n# Question 10\n::: {.panel-tabset}\n\n## **Question**{-}\n\nUsing your netflix_movies tibble, compute:\n\n* The mean and median duration of the movies\n* The mean and standard deviation of the cast numbers.\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Compute summary statistics of duration and cast number\n\nnetflix_movies |> \n summarise(mean_duration = mean(duration_mins, na.rm = TRUE),\n median_duration = median(duration_mins, na.rm = TRUE),\n mean_cast = mean(num_cast, na.rm = TRUE),\n std_cast = sd(num_cast, na.rm = TRUE))\n```\n:::\n\n:::\n\n# Question 11 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nUsing your netflix_movies tibble:\n\n* Select the title, duration, director and cast numbers\n* Sort in descending order of duration\n\nWhich movie was the longest, and who directed it?\n \n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Pipeline for longest movie\n\nnetflix_movies |> \n select(title, duration_mins, director, num_cast) |> \n arrange(desc(duration_mins)) |> \n glimpse()\n```\n:::\n\nThe longest movie on Netflix is Black Mirror: Bandersnatch, at 312 minutes, with no recorded director.\n\n:::\n\n# Question 12 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nUsing your netflix_movies tibble:\n\nGroup by primary_country and obtain the median cast number.\n \n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Group by country\n\nnetflix_movies |> \n group_by(primary_country) |> \n summarise(var_cast = median(num_cast, na.rm = TRUE))\n```\n:::\n\n:::\n\n# Question 13 \n::: {.panel-tabset}\n\n## **Question**{-}\n\nUsing your netflix_movies tibble:\n\nGroup by type and rating of the movie, producing the mean duration.\n \n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Group by type and rating\n\nnetflix_movies |> \n group_by(type, rating) |> \n summarise(mean_duration = mean(duration_mins, na.rm = TRUE))\n```\n:::\n\n:::\n\n# Question 14\n::: {.panel-tabset}\n\n## **Question**{-}\n\nLeft join the imdb_scores data to the **original** netflix data.\n\nCreate a new variable netflix_imdb to contain this.\n\n\n## **Show Answer**{-}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Join imdb and netflix\n\nnetflix_imdb <- netflix |> \n left_join(y = imdb_scores,\n by = \"title\")\n\nglimpse(netflix_imdb)\n```\n:::\n\n:::\n\n# Summary \n\nIn this case study you have had the opportunity to apply data analysis techniques with the tidyverse to some additional datasets. \n\nThis is not exhaustive; have a look at the data and experiment with other techniques you can use.\n\nThis data has been provided for you to experiment with; however there is nothing better than learning with data that is meaningful to you.\n\nFor additional datasets we recommend exploring:\n\n* [Kaggle](https://www.kaggle.com/)\n* [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday)\n* [Data.gov](https://data.gov.uk/)\n\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
18 changes: 18 additions & 0 deletions
18
_freeze/CH7_control_flow_loops_and_functions/execute-results/html.json
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"hash": "3422a9bcf98c3b27689f9e2ab455783e", | ||
"result": { | ||
"markdown": "---\ntitle: \"Course Information\"\nauthor: \"Government Analysis Function and ONS Data Science Campus\"\nengine: knitr\nexecute:\n echo: true\n eval: false\n freeze: auto # re-render only when source changes\nformat:\n html: \n highlight: null\n theme: \n light: flatly\n dark: darkly\n toc: true\n toc-title: Contents\n toc-location: right\n toc-depth: 3\n number-sections: true\n link-external-newwindow: true\n embed-resources: true\n \n---\n\n{fig-alt=\"Data Science Campus and Analysis Function logos.\"}\n\n> To switch between light and dark modes, use the toggle in the top left\n\n# Introduction\n\nThis course will cover basic concepts and give you the confidence to work independently in the R programming language. No prior coding or statistical knowledge is assumed, however you should be confident using basic computer software.\n\nThe course is split into chapters; each chapter will build upon the previous one. It will emphasise the application of skills, building confidence and resilience in programming.\n\nIt is essential that you have frequent opportunities to practice what you have learnt from the course.\n\n# Course Materials\n\nThe course materials come in several formats:\n\n* HTML pages such as the one you are reading now\n\n* Data [](datasets.qmd) we will use during the course. **It's highly recommended you create a project with a 'data' folder and download all the required datasets before starting the course**\n\nYou can also navigate to the course Github Repository and clone or fork the website structure for yourself. If you are new to programming and version control, we recommend you remain on the website to gain the best experience.\n\n\n# Software Requirements\n\n* R programming language \n* R studio (recommended but not essential)\n* Web browser (Internet connection not necessary)\n*\tExcel or other spreadsheet software for viewing csv and xlsx documents\n \n\n# Packages\n\nPackages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. The following will be used in this course:\n\n* tidyverse\n* readxl\n* janitor\n\n# Pre-Course Check-list:\n\n* Install R and RStudio on your laptop as per your department's guidance.\n\n* Check your department's guidelines for installing packages.\n\n* Save the data from the ZIP file to your hard drive in your working directory.\n\n\n# Course Overview\n\nThe course is divided into 6 chapters, over the 2 days we will cover,\n\n1. **Chapter One - Getting Started with R**\n\n * Be familiar with R Studio.\n \n * RStudio environment, layout, and customization.\n \n * Understand the Key Benefits of using R.\n \n * How to run code in R.\n \n * Know where to get help.\n \n * Discover R’s data types.\n \n * Be able to create Variables.\n\n\n \n<br> \n \n2. **Chapter Two - Data Structures**\n\n * Be familiar with data structures in R.\n \n * Understand how vectors operate.\n \n * Be familiar with lists.\n \n * Be familiar with data frames and tibbles.\n\n\n\n<br> \n \n \n3. **Chapter Three - Importing and Exporting Data**\n\n * Organise our work\n \n * Have an understanding of what packages are.\n \n * Be able to load and install a package.\n \n * Be able to check package versions and R version.\n \n * Be able to import data from multiple formats.\n \n * Be able to inspect loaded data and select elements within the data frame.\n \n * Be able to export data.\n \n * Be able to explore data.\n\n\n\n<br>\n\n4. **Chapter Four - Tibbles and Dplyr**\n\n\n* Understand the importance of clean variable names.\n\n* Be able to clean column names using the janitor package.\n\n* Understand the use of the pipe operator.\n\n* Be able to sort data with dplyr’s arrange verb.\n\n* Be able to select data with dplyr’s select verb.\n\n* Be able to filter data with dplyr’s filter verb.\n\n* Be able to transform data with dplyr’s mutate verb.\n\n* Be able to join datasets together.\n\n\n\n<br>\n \n5. **Chapter Five - Summary Statistics and Aggregation**\n\n * Describe numeric and categorical data\n\n * Aggregate and data\n \n\n6. **Chapter Six - Case Study**\n", | ||
"supporting": [], | ||
"filters": [ | ||
"rmarkdown/pagebreak.lua" | ||
], | ||
"includes": {}, | ||
"engineDependencies": {}, | ||
"preserve": {}, | ||
"postProcess": true | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.