papersize	documentclass	classoption	colorlinks
a4	scrartcl	DIV=14	true

MY472 Data for Data Scientists

Michaelmas Term 2020

Prerequisites

All students are required to complete the preparatory course 'R Advanced for Methodology' early in Michaelmas Term, ideally in weeks 0 and 1. You will be auto-enrolled into the R course when enrolling into MY472 on Moodle.

Instructors

Office hour slots to be booked via LSE's StudentHub

Friedrich Geiecke, Department of Methodology. Office hours: Tuesdays 16:00-18:00 via Zoom
Martin Lukac, Department of Methodology. Office hours: Mondays 14:00-17:00 via Zoom

Course information

Lectures are prerecorded and available via Moodle
Lecture discussions: Tuesdays 09:00–11:00 and 15:00-16:00 via Zoom (you can choose which one to attend)
Classes on:
- Thursdays 10:00-11:00, CLM.3.02 and via Zoom
- Fridays 16:00-17:00, LRB.R.21 and via Zoom

No lectures or classes will take place during (Reading) Week 6.

Quick links to topics

Week	Date	Topic
1	28 Sep	Introduction to data
2	5 Oct	The shape of data
3	12 Oct	HTML and CSS
4	19 Oct	Using data from the Internet
5	26 Oct	Working with APIs
6	2 Nov	Reading week
7	9 Nov	Textual data
8	16 Nov	Data visualization
9	23 Nov	Creating and managing databases
10	30 Nov	Interacting with online databases
11	7 Dec	Cloud Computing

Course description

This course will cover the principles of digital methods for collecting, processing, and storing data. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. We use a project-based learning approach towards the study of computation and some group-based collaboration, essential ingredients of modern data science work. We will also make frequent use of version control and group collaboration tools such as git and GitHub.

We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We continue with an introduction of R markdown and the reshaping of data in R. It follows a discussion of various common data types on the internet such as markup languages (e.g. HTML and XML) and JSON. Students also study the fundamentals of acquisition and management of data from the internet through both scraping of websites and accessing APIs of online databases and social network services.

After the reading week, we will learn how to work with unstructured data in the form of text. Afterwards we continue with an overview of the principles of exploratory data analysis through data visualization e.g. using R's ggplot2. Next, we will cover database design, especially relational databases, using examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to ensure that students learn to create, populate and query an SQL database. We will then introduce NoSQL using MongoDB and the JSON data format for comparison. For both types of database, students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data. The course will be concluded with a discussion of cloud computing. Students will first learn the basics of cloud computing that can serve various purposes such data analysis and then how to set up a cloud computing environment through Amazon Web Services, a popular cloud platform.

Assessment

Formative coursework

Students will be expected to produce five weekly, structured problem sets with a beginning component to be started in the staff-led lab sessions, to be completed by the student outside of class. Answers should be formatted and submitted for assessment. One or more of these problem sets will be completed in collaboration with other students.

Summative assignments

Take home exam (50%) and in class assessment (50%).

Student problem sets will be marked and will provide 50% of the mark.

Assessment criteria

Assignments will be marked using the following criteria:

70–100: Very Good to Excellent (Distinction). Perceptive, focused use of a good depth of material with a critical edge. Original ideas or structure of argument.
60–69: Good (Merit). Perceptive understanding of the issues plus a coherent well-read and stylish treatment though lacking originality
50–59: Satisfactory (Pass). A “correct” answer based largely on lecture material. Little detail or originality but presented in adequate framework. Small factual errors allowed.
30–49: Unsatisfactory (Fail) and 0–29: Unsatisfactory (Bad fail). Based entirely on lecture material but unstructured and with increasing error component. Concepts are disordered or flawed. Poor presentation. Errors of concept and scope or poor in knowledge, structure and expression.

Some of the assignemnts will involve shorter questions, to which the answers can be relatively unambiguously coded as (fully or partially) correct or incorrect. In the marking, these questions may be further broken down into smaller steps and marked step by step. The final mark is then a function of the proportion of parts of the questions which have been answered correctly. In such marking, the principle of partial credit is observed as far as feasible. This means that an answer to a part of a question will be treated as correct when it is correct conditional on answers to other parts of the question, even if those other parts have been answered incorrectly.

Detailed course schedule

Schedule

1. Introduction to data

In the first week, we will introduce the basic concepts of the course, including how data is recorded, stored, and shared. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will introduce the use of git and GitHub, using the lab session to guide students through in setting up an account and subscribing to the course organisation and assignments.

This week will also introduce basic data types, in a language-agnostic manner, from the perspective of machine implementations through to high-level programming languages. We will then focus on how basic data types are implemented in R.

Resources

Lecture slides
R example: Introduction to RMarkdown and as rmd source
R example: vectors, lists, data frames

Required reading

Wickham, Hadley. Nd. Advanced R, 2nd ed. Ch 3, Names and values, Chapter 4, Vectors, and Chapter 5, Subsetting. (Ch. 2-3 of the print edition),
GitHub Guides, especially: "Understanding the GitHub Flow", "Hello World", and "Getting Started with GitHub Pages".
GitHub. "Markdown Syntax" (a cheatsheet).

Lab: Working with git and GitHub

Installing git and setting up an account on GitHub
How to complete and submit assignments using GitHub Classroom
Forking and correcting a broken RMarkdown file
Cloning a website repository, modifying it, and publishing a personal webpage

2. The shape of data

This week moves beyond the rectangular format common in statistical datasets, modeled on a spreadsheet, to cover relational structures and the concept of database normalization. We will also cover ways to restructure data from "wide" to "long" format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.

Resources

Lecture slides
R examples: conditionals, loops, and functions, introduction to key tidyverse functions, industrial production dataset, and industrial production and unemployment dataset

Required reading

Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O'Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition).
The Tidyverse collection of packages for R.

Lab: Reshaping data in R

Assignment 1: Data cleaning in R

Link to GitHub Classroom available via Moodle on Monday, October 5, 2pm
Deadline on Friday, October 16, 2pm

3. HTML and CSS

From week 3 to week 5, we will learn how to get the data from the Internet. This week introduces the basics, including markup languages (HTML, XML, and Markdown) and other common data formats such as JSON (Javascript Object Notation). We also cover basic web scraping, to turn web data into text or numbers. We will also cover the client-server model, and how machines and humans transmit data over networks and to and from databases.

Resources

Lecture slides
Example 1: scraping tables
Example 2: scraping unstructured data

Required reading

Lazer, David, and Jason Radford. 2017. “Data Ex Machina: Introduction to Big Data.” Annual Review of Sociology 43(1): 19–39.
Howe, Shay. 2015. Learn to Code HTML and CSS: Develop and Style Websites. New Riders. Chs 1-8.
Kingl, Arvid. 2018. Web Scraping in R: rvest Tutorial.

Lab: Web scraping

Scraping tables
Scraping unstructured data

4. Using data from the Internet

Continuing from the material covered in Week 3, we will learn the advanced topics in scraping the web. The topics include the scraping documents in XML (such as RSS), scraping websites beyond the authentication, and websites with non-static components.

Resources

Lecture slides
Example of vectors and list operations in R

Required reading

Sai Swapna Gollapudi. 2018. Learn Web Scraping and Browser Automation Using RSelenium in R.
Wickham, Hadley. 2015. Parse and process XML (and HTML) with xml2
Mozilla Developer Web Docs. What is JavaScript.

Lab: Group work on first five weeks

Assignment 2: Web scraping

Link to GitHub Classroom available via Moodle on Monday, October 19, 2pm
Deadline on Friday, October 30, 2pm

5. Working with APIs

Resources

Lecture slides
Examples

How to work with Application Programming Interfaces (APIs), which offer developers and researchers access to data in a structured format. Our running examples will be the New York Times API and the Twitter API.

Required reading

Steinert-Threlkeld. 2018. Twitter as Data. Cambridge University Press.

Lab: APIs

Interacting with the New York Times API
Interacting with Twitter's REST and Streaming API

Assignment 3: APIs

Link to GitHub Classroom available via Moodle on Monday, October 26, 2pm
Deadline on Friday, November 13, 2pm

6. Reading week

7. Textual data

We will learn how to work with unstructured data in the form of text, and how to deal with format conversion, encoding problems, and serialization. We will also cover search and replace operations using regular expressions, as well as the most common textual data types in R and Python.

Resources

Lecture slides and as HTML
Regular expressions cheat sheet

Required reading

Kenneth Benoit. July 16, 2019. "Text as Data: An Overview" Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.

Lab

Group working with textual data

8. Data visualisation

The lecture this week will offer an overview of the principles of exploratory data analysis through (good) data visualization. In the seminars, we will practice producing our own graphs using ggplot2.

Resources

Lecture slides
Anscombe dataset plots
ggplot2 basics
Scales, axes, and legends in gplot2

Required reading

Wickham, Hadley and Garett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O'Reilly. Data visualization, Graphics for communication (Ch. 1 and 22 of the print edition).
Hughes, A. (2015) "Visualizing inequality: How graphical emphasis shapes public opinion" Research and Politics.

Lab

Data visualization with ggplot2

Assignment 4: Data visualization

Link to GitHub Classroom available via Moodle on Monday, November 16, 2pm
Deadline on Friday, November 27, 2pm

9. Creating and managing databases

This session will offer an introduction to relational databases: structure, logic, and main types. We will learn how to write SQL code, a language designed to query this type of databases that is currently employed by most tech companies; and how to use it from R using the DBI package.

Resources

Lecture slides
Examples

Required reading

Beaulieu. 2009. Learning SQL. O'Reilly. (Chapters 1, 3, 4, 5, 8)

Lab: SQL

Analyzing public Facebook data in a SQLite database

10. Interacting with online databases

This week, we will dive deeper into the databases. In particular, this week covers following topics: How to set up and use relational databases in the cloud, how to obtain big data analytics through data warehousing services (e.g. Google BigQuery), and fundamentals of noSQL databases.

Resources

Lecture slides
Examples

Required reading

Beaulieu. 2009. Learning SQL. O'Reilly. (Chapters 2)
Hows, Membrey, and Plugge. 2014. MongoDB Basics. Apress. (Chapter 1)
Tigani and Naidu. 2017. Google BigQuery Analytics. Weily. (Chapters 1-3)

Lab

SQL JOINs, subqueries, and BigQuery

Assignment 5: Databases

Link to GitHub Classroom available via Moodle on Monday, November 30, 2pm
Deadline on Friday, December 11, 2pm

11. Cloud computing

In this week, we focus on the setup of computation environments on the Internet. We will introduce the cloud computing concepts and learn why the big shift to the cloud computing is occurring in the industry and how it is relevant to us as data scientists. In the lab, we will have an introduction to the cloud environment setup using Amazon Web Services. We will sign up an account, launch a cloud computing environment, create a webpage, and set up a statistical computing environment.

Resources

Lecture slides
Class slides

Required reading

Rajaraman, V. 2014. "Cloud Computing." Resonance 19(3): 242–58.
AWS: What is cloud computing.
Azure: Developer guide.

Lab: Working with AWS

Setup an AWS account (link from Moodle for AWS Educate free account)
Secure the account
Configure EC2 instance
Work with EC2 instance
- Login EC2-Linux Console
- Set up a web server
- Install R, some packages
- Stop the instance

Take-home exam

Link to GitHub Classroom available via Moodle on ...
Deadline on Friday, January 15, 2pm

Files

index.md

Latest commit

History

index.md

File metadata and controls

MY472 Data for Data Scientists

Michaelmas Term 2020

Prerequisites

Instructors

Course information

Quick links to topics

Course description

Assessment

Formative coursework

Summative assignments

Assessment criteria

Detailed course schedule

Schedule

1. Introduction to data

Resources

Required reading

Recommended reading

Lab: Working with git and GitHub

2. The shape of data

Resources

Required reading

Lab: Reshaping data in R

Assignment 1: Data cleaning in R

3. HTML and CSS

Resources

Required reading

Recommended reading

Lab: Web scraping

4. Using data from the Internet

Resources

Required reading

Recommended reading

Lab: Group work on first five weeks

Assignment 2: Web scraping

5. Working with APIs

Resources

Required reading

Recommended reading

Lab: APIs

Assignment 3: APIs

6. Reading week

7. Textual data

Resources

Required reading

Lab

8. Data visualisation

Resources

Required reading

Recommended reading

Lab

Assignment 4: Data visualization

9. Creating and managing databases

Resources

Required reading

Recommended readings

Lab: SQL

10. Interacting with online databases

Resources

Required reading

Recommended reading

Lab

Assignment 5: Databases

11. Cloud computing

Resources

Required reading

Recommended reading

Lab: Working with AWS

Take-home exam