Skip to content

Setting up: Retrieving Data

Sarah Anoke edited this page May 2, 2020 · 8 revisions

The different datasets that were combined for this project are hosted by several different websites. The goal is to save them all directly to to AWS S3. However S3 cannot initiate downloads from a website; the website (for example, the federal government site www.usaspending.gov) would have to become a client and initiate the upload.

Since this is not possible, the next best option is to download via EC2 and upload into S3.

Note that S3 buckets exist outside of the VPC so the concepts of security groups, etc don't apply.

Federal Spending Data

  1. Create an appropriately-sized S3 bucket.

    • Check the source website for the size of the complete PostgreSQL Dump Archive file.
  2. Create an IAM role to give an EC2 instance permission to read and write from S3 buckets. Use the 'AmazonS3FullAccess' policy.

  3. Create an EC2 instance to initiate the download.

    • Remember to attach a volume that's large enough to hold the archive file.
    • Make sure that the instance has an outbound security rule that allows external downloads.
    • Remember to attach the IAM role that will allow the instance read/write access to S3 buckets.
  4. SSH into the EC2 instance.

    • If you get an error like WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!, which can happen when you change the instance associated with an Elastic IP, delete the offending line in the specified credential file and try again.
  5. Install AWS CLI.

# update the package list
sudo apt update

# install pip 
sudo apt install awscli
  1. Use wget to fetch the file:
wget https://files.usaspending.gov/database_download/usaspending-db_20200110.zip
  1. Move a copy of the zipped file to S3 using an AWS CLI command (complete instructions).
# general syntax is 
# aws s3 cp <source> s3://my-bucket/
aws s3 cp usaspending-db_20200110.zip s3://sanoke-insight-fedspend-projectdata/

Historical Legislator Data

Datasets containing information about

members of the United States Congress (1789-Present), congressional committees (1973-Present), committee membership (current only), and presidents and vice presidents of the United States

are provided as a community resource in the unitedstates/congress-legislators GitHub repo.

Congressional District Population Sizes

The US Census publishes data disaggregated on the most common social, economic, housing, and demographic characteristics; each of the four ACS tables published are on one of these axes.

I selected the file (DP05.csv, README) associated with demographic characteristics (sex, age) and aggregated up to get the 2018 total population by congressional district. State abbreviations downloaded from a simple dataset found online.

Congressional District Income and Tax

Individual income and tax data disaggregated by congressional district are made available by the IRS. The data used for this app are from income earned in 2017 (and reported in 2018). The particular quantities used are defined below.

  • [A02650] Total income amount Form 1040:22 / 1040A:15 / 1040EZ:4
  • [A04800] Taxable income amount Form 1040:43 / 1040A:27 / 1040EZ:6
  • [A06500] Income tax amount Form 1040:56, also includes data from Form 1040A and 1040EZ filers. This is the tax due for income earned in 2017 (excluding some special cases).
  • [A11900] Total overpayments amount Form 1040:75 / 1040A:47 / 1040EZ:13a. This amount is typically referred to as a 'tax return'.