Skip to content

Setting up: Retrieving Data

Sarah Anoke edited this page Feb 11, 2020 · 8 revisions

The different datasets that were combined for this project are hosted by several different websites. The goal is to save them all directly to to AWS S3. However S3 cannot initiate downloads from a website; the website (for example, the federal government site www.usaspending.gov) would have to become a client and initiate the upload.

Since this is not possible, the next best option is to download via EC2 and upload into S3.

Note that S3 buckets exist outside of the VPC so the concepts of security groups, etc don't apply.

Federal Spending Data

  1. Create an appropriately-sized S3 bucket.

    • Check the source website for the size of the complete PostgreSQL Dump Archive file.
  2. Create an IAM role to give an EC2 instance permission to read and write from S3 buckets. Use the 'AmazonS3FullAccess' policy.

  3. Create an EC2 instance to initiate the download.

    • Remember to attach a volume that's large enough to hold the archive file.
    • Make sure that the instance has an outbound security rule that allows external downloads.
    • Remember to attach the IAM role that will allow the instance read/write access to S3 buckets.
  4. SSH into the EC2 instance.

    • If you get an error like WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!, which can happen when you change the instance associated with an Elastic IP, delete the offending line in the specified credential file and try again.
  5. Install AWS CLI.

# update the package list
sudo apt update

# install pip 
sudo apt install awscli
  1. Use wget to fetch the file:
wget https://files.usaspending.gov/database_download/usaspending-db_20200110.zip
  1. Move a copy of the zipped file to S3 using an AWS CLI command (complete instructions).
# general syntax is 
# aws s3 cp <source> s3://my-bucket/
aws s3 cp usaspending-db_20200110.zip s3://sanoke-insight-fedspend-projectdata/

Historical Legislator Data

Datasets containing information about

members of the United States Congress (1789-Present), congressional committees (1973-Present), committee membership (current only), and presidents and vice presidents of the United States are provided as a community resource in the unitedstates/congress-legislators GitHub repo.

Clone this wiki locally