Skip to content

Data pipeline: Ingesting data

Sarah Anoke edited this page Apr 28, 2020 · 8 revisions

Federal Spending Data

These data are in an S3 bucket as a PostgreSQL archive dump within a zip file. Below we'll unzip the file and restore the original PostgreSQL database.

1 - Get the archive file from S3 using root

  1. Log into the root Ubuntu account using sudo -u root -i
  2. Install AWS CLI
sudo apt install awscli
  1. Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created 

This is a good time to start tmux (more info).

  1. Download the archive dump from S3 
and unzip it
aws s3 cp \ 
   s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
   usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110

2 - Restore and inspect the database

After we configured the database location in postgresql.conf, the record of user root was lost, so we'll recreate it (same process as above).

sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL) 
ALTER USER root WITH ENCRYPTED PASSWORD '<enter-password>';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;

Refer to guidance to understand the structure of these commands, including the flags.

pg_restore --list usaspending-db_20200110 | \ 
   sed '/MATERIALIZED VIEW DATA/D' > restore.list

# the command below takes 5.3 hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \ 
  --verbose --exit-on-error --use-list restore.list usaspending-db_20200110

psql --dbname postgresql://root:'password123'@localhost:5432/root --command \ 
   'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \ 
   VERBOSITY=verbose --set SHOW_CONTEXT=always


pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list􏰑

We can log into psql to view a list of the loaded tables using \dt, or download a GUI like pgAdmin or postico. If using pgAdmin,

  • [General] Choose a name for the server; the choice is inconsequential.
  • [Connection] Use the same login information as in the pg_restore commands above
  • [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
  • Once connected, view tables using the Object Explorer on the left, selecting root > Schemas > Public > Tables under the name of the server you just created (FedSpendData).

Historical Legislator Data

These data were read in as a JSON file and parsed into lists in PySpark; these lists were later assembled into a DataFame for use in joins.

Unfortunately I used for loops here, although with more time I could have figured out an implementation with list comprehensions.

# read in the JSON file, parse into lists
# NOTE manually placed JSON on each worker
lc = spark.read.option('multiline','true').json('legislators-current.json').collect()
firstName = []
lastName  = []
typeOfRep = []
party     = []
state     = []
district  = []
year      = []
# for every legislator...
for l in lc:
    # ...and every term...
    for t in l['terms']:
        y = 0
        term_length = int(t['end'][0:4]) - int(t['start'][0:4])
        # ...and for each year in the term
        while y <= term_length:
            firstName.append(l['name']['first'])
            lastName.append(l['name']['last'])
            typeOfRep.append(t['type'])
            party.append(t['party'])
            state.append(t['state'])
            district.append(t['district'])
            year.append(int(t['start'][0:4]) + y)
            y += 1

Congressional District Population Sizes

The US Census publishes data disaggregated on the most common social, economic, housing, and demographic characteristics; each of the four ACS tables published are on one of these axes.

I selected the file (DP05.csv, README) associated with demographic characteristics (sex, age) and aggregated up to get the 2018 total population by congressional district. State abbreviations from here.

Clone this wiki locally