Skip to content

Data pipeline: Ingesting data

Sarah Anoke edited this page Feb 4, 2020 · 8 revisions

Federal Spending Data

These data are in an S3 bucket as a PostgreSQL archive dump within a zip file. Below we'll unzip the file and restore the original PostgreSQL database.

1 - Get the archive file from S3 using root

  1. Log into the root Ubuntu account using sudo -u root -i
  2. Install AWS CLI
sudo apt install awscli
  1. Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created 

This is a good time to start tmux (more info).

  1. Download the archive dump from S3 
and unzip it
aws s3 cp \ 
   s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
   usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110

2 - Restore and inspect the database

After we configured the database location in postgresql.conf, the record of user root was lost, so we'll recreate it (same process as above).

sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL) 
ALTER USER root WITH ENCRYPTED PASSWORD '<enter-password>';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;

Refer to guidance to understand the structure of these commands, including the flags.

pg_restore --list usaspending-db_20200110 | \ 
   sed '/MATERIALIZED VIEW DATA/D' > restore.list

# the command below takes 5.3 hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \ 
  --verbose --exit-on-error --use-list restore.list usaspending-db_20200110

psql --dbname postgresql://root:'password123'@localhost:5432/root --command \ 
   'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \ 
   VERBOSITY=verbose --set SHOW_CONTEXT=always


pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list􏰑

We can log into psql to view a list of the loaded tables using \dt, or download a GUI like pgAdmin or postico. If using pgAdmin,

  • [General] Choose a name for the server; the choice is inconsequential.
  • [Connection] Use the same login information as in the pg_restore commands above
  • [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
  • Once connected, view tables using the Object Explorer on the left, selecting root > Schemas > Public > Tables under the name of the server you just created (FedSpendData).
Clone this wiki locally