Skip to content

Setting up: PostgreSQL

Sarah Anoke edited this page Jan 31, 2020 · 6 revisions

This project is centered around federal spending data from 2001-present, provided directly by the US Government. The data themselves are provided within a PostgreSQL dump archive, which we will restore to the full 1.5TB+ database.

These instructions assume the dump archive is (located in S3).

The following steps were completed on an m5.4xlarge EC2 instance, with

  • 16 vCPUs and 64 GB memory (needed to speed up pg_restore process from days to hours)
  • 8GB storage and an additional 1TB volume attached and manually mounted.

1 - Install PostgreSQL

sudo apt update
sudo apt install postgresql postgresql-contrib

According to Sriram,

On installation, a superuser postgres was created, with a database postgres associated with that account. Since PostgreSQL uses the client's OS user profile to authenticate the user, to access the database we need to log into to that user. By default, the installation created a new OS user postgres inside the Ubuntu OS.

According to guidance on restoring this particular database,

The archive was dumped from a generic database where objects were owned by a user named root, and it will simplify the restore if that same user exists on the server this database will be restored to.

So next, we create a user root...

sudo -u postgres createuser root
sudo -u postgres createdb root

# start psql to grant the privleges
sudo -u postgres psql

  • [Note] If you get Command not found when running sudo -u postgres psql, it's likely because sudo doesn't know where psql is to open it. Use this Stack Exchange post to resolve the issue. In my case, I used which psql to find the correct path and specified it directly.
sudo -u postgres /usr/bin/psql

Once psql is running, and add superuser privileges to root.

-- run the following commands using psql
ALTER USER root WITH ENCRYPTED PASSWORD 'password123';

GRANT ALL PRIVILEGES ON DATABASE root TO root ;

ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;
-- check the list of users and permissions
\du

\q
  • [Note] When creating the account, we have to specify a password in plaintext 😟. The more secure option is using some sort of encoding or hash function.

2 - Configure PostgreSQL

By default, PostgreSQL stores the database in the root volume, which will run out of space quickly. We'll point PostgreSQL to the correct location, as well as configure some memory settings to make pg_restore complete faster.

# change owner of mount point
sudo chown -R postgres:postgres /database
su postgres   # b/c owns the server process
/usr/lib/postgresql/10/bin/initdb -D /database
exit

# turn off postgresql so we can configure it
service postgresql stop

nano /etc/postgresql/10/main/postgresql.conf
# MODIFIED CONFIGURATIONS
# data_directory = ‘/database’
# shared_buffers = 16GB
# max_worker_processes = 16             # 16 vCPUs on this machine
# max_parallel_workers_per_gather = 16  # max
# max_parallel_workers = 16
# effective_cache_size = 48GB           # 75% of 64GB mem available

# turn postgresql back on
service postgresql start

3 - Get the archive file from S3 using root

  1. Log into the root Ubuntu account using sudo -u root -i
  2. Install AWS CLI
sudo apt install awscli
  1. Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created 

This is a good time to start tmux (more info).

  1. Download the archive dump from S3 
and unzip it
aws s3 cp \ 
   s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
   usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110

4 - Restore and inspect the database

After we configured the database location in postgresql.conf, the record of user root was lost, so we'll recreate it (same process as above).

sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL) 
ALTER USER root WITH ENCRYPTED PASSWORD 'password123';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;

Refer to guidance to understand the structure of these commands, including the flags.

pg_restore --list usaspending-db_20200110 | \ 
   sed '/MATERIALIZED VIEW DATA/D' > restore.list

# the command below takes several hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \ 
  --verbose --exit-on-error --use-list restore.list usaspending-db_20200110

psql --dbname postgresql://root:'password123'@localhost:5432/root --command \ 
   'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \ 
   VERBOSITY=verbose --set SHOW_CONTEXT=always


pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list􏰑

We can log into psql to view a list of the loaded tables using \dt, or download a GUI like pgAdmin or postico. If using pgAdmin,

  • [General] Choose a name for the server; the choice is inconsequential.
  • [Connection] Use the same login information as in the pg_restore commands above
  • [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
  • Once connected, view tables using the Object Explorer on the left, selecting root > Schemas > Public > Tables under the name of the server you just created (FedSpendData).
Clone this wiki locally