-
Notifications
You must be signed in to change notification settings - Fork 2
Setting up: PostgreSQL
This project is centered around federal spending data from 2001-present, provided directly by the US Government. The data themselves are provided within a PostgreSQL dump archive, which we will restore to the full 1.5TB+ database.
These instructions assume the dump archive is (located in S3).
The following steps were completed on an m5.4xlarge EC2 instance, with
- 16 vCPUs and 64 GB memory (needed to speed up
pg_restore
process from days to hours) - 8GB storage and an additional 1TB volume attached and manually mounted.
sudo apt update
sudo apt install postgresql postgresql-contrib
According to Sriram,
On installation, a superuser
postgres
was created, with a databasepostgres
associated with that account. Since PostgreSQL uses the client's OS user profile to authenticate the user, to access the database we need to log into to that user. By default, the installation created a new OS userpostgres
inside the Ubuntu OS.
According to guidance on restoring this particular database,
The archive was dumped from a generic database where objects were owned by a user named
root
, and it will simplify the restore if that same user exists on the server this database will be restored to.
So next, we create a user root
...
sudo -u postgres createuser root
sudo -u postgres createdb root
# start psql to grant the privleges
sudo -u postgres psql
- [Note] If you get
Command not found
when runningsudo -u postgres psql
, it's likely becausesudo
doesn't know wherepsql
is to open it. Use this Stack Exchange post to resolve the issue. In my case, I usedwhich psql
to find the correct path and specified it directly.
sudo -u postgres /usr/bin/psql
Once psql
is running, and add superuser privileges to root
.
-- run the following commands using psql
ALTER USER root WITH ENCRYPTED PASSWORD 'password123';
GRANT ALL PRIVILEGES ON DATABASE root TO root ;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;
-- check the list of users and permissions
\du
\q
- [Note] When creating the account, we have to specify a password in plaintext 😟. The more secure option is using some sort of encoding or hash function.
By default, PostgreSQL stores the database in the root volume, which will run out of space quickly. We'll point PostgreSQL to the correct location, as well as configure some memory settings to make pg_restore
complete faster.
# change owner of mount point
sudo chown -R postgres:postgres /database
su postgres # b/c owns the server process
/usr/lib/postgresql/10/bin/initdb -D /database
exit
# turn off postgresql so we can configure it
service postgresql stop
nano /etc/postgresql/10/main/postgresql.conf
# MODIFIED CONFIGURATIONS
# data_directory = ‘/database’
# shared_buffers = 16GB
# max_worker_processes = 16 # 16 vCPUs on this machine
# max_parallel_workers_per_gather = 16 # max
# max_parallel_workers = 16
# effective_cache_size = 48GB # 75% of 64GB mem available
# turn postgresql back on
service postgresql start
- Log into the
root
Ubuntu account usingsudo -u root -i
- Install AWS CLI
sudo apt install awscli
- Configure the instance credentials to allow access to S3 bucket (credential or config issues?)
aws configure
# a prompt will appear;
# enter the keys that were generated when IAM profile was first created
This is a good time to start tmux
(more info).
- Download the archive dump from S3 and unzip it
aws s3 cp \
s3://sanoke-insight-fedspend-projectdata/usaspending-db_20200110.zip \
usaspending-db_20200110.zip
sudo apt install unzip
unzip usaspending-db_20200110.zip -d usaspending-db_20200110
After we configured the database location in postgresql.conf
, the record of user root
was lost, so we'll recreate it (same process as above).
sudo -u postgres createuser root
sudo -u postgres createdb root
sudo -u postgres psql
-- add SUPERUSER privileges (in SQL)
ALTER USER root WITH ENCRYPTED PASSWORD 'password123';
GRANT ALL PRIVILEGES ON DATABASE root TO root;
ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;
Refer to guidance to understand the structure of these commands, including the flags.
pg_restore --list usaspending-db_20200110 | \
sed '/MATERIALIZED VIEW DATA/D' > restore.list
# the command below takes several hours to complete
pg_restore --jobs 32 --dbname postgresql://root:'password123'@localhost:5432/root \
--verbose --exit-on-error --use-list restore.list usaspending-db_20200110
psql --dbname postgresql://root:'password123'@localhost:5432/root --command \
'ANALYZE VERBOSE;' --echo-all --set ON_ERROR_STOP=on --set \
VERBOSITY=verbose --set SHOW_CONTEXT=always
pg_restore --list usaspending-db_20200110 | grep "MATERIALIZED VIEW DATA" > refresh.list
We can log into psql
to view a list of the loaded tables using \dt
, or download a GUI like pgAdmin or postico. If using pgAdmin,
- [General] Choose a name for the server; the choice is inconsequential.
- [Connection] Use the same login information as in the
pg_restore
commands above - [SSH Tunnel] Set up SSH tunneling using the same information used to SSH via command line
- Once connected, view tables using the Object Explorer on the left, selecting
root > Schemas > Public > Tables under the name of the server you just created (
FedSpendData
).
📬 Create an issue if you have one.
(📝 instructions accurate as of Feb 2020)