Skip to content

Setting up: PostgreSQL

Sarah Anoke edited this page Feb 7, 2020 · 6 revisions

This project is centered around federal spending data from 2001-present, provided directly by the US Government. The data themselves are provided within a PostgreSQL dump archive, which we will restore to the full 1.5TB+ database.

These instructions assume the dump archive is (located in S3).

The following steps were completed on an m5.4xlarge EC2 instance, with

  • 16 vCPUs and 64 GB memory (needed to speed up pg_restore process from days to hours)
  • 8GB storage and an additional 1TB volume attached and manually mounted.

[Note] My Spark master setup occurred on this same instance.

1 - Install PostgreSQL

sudo apt update
sudo apt install postgresql postgresql-contrib

According to Sriram,

On installation, a superuser postgres was created, with a database postgres associated with that account. Since PostgreSQL uses the client's OS user profile to authenticate the user, to access the database we need to log into to that user. By default, the installation created a new OS user postgres inside the Ubuntu OS.

According to guidance on restoring this particular database,

The archive was dumped from a generic database where objects were owned by a user named root, and it will simplify the restore if that same user exists on the server this database will be restored to.

So next, we create a user root...

sudo -u postgres createuser root
sudo -u postgres createdb root

# start psql to grant the privleges
sudo -u postgres psql

  • [Note] If you get Command not found when running sudo -u postgres psql, it's likely because sudo doesn't know where psql is to open it. Use this Stack Exchange post to resolve the issue. In my case, I used which psql to find the correct path and specified it directly.
sudo -u postgres /usr/bin/psql

Once psql is running, and add superuser privileges to root.

-- run the following commands using psql
ALTER USER root WITH ENCRYPTED PASSWORD '<enter-password>';

GRANT ALL PRIVILEGES ON DATABASE root TO root ;

ALTER USER root WITH SUPERUSER CREATEDB CREATEROLE;
-- check the list of users and permissions
\du

\q
  • [Note] When creating the account, we have to specify a password in plaintext 😟. The more secure option is using some sort of encoding or hash function.

2 - Configure PostgreSQL

By default, PostgreSQL stores the database in the root volume, which will run out of space quickly. We'll point PostgreSQL to the correct location, as well as configure some memory settings to make pg_restore complete faster.

# change owner of mount point
sudo chown -R postgres:postgres /database
su postgres   # b/c owns the server process
/usr/lib/postgresql/10/bin/initdb -D /database
exit

# turn off postgresql so we can configure it
service postgresql stop

nano /etc/postgresql/10/main/postgresql.conf
# MODIFIED CONFIGURATIONS
# data_directory = ‘/database’
# shared_buffers = 16GB
# max_worker_processes = 16             # 16 vCPUs on this machine
# max_parallel_workers_per_gather = 16  # max
# max_parallel_workers = 16
# effective_cache_size = 48GB           # 75% of 64GB mem available

# turn postgresql back on
service postgresql start

_[Note] If we've stopped the instance running PostgreSQL, on restart we must remount the extra volume and restart the postgresql service.