How to import programmatically re-import changing datasets in production? #57

derhuerst · 2024-02-16T15:29:45Z

@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

why the import needs to be (more or less) atomic

From derhuerst/berlin-gtfs-rt-server#9 (comment):

An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

separate DBs

I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.

At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:

For each GTFS import, a new DB named gtfs_$unix_timestamp gets created, and data is imported into it.
After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
Before each import, all import DBs other than the latest two are deleted.
This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.

One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

separate schemas

Now that gtfs-via-postgres has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:

Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
We can get rid of the whole PgBouncer hassle (see above).
A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.

However, there are disadvantages:

As outlined in support importing >1 GTFS datasets into one Postgres DB #51 (comment), importing >1 datasets with different versions of gtfs-via-postgres won't be possible.
If there was a bug in the logic cleaning up old imports, they would break the running service, e.g. if they delete parts of the live/latest schema.

The text was updated successfully, but these errors were encountered:

derhuerst · 2024-02-18T16:29:54Z

related: mobidata-bw/ipl-orchestration#8

derhuerst added the question Further information is requested label Feb 16, 2024

derhuerst mentioned this issue Feb 16, 2024

allow importing into a custom DB schema derhuerst/berlin-gtfs-rt-server#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to import programmatically re-import changing datasets in production? #57

How to import programmatically re-import changing datasets in production? #57

derhuerst commented Feb 16, 2024

derhuerst commented Feb 18, 2024

How to import programmatically re-import changing datasets in production? #57

How to import programmatically re-import changing datasets in production? #57

Comments

derhuerst commented Feb 16, 2024

why the import needs to be (more or less) atomic

separate DBs

separate schemas

derhuerst commented Feb 18, 2024