Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to import programmatically re-import changing datasets in production? #57

Open
derhuerst opened this issue Feb 16, 2024 · 1 comment
Labels
question Further information is requested

Comments

@derhuerst
Copy link
Member

@dancesWithCycles asked in the berlin-gtfs-rt-server project (which transitively depends on gtfs-via-postgres) how to programmatically import GTFS and switch some running service (e.g. an API) to the newly imported data.

I'll explain the experience I have made with different approaches here. Everyone is very welcome giving theirs and discussing the trade-offs!

why the import needs to be (more or less) atomic

From derhuerst/berlin-gtfs-rt-server#9 (comment):

An alternative approach would be a script that cleans up an existing database without dropping it so that the update happens on a clean database.

With this design, if your script crashes after it has cleaned the DB, you'll leave your service in a non-functional state. Also, even if it runs through, you'll have an unpredictable period of downtime.

separate DBs

I am using a Managed Server where I do not want to drop and create a database every time I update the GTFS feed. I rather drop and create the respective schema.

At the end of the day I need to make sure to prepare a fresh environment for the GTFS feed import into PostgreSQL without dropping the database. How would you do it?

Recently, in postgis-gtfs-importer, I tackled the problem differently by using >1 DBs:

  • For each GTFS import, a new DB named gtfs_$unix_timestamp gets created, and data is imported into it.
  • After a successful import, the newly created DB is marked as the latest in a special "bookkeeping" DB.
  • Before each import, all import DBs other than the latest two are deleted.
  • This whole process is done as atomically as PostgreSQL allows, by combining a transaction and an exclusive lock.

One problem remains: The consuming program then needs to connect to a DB with a dynamic name. Because at MobiData BW IPL, we have PgBouncer in place anyways, we use it to "alias" this dynamic DB into a stable name (e.g. gtfs). There are a lot of gotchas involved here though.

TLDR: If you do have the option to programmatically create PostgreSQL DBs, for now I recommend using this tool or process. Otherwise, consider other options.

separate schemas

Now that gtfs-via-postgres has gained the ability to import >1 GTFS datasets into 1 DB with version 4.9.0, one could also adapt the aforementioned import process to use separate schemas instead of separate DBs.

I see the following advantages:

  • Hosting environments where creating an arbitrary number of DBs is not allowed are supported.
  • We can get rid of the whole PgBouncer hassle (see above).
  • A schema can be created or deleted within a transaction, so the process is truly atomic. It seems that the "bookkeeping" DB and the exclusive lock wouldn't be necessary anymore.

However, there are disadvantages:

@derhuerst
Copy link
Member Author

related: mobidata-bw/ipl-orchestration#8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

1 participant