Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for DuckDB or ClickHouse? #31

Open
patricksheehan opened this issue Dec 16, 2022 · 10 comments
Open

support for DuckDB or ClickHouse? #31

patricksheehan opened this issue Dec 16, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@patricksheehan
Copy link

I think the arrivals_departures view is what I'm most interested in, but for the simple iOS app I'm building I'd like to avoid hosting a database. Would it be possible/reasonable to provide a SQLite port? FWIW I'm happy to work on this, I just haven't done such a port before.

@derhuerst derhuerst added enhancement New feature or request help wanted Extra attention is needed labels Dec 16, 2022
@derhuerst
Copy link
Member

Thanks for asking; If SQLite can be supported in a reasonable way, I would like to support it! I assume that 99% can be ported easily, but there might be a few hard parts; I'm not sure though, because I don't know SQLite well enough to judge without looking into it. Ideally, I would like to avoid a class port (a fork that implements SQLite, while this repo only implements PostgreSQL), because keeping the two synchronised is labor-intensive and error-prone.

I would propose that you try porting it in a draft PR and leave comments on all sections that are unclear or unfinished. We can then think about how to implement a switch that

In the worst case, if the SQLite code path ends up being too different from the PostgreSQL one, we'll have to resort to a true fork.

@patricksheehan
Copy link
Author

Makes sense! For now I'm going to pursue some other routes (Transitland API, working directly with a SQLite implementation) and will circle back if this seems like the quickest path :)

@atvaccaro
Copy link

duckdb could also be an interesting option and it looks like there's node bindings. Maybe an ORM or something like dbt to help with the cross-database compatibility?

@derhuerst
Copy link
Member

duckdb could also be an interesting option [...].

Indeed, DuckDB claims to cover the GTFS analysis use case very well. One could analyze how it performs with the queries used as benchmarks, which reflect (my personal, happy to receive some feedback) use cases.

Maybe an ORM or something like dbt to help with the cross-database compatibility?

AFAIK usually ORMs assume that the process connects to the DB and sends queries "interactively", which is unnecessarily slow for our use case (inserting millions of rows known up front). Is it possible to generate SQL (dialects) using dbt?

@atvaccaro
Copy link

atvaccaro commented Feb 8, 2023

AFAIK usually ORMs assume that the process connects to the DB and sends queries "interactively", which is unnecessarily slow for our use case (inserting millions of rows known up front).

I think the main use case would be to generate different dialects for the views/tables on top of the raw data, for example using SQLAlchemy to make stuff like service_days cross-db. I think the data loading is trickier and more database-specific. duckdb will gladly create a table for you based on an input CSV so you don't need to actually define the table up front like Postgres.

Is it possible to generate SQL (dialects) using dbt?

Yes it's essentially just a framework for templating SQL via Jinja (and providing a dependency graph it can execute) and developers use macros to provide cross-database support. dbt will compile to the appropriate dialect depending on your chosen database target; I could imagine either SQLAlchemy or dbt being useful for creating the views/tables on top of the initial tables.

I think you've looked at our project some already but we use dbt as the "source of truth" for our BigQuery data warehouse models.

@derhuerst
Copy link
Member

I'm hesitant to use an ORM, even if it "just" generates SQL dialects instead of connecting to the DBs, for the following reasons:

  1. complexity – Mapping the GTFS semantic model, which is quite dynamic (e.g. location_type in stops.txt, calendar_dates.txt & calendar.txt, or the GTFS time semanstics), to a set of ORM models seems to be tricky. How we make sure to support all GTFS use cases while keeping the code maintainable?
  2. performance
    • Currently, importing a large GTFS feed already takes a few minutes, which is long enough to throw you out of your flow, and long enough to be annoying in CI runs. If an ORM would increase this time e.g. 3x (which is not too unreasonable, given that they are usually quite complex), that would make gtfs-via-postgres a signifcantly less useful.
    • The arguably more important aspect is "runtime performance", i.e. the DB's query planner's ability to optimise queries using views like arrivals_departures/connections.
  3. usability without ORMs – I would like gtfs-via-postgres to stay a tool that is useful to people analysing GTFS using plain SQL queries, and to programs written in any programming lanauge. We would have to make sure that the SQL(s) generated by the ORM is idiomatic even from a non-ORM perspective.

While these are strengths of gtfs-via-postgres that I have strong opinions about, it is an unproven assumption that it is impossible to retain them using ORMs, so I think it's best to make an experiment to try one of them and see how they behave. I'd be very happy to be proven wrong! I can't promise to merge (or otherwise make use of) that work though.

@derhuerst
Copy link
Member

I think you've looked at our project some already but we use dbt as the "source of truth" for our BigQuery data warehouse models.

Didn't have a look at the DB-related code yet. Will have a look.

@derhuerst
Copy link
Member

derhuerst commented Apr 10, 2023

FYI: I'm currently looking into how well DuckDB fits as a local in-process database.

also related: http://duckdb.org/2023/08/23/even-friendlier-sql.html

@tobwen
Copy link

tobwen commented Jul 9, 2023

I'd also recommend to have a look at https://clickhouse.com/ - I'm running a 80M x 78 database (rows x columns) on another topic and the query time went down from 1 seconds to less than 0.01 ms - even without indexes.

@derhuerst
Copy link
Member

derhuerst commented Jul 11, 2023

I think adopting any other database as a second option is only worth it if it provides significant operational (e.g. SQLite/DuckDB with local in-process execution and great portability) or performance (e.g. parallel execution beyond what PostgreSQL can do) benefits.

I briefly looked into both DuckDB as well as ClickHouse, and both look promising with regards to gtfs-via-postgres's goals.

But I think I won't have enough time/motivation to investigate the necessary changes to support any one. You're very welcome to do this and submit a WIP Pull Request!

Please note though that I won't necessarily merge it, depending on the increase in code complexity it brings, so consider this an experiment. This is also why, if you decide to investigate one of the DB systems, I encourage you to give frequent update on "roadblocks" (where you notice that significant changes in the SQL generated gtfs-via-postgres are necessary), so that I can let you know if I'm not feeling comfortable anymore maintaining the resulting complexity long-term.

@derhuerst derhuerst changed the title Provide SQLite port? port to SQLite/DuckDB or ClickHouse? Jul 11, 2023
@derhuerst derhuerst changed the title port to SQLite/DuckDB or ClickHouse? support for SQLite/DuckDB or ClickHouse? Jul 11, 2023
@derhuerst derhuerst changed the title support for SQLite/DuckDB or ClickHouse? support for DuckDB or ClickHouse? Oct 6, 2023
@derhuerst derhuerst self-assigned this Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Development

No branches or pull requests

4 participants