Suppose I have a shared mutation table that another SQRL job writes to:
/*+engine(iceberg) */
CREATE TABLE SinkTable (
userid INT NOT NULL,
total_tokens BIGINT NOT NULL,
window_end TIMESTAMP_LTZ(3),
window_time TIMESTAMP_LTZ(3)
);
and now I want to build an application that queries that table directly in iceberg with Duckdb (those are my only enabled database engines):
TokensByUser(userid INT):= SELECT * FROM SinkTable where userid = :userid;
Currently, that still requires a flink engine which essentially copies the table over into a new intermediate iceberg table that DuckDB then queries.
That is a bit wasteful. We could query the underlying table directly and cut Flink out entirely (and leave out as an engine).
This would require:
- supporting the planner without a configured "stream" engine.
- allowing "shortcuts" in the dag when planning database tables. If the table to be materialized is a source table and the source is in the same engine as the sink and the engine supports such sharing (only table formats support this), then we can use the source directly. The key is figuring out how to convert that to an
EngineCreateTable.
- support producing deployment assets without stream engine.
On the other hand, I'm wondering if this is generally a useful feature. If this is for one-time querying, you can do that much easier and cheaper outside of DataSQRL (e.g. Athena if tables are registered in Glue). If this is for building an application, you want to make sure that iceberg table is optimally partitioned and sorted for that query workload which often means that you need to copy the table and lay it out differently. Hence, this feature may not provide actual value.
Just putting this hear as food for thought and to collect feedback.
Suppose I have a shared mutation table that another SQRL job writes to:
and now I want to build an application that queries that table directly in iceberg with Duckdb (those are my only enabled database engines):
Currently, that still requires a flink engine which essentially copies the table over into a new intermediate iceberg table that DuckDB then queries.
That is a bit wasteful. We could query the underlying table directly and cut Flink out entirely (and leave out as an engine).
This would require:
EngineCreateTable.On the other hand, I'm wondering if this is generally a useful feature. If this is for one-time querying, you can do that much easier and cheaper outside of DataSQRL (e.g. Athena if tables are registered in Glue). If this is for building an application, you want to make sure that iceberg table is optimally partitioned and sorted for that query workload which often means that you need to copy the table and lay it out differently. Hence, this feature may not provide actual value.
Just putting this hear as food for thought and to collect feedback.