-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Motivation
We (evidence.dev) use Clickhouse as our query engine. Users write SQL in the browser (or compose components that write SQL), and we send queries to clickhouse. We provide errors to users if their queries are invalid.
At the moment, this requires a round trip to Clickhouse for every validation call. This puts a lot of load on Clickhouse. We'd like to intercept "definitely invalid" queries before they get sent. We can tolerate false negatives as we can fall back to clickhouse for validation.
This library would already be quite useful, but would be about 100% better if it was able to cover more error classes
Proposal: Add support for semantic validation in the core rust library
This is more a discussion than a specific issue. I've been looking into various of the open source SQL dialects and how they think about validation.
Broadly speaking, SQL validation can be divided into two types:
- Syntax validation: does the query parse
- Semantic validation: does the query make sense
- Do the tables exist?
- Do the columns exist on those tables?
- Are the functions valid (name + arity)?
- Do the types work? (e.g., SUM('hello') parses fine but is semantically wrong)
- Are column references unambiguous? (e.g., SELECT id FROM a JOIN b — which id?)
- Do aggregations make sense? (e.g., SELECT x, SUM(y) FROM t without GROUP BY x)
From the readme it seems like both are a goal of this project.
Validation with syntax, semantic, and schema-aware checks
Currently, this library's validation focusses on syntax validation.
It does do some semantic validation, eg in validateWithSchema() in the typescript bindings, but as far as i can tell all the validation happens in the ts package rather than in the main rust crate.
To do semantic validation, information about the schema is essential.
Implementation
This could be added incrementally per dialect, and also need not cover all the above types of validation. eg easier to validate table and column existance, harder to validate that two expressions have types that can be compared
This would require:
- Ability to pass in schema: eg tables and columns, with types
- A set of functions signatures, per dialect.
- These could be retrieved fairly easily for open source dialects using eg
select * from system.functionsorfrom duckdb_functions();etc. - For eg BigQuery this would be more manual
- These could be retrieved fairly easily for open source dialects using eg
- Some hard things about which types can be compared to which other types.
My sense is that it would be very hard / impossible to get to 100%, but fairly easy to get to a high percent coverage.