Skip to content

Discussion: Semantic Validation #20

@archiewood

Description

@archiewood

related: #6 #19

Motivation

We (evidence.dev) use Clickhouse as our query engine. Users write SQL in the browser (or compose components that write SQL), and we send queries to clickhouse. We provide errors to users if their queries are invalid.

At the moment, this requires a round trip to Clickhouse for every validation call. This puts a lot of load on Clickhouse. We'd like to intercept "definitely invalid" queries before they get sent. We can tolerate false negatives as we can fall back to clickhouse for validation.

This library would already be quite useful, but would be about 100% better if it was able to cover more error classes

Proposal: Add support for semantic validation in the core rust library

This is more a discussion than a specific issue. I've been looking into various of the open source SQL dialects and how they think about validation.

Broadly speaking, SQL validation can be divided into two types:

  • Syntax validation: does the query parse
  • Semantic validation: does the query make sense
    • Do the tables exist?
    • Do the columns exist on those tables?
    • Are the functions valid (name + arity)?
    • Do the types work? (e.g., SUM('hello') parses fine but is semantically wrong)
    • Are column references unambiguous? (e.g., SELECT id FROM a JOIN b — which id?)
    • Do aggregations make sense? (e.g., SELECT x, SUM(y) FROM t without GROUP BY x)

From the readme it seems like both are a goal of this project.

Validation with syntax, semantic, and schema-aware checks

Currently, this library's validation focusses on syntax validation.

It does do some semantic validation, eg in validateWithSchema() in the typescript bindings, but as far as i can tell all the validation happens in the ts package rather than in the main rust crate.

To do semantic validation, information about the schema is essential.

Implementation

This could be added incrementally per dialect, and also need not cover all the above types of validation. eg easier to validate table and column existance, harder to validate that two expressions have types that can be compared

This would require:

  • Ability to pass in schema: eg tables and columns, with types
  • A set of functions signatures, per dialect.
    • These could be retrieved fairly easily for open source dialects using eg select * from system.functions or from duckdb_functions(); etc.
    • For eg BigQuery this would be more manual
  • Some hard things about which types can be compared to which other types.

My sense is that it would be very hard / impossible to get to 100%, but fairly easy to get to a high percent coverage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions