Discussion: Semantic Validation

related: #6 #19 

## Motivation
We (evidence.dev) use Clickhouse as our query engine. Users write SQL in the browser (or compose components that write SQL), and we send queries to clickhouse. We provide errors to users if their queries are invalid.

At the moment, this requires a round trip to Clickhouse for every validation call. This puts a lot of load on Clickhouse. We'd like to intercept "definitely invalid" queries before they get sent. We can tolerate false negatives as we can fall back to clickhouse for validation.

This library would already be quite useful, but would be about 100% better if it was able to cover more error classes

## Proposal: Add support for semantic validation in the core rust library

This is more a discussion than a specific issue. I've been looking into various of the open source SQL dialects and how they think about validation.

Broadly speaking, SQL validation can be divided into two types:
- **Syntax validation:** does the query parse
- **Semantic validation:** does the query make sense
   - Do the tables exist?
   - Do the columns exist on those tables?
   - Are the functions valid (name + arity)?
   - Do the types work? (e.g., SUM('hello') parses fine but is semantically wrong)
   - Are column references unambiguous? (e.g., SELECT id FROM a JOIN b — which id?)
   - Do aggregations make sense? (e.g., SELECT x, SUM(y) FROM t without GROUP BY x)

From the readme it seems like both are a goal of this project.

> Validation with syntax, semantic, and schema-aware checks

Currently, this library's validation focusses on _syntax_ validation.

It does do _some_ semantic validation, eg in validateWithSchema() in the typescript bindings, but as far as i can tell all the validation happens in the ts package rather than in the main rust crate.

To do semantic validation, information about the schema is essential.

## Implementation

This could be added incrementally per dialect, and also need not cover all the above types of validation. eg easier to validate table and column existance, harder to validate that two expressions have types that can be compared

This would require:
- Ability to pass in schema: eg tables and columns, with types
- A set of functions signatures, per dialect. 
   - These could be retrieved fairly easily for open source dialects using eg `select * from system.functions` or `from duckdb_functions();` etc. 
   - For eg BigQuery this would be more manual 
- Some hard things about which types can be compared to which other types.

My sense is that it would be very hard / impossible to get to 100%, but fairly easy to get to a high percent coverage.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Semantic Validation #20

Motivation

Proposal: Add support for semantic validation in the core rust library

Implementation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Discussion: Semantic Validation #20

Description

Motivation

Proposal: Add support for semantic validation in the core rust library

Implementation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions