Skip to content

Latest commit

 

History

History
196 lines (139 loc) · 8.38 KB

overview.md

File metadata and controls

196 lines (139 loc) · 8.38 KB

Developer Documentation

The DataSQRL project consists of two parts:

  • The DataSQRL build tool: Compiles SQRL script and (optional) API schema to an integrated data pipeline or event-driven microservice by producing the deployment assets for all components of the pipeline or microservice.
  • The DataSQRL runtime components: libraries and components that are executed as part of a compiled data pipeline when the deployment assets are deployed.

This repository contains the code for both parts as well. Code that is shared between the build tool and runtime is in the sqrl-base module.

DataSQRL Build Tool

The goal of the DataSQRL build tool is simplifying the development of data pipelines and event-driven microservices. Developers can ipmlement the logic of their data product in SQL and define how they want to serve the data product through an API specification. The build tool compiles those two artifacts into an integrated data pipeline that ingests, processes, stores, and serves the data as defined.

DataSQRL has it's own variant of SQL called SQRL. SQRL extends SQL to allow explicit imports and nested tables to represent hierarchical data natively. It also provides some syntactic sugar to make it more convenient to develop with SQL.

DataSQRL supports a pluggable engine architecture. A data pipeline or microservice consists of multiple stages and each stage is executed by an engine. For example, a data pipeline may consist of a stream processing, storage, and serving stage which are executed by Apache Flink, PostgreSQL, and Vert.x, respectively.

DataSQRL supports the following types of stages:

  • Stream Processing: For processing data as it is ingested
  • Log: For moving data between stages reliably
  • Database: For storage and querying data
  • Server: For returning data through an API upon request

A data pipeline topology is a sequence of stages. A pipeline topology may contain multiple stages of the same type (e.g. two different database stages). An engine is what executes the deployment assets for a given stage. For example, the FlinkSQL generated by the compiler as part of the deployment assets for the "stream" stage is executed by the Flink engine.

The pipeline topology as well as other compiler configuration options are specified in a json configuration file typically called package.json.

In addition to the compiler, the DataSQRL build tool contains a package manager for resolving external data and function dependencies, pre-processors for preparing dependencies, and post-processors for refining the deployment artifacts produced by the compiler to the specified engine configuration.

The DataSQRL build tool consists of the following components (i.e. java modules), listed in the order in which they are invoked when compiling a SQRL script:

Packager

The packager prepares the build directory for the compiler. The job of the packager is to populate the build directory with all the files needed by the compiler, so that the compiler can operate exclusively on the build directory and have all the files it needs prepared in a standard way.

The packager resolves external data and function dependencies that are imported in a script. It downloads such dependencies from a central repository if necessary and places them in the build directory.

The packager also runs all registered pre-processors on the local directory to pre-process input files and place them into the build directory for the compiler. DataSQRL has a generic pre-processor framework.

Link to packager module

Transpiler

The transpiler is the first stage of the compiler. The transpiler parses the SQRL script into a logical plan by "translating" the SQRL SQL variant into vanilla SQL with hints to annotate context.

The transpiler resolves imports against the build directory using module loaders that retrieve dependencies. It maintains a schema of all defined tables in a SQRL script. It also parses the API specification and maps the API schema the table schemas.

The transpiler is build on top of Apache Calcite for all SQL handling and java-graphql for all GraphQL handling.

Link to transpiler module

Logical Plan Rewriter

The logical plan rewriter is the second stage of the compiler. It takes the logical plan (i.e. Relnode) produced by the transpiler for each table defined in the SQRL script and rewrites the logical plan into a normalized represenation (called the AnnotatedLP) which serves these purposes:

  1. It keeps track of important metadata like timestamps, primary keys, sort orders, etc
  2. It translates SQRL specific SQL annotations into standard SQL depending on the context in which they are used. The goal is to translate SQL developers are likely to write to SQL that is optimized for the context in which they wrote that SQL. For example, rewriting an unspecified JOIN into a temporal join if the join is between a stream and a state on the state's primary key.
  3. It extracts certain "expensive" logical operators with the goal of trying to move them to another stage in the pipeline where they are cheaper to execute while guaranteeing the same semantics.

In other words, the LP Rewriter attempts to rewrite SQL query for a table definition to make it more efficient while preserving user intent and it extract metadata needed for contextual processing.

The LP Rewriter is part of the planner

DAG Planner

The DAG planner takes all the individual table defintions and assembles them into a processing DAG. It prunes the DAG and rewrites the DAG before optimizing the DAG to assign each node (i.e. table) to a stage in the pipeline.

The optimizer uses a cost model and is constrained to produce only viable pipelines.

At the end of the DAG planning process, each table defined in the SQRL script is assigned to a stage in the pipeline.

The DAG Planner is part of the planner

Physical Planner

All the tables in a given stage are then passed to the stage engine's physical planner which produces the physical plan for the engine that has been configured to execute that stage.

For example, the tables assigned to the "stream" stage are passed to Flink's physical planner to produce FlinkSQL.

The physical plans are then written out as deployment assets to the build/deploy directory.

Discovery

The build tool contains a separate piece of functionality for data discovery. The goal for data discovery is to make it easy for new users to get started with DataSQRL by automatically generating package definitions for users' data.

The user can run discovery against local data files, object storage, or a Kafka cluster. Data discovery analyzes the data and ingests it to determine the schema automatically. It then produces a package for the data that the user can import into their scripts.

Link to discovery module

Command Line Interface

The DataSQRL build tool is accessed through a command line interface. It defines all of the commands that DataSQRL supports and provides usability features to help the user and produce useful error messages.

Link to cli module

Testing

Integration tests for the DataSQRL build tool can be found in sqrl-testing/sqrl-integration-tests

DataSQRL Runtime

The DataSQRL project also contains some runtime components that get executed as part of the data pipeline compiled by the DataSQRL build tool:

Server

The server module contains the default server implemented based on Vert.x. The server takes a configuration file as input which maps each API entry point to a single or set of SQL queries that are executed against the database on request.

The server processes mutations by creating events and persisting them to a log. The server handles subscriptions by listening to new log events and forwarding them to clients.

The configuration file that defines the behavior of the server is a deployment asset produced by the DataSQRL build tool for the "server" stage.

Flink Libraries

DataSQRL extends the standard SQL function library with many additional function modules. Those are executed at runtime in Flink or on the server. The functions are iplemented in the sqrl-flink-lib module.

It also contains custom formats and connectors for Flink that extend existing formats and connectors to support functionality needed by DataSQRL. It may be reasonable to contribute those improvements and extensions back to the Apache Flink project.

=======