Skip to content

Commit b147cc9

Browse files
authored
Merge pull request #98 from tenzir/topic/language-documentation
Add comprehensive TQL language documentation
2 parents 55dab75 + 223c1d3 commit b147cc9

30 files changed

+2916
-959
lines changed

.github/workflows/docs.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,6 @@ jobs:
4444
uses: actions/configure-pages@v5
4545

4646
- name: Build with Astro
47-
env:
48-
RUN_LINK_CHECK: true
4947
run: |
5048
if [[ "${{ github.event_name }}" == "pull_request" ]]; then
5149
pnpm astro build \

.github/workflows/lint.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,29 @@ jobs:
7272

7373
- name: Run Prettier
7474
run: pnpm lint:prettier
75+
76+
linkcheck:
77+
name: Link Check
78+
runs-on: ubuntu-latest
79+
if: github.event_name == 'pull_request'
80+
steps:
81+
- name: Checkout repository
82+
uses: actions/checkout@v4
83+
with:
84+
fetch-depth: 0
85+
submodules: recursive
86+
87+
- name: Install pnpm
88+
run: npm install -g pnpm
89+
90+
- name: Setup Node.js and pnpm
91+
uses: actions/setup-node@v4
92+
with:
93+
node-version: "20"
94+
cache: "pnpm"
95+
96+
- name: Install dependencies
97+
run: pnpm install --frozen-lockfile
98+
99+
- name: Run link check
100+
run: pnpm run lint:linkcheck

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ editorial guidelines for writing clear and consistent technical documentation.
1919

2020
Most notably:
2121

22-
- In general, use _active voice_.
22+
- ULTRA IMPORTANT: Always use _active voice_.
2323
- Avoid anthropomorphic language: Don't attribute human qualities to software or
2424
hardware.
2525
- Include definite and indefinite articles (a, an, and the) in your writing.

scripts/generate-reference.js

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -228,8 +228,9 @@ TODO: the following functions still need to be documented:
228228
229229
*/}
230230
231-
Functions appear in [expressions](/reference/language/expressions) and take positional
232-
and/or named arguments, producing a value as a result of their computation.
231+
Functions appear in [expressions](/explanations/language/expressions) and take
232+
positional and/or named arguments, producing a value as a result of their
233+
computation.
233234
234235
Function signatures have the following notation:
235236
@@ -241,18 +242,6 @@ f(arg1:<type>, arg2=<type>, [arg3=type]) -> <type>
241242
- \`arg=<type>\`: named argument
242243
- \`[arg=type]\`: optional (named) argument
243244
- \`-> <type>\`: function return type
244-
245-
TQL features the [uniform function call syntax
246-
(UFCS)](https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax), which
247-
allows you to interchangeably call a function with at least one argument either
248-
as _free function_ or _method_. For example, \`length(str)\` and \`str.length()\`
249-
resolve to the identical function call. The latter syntax is particularly
250-
suitable for function chaining, e.g., \`x.f().g().h()\` reads left-to-right as
251-
"start with \`x\`, apply \`f()\`, then \`g()\` and then \`h()\`," compared to
252-
\`h(g(f(x)))\`, which reads "inside out."
253-
254-
Throughout our documentation, we use the free function style in the synopsis
255-
but often resort to the method style when it is more idiomatic.
256245
`;
257246
} else if (type === "Operators") {
258247
markdown += `Tenzir comes with a wide range of built-in pipeline operators.

src/content/docs/explanations/architecture/pipeline/document-vs-structured.svg

Lines changed: 4 additions & 0 deletions
Loading

src/content/docs/explanations/architecture/pipeline/export-networking.excalidraw.svg

Lines changed: 0 additions & 17 deletions
This file was deleted.

src/content/docs/explanations/architecture/pipeline/implicit-vs-explicit-networking.svg

Lines changed: 0 additions & 10 deletions
This file was deleted.

src/content/docs/explanations/architecture/pipeline/import-networking.excalidraw.svg

Lines changed: 0 additions & 17 deletions
This file was deleted.

src/content/docs/explanations/architecture/pipeline/index.mdx

Lines changed: 96 additions & 169 deletions
Original file line numberDiff line numberDiff line change
@@ -14,97 +14,73 @@ Our pipelines have 3 types of operators: **inputs** that produce data,
1414

1515
![Pipeline Structure](./operator-types.svg)
1616

17-
## Structured and Unstructured Dataflow
17+
You write pipelines in the [Tenzir Query Language
18+
(TQL)](/explanations/language), a language that we developed from the ground up
19+
to concisely describe such dataflows.
1820

19-
Tenzir pipelines make one more distinction: the elements that the operators push
20-
through the pipeline are _typed_. An operator has an **upstream** and
21-
**downstream** type:
21+
:::tip[Learn TQL]
22+
Head over to our [language documentation](/explanations/language) for an
23+
in-depth explanation of how TQL works. We're continuing here with high-level
24+
architectural aspects of the pipeline execution model.
25+
:::
26+
27+
## Typed Operators
28+
29+
Tenzir pipelines operate both ond unstructured stream of bytes and typed event
30+
streams. The execution model ensures type safety while maintaining high
31+
performance through batching and parallel processing.
32+
33+
An operator has an **upstream** and **downstream** type:
2234

2335
![Upstream and Downstream Types](./operator-table.svg)
2436

25-
When composing pipelines out of operators, upstream/downstream types of adjacent
26-
operators have to match. Otherwise the pipeline is malformed. We call any
27-
void-to-void operator sequence a **closed pipeline**. Only closed pipelines
28-
can execute.
29-
30-
If a pipeline is not closed, Tenzir attempts to auto-complete missing input and
31-
output operators. When you [run a pipeline](/guides/basic-usage/run-pipelines) on
32-
the command line, we implicitly read JSON from stdin and write JSON to stdout.
33-
When you run a pipeline in the app and do not provide a sink, we only append
34-
[`serve`](/reference/operators/serve) to make the pipeline a REST API
35-
and extract data piecemeal through your browser.
36-
37-
Operators can be _polymorphic_ in that they can have more than a single upstream
38-
and downstream type. For example,
39-
[`buffer`](/reference/operators/buffer) accepts both bytes and events.
40-
41-
Many Tenzir pipelines use the [`from`](/reference/operators/from) and
42-
[`to`](/reference/operators/to) operators to get data in and out,
43-
respectively. For example, to load data from a local JSON file system, filter
44-
events where a certain field matches a predicate, and store the result to an S3
45-
bucket in Parquet format, you can write the following pipeline:
46-
47-
```tql
48-
from "/path/to/file.json"
49-
where src_ip in 10.0.0.0/8
50-
to "s3://bucket/dir/file.parquet"
51-
```
52-
53-
This pipelines consists of three operators:
54-
55-
![Operator Composition Example 1](./operator-composition-example-1.svg)
56-
57-
The operator [`from`](/reference/operators/from) is a void-to-events
58-
input operator, [`where`](/reference/operators/where) an
59-
events-to-events transformation operator, and
60-
[`to`](/reference/operators/to) an events-to-void output operator.
61-
62-
Other inputs provide bytes first and you need to interpret them in order to
63-
transform them as events.
64-
65-
```tql
66-
load_kafka "topic"
67-
read_ndjson
68-
select host, message
69-
write_yaml
70-
save_zmq "tcp://1.2.3.4"
71-
```
72-
73-
![Operator Composition Example 2](./operator-composition-example-2.svg)
74-
75-
With these building blocks in place, you can create all kinds of pipelines, as
76-
long as they follow the two principal rules of (1) sequencing inputs,
37+
This typing ensures pipelines are well-formed. Adjacent operators must have
38+
matching types: the downstream type of one operator must match the upstream type
39+
of the next, i.e., upstream/downstream types of adjacent operators have to
40+
match. Otherwise the pipeline is malformed.
41+
42+
With these operators as building blocks, you can create all kinds of pipelines,
43+
as long as they follow the two principal rules of (1) sequencing inputs,
7744
transformations, and outputs, and (2) ensuring that operator upstream/downstream
78-
types match. Here is an example of other valid pipeline instances:
45+
types match. Here are examples of other valid pipeline variations:
7946

8047
![Operator Composition Examples](./operator-composition-variations.svg)
8148

8249
## Multi-Schema Dataflows
8350

84-
Every event that flows through a pipeline is part of a _data frame_ with a
85-
schema. Internally, these data frames are represented as Apache Arrow record
86-
batch, encoding potentially of tens of thousands of events at once. This
87-
innate batching is the reason why the pipelines can achieve a high throughput.
88-
89-
Unique about Tenzir is that a single pipeline can run with _multiple schemas_,
90-
even though the events are data frames internally. Tenzir parsers
91-
(bytes-to-event operators) are capable of emitting events with changing schemas.
92-
This behavior is different from other engines that work with data frames where
93-
operators can only execute on a single schema. In this light, Tenzir combines
94-
the performance of structured query engines with the flexibility of
95-
document-oriented engines.
96-
97-
If an operator detects a schema changes, it creates a new batch of events. In
98-
terms of performance, the worst case for Tenzir is a ordered stream of
99-
schema-switching events, with every event having a new schema than the previous
100-
one. But even for those scenarios operators can efficiently build homogeneous
101-
batches when the inter-event order does not matter. Similar to predicate
102-
pushdown, Tenzir operators support "ordering pushdown" to signal to upstream
103-
operators that the event order only matters intra-schema but not inter-schema.
104-
In this case the operator transparently "demultiplex" a heterogeneous event
105-
stream into N homogeneous streams. The
106-
[`sort`](/reference/operators/sort) operator is an example of such an
107-
operator; it pushes its ordering requirements upstream, allowing parsers to
51+
As mentioned above, pipelines can transport both _bytes_ and _events_. Let's go
52+
deeper into the details of Tenzir represents events. Every event that flows
53+
through a pipeline is part of a _data frame_ with a schema. Internally, these
54+
data frames are represented as Apache Arrow record batches, encoding potentially
55+
of tens of thousands of events in a single block of data. This innate batching
56+
is the reason why the pipelines can achieve high throughput.
57+
58+
Unique about Tenzir's pipeline executor is that a single pipeline can process
59+
events with _multiple schemas_. When you typically work with data frames, your
60+
workload runs on input with a fixed schema, e.g., when you query a database
61+
table. In Tenzir, schemas can change dynamically during the execution of a
62+
pipeline, much like document-oriented engines that work on JSON or have
63+
one-event-at-a-time processing semantics. Tenzir is unique in that it gives the
64+
user the feeling of operating on a single event at a time while hiding the
65+
structured data frame batching behind the scenes. Thus, Tenzir combines the
66+
performance of structured query engines with the flexibility of
67+
document-oriented engines, making it perfect fit for processing _semi-structured
68+
data_ at scale:
69+
70+
![Structured vs document-oriented engines](./document-vs-structured.svg)
71+
72+
The schema variance begins early in the data flow, where parsers emit events
73+
with changing schemas as they encounter changing fields. If an operator detects
74+
a schema changes, it creates a new batch of events. In terms of performance, the
75+
worst case for Tenzir is a ordered stream of schema-switching events, with every
76+
event having a new schema than the previous one. But even for those scenarios
77+
operators can efficiently build homogeneous batches when the inter-event order
78+
does not matter. Similar to predicate pushdown, Tenzir operators support
79+
_ordering pushdown_ to signal to upstream operators that the event order only
80+
matters intra-schema but not inter-schema. In this case the operator
81+
transparently "demultiplex" a heterogeneous event stream into N homogeneous
82+
streams. The [`sort`](/reference/operators/sort) operator is an example of such
83+
an operator; it pushes its ordering requirements upstream, allowing parsers to
10884
efficiently create multiple streams events in parallel.
10985

11086
![Multi-schema Example](./multi-schema-example.svg)
@@ -123,100 +99,51 @@ that you define explicitly.
12399

124100
## Unified Live Stream Processing and Historical Queries
125101

126-
Engines for event stream processing and batch processing of historical data have
127-
vastly different requirements. We believe that we found a sweetspot with our
128-
language and accompanying execution engine that makes working with both types of
129-
workloads incredibly easy: just pick a input operator at the beginning the a
130-
pipeline that points to your data source, be it infinitely streaming or stored
131-
dataset. Tenzir will figure out the rest.
102+
Tenzir's execution engine transparently processes both historical data and
103+
real-time event streams within a single, unified pipeline model.
104+
[TQL](/explanations/language) empowers you to switch between these workloads by
105+
simply changing the data source at the start of your pipeline.
132106

133107
![Unified Processing](./unified-processing.svg)
134108

135-
Our desired user experience for interacting with historical data looks like
136-
this:
137-
138-
1. **Ingest**: to store data at a node, create a pipeline that ends with
139-
[`import`](/reference/operators/import).
140-
2. **Query**: to run a historical query over data at the node, create a pipeline
141-
that begins with [`export`](/reference/operators/export).
109+
This design lets you reuse the same logic for exploring existing data and for
110+
deploying it on live streams, which streamlines the entire analytics workflow.
142111

143-
For example, to ingest JSON from a Kafka, you write `from "kafka://topic |
144-
import`. To query the stored data, you write `export | where file == 42`.
112+
Each Tenzir Node includes a lightweight **edge storage** engine for efficient
113+
local data persistence. You interact with this storage engine using just two
114+
dedicated operators to store and retrieve data. The retrievial goes much beyond replay.
145115

146-
The example with `export` suggests that the pipeline _first_ exports everything,
147-
and only _then_ starts filtering with `where`, performing a full scan over the
148-
stored data. But this is not what's happening. Pipelines support **predicate
149-
pushdown** for every operator. This means that `export` receives the filter
150-
expression before it starts executing, enabling index lookups or other
151-
optimizations to efficiently execute queries with high selectivity where scans
152-
would be sub-optimal.
116+
A naive interpretation would be that [`export`](/reference/operators/export)
117+
first retrieves all its data, which subsequent operators then filter. However,
118+
Tenzir actively optimizes this process using **predicate pushdown**. Before a
119+
pipeline runs, Tenzir pushes filter conditions from later stages down to the
120+
initial storage source. This allows the source to intelligently fetch only the
121+
necessary data, often using fast index lookups and avoiding costly full scans.
153122

154-
The key insight here is to realize that optimizations like predicate pushdown
155-
extend to the storage engine and do not only apply to the streaming executor.
156-
157-
The Tenzir native storage engine is not a full-fledged database, but rather a
158-
catalog with a thin indexing layer over a set of Parquet/Feather files. These
159-
sparse indexes (sketch data structures, such as min-max synopses, Bloom filters,
160-
etc.) avoid full scans for every query. The catalog tracks evolving schemas,
161-
performs expression binding, and provides a transactional interface to add and
162-
replace partitions during compaction.
163-
164-
The diagram below shows the main components of the storage engine:
123+
Tenzir's unique edge storage engine enables this powerful optimization. The
124+
diagram below illustrates how the engine works:
165125

166126
![Database Architecture](./storage-engine-architecture.svg)
167127

168-
Because of this transparent optimization, you can just exchange the input
169-
operator of a pipeline and switch between historical and streaming execution
170-
and everything works as expected. A typical use case begins some exploratory
171-
data analysis involving a few `export` pipelines, but then would deploy the
172-
pipeline on streaming data by exchanging the input with a Kafka stream.
173-
174-
## Built-in Networking to Create Data Fabrics
175-
176-
Tenzir pipelines have built-in network communication, allowing you to create a
177-
distributed fabric of dataflows to express intricate use cases that go beyond
178-
single-machine processing. There are two types of network connections:
179-
_implicit_ and _explicit_ ones:
180-
181-
![Implicit vs. Explicit](./implicit-vs-explicit-networking.svg)
182-
183-
An implicit network connection exists, for example, when you use the `tenzir`
184-
binary on the command line to run a pipeline that ends in
185-
[`import`](/reference/operators/import):
186-
187-
```tql
188-
from "/file/eve.json"
189-
where tag != "foo"
190-
import
191-
```
192-
193-
Or one that begins with [`export`](/reference/operators/export):
194-
195-
```tql
196-
export
197-
where src_ip in 10.0.0.0/8
198-
to "/tmp/result.json"
199-
```
200-
201-
Tenzir pipelines are eschewing networking to minimize latency and maximize
202-
throughput, which results in the following operator placement for the above examples:
203-
204-
![Implicit Networking](./implicit-networking.svg)
205-
206-
The executor generally transfers ownership of operators between
207-
processes as late as possible to prefer local, high-bandwidth communication. For
208-
maximum control over placement of computation, you can override the automatic
209-
operator location with the [`local`](/reference/operators/local) and
210-
[`remote`](/reference/operators/remote) operators.
211-
212-
The above examples are implicit network connections because they're not visible
213-
in the pipeline definition. An explicit network connection terminates a pipeline
214-
as with an input or output operator:
215-
216-
![Pipeline Fabric](./pipeline-fabric.excalidraw.svg)
217-
218-
This fictive data fabric above consists of a heterogeneous set of technologies,
219-
interconnected by pipelines. Because you have full control over the location
220-
where you run the pipeline, you can push it all the way to the "last mile." This
221-
helps especially when there are compliance and data residency concerns that must
222-
be properly addressed.
128+
The edge storage engine is not a traditional database but a lightweight
129+
**catalog** that maintains a thin indexing layer over immutable Apache Parquet
130+
and Feather files. It maintains **sparse indexes**, such as min-max synopses and
131+
Bloom filters, that act as a table of contents. These indexes allow the engine
132+
to quickly rule out data partitions that do not match a query's filter, avoiding
133+
unnecessary scans. The catalog also tracks evolving schemas and provides a
134+
transactional interface for partition operations.
135+
136+
Because the engine handles these optimizations automatically, the same pipeline
137+
logic can be seamlessly repurposed. A pipeline developed for historical analysis
138+
can be deployed on a live data stream by simply exchanging the historical data
139+
source for a streaming one. This unified model streamlines the path from
140+
interactive exploration to production deployment.
141+
142+
:::tip[Federated Search]
143+
The Tenzir pipeline execution engine leverages powerful optimizations, such as
144+
predicate, limit, and ordering pushdowns. These optimizations are propagated to
145+
any pipeline source, including operators that fetch data from remote storage
146+
layers, databases, or SIEMs. This process enables efficient **federated search**
147+
across distributed systems and is a transparent, fundamental capability of the
148+
engine.
149+
:::

0 commit comments

Comments
 (0)