@@ -14,97 +14,73 @@ Our pipelines have 3 types of operators: **inputs** that produce data,
14
14
15
15
![ Pipeline Structure] ( ./operator-types.svg )
16
16
17
- ## Structured and Unstructured Dataflow
17
+ You write pipelines in the [ Tenzir Query Language
18
+ (TQL)] ( /explanations/language ) , a language that we developed from the ground up
19
+ to concisely describe such dataflows.
18
20
19
- Tenzir pipelines make one more distinction: the elements that the operators push
20
- through the pipeline are _ typed_ . An operator has an ** upstream** and
21
- ** downstream** type:
21
+ :::tip[ Learn TQL]
22
+ Head over to our [ language documentation] ( /explanations/language ) for an
23
+ in-depth explanation of how TQL works. We're continuing here with high-level
24
+ architectural aspects of the pipeline execution model.
25
+ :::
26
+
27
+ ## Typed Operators
28
+
29
+ Tenzir pipelines operate both ond unstructured stream of bytes and typed event
30
+ streams. The execution model ensures type safety while maintaining high
31
+ performance through batching and parallel processing.
32
+
33
+ An operator has an ** upstream** and ** downstream** type:
22
34
23
35
![ Upstream and Downstream Types] ( ./operator-table.svg )
24
36
25
- When composing pipelines out of operators, upstream/downstream types of adjacent
26
- operators have to match. Otherwise the pipeline is malformed. We call any
27
- void-to-void operator sequence a ** closed pipeline** . Only closed pipelines
28
- can execute.
29
-
30
- If a pipeline is not closed, Tenzir attempts to auto-complete missing input and
31
- output operators. When you [ run a pipeline] ( /guides/basic-usage/run-pipelines ) on
32
- the command line, we implicitly read JSON from stdin and write JSON to stdout.
33
- When you run a pipeline in the app and do not provide a sink, we only append
34
- [ ` serve ` ] ( /reference/operators/serve ) to make the pipeline a REST API
35
- and extract data piecemeal through your browser.
36
-
37
- Operators can be _ polymorphic_ in that they can have more than a single upstream
38
- and downstream type. For example,
39
- [ ` buffer ` ] ( /reference/operators/buffer ) accepts both bytes and events.
40
-
41
- Many Tenzir pipelines use the [ ` from ` ] ( /reference/operators/from ) and
42
- [ ` to ` ] ( /reference/operators/to ) operators to get data in and out,
43
- respectively. For example, to load data from a local JSON file system, filter
44
- events where a certain field matches a predicate, and store the result to an S3
45
- bucket in Parquet format, you can write the following pipeline:
46
-
47
- ``` tql
48
- from "/path/to/file.json"
49
- where src_ip in 10.0.0.0/8
50
- to "s3://bucket/dir/file.parquet"
51
- ```
52
-
53
- This pipelines consists of three operators:
54
-
55
- ![ Operator Composition Example 1] ( ./operator-composition-example-1.svg )
56
-
57
- The operator [ ` from ` ] ( /reference/operators/from ) is a void-to-events
58
- input operator, [ ` where ` ] ( /reference/operators/where ) an
59
- events-to-events transformation operator, and
60
- [ ` to ` ] ( /reference/operators/to ) an events-to-void output operator.
61
-
62
- Other inputs provide bytes first and you need to interpret them in order to
63
- transform them as events.
64
-
65
- ``` tql
66
- load_kafka "topic"
67
- read_ndjson
68
- select host, message
69
- write_yaml
70
- save_zmq "tcp://1.2.3.4"
71
- ```
72
-
73
- ![ Operator Composition Example 2] ( ./operator-composition-example-2.svg )
74
-
75
- With these building blocks in place, you can create all kinds of pipelines, as
76
- long as they follow the two principal rules of (1) sequencing inputs,
37
+ This typing ensures pipelines are well-formed. Adjacent operators must have
38
+ matching types: the downstream type of one operator must match the upstream type
39
+ of the next, i.e., upstream/downstream types of adjacent operators have to
40
+ match. Otherwise the pipeline is malformed.
41
+
42
+ With these operators as building blocks, you can create all kinds of pipelines,
43
+ as long as they follow the two principal rules of (1) sequencing inputs,
77
44
transformations, and outputs, and (2) ensuring that operator upstream/downstream
78
- types match. Here is an example of other valid pipeline instances :
45
+ types match. Here are examples of other valid pipeline variations :
79
46
80
47
![ Operator Composition Examples] ( ./operator-composition-variations.svg )
81
48
82
49
## Multi-Schema Dataflows
83
50
84
- Every event that flows through a pipeline is part of a _ data frame_ with a
85
- schema. Internally, these data frames are represented as Apache Arrow record
86
- batch, encoding potentially of tens of thousands of events at once. This
87
- innate batching is the reason why the pipelines can achieve a high throughput.
88
-
89
- Unique about Tenzir is that a single pipeline can run with _ multiple schemas_ ,
90
- even though the events are data frames internally. Tenzir parsers
91
- (bytes-to-event operators) are capable of emitting events with changing schemas.
92
- This behavior is different from other engines that work with data frames where
93
- operators can only execute on a single schema. In this light, Tenzir combines
94
- the performance of structured query engines with the flexibility of
95
- document-oriented engines.
96
-
97
- If an operator detects a schema changes, it creates a new batch of events. In
98
- terms of performance, the worst case for Tenzir is a ordered stream of
99
- schema-switching events, with every event having a new schema than the previous
100
- one. But even for those scenarios operators can efficiently build homogeneous
101
- batches when the inter-event order does not matter. Similar to predicate
102
- pushdown, Tenzir operators support "ordering pushdown" to signal to upstream
103
- operators that the event order only matters intra-schema but not inter-schema.
104
- In this case the operator transparently "demultiplex" a heterogeneous event
105
- stream into N homogeneous streams. The
106
- [ ` sort ` ] ( /reference/operators/sort ) operator is an example of such an
107
- operator; it pushes its ordering requirements upstream, allowing parsers to
51
+ As mentioned above, pipelines can transport both _ bytes_ and _ events_ . Let's go
52
+ deeper into the details of Tenzir represents events. Every event that flows
53
+ through a pipeline is part of a _ data frame_ with a schema. Internally, these
54
+ data frames are represented as Apache Arrow record batches, encoding potentially
55
+ of tens of thousands of events in a single block of data. This innate batching
56
+ is the reason why the pipelines can achieve high throughput.
57
+
58
+ Unique about Tenzir's pipeline executor is that a single pipeline can process
59
+ events with _ multiple schemas_ . When you typically work with data frames, your
60
+ workload runs on input with a fixed schema, e.g., when you query a database
61
+ table. In Tenzir, schemas can change dynamically during the execution of a
62
+ pipeline, much like document-oriented engines that work on JSON or have
63
+ one-event-at-a-time processing semantics. Tenzir is unique in that it gives the
64
+ user the feeling of operating on a single event at a time while hiding the
65
+ structured data frame batching behind the scenes. Thus, Tenzir combines the
66
+ performance of structured query engines with the flexibility of
67
+ document-oriented engines, making it perfect fit for processing _ semi-structured
68
+ data_ at scale:
69
+
70
+ ![ Structured vs document-oriented engines] ( ./document-vs-structured.svg )
71
+
72
+ The schema variance begins early in the data flow, where parsers emit events
73
+ with changing schemas as they encounter changing fields. If an operator detects
74
+ a schema changes, it creates a new batch of events. In terms of performance, the
75
+ worst case for Tenzir is a ordered stream of schema-switching events, with every
76
+ event having a new schema than the previous one. But even for those scenarios
77
+ operators can efficiently build homogeneous batches when the inter-event order
78
+ does not matter. Similar to predicate pushdown, Tenzir operators support
79
+ _ ordering pushdown_ to signal to upstream operators that the event order only
80
+ matters intra-schema but not inter-schema. In this case the operator
81
+ transparently "demultiplex" a heterogeneous event stream into N homogeneous
82
+ streams. The [ ` sort ` ] ( /reference/operators/sort ) operator is an example of such
83
+ an operator; it pushes its ordering requirements upstream, allowing parsers to
108
84
efficiently create multiple streams events in parallel.
109
85
110
86
![ Multi-schema Example] ( ./multi-schema-example.svg )
@@ -123,100 +99,51 @@ that you define explicitly.
123
99
124
100
## Unified Live Stream Processing and Historical Queries
125
101
126
- Engines for event stream processing and batch processing of historical data have
127
- vastly different requirements. We believe that we found a sweetspot with our
128
- language and accompanying execution engine that makes working with both types of
129
- workloads incredibly easy: just pick a input operator at the beginning the a
130
- pipeline that points to your data source, be it infinitely streaming or stored
131
- dataset. Tenzir will figure out the rest.
102
+ Tenzir's execution engine transparently processes both historical data and
103
+ real-time event streams within a single, unified pipeline model.
104
+ [ TQL] ( /explanations/language ) empowers you to switch between these workloads by
105
+ simply changing the data source at the start of your pipeline.
132
106
133
107
![ Unified Processing] ( ./unified-processing.svg )
134
108
135
- Our desired user experience for interacting with historical data looks like
136
- this:
137
-
138
- 1 . ** Ingest** : to store data at a node, create a pipeline that ends with
139
- [ ` import ` ] ( /reference/operators/import ) .
140
- 2 . ** Query** : to run a historical query over data at the node, create a pipeline
141
- that begins with [ ` export ` ] ( /reference/operators/export ) .
109
+ This design lets you reuse the same logic for exploring existing data and for
110
+ deploying it on live streams, which streamlines the entire analytics workflow.
142
111
143
- For example, to ingest JSON from a Kafka, you write `from "kafka://topic |
144
- import` . To query the stored data, you write ` export | where file == 42`.
112
+ Each Tenzir Node includes a lightweight ** edge storage** engine for efficient
113
+ local data persistence. You interact with this storage engine using just two
114
+ dedicated operators to store and retrieve data. The retrievial goes much beyond replay.
145
115
146
- The example with ` export ` suggests that the pipeline _ first_ exports everything,
147
- and only _ then_ starts filtering with ` where ` , performing a full scan over the
148
- stored data. But this is not what's happening. Pipelines support ** predicate
149
- pushdown** for every operator. This means that ` export ` receives the filter
150
- expression before it starts executing, enabling index lookups or other
151
- optimizations to efficiently execute queries with high selectivity where scans
152
- would be sub-optimal.
116
+ A naive interpretation would be that [ ` export ` ] ( /reference/operators/export )
117
+ first retrieves all its data, which subsequent operators then filter. However,
118
+ Tenzir actively optimizes this process using ** predicate pushdown** . Before a
119
+ pipeline runs, Tenzir pushes filter conditions from later stages down to the
120
+ initial storage source. This allows the source to intelligently fetch only the
121
+ necessary data, often using fast index lookups and avoiding costly full scans.
153
122
154
- The key insight here is to realize that optimizations like predicate pushdown
155
- extend to the storage engine and do not only apply to the streaming executor.
156
-
157
- The Tenzir native storage engine is not a full-fledged database, but rather a
158
- catalog with a thin indexing layer over a set of Parquet/Feather files. These
159
- sparse indexes (sketch data structures, such as min-max synopses, Bloom filters,
160
- etc.) avoid full scans for every query. The catalog tracks evolving schemas,
161
- performs expression binding, and provides a transactional interface to add and
162
- replace partitions during compaction.
163
-
164
- The diagram below shows the main components of the storage engine:
123
+ Tenzir's unique edge storage engine enables this powerful optimization. The
124
+ diagram below illustrates how the engine works:
165
125
166
126
![ Database Architecture] ( ./storage-engine-architecture.svg )
167
127
168
- Because of this transparent optimization, you can just exchange the input
169
- operator of a pipeline and switch between historical and streaming execution
170
- and everything works as expected. A typical use case begins some exploratory
171
- data analysis involving a few ` export ` pipelines, but then would deploy the
172
- pipeline on streaming data by exchanging the input with a Kafka stream.
173
-
174
- ## Built-in Networking to Create Data Fabrics
175
-
176
- Tenzir pipelines have built-in network communication, allowing you to create a
177
- distributed fabric of dataflows to express intricate use cases that go beyond
178
- single-machine processing. There are two types of network connections:
179
- _ implicit_ and _ explicit_ ones:
180
-
181
- ![ Implicit vs. Explicit] ( ./implicit-vs-explicit-networking.svg )
182
-
183
- An implicit network connection exists, for example, when you use the ` tenzir `
184
- binary on the command line to run a pipeline that ends in
185
- [ ` import ` ] ( /reference/operators/import ) :
186
-
187
- ``` tql
188
- from "/file/eve.json"
189
- where tag != "foo"
190
- import
191
- ```
192
-
193
- Or one that begins with [ ` export ` ] ( /reference/operators/export ) :
194
-
195
- ``` tql
196
- export
197
- where src_ip in 10.0.0.0/8
198
- to "/tmp/result.json"
199
- ```
200
-
201
- Tenzir pipelines are eschewing networking to minimize latency and maximize
202
- throughput, which results in the following operator placement for the above examples:
203
-
204
- ![ Implicit Networking] ( ./implicit-networking.svg )
205
-
206
- The executor generally transfers ownership of operators between
207
- processes as late as possible to prefer local, high-bandwidth communication. For
208
- maximum control over placement of computation, you can override the automatic
209
- operator location with the [ ` local ` ] ( /reference/operators/local ) and
210
- [ ` remote ` ] ( /reference/operators/remote ) operators.
211
-
212
- The above examples are implicit network connections because they're not visible
213
- in the pipeline definition. An explicit network connection terminates a pipeline
214
- as with an input or output operator:
215
-
216
- ![ Pipeline Fabric] ( ./pipeline-fabric.excalidraw.svg )
217
-
218
- This fictive data fabric above consists of a heterogeneous set of technologies,
219
- interconnected by pipelines. Because you have full control over the location
220
- where you run the pipeline, you can push it all the way to the "last mile." This
221
- helps especially when there are compliance and data residency concerns that must
222
- be properly addressed.
128
+ The edge storage engine is not a traditional database but a lightweight
129
+ ** catalog** that maintains a thin indexing layer over immutable Apache Parquet
130
+ and Feather files. It maintains ** sparse indexes** , such as min-max synopses and
131
+ Bloom filters, that act as a table of contents. These indexes allow the engine
132
+ to quickly rule out data partitions that do not match a query's filter, avoiding
133
+ unnecessary scans. The catalog also tracks evolving schemas and provides a
134
+ transactional interface for partition operations.
135
+
136
+ Because the engine handles these optimizations automatically, the same pipeline
137
+ logic can be seamlessly repurposed. A pipeline developed for historical analysis
138
+ can be deployed on a live data stream by simply exchanging the historical data
139
+ source for a streaming one. This unified model streamlines the path from
140
+ interactive exploration to production deployment.
141
+
142
+ :::tip[ Federated Search]
143
+ The Tenzir pipeline execution engine leverages powerful optimizations, such as
144
+ predicate, limit, and ordering pushdowns. These optimizations are propagated to
145
+ any pipeline source, including operators that fetch data from remote storage
146
+ layers, databases, or SIEMs. This process enables efficient ** federated search**
147
+ across distributed systems and is a transparent, fundamental capability of the
148
+ engine.
149
+ :::
0 commit comments