This document describes the internal architecture of dsq, including its core modules and how they work together.
dsq is organized into several specialized crates, each handling a distinct aspect of data processing:
dsq-core - Main library coordinating all components
├── dsq-parser - Parse filter expressions into AST
├── dsq-filter - Compile and execute filters
├── dsq-functions - Built-in function registry
├── dsq-formats - Format detection and I/O
├── dsq-io - Low-level I/O operations
└── dsq-shared - Shared types and utilities
The core library provides high-level APIs and coordinates all components.
Core value type that bridges JSON and Polars DataFrames, enabling seamless data interchange between different representations.
Comprehensive data operations library organized by category:
- basic - Fundamental operations:
select,filter,sort,head,tail,unique - aggregate - Grouping and aggregation:
group_by,sum,mean,count,pivot - join - Join operations: inner, left, right, outer joins
- transform - Data transformations:
transpose,cast,reshape
Input/output support for multiple file formats:
- CSV/TSV with automatic delimiter detection
- Parquet for columnar storage
- JSON and JSON Lines
- Arrow IPC format
- Avro (planned)
jq-compatible filter compilation and execution engine with DataFrame optimizations.
Comprehensive error handling with specific error types:
TypeError- Type conversion and compatibility issuesFormatError- File format parsing problemsFilterError- jq filter syntax and execution errors
High-level fluent API (Dsq struct) for ergonomic data processing workflows with operation chaining and automatic optimization.
- Polars - High-performance DataFrame operations
- Arrow - Columnar memory format
- Serde - Serialization/deserialization
- Tokio - Async runtime for streaming operations
- Nom - Parser combinators for filter syntax
Foundational parsing component that converts filter language strings into Abstract Syntax Tree (AST) representations.
- Complete DSQ Syntax Support - Parses all jq-compatible filter language constructs
- Fast Parsing - Uses nom for high-performance parsing with zero-copy operations
- Comprehensive Error Reporting - Detailed error messages with position information
- AST Generation - Produces structured AST for processing by dsq-filter
- Type Safety - Compile-time guarantees about AST structure
- Identity and field access:
.,.field,.field.subfield - Array operations:
.[0],.[1:5],.[] - Function calls:
length,map(select(.age > 30)) - Arithmetic:
+,-,*,/ - Comparisons:
>,<,==,!=,>=,<= - Logical operations:
and,or,not - Object/array construction:
{name, age},[1, 2, 3] - Pipelines:
expr1 | expr2 | expr3 - Assignments:
. += value,. |= value - Control flow:
if condition then expr else expr end - Sequences:
expr1, expr2, expr3
FilterParser- Main parser interface with public APIast.rs- AST node definitions with Display implementationsparser.rs- nom-based parser combinators implementing the grammarerror.rs- Comprehensive error types with position trackingtests.rs- Extensive test suite covering all language features
use dsq_parser::{FilterParser, Filter};
// Parse a DSQ filter string into an AST
let parser = FilterParser::new();
let filter: Filter = parser.parse(".name | length")?;
// Access the parsed AST
match &filter.expr {
dsq_parser::Expr::Pipeline(exprs) => {
println!("Pipeline with {} expressions", exprs.len());
}
_ => {}
}- nom - Parser combinator library
- serde - Serialization support for AST
- num-bigint - Arbitrary-precision integers
- dsq-shared - Shared types and utilities
Core filter compilation and execution engine that powers dsq's jq-compatible syntax. Operates at the AST level to transform filter expressions into optimized DataFrame operations.
Transforms jaq AST nodes into dsq operations with support for:
- Variable scoping and function definitions
- Type checking and error reporting
- Multiple optimization levels (None, Basic, Advanced)
- DataFrame-specific optimizations
High-performance execution engine with:
- Configurable execution modes (Standard, Lazy, Streaming)
- Execution statistics and performance monitoring
- Error handling with strict/lenient modes
- Built-in caching for compiled filters
Execution context management for:
- Variable bindings during filter execution
- User-defined and built-in function registry
- Recursion depth control and stack management
use dsq_filter::{FilterCompiler, ExecutorConfig, FilterExecutor};
// Compile a jq filter expression
let compiler = FilterCompiler::new();
let compiled = compiler.compile_str(r#"map(select(.age > 30)) | sort_by(.name)"#)?;
// Execute on data
let mut executor = FilterExecutor::with_config(ExecutorConfig {
collect_stats: true,
..Default::default()
});
let result = executor.execute_compiled(&compiled, data)?;
println!("Execution stats: {:?}", result.stats);Provides the comprehensive built-in function registry, implementing over 150 functions across multiple categories.
BuiltinRegistry- Central registry managing all built-in functionsBuiltinFunction- Type alias for function implementations- Function categories - Organized by operation type (math, string, array, etc.)
- Type polymorphism - Functions work across different data types
- Type Safety - Compile-time function validation and type checking
- Performance - Optimized implementations using Polars operations where possible
- Extensibility - Easy to add new functions via the registry pattern
- Cross-Type Support - Functions automatically handle different input types
- Error Handling - Comprehensive error reporting with context
- Polars - High-performance DataFrame operations
- Serde - Serialization for complex data structures
- Chrono - Date/time manipulation
- URL - URL parsing and manipulation
- Base64/Base58/Base32 - Encoding/decoding operations
- SHA2/SHA1/MD5 - Cryptographic hashing
- UUID - Universally unique identifier generation
- Heck - String case conversion
- Rand - Random number generation
See FUNCTIONS.md for the complete function reference.
Provides comprehensive support for reading and writing various structured data formats with automatic format detection.
DataFormatenum - Represents all supported formats with format-specific capabilitiesDataReader/DataWritertraits - Unified interface for reading/writing across all formats- Format-specific implementations - Optimized readers/writers for each format
- Automatic format detection - Extension-based and content-based detection
- Format Detection - Automatic detection from file extensions, magic bytes, and content analysis
- Unified Interface - Consistent
read_file()andwrite_file()functions across formats - Performance Optimizations - Lazy reading, streaming, and parallel processing where supported
- Extensibility - Easy to add new formats with macro-based boilerplate reduction
- Type Safety - Compile-time format validation and option checking
use dsq_formats::{DataFormat, from_path, to_path};
// Generic reading - format auto-detected
let data = from_path("data.csv")?;
// Explicit format specification
let data = from_path_with_format("data.txt", DataFormat::Csv)?;
// Writing with format options
let writer = to_path_with_format("output.parquet", DataFormat::Parquet)?;
writer.write_dataframe(&dataframe, &write_options)?;csv- CSV/TSV reading and writingjson- JSON and JSON Lines supportjson5- JSON5 format supportparquet- Apache Parquet formatavro- Apache Avro format
See FORMATS.md for detailed format documentation.
Provides low-level I/O utilities and high-level data reading/writing interfaces.
Built around two main traits:
DataReader- Reading data from various sources into DataFrames or LazyFramesDataWriter- Writing DataFrames and LazyFrames to various destinations
- Synchronous and asynchronous file reading/writing
- STDIN/STDOUT handling with proper buffering
- Memory-mapped I/O for large files
- Error handling with specific error types
- Format-agnostic reading/writing APIs
- Lazy evaluation support for performance
- Streaming capabilities for large datasets
- Batch processing utilities
- File inspection - Extract metadata without full loading
- Format conversion - Convert between any supported formats
- Batch operations - Process multiple files with consistent operations
- Streaming processing - Constant memory usage for large datasets
pub struct ReadOptions {
pub max_rows: Option<usize>, // Limit rows read
pub infer_schema: bool, // Auto-detect column types
pub lazy: bool, // Use lazy evaluation
pub skip_rows: usize, // Skip initial rows
pub columns: Option<Vec<String>>, // Select specific columns
}
pub struct WriteOptions {
pub include_header: bool, // Include headers in output
pub overwrite: bool, // Overwrite existing files
pub compression: Option<CompressionLevel>, // Compression settings
}use dsq_io::{from_path, ReadOptions};
let mut reader = from_path("data.csv")?;
let options = ReadOptions::default();
let dataframe = reader.read(&options)?;use dsq_io::{convert_file, ReadOptions, WriteOptions};
let read_opts = ReadOptions::default();
let write_opts = WriteOptions::default();
convert_file("data.csv", "data.parquet", &read_opts, &write_opts)?;use dsq_io::stream::StreamProcessor;
let processor = StreamProcessor::new(10000) // 10k row chunks
.with_read_options(read_opts)
.with_write_options(write_opts);
processor.process_file("large.csv", "processed.parquet", |chunk| {
let filtered = chunk.filter(/* some condition */)?;
Ok(Some(filtered))
})?;- Polars - DataFrame operations and I/O
- Tokio - Async runtime for I/O operations
- Apache Avro - Avro format support
- dsq-shared - Shared types and utilities
- dsq-formats - Format-specific I/O implementations
The fluent API in dsq-core provides an ergonomic interface for common workflows.
use dsq_core::api::Dsq;
// Chain operations easily
let result = Dsq::from_file("data.csv")?
.select(&["name", "age", "department"])
.filter_expr("age > 25")
.sort_by(&["department", "age"])
.group_by(&["department"])
.aggregate(&["department"], vec![
dsq_core::ops::aggregate::AggregationFunction::Count,
dsq_core::ops::aggregate::AggregationFunction::Mean("salary".to_string()),
])
.to_json()?;dsq-core supports optional features for different use cases:
default- Includesall-formats,io, andfilterfor full functionalityall-formats- Enables all supported data formatsio- File I/O operations and format conversionfilter- jq-compatible filter compilation and executionrepl- Interactive REPL supportcli- Command-line interface components
csv- CSV/TSV reading and writingjson- JSON and JSON Lines supportparquet- Apache Parquet format supportavro- Apache Avro format supportio-arrow- Apache Arrow IPC format support
dsq leverages Polars' high-performance engine:
- Lazy Evaluation - Operations are optimized before execution
- Columnar Processing - Efficient memory layout and SIMD operations
- Parallel Execution - Multi-threaded processing where beneficial
- Memory Efficiency - Streaming and chunked processing for large datasets