datasetq is a high-performance data processing tool that extends jq-like syntax to work with structured data formats including Parquet, Avro, CSV, JSON Lines, Arrow, and more. Built on Polars, dsq provides fast data manipulation across multiple file formats with familiar filter syntax.
- Format Flexibility - Process Parquet, Avro, CSV, TSV, JSON Lines, Arrow, and more with automatic format detection
- Performance - Built on Polars DataFrames with lazy evaluation, columnar operations, and efficient memory usage
- Familiar Syntax - jq-inspired filter syntax extended to tabular data operations
- Correctness - Proper type handling and clear error messages
Download binaries for Linux, Mac, and Windows from the releases page.
On Linux:
curl -fsSL https://github.com/datasetq/datasetq/releases/latest/download/dsq-$(uname -m)-unknown-linux-musl -o dsq && chmod +x dsqInstall with Rust toolchain (see https://rustup.rs/):
cargo install --locked dsq-cli
cargo install --locked --git https://github.com/datasetq/datasetq # development versionOr build from the repository:
cargo build --release # creates target/release/dsq
cargo install --locked --path dsq-cli # installs binaryProcess CSV data:
dsq 'map(select(.age > 30))' people.csvConvert between formats:
dsq '.' data.csv --output data.parquetAggregate data:
dsq 'group_by(.department) | map({dept: .[0].department, count: length})' employees.parquetFilter and transform:
dsq 'map(select(.status == "active") | {name, email})' users.jsonProcess multiple files:
dsq 'flatten | group_by(.category)' sales_*.csvUse lazy evaluation for large datasets:
dsq --lazy 'filter(.amount > 1000)' transactions.parquetStart an interactive REPL to experiment with filters:
dsq --interactiveAvailable REPL commands:
load <file>- Load data from a fileshow- Display current dataexplain <filter>- Explain what a filter doeshistory- Show command historyhelp- Show help messagequit- Exit
dsq convert input.csv output.parquetdsq inspect data.parquet --schema --sample 10 --statsdsq merge data1.csv data2.csv --output combined.csvdsq completions bash >> ~/.bashrcInput/Output:
- CSV/TSV - Delimited text with customizable options
- Parquet - Columnar storage with compression
- JSON/JSON Lines - Standard and newline-delimited JSON
- Arrow - Columnar in-memory format
- Avro - Row-based serialization
- ADT - ASCII delimited text (control characters)
Output Only:
- Excel (.xlsx)
- ORC - Optimized row columnar
Format detection is automatic based on file extensions. Override with --input-format and --output-format.
- Architecture - Core library structure and modules
- Functions - Built-in function reference
- Formats - Format support and options
- API - Library usage examples
- Configuration - Configuration file reference
- Language - Filter language syntax
-i, --input-format <FORMAT>- Specify input format-o, --output <FILE>- Output file (stdout by default)--output-format <FORMAT>- Specify output format-f, --filter-file <FILE>- Read filter from file
--lazy- Enable lazy evaluation--dataframe-optimizations- Enable DataFrame optimizations--threads <N>- Number of threads--memory-limit <LIMIT>- Memory limit (e.g., 1GB)
-c, --compact-output- Compact output-r, --raw-output- Raw strings without quotes-S, --sort-keys- Sort object keys
-v, --verbose- Increase verbosity--explain- Show execution plan--stats- Show execution statistics-I, --interactive- Start REPL mode
Configuration files are searched in:
- Current directory (
.dsq.toml,dsq.yaml) - Home directory (
~/.config/dsq/) - System directory (
/etc/dsq/)
Manage configuration:
dsq config show # Show current configuration
dsq config set filter.lazy_evaluation true
dsq config init # Create default configSee Configuration for details.
Contributions are welcome! Please ensure:
- Compatibility with jq syntax where possible
- Tests pass with
cargo test - Documentation updated for new features
- Performance implications considered
See CONTRIBUTING.md for details.
dsq builds on excellent foundations from:
- jq - The original and inimitable jq
- jaq - jq clone inspiring our syntax compatibility
- Polars - High-performance DataFrame library
- Arrow - Columnar memory format
Special thanks to Ronald Duncan for defining the ASCII Delimited Text (ADT) format.
Our GitHub Actions disk space cleanup script was inspired by the Apache Flink project.
See LICENSE file for details.