06 Nov 03:23

jqnatividad

6dd67c1

0.138.0 Latest

Latest

Highlights:

⭐ New template command for rendering templates with CSV data.
Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template).
⭐ New lookup module for fetching reference data from remote and local files.
In addition to the typical http/https schemes for remote files, qsv adds two additional schemes - CKAN:// and datHere://, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
The lookup module is now being used by the luau (for its qsv_register_lookup helper) and validate (for its dynamicEnum custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g. apply, geocode, template, sqlp, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract).
✨ Enhanced fetchpost with MiniJinja templating for payload construction.
Previously, fetchpost was limited to posting url-encoded HTML Form data with content type application/x-www-form-urlencoded. Now with the new --payload-tpl and --content-type options, users can post request bodies rendered with MiniJinja and specify other content types (typically application/json, text/plain, multipart/form-data) as well.
✨ Improved Polars integration with automatic schema detection
The joinp and sqlp commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:
1. Faster execution by skipping Polars' schema inference step
2. GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
🏃 fast-float2 crate for faster float parsing
Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) with fast-float2.
💪 Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.

Added

added lookup module - enabling fetching and caching of reference data from remote and local files #2262
fetchpost: add --payload-tpl <file> and --content-type options to construct payload using MiniJinja with the appropriate content-type #2268 5921498
joinp: derive polars schema from stats cache 86fe22e
sqlp: derive polars schema from stats cache #2256
template: new command to render MiniJinja templates with CSV data #2267
validate: add dynamicEnum lookup support #2265
contrib(completions): add template command and update fetchpost by @rzmk in #2269
add fast-float2 dependency for faster bytes to float conversion 7590e4e 3ca30aa
added more benchmarks for new/updated commands f8a1d4f cd7e480

Changed

luau: adapt to mlua 0.10 API changes 268cb45
luau: refactored stage management 31ef58a
luau: now uses the lookup module 2f4be34
stats: minor perf refactoring 6cdd6ea
build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
deps: updated our fork of the csv crate with more perf optimizations eae7d76
deps: use calamine upstream with unreleased fixes 4cc7f37
deps: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322
deps: bump jsonschema from 0.25 to 0.26 #2251
deps: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0
deps: bump mlua from 0.9 to 0.10 #2249
deps: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44
apply select clippy lint suggestions
updated indirect dependencies
aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5

Fixed

fix documentation typo: it's → its by @tmtmtmtm in #2254

Removed

removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
removed unneeded create_dir_all_threadsafe helper now that std::create_dir_all is threadsafe d0af83b

Full Changelog: 0.137.0...0.138.0

Contributors

tmtmtmtm, dependabot, and rzmk

Assets 12

21 Oct 03:57

jqnatividad

0.137.0

75dbaba

0.137.0

Highlights:

extdedup & extsort now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful --select option to specify which columns to deduplicate or sort on.
This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table for extdedup, and an external merge sort for extsort) - they can handle files larger than memory.
sqlp now has a --cache-schema option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.
fetch and fetchpost have been updated to use the jaq crate instead of the jql crate. This change was made to improve performance and to make the commands consistent with the json command which also uses jaq. Furthermore, jaq is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.
stats is a tad faster as we keep squeezing more performance from this central command.

Added

extdedup: now supports two modes - LINE mode and CSV mode #2208
extsort: now also has two modes - CSV mode and LINE mode #2210
sqlp: add --cache-schema option #2224
added sqlp --cache-schema benchmarks

Changed

apply & applydp: use smallvec for operations vector & other minor performance optimizations #2219 & bc837ae
apply & applydp: specify min_length for parallel iterators 7d6ce5e
fetch & fetchpost: replace jql with jaq #2222
stats: performance optimizations f205809 e26c27f 4579c1b
validate: specify min_length for parallel iterators a5b8185
deps: updated polars to 0.43.1 at the py-1.10.0 tag.
build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
apply select clippy lints
bumped indirect dependencies
bumped MSRV to 1.82

Fixed:

fix performance regression in batched commands by refactoring optimal_batch_size to require indexed CSV files #2206

Removed:

fetch & fetchpost: removed jql options; replaced with jaq #2222

Full Changelog: 0.136.0...0.137.0

Contributors

dependabot

Assets 12

08 Oct 19:41

jqnatividad

0.136.0

82b7611

0.136.0

🎉 qsv pro is now available in the Microsoft Store! 🎉

It's Data Wrangling Democratized on the Desktop, featuring:

📊 Familiar Spreadsheet Interface
tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line.
CKAN desktop client
designed to make data publishing easier for portal operators and data stewards using the CKAN platform.
📥 Flow
allows you to build custom node-based flows and data pipelines using a visual interface.
🔧 Toolbox
features an ever-expanding library of reusable scripts for common data-wrangling use cases.
⭐ and more!
Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support, .qsv file format, etc.) that will be unveiled in future versions.

Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!

Get it from https://qsvpro.dathere.com or

Other highlights:

excel: new --table option for XLSX files; new --header-row option; expanded --range option, adding support for Named Ranges and absolute ranges (e.g. Sheet2!$A$1:$J$10); and expanded metadata export now including Named Ranges and Tables (for XLSX files)
Improved performance for several commands (apply, datefmt, tojsonl and validate) through automatic batch size optimization
validate: dynamicEnum custom JSON Schema keyword in validate command (renamed from dynenum) and enhanced email validation
schema: automatic JSON Schema const inferencing for columns with just one value
Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes

NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT

Added

🎉 qsv pro is now in the Microsoft Store!!! 🎉
apply, datefmt, tojsonl, validate: added logic to automatically determine optimal batch size for better parallelization #2178
enum: added --new-column support for all enum modes, not just --increment #2173
excel: new --table option for XLSX files #2194
excel: new --header-row option 458f79a
excel: expanded range and metadata options #2195
schema: added JSON Schema automatic const inferencing #2180
Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
contrib(completions): add --table option to qsv excel by @rzmk in #2197
completions: add --header-row option to qsv excel e8794d5
added new apply operations sentiment benchmark b745e64
docs: added indexing section to PERFORMANCE.md 804145a

Changed

stats: various minor micro-optimizations 62d95fc 2c2862a
validate: renamed custom keyword dynenum to dynamicEnum to be more consistent with JSON schema naming conventions 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf
validate: optimizations for increased performance; replace serde_json with simd_json 0.135.0...master#diff-9783631cdad9e1f47f60266303dc2d56a6e7a486784b61c40961601e8192f7cf
apply new clippy::ref_option lint to Config::new API #2192
Update debian package readme by @tino097 in #2187
deps: bump calamine from 0.25 to 0.26 b42279a
deps: jsonschema use latest 0.22.3 upstream with unreleased features/fixes
deps: polars use latest 0.43.1 upstream with unreleased features/fixes
deps: created our own fork of unmaintained vader_sentiment crate b426761
deps: use serde_json upstream with unreleased perf improvement/fixes https://github.com/jqnatividad/qsv/blob/1c1174b3b8b65d9dfd9c841597366fb09d0a047c/Cargo.toml#L221
build(deps): bump flate2 from 1.0.33 to 1.0.34 by @dependabot in #2171
build(deps): bump flexi_logger from 0.29.0 to 0.29.1 by @dependabot in #2189
build(deps): bump flexi_logger from 0.29.1 to 0.29.2 by @dependabot in #2196
build(deps): bump hashbrown from 0.14.5 to 0.15.0 by @dependabot in #2186
build(deps): bump jsonschema from 0.20.0 to 0.21.0 by @dependabot in #2177
build(deps): bump jsonschema from 0.22.1 to 0.22.2 by @dependabot in #2191
build(deps): bump regex from 1.10.6 to 1.11.0 by @dependabot in #2176
build(deps): bump reqwest from 0.12.7 to 0.12.8 by @dependabot in #2183
build(deps): bump simd-json from 0.14.0 to 0.14.1 #2199
build(deps): bump simple-expand-tilde from 0.4.2 to 0.4.3 by @dependabot in #2190
build(deps): bump sysinfo from 0.31.4 to 0.32.0 by @dependabot in #2193
build(deps): bump tempfile from 3.12.0 to 3.13.0 by @dependabot in #2175
apply select clippy lints
bumped indirect dependencies
aligned Rust nightly to Polars nightly - 2024-09-29 7cd2de1

Fixed

schema: fix enum so it only adds a list when the number of unique values > --enum-threshold #2180
Upload artifact fix for Debian package publishing by @tino097 in #2168
fixed typos configuration 627de89
fixed various GitHub Actions publishing workflow issues

Full Changelog: 0.135.0...0.136.0

Contributors

tino097, dependabot, and rzmk

Assets 12

24 Sep 12:46

jqnatividad

0.135.0

7b9edaf

0.135.0

Highlights

JSON Schema validation just got a whole lot more powerful with the introduction of qsv's custom dynenum keyword!
With dynenum, you can now dynamically lookup valid enum values from a CSV (on the filesystem or on a URL), allowing for more flexible and responsive data validation.

Unlike the standardenum keyword, dynenum does not require hardcoding valid values at schema definition time, and can be used to validate data against a changing set of valid values.

For an example, see #1872 (reply in thread).

In an upcoming qsv pro release, we're planning on making dynenum even more powerful by allowing you to easily specify high-value reference data (e.g. US Census data, World Bank data, data.gov, etc.) that is maintained at data.dathere.com and other CKAN instances.

This release also add the custom currency JSON Schema format, which enables currency validation according to the ISO 4217 standard.

The Polars engine was also upgraded to 0.43.1 at the py-1.81.1 tag - making for various under-the-hood improvements for the sqlp, joinp and count commands, as we set the stage for more Polars-powered features in future releases.

Added

foreach: enabled foreach command on Windows prebuilt binaries def9c8f
lens: added support for QSV_SNIFF_DELIMITER env var and snappy auto-decompression 8340e89
sample: add --max-size option e845a3c
validate: added dynenum custom JSON Schema keyword for dynamic validation lookups #2166
tests: add tests for https://100.dathere.com/lessons/2 by @rzmk in #2141
added stats_sorted and frequency_sorted benchmarks
added validate_dynenum benchmarks

Changed

json: add error for empty key and update usage text by @rzmk in #2167
prompt: gate prompt command behind prompt feature #2163
validate: expanded currency JSON Schema custom format to support ISO 4217 currency codes and alternate formats 5202508
validate: migrate to new jsonschema crate api 5d65054
Update ubuntu version for deb package by @tino097 in #2126
contrib(completions): update completions for qsv v0.134.0 and fix subcommand options by @rzmk in #2135
contrib(completions): add --max-size completion for sample by @rzmk in #2142
deps: bump to polars 0.43.1 at py-1.81.1 #2130
deps: switch back to calamine upstream instead of our fork 677458f
build(deps): bump actix-governor from 0.5.0 to 0.6.0 by @dependabot in #2146
build(deps): bump anyhow from 1.0.87 to 1.0.88 by @dependabot in #2132
build(deps): bump arboard from 3.4.0 to 3.4.1 by @dependabot in #2137
build(deps): bump bytes from 1.7.1 to 1.7.2 by @dependabot in #2148
build(deps): bump geosuggest-core from 0.6.3 to 0.6.4 by @dependabot in #2153
build(deps): bump geosuggest-utils from 0.6.3 to 0.6.4 by @dependabot in #2154
build(deps): bump jql-runner from 7.1.13 to 7.2.0 by @dependabot in #2165
build(deps): bump jsonschema from 0.18.1 to 0.18.2 by @dependabot in #2127
build(deps): bump jsonschema from 0.18.2 to 0.18.3 by @dependabot in #2134
build(deps): bump jsonschema from 0.18.3 to 0.19.1 by @dependabot in #2144
build(deps): bump jsonschema from 0.19.1 to 0.20.0 by @dependabot in #2152
build(deps): bump pyo3 from 0.22.2 to 0.22.3 by @dependabot in #2143
build(deps): bump rfd from 0.14.1 to 0.15.0 by @dependabot in #2151
build(deps): bump simple-expand-tilde from 0.4.0 to 0.4.2 by @dependabot in #2129
build(deps): bump qsv_currency from 0.6.0 to 0.7.0 by @dependabot in #2159
build(deps): bump qsv_docopt from 1.7.0 to 1.8.0 by @dependabot in #2136
build(deps): bump redis from 0.26.1 to 0.27.0 by @dependabot in #2133
build(deps): bump simdutf8 from 0.1.4 to 0.1.5 by @dependabot in #2164
bump indirect dependencies
apply select clippy lint suggestions
several usage text/documentation improvements
bump MSRV to 1.81.0

Fixed

validate: correct fail_validation_error! macro; reformat error messages to use hyphens as the JSONschema error message already starts with "error:" 9a25524
moved --help output from stderr to stdout as per GNU CLI guidelines #2138
lens: fixed parsing of lens options 1cdd1bc
searchset: fixed usage text for <regexset-file> 9a60fb0
used patched forks of arrow, csvlens and xlsxwriter crates that replaces a dependency on an old version of lexical-core with known soundness issues - https://rustsec.org/advisories/RUSTSEC-2023-0086. Once those crates have updated their lexical-coredependency, we will revert to the original crates.

Removed

removed prompt command from qsvlite #2163
publish: remove lens feature from i686 targets as it does not compile 959ca76
deps: remove anyhow dependency #2150

Full Changelog: 0.134.0...0.135.0

Contributors

tino097, dependabot, and rzmk

Assets 13

10 Sep 12:11

jqnatividad

0.134.0

88d72c6

0.134.0

qsv pro v1 is here! 🎉

If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!

Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!

And that's just the beginning, there's more to come! You just have to try it!

Download qsv pro v1 now at qsvpro.dathere.com.

Other highlights include:

pro: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features.
lens: new command to interactively view CSVs using the csvlens crate.
The ludicrously fast diff command is now easier to use with its --drop-equal-fields option. @janriemer continues to work on his csv-diff crate, and there's more diff UX improvements coming soon!
stats adds sum_length and avg_length "streaming" statistics in addition to the existing min_length and max_length metrics. These are especially useful for datasets with a lot of "free text" columns.
stats also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
It's a little complicated, but the way stats works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation. This in turn, makes the frequency command faster and more memory efficient.
It's performance tweaks like these, that despite adding six metrics (is_ascii, sort_order, sum_length, avg_length, sem - standard error of the mean & cv - coefficient of variation) in recent releases, that stats is still able to compile 35 statistics and do GUARANTEED data type inferences of a million row, 41 column, 520 MB sample of NYC's 311 data in 1.327 seconds (753,580 records per second)!¹
we now also use our own fork of the csv crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!

Added

pro: add qsv pro command to interact with qsv pro API by @rzmk in #2039
lens: new command to interactively view CSVs using the csvlens crate #2117
apply: add crc32 operation #2121
count: add --delimiter option #2120
diff: add flag --drop-equal-fields by @janriemer in #2114
stats: add sum_length and avg_length columns #2113
stats: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets 4e63fec

Changed

count: added comment to justify magic number 5241e39
stats: use simdjson for faster JSONL parsing; micro-optimize compute hot loop 0e8b734
stats: standardized OVERFLOW and UNDERFLOW messages 38c6128
sort: renamed symbol so eliminate devskim lint false positive warning 12db739
enable lens feature in GH workflows #2122
deps: bump polars 0.42.0 to latest upstream at time of release 3c17ed1
deps: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks e4bcd71
build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in #2111
build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in #2106
build(deps): bump qsv-stats from 0.19.0 to 0.22.0 #2107 #2112 cb1eb60
apply select clippy lint suggestions
updated several indirect dependencies
made various doc and usage text improvements

Fixed

schema: Print an error if the qsv stats invocation fails by @abrauchli in #2110

New Contributors

@abrauchli made their first contribution in #2110

Full Changelog: 0.133.1...0.134.0

see stats_everything_index benchmark ↩

Contributors

abrauchli, janriemer, and 2 other contributors

Assets 13

03 Sep 19:04

jqnatividad

0.133.1

e42f499

0.133.1

Highlights

¹ This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations!
To keep up, we've added Polars revision info to the --version output, and the --envlist option now includes Polars relevant env vars. We've also added support for the POLARS_BACKTRACE_IN_ERR env var to control whether Polars backtraces are included in error messages.
We also removed the to parquet subcommand as its redundant with the Polars-powered sqlp's ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller.


¹	This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations! To keep up, we've added Polars revision info to the `--version` output, and the `--envlist` option now includes Polars relevant env vars. We've also added support for the `POLARS_BACKTRACE_IN_ERR` env var to control whether Polars backtraces are included in error messages. We also removed the `to parquet` subcommand as its redundant with the Polars-powered `sqlp`'s ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller.

Other highlights include:

New edit command that allows you to edit CSV files.
The count command's --width option now includes record width stats beyond max length (avg, median, min, variance, stddev & MAD).
The fixlengths command now has --quote and --escape options.
The stats command adds a sort_order streaming statistic.

NOTE: 0.133.0 was skipped because of a dev dependency conflict with the csvs_convert crate, preventing us from publishing 0.133.0 to crates.io. This has been resolved in 0.133.1.

Added

count: expanded --width options, adding record width stats beyond max length (avg, median, min, variance, stddev & MAD). Also added --json output when using --width #2099
edit: add qsv edit command by @rzmk in #2074
fixlengths: added --quote and --escape options #2104
stats: add sort_order streaming statistic #2101
polars: add polars revision info to --version output e60e44f
polars: added Polars relevant env vars to --envlist option 0ad68fe
polars: add & document POLARS_BACKTRACE_IN_ERR env var f9cc559

Changed

Optimize polars optflags #2089
deps: bump polars 0.42.0 to latest upstream at time of release 3b7af51
bump polars to latest upstream, removing smartstring #2091
build(deps): bump actions/setup-python from 5.1.1 to 5.2.0 by @dependabot in #2094
build(deps): bump flate2 from 1.0.32 to 1.0.33 by @dependabot in #2085
build(deps): bump flexi_logger from 0.28.5 to 0.29.0 by @dependabot in #2086
build(deps): bump indexmap from 2.4.0 to 2.5.0 by @dependabot in #2096
build(deps): bump jsonschema from 0.18.0 to 0.18.1 by @dependabot in #2084
build(deps): bump serde from 1.0.208 to 1.0.209 by @dependabot in #2082
build(deps): bump serde_json from 1.0.125 to 1.0.127 by @dependabot in #2079
build(deps): bump sysinfo from 0.31.2 to 0.31.3 by @dependabot in #2077
build(deps): bump qsv-stats from 0.18.0 to 0.19.0 by @dependabot in #2100
build(deps): bump tokio from 1.39.3 to 1.40.0 by @dependabot in #2095
apply select clippy lint suggestions
updated several indirect dependencies
made various doc and usage text improvements
pin Rust nightly to 2024-08-26 from 2024-07-26, aligning with Polars pinned nightly

Fixed

Ensure portable binaries are "added" to the publish zip archive, instead of replacing all the binaries with just the portable version. Fixes #2083. 34ad206

Removed

removed to parquet subcommand as its redundant with sqlp's ability to create parquet files. This also removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries much smaller #2088
removed smartstring dependency now that Polars has its own compact inlined string type 47f047e
removed to parquet benchmark

Full Changelog: 0.132.0...0.133.1

ChatGPT prompt: Using the logos for the Polars project and the qsv project as a baseline, can you create a version with the cowboy riding a polar bear instead? ↩

Contributors

dependabot and rzmk

Assets 15

21 Aug 10:34

jqnatividad

0.132.0

d644e83

0.132.0

Highlights

With this release, we finally finish the stats caching refactor started in 0.131.0, replacing the binary encoded stats cache with a simpler JSONL cache. The stats cache stores the necessary statistical metadata to make several key commands smarter & faster. Per the benchmarks:

frequency is 6x faster (frequency_index_stats_mode_auto).
Not only is it faster, it now doesn't need to compile a hashmap for columns with ALL unique values (e.g. ID columns) - practically, making it able to handle "real-world" datasets of any size (that is, unless all the columns have ALL unique cardinalities. In that case, the entire CSV will have to fit into memory).
tojsonl is 2.67x faster (tojsonl_index)
schema is two orders of magnitude (100x) faster!!! (schema_index)

The stats cache also provides the foundation for even more "smart" features and commands in the future. It also has the side-benefit of adding a way to produce stats in JSONL format that can be used for other purposes beyond qsv.

The search, searchset, and replace commands now also have a --literal option that allows you to search for and replace strings with regex special/reserved characters. This makes it easier to search for and replace strings that contain otherwise reserved regex characters without having to escape them (especially useful with URL columns that often contain characters like ?,:,-,., etc.)

Added

search, searchset & replace: add --literal option #2060 & 7196053
slice: added usage text examples 04afaa3
publish: added workflow to build "portable" binaries with CPU features disabled
contrib(completions): add --literal for search and searchset by @rzmk in #2061
contrib(completions): add --literal completion to replace by @rzmk in #2062
add more polars metadata in --version info #2073
docs: added more info to SECURITY.md 609d4df
docs: expanded Goals/Non-Goals 54998e3
docs: added Installation "Option 0" quick start bf5bf82
added search --literal benchmark

Changed

stats, schema, frequency & tojsonl: stats caching refactor, replacing binary encoded stats cache with a simpler JSONL cache #2055
rename stats --stats-json option to stats --stats-jsonl #2063
changed "broken pipe" error to a warning 7353275
docs: update multithreading and caching sections of PERFORMANCE.md 5e6bc45
deps: switch to our qsv-optimized fork of csv crate 3fc1e82
deps: bump polars from 0.41.3 to 0.42.0 #2051
build(deps): bump actix-web from 4.8.0 to 4.9.0 by @dependabot in #2041
build(deps): bump flate2 from 1.0.31 to 1.0.32 by @dependabot in #2071
build(deps): bump indexmap from 2.3.0 to 2.4.0 by @dependabot in #2049
build(deps): bump reqwest from 0.12.6 to 0.12.7 by @dependabot in #2070
build(deps): bump rust_decimal from 1.35.0 to 1.36.0 by @dependabot in #2068
build(deps): bump serde from 1.0.205 to 1.0.206 by @dependabot in #2043
build(deps): bump serde from 1.0.206 to 1.0.207 by @dependabot in #2047
build(deps): bump serde from 1.0.207 to 1.0.208 by @dependabot in #2054
build(deps): bump serde_json from 1.0.122 to 1.0.124 by @dependabot in #2045
build(deps): bump serde_json from 1.0.124 to 1.0.125 by @dependabot in #2052
apply select clippy lint suggestions
updated several indirect dependencies
made various usage text improvements

Fixed

stats: fix --output delimiter inferencing based on file extension #2065
make process_input helper handle stdin better #2058
docs: fix completions for --stats-jsonl and qsv pro installation text update by @rzmk in #2072
docs: added Note about why luau feature is disabled in musl binaries - ffa2bc5 & 27d0f8e

Removed

Removed bincode dependency now that we're using JSONL stats cache #2055 babd92b

Full Changelog: 0.131.1...0.132.0

Contributors

dependabot and rzmk

Assets 15

09 Aug 14:44

jqnatividad

0.131.1

c60b99d

0.131.1

Changed

deps: bump polars to latest upstream post py-1.41.1 release at the time of this release
build(deps): bump filetime from 0.2.23 to 0.2.24 by @dependabot in #2038

Fixed

frequency: change --stats-mode default to none from auto.
This is because of a big performance regression when using --stats-mode auto on datasets with columns with ALL unique values.
See #2040 for more info.

Full Changelog: 0.131.0...0.131.1

Contributors

dependabot

Assets 15

09 Aug 01:03

jqnatividad

0.131.0

c26cd0b

0.131.0

Highlights

Refactored frequency to make it smarter and faster.
frequency's core algorithm essentially compiles an in-memory hashmap to determine the frequency of each unique value for each column. It does this using multi-threaded, multi-I/O techniques to make it blazing fast.
However, for columns with ALL unique values (e.g. ID columns), this takes a comparatively long time and consumes a lot of memory as it essentially compiles a hashmap of the ENTIRE column, with a hashmap entry for each column value with a count of 1.
Now, with the new --stats-mode option (enabled by default), frequency can compile the dataset in a more intelligent way by looking up a column's cardinality in the stats cache.
If the cardinality of a column is equal to the CSV's rowcount (indicating a column with ALL unique values), it short-circuits frequency calculations for that column - dramatically reducing the time and memory requirements for the ID column as it eliminates the need to maintain a hashmap for it.
Practically speaking, this makes frequency able to handle "real-world" datasets of any size.
To ensure frequency is as fast as possible, be sure to index and compute stats for your datasets beforehand.
Setting the stage for Datapusher+ v1 and...
The "itches we've been scratching" the past few months have been informed by our work at several clients towards the release of Datapusher+ 1.0 and qsv pro 1.0 (more info below) - both targeted for release this month.
DP+ is our third-gen, high-speed data ingestion/registration tool for CKAN that uses qsv as its data wrangling/analysis engine. It will enable us to reinvent the way data is ingested into CKAN - with exponentially faster data ingestion, metadata inferencing, data validation, computed metadata fields, and more!
We're particularly excited how qsv will allow us to compute and infer high-quality metadata for datasets (with a focus on inferring optional recommended DCAT-US v3 metadata fields) in "near real-time", while dataset publishers are still entering metadata. This will be a game-changer for CKAN administrators and data publishers!
...qsv pro 1.0
qsv pro is datHere's enterprise-grade data wrangling/curation workbench that’s planned for v1.0 release this month.
Building the core functionality of qsv pro's Workflow feature is one of the primary reasons for a v1.0 release.
We feel qsv pro may be a game-changer for data wranglers and data curators who need to work with spreadsheets and large datasets to view statistical data and metadata while also performing complex data wrangling operations in a user-friendly way without having to write code.

Added

docs: added Shell Completion section 556a2ff
docs: add 🪄 emoji in legend to indicate "automagical" commands 2753c90
Add building deb package (WIP) by @tino097 in #2029
Added GitHub workflow to test debian package (WIP) by @tino097 in #2032
tests: added false positive to _typos.toml configuration d576af2
added more benchmarks
added more tests

Changed

fetch & fetchpost: remove expired diskcache entries on startup 9b6ab5d
frequency: smarter frequency compilation with new --stats-mode option #2030
json: refactored for maintainability & performance 62e9216 and 4e44b18
improved self-update messages 5c874e0 and 0aa0b13
contrib(completions): frequency updates & remove bashly/fish by @rzmk in #2031
Debian package update by @tino097 in #2017
publish: optimized enabled CPU features when building release binaries in all GitHub Actions "publishing" workflows
publish: ensure latest Python patch release is used when building qsvpy binary variants 2ab03a0 and ec6f486
tests: also enabled CPU features in CI tests
docs: wordsmith qsv "elevator pitch" cc47fe6
docs: point to https://100.dathere.com in Whirlwind tour fc49aef
deps: bump polars to latest upstream post py-1.41.1 release at the time of this release
build(deps): bump bytes from 1.6.1 to 1.7.0 by @dependabot in #2018
build(deps): bump bytes from 1.7.0 to 1.7.1 by @dependabot in #2021
build(deps): bump flate2 from 1.0.30 to 1.0.31 by @dependabot in #2027
build(deps): bump indexmap from 2.2.6 to 2.3.0 by @dependabot in #2020
build(deps): bump jaq-parse from 1.0.2 to 1.0.3 by @dependabot in #2016
build(deps): bump redis from 0.26.0 to 0.26.1 by @dependabot in #2023
build(deps): bump regex from 1.10.5 to 1.10.6 by @dependabot in #2025
build(deps): bump serde_json from 1.0.121 to 1.0.122 by @dependabot in #2022
build(deps): bump sysinfo from 0.30.13 to 0.31.0 by @dependabot in #2019
build(deps): bump sysinfo from 0.31.0 to 0.31.2 by @dependabot in #2024
build(deps): bump tempfile from 3.11.0 to 3.12.0 by @dependabot in #2033
build(deps): bump serde from 1.0.204 to 1.0.205 by @dependabot in #2036
apply select clippy suggestions
updated several indirect dependencies
made various usage text improvements
bumped MSRV to 1.80.1

Fixed

sqlp & joinp: fixed .ssv.sz output auto-compression support 5397f6c & d86ba63
docs: fix link by @uncenter in #2026
tests: correct misnamed test 8ae6000
tests: fix flaky reverse property tests d86ba63

Removed

docs: "Quicksilver" is the name of the logo horse, not how you pronounce "qsv" e4551ae

New Contributors

@uncenter made their first contribution in #2026

Full Changelog: 0.130.0...0.131.0

Contributors

tino097, dependabot, and 2 other contributors

Assets 15

29 Jul 19:07

jqnatividad

0.130.0

1d4b2bd

0.130.0

Following the 0.129.0 release - the largest release to date, 0.130.0 continues to polish qsv as a data-wrangling engine, packing new features, fixes, and improvements, previewing upcoming features in qsv pro 1.0. Here are a few highlights:

Highlights

Added .ssv (semicolon separated values) automatic support. Semicolon separated values are now automatically detected and supported by qsv. Though not as common as CSV, SSV is used in some regions and industries, so qsv now supports it.
Added cargo deb compatibility. In preparation for the release of DataPusher+ 1.0, we're now making it easier to upgrade qsvdp so CKAN administrators can install and upgrade it easily using apt-get install qsvdp or apt-get upgrade qsvdp.
DP+ is our next-gen, high-speed data ingestion tool for CKAN that uses qsv as its analysis engine. Its not only a robust, fast, validating data pump that guarantees high quality data, it also does extended analysis to infer and automatically derive high-quality metadata - what we call "automagical metadata".
Upgraded to the latest Polars upstream at the py-polars-1.3.0 tag. Polars tops the TPC-H Benchmark and is several orders of magnitude faster than traditional dataframe libraries (cough - 🐼 pandas). qsv proudly rides the 🐻‍❄️ Polars bear to get subsecond response times even with very large datasets!
qsv v0.130.0 shell completions files are available for download here. With shell completions, pressing tab in a compatible shell provides suggestions for various qsv commands, subcommands, and options that you can choose from. Supported shells include bash, zsh, powershell, fish, nushell, fig, and elvish. View tips on how to install completions for the bash shell here.

Added

apply: add base62 encode/decode operations #2013
headers: add --just-count option #2004
json: add --select option #1990
searchset: add --not-one flag by @rzmk in #1994
Added .ssv (semicolon separated values) automatic support #1987
Added cargo deb compatibility by @tino097 in #1991
contrib(completions): add --just-count for headers by @rzmk in #2006
contrib(completions): add --select for json by @rzmk in #1992
added several benchmarks
added more tests

Changed

diff: allow selection of --key and --sort-columns by name, not just by index #2010
fetch & fetchpost: replace deprecated Redis execute command 75cbe2b
stats: more intelligent --infer-lenoption c6a0e64
validate: return delimiter detected upon successful CSV validation #1977
bump polars to latest upstream at py-polars-1.3.0 tag #2009
deps: bump csvs_convert from 0.8.12 to 0.8.13 d1d0800
build(deps): bump cached from 0.52.0 to 0.53.0 by @dependabot in #1983
build(deps): bump cached from 0.53.0 to 0.53.1 by @dependabot in #1986
build(deps): bump postgres from 0.19.7 to 0.19.8 by @dependabot in #1985
build(deps): bump pyo3 from 0.22.1 to 0.22.2 by @dependabot in #1979
build(deps): bump redis from 0.25.4 to 0.26.0 by @dependabot in #1995
build(deps): bump serde_json from 1.0.120 to 1.0.121 by @dependabot in #2011
build(deps): bump simple-expand-tilde from 0.1.7 to 0.4.0 by @dependabot in #1984
build(deps): bump tokio from 1.38.0 to 1.38.1 by @dependabot in #1973
build(deps): bump tokio from 1.38.1 to 1.39.1 by @dependabot in #1988
build(deps): bump xxhash-rust from 0.8.11 to 0.8.12 by @dependabot in #1997
apply select clippy suggestions
updated several indirect dependencies
made various usage text improvements
pin Rust nightly to 2024-07-26

Fixed

diff: clarify --key usage examples, resolves #1998 by @rzmk in #2001
json: refactored so it didn't need to use threads to spawn qsv select to order the columns. Had to do this as sometimes intermediate output was sent to stdout before the final output was ready 0f25def
py: replace row with col in usage text by @allen-chin in #2008
reverse: fix indexed bug #2007
validate: properly auto-detect tab delimiter when file extension is TSV or TAB #1975
fix panic when process_input helper fn receives unexpected input from stdin 152fec4

Removed

docs: remove *nix only message for foreach by @rzmk in #1972

New Contributors

@tino097 made their first contribution in #1991
@allen-chin made their first contribution in #2008

Full Changelog: 0.129.1...0.130.0

To stay updated with datHere's latest news and updates (including qsv pro, datHere's CKAN DMS, and analyze.dathere.com), subscribe to the newsletter here: dathere.com/newsletter

Contributors

tino097, allen-chin, and 2 other contributors

Assets 15

Releases: jqnatividad/qsv

0.138.0

Highlights:

Added

Changed

Fixed

Removed

Contributors

0.137.0

Highlights:

Added

Changed

Fixed:

Removed:

Contributors

0.136.0

🎉 qsv pro is now available in the Microsoft Store! 🎉

Added

Changed

Fixed

Contributors

0.135.0

Highlights

Added

Changed

Fixed

Removed

Contributors

0.134.0

qsv pro v1 is here! 🎉

Added

Changed

Fixed

New Contributors

Contributors

0.133.1

Highlights

Added

Changed

Fixed

Removed

Contributors

0.132.0

Highlights

Added

Changed

Fixed

Removed

Contributors

0.131.1

Changed

Fixed

Contributors

0.131.0

Highlights

Added

Changed

Fixed

Removed

New Contributors

Contributors

0.130.0

Highlights

Added

Changed

Fixed

Removed

New Contributors

Contributors