Releases: jqnatividad/qsv
0.120.0
Happy New Year! πππ
Here's the first release of 2024, the biggest ever with 280+ commits! qsv 0.120.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.
Apart from wrapping qsv with a User Interface, qsv pro also comes with a retinue of related cloud-based data cleaning, enrichment and enhancement services along with expanded metadata inferencing to make your Data Useful, Usable and Used!
qsv pro draws inspiration from OpenRefine, but reimagined without its file size and speed limitations, with qsv pro having the ability to process multi-gigabyte files in seconds.
It incorporates hard lessons we learned in the past 12 years deploying Data Portals and Data Pipelines to create a new Data/Metadata Wrangling and AI-assisted Data Publishing service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.
But it's not quite ready for release yet, so stay tuned!
We're now taking signups for a preview release however, so if you're interested, please sign up here!
Excitingly, qsv was also mentioned on Hacker News in this thread Dec 23, 2023! As a result, we're now almost at 2,000+ stars on GitHub from 900 stars on Dec 22! πππ
Stay tuned for more advancements in 2024 β it's set to be a landmark year for qsv! π¦π¦π¦
Added
cat
: add rowskey --group options; increased perf of rowskey #1508validate
: add --trim and --quiet options #1452apply
&applydp
:operations regex_replace
now supports empty--replacement
with the "<NULL>" special value #1470 and #1471exclude
: also consider rows with empty fields #1498extsort
: add--tmp-dir
option ca1f461
Changed
validate
: Faster RFC4180 validation with byterecords and SIMD-accelerated utf8 validation #1440excel
: minor performance tweaks #1446apply
,applydp
,explode
,geocode
,pseudo
: consolidate redundant code and use onereplace_column_value
helper fn in util.rs #1456excel
: bump calamine from 0.22 to 0.23 #1473excel
&joinp
: use atoi_simd for faster &[u8] to int conversion 9521f3ecat
,describegpt
,headers
,sqlp
,to
,tojsonl
: refactor commands that accept multiple input files to use improved process_input helper #1496fetch
&fetchpost
: get_response refactor for maintainability and performance #1507luau
: replaced --no-colindex option with --colindex option. --col-index slows down processing and is not often used, so make it an option, not the default. a0c8568- make thousands crate optional with apply feature in #1453
- build(deps): bump uuid from 1.6.0 to 1.6.1 by @dependabot in #1430
- build(deps): bump serde from 1.0.192 to 1.0.193 by @dependabot in #1432
- build(deps): bump data-encoding from 2.4.0 to 2.5.0 by @dependabot in #1435
- build(deps): bump mlua from 0.9.1 to 0.9.2 by @dependabot in #1436
- build(deps): bump url from 2.4.1 to 2.5.0 by @dependabot in #1437
- build(deps): bump jql-runner from 7.0.6 to 7.0.7 by @dependabot in #1439
- build(deps): bump jql-runner from 7.0.7 to 7.1.0 by @dependabot in #1447
- build(deps): bump jql-runner from 7.1.0 to 7.1.1 by @dependabot in #1457
- build(deps): bump jql-runner from 7.1.1 to 7.1.2 by @dependabot in #1486
- build(deps): bump hashbrown from 0.14.2 to 0.14.3 by @dependabot in #1441
- build(deps): bump redis from 0.23.3 to 0.23.4 by @dependabot in #1442
- build(deps): bump redis from 0.23.3 to 0.24.0 by @dependabot in #1455
- build(deps): bump atoi_simd from 0.15.3 to 0.15.4 by @dependabot in #1444
- build(deps): bump atoi_simd from 0.15.4 to 0.15.5 by @dependabot in #1445
- build(deps): bump atoi_simd from 0.15.5 to 0.15.6 by @dependabot in #1512
- build(deps): bump actions/setup-python from 4.7.1 to 4.8.0 by @dependabot in #1454
- build(deps): bump actions/setup-python from 4.8.0 to 5.0.0 by @dependabot in #1459
- build(deps): bump actions/stale from 8 to 9 by @dependabot in #1463
- build(deps): bump itoa from 1.0.9 to 1.0.10 by @dependabot in #1464
- build(deps): bump tokio from 1.34.0 to 1.35.0 by @dependabot in #1465
- build(deps): bump tokio from 1.35.0 to 1.35.1 by @dependabot in #1483
- build(deps): bump ryu from 1.0.15 to 1.0.16 by @dependabot in #1466
- build(deps): bump file-format from 0.22.0 to 0.23.0 by @dependabot in #1468
- build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1476
- build(deps): bump geosuggest-utils from 0.5.1 to 0.5.2 by @dependabot in #1479
- build(deps): bump geosuggest-core from 0.5.1 to 0.5.2 by @dependabot in #1478
- build(deps): bump reqwest from 0.11.22 to 0.11.23 by @dependabot in #1480
- build(deps): bump calamine from 0.23.0 to 0.23.1 by @dependabot in #1481
- build(deps): bump qsv-sniffer from 0.10.0 to 0.10.1 by @dependabot in #1484
- build(deps): bump anyhow from 1.0.75 to 1.0.76 by @dependabot in #1485
- build(deps): bump futures from 0.3.29 to 0.3.30 by @dependabot in #1492
- build(deps): bump futures-util from 0.3.29 to 0.3.30 by @dependabot in #1491
- build(deps): bump crossbeam-channel from 0.5.9 to 0.5.10 by @dependabot in #1490
- build(deps): bump sysinfo from 0.29.10 to 0.29.11 by @dependabot in #1443
- Bump sysinfo from 0.29.11 to 0.30 #1489
- build(deps): bump sysinfo from 0.30.0 to 0.30.1 by @dependabot in #1495
- build(deps): bump sysinfo from 0.30.1 to 0.30.2 by @dependabot in #1504
- build(deps): bump sysinfo from 0.30.2 to 0.30.3 by @dependabot in #1509
- build(deps): bump tabwriter from 1.3.0 to 1.4.0 by @dependabot in #1500
- build(deps): bump tempfile from 3.8.1 to 3.9.0 by @dependabot in #1502
- build(deps): bump qsv_docopt from 1.4.0 to 1.5.0 by @dependabot in #1503
- build(deps): bump ahash from 0.8.6 to 0.8.7 by @dependabot in #1510
- build(deps): bump serde_json from 1.0.108 to 1.0.109 by @dependabot in #1511
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-12-23
Fixed
apply
: Fix fordynfmt
andcalcconv
subcommands not working in release mode #1467luau
: fix check for excess mapped columns earlier. Otherwise, we'll get a CSV different field count error db15811
Removed
luau
: remove unneeded--jit
option as we precompile luau scripts to bytecode #1438
Full Changelog: 0.119.0...0.120.0
0.119.0
Highlights:
As we prepare for version 1.0, we're focusing on performance, stability and reliability as we set the stage for qsv pro - a cloud-backed UI version of qsv powered by Tauri, set to be released in 2024. Stay tuned!
diff
is now out of beta and blazingly fast! Give "the fastest CSV-diff in the world" a try π!joinp
now supports snappy automatic compression/decompression!sqlp
&joinp
now recognize theQSV_COMMENT_CHAR
environment variable, allowing you to skip comment lines in your input CSV files. They're also faster with the upgrade to Polars 0.35.4.sqlp
now supports subqueries, table aliases, and more!luau
: upgraded embedded Luau from 0.599 to 0.604; refactored code to reduce unneeded allocations and increase performance (more than doubling it!) as we prepare for extended recipe support.cat
is now even faster with the--flexible
option. If you know your CSV files are valid, you can use this option to skip CSV validation and makecat
run twice as fast!- qsv can now add a Byte Order Mark (BOM) header sequence to produce Excel-friendly CSVs with the
QSV_OUTPUT_BOM
environment variable. stats
,sort
,schema
&validate
are now faster with the use ofatoi_simd
to directly convert &[u8] to integer, skipping unnecessary utf8 validation, while also using SIMD CPU instructions for noticeably faster performance.
Added
diff
: added option/flag for headers in output by @janriemer in #1395diff
: added option/flag--delimiter-output
by @janriemer in #1402cat
: added--flexible
option to makecat rows
faster still #1408sqlp
&joinp
: both commands now recognize QSV_COMMENT_CHAR env var #1412joinp
: added snappy compression/decompression support #1413geocode
: now automatically decompresses snappy-compressed index files #1429- Add Byte Order Mark (BOM) output support #1424
- Added Codacy code quality badge 9959129
Changed
stats
,sort
,schema
&validate
: use atoi_simd to directly convert &[u8] to integer skipping unnecessary utf8 validation, while also using SIMD instructions for noticeably faster performancecat
: fastercat rows
#1407count
: optimize--width
option #1411luau
: upgrade embedded Luau from 0.603 to 0.604 #1426- use
ato_simd
for fast &[u8] to int conversion #1423 luau
: performance refactor 4cebd7c- build(deps): bump csv-diff from 0.1.0-beta.4 to 0.1.0 by @dependabot in #1394
- build(deps): bump serde_json from 1.0.107 to 1.0.108 by @dependabot in #1393
- build(deps): bump indexmap from 2.0.2 to 2.1.0 by @dependabot in #1397
- build(deps): bump jql-runner from 7.0.4 to 7.0.5 by @dependabot in #1399
- build(deps): bump jql-runner from 7.0.5 to 7.0.6 by @dependabot in #1400
- build(deps): bump file-format from 0.21.0 to 0.22.0 by @dependabot in #1401
- build(deps): bump cached from 0.46.0 to 0.46.1 by @dependabot in #1403
- build(deps): bump serde from 1.0.190 to 1.0.192 by @dependabot in #1404
- build(deps): bump tokio from 1.33.0 to 1.34.0 by @dependabot in #1409
- build(deps): bump flexi_logger from 0.27.2 to 0.27.3 by @dependabot in #1410
- build(deps): bump qsv-stats from 0.11.0 to 0.12.0 by @dependabot in #1415
- build(deps): bump itertools from 0.11.0 to 0.12.0 by @dependabot in #1418
- build(deps): bump rust_decimal from 1.33.0 to 1.33.1 by @dependabot in #1420
- build(deps): bump polars from 0.35.2 to 0.35.4 by @dependabot in #1425
- build(deps): bump uuid from 1.5.0 to 1.6.0 by @dependabot in #1428
- bump MSRV to 1.74.0
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-11-18
Fixed
pseudo
: detect when more than one column is selected for pseudonymization 0b09372- dotenv (.env) tweaks/fixes #1427
- fix several typos 723443e
- fix several markdown lints
Removed
- remove fast-float as std float parse is now also using Eisel-Lemire algorithm #1414
Full Changelog: 0.118.0...0.119.0
NOTE:
To verify prebuilt binary zip archives - click here.
0.118.0
Highlights:
- With the Polars upgrade to 0.34.2, the
sqlp
andjoinp
enjoy expanded capabilities and a noticeable performance boost. π¦π - We now publish the 500, 1000, 5000 and 15000 Geonames cities indices for the
geocode
command, with users able to easily switch indices with theindex-load
subcommand. As the name implies, the 500 index contains cities with populations of 500 or more, the 1000 index contains cities with populations of 1000 or more, and so on.
The 15000 index (default) is the smallest (13mb) and fastest with ~26k cities. The 500 index is the largest(56mb) and slowest, with ~200k cities. The 5000 index is 21mb with ~53k cities. The 1000 index is 44mb with ~140k cities. π - The
geocode
command now returns US Census FIPS codes for US places with the%json
and%pretty-json
formats, returning both US State and US County FIPS codes, with upcoming support for Cities and other US Census geographies (School Districts, Voting Districts, Congressional Districts, etc.) π - Improved performance for
stats
,schema
andtojsonl
commands with the stats cache bincode refactor. This is especially noticeable for large CSV files asstats
previously created large bincode cache files by default.
The bincode cache allows other commands (currently, onlyschema
andtojsonl
) to skip recomputing statistics and deserialize the saved stats data structures directly into memory. Now, it will only create a bincode file if the--stats-binout
option is specified (typically, before using theschema
antojsonl
commands).stats
will still continue to create a stats CSV cache file by default, but it will be much smaller than the bincode file, and is universally applicable, unlike the bincode cache. π - self-update will now verify updates. This is done by verifying the zipsign signature of the release zip archive before applying it. This should make it harder for malicious actors to compromise the self-update process. Version 0.118.0 has the verification code, and future releases will use this new verification process.
Regardless, we will zipsign all zip archives starting with this release.
Users can manually verify the signatures by downloading the zipsign public key and running thezipsign
command line tool. See Verifying the Integrity of the Prebuilt Binaries Zip Archive for more info. π¦ - The
frequency
command now supports the--ignore-case
option for case-insensitive frequency counts. π¦π - The
schema
command can now compile case-insensitive enum constraints. π¦ - Improved performance for
apply
andapplydp
commands with faster compile-time perfect hash functions for operations lookups. π - Several minor performance improvements and bug fixes with
snappy
,sniff
&cat
commands. π
Added
frequency
: added--ignore-case
option #1386geocode
: added 500, 1000, 5000, 15000 Geonames cities convenience shortcuts toindex
subcommands bd9f4c3schema
: added--ignore-case
option when compiling enum constraints; replaced Hashset with faster AHashset a16a1casnappy
: addedbuf_size
parm to compress helper fn e0c0d1fsniff
added--just-mime
option #1372- added zipsign signature verification to self-update #1389
Changed
apply
&applydp
: replaced binary_search with faster compile-time perfect hash functions for operations lookups #1371stats
,schema
andtojsonl
: stats cache bincode refactor #1377luau
: replaced sanitise-file-name with more popular sanitize-filename crate 8927cb7cat
: minor optimization by preallocating with capacity c13c341sqlp
&joinp
: expanded speed/functionality with upgrade to Polars 0.34.2 #1385tojsonl
: improved boolean inferencing. Now correctly infers booleans, even if the enum domain range is more than 2, but has cardinality 2 case-insensitive 6345f2d- build(deps): bump strum_macros from 0.25.2 to 0.25.3 by @dependabot in #1368
- build(deps): bump regex from 1.10.1 to 1.10.2 by @dependabot in #1369
- build(deps): bump uuid from 1.4.1 to 1.5.0 by @dependabot in #1373
- build(deps): bump hashbrown from 0.14.1 to 0.14.2 by @dependabot in #1376
- build(deps): bump self_update from 0.38.0 to 0.39.0 by @dependabot in #1378
- build(deps): bump ahash from 0.8.5 to 0.8.6 by @dependabot in #1383
- build(deps): bump serde from 1.0.189 to 1.0.190 by @dependabot in #1388
- build(deps): bump futures from 0.3.28 to 0.3.29 by @dependabot in #1390
- build(deps): bump futures-util from 0.3.28 to 0.3.29 by @dependabot in #1391
- build(deps): bump tempfile from 3.8.0 to 3.8.1 by @dependabot in 4f6200c
- apply select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-10-26
Fixed
dedup
: fixed --ignore-case not being honored during internal sort option #1387applydp
: fixed wrong usage text usingapply
and notapplydp
c47ba86geocode
: fixedindex-update
not honoring--timeout
parameter 3272a9egeocode
: fixedindex-load
to work properly with convenience shortcuts 5097326
Full Changelog: 0.117.0...0.118.0
0.117.0
Highlights:
geocode
: added Federal Information Processing Standards (FIPS) codes to results for US places, so we can derive GEOIDs. This paves the way to doing data enrichment lookups (starting with the US Census) in an upcoming release. π¦- Added Goal/Non-goals, explicitly codifying what qsv is and isn't, and what we're trying to achieve with the toolkit.
excel
: CSV output processing is now multi-threaded, making it a bit faster. The bottleneck is still the Excel/ODS library we're using (calamine), which is single-threaded. But there are active discussions underway to make it much faster in the future. π- Upgrading the MSRV to 1.73.0 has allowed us to use LLVM 17, which has resulted in a small performance boost. π
Added:
geocode
: added Federal Information Processing Standards (FIPS) codes to results for US places.- Added Goals/Non-goals to README.md
Changed
cat
: minor optimization 343bb66excel
: CSV output processing is now multi-threaded #1360geocode
: more efficient dynfmt ptocessing #1367frequency
: optimize allocations before hot loop 655bebcluau
: upgraded embedded Luau from 0.596 to 0.599deps
: bump calamine from 0.22.0 to 0.22.1 4c4ed7edocs
: reorganized README, moving FEATURES and INTERPRETERS to their own markdown files.- build(deps): bump byteorder from 1.4.3 to 1.5.0 by @dependabot in #1347
- build(deps): bump tokio from 1.32.0 to 1.33.0 by @dependabot in #1354
- build(deps): bump regex from 1.9.6 to 1.10.0 by @dependabot in #1356
- build(deps): bump semver from 1.0.19 to 1.0.20 by @dependabot in #1358
- build(deps): bump pyo3 from 0.19.2 to 0.20.0 by @dependabot in #1359
- build(deps): bump serde from 1.0.188 to 1.0.189 by @dependabot in #1361
- build(deps): bump flate2 from 1.0.27 to 1.0.28 by @dependabot in #1363
- build(deps): bump regex from 1.10.0 to 1.10.1 by @dependabot in #1366
deps
: update several indirect dependencies- pin Rust nightly to 2023-10-14
- bump MSRV to 1.73.0
Removed
excel
: removed--progressbar
option as Excel/ODS maximum sheet size is just too small (1,048,576 rows) to make it useful.
Fixed
Full Changelog: 0.116.0...0.117.0
0.116.0
Highlights: π π
- Benchmarks refinements galore with more benchmarks and more comprehensive benchmarking instructions. π
geocode
: The Geonames index's configuration metadata is now available with thegeocode index-check
subcommand. No need to maintain a separate metadata JSON file. This should make it even easier to maintain multiple Geonames index files with different configurations without having to worry if you're looking at the right metadata JSON file. πcat
:rowskey
subcommand is now 27% faster ππ½tojsonl
: parallelized with rayon, making it 33% faster! ππ½- smaller qsv binary size and faster compile times if the
to_parquet
feature is disabled. If you're good enough withsqlp
's ability to create a parquet file from a SQL query, qsv's binary size and compile time will be markedly smaller/faster. ππ½ - minor perf tweaks & optimizations -
count
andluau
commands ππ½
Added
geocode
: added Geonames index file metadata toindex-check
subcommandtojsonl
: parallelized with rayon #1338to
: addedto_parquet
feature. #1341benchmarks
: upgraded from 3.0.0 to 3.3.1- you can now specify a separate benchmarking binary as we dogfood qsv for the benchmarks and some features are required that may not be in the qsv binary variant being benchmarked
- added additional
count
benchmarks with--width
option - added additional
luau
benchmarks with single/multi filter options - added additional
search
benchmark with--unicode
option - show absolute path of qsv binaries used (both the one we're dogfooding and the one being benchmarked) and their version info before running the benchmarks proper
- ensured
schema
benchmark was not using the stats cache with the--force
option
Changed
cat
: use an empty byte_record var instead of repeatedly allocating a new one in a hot loop eddafd1count
: minor optimization bb113c0luau
: minor perf tweaks c71cd16 and f9c1e3c- (deps): bump Geosuggest from 0.4.5 to 5.1 #1333
- (deps): use patched version of calamine which has unreleased fixes since 0.22.0
- build(deps): bump flexi_logger from 0.27.0 to 0.27.2 by @dependabot in #1328
- build(deps): bump indexmap from 2.0.0 to 2.0.1 by @dependabot in #1329
- build(deps): bump hashbrown from 0.14.0 to 0.14.1 by @dependabot in #1334
- build(deps): bump file-format from 0.20.0 to 0.21.0 by @dependabot in #1335
- build(deps): bump indexmap from 2.0.1 to 2.0.2 by @dependabot in #1336
- build(deps): bump regex from 1.9.5 to 1.9.6 by @dependabot in #1337
- build(deps): bump jql-runner from 7.0.3 to 7.0.4 by @dependabot in #1340
- build(deps): bump csvs_convert from 0.8.7 to 0.8.8 by @dependabot in #1339
- build(deps): bump actions/setup-python from 4.7.0 to 4.7.1 by @dependabot in #1342
- build(deps): bump reqwest from 0.11.21 to 0.11.22 by @dependabot in #1343
- build(deps): bump csv from 1.2.2 to 1.3.0 by @dependabot in #1344
- build(deps): bump actix-governor from 0.4.1 to 0.5.0 by @dependabot in #1346
- applied select clippy suggestions
- update several indirect dependencies
- pin Rust nightly to 2023-10-04
Removed
geocode
: removed separate metadata JSON file for Geonames index files. The metadata is now embedded in the index file itself and can be viewed with theindex-check
command.- removed redundant setting from profile.release-samply in Cargo.toml 2a35be5
Fixed
geocode
: when producing JSON output with the now subcommands (suggestnow,
reversenow
,countryinfonow
), we now produce valid JSON. We previously generated JSON with escaped/extra quotes as it was formatted to be included in CSV files, which is required for thesuggest
,reverse
andcountryinfo
subcommands as they are designed to process CSVs with multiple rows, thus requiring escaped JSON. Thenow
commands are only meant for one result so there's no need to escape quote the JSON. #1345schema
: fixed--force
flag not being honored
Full Changelog: 0.115.0...0.116.0
0.115.0
We continue to refine the benchmark suite, and have added a new setup
argument to setup and install the required tools for the benchmark suite. We've also added more comprehensive checks to ensure that the required tools are installed before running the benchmarks. π
For geocode
, we've added a JSON file describing the Geonames index file configuration. This should help users maintain several Geonames index files with different configurations. π
geocode
should also be a tad faster now, thanks to cached
crate making ahash its default hashing algorithm and upgrading hashbrown
- microbenchmarks show a 33% performance improvement. ππ½
We also added a release-samply
profile so we can make it easier to squeeze more performance out of the toolkit with samply
. ππ½
Added
geocode
: added a JSON file describing the Geonames index file configuration in #1324benchmarks
: v3.0.0 release- added
setup
argument to setup and install required tools for the benchmark suite - added more comprehensive required tools check
- added more realistic luau benchmarks, using helper luau scripts
(dt_format.luau and turnaround_time.luau) - added stats with_cache and create_cache benchmarks
- added benchmark_aggregations.luau script for benchmark analysis
- added
binary
,total_mean
andqsv_env
columns to benchmark results
binary
is the qsv binary variant used
total_mean
is the sum of all the mean run times of the benchmarks
qsv_env
are the qsv-relevant environment variables active while running the benchmarks - expanded README.md and benchmark suite usage instructions
- added
- added
release-samply
profile to Cargo.toml to facilitate continued performance optimization withsamply
Changed
readme
: move tab completion instructions/script to scripts/miscgeocode
: updated bundled Geonames index to 2021-09-25- bump embedded luau from 0.594 to 0.596
- build(deps): bump flexi_logger from 0.26.1 to 0.27.0 by @dependabot in #1317
- build(deps): bump indicatif from 0.17.6 to 0.17.7 by @dependabot in #1318
- build(deps): bump semver from 1.0.18 to 1.0.19 by @dependabot in #1320
- build(deps): bump cached from 0.45.1 to 0.46.0 by @dependabot in #1322
- build(deps): bump geosuggest-core from 0.4.3 to 0.4.5 by @dependabot in #1323
- build(deps): bump geosuggest-utils from 0.4.3 to 0.4.5 by @dependabot in #1321
- build(deps): bump fastrand from 2.0.0 to 2.0.1 by @dependabot in #1325
- bump MSRV from Rust 1.72.0 to 1.72.1
- cargo update bump several indirect dependencies
- pin Rust nightly to 2023-09-25
Fixed
benchmarks
: fixed invalid luau benchmark that had invalid luau command
Full Changelog: 0.114.0...0.115.0
0.114.0
The long-overdue Benchmarks revamp is finally here! π- https://qsv.dathere.com/benchmarks
The benchmarks have been completely rewritten to be more reproducible, and now use hyperfine instead of time
. The new benchmarks are now run as part of the release process, and the results are compiled into a single page that is published on the new Quicksilver website.
The new benchmarks are also more comprehensive, and designed to be run on a variety of hardware and operating systems. This allows users to adapt the benchmarks to their own workloads and environments.
Other release highlights include:
geocode
is now fully-featured and ready for production use! π Though it only currently features Geonames city-level lookup support, it provides a solid foundation on top of which we'll add more geocoding providers in the future (next up - OpenCage support with street-level geocoding).- Polars has been bumped from 0.32.1 to 0.33.2, which includes a number of performance improvements for the
sqlp
andjoinp
commands. - major performance increase on several
regex
/aho-corasick
powered commands on Apple Silicon thanks to various under-the-hood improvements in theaho-corasick
crate.
Big thanks to @rzmk , @a5dur, @minhajuddin2510 and @samibaig and helping me finally push out the revamped Benchmarks!
Added
- Added autoindex size threshold, replacing
QSV_AUTOINDEX
env var withQSV_AUTOINDEX_SIZE
. Resolves #1300. in #1301 69e25ac diff
: Added test for different delimiters by @janriemer in #1297benchmarks
: Added qsv benchmark notebook. by @a5dur in #1309geocode
: Addedcountryinfo/now
subcommand made available in geosuggest 0.4.3 #1311geocode
: Added--language
option so users can specify the language of the geocoding results. This requires running theindex-update
subcommand with the--languages
option to rebuild the index with the desired languages.sqlp
: add example of using columns with embedded spaces in SQL queries f7bf4f6
Changed
benchmarks
: Benchmarks revamped #1298, #1310 d8eeb94- build(deps): bump serde_json from 1.0.106 to 1.0.107 by @dependabot in #1302
- build(deps): bump mimalloc from 0.1.38 to 0.1.39 by @dependabot in #1303
- build(deps): bump simple-home-dir from 0.1.4 to 0.2.0 by @dependabot in #1304
- build(deps): bump chrono from 0.4.30 to 0.4.31 by @dependabot in #1305
- (deps): bump Polars from 0.32.1 to Polars 0.33.2 #1308
- build(deps): bump cpc from 1.9.2 to 1.9.3 by @dependabot in #1313
- build(deps): bump rayon from 1.7.0 to 1.8.0 by @dependabot in #1315
- (deps): update several indirect dependencies
- pin Rust nightly to 2023-09-21
Full Changelog: 0.113.0...0.114.0
0.113.0
This is the first "Unicorn" π¦ release, adding MAJOR new features to the toolkit!
geocode
: adds high-speed, cache-backed, multi-threaded geocoding using a local, updateable copy of the GeoNames database. This is a major improvement over the previousgeocode
subcommand in theapply
command thanks to the wonderful geosuggest crate.- guaranteed non-UTF8 input detection with the
validate
andinput
commands. Quicksilver REQUIRES UTF-8 encoded input. You can now use these commands to ensure you have valid UTF-8 input before using the rest of the toolkit. - New/expanded whirlwind tour & quick-start notebooks by @a5dur and @rzmk π
- Various performance improvements all-around: ππ½
- overall increase of ~5% now that
mimalloc
- the default allocator for qsv, is built without secure mode unnecessarily enabled. flatten
command is now ~10% faster- faster regex performance thanks to various under-the-hood improvements in the
regex
crate - and the benchmark scripts have been updated by @minhajuddin2510 to use hyperfine instead of time, and to use the same input file for all benchmarks to make them more reproducible. In upcoming releases, we'll start compiling the benchmark results into a single page as part of the release process, so we can track our progress over time.
- overall increase of ~5% now that
and last but not least - Quicksilver now has a website! - https://qsv.dathere.com/ π¦ π π
And its not just a static site with a few links - its a full-blown web app that lets you try out qsv commands in your browser! It's not just a demo site - you can use it as a configurator and save your commands to a gist and share them with others!
It's the first Beta release of the Quicksilver website, so there's still a lot of work to do, but we're excited to share it with you and get your feedback!
We have more exciting features planned for Quicksilver and the website, but we require your help to make it happen! For qsv, use GitHub issues. For the website, use the feedback form. And if you want to help out, please check out the contributing guide.
Big thanks to @rzmk for all the work on the website! To @a5dur for all the QA work on this release! And to @minhajuddin2510 for revamping the benchmark script!
Added
geocode
: new high-speed geocoding command #1231- major improvements using geosuggest upstream #1269
- add suggest
--country
filter #1275 - add
--admin1
filter #1276 - automatic
--country
inferencing from--admin1
code #1277 - add
--suggestnow
and--reversenow
subcommands #1280 - add
"%dyncols:"
special formatter to dynamically add geocoded columns to the output CSV #1286
excel
: add SheetType (Worksheet, DialogSheet, MacroSheet, ChartSheet, VBA) in metadata mode; log.info! headers; wordsmith comments #1225excel
: moar metadata! moar examples! #1271- add support ALL_PROXY env var #1233
input
: add--encoding-errors
handling option #1235fixlengths
: add--insert
option #1247joinp
: add--sql-filter
option #1287luau
: we now embed Luau 0.594 from 0.592notebooks
: add qsv-colab-quickstart by @rzmk in #1253notebooks
: Added Whirlwindtour.ipynb by @a5dur in #1223
Changed
flatten
: refactor for performance #1227validate
: improved utf8 error mesages #1256apply
&applydp
: improve usage text in relation to multi-column capabilites #1257- qsv-cache now set to ~/.qsv-cache by default #1265
- Download file helper refactor #1267
- Benchmark Update by @minhajuddin2510 in #1237
- Improved error handling #1238
- Improved error handling - incorrect usage errors are now differentiated from other errors as well #1239
- build(deps): bump whatlang from 0.16.2 to 0.16.3 by @dependabot in #1221
- build(deps): bump serde_json from 1.0.104 to 1.0.105 by @dependabot in #1220
- build(deps): bump tokio from 1.31.0 to 1.32.0 by @dependabot in #1222
- build(deps): bump mlua from 0.9.0-rc.3 to 0.9.0 by @dependabot in #1224
- build(deps): bump tempfile from 3.7.1 to 3.8.0 by @dependabot in #1226
- build(deps): bump postgres from 0.19.5 to 0.19.6 by @dependabot in #1229
- build(deps): bump file-format from 0.18.0 to 0.19.0 by @dependabot in #1228
- build(deps): bump reqwest from 0.11.18 to 0.11.19 by @dependabot in #1232
- build(deps): bump rustls-webpki from 0.101.3 to 0.101.4 by @dependabot in #1236
- build(deps): bump reqwest from 0.11.19 to 0.11.20 by @dependabot in #1241
- build(deps): bump rust_decimal from 1.31.0 to 1.32.0 by @dependabot in #1242
- build(deps): bump serde from 1.0.185 to 1.0.186 by @dependabot in #1243
- build(deps): bump jql-runner from 7.0.2 to 7.0.3 by @dependabot in #1246
- build(deps): bump grex from 1.4.2 to 1.4.4 by @dependabot in #1245
- build(deps): bump mlua from 0.9.0 to 0.9.1 by @dependabot in #1244
- build(deps): bump mimalloc from 0.1.37 to 0.1.38 by @dependabot in #1249
- build(deps): bump postgres from 0.19.6 to 0.19.7 by @dependabot in #1251
- build(deps): bump serde from 1.0.186 to 1.0.187 by @dependabot in #1250
- build(deps): bump serde from 1.0.187 to 1.0.188 by @dependabot in #1252
- build(deps): bump regex from 1.9.3 to 1.9.4 by @dependabot in #1254
- build(deps): bump url from 2.4.0 to 2.4.1 by @dependabot in #1261
- build(deps): bump tabwriter from 1.2.1 to 1.3.0 by @dependabot in #1259
- build(deps): bump sysinfo from 0.29.8 to 0.29.9 by @dependabot in #1260
- build(deps): bump actix-web from 4.3.1 to 4.4.0 by @dependabot in #1262
- build(deps): bump chrono from 0.4.26 to 0.4.27 by @dependabot in #1264
- build(deps): bump chrono from 0.4.27 to 0.4.28 by @dependabot in #1266
- build(deps): bump redis from 0.23.2 to 0.23.3 by @dependabot in #1268
- build(deps): bump regex from 1.9.4 to 1.9.5 by @dependabot in #1272
- build(deps): bump flexi_logger from 0.25.6 to 0.26.0 by @dependabot in #1273
- build(deps): bump geosuggest-core from 0.4.0 to 0.4.2 by @dependabot in #1279
- build(deps): bump geosuggest-utils from 0.4.0 to 0.4.2 by @dependabot in #1278
- build(deps): bump cached from 0.44.0 to 0.45.0 by @dependabot in #1282
- build(deps): bump self_update from 0.37.0 to 0.38.0 by @dependabot in #1281
- build(deps): bump actions/checkout from 3 to 4 by @dependabot in #1283
- build(deps): bump chrono from 0.4.28 to 0.4.29 by @dependabot in #1284
- build(deps): bump cached from 0.45.0 to 0.45.1 by @dependabot in #1285
- build(deps): bump sysinfo from 0.29.9 to 0.29.10 by @dependabot in #1288
- build(deps): bump chrono from 0.4.29 to 0.4.30 by @dependabot in #1290
- build(deps): bump bytes from 1.4.0 to 1.5.0 by @dependabot in #1289
- build(deps): bump file-format from 0.19.0 to 0.20.0 by @dependabot in #1291
- cargo update bump several indirect dependencies
- apply select clippy suggestions
- pin Rust nightly to 2023-09-06
Removed
apply
: remove geocode subcmd now that we have a dedicatedgeocode
command https://github.co...
0.112.0
This is the second in a series of "Giddy-up" ππ½ releases, improving the performance of the following commands:
stats
: by refactoring the code to detect empty cells more efficiently, and by removing
unnecessary bounds checks in the main compute loop. (~10% performance improvement)sample
: by refactoring the code to use an index more effectively when available - not only making it faster, but also eliminating the need to load the entire dataset into memory. Also added a--faster
option to use a faster random number generator. (~15% performance improvement)frequency
,schema
,search
&validate
by amortizing/reducing allocations in hot loopsexcel
: by refactoring the main hot loop to convert Excel cells more efficiently
The prebuilt binaries are also built with CPU optimizations enabled for x86_64 and Apple Silicon (arm64) architectures.
0.112.0 is also a "Carousel" (i.e. increased usability) π release featuring new Jupyter notebooks in the contrib/notebooks
directory to help users get started with qsv.
Added
sqlp
: addedCASE
expression support with Polars 0.32 9d508e6sample
: added--faster
option to use a faster random number generator #1210jsonl
: added--delimiter
option #1205excel
: added--delimiter
option ab73067notebook/describegpt
: added describegpt QA Jupyter notebook by @a5dur in #1215notebook/count
: added intro-to-count.ipynb by @rzmk in #1207
Changed
stats
: refactor hot compute function - 35999c5stats
: faster detection of empty samples b054815 and a7f0836sample
: major refactor making it faster, but also eliminating need to load the entire dataset into memory when an index is available. #1210frequency
: refactor primary ftables function 57d660dexcel
: refactor main loop for more performance - 61f227brustfmt
: match_block_trailing_comma #1206- bump MSRV to 1.71.1 1c99364
- apply clippy suggestions #1209
- build(deps): bump tokio from 1.29.1 to 1.30.0 by @dependabot in #1204
- build(deps): bump log from 0.4.19 to 0.4.20 by @dependabot in #1211
- build(deps): bump redis from 0.23.1 to 0.23.2 by @dependabot in #1213
- build(deps): bump tokio from 1.30.0 to 1.31.0 by @dependabot in #1212
- build(deps): bump sysinfo from 0.29.7 to 0.29.8 by @dependabot in #1214
- upgrade to Polars 0.32.0 #1217
- build(deps): bump flate2 from 1.0.26 to 1.0.27 by @dependabot in #1218
- build(deps): bump polars from 0.32.0 to 0.32.1 by @dependabot in #1219
- cargo update bump several indirect dependencies
- pin Rust nightly to 2023-08-13
Removed
stats
: removed Debug derives from structs - 2def136
Fixed
Full Changelog: 0.111.0...0.112.0
0.111.0
This is the first in a series of "Giddy-up" ππ½ releases.
As Quicksilver matures, we will continue to tweak it in our goal to be the π fastest general purpose CSV data-wrangling CLI toolkit available.
"Giddy-up" ππ½ releases increase performance by:
- taking advantage of new Rust features as they become available
- using new libraries that are faster than the ones we currently use
- optimizing our code to take advantage of new features in the libraries we use
- using new algorithms that are faster than the ones we currently use
- taking advantage of more hardware features (SIMD, multi-core, etc.)
- adding reproducible benchmarks that are automatically updated on release to track our progress
As it is, Quicksilver has an aggressive release tempo - with more than 160 releases since its initial release in December 2020. This was made possible by the solid foundation of Rust and the xsv project from which qsv was forked. We will continue to build on this foundation by adding more CI tests and starting to track code coverage so we can continue to iterate aggressively with confidence.
Apart from "giddy-up" releases, Quicksilver will also have "carousel" π releases that will focus on making the toolkit more accessible to non-technical users.
"Carousel" π releases will include:
- more documentation
- more examples
- more tutorials
- more recipes in the Cookbook
- multiple GUI wrappers around the CLI
- integrations with common desktop tools like Excel, Google Sheets, Open Office, etc.
- tighter integration with the CKAN ecosystem, with a focus on helping data publishers & data coordinators maintain a high quality data/metadata catalog
Hopefully, this will make qsv more accessible to non-technical users, and help them get more value out of their data. Special attention will be given to "open data" use cases - enabling non-profits, governments and regular citizens tap raw open data and convert it to actionable insight - making open data useful, usable and used.
Every now and then, we'll also have "Unicorn" π¦ releases that will add MAJOR new features to the toolkit (e.g. 10x type features like the integration of Pola.rs into qsv).
We will also add a new Technical Documentation section to the wiki to document qsv's architecture and how each command works. The hope is doing so will lower the barrier to contributions and help us grow the community of qsv contributors.
Added
Changed
stats
: refactor init_date_inference #1187join
: cache has_headers result in hot loop e53edafsearch
&searchset
: amortize allocs #1188stats
: usefast-float
to convert string to float #1191sqlp
: more examples, apply clippy::needless_borrow lint ff37a04 and b8e1f77- use
fast-float
project-wide (apply
,applydp
,schema
,sort
,validate
) #1192 - fine tune publishing workflows to enable universally available CPU features a1dccc7
- build(deps): bump serde from 1.0.179 to 1.0.180 by @dependabot in #1176
- build(deps): bump pyo3 from 0.19.1 to 0.19.2 by @dependabot in #1177
- build(deps): bump qsv-dateparser from 0.9.0 to 0.10.0 by @dependabot in #1178
- build(deps): bump qsv-sniffer from 0.9.4 to 0.10.0 by @dependabot in #1180
- build(deps): bump indicatif from 0.17.5 to 0.17.6 by @dependabot in #1182
- Bump to qsv stats 0.11 #1184
- build(deps): bump serde from 1.0.180 to 1.0.181 by @dependabot in #1185
- build(deps): bump qsv_docopt from 1.3.0 to 1.4.0 by @dependabot in #1186
- build(deps): bump filetime from 0.2.21 to 0.2.22 by @dependabot in #1193
- build(deps): bump regex from 1.9.1 to 1.9.2 by @dependabot in #1194
- build(deps): bump regex from 1.9.2 to 1.9.3 by @dependabot in #1195
- build(deps): bump serde from 1.0.181 to 1.0.182 by @dependabot in #1196
- build(deps): bump tempfile from 3.7.0 to 3.7.1 by @dependabot in #1199
- build(deps): bump strum_macros from 0.25.1 to 0.25.2 by @dependabot in #1200
- build(deps): bump serde from 1.0.182 to 1.0.183 by @dependabot in #1201
- cargo update bump several indirect dependencies
- apply select clippy lint suggestions
- pin Rust nightly to 2023-08-07
Removed
- temporarily remove rand/simd_support feature when building nightly as its causing the nightly build to fail 0a66fdb
Fixed
New Contributors
Full Changelog: 0.110.0...0.111.0