Skip to content

Commit

Permalink
typo: multithreaded not multi-threaded
Browse files Browse the repository at this point in the history
debatable, but use non-hyphenated form consistently

[skip ci]
  • Loading branch information
jqnatividad committed Mar 11, 2024
1 parent fcfc75b commit 8f090d9
Show file tree
Hide file tree
Showing 6 changed files with 17 additions and 17 deletions.
16 changes: 8 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ VendorID,total_amount
(3, 2)
6.09 real 6.82 user 0.16 sys

# with fast path optimization, fully exploiting Polars' multi-threaded, mem-mapped CSV reader!
# with fast path optimization, fully exploiting Polars' multithreaded, mem-mapped CSV reader!
/usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
Expand Down Expand Up @@ -484,7 +484,7 @@ Users can manually verify the signatures by downloading the zipsign public key a
## Highlights:
* `geocode`: added Federal Information Processing Standards (FIPS) codes to results for US places, so we can derive [GEOIDs](https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html#:~:text=FIPS%20codes%20are%20assigned%20alphabetically,Native%20Hawaiian%20(AIANNH)%20areas.). This paves the way to doing data enrichment lookups (starting with the US Census) in an upcoming release.
* Added [Goal/Non-goals](https://github.com/jqnatividad/qsv#goals--non-goals), explicitly codifying what qsv is and isn't, and what we're trying to achieve with the toolkit.
* `excel`: CSV output processing is now multi-threaded, making it a bit faster. The bottleneck is still the Excel/ODS library we're using ([calamine](https://github.com/tafia/calamine)), which is single-threaded. But there are [active](https://github.com/tafia/calamine/issues/346) [discussions](https://github.com/tafia/calamine/issues/362) underway to make it much faster in the future.
* `excel`: CSV output processing is now multithreaded, making it a bit faster. The bottleneck is still the Excel/ODS library we're using ([calamine](https://github.com/tafia/calamine)), which is single-threaded. But there are [active](https://github.com/tafia/calamine/issues/346) [discussions](https://github.com/tafia/calamine/issues/362) underway to make it much faster in the future.
* Upgrading the MSRV to 1.73.0 has allowed us to use LLVM 17, which has resulted in an overall performance boost.

---
Expand All @@ -495,7 +495,7 @@ Users can manually verify the signatures by downloading the zipsign public key a

### Changed
* `cat` : minor optimization https://github.com/jqnatividad/qsv/commit/343bb668ae84fcf862883245382e7d8015da88c2
* `excel`: CSV output processing is now multi-threaded https://github.com/jqnatividad/qsv/pull/1360
* `excel`: CSV output processing is now multithreaded https://github.com/jqnatividad/qsv/pull/1360
* `geocode`: more efficient dynfmt ptocessing https://github.com/jqnatividad/qsv/pull/1367
* `frequency`: optimize allocations before hot loop https://github.com/jqnatividad/qsv/commit/655bebcdec6d89f0ffa33d794069ee5eee0df3e5
* `luau`: upgraded embedded Luau from 0.596 to 0.599
Expand Down Expand Up @@ -663,7 +663,7 @@ Other release highlights include:
## [0.113.0] - 2023-09-08 🦄🏇🏽🎠
This is the first "[Unicorn](https://7esl.com/unicorn/)" 🦄 release, adding MAJOR new features to the toolkit!

* `geocode`: adds high-speed, cache-backed, multi-threaded geocoding using a local, updateable copy of the [GeoNames](https://www.geonames.org/) database. This is a major improvement over the previous `geocode` subcommand in the `apply` command thanks to the wonderful [geosuggest](https://github.com/estin/geosuggest) crate.
* `geocode`: adds high-speed, cache-backed, multithreaded geocoding using a local, updateable copy of the [GeoNames](https://www.geonames.org/) database. This is a major improvement over the previous `geocode` subcommand in the `apply` command thanks to the wonderful [geosuggest](https://github.com/estin/geosuggest) crate.
* guaranteed non-UTF8 input detection with the `validate` and `input` commands. Quicksilver [_REQUIRES_ UTF-8 encoded input](https://github.com/jqnatividad/qsv/tree/master#utf-8-encoding). You can now use these commands to ensure you have valid UTF-8 input before using the rest of the toolkit.
* New/expanded whirlwind tour & quick-start notebooks by @a5dur and @rzmk 🎠
* Various performance improvements all-around: 🏇🏽
Expand Down Expand Up @@ -1108,7 +1108,7 @@ This release features the new [Polars](https://www.pola.rs/)-powered `sqlp` comm

Initial tests show that its competitive with [DuckDB](https://duckdb.org/) and faster than [DataFusion](https://arrow.apache.org/datafusion/) on identical SQL queries, and it just runs rings around [pandasql](https://github.com/yhat/pandasql/#pandasql).

It converts Polars SQL (a subset of ANSI SQL) queries to multi-threaded LazyFrames expressions and then executes them. This is a very powerful feature and allows you to do things like joins, aggregations, group bys, etc. on larger than memory CSVs. The `sqlp` command is still experimental and we are looking for feedback on it. Please try it out and let us know what you think.
It converts Polars SQL (a subset of ANSI SQL) queries to multithreaded LazyFrames expressions and then executes them. This is a very powerful feature and allows you to do things like joins, aggregations, group bys, etc. on larger than memory CSVs. The `sqlp` command is still experimental and we are looking for feedback on it. Please try it out and let us know what you think.

### Added
* `sqlp`: new command to allow Polars SQL queries against CSVs https://github.com/jqnatividad/qsv/pull/1015
Expand Down Expand Up @@ -1442,7 +1442,7 @@ So "0.100.0" is less than "0.99.0", and self-update won't work.
### Added
* added [Snappy](https://google.github.io/snappy/) auto-compression/decompression support. The Snappy format was chosen primarily
because it supported streaming compression/decompression and is designed for performance. https://github.com/jqnatividad/qsv/pull/911
* added `snappy` command. Although files ending with the ".sz" extension are automatically compressed/decompressed, the `snappy` command offers 4-5x faster multi-threaded compression. It can also be used to check if a file is Snappy-compressed or not, and can be used to compress/decompress any file. https://github.com/jqnatividad/qsv/pull/911 and https://github.com/jqnatividad/qsv/pull/916
* added `snappy` command. Although files ending with the ".sz" extension are automatically compressed/decompressed, the `snappy` command offers 4-5x faster multithreaded compression. It can also be used to check if a file is Snappy-compressed or not, and can be used to compress/decompress any file. https://github.com/jqnatividad/qsv/pull/911 and https://github.com/jqnatividad/qsv/pull/916
* `diff` command added to `qsvlite` and `qsvdp` binary variants https://github.com/jqnatividad/qsv/pull/910
* `to`: added stdin support https://github.com/jqnatividad/qsv/pull/913

Expand Down Expand Up @@ -2746,7 +2746,7 @@ This means clippy, even in pedantic/nursery/perf mode will have no warnings. htt
* pin Rust Nightly to 2022-06-29

### Fixed
* `fetch`: is single-threaded again. It turns out it was more complicated than I hoped. Will revisit making it multi-threaded once I sort out the sync issues.
* `fetch`: is single-threaded again. It turns out it was more complicated than I hoped. Will revisit making it multithreaded once I sort out the sync issues.

## [0.56.0] - 2022-06-20
### Added
Expand Down Expand Up @@ -3072,7 +3072,7 @@ which prevented us from building qsv's nightly build. (see https://github.com/ap

## [0.45.0] - 2022-04-30
### Added
* Added `extsort` command - sort arbitrarily large text files\CSVs using a multi-threaded external sort algorithm.
* Added `extsort` command - sort arbitrarily large text files\CSVs using a multithreaded external sort algorithm.

### Changed
* Updated whirlwind tour with simple `stats` step
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
| [index](/src/cmd/index.rs#L2) | Create an index (📇) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 14 seconds to index) & provides constant time indexing/random access into the CSV. With an index, `count`, `sample` & `slice` work instantaneously; random access mode is enabled in `luau`; and multithreading (🏎️) is enabled for the `frequency`, `split`, `stats`, `schema` & `tojsonl` commands. |
| [input](/src/cmd/input.rs#L2) | Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands. |
| [join](/src/cmd/join.rs#L2) | Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast. |
| [joinp](/src/cmd/joinp.rs#L2)<br>✨🚀🐻‍❄️ | Inner, outer, cross, anti, semi & asof joins using the [Pola.rs](https://www.pola.rs) engine. Unlike the `join` command, `joinp` can process files larger than RAM, is multi-threaded, has join key validation, pre-join filtering, supports [asof joins](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) (which is [particularly useful for time series data](https://github.com/jqnatividad/qsv/blob/30cc920d0812a854fcbfedc5db81788a0600c92b/tests/test_joinp.rs#L509-L983)) & its output doesn't have duplicate columns. However, `joinp` doesn't have an --ignore-case option & it doesn't support right outer joins. |
| [joinp](/src/cmd/joinp.rs#L2)<br>✨🚀🐻‍❄️ | Inner, outer, cross, anti, semi & asof joins using the [Pola.rs](https://www.pola.rs) engine. Unlike the `join` command, `joinp` can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports [asof joins](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.join_asof.html) (which is [particularly useful for time series data](https://github.com/jqnatividad/qsv/blob/30cc920d0812a854fcbfedc5db81788a0600c92b/tests/test_joinp.rs#L509-L983)) & its output doesn't have duplicate columns. However, `joinp` doesn't have an --ignore-case option & it doesn't support right outer joins. |
| [jsonl](/src/cmd/jsonl.rs#L2)<br>🚀🔣 | Convert newline-delimited JSON ([JSONL](https://jsonlines.org/)/[NDJSON](http://ndjson.org/)) to CSV. See `tojsonl` command to convert CSV to JSONL.
| <a name="luau_deeplink"></a><br>[luau](/src/cmd/luau.rs#L2) 👑<br>✨📇🌐🔣 ![CKAN](docs/images/ckan.png) | Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a [Luau](https://luau-lang.org) [0.616](https://github.com/Roblox/luau/releases/tag/0.616) expression/script for every row of a CSV file ([sequential mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L254-L298)), or using [random access](https://www.webopedia.com/definitions/random-access/) with an index ([random access mode](https://github.com/jqnatividad/qsv/blob/bb72c4ef369d192d85d8b7cc6e972c1b7df77635/tests/test_luau.rs#L367-L415)).<br>Can process a single Luau expression or [full-fledged data-wrangling scripts using lookup tables](https://github.com/dathere/qsv-lookup-tables#example) with discrete BEGIN, MAIN and END sections.<br> It is not just another qsv command, it is qsv's [Domain-specific Language](https://en.wikipedia.org/wiki/Domain-specific_language) (DSL) with [numerous qsv-specific helper functions](https://github.com/jqnatividad/qsv/blob/113eee17b97882dc368b2e65fec52b86df09f78b/src/cmd/luau.rs#L1356-L2290) to build production data pipelines. |
| [partition](/src/cmd/partition.rs#L2) | Partition a CSV based on a column value. |
Expand Down Expand Up @@ -274,7 +274,7 @@ For all commands except the `index`, `extdedup` & `extsort` commands, if the inp
Similarly, if the `--output` file has an ".sz" extension, qsv will _automatically_ do streaming compression as it writes it.
If the output file has an extended CSV/TSV ".sz" extension, qsv will also use the file extension to determine the delimiter to use.

Note however that compressed files cannot be indexed, so index-accelerated commands (`frequency`, `schema`, `split`, `stats`, `tojsonl`) will not be multi-threaded. Random access is also disabled without an index, so `slice` will not be instantaneous and `luau`'s random-access mode will not be available.
Note however that compressed files cannot be indexed, so index-accelerated commands (`frequency`, `schema`, `split`, `stats`, `tojsonl`) will not be multithreaded. Random access is also disabled without an index, so `slice` will not be instantaneous and `luau`'s random-access mode will not be available.

There is also a dedicated [`snappy`](/src/cmd/snappy.rs#L2) command with four subcommands for direct snappy file operations — a multithreaded `compress` subcommand (4-5x faster than the built-in, single-threaded auto-compression); a `decompress` subcommand with detailed compression metadata; a `check` subcommand to quickly inspect if a file has a Snappy header; and a `validate` subcommand to confirm if a Snappy file is valid.

Expand Down Expand Up @@ -359,7 +359,7 @@ However, if the latest Rust stable has been released for more than a week and Ho
## Goals / Non-Goals

QuickSilver's goals, in priority order, are to be:
* **As Fast as Possible** - To do so, it has frequent releases, an aggressive MSRV policy, takes advantage of CPU features, employs [various caching strategies](docs/PERFORMANCE.md#caching), uses [HTTP/2](https://www.cloudflare.com/learning/performance/http2-vs-http1.1/#:~:text=Multiplexing%3A%20HTTP%2F1.1%20loads%20resources,resource%20blocks%20any%20other%20resource.), and is multi-threaded when possible and it makes sense. See [Performance](docs/PERFORMANCE.md) for more info.
* **As Fast as Possible** - To do so, it has frequent releases, an aggressive MSRV policy, takes advantage of CPU features, employs [various caching strategies](docs/PERFORMANCE.md#caching), uses [HTTP/2](https://www.cloudflare.com/learning/performance/http2-vs-http1.1/#:~:text=Multiplexing%3A%20HTTP%2F1.1%20loads%20resources,resource%20blocks%20any%20other%20resource.), and is multithreaded when possible and it makes sense. See [Performance](docs/PERFORMANCE.md) for more info.
* **Able to Process Very Large Files** - Most qsv commands are streaming, using constant memory, and can process arbitrarily large CSV files. For those commands that require loading the entire CSV into memory (denoted by 🤯), qsv has Out-of-Memory prevention, batch processing strategies and "ext"ernal commands that use the disk to process larger than memory files. See [Memory Management](docs/PERFORMANCE.md#memory-management) for more info.
* **A Complete Data-Wrangling Toolkit** - qsv aims to be a comprehensive data-wrangling toolkit that you can use for quick analysis and investigations, but is also robust enough for production data pipelines. Its many commands are targeted towards common data-wrangling tasks and can be combined/composed into complex data-wrangling scripts with its Luau-based DSL.
Luau will also serve as the backbone of a whole library of **qsv recipes** - reusable scripts for common tasks (e.g. street-level geocoding, removing PII, data enrichment, etc.) that prompt for easily modifiable parameters.
Expand Down
2 changes: 1 addition & 1 deletion src/cmd/count.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ count options:
WHEN THE POLARS FEATURE IS ENABLED:
--no-polars Use the regular single-threaded, streaming CSV reader instead of
the much faster Polars multi-threaded, mem-mapped CSV reader.
the much faster Polars multithreaded, mem-mapped CSV reader.
Use this when you encounter issues when counting with the
Polars CSV reader. The regular reader is slower but can read any
valid CSV file of any size.
Expand Down
2 changes: 1 addition & 1 deletion src/cmd/joinp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Joins two sets of CSV data on the specified columns using the Pola.rs engine.
The default join operation is an 'inner' join. This corresponds to the
intersection of rows on the keys specified.
Unlike the join command, joinp can process files larger than RAM, is multi-threaded,
Unlike the join command, joinp can process files larger than RAM, is multithreaded,
has join key validation, pre-join filtering, supports asof joins & its output doesn't
have duplicate columns.
Expand Down
6 changes: 3 additions & 3 deletions src/cmd/snappy.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Does streaming compression/decompression of the input using the Snappy framing f
https://github.com/google/snappy/blob/main/framing_format.txt
It has four subcommands:
compress: Compress the input (multi-threaded).
compress: Compress the input (multithreaded).
decompress: Decompress the input (single-threaded).
check: Quickly check if the input is a Snappy file by inspecting the
first 50 bytes of the input is valid Snappy data.
Expand All @@ -17,7 +17,7 @@ Note that most qsv commands already automatically decompresses Snappy files if t
input file has an ".sz" extension. It will also automatically compress the output
file (though only single-threaded) if the --output file has an ".sz" extension.
This command's multi-threaded compression is 5-6x faster than qsv's automatic
This command's multithreaded compression is 5-6x faster than qsv's automatic
single-threaded compression.
Also, this command is not specific to CSV data, it can compress/decompress ANY file.
Expand Down Expand Up @@ -232,7 +232,7 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
Ok(())
}

// multi-threaded streaming snappy compression
// multithreaded streaming snappy compression
pub fn compress<R: Read, W: Write + Send + 'static>(
mut src: R,
dst: W,
Expand Down
2 changes: 1 addition & 1 deletion src/cmd/sqlp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ sqlp options:
Use this when you get query errors or to force CSV parsing when there
is only one input file, no CSV parsing options are used and its not
a SQL script. Otherwise, the CSV will be read directly into a LazyFrame
using the fast path with the multi-threaded, mem-mapped read_csv()
using the fast path with the multithreaded, mem-mapped read_csv()
Polars SQL function which is much faster though not as configurable than
the regular CSV parser.
--truncate-ragged-lines Truncate ragged lines when parsing CSVs. If set, rows with more
Expand Down

0 comments on commit 8f090d9

Please sign in to comment.