GitHub - vincev/dply-rs: A data manipulation tool for parquet, csv, and json data.

dply is a command line tool for viewing, querying, and writing csv and parquet files, inspired by dplyr.

Usage overview

A dply pipeline consists of a number of functions to read, transform, or write Parquet or CSV files.

Conversions between CSV, NdJSON, and Parquet files

The functions csv, json and parquet read and write data for their respective formats. The following two steps pipeline converts a Parquet file to NdJSON:

$ dply -c 'parquet("nyctaxi.parquet") | json("nyctaxi.json")'

We can use a select step if we want to convert a subset of the columns:

$ dply -c 'parquet("nyctaxi.parquet") |
    select(ends_with("time"), payment_type) |
    json("nyctaxi.json")'
$ head -2 nyctaxi.json| jq
{
  "payment_type": "Credit card",
  "tpep_dropoff_datetime": "2022-11-22T19:45:53",
  "tpep_pickup_datetime": "2022-11-22T19:27:01"
}
{
  "payment_type": "Cash",
  "tpep_dropoff_datetime": "2022-11-27T16:50:06",
  "tpep_pickup_datetime": "2022-11-27T16:43:26"
}

Extracting nested fields from nested NdJSON

To extract a nested field in a NdJSON file we can use the field function in a mutate step. The following example extracts the sha from the list of commits in the payload object:

$ dply -c 'json("./tests/data/github.json") |
    mutate(commits = field(payload, commits)) |
    unnest(commits) |
    mutate(sha = field(commits, sha)) |
    select(sha) |
    show()'
shape: (4, 1)
┌──────────────────────────────────────────┐
│ sha                                      │
│ ---                                      │
│ str                                      │
╞══════════════════════════════════════════╡
│ a02be18dc2a0faa0faec14f50c8b190ca0b50034 │
│ ac97a4ab3a4d86f61a6ba167c06cd8813b470867 │
│ null                                     │
│ e4b233f1323a4b4e4461ed1aad31d20a7fbf0db4 │
└──────────────────────────────────────────┘

Complex NdJSON files can be converted to Parquet for faster query processing:

$ dply -c 'json("github.json") | parquet("github.parquet")'

Grouping, sorting columns, and saving results to a file

The following pipeline reads a Parquet file¹, group rows by payment_type, computes the minimum, mean, and maximum fare for each payment type, saves the result to fares.csv CSV file, and shows the result:

$ dply -c 'parquet("nyctaxi.parquet") |
    group_by(payment_type) |
    summarize(
        min_price = min(total_amount),
        mean_price = mean(total_amount),
        max_price = max(total_amount)
    ) |
    arrange(payment_type) |
    csv("fares.csv") |
    show()'
shape: (5, 4)
┌──────────────┬───────────┬────────────┬───────────┐
│ payment_type ┆ min_price ┆ mean_price ┆ max_price │
│ ---          ┆ ---       ┆ ---        ┆ ---       │
│ str          ┆ f64       ┆ f64        ┆ f64       │
╞══════════════╪═══════════╪════════════╪═══════════╡
│ Cash         ┆ -61.85    ┆ 18.07      ┆ 86.55     │
│ Credit card  ┆ 4.56      ┆ 22.969491  ┆ 324.72    │
│ Dispute      ┆ -55.6     ┆ -0.145161  ┆ 54.05     │
│ No charge    ┆ -16.3     ┆ 0.086667   ┆ 19.8      │
│ Unknown      ┆ 9.96      ┆ 28.893333  ┆ 85.02     │
└──────────────┴───────────┴────────────┴───────────┘

Running dply without any parameter starts the interactive client:

Supported functions

dply supports the following functions:

arrange Sorts rows by column values
count Counts columns unique values
config Configure display format options
csv Reads or writes a dataframe in CSV format
distinct Retains unique rows
filter Filters rows that satisfy given predicates
glimpse Shows a dataframe overview
group by and summarize Performs grouped aggregations
head Shows the first few dataframe rows in table format
joins Left, inner, outer and cross joins
json Reads or writes a dataframe in JSON format
mutate Creates or mutate columns
parquet Reads or writes a dataframe in Parquet format
relocate Moves columns positions
rename Renames columns
select Selects columns
show Shows all dataframe rows
unnest Expands list columns into rows

more examples can be found in the tests folder.

Installation

Binaries generated by the release Github action for Linux, macOS (x86), and Windows are available in the releases page.

You can also install dply using Cargo:

cargo install dply

or by building it from this repository:

git clone https://github.com/vincev/dply-rs
cd dply-rs
cargo install --path .

The file nyctaxi.parquet in the tests/data folder is a 250 rows parquet file sampled from the NYC trip record data. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage overview

Conversions between CSV, NdJSON, and Parquet files

Extracting nested fields from nested NdJSON

Grouping, sorting columns, and saving results to a file

Supported functions

Installation

About

Releases 14

Packages

Contributors 2

Languages

License

vincev/dply-rs

Folders and files

Latest commit

History

Repository files navigation

Usage overview

Conversions between CSV, NdJSON, and Parquet files

Extracting nested fields from nested NdJSON

Grouping, sorting columns, and saving results to a file

Supported functions

Installation

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 2

Languages

Packages