Skip to content

Commit

Permalink
Merge pull request #1263 from jqnatividad/apply-remove-geocode-subcmd
Browse files Browse the repository at this point in the history
`apply`: remove geocode subcmd now that we have a dedicated `geocode` command
  • Loading branch information
jqnatividad authored Aug 29, 2023
2 parents 13e6cd7 + ddd315b commit 7d268dc
Show file tree
Hide file tree
Showing 5 changed files with 4 additions and 276 deletions.
24 changes: 1 addition & 23 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 0 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,6 @@ reqwest = { version = "0.11", features = [
"rustls-tls",
"stream",
], default-features = false }
reverse_geocoder = { version = "3", optional = true }
rust_decimal = "1.32"
ryu = "1"
sanitise-file-name = { version = "1.0", optional = true }
Expand Down Expand Up @@ -235,15 +234,13 @@ all_features = [
"to",
]
apply = [
"cached",
"censor",
"cpc",
"data-encoding",
"dynfmt",
"eudex",
"hashbrown",
"qsv_currency",
"reverse_geocoder",
"strsim",
"titlecase",
"vader_sentiment",
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@

</div>

> ℹ️ **NOTE:** Quicksilver (qsv) is a fork of the popular [xsv](https://github.com/BurntSushi/xsv) utility, merging several pending PRs [since xsv 0.13.0's May 2018 release](https://github.com/BurntSushi/xsv/issues/267). On top of xsv's 20 commands, it adds numerous new features; 37 additional commands; 6 `apply` subcommands & 35 operations; 5 `to` subcommands; 3 `cat` subcommands; 3 `geocode` subcommands & 3 operations; and 4 `snappy` subcommands.
> ℹ️ **NOTE:** Quicksilver (qsv) is a fork of the popular [xsv](https://github.com/BurntSushi/xsv) utility, merging several pending PRs [since xsv 0.13.0's May 2018 release](https://github.com/BurntSushi/xsv/issues/267). On top of xsv's 20 commands, it adds numerous new features; 37 additional commands; 5 `apply` subcommands & 35 operations; 5 `to` subcommands; 3 `cat` subcommands; 3 `geocode` subcommands & 3 operations; and 4 `snappy` subcommands.
See [FAQ](https://github.com/jqnatividad/qsv/discussions/categories/faq) for more details.

## Available commands

| Command | Description |
| --- | --- |
| [apply](/src/cmd/apply.rs#L2)<br>✨🚀🧠🤖 | Apply series of string, date, math, currency & geocoding transformations to given CSV column/s. It also has some basic [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) functions ([similarity](https://crates.io/crates/strsim), [sentiment analysis](https://crates.io/crates/vader_sentiment), [profanity](https://docs.rs/censor/latest/censor/), [eudex](https://github.com/ticki/eudex#eudex-a-blazingly-fast-phonetic-reductionhashing-algorithm) & [language detection](https://crates.io/crates/whatlang)). |
| [apply](/src/cmd/apply.rs#L2)<br>✨🚀🧠🤖 | Apply series of string, date, math & currency to given CSV column/s. It also has some basic [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) functions ([similarity](https://crates.io/crates/strsim), [sentiment analysis](https://crates.io/crates/vader_sentiment), [profanity](https://docs.rs/censor/latest/censor/), [eudex](https://github.com/ticki/eudex#eudex-a-blazingly-fast-phonetic-reductionhashing-algorithm) & [language detection](https://crates.io/crates/whatlang)). |
| <a name="applydp_deeplink"></a>[applydp](/src/cmd/applydp.rs#L2)<br>🚀 ![CKAN](docs/images/ckan.png)| applydp is a slimmed-down version of `apply` with only [Datapusher+](https://github.com/dathere/datapusher-plus) relevant subcommands/operations (`qsvdp` binary variant only). |
| [behead](/src/cmd/behead.rs#L2) | Drop headers from a CSV. |
| [cat](/src/cmd/cat.rs#L2) | Concatenate CSV files by row or by column. |
Expand Down
133 changes: 1 addition & 132 deletions src/cmd/apply.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,11 @@ static USAGE: &str = r#"
Apply a series of transformation functions to given CSV column/s. This can be used to
perform typical data-wrangling tasks and/or to harmonize some values, etc.
It has six subcommands:
It has five subcommands:
* operations - 36 string, format, currency, regex & NLP operators.
* emptyreplace - replace empty cells with <--replacement> string.
* datefmt - Formats a recognized date column to a specified format using <--formatstr>.
* dynfmt - Dynamically constructs a new column from other columns using the <--formatstr> template.
* geocode - geocodes a WGS84 location against a static copy of the Geonames cities database.
* calcconv - parse and evaluate math expressions, with support for units and conversions.
OPERATIONS (multi-column capable)
Expand Down Expand Up @@ -211,27 +210,6 @@ Create a new column 'FullName' from 'FirstName', 'MI', and 'LastName' columns:
$ qsv apply dynfmt --formatstr 'Sir/Madam {FirstName} {MI}. {LastName}' -c FullName file.csv
GEOCODE
Geocodes to the nearest city center point given a location column
[i.e. a column which contains a latitude, longitude WGS84 coordinate] against
an embedded copy of the Geonames city database.
The geocoded information is formatted based on --formatstr, returning
it in 'city-state' format if not specified.
Use the --new-column option if you want to keep the location column:
Examples:
Geocode file.csv Location column and set the geocoded value to a
new column named City.
$ qsv apply geocode Location --new-column City file.csv
Geocode file.csv Location column with --formatstr=city-state and
set the geocoded value a new column named City.
$ qsv apply geocode Location --formatstr city-state --new-column City file.csv
CALCCONV
Parse and evaluate math expressions into a new column, with support for units and conversions.
The math expression is built dynamically using the <--formatstr> template, similar to the DYNFMT
Expand Down Expand Up @@ -276,7 +254,6 @@ qsv apply operations <operations> [options] <column> [<input>]
qsv apply emptyreplace --replacement=<string> [options] <column> [<input>]
qsv apply datefmt [--formatstr=<string>] [options] <column> [<input>]
qsv apply dynfmt --formatstr=<string> [options] --new-column=<name> [<input>]
qsv apply geocode [--formatstr=<string>] [options] <column> [<input>]
qsv apply calcconv --formatstr=<string> [options] --new-column=<name> [<input>]
qsv apply --help
Expand Down Expand Up @@ -305,13 +282,6 @@ apply arguments:
See DYNFMT example above for more details.
--new-column=<name> Put the generated values in a new column.
GEOCODE subcommand:
--formatstr=<string> The place format to use for the geocode operation.
See GEOCODE section in the --formatstr option below
for more details.
<column> The Location column (a column which contains BOTH latitude and
longitude WGS84 coordinates) to geocode.
CALCONV subcommand:
--formatstr=<string> The calculation/conversion expression to use.
--new-column=<name> Put the calculated/converted values in a new column.
Expand Down Expand Up @@ -352,17 +322,6 @@ apply options:
DYNFMT: the template to use to construct a new column.
GEOCODE: the place format to use with the geocode subcommand.
The available formats are:
- 'city-state' (default) - e.g. Brooklyn, New York
- 'city-country' - Brooklyn, US
- 'city-state-country' | 'city-admin1-country' - Brooklyn, New York US
- 'city' - Brooklyn
- 'county' | 'admin2' - Kings County
- 'state' | 'admin1' - New York
- 'county-country' | 'admin2-country' - Kings County, US
- 'county-state-country' | 'admin2-admin1-country' - Kings County, New York US
- 'country' - US
-j, --jobs <arg> The number of jobs to run in parallel.
When not set, the number of jobs is set to the number of CPUs detected.
-b, --batch <size> The number of rows per batch to load into memory, before running in parallel.
Expand All @@ -380,7 +339,6 @@ Common options:

use std::{str::FromStr, sync::OnceLock};

use cached::proc_macro::cached;
use censor::{Censor, Sex, Zealous};
use cpc::{eval, units::Unit};
use data_encoding::BASE64;
Expand All @@ -395,7 +353,6 @@ use rayon::{
prelude::IntoParallelRefIterator,
};
use regex::Regex;
use reverse_geocoder::{Locations, ReverseGeocoder};
use serde::Deserialize;
use strsim::{
damerau_levenshtein, hamming, jaro_winkler, normalized_damerau_levenshtein, osa_distance,
Expand Down Expand Up @@ -465,7 +422,6 @@ struct Args {
cmd_datefmt: bool,
cmd_dynfmt: bool,
cmd_emptyreplace: bool,
cmd_geocode: bool,
cmd_calcconv: bool,
arg_input: Option<String>,
flag_rename: Option<String>,
Expand All @@ -484,8 +440,6 @@ struct Args {
}

static CENSOR: OnceLock<Censor> = OnceLock::new();
static LOCS: OnceLock<Locations> = OnceLock::new();
static GEOCODER: OnceLock<ReverseGeocoder> = OnceLock::new();
static EUDEX_COMPARAND_HASH: OnceLock<eudex::Hash> = OnceLock::new();
static REGEX_REPLACE: OnceLock<Regex> = OnceLock::new();
static SENTIMENT_ANALYZER: OnceLock<SentimentIntensityAnalyzer> = OnceLock::new();
Expand All @@ -511,7 +465,6 @@ enum ApplySubCmd {
Operations,
DateFmt,
DynFmt,
Geocode,
EmptyReplace,
CalcConv,
}
Expand Down Expand Up @@ -608,8 +561,6 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
Err(e) => return Err(e),
}
ApplySubCmd::Operations
} else if args.cmd_geocode {
ApplySubCmd::Geocode
} else if args.cmd_datefmt {
ApplySubCmd::DateFmt
} else if args.cmd_dynfmt {
Expand Down Expand Up @@ -678,20 +629,6 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
.map(|record_item| {
let mut record = record_item.clone();
match apply_cmd {
ApplySubCmd::Geocode => {
let mut cell = record[column_index].to_owned();
if !cell.is_empty() {
let search_result = search_cached(&cell, &args.flag_formatstr);
if let Some(geocoded_result) = search_result {
cell = geocoded_result;
}
}
if args.flag_new_column.is_some() {
record.push_field(&cell);
} else {
record = replace_column_value(&record, column_index, &cell);
}
},
ApplySubCmd::Operations => {
let mut cell = String::new();
for col_index in &*sel {
Expand Down Expand Up @@ -829,9 +766,6 @@ pub fn run(argv: &[&str]) -> CliResult<()> {
} // end batch loop

if show_progress {
if args.cmd_geocode {
util::update_cache_info!(progress, SEARCH_CACHED);
}
util::finish_progress(&progress);
}
Ok(wtr.flush()?)
Expand Down Expand Up @@ -1291,68 +1225,3 @@ fn apply_operations(
}
}
}

#[cached(
key = "String",
convert = r#"{ format!("{}", cell) }"#,
option = true,
sync_writes = false
)]
fn search_cached(cell: &str, formatstr: &str) -> Option<String> {
let geocoder =
GEOCODER.get_or_init(|| ReverseGeocoder::new(LOCS.get_or_init(Locations::from_memory)));

// regex for Location field. Accepts (lat, long) & lat, long
let locregex: &'static Regex =
regex_oncelock!(r"(?-u)([+-]?[0-9]+\.?[0-9]*|\.[0-9]+),\s*([+-]?[0-9]+\.?[0-9]*|\.[0-9]+)");

let loccaps = locregex.captures(cell);
loccaps.and_then(|loccaps| {
let lat = fast_float::parse(&loccaps[1]).unwrap_or_default();
let long = fast_float::parse(&loccaps[2]).unwrap_or_default();
if (-90.0..=90.0).contains(&lat) && (-180.0..=180.0).contains(&long) {
let search_result = geocoder.search((lat, long));
search_result.map(|locdetails| {
#[allow(clippy::match_same_arms)]
// match arms are evaluated in order,
// so we're optimizing for the most common cases first
match formatstr {
"%+" | "city-state" => format!(
"{name}, {admin1}",
name = locdetails.record.name,
admin1 = locdetails.record.admin1,
),
"city-country" => format!(
"{name}, {cc}",
name = locdetails.record.name,
cc = locdetails.record.cc
),
"city-state-country" | "city-admin1-country" => format!(
"{name}, {admin1} {cc}",
name = locdetails.record.name,
admin1 = locdetails.record.admin1,
cc = locdetails.record.cc
),
"city" => locdetails.record.name.to_string(),
"county" | "admin2" => locdetails.record.admin2.to_string(),
"state" | "admin1" => locdetails.record.admin1.to_string(),
"county-country" | "admin2-country" => format!(
"{admin2}, {cc}",
admin2 = locdetails.record.admin2,
cc = locdetails.record.cc
),
"county-state-country" | "admin2-admin1-country" => format!(
"{admin2}, {admin1} {cc}",
admin2 = locdetails.record.admin2,
admin1 = locdetails.record.admin1,
cc = locdetails.record.cc
),
"country" => locdetails.record.cc.to_string(),
_ => locdetails.record.name.to_string(),
}
})
} else {
None
}
})
}
Loading

0 comments on commit 7d268dc

Please sign in to comment.