Why use JSON Schema for validating CSV files? Why not JSON Table Schema? #322

jqnatividad · 2022-05-25T01:15:03Z

jqnatividad
May 25, 2022
Maintainer

When the schema and validate commands specifications were first written, the intent was to use JSON Table Schema - which is specifically designed for tabular data.

Of course, being active in the CKAN community, I considered using Frictionless Data. However, it was limited to Python.

After surveying the available crates that can be leverage to build these commands - it became clear that we had to use JSON Schema instead using the jsonschema crate.

And validate is quite performant! Validating a million rows in less than 3 seconds.¹

$ qsv validate .\NYC_311_SR_2010-2020-sample-1M.csv .\nyc50ksample.csv.schema.json
[00:00:02] [==================== 100% validated 1,000,000 records.] (411,861/sec)
Writing invalid/valid/error files...
2,995 out of 1,000,000 records invalid.

Still, as the jsonschema crate is still evolving, qsv will also support JSON Table Schema if and when it becomes doable/available in Rust.

Using this 1M row sample of NYC 311 Data and this JSON schema file that was run on the first 50,000 rows of the same 1M row 311 sample.
There were 2,995 "invalid" rows as the 50K sample didn't have certain enums that was present later in the 1M row sample.
This was run on a Ryzen 4800H laptop with 8 physical cores/16 logical cores and 32gb memory. validate is multi-threaded and used all 16 cores. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use JSON Schema for validating CSV files? Why not JSON Table Schema? #322

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why use JSON Schema for validating CSV files? Why not JSON Table Schema? #322

jqnatividad May 25, 2022 Maintainer

Footnotes

Replies: 0 comments

jqnatividad
May 25, 2022
Maintainer