`diff`: only show headers when there are differences #1326

jqnatividad · 2023-09-27T11:11:54Z

@janriemer , I think it'd be great if qsv diff behaves like the diff CLI command, which produces no output when files are identical.

The text was updated successfully, but these errors were encountered:

janriemer · 2023-10-01T13:16:57Z

Hey @jqnatividad 👋

this definitely sounds like a useful feature to have! 👍 Thank you for bringing this up! I can imagine a use-case, where one can more easily check via a script if there is a diff or not, when there is no output, when files are identical.

At first, I was a bit worried that this would make composing diff results together more difficult:

Scenario: comparing files A and B and then comparing that result (let's call it AB) with C, where all files have headers - all in one command

=> It would be impossible to configure diff's argument options for AB, because we don't know in advance, whether A and B are equal - or, in other words: we don't know, whether AB will have headers applied to it or not.

Remaining open questions (and possible solutions)

Note: the names that are used for cli options here probably won't be the exact names in the final implementation.

⚠️ What should happen, when we specify header-in-result=true, but none of the compared files have headers? Where should we draw the header names from?

Possible solutions:

✅ use default column names, like column (this value should also be configurable) and number them: column1, column2, etc. => this is my preferred solution (for the first iteration)
✅ Allow to specify the column names directly, like header-names-in-result='Foo,Bar,Baz,...'
- ⚠️ this is more error-prone: e.g. what happens, when too few or too many header names are specified (when the number of specified header names don't match the actual number of columns in the CSV)?

What do you think about this? I'm glad, we can improve diff together and make it the best it can be! 😊 🚀

jqnatividad · 2023-10-01T15:59:38Z

Hi @janriemer! 👋

Thanks for giving this feature request a lot of thought.

Let's go with your preferred solution. The one thing I'd bring to your attention is how to rename handles a similar situation.

For consistency, perhaps, diff should follow what rename does with the _all_generic magic value?

https://github.com/jqnatividad/qsv/blob/master/src/cmd/rename.rs

WDYT?

janriemer · 2023-10-08T20:55:39Z

Hi @jqnatividad,

just want to let you know that I'm actively working on this. First tests seem promising! 🙂

Thank you also for the hint regarding rename and _all_generic. 👍 I'm using this now, when neither input files have headers. The only thing I'm struggling with here is: how to know the number of columns that need to be generated?

I think we need to expose the number of columns somehow in csv-diff crate - more precisely in DiffByteRecords (which is used by diff command).

Nothing, we can't solve, though. 😉

I'll probably give another update on Wednesday or next weekend. 🤞

janriemer · 2023-10-22T22:08:45Z

Hi @jqnatividad 👋

another update: an MR for the crate csv-diff is now in review (by me). Once merged, it will allow csv-diff to:

get the number of columns that a DiffByteRecord will have
- this pretty much includes all cases, like
  - when there is no diff, but at least one CSV has headers
  - when there is a diff, but no headers
  - etc.
get at the headers in the diff result (if originally provided via has_headers flag in csv)
- this will allow us to more easily output the headers in the result

After the acceptance of the MR, I'll prepare a new release of csv-diff (it will be 0.1.0 - first non-alpha/-beta release (:tada:)) and we can use the new functionality for this enhancement in diff. Very exciting! ✨ ☺️

Next steps

Merge MR in csv-diff => see MR 26 and MR 27
Release new version 0.1.0 of csv-diff => see tag v0.1.0
Implement this enhancement in diff and create a PR => see diff: add option/flag for headers in output #1395

Some anecdote (not important, only if you're interested)

Providing the above functionality in csv-diff hasn't been that easy, but it has shown me a very good path on how to provide similar functionality in the future (in the past, I've tried similar things, but failed). What I mean by that is, that it is almost like "sniffing" the first rows of the CSVs now, which is difficult when the goal is to do everything in a single pass, due to parsing and diffing happening in different threads (I know that there is the qsv-sniffer crate, but because of what I've just mentioned, I'm not sure it is suitable for csv_diff).

jqnatividad · 2023-10-23T09:33:01Z

Thanks for the update @janriemer !

I look forward to merging the first non-beta powered diff powered by csv-diff!

And I can emphatize with your journey with it - I've gone down many dead-ends with qsv myself, and even ended up reverting some code several times and removing features (like auto-transcoding to utf-8).

It's interesting you mention qsv-sniffer as that was something I adapted from csv-sniffer and the Viterbi algorithm at its heart is still somewhat opaque to me, which prevents me from really making it a bulletproof sniffer, as there are still some valid CSVs that it sometimes fails to sniff.

BTW, speaking of sniff, have you considered adding an additional mode to diff to return JSON metadata about the diff rather than the diff result?

I do this in several commands in addition to sniff - excel, safenames, sortcheck & validate.

qsv is meant to be used in pipelines as well, and having machine-readable JSON would be great!

janriemer · 2023-10-31T17:19:56Z

Hey @jqnatividad

sorry for the late reply. 😳 The PR for this issue is ready at #1395! 🎉

Thank you for sharing your experience of your journey with qsv and the difficulties you've encountered along the way. qsv is a very complex project and I'm very impressed by what you've achieved so far! 🎩

Yeah, I can imagine sniff being difficult. 🥴

Yes, thank you for the idea regarding JSON output. 👍 It is on my list of TODOs, but "unfortunately" this list gets longer and longer. 😄 I'd like to have some other things take priority right now, that are more kind of stabilizations, like:

deal with CSVs that have headers, but have different ordering
deal with CSVs that are flexible
provide a cryptographically secure version for diffing (currently xxhash, a 128-bit non-crypto-hash, is used) and make it configurable with cargo features which hash to use
refactor module organization and cargo features, so that cargo features are all compatible with each other

But I'll see what I can do. 😉 With serde this might be relatively easy, but I can't really tell right now.

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Sep 27, 2023

janriemer mentioned this issue Oct 31, 2023

diff: add option/flag for headers in output #1395

Merged

janriemer mentioned this issue Oct 31, 2023

diff: Provide an option to set the delimiter for the output #1396

Closed

2 tasks

jqnatividad closed this as completed in #1395 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`diff`: only show headers when there are differences #1326

`diff`: only show headers when there are differences #1326

jqnatividad commented Sep 27, 2023

janriemer commented Oct 1, 2023

jqnatividad commented Oct 1, 2023

janriemer commented Oct 8, 2023

janriemer commented Oct 22, 2023 •

edited

Loading

jqnatividad commented Oct 23, 2023

janriemer commented Oct 31, 2023

diff: only show headers when there are differences #1326

diff: only show headers when there are differences #1326

Comments

jqnatividad commented Sep 27, 2023

janriemer commented Oct 1, 2023

Suggested solution

Remaining open questions (and possible solutions)

jqnatividad commented Oct 1, 2023

janriemer commented Oct 8, 2023

janriemer commented Oct 22, 2023 • edited Loading

Next steps

Some anecdote (not important, only if you're interested)

jqnatividad commented Oct 23, 2023

janriemer commented Oct 31, 2023

`diff`: only show headers when there are differences #1326

`diff`: only show headers when there are differences #1326

janriemer commented Oct 22, 2023 •

edited

Loading