Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: only show headers when there are differences #1326

Closed
jqnatividad opened this issue Sep 27, 2023 · 6 comments · Fixed by #1395
Closed

diff: only show headers when there are differences #1326

jqnatividad opened this issue Sep 27, 2023 · 6 comments · Fixed by #1395
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.

Comments

@jqnatividad
Copy link
Collaborator

@janriemer , I think it'd be great if qsv diff behaves like the diff CLI command, which produces no output when files are identical.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Sep 27, 2023
@janriemer
Copy link

Hey @jqnatividad 👋

this definitely sounds like a useful feature to have! 👍 Thank you for bringing this up! I can imagine a use-case, where one can more easily check via a script if there is a diff or not, when there is no output, when files are identical.

At first, I was a bit worried that this would make composing diff results together more difficult:

Scenario: comparing files A and B and then comparing that result (let's call it AB) with C, where all files have headers - all in one command

=> It would be impossible to configure diff's argument options for AB, because we don't know in advance, whether A and B are equal - or, in other words: we don't know, whether AB will have headers applied to it or not.

Suggested solution

After thinking about this a little more, though, I think we can generalize this feature to the following:

=> Let's provide an option that configures, whether the diff result should output headers or not (independent of whether both files are equal).

The current behavior (without this option) is: use headers, when at least one file has headers.

Remaining open questions (and possible solutions)

Note: the names that are used for cli options here probably won't be the exact names in the final implementation.

  • ⚠️ What should happen, when we specify header-in-result=true, but none of the compared files have headers? Where should we draw the header names from?

Possible solutions:

  • ✅ use default column names, like column (this value should also be configurable) and number them: column1, column2, etc. => this is my preferred solution (for the first iteration)
  • ✅ Allow to specify the column names directly, like header-names-in-result='Foo,Bar,Baz,...'
    • ⚠️ this is more error-prone: e.g. what happens, when too few or too many header names are specified (when the number of specified header names don't match the actual number of columns in the CSV)?

What do you think about this? I'm glad, we can improve diff together and make it the best it can be! 😊 🚀

@jqnatividad
Copy link
Collaborator Author

Hi @janriemer! 👋

Thanks for giving this feature request a lot of thought.

Let's go with your preferred solution. The one thing I'd bring to your attention is how to rename handles a similar situation.

For consistency, perhaps, diff should follow what rename does with the _all_generic magic value?

https://github.com/jqnatividad/qsv/blob/master/src/cmd/rename.rs

WDYT?

@janriemer
Copy link

Hi @jqnatividad,

just want to let you know that I'm actively working on this. First tests seem promising! 🙂

Thank you also for the hint regarding rename and _all_generic. 👍 I'm using this now, when neither input files have headers. The only thing I'm struggling with here is: how to know the number of columns that need to be generated?

I think we need to expose the number of columns somehow in csv-diff crate - more precisely in DiffByteRecords (which is used by diff command).

Nothing, we can't solve, though. 😉

I'll probably give another update on Wednesday or next weekend. 🤞

@janriemer
Copy link

janriemer commented Oct 22, 2023

Hi @jqnatividad 👋

another update: an MR for the crate csv-diff is now in review (by me). Once merged, it will allow csv-diff to:

  • get the number of columns that a DiffByteRecord will have
    • this pretty much includes all cases, like
      • when there is no diff, but at least one CSV has headers
      • when there is a diff, but no headers
      • etc.
  • get at the headers in the diff result (if originally provided via has_headers flag in csv)
    • this will allow us to more easily output the headers in the result

After the acceptance of the MR, I'll prepare a new release of csv-diff (it will be 0.1.0 - first non-alpha/-beta release (:tada:)) and we can use the new functionality for this enhancement in diff. Very exciting! ✨ ☺️

Next steps


Some anecdote (not important, only if you're interested)

Providing the above functionality in csv-diff hasn't been that easy, but it has shown me a very good path on how to provide similar functionality in the future (in the past, I've tried similar things, but failed). What I mean by that is, that it is almost like "sniffing" the first rows of the CSVs now, which is difficult when the goal is to do everything in a single pass, due to parsing and diffing happening in different threads (I know that there is the qsv-sniffer crate, but because of what I've just mentioned, I'm not sure it is suitable for csv_diff).

@jqnatividad
Copy link
Collaborator Author

Thanks for the update @janriemer !

I look forward to merging the first non-beta powered diff powered by csv-diff!

And I can emphatize with your journey with it - I've gone down many dead-ends with qsv myself, and even ended up reverting some code several times and removing features (like auto-transcoding to utf-8).

It's interesting you mention qsv-sniffer as that was something I adapted from csv-sniffer and the Viterbi algorithm at its heart is still somewhat opaque to me, which prevents me from really making it a bulletproof sniffer, as there are still some valid CSVs that it sometimes fails to sniff.

BTW, speaking of sniff, have you considered adding an additional mode to diff to return JSON metadata about the diff rather than the diff result?

I do this in several commands in addition to sniff - excel, safenames, sortcheck & validate.

qsv is meant to be used in pipelines as well, and having machine-readable JSON would be great!

@janriemer
Copy link

Hey @jqnatividad

sorry for the late reply. 😳 The PR for this issue is ready at #1395! 🎉

Thank you for sharing your experience of your journey with qsv and the difficulties you've encountered along the way. qsv is a very complex project and I'm very impressed by what you've achieved so far! 🎩

Yeah, I can imagine sniff being difficult. 🥴

Yes, thank you for the idea regarding JSON output. 👍 It is on my list of TODOs, but "unfortunately" this list gets longer and longer. 😄 I'd like to have some other things take priority right now, that are more kind of stabilizations, like:

  • deal with CSVs that have headers, but have different ordering
  • deal with CSVs that are flexible
  • provide a cryptographically secure version for diffing (currently xxhash, a 128-bit non-crypto-hash, is used) and make it configurable with cargo features which hash to use
  • refactor module organization and cargo features, so that cargo features are all compatible with each other

But I'll see what I can do. 😉 With serde this might be relatively easy, but I can't really tell right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants