Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent file ordering when generating checksums #393

Open
superbonaci opened this issue Apr 3, 2020 · 3 comments
Open

Inconsistent file ordering when generating checksums #393

superbonaci opened this issue Apr 3, 2020 · 3 comments

Comments

@superbonaci
Copy link

Performed md5deep -r folder/ several consecutive times, my surprise is that each time the file order may differ, even if the contents are exactly the same. For example:

First checksum generation:

md5deep -r folder/ > 1.md5
xxxx  clonezilla.iso
xxxx  debian.iso
xxxx  gparted.iso
xxxx  Mac.iso
xxxx  archlinux.iso
xxxx  memtest86-usb.zip

Second checksum generation:

md5deep -r folder/ > 2.md5
xxxx  clonezilla.iso
xxxx  gparted.iso
xxxx  debian.iso
xxxx  Mac.iso
xxxx  memtest86-usb.zip
xxxx  Microsoft.exe

I can't perform any diff 1.md5 2.md5 because the order differs, have to sort them first. Can this be fixed or the file list is randomly generated then checksum is performed?

@superbonaci
Copy link
Author

Maybe it's because of this: #394

@paulhargreaves
Copy link

Do this: diff <(sort 1.md5) <(sort 2.md5)

@ashugg
Copy link

ashugg commented Feb 12, 2025

I can't perform any diff 1.md5 2.md5 because the order differs, have to sort them first. Can this be fixed or the file list is randomly generated then checksum is performed?

The checksum output of md5deep/hashdeep is in the order of the hashing threads completing their work. From the man page:

By default the program will create one producer thread to scan the file system and one hashing thread per CPU core. Multi-threading causes output filenames to be in non-deterministic order, as files that take longer to hash will be delayed while they are hashed. If a deterministic order is required, specify -j0 to disable multi-threading.

So, as well has the suggestion about sorting the checksums as you diff them, you have a few options to produce nice sorted output.

(1) If you don't care about how long the job takes to run, use the -j0 flag, for example:

md5deep -rj0 folder/ > 1.md5

(2) Sort the output as the job runs, for example:

md5deep -rj0 folder/ | sort -k2 > 1.md5

I prefer to use hashdeep, which has more complex output. I'll just note my method here for sorting (using Bash shell):

for d in folder1 folder2 folder3 ; do printf "\n\nChecksumming '$d':\n" ; TMPFILE=mktemp -t HASHDEEP|| exit 1 ; hashdeep -relof $d | tee $TMPFILE ; ( grep "^[%#]" $TMPFILE ; grep -v "^[%#]" $TMPFILE | sort -t, -k4 ) > $d.HASHDEEP.txt ; rm -f $TMPFILE ; done

This gives output (on stderr) showing the hashing progress on each file, writing the checksums to a temporary file, then sorting that temporary file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants