Skip to content

Releases: Bulat-Ziganshin/DataSmoke

2-pass 32/64-bit coverage smokers and -bBUFSIZE option

12 Feb 15:28
Compare
Choose a tag to compare

Changes:

  • Added "2-pass DWord/QWord coverage" smokers that does the same as 1-pass coverage smoker but with automatic selection of the most populated sector
  • The "DWord hash entropy" smoker, of course absolutely useless
  • -bBUFSIZE option selects size of analyzed blocks: -b64k, -b4m/-b4, -b1g
  • print numbers of first 10 incompressible blocks
  • hashing uses SSE 4.2 crc32c instruction in order to make distribution more fair

Tuned DWord coverage algorithm plus printing amount of incompressible blocks

09 Feb 18:55
Compare
Choose a tag to compare

Changes:

  • DWord coverage modifed to use STEP=1 (required to detect duplicated compressed data) and HASHSIZE = 2*mb (required for more precise coverage computation, previously it was 88% on random data)
  • print amount of incompressible blocks (entropy or coverage >95%)
  • display results as Markdown-friendly tables
  • calculate min/max entropy only on complete 4MB blocks, so average entropy now may be less than minimal or larger than maximal 😆
  • print everything to stdout instead of stderr
  • substantially updated README

Initial implementations of byte/word/dword/order1 smokers

08 Feb 19:37
Compare
Choose a tag to compare

The full list of smells (speeds measured on the single core of i7-4770):

  • ByteSmoker: computes entropy of individual bytes (2 GB/s).
  • WordSmoker: computes entropy of 16-bit words (0.7-1.5 GB/s).
  • DWordSmoker: computes entropy of 32-bit dwords (3 GB/s).
  • Order1Smoker: computes order-1 entropy of 8-bit bytes (0.7-1.5 GB/s).

And examples of their work:

Text file (enwik9):

  • ByteSmoker entropy: minimum 62.68%, average 64.20%, maximum 66.97%
  • WordSmoker entropy: minimum 53.14%, average 55.97%, maximum 57.93%
  • Order1Smoker entropy: minimum 42.43%, average 47.75%, maximum 48.88%
  • DWordSmoker entropy: minimum 4.14%, average 10.37%, maximum 16.01%

Binary file:

  • ByteSmoker entropy: minimum 48.49%, average 77.67%, maximum 93.62%
  • WordSmoker entropy: minimum 33.09%, average 68.74%, maximum 92.00%
  • Order1Smoker entropy: minimum 17.69%, average 59.81%, maximum 90.39%
  • DWordSmoker entropy: minimum 1.78%, average 31.92%, maximum 92.00%

Compressed file:

  • ByteSmoker entropy: minimum 100.00%, average 100.00%, maximum 100.00%
  • WordSmoker entropy: minimum 99.75%, average 99.93%, maximum 99.93%
  • Order1Smoker entropy: minimum 99.49%, average 99.86%, maximum 99.86%
  • DWordSmoker entropy: minimum 96.20%, average 96.95%, maximum 98.04%