Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fwrite length for gzip output #6393

Merged
merged 59 commits into from
Jan 15, 2025
Merged

Fix fwrite length for gzip output #6393

merged 59 commits into from
Jan 15, 2025

Conversation

philippechataignon
Copy link
Contributor

@philippechataignon philippechataignon commented Aug 23, 2024

Closes #6356. Closes #5506.

This PR is an attempt to create a better gzip file with fwrite. Its an important rewrite because it includes some refactoring of actual code.

zlib

  • use Z_SYNC_FLUSH instead of Z_FINISH
  • create manual heading and write crc and len in tail
  • calc len and crc in thread and summarize in main thread
  • len in gzip specification is 32 bits and then is modulo 2 ^ 32 for uncompressed size > 4GiB

C code

  • simplify the implementation with only a #pragma omp parallel for for chunk loop and #pragma omp ordered for the writing and summarizing part.
  • Matt Dowle introduces the use of pool of buffers : the idea is generalized. The pools are created at the beginning and then uses for writing headers and rows. All the malloc occur early and no need for an header buffer.
  • Deobfuscate some part, especially if ( ) is followed by a new line, no =- or =*. Lot of work remains. Use of indent command ?
  • Remove some old debug code (msg)

Copy link

github-actions bot commented Aug 23, 2024

Comparison Plot

Generated via commit a11f8b5

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 36 seconds
Installing different package versions 8 minutes and 11 seconds
Running and plotting the test cases 2 minutes and 19 seconds

@oliverfoster
Copy link

oliverfoster commented Aug 23, 2024

len in gzip specification is 32 bits and then is false for uncompressed size > 4GiB

Is it not meant to be length modulo 4GiB rather than false? The remainder of dividing by 4GiB multiplied by 4GiB?

Screenshot_20240823-194218

@philippechataignon
Copy link
Contributor Author

philippechataignon commented Aug 24, 2024

You're right and this PR version stores the modulo 2**32 as requested but its not the right size.
Note that in new versions of gzip (after 1.12), the length stored in file is no more used. See note on gzip -l on https://savannah.gnu.org/news/?id=10121

7z l # (use header)
    Date      Time    Attr         Size   Compressed  Name
                     .....   1298954309   2212092276  mtcars.csv

% gzip --version                                                                                                                                                    
gzip 1.12
% gzip -lv mtcars.csv.gz  # (decompress and takes much more time)
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 5cd79282 Aug 24 10:19          2212092276          9888888901  77.6% mtcars.csv

@philippechataignon
Copy link
Contributor Author

Put PR #5513 in this PR with new param compressLevel.

@philippechataignon philippechataignon force-pushed the fix_fwrite_length branch 2 times, most recently from b6091be to 615161a Compare January 10, 2025 16:26
* add // # nocov for STOP, like previous version
* add a test when naLen > width
* remove test of buffMB done in fwrite.R
@philippechataignon
Copy link
Contributor Author

Hi, all checks are now OK for coverage. Review can continue. Thanks @MichaelChirico for advise.

@MichaelChirico
Copy link
Member

Thanks so much for the extra effort!!

Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! I found a few of the new # nocov extraneous and managed to come up with some simple tests to get coverage to 100% -- PTAL at my tail commits if you're interested.

@MichaelChirico MichaelChirico merged commit e0abdfc into master Jan 15, 2025
10 of 11 checks passed
@MichaelChirico MichaelChirico deleted the fix_fwrite_length branch January 15, 2025 07:05
rffontenelle pushed a commit that referenced this pull request Jan 17, 2025
* fwrite with correct file length

* gzip length and crc are manually computed in each thread and then
  added/combined

* gzip header is minimal

* remove some old debug code

* Escape with NOZLIB for compilation succeed without zlib

* Move zlib check at start to avoid oufile deletion

* Indent and add comments

* Buffers unification

* Restore schedule(dynamic) more efficient and progress

* Use alloc_size to see allocation when verbose

* Test if stream init succeded

* Add cast to avoid warnings on Windows

* More explicit timing messages

* Free stream structs

* Add option to control compression level for fwrite with gzip

* Rework namings and default value

* Rename gzipLevel to compressLevel

* compressLevel param documentation

* Put zlib initialization together

* Refact buffSize, numBatchs and numBatches

* Add missing NOZLIB

* Increase outputs in last message when verbose

* No real init for stream_thread when is_gzip false

* Minor corrections

* Uses %zu format for size_t

* Last verbose msg was not printed when not is_gzip

* minor operator ws change

* Add test for compressLevel=1

* Add url link in compressLevel documentation

* Add 2 lines in NEWS for fwrite fix and compressLevel

* tidy-up, expand NEWS for compressLevel

* Use match.arg() for arg validation

* add a test for the other extreme compressLevel=9

* partial test fix

* fix updated test errors

* confirmed NEWS wording, fix typo

* fix order

* weak ordering

* place in 1.17.0 NEWS

* Add parenthesis to be more explicit

* Add comment for DeflateInit2

* typo

* Add parenthesis to be more explicit (2)

Co-authored-by: Michael Chirico <[email protected]>

* Try to emphasize that '-' is "command flag hyphen", not "negative"

* Convert Toby'd comment to atime_test()

* Remove INTERNAL_STOP

* Increase coverage

* add // # nocov for STOP, like previous version
* add a test when naLen > width
* remove test of buffMB done in fwrite.R

* Try to fix nocov error

* Another attempt to increase coverage

* Add more nocov

* More judicious #nocov, keep INTERNAL_STOP

* eol='' coverage

* buffMB<line width

* Similar for buffMB vs. header width

* 0-row table verbose output

---------

Co-authored-by: Benjamin Schwendinger <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
Co-authored-by: Michael Chirico <[email protected]>
@tdhock tdhock restored the fix_fwrite_length branch January 17, 2025 21:54
@tdhock
Copy link
Member

tdhock commented Jan 17, 2025

restoring branch so that atime tests can find the commit 3630413

@aitap aitap mentioned this pull request Jan 18, 2025
@philippechataignon philippechataignon deleted the fix_fwrite_length branch January 25, 2025 10:18
@MichaelChirico MichaelChirico restored the fix_fwrite_length branch February 4, 2025 06:42
@MichaelChirico
Copy link
Member

FYI @philippechataignon, we need to retain the branch for any PR involved in an {atime} performance test:

PR = "3630413ae493a5a61b06c50e80d166924d2ef89a"), # Close-to-last merge commit in the PR.

I'm not super jazzed about this setup, but nothing better has emerged as yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants