Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAT: Duplicated payload metadata values for "Actual-Content-Length" and "Trailing-Slop-Length" #103

Conversation

sebastian-nagel
Copy link
Contributor

This is a stupid regression of the multi-valued metadata (#98): in the payload metadata of WAT records the values of "Actual-Content-Length" and "Trailing-Slop-Length" are now duplicated. Here one example:

  "Envelope": {
    "Format": "WARC/1.0",
    "Payload-Metadata": {
      "Actual-Content-Length": [
        "418",
        "418"
      ],
      "Actual-Content-Type": "application/warc-fields",
      "Trailing-Slop-Length": [
        "4",
        "0"
      ],
    },
    "WARC-Header-Metadata": {
      "WARC-Type": "warcinfo"
    }

The reason is that these values are set (or appended) from the classes WARCResource and WARCMetaDataResourceFactory resp. HTTPHeadersResourceFactory. Because no other factory classes set these payload metadata headers, only WARC metadata and WARC response records are affected:

  • the value of "Actual-Content-Length" is simply duplicated
  • "Trailing-Slop-Length" has two values: 4 is set in WARCResource while 0 is set in the factory classes. Before Make MetaData multi-valued to preserve values of repeating WARC and HTTP headers #98, only the last value 0 made into the WAT record.
    • this is also cumbersome, because other WARC types, e.g. WARC request, have the first value (4), as the value is set only once and is never overwritten.
    • unfortunately, the documentation of "Trailing-Slop-Length" ("Number of trailing slop bytes" in the WAT spec is not really useful to understand which of the two values is the correct one. Both make sense: 4 bytes (\r\n\r\n) used as WARC record separator resp. zero superfluous bytes.

This PR changes the affected classes, so that additional values are only appended if they differ from already set values and (for "Trailing-Slop-Length") are not 0.

The solution looks overtly complex, but I decided to keep possible changes at a minimum. The matrix required for exhaustive testing is large because it's a combination of WARC and ARC, compressed or not, all WARC record types. The unit tests added cover only uncompressed ARC and WARC files with the record types contained in already provided test files.

@ato ato merged commit 90d28d7 into iipc:master Dec 13, 2024
5 checks passed
@sebastian-nagel sebastian-nagel deleted the duplicated-payload-metadata-actual-content-length branch December 13, 2024 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants