- MetaData is now multivalued to support repeated WARC and HTTP headers. #98
- commons-io 2.18.0
- commons-lang 2.6
- guava 33.3.1-jre
- hadoop 3.4.1
- htmlparser 2.1
- httpcore 4.4.16
- json 20240303
- junit 4.13.2
- Fixed URLParser and WaybackURLKeyMaker failing on URLs with IPv6 address hostnames #100
- WAT extractor: do not fail on missing WARC-Filename in warcinfo record
- ExtractingParseObserver: extract rel, hreflang and type attributes
- ExtractingParseObserver: extract links from onClick attributes
- commons-collections 3.2.2
- commons-io 2.7
- dsiutils 2.2.8
- guava 33.3.0-jre
- hadoop 3.4.0 (now optional)
- pig 0.17.0
- org.json 20231013
- joda-time (was unused)
- Use commons-collections v3.2.2 to avoid v3.2.1 vulnerability
- Extract
property
attributes of HTML meta elements - Do not add value of preceding HTTP header field if there is no value
- Fix WAT records corresponding to response records of Wget generated WARCs
- Improve HTML link extraction
- Move unit tests over from heritrix3 to webarchive-commons
- Strip empty port via URLParser
- Use CharsetDetector to guess encoding of HTML documents
- Fix last header was lost if LF LF
- Make regular expression to extract URLs from CSS more restrictive
- Remove invalid constant
PROFILE_REVISIT_URI_AGNOSTIC_IDENTICAL_DIGEST
- Make canonicalizer be able to strip session id params even if they are the first params in the query string
- Store origin-code of ARC file header
- Flush output etc before tallying stats to fix sizeOnDisk calculation
- Get rid of broken, seemingly unnecessary escapeWhitespace() step of uri fixup
- Handle empty String argument in CharsetDetector.trimAttrValue
- WAT extractor: adding information in WAT's warcinfo
- WAT extractor: missing WARC format version
- WAT extractor: envelope structure does not conform to the WAT specification
- WAT extractor: WARC-Date in all records should be the WAT record generation date
- WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself
- WAT extractor: Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length
- Escape redirect URLs in RealCDXExtractorOutput
- Tests fail on Windows
- Test fails on Java 8
- RecordingOutputStream can affect tcp packets sent in an undesirable way