Skip to content

Releases: internetarchive/heritrix3

3.14.1

06 Apr 08:50
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Bug fixes

  • FetchHTTP2: Fixed an IllegalArgumentException when the host contains underscores. #718
  • FetchHTTP2: Removed HttpClient default authentication protocol handlers to avoid failures on 401/407 responses with large response bodies or missing *-Authenticate headers. #719
  • Crawl job XML parsing: Enabled XXE protection when parsing crawl job XML. #713
  • URI precedence warnings: Heritrix now logs a warning when URI precedence exceeds the maximum supported value of 127 and is clipped. #721

Dependency upgrades

  • amqp-client: 5.28.0 → 5.29.0
  • codemirror__autocomplete: 6.20.0 → 6.20.1
  • codemirror__commands: 6.10.1 → 6.10.2
  • codemirror__language: 6.12.1 → 6.12.2
  • codemirror__lint: 6.9.2 → 6.9.5
  • codemirror__view: 6.39.11 → 6.39.16
  • commons-net: 3.12.0 → 3.13.0
  • groovy-bom: 5.0.4 → 5.0.5
  • jackson-bom: 2.21.0 → 2.21.2
  • jakarta.xml.bind-api: 4.0.4 → 4.0.5
  • jaxb-runtime: 4.0.6 → 4.0.7
  • jetty (jetty-bom, jetty-ee10-bom): 12.0.32 → 12.0.34
  • jsch: 2.27.7 → 2.28.0
  • junit-jupiter: 6.0.2 → 6.0.3
  • kafka-clients: 4.1.1 → 4.2.0
  • lezer__common: 1.5.0 → 1.5.1
  • lezer__lr: 1.4.7 → 1.4.8
  • lz4-java: 1.10.3 → 1.10.4
  • pdfbox: 3.0.6 → 3.0.7
  • spring (spring-beans, spring-context, spring-core, spring-expression): 7.0.3 → 7.0.6

3.14.0

06 Feb 14:46
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • JSON API responses: You can now request JSON responses from the REST API with a Accept: application/json request header. #703
  • Job deletion: You can now delete jobs via both the UI and the API. #707
  • ExtractorYoutubeDL: Added a new skipSuperResolution option to skip upscaled YouTube videos. #709
  • FetchHTTP2: Added SOCKS5 proxy support. #710

Bug fixes

  • Improved usability of the engine page when many jobs are active by making the "Exit Java" joblist scrollable. #704

Dependency upgrades

  • amqp-client: 5.27.1 → 5.28.0
  • codemirror__autocomplete: 6.18.6 → 6.20.0
  • codemirror__commands: 6.8.1 → 6.10.1
  • codemirror__language: 6.11.3 → 6.12.1
  • codemirror__lint: 6.8.5 → 6.9.2
  • codemirror__state: 6.5.2 → 6.5.4
  • codemirror__view: 6.38.1 → 6.39.11
  • commons-codec: 1.20.0 → 1.21.0
  • crawler-commons: 1.5 → 1.6
  • dnsjava: 3.6.3 → 3.6.4
  • groovy-bom: 5.0.2 → 5.0.4
  • jackson-bom: 2.20.1 → 2.21.0
  • jetty (jetty-bom, jetty-ee10-bom): 12.0.30 → 12.0.32
  • junit-jupiter: 6.0.1 → 6.0.2
  • lezer__common: 1.2.3 → 1.5.0
  • lezer__highlight: 1.2.1 → 1.2.3
  • lezer__lr: 1.4.2 → 1.4.7
  • lz4-java: 1.10.1 → 1.10.3
  • spring (spring-beans, spring-context, spring-core, spring-expression): 7.0.1 → 7.0.3
  • style-mod: 4.1.2 → 4.1.3
  • webarchive-commons: 3.0.2 → 3.0.3
  • webjars-locator-lite: 1.1.2 → 1.1.3

3.13.0

11 Dec 05:08
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • Config editor: IDE-style completions for bean names and Spring XML (powered by the new bean docs generator). #684
  • Job status API: The sizeTotalsReport now includes a sizeOnDisk value totaling the size of the files in latest/warcs. #700
  • ExtractorJson: New extractor that extracts URI strings from JSON documents. #701

Bug fixes

  • AbstractCookieStore: Fixed cookies with leading dot (.example.com) being ignored #691
  • ExtractorHTML: Fixed attribute values longer than 2048 characters causing extraction of truncated strings. #697
  • ClientFTP: Fixed MalformedServerReplyException when FTP sends a response with only an error code and no message. #694
  • BdbMultipleWorkQueues: Added null checks, type validation, and warning logs in BdbMultipleWorkQueues.delete() to improve frontier stability in the case of corrupted or partially persisted CrawlURIs. #693
  • BeanDocProcessor: Fixed compiler IllegalArgumentException when IntelliJ runs the annotation processor with a ProcessingEnvironment wrapper.

Dependency upgrades

  • amqp-client: 5.27.0 → 5.27.1
  • commons-cli: 1.10.0 → 1.11.0
  • commons-codec: 1.19.0 → 1.20.0
  • commons-io: 2.20.0 → 2.21.0
  • jackson: 2.20.0 → 2.20.1
  • jetty: 12.0.29 → 12.0.30
  • jsch: 2.27.4 → 2.27.7
  • junit-jupiter: 6.0.0 → 6.0.1
  • kafka-clients: 4.1.0 → 4.1.1
  • lz4-java: 1.8.0 → 1.10.1
  • spring-framework: 6.2.12 → 7.0.1
  • webarchive-commons: 3.0.1 → 3.0.2

3.12.0

30 Oct 01:52
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • ConfigurableExtractorJS: Regex rules to skip extracting <script> tags when their attributes match. #672

Bug fixes

  • Docs: Switch bean docs generation to an annotation processor, fixing the bean reference broken by Java language changes. #683
  • StatisticsTracker: Don’t restore crawlEndTime when resuming from a checkpoint. #669
  • ExtractorJS: Fix overriding the strict setting in sheets. #670
  • Berkeley DB: Handle more shutdown interrupts gracefully. #671

Dependency upgrades

  • amqp-client: 5.26.0 → 5.27.0
  • groovy: 4.0.28 → 5.0.2
  • jaxb-runtime: 4.0.5 → 4.0.6
  • jetty: 12.0.27 → 12.0.29
  • jsch: 2.27.3 → 2.27.4
  • junit-jupiter: 5.13.4 → 6.0.0
  • kafka-clients: 3.9.1 → 4.1.0
  • pdfbox: 3.0.5 → 3.0.6
  • rethinkdb-driver: 2.3.3 → 2.4.4
  • spring: 6.2.11 → 6.2.12
  • webarchive-commons: 3.0.0 → 3.0.1
  • webjars-locator-lite: 1.1.0 → 1.1.2

3.11.0

22 Sep 05:04
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • KnowledgableExtractorJS now extends ConfigurableExtractorJS for its additional options. #668

Bug fixes

  • Invalid characters are now stripped from the XML REST API output. Log file truncation after an unclean shutdown can sometimes introduce such characters. #667

Dependency upgrades

  • codemirror@language: 6.11.2 → 6.11.3
  • jakarta.xml.bind-api: 4.0.2 → 4.0.4
  • jetty: 12.0.25 → 12.0.27
  • jsch: 2.27.2 → 2.27.3
  • gson: 2.13.1 → 2.13.2
  • spring: 6.2.10 → 6.2.11

3.10.2

29 Aug 08:32
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Bug fixes

  • AMQPPublishProcessor: The User-Agent string is now included in the metadata so Umbra can use it in its own requests. #663
  • FetchDNS: DNS lookups returning 0.0.0.0 are now treated as resolution failure. #665

Dependency upgrades

  • amqp-client: 5.25.0 → 5.26.0
  • codemirror@language: 6.11.1 → 6.11.2
  • codemirror@legacy-modes: 6.5.0 → 6.5.1
  • codemirror@view: 6.37.2 → 6.38.1
  • commons-cli: 1.9.0 → 1.10.0
  • commons-codec: 1.18.0 → 1.19.0
  • commons-net: 3.11.1 → 3.12.0
  • jetty: 12.0.22 → 12.0.25
  • junit-jupiter: 5.13.3 → 5.13.4
  • groovy: 4.0.27 → 4.0.28
  • spring-framework: 6.2.9 → 6.2.10

3.10.1

21 Jul 08:20
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

Bug fixes

  • FetchHTTP2

    • HTTP/1.1 is now used on servers that don't support ALPN. Fixes IOException: frame_size_error/invalid_frame_length
    • Fixed NullPointerException when the server's IP address isn't available.
  • Seeds report: Redirect URIs are now recorded from the Location header for HTTP status codes 303 See other,
    307 Temporary Redirect and 308 Permanent Redirect.
    Previously this was only done for 301 Moved Permanently and 302 Found.

  • Public suffixes list: A resource naming conflict between webarchive-commons and crawler-commons for
    effective_tld_names.dat was resolved and the list was updated to the latest version.

Dependency upgrades

  • codemirror@state: 6.4.0 → 6.5.11
  • codemirror@view: 6.37.1 → 6.37.2
  • commons-lang: 2.6 → 3.18.0
  • commons-io: 2.19.0 → 2.20.0
  • crawler-commons: 1.4 → 1.5
  • jetty: 12.0.17 → 12.0.22
  • jsch: 2.27.0 → 2.27.2
  • junit-jupiter: 5.13.2 → 5.13.3
  • restlet: 2.6.0-rc1 → 2.6.0
  • spring: 6.2.7 → 6.2.9
  • webarchive-commons: 2.0.1 → 3.0.0

3.10.0

12 Jun 13:22
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • BrowserProcessor: Loads fetched pages in a local browser (Firefox/ChromeDriver), records all browser requests,
    and runs pluggable behaviors (e.g. scrolling, link extraction). #653

    • Uses the WebDriver BiDi protocol for browser automation.
    • The recording proxy is built on Jetty's ProxyHandler and the FetchHTTP2 module.
    • Status: Working for small crawls but needs more robust error handling (browser crashes, resource limits).
  • Basic web auth: You can now switch the web interface from Digest authentication to Basic authentication with the --web-auth basic command-line option. This is useful when running Heritrix behind a reverse proxy that adds external authentication. #654

  • Robots.txt wildcards: The * and $ wildcard rules from RFC 9309 are now supported. #656

  • FetchHTTP2: Added HTTP proxy support. #657

Fixes

  • Code editor: The configuration editor and script console were upgraded to CodeMirror 6. This resolves some browser incompatibilities, allowing CodeMirror’s own find function to be re-enabled for reliable text search of content far outside the viewport. #651

  • BDB shutdown interrupt handling: The thread’s interrupted flag is now cleared before some BDB interactions to reduce the likelihood of environment invalidation when requestCrawlStop() is called repeatedly. #659

  • FetchHTTP2: Fixed gzip alert log messages by configuring HttpClient to not decode gzip encoding from response.

Removals

  • Removed Apache HttpClient 3: If you have custom Heritrix modules you may need to update the following
    class references in your code:

    Removed Replacement
    org.apache.commons.httpclient.URIException org.archive.url.URIException
    org.apache.commons.httpclient.Header org.archive.format.http.HttpHeader

    Note that Apache HttpClient 4 (org.apache.http) was not removed. #652

Dependency Upgrades

  • codemirror: 2.23 → 6
  • easymock: 5.5.0 → removed
  • groovy: 4.0.26 → 4.0.27
  • junit: 5.12.2 → 5.13.1
  • kafka-clients: 3.9.0 → 3.9.1
  • spring: 6.2.6 → 6.2.7
  • webarchive-commons: 1.3.0 → 2.0.1

3.9.0

13 May 04:49
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New features

  • FetchHTTP2: Added a new fetch module supporting HTTP/2 and HTTP/3. #649

Fixes

  • Fixed HighestUriPrecedenceProvider: Added Histotable serializer and Kryo autoregistration. #647

Changes

  • JUnit 5: Upgraded all JUnit 3 and 4 style tests to JUnit 5. #650

Dependency Upgrades

  • commons-io: 2.18.0 → 2.19.0
  • gson: 2.12.1 → 2.13.1
  • jetty: 9.4.19.v20190610 → 12.0.17
  • jsch: 0.2.24 → 2.27.0
  • junit: 4.13.2 → 5.12.2
  • pdfbox: 3.0.4 → 3.0.5
  • restlet: 2.5.0 → 2.6.0-RC1
  • spring: 6.2.5 → 6.2.6

3.8.0

01 Apr 12:18
@ato ato

Choose a tag to compare

Download distribution zip (or tar.gz)

Full Changelog | Javadoc | Maven Central

New Features

  • ExtractorYoutubeDL processArguments: New option for overriding the default yt-dlp process arguments. #644

Fixes

  • Slow tests: Fixed ObjectIdentityBdbManualCacheTest so it no longer fails when running tests with -DrunSlowTests=true. #643
  • Test stability: Disabled FetchHTTPTest.testHostHeaderDefaultPort due to sporadic test failures.
  • Code cleanup: Fixed some compiler and IDE warnings. Removed unused utility classes (JavaLiterals, LogUtils). #645

Dependency Upgrades

  • amqp-client: 5.24.0 → 5.25.0
  • beanshell: 2.0b5 → 2.0b6
  • commons-codec: 1.17.2 → 1.18.0
  • dnsjava: 3.6.2 → 3.6.3
  • groovy: 4.0.24 → 4.0.26
  • gson: 2.11.0 → 2.12.1
  • jsch: 0.2.22 → 0.2.24
  • pdfbox: 3.0.3 → 3.0.4
  • slf4j: 2.0.16 → 2.0.17
  • spring: 6.1.16 → 6.2.5