Releases: internetarchive/heritrix3
3.6.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Java Compatibility Notice
This release of Heritrix requires Java 17 or later.
New Features
- Automatic Checkpoints on Shutdown: Added
checkpointOnShutdown
option toCheckpointService
to enable automatic checkpoints if Heritrix is gracefully terminated. #626 - Command-Line Checkpoint Selection: The
--checkpoint
command-line option restarts from a named checkpoint when using the--run-job
option. #626 - ConfigurableExtractorJS forceStrictIfUrlMatchingRegexList: URLs matching the regular expressions on this list will be processed in strict mode, with only absolute URLs extracted, not relative ones. #624
Changes
- Upgraded to Spring Framework 6.1: The Spring
@Required
annotation has been removed, so it was replaced with a custom implementation to maintain backward compatibility with existing crawl configurations. Spring 6 requires Java 17 so Heritrix does now too. #625
Fixes
- Manifest Hop Priority: Links from sitemaps are now given the same priority as normal navigation links. They were incorrectly being prioritized as transitive hops (embeds). #623
- SLF4J Logging: Heritrix now includes
slf4j-jdk14
to eliminate a startup warning message and fix logging for dependencies (such as crawler-commons) that use SLF4J. Heritrix doesn't use SLF4J itself. #628
Dependency Upgrades
- amqp-client 5.23.0
- commons-cli 1.9.0
- commons-codec 1.17.1
- commons-io 2.18.0
- commons-net 3.11.1
- crawler-commons 1.4
- dnsjava 3.6.2
- easymock 5.5.0
- freemarker 2.3.33
- groovy 4.0.24
- gson 2.11.0
- httpcomponents 4.5.14
- java-socks-proxy-server 4.1.2
- java-websocket removed
- jaxb-runtime 4.0.5
- jsch switched to mwiede fork 0.2.21
- junit 4.13.2
- kafka-clients 3.9.0
- kryo 5.6.2
- pdfbox 3.0.3
- slf4j 2.0.16
- spring-framework 6.1.15
- webarchive-commons 1.2.0
3.5.0
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
End of interim releases
This release drops the term "interim release" which distinguished releases made temporarily by the community in the absence of releases made by Internet Archive. The community releases have effectively become the official releases.
In conjunction with this, the version numbers which were paused at 3.4.0 for the interim releases, have now resumed incrementing following the scheme major.minor.patch
with the minor release number incremented when features are added or removed.
Java compatibility notice
This will likely be the last release of Heritrix compatible with Java 8. The next release is expected to require Java 17 or later.
Changes in this release
Removals
- Removed HBase modules from contrib. #621
Fixes
- ConfigurableExtractorJS: Set default value (false) for strict property. #612
- ExtractorHTML: Treat
cite
attribute as a navlink instead of embed. #608 - Building no longer requires the builds.archive.org or Cloudera repositories. #614
- Updated to new URL of the restlet repository.
Dependency Upgrades
- Removed hbase, joda-time, log4j
- commons-io 2.14.0
- kafka-clients 3.8.0
- ftpserver-core 1.2.0
- jetty 9.4.56.v20240826
- webarchive-commons 1.1.10
2024-09-09 Interim Release
Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
Compatibility Note
Checkpoints and crawl state created with older versions of Heritrix will not be loadable as kryo has been significantly updated. Replaying the recovery log may be an alternative in some cases.
New Features
- JDK 22 support
- Added
ConfigurableExtractorJS
for more flexible JavaScript extraction. (#602) - Added
HostnameQueueAssignmentPolicyWithLimits
with optional name length limits. (#598) ExtractorHTML
can now extract more variants of alternative resolution image URLs. (#605)- Attributes are now matched case-insensitively (previously
src
andSRC
worked but notSrc
) - New
<img>
attributes:data-full-src
,data-lazy-srcset
,data-src-small
,data-src-medium
- New
<link>
attribute:imagesrcset
- Attributes are now matched case-insensitively (previously
ExtractorHTTP
can now be configured with extra inferred paths (#597)ExtractorYoutubeDL
metadata records can now be optionally logged to crawl.log (#593)
Removals
- Removed
ExtractorChrome
from contrib (#601)
Fixes
- Reduced false positive speculative URLs from meta tags (#595)
- Fixed BdbModule resource leak on job teardown (f428001)
- Corrected function name in
ScriptedProcessor
Javadoc. (#599) - Updated Maven builds to use HTTPS for resolving dependencies.
- Reset CrawlURI status for hasPrerequisite() so that it isn't preserved between attempts (#600)
- Fixed older junit3 tests not being run (#592)
- Increased DiskSpaceMonitor default pause threshold to 8 GiB to avoid BDB issue (#499)
- Stopped logging authentication failures when auth header is missing (#539)
- Fixed console still showing job running after crash (#549)
Dependency Upgrades
- Transitioned
PDFParser
andExtractorPDF
to pdfbox (#575) - Transitioned
ExtractorYoutubeDL
to yt-dlp - commons-net 3.9.0
- com.rabbitmq:amqp-client 5.18.0
- dnsjava 3.6.0
- groovy 4.0.21
- kryo 5.6.0
- spring-expression 5.3.39
2022-07-27 Interim Release
This is the 2022-07-27 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2021-09-23 Interim Release
This is the 2021-09-23 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2021-08-03 Interim Release
This is the 2021-08-03 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
This release includes:
- Upgrades http-client to version 4.5, including improved cookie handling and expiration.
- A new browser-based extraction module,
ExtractorChrome
. - JDK16 compatibility improvements.
- Many more smaller fixes and improvements (see changelog).
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2021-06-17 Interim Release
This is the 2021-06-17 release. Despite being an interim release, it should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
IMPORTANT This release was accidentally built with Java 15 and due to changes in the run-time libraries it is not compatible with Java 8 (Java 9 or later should work fine).
This release improves sitemap extraction, and fixes a bug that can interfere with checkpoint creation.
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2021-05-27 Interim Release
This is the 2021-05-27 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
Notably, this release includes new modules for finding and using sitemaps. See: Support for extracting URLs in sitemaps #262
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2020-05-18 Interim Release
This is the 2020-05-18 release. Despite being an interim release, it includes a number of important fixes for bugs in Heritix 3.4 and should be considered the latest stable version of Heritrix. Some basic release notes are available here. You can find more detailed information in the changelog
This release features new modules to support archiving over SFTP, but stored as a reponse
record rather than the resource
record that has been more widely used in the past. The next release will resolve this as per this pull request
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here
2020-03-04 Interim Release
This is the fourth dated, periodic release. It includes a number of significant changes, most importantly updating of the Berkeley Database from a very old version 4.1.6 to version 7.5.11. This resolves a long-standing bug when recovering from checkpoints multiple times, but also means that the Heritrix state files from previous versions are not compatible with this version. In other words:
Any crawl state folders from previous versions of Heritrix are not compatible with this version! You can only use this new release with new crawls!
Some basic release notes are available here. You can find more detailed information in the changelog
The binaries are available via Maven Central here.
If you want the distribution package, you can download it from here