Webarchiv.[cz] | Configuration history of crawls

We use this repository to track changes in heritrix configuration. We also track seeds we used for harvest.

Webarchiv.[en] | Configuration history of crawls
[cz]

Usual files, directories and commits looks like this:

This is what files, directories, and revisions look like:

Files

The file naming convention is derived from the record naming rules.
Each file is only a combination of the data type and its value from the metadata specification.

File with list of seeds

[fileType.prefix]-[dateType.month]-[harvestType.tag]-[harvestFreq].[fileType.fileformat]

seeds-2019-06-S-1M_2M_OneShot_ArchiveIt.txt

File with configuration of crawler

[fileType.prefix].[fileType.fileformat]
[fileType.prefix]-[harvestType.tag]-[dateType.year].[fileType.fileformat]

crawler-beans.cxml
crawler-beans-S-2020.cxml

Directories

The directory naming convention is derived from record naming rules.
Each directory is only a combination of the harvestType and directoryType.

[harvestType]-[directoryType.suffix]

Monthly-crawls/
Topic-crawls/
Shared-config/

Rules for naming records

fileType

prefix	mimetype	fileformat	description
seeds	text/plain	txt	file with list of seeds
crawler-beans	text/xml	cxml	file with crawler configuration

directoryType

directoryType	suffix	description
config	-conf	directory with shared configuration for all harvests as blacklist, etc.
crawls	-crawls	directory with configuration of crawler and seeds for specifics of harvests
reports	-reports	directory with logs and reports about harvest

dateType

Definition of recording date and time items.

dateType	format
year	yyyy
month	yyyy-MM
day	yyyy-MM-DD
time	yyyy-MM-DD@hhmmss

harvestName

See the metadata specification for more information about harvestName #v04

harvestType

This is the curatorial definition of the harvest from which the list of seeds for harvest is derived.
See the metadata specification for more information about harvestType #v04

harvestType	tag	description
Serials	S	Selective repeating harvest (combination of selected seeds with different annual frequency)
Topics	T	Thematic selective harvest. These harvest usually repeats few times.
Totals		Comprehensive harvest of national domain .cz. We do not store domain crawl configuration here.¹
Tests		Harvest feasibility tests
Requests		Requested harvest in cooperation with another institution
Continuous		Continuous selective harvest

harvestFreq

This is curated selection of seeds with defined frequency of repeating crawls:
See the metadata specification for more information about harvestFreq #v04

harvestFreq	description
1M	means selection of seeds to be crawled every month
2M	means selection of seeds to be crawled every other month
6M	means selection of seeds to be crawled twice a year
12M	means selection of seeds to be crawled once a year
Archive_IT	are seeds acquired last month with low frequency as once or twice a year -> to be harvested asap.
OneShot	are seeds without contracts - which we would like to have in archive, but are not publicly available.

Reference

About Webarchiv Harvests
Comprehensive Harvests

Software and software libraries

Software	Version	Language	Official source of code	Utilization
Heritix	3.4.0	Java	https://github.com/internetarchive/heritrix3	crawler
Seeder		Python	https://github.com/WebarchivCZ/Seeder.git	web curator tool

These are not really implemented

Definition and settings of the repository license
Update crawler config files for all harvest type
Create directory for reports of harvests
Create abuse report for crawlers

License

Footnotes

But we not be able to provide seeds.txt file as it is violates our agreement with seeds provider NIC.CZ ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.en.md

README.en.md

Webarchiv.[cz] | Configuration history of crawls

Files

File with list of seeds

File with configuration of crawler

Directories

Rules for naming records

fileType

directoryType

dateType

harvestName

harvestType

harvestFreq

Reference

Software and software libraries

These are not really implemented

License

Files

README.en.md

Latest commit

History

README.en.md

File metadata and controls

Webarchiv.[cz] | Configuration history of crawls

Files

File with list of seeds

File with configuration of crawler

Directories

Rules for naming records

fileType

directoryType

dateType

harvestName

harvestType

harvestFreq

Reference

Software and software libraries

These are not really implemented

License

Footnotes