Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate out downloaded pages into different (warc) files #148

Open
DanAbbz92 opened this issue Dec 7, 2017 · 3 comments
Open

Separate out downloaded pages into different (warc) files #148

DanAbbz92 opened this issue Dec 7, 2017 · 3 comments

Comments

@DanAbbz92
Copy link

Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one?

This will allow for easier data extraction based on individual items

@aecio
Copy link
Member

aecio commented Dec 7, 2017

Do you mean storing the WARC record of each URL in a single file? No.
But you could try to set the maximum size (in bytes) for each file using:

target_storage.data_format.warc.max_file_size: 262144000

Setting a small enough size would force 1 page per WARC file. That being said, I wouldn't recommend this since you may run into file system problems on large crawls.

Another option is to use the FILESYSTEM data formats. They do create one file per URL, but they don't support the WARC format as yet.

@DanAbbz92
Copy link
Author

Ah, thanks or the update @aecio and yes, a WARC per relevant URL.

Does that mean the FILESYSTEM data format is planned to support WARC files in the future?

@aecio aecio added the question label Dec 8, 2017
@aecio
Copy link
Member

aecio commented Dec 8, 2017

No, it is not planed, but it could be included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants