-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl source information per item/page #31
base: main
Are you sure you want to change the base?
Conversation
…ilable when parsing
@wvengen The idea here is that the user has access to the WARC record, and can choose to extract any data from it. For example, you can extract the record ID (see example below). This way we leave the choice to the user rather than exposing specific values. This does however require some additional steps for the user to figure out what attributes are available in the extra = {}
warc_record: Optional[WARCRecord] = response.meta.get("warc_record")
if warc_record:
extra['warc_record_id'] = warc_record.header.record_id
yield {
...,
**extra,
} Do you think this is sufficient or do you think we do require additional information? WACZ source data is currently not possible ( |
Great idea, I think it can be useful to have access to the original WARC record during crawling from an archive. But I think the use-case is different: What happens when you crawl from the live internet, while storing the result in a WACZ? That is the moment I would like to have the information, so we can stored the result with the scraped data, and locate where it came from. Is this what you meant with "WACZ source data is currently not possible"? |
Ah, I misunderstood! I will look into this too. Is there any preference in the data we would like to store? WARC record ID, WACZ file URI (or WACZ file name or so), or something else?
No, this statement was to say that the WACZ source information is not implemented in this PR. Wondering if this was still desired in any way. |
Ah, of course. Feel free to add this, but not if it becomes too complex :) |
I think WARC record ID would suffice, WACZ file URI would be nice. |
There is one complication when adding the entire WARC record to the meta data. The WARC record object that is used in the downloader middleware ( It can be simplified by extracting specific parts of the data (e.g. specific headers), but that would remove the flexibility. |
Hmmm, that limits the usefulness. Ideally, as a developer, you don't need to think about whether the crawl goes from live internet, or from a WACZ archive. But there are cases this matters, and in these cases we need to have handles. In this case, the most pressing need it so know how to locate the source in the WACZ, so I think we'd best provide just this: the wacz url (as you already do), and some kind of identifier to locate it. Then even if the logic of the middleware would change in the future, these things can still be provided. |
# Record found, try to re-create a response from it. | ||
response = record_transformer.response_for_record(warc_record, request) | ||
response.meta["response_origin"] = warc_record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that actually all meta
keys need to start with webarchive_
, to avoid naming conflicts.
Origin sounds like HTTP Origin, maybe it's better to use something with warc here?
Resolves #30
Attach the WARC record to the request/response meta. This way the user can choose to attach source information when parsing an item.
warc_record
to the request/response meta dataTodo's