Crawl source information per item/page #31

leewesleyv · 2025-01-24T12:41:04Z

Resolves #30

Attach the WARC record to the request/response meta. This way the user can choose to attach source information when parsing an item.

Adds the warc_record to the request/response meta data

Todo's

Provide a recipe for extracting this information during scraping to the documentation

…ilable when parsing

leewesleyv · 2025-01-24T12:50:00Z

@wvengen The idea here is that the user has access to the WARC record, and can choose to extract any data from it. For example, you can extract the record ID (see example below). This way we leave the choice to the user rather than exposing specific values. This does however require some additional steps for the user to figure out what attributes are available in the WARCRecord class. That is also the motivation behind adding a recipes page to the documentation to guide people in the right direction.

extra = {}
warc_record: Optional[WARCRecord] = response.meta.get("warc_record")

if warc_record:
    extra['warc_record_id'] = warc_record.header.record_id

yield {
   ...,
   **extra,
}

Do you think this is sufficient or do you think we do require additional information? WACZ source data is currently not possible (WARCRecord) does not provide this information.

scrapy_webarchive/downloadermiddlewares.py

wvengen · 2025-01-24T12:56:47Z

Great idea, I think it can be useful to have access to the original WARC record during crawling from an archive.

But I think the use-case is different: What happens when you crawl from the live internet, while storing the result in a WACZ? That is the moment I would like to have the information, so we can stored the result with the scraped data, and locate where it came from.

Is this what you meant with "WACZ source data is currently not possible"?

leewesleyv · 2025-01-24T13:11:47Z

But I think the use-case is different: What happens when you crawl from the live internet, while storing the result in a WACZ? That is the moment I would like to have the information, so we can store the result with the scraped data, and locate where it came from.

Ah, I misunderstood! I will look into this too. Is there any preference in the data we would like to store? WARC record ID, WACZ file URI (or WACZ file name or so), or something else?

Is this what you meant with "WACZ source data is currently not possible"?

No, this statement was to say that the WACZ source information is not implemented in this PR. Wondering if this was still desired in any way.

wvengen · 2025-01-24T13:14:45Z

No, this statement was to say that the WACZ source information is not implemented in this PR. Wondering if this was still desired in any way.

Ah, of course. Feel free to add this, but not if it becomes too complex :)

wvengen · 2025-01-24T13:17:09Z

Is there any preference in the data we would like to store?

I think WARC record ID would suffice, WACZ file URI would be nice.

leewesleyv · 2025-01-24T14:55:36Z

There is one complication when adding the entire WARC record to the meta data. The WARC record object that is used in the downloader middleware (warc.warc.WARCRecord) is different from the WARC record object used in the extension (warcio.recordloader.ArcWarcRecord). This means that if we add the WARC record to the meta data, we need different ways to parse the response based on wether we are crawling an archive vs writing an archive.

It can be simplified by extracting specific parts of the data (e.g. specific headers), but that would remove the flexibility.

wvengen · 2025-01-24T15:19:57Z

Hmmm, that limits the usefulness. Ideally, as a developer, you don't need to think about whether the crawl goes from live internet, or from a WACZ archive. But there are cases this matters, and in these cases we need to have handles. In this case, the most pressing need it so know how to locate the source in the WACZ, so I think we'd best provide just this: the wacz url (as you already do), and some kind of identifier to locate it. Then even if the logic of the middleware would change in the future, these things can still be provided.

wvengen · 2025-01-24T15:16:56Z

scrapy_webarchive/downloadermiddlewares.py

        # Record found, try to re-create a response from it.
        response = record_transformer.response_for_record(warc_record, request)
+        response.meta["response_origin"] = warc_record


I think that actually all meta keys need to start with webarchive_, to avoid naming conflicts.
Origin sounds like HTTP Origin, maybe it's better to use something with warc here?

Attach the WARC record to the request/response meta so it becomes ava…

08d1734

…ilable when parsing

leewesleyv requested a review from wvengen January 24, 2025 12:51

leewesleyv self-assigned this Jan 24, 2025

wvengen reviewed Jan 24, 2025

View reviewed changes

scrapy_webarchive/downloadermiddlewares.py Outdated Show resolved Hide resolved

Expose exported data in the meta for the WACZ export extension

55ecbae

wvengen reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl source information per item/page #31

Crawl source information per item/page #31

leewesleyv commented Jan 24, 2025 •

edited

Loading

leewesleyv commented Jan 24, 2025 •

edited

Loading

wvengen commented Jan 24, 2025

leewesleyv commented Jan 24, 2025

wvengen commented Jan 24, 2025

wvengen commented Jan 24, 2025

leewesleyv commented Jan 24, 2025

wvengen commented Jan 24, 2025

wvengen Jan 24, 2025

Crawl source information per item/page #31

Are you sure you want to change the base?

Crawl source information per item/page #31

Conversation

leewesleyv commented Jan 24, 2025 • edited Loading

leewesleyv commented Jan 24, 2025 • edited Loading

wvengen commented Jan 24, 2025

leewesleyv commented Jan 24, 2025

wvengen commented Jan 24, 2025

wvengen commented Jan 24, 2025

leewesleyv commented Jan 24, 2025

wvengen commented Jan 24, 2025

wvengen Jan 24, 2025

Choose a reason for hiding this comment

leewesleyv commented Jan 24, 2025 •

edited

Loading

leewesleyv commented Jan 24, 2025 •

edited

Loading