Adding source information to a scraped item #30

leewesleyv · 2025-01-21T09:48:34Z

As a user of this middleware, I'd like to be able to locate the specific response the item originated from, when looking at the item.

Scraped item should/can include information such as:

WACZ archive
CDXJ entry
WARC location

Think about the information that can be useful to attach to the crawled item refer to the source of item.

wvengen · 2025-01-21T09:59:14Z

The WACZ archive could also be stored in the job's properties (same place where statistics end up), if it is equal for all items.
Scrapy tracks unique requests by fingerprint. This may or may not be appropriate here.

I think that the index (CDXJ) would be the sensible place to locate the WARC entry. If there is a unique id, that could be something to expose, that a user can choose to emit in the scraped item. If there is no such unique id, I think that CDXJ supports custom properties, so e.g. the fingerprint could be added there.

leewesleyv added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 21, 2025

leewesleyv linked a pull request Jan 24, 2025 that will close this issue

Crawl source information per item/page #31

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding source information to a scraped item #30

Adding source information to a scraped item #30

leewesleyv commented Jan 21, 2025 •

edited by wvengen

Loading

wvengen commented Jan 21, 2025 •

edited

Loading

Adding source information to a scraped item #30

Adding source information to a scraped item #30

Comments

leewesleyv commented Jan 21, 2025 • edited by wvengen Loading

wvengen commented Jan 21, 2025 • edited Loading

leewesleyv commented Jan 21, 2025 •

edited by wvengen

Loading

wvengen commented Jan 21, 2025 •

edited

Loading