You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The WACZ archive could also be stored in the job's properties (same place where statistics end up), if it is equal for all items.
Scrapy tracks unique requests by fingerprint. This may or may not be appropriate here.
I think that the index (CDXJ) would be the sensible place to locate the WARC entry. If there is a unique id, that could be something to expose, that a user can choose to emit in the scraped item. If there is no such unique id, I think that CDXJ supports custom properties, so e.g. the fingerprint could be added there.
As a user of this middleware, I'd like to be able to locate the specific response the item originated from, when looking at the item.
Scraped item should/can include information such as:
Think about the information that can be useful to attach to the crawled item refer to the source of item.
The text was updated successfully, but these errors were encountered: