Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding source information to a scraped item #30

Open
leewesleyv opened this issue Jan 21, 2025 · 1 comment · May be fixed by #31
Open

Adding source information to a scraped item #30

leewesleyv opened this issue Jan 21, 2025 · 1 comment · May be fixed by #31
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@leewesleyv
Copy link
Collaborator

leewesleyv commented Jan 21, 2025

As a user of this middleware, I'd like to be able to locate the specific response the item originated from, when looking at the item.

Scraped item should/can include information such as:

  • WACZ archive
  • CDXJ entry
  • WARC location

Think about the information that can be useful to attach to the crawled item refer to the source of item.

@leewesleyv leewesleyv added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 21, 2025
@wvengen
Copy link
Member

wvengen commented Jan 21, 2025

The WACZ archive could also be stored in the job's properties (same place where statistics end up), if it is equal for all items.
Scrapy tracks unique requests by fingerprint. This may or may not be appropriate here.

I think that the index (CDXJ) would be the sensible place to locate the WARC entry. If there is a unique id, that could be something to expose, that a user can choose to emit in the scraped item. If there is no such unique id, I think that CDXJ supports custom properties, so e.g. the fingerprint could be added there.

@leewesleyv leewesleyv linked a pull request Jan 24, 2025 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants