Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl source information per item/page #31

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

leewesleyv
Copy link
Collaborator

@leewesleyv leewesleyv commented Jan 24, 2025

Resolves #30

Attach the WARC record to the request/response meta. This way the user can choose to attach source information when parsing an item.

  • Adds the warc_record to the request/response meta data

Todo's

  • Provide a recipe for extracting this information during scraping to the documentation

@leewesleyv
Copy link
Collaborator Author

leewesleyv commented Jan 24, 2025

@wvengen The idea here is that the user has access to the WARC record, and can choose to extract any data from it. For example, you can extract the record ID (see example below). This way we leave the choice to the user rather than exposing specific values. This does however require some additional steps for the user to figure out what attributes are available in the WARCRecord class. That is also the motivation behind adding a recipes page to the documentation to guide people in the right direction.

extra = {}
warc_record: Optional[WARCRecord] = response.meta.get("warc_record")

if warc_record:
    extra['warc_record_id'] = warc_record.header.record_id

yield {
   ...,
   **extra,
}

Do you think this is sufficient or do you think we do require additional information? WACZ source data is currently not possible (WARCRecord) does not provide this information.

@leewesleyv leewesleyv requested a review from wvengen January 24, 2025 12:51
@leewesleyv leewesleyv self-assigned this Jan 24, 2025
@wvengen
Copy link
Member

wvengen commented Jan 24, 2025

Great idea, I think it can be useful to have access to the original WARC record during crawling from an archive.

But I think the use-case is different: What happens when you crawl from the live internet, while storing the result in a WACZ? That is the moment I would like to have the information, so we can stored the result with the scraped data, and locate where it came from.

Is this what you meant with "WACZ source data is currently not possible"?

@leewesleyv
Copy link
Collaborator Author

But I think the use-case is different: What happens when you crawl from the live internet, while storing the result in a WACZ? That is the moment I would like to have the information, so we can store the result with the scraped data, and locate where it came from.

Ah, I misunderstood! I will look into this too. Is there any preference in the data we would like to store? WARC record ID, WACZ file URI (or WACZ file name or so), or something else?

Is this what you meant with "WACZ source data is currently not possible"?

No, this statement was to say that the WACZ source information is not implemented in this PR. Wondering if this was still desired in any way.

@wvengen
Copy link
Member

wvengen commented Jan 24, 2025

No, this statement was to say that the WACZ source information is not implemented in this PR. Wondering if this was still desired in any way.

Ah, of course. Feel free to add this, but not if it becomes too complex :)

@wvengen
Copy link
Member

wvengen commented Jan 24, 2025

Is there any preference in the data we would like to store?

I think WARC record ID would suffice, WACZ file URI would be nice.

@leewesleyv
Copy link
Collaborator Author

There is one complication when adding the entire WARC record to the meta data. The WARC record object that is used in the downloader middleware (warc.warc.WARCRecord) is different from the WARC record object used in the extension (warcio.recordloader.ArcWarcRecord). This means that if we add the WARC record to the meta data, we need different ways to parse the response based on wether we are crawling an archive vs writing an archive.

It can be simplified by extracting specific parts of the data (e.g. specific headers), but that would remove the flexibility.

@wvengen
Copy link
Member

wvengen commented Jan 24, 2025

Hmmm, that limits the usefulness. Ideally, as a developer, you don't need to think about whether the crawl goes from live internet, or from a WACZ archive. But there are cases this matters, and in these cases we need to have handles. In this case, the most pressing need it so know how to locate the source in the WACZ, so I think we'd best provide just this: the wacz url (as you already do), and some kind of identifier to locate it. Then even if the logic of the middleware would change in the future, these things can still be provided.

# Record found, try to re-create a response from it.
response = record_transformer.response_for_record(warc_record, request)
response.meta["response_origin"] = warc_record
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that actually all meta keys need to start with webarchive_, to avoid naming conflicts.
Origin sounds like HTTP Origin, maybe it's better to use something with warc here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding source information to a scraped item
2 participants