Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl source information per item/page #31

Merged
merged 9 commits into from
Jan 28, 2025
42 changes: 38 additions & 4 deletions docs/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

The `wacz_crawl_skip` flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:

```python
``` py
yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])
```

Expand All @@ -28,7 +28,7 @@ Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spid

To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:

```python
``` py title="settings.py"
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}
Expand All @@ -40,7 +40,7 @@ SPIDER_MIDDLEWARES = {

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
``` py title="settings.py"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
Expand All @@ -49,7 +49,10 @@ SW_WACZ_CRAWL = True

Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, `archive_regex`:

```python
``` py title="my_wacz_spider.py" hl_lines="6"
from scrapy.spiders import Spider


class MyWaczSpider(Spider):
name = "myspider"
archive_regex = r"^/tag/[\w-]+/$"
Expand All @@ -74,3 +77,34 @@ com,toscrape,quotes)/static/main.css 20241007081525074 {...}
> com,toscrape,quotes)/tag/simile/ 20241007081524944 {...}
> com,toscrape,quotes)/tag/truth/ 20241007081523804 {...}
```

## Requests and Responses

### Special Keys in Request.meta

The `Request.meta` attribute in Scrapy allows you to store arbitrary data for use during the crawling process. While you can store any custom data in this attribute, Scrapy and its built-in extensions recognize certain special keys. Additionally, the `scrapy-webarchive` extension introduces its own special key for managing metadata. Below is a description of the key used by `scrapy-webarchive`:

* `webarchive_warc`

#### `webarchive_warc`
This key stores the result of a WACZ crawl or export. The data associated with this key is read-only and is not used to control Scrapy's behavior. The value of this key can be accessed using the constant `WEBARCHIVE_META_KEY`, but direct usage of this constant is discouraged. Instead, you should use the provided class method to instantiate a metadata object, as shown in the example below:

``` py title="my_wacz_spider.py" hl_lines="10"
from scrapy.spiders import Spider
from scrapy_webarchive.models import WarcMetadata


class MyWaczSpider(Spider):
name = "myspider"

def parse_function(self, response):
# Instantiate a WarcMetadata object from the response
warc_meta = WarcMetadata.from_response(response)

# Extract the attributes to attach while parsing a page/item
if warc_meta:
yield {
'warc_record_id': warc_meta.record_id,
'wacz_uri': warc_meta.wacz_uri,
}
```
10 changes: 8 additions & 2 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,20 @@

To install `scrapy-webarchive`, run:

```bash
``` bash
pip install scrapy-webarchive
```

If you want to use a cloud provider for storing/scraping, you opt-in to install these dependencies:

```bash
``` bash
pip install scrapy-webarchive[aws]
```

``` bash
pip install scrapy-webarchive[gcs]
```

``` bash
pip install scrapy-webarchive[all]
```
8 changes: 4 additions & 4 deletions docs/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

### `SW_EXPORT_URI`

```python
``` py title="settings.py"
# Either configure the directory where the output should be uploaded to
SW_EXPORT_URI = "s3://scrapy-webarchive/"
SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
Expand Down Expand Up @@ -45,7 +45,7 @@ This setting defines the description of the WACZ used in the `datapackage.json`,

⚠️ Scraping against a remote source currently only supports AWS S3.

```python
``` py title="settings.py"
# "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
Expand All @@ -58,8 +58,8 @@ This setting defines the location of the WACZ file that should be used as a sour

### `SW_WACZ_CRAWL`

```python
``` py title="settings.py"
SW_WACZ_CRAWL = True
```

Setting to ignore original `start_requests`, just yield all responses found in WACZ.
Setting to ignore original `start_requests`, just yield all responses found in WACZ. For more information see [Iterating a WACZ archive index](advanced_usage.md#iterating-a-wacz-archive-index).
9 changes: 4 additions & 5 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ The general use for this plugin is separated in two parts, exporting and crawlin

To archive the requests/responses during a crawl job you need to enable the `WaczExporter` extension.

```python
``` py title="settings.py"
EXTENSIONS = {
"scrapy_webarchive.extensions.WaczExporter": 543,
}
```

This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).

```python
``` py title="settings.py"
SW_EXPORT_URI = "s3://scrapy-webarchive/"
```

Expand All @@ -33,15 +33,14 @@ To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader

To use the downloader middleware, enable it in the settings like so:

```python
``` py title="settings.py"
DOWNLOADER_MIDDLEWARES = {
"scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
}
```

Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:

```python
``` py title="settings.py"
SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
SW_WACZ_CRAWL = True
```
15 changes: 14 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,24 @@ site_url: https://github.com/q-m/scrapy-webarchive

theme:
name: material
features:
- content.code.copy
- content.code.select
- content.code.annotate

nav:
- Introduction: index.md
- Installation: installation.md
- Usage:
- Usage: usage.md
- Advanced Usage: advanced_usage.md
- Settings: settings.md
- Settings: settings.md

markdown_extensions:
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
18 changes: 14 additions & 4 deletions scrapy_webarchive/downloadermiddlewares.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@
from scrapy.spiders import Spider

from scrapy_webarchive.exceptions import WaczMiddlewareException
from scrapy_webarchive.models import WarcMetadata
from scrapy_webarchive.spidermiddlewares import BaseWaczMiddleware
from scrapy_webarchive.utils import WEBARCHIVE_META_KEY
from scrapy_webarchive.warc import record_transformer


Expand Down Expand Up @@ -48,17 +50,25 @@ def process_request(self, request: Request, spider: Spider):

# Get record from existing index entry, or else lookup by URL.
if request.meta.get("cdxj_record"):
warc_record = self.wacz.get_warc_from_cdxj_record(cdxj_record=request.meta["cdxj_record"])
cdxj_record = request.meta["cdxj_record"]
warc_record = self.wacz.get_warc_from_cdxj_record(cdxj_record=cdxj_record)
else:
warc_record = self.wacz.get_warc_from_url(url=request.url)
warc_record, cdxj_record = self.wacz.get_warc_from_url(url=request.url)

# When page not found in archive, return status 404 and record it in a statistic.
if not warc_record:
self.stats.inc_value("webarchive/response_not_found", spider=spider)
return Response(url=request.url, status=404)

# Record found, try to re-create a response from it.
response = record_transformer.response_for_record(warc_record)
response = record_transformer.response_for_record(warc_record, request)

warc_metadata = WarcMetadata(
action="read",
record_id=warc_record.header.record_id,
wacz_uri=cdxj_record.wacz_file.storage_handler.uri,
)
response.meta[WEBARCHIVE_META_KEY] = warc_metadata.to_dict()

if not response:
self.stats.inc_value("webarchive/response_not_recognized", spider=spider)
Expand Down
44 changes: 28 additions & 16 deletions scrapy_webarchive/extensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@
from twisted.internet.defer import Deferred
from typing_extensions import Any, Dict, Protocol, Self, Type, Union, cast

from scrapy_webarchive.utils import WARC_DT_FORMAT, get_formatted_dt_string, get_scheme_from_uri
from scrapy_webarchive.models import WarcMetadata
from scrapy_webarchive.utils import WARC_DT_FORMAT, WEBARCHIVE_META_KEY, get_formatted_dt_string, get_scheme_from_uri
from scrapy_webarchive.wacz.creator import WaczFileCreator
from scrapy_webarchive.warc import WarcFileWriter

Expand Down Expand Up @@ -61,11 +62,19 @@ def __init__(self, settings: Settings, crawler: Crawler) -> None:
self.spider_name = crawler.spidercls.name if hasattr(crawler.spidercls, "name") else crawler.spider.name

# Get the store URI and configure the WACZ filename
store_uri, self.wacz_fname = self._retrieve_store_uri_and_wacz_fname()
self.store_uri, self.wacz_fname = self._retrieve_store_uri_and_wacz_fname()

# Initialize store and writer
self.store: FilesStoreProtocol = self._get_store(store_uri)
# Initialize store, writer and creator
self.store: FilesStoreProtocol = self._get_store(self.store_uri)
self.writer = WarcFileWriter(collection_name=self.spider_name)
self.wacz_creator = WaczFileCreator(
store=self.store,
warc_fname=self.writer.warc_fname,
wacz_fname=self.wacz_fname,
collection_name=crawler.spider.name,
title=self.settings["SW_WACZ_TITLE"],
description=self.settings["SW_WACZ_DESCRIPTION"],
)

def _check_configuration_prerequisites(self) -> None:
"""raises NotConfigured if essential settings or middleware configurations are incorrect."""
Expand Down Expand Up @@ -143,27 +152,30 @@ def response_downloaded(self, response: Response, request: Request, spider: Spid
request.meta["WARC-Date"] = get_formatted_dt_string(format=WARC_DT_FORMAT)

# Write response WARC record
record = self.writer.write_response(response, request)
response_record = self.writer.write_response(response, request)
self.stats.inc_value("webarchive/exporter/response_written", spider=spider)
self.stats.inc_value(
f"webarchive/exporter/writer_status_count/{record.http_headers.get_statuscode()}",
f"webarchive/exporter/writer_status_count/{response_record.http_headers.get_statuscode()}",
spider=spider,
)

# Write request WARC record
self.writer.write_request(request, concurrent_to=record)
self.writer.write_request(request, concurrent_to=response_record)
self.stats.inc_value("webarchive/exporter/request_written", spider=spider)

def spider_closed(self, spider: Spider) -> None:
wacz_creator = WaczFileCreator(
store=self.store,
warc_fname=self.writer.warc_fname,
wacz_fname=self.wacz_fname,
collection_name=spider.name,
title=self.settings["SW_WACZ_TITLE"],
description=self.settings["SW_WACZ_DESCRIPTION"],
warc_metadata = WarcMetadata(
action="write",
record_id=response_record.rec_headers.get_header('WARC-Record-ID'),
wacz_uri=self.export_uri,
)
wacz_creator.create()
request.meta[WEBARCHIVE_META_KEY] = warc_metadata.to_dict()

def spider_closed(self, spider: Spider) -> None:
self.wacz_creator.create()

@property
def export_uri(self) -> str:
return os.path.join(self.store_uri, self.wacz_creator.wacz_fname)


def get_archive_uri_template_dt_variables() -> dict:
Expand Down
44 changes: 44 additions & 0 deletions scrapy_webarchive/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from dataclasses import dataclass
from typing import Union

from scrapy.http.response import Response

from scrapy_webarchive.utils import WEBARCHIVE_META_KEY


@dataclass
class WarcMetadata:
"""
Encapsulates metadata about the WARC record.

Attributes:
action (str): The action performed ("read" or "write").
record_id (str): The unique ID of the WARC record.
wacz_uri (str): The URI of the WACZ file.
"""
action: str
record_id: str
wacz_uri: str

def to_dict(self) -> dict:
"""Convert the object to a dictionary for compatibility with Scrapy's meta."""

return {
"action": self.action,
"record_id": self.record_id,
"wacz_uri": self.wacz_uri,
}

@classmethod
def from_response(cls, response: Response) -> Union["WarcMetadata", None]:
"""Create a WarcMetadata instance from a Scrapy response object."""

if not hasattr(response, "meta"):
return None

warc_meta = response.meta.get(WEBARCHIVE_META_KEY)

if not warc_meta:
return None

return cls(**warc_meta)
1 change: 1 addition & 0 deletions scrapy_webarchive/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
WARC_DT_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
TIMESTAMP_DT_FORMAT = "%Y%m%d%H%M%S"
BUFF_SIZE = 1024 * 64
WEBARCHIVE_META_KEY = "webarchive_warc"

logger = logging.getLogger(__name__)

Expand Down
Loading
Loading