q-m · leewesleyv · Jan 28, 2025 · Jan 24, 2025 · Jan 24, 2025 · Jan 27, 2025
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md
@@ -6,7 +6,7 @@
 
 The `wacz_crawl_skip` flag is applied to requests that should be ignored by the crawler. When this flag is present, the middleware intercepts the request and prevents it from being processed further, skipping both download and parsing. This is useful in scenarios where the request should not be collected during a scraping session. Usage:
 
-```python
+``` py
 yield Request(url, callback=cb_func, flags=["wacz_crawl_skip"])
 ```
 
@@ -28,7 +28,7 @@ Going around the default behaviour of the spider, the `WaczCrawlMiddleware` spid
 
 To use this strategy, enable both the spider- and the downloadermiddleware in the spider settings like so:
 
-```python
+``` py title="settings.py"
 DOWNLOADER_MIDDLEWARES = {
     "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
 }
@@ -40,7 +40,7 @@ SPIDER_MIDDLEWARES = {
 
 Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
 
-```python
+``` py title="settings.py"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
 SW_WACZ_CRAWL = True
 ```
@@ -49,7 +49,10 @@ SW_WACZ_CRAWL = True
 
 Not all URLs will be interesting for the crawl since your WACZ will most likely contain static files such as fonts, JavaScript (website and external), stylesheets, etc. In order to improve the performance of the spider by not reading all the irrelevant request/response entries, you can configure the following atrribute in your spider, `archive_regex`:
 
-```python
+``` py title="my_wacz_spider.py" hl_lines="6"
+from scrapy.spiders import Spider
+
+
 class MyWaczSpider(Spider):
     name = "myspider"
     archive_regex = r"^/tag/[\w-]+/$"
@@ -74,3 +77,34 @@ com,toscrape,quotes)/static/main.css 20241007081525074 {...}
 > com,toscrape,quotes)/tag/simile/ 20241007081524944 {...}
 > com,toscrape,quotes)/tag/truth/ 20241007081523804 {...}
 ```
+
+## Requests and Responses
+
+### Special Keys in Request.meta
+
+The `Request.meta` attribute in Scrapy allows you to store arbitrary data for use during the crawling process. While you can store any custom data in this attribute, Scrapy and its built-in extensions recognize certain special keys. Additionally, the `scrapy-webarchive` extension introduces its own special key for managing metadata. Below is a description of the key used by `scrapy-webarchive`:
+
+* `webarchive_warc`
+
+#### `webarchive_warc`
+This key stores the result of a WACZ crawl or export. The data associated with this key is read-only and is not used to control Scrapy's behavior. The value of this key can be accessed using the constant `WEBARCHIVE_META_KEY`, but direct usage of this constant is discouraged. Instead, you should use the provided class method to instantiate a metadata object, as shown in the example below:
+
+``` py title="my_wacz_spider.py" hl_lines="10"
+from scrapy.spiders import Spider
+from scrapy_webarchive.models import WarcMetadata
+
+
+class MyWaczSpider(Spider):
+    name = "myspider"
+
+    def parse_function(self, response):
+        # Instantiate a WarcMetadata object from the response
+        warc_meta = WarcMetadata.from_response(response)
+
+        # Extract the attributes to attach while parsing a page/item
+        if warc_meta:
+            yield {
+                'warc_record_id': warc_meta.record_id,
+                'wacz_uri': warc_meta.wacz_uri,
+            }
+```
diff --git a/docs/installation.md b/docs/installation.md
@@ -2,14 +2,20 @@
 
 To install `scrapy-webarchive`, run:
 
-```bash
+``` bash
 pip install scrapy-webarchive
 ```
 
 If you want to use a cloud provider for storing/scraping, you opt-in to install these dependencies:
 
-```bash
+``` bash
 pip install scrapy-webarchive[aws]
+```
+
+``` bash
 pip install scrapy-webarchive[gcs]
+```
+
+``` bash
 pip install scrapy-webarchive[all]
 ```
diff --git a/docs/settings.md b/docs/settings.md
@@ -6,7 +6,7 @@
 
 ### `SW_EXPORT_URI`
 
-```python
+``` py title="settings.py"
 # Either configure the directory where the output should be uploaded to
 SW_EXPORT_URI = "s3://scrapy-webarchive/"
 SW_EXPORT_URI = "s3://scrapy-webarchive/{spider}/"
@@ -45,7 +45,7 @@ This setting defines the description of the WACZ used in the `datapackage.json`,
 
 ⚠️ Scraping against a remote source currently only supports AWS S3.
 
-```python
+``` py title="settings.py"
 # "file://" must be explicitly added, unlike SW_EXPORT_URI where it makes an assumption if no scheme is added.
 SW_WACZ_SOURCE_URI = "file:///Users/username/Documents/archive.wacz"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
@@ -58,8 +58,8 @@ This setting defines the location of the WACZ file that should be used as a sour
 
 ### `SW_WACZ_CRAWL`
 
-```python
+``` py title="settings.py"
 SW_WACZ_CRAWL = True
 ```
 
-Setting to ignore original `start_requests`, just yield all responses found in WACZ.
+Setting to ignore original `start_requests`, just yield all responses found in WACZ. For more information see [Iterating a WACZ archive index](advanced_usage.md#iterating-a-wacz-archive-index).
diff --git a/docs/usage.md b/docs/usage.md
@@ -11,15 +11,15 @@ The general use for this plugin is separated in two parts, exporting and crawlin
 
 To archive the requests/responses during a crawl job you need to enable the `WaczExporter` extension. 
 
-```python
+``` py title="settings.py"
 EXTENSIONS = {
     "scrapy_webarchive.extensions.WaczExporter": 543,
 }
 ```
 
 This extension also requires you to set the export location using the `SW_EXPORT_URI` settings (check the settings page for different options for exporting).
 
-```python
+``` py title="settings.py"
 SW_EXPORT_URI = "s3://scrapy-webarchive/"
 ```
 
@@ -33,15 +33,14 @@ To crawl against a WACZ archive you need to use the `WaczMiddleware` downloader
 
 To use the downloader middleware, enable it in the settings like so:
 
-```python
+``` py title="settings.py"
 DOWNLOADER_MIDDLEWARES = {
     "scrapy_webarchive.downloadermiddlewares.WaczMiddleware": 543,
 }
 ```
 
 Then define the location of the WACZ archive with `SW_WACZ_SOURCE_URI` setting:
 
-```python
+``` py title="settings.py"
 SW_WACZ_SOURCE_URI = "s3://scrapy-webarchive/archive.wacz"
-SW_WACZ_CRAWL = True
 ```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -4,11 +4,24 @@ site_url: https://github.com/q-m/scrapy-webarchive
 
 theme:
   name: material
+  features:
+    - content.code.copy
+    - content.code.select
+    - content.code.annotate
 
 nav:
   - Introduction: index.md
   - Installation: installation.md
   - Usage:
       - Usage: usage.md
       - Advanced Usage: advanced_usage.md
-  - Settings: settings.md
+  - Settings: settings.md
+
+markdown_extensions:
+  - pymdownx.highlight:
+      anchor_linenums: true
+      line_spans: __span
+      pygments_lang_class: true
+  - pymdownx.inlinehilite
+  - pymdownx.snippets
+  - pymdownx.superfences
diff --git a/scrapy_webarchive/downloadermiddlewares.py b/scrapy_webarchive/downloadermiddlewares.py
@@ -6,7 +6,9 @@
 from scrapy.spiders import Spider
 
 from scrapy_webarchive.exceptions import WaczMiddlewareException
+from scrapy_webarchive.models import WarcMetadata
 from scrapy_webarchive.spidermiddlewares import BaseWaczMiddleware
+from scrapy_webarchive.utils import WEBARCHIVE_META_KEY
 from scrapy_webarchive.warc import record_transformer
 
 
@@ -48,17 +50,25 @@ def process_request(self, request: Request, spider: Spider):
 
         # Get record from existing index entry, or else lookup by URL.
         if request.meta.get("cdxj_record"):
-            warc_record = self.wacz.get_warc_from_cdxj_record(cdxj_record=request.meta["cdxj_record"])
+            cdxj_record = request.meta["cdxj_record"]
+            warc_record = self.wacz.get_warc_from_cdxj_record(cdxj_record=cdxj_record)
         else:
-            warc_record = self.wacz.get_warc_from_url(url=request.url)
+            warc_record, cdxj_record = self.wacz.get_warc_from_url(url=request.url)
 
         # When page not found in archive, return status 404 and record it in a statistic.
         if not warc_record:
             self.stats.inc_value("webarchive/response_not_found", spider=spider)
             return Response(url=request.url, status=404)
-        
+
         # Record found, try to re-create a response from it.
-        response = record_transformer.response_for_record(warc_record)
+        response = record_transformer.response_for_record(warc_record, request)
+
+        warc_metadata = WarcMetadata(
+            action="read",
+            record_id=warc_record.header.record_id,
+            wacz_uri=cdxj_record.wacz_file.storage_handler.uri,
+        )
+        response.meta[WEBARCHIVE_META_KEY] = warc_metadata.to_dict()
 
         if not response:
             self.stats.inc_value("webarchive/response_not_recognized", spider=spider)

diff --git a/scrapy_webarchive/extensions.py b/scrapy_webarchive/extensions.py
@@ -16,7 +16,8 @@
 from twisted.internet.defer import Deferred
 from typing_extensions import Any, Dict, Protocol, Self, Type, Union, cast
 
-from scrapy_webarchive.utils import WARC_DT_FORMAT, get_formatted_dt_string, get_scheme_from_uri
+from scrapy_webarchive.models import WarcMetadata
+from scrapy_webarchive.utils import WARC_DT_FORMAT, WEBARCHIVE_META_KEY, get_formatted_dt_string, get_scheme_from_uri
 from scrapy_webarchive.wacz.creator import WaczFileCreator
 from scrapy_webarchive.warc import WarcFileWriter
 
@@ -61,11 +62,19 @@ def __init__(self, settings: Settings, crawler: Crawler) -> None:
         self.spider_name = crawler.spidercls.name if hasattr(crawler.spidercls, "name") else crawler.spider.name
 
         # Get the store URI and configure the WACZ filename
-        store_uri, self.wacz_fname  = self._retrieve_store_uri_and_wacz_fname()
+        self.store_uri, self.wacz_fname  = self._retrieve_store_uri_and_wacz_fname()
 
-        # Initialize store and writer
-        self.store: FilesStoreProtocol = self._get_store(store_uri)
+        # Initialize store, writer and creator
+        self.store: FilesStoreProtocol = self._get_store(self.store_uri)
         self.writer = WarcFileWriter(collection_name=self.spider_name)
+        self.wacz_creator = WaczFileCreator(
+            store=self.store,
+            warc_fname=self.writer.warc_fname,
+            wacz_fname=self.wacz_fname,
+            collection_name=crawler.spider.name,
+            title=self.settings["SW_WACZ_TITLE"],
+            description=self.settings["SW_WACZ_DESCRIPTION"],
+        )
 
     def _check_configuration_prerequisites(self) -> None:
         """raises NotConfigured if essential settings or middleware configurations are incorrect."""
@@ -143,27 +152,30 @@ def response_downloaded(self, response: Response, request: Request, spider: Spid
         request.meta["WARC-Date"] = get_formatted_dt_string(format=WARC_DT_FORMAT)
 
         # Write response WARC record
-        record = self.writer.write_response(response, request)
+        response_record = self.writer.write_response(response, request)
         self.stats.inc_value("webarchive/exporter/response_written", spider=spider)
         self.stats.inc_value(
-            f"webarchive/exporter/writer_status_count/{record.http_headers.get_statuscode()}", 
+            f"webarchive/exporter/writer_status_count/{response_record.http_headers.get_statuscode()}", 
             spider=spider,
         )
 
         # Write request WARC record
-        self.writer.write_request(request, concurrent_to=record)
+        self.writer.write_request(request, concurrent_to=response_record)
         self.stats.inc_value("webarchive/exporter/request_written", spider=spider)
 
-    def spider_closed(self, spider: Spider) -> None:
-        wacz_creator = WaczFileCreator(
-            store=self.store,
-            warc_fname=self.writer.warc_fname,
-            wacz_fname=self.wacz_fname,
-            collection_name=spider.name,
-            title=self.settings["SW_WACZ_TITLE"],
-            description=self.settings["SW_WACZ_DESCRIPTION"],
+        warc_metadata = WarcMetadata(
+            action="write",
+            record_id=response_record.rec_headers.get_header('WARC-Record-ID'),
+            wacz_uri=self.export_uri,
         )
-        wacz_creator.create()
+        request.meta[WEBARCHIVE_META_KEY] = warc_metadata.to_dict()
+
+    def spider_closed(self, spider: Spider) -> None:
+        self.wacz_creator.create()
+
+    @property
+    def export_uri(self) -> str:
+        return os.path.join(self.store_uri, self.wacz_creator.wacz_fname)
 
 
 def get_archive_uri_template_dt_variables() -> dict:

diff --git a/scrapy_webarchive/models.py b/scrapy_webarchive/models.py
@@ -0,0 +1,44 @@
+from dataclasses import dataclass
+from typing import Union
+
+from scrapy.http.response import Response
+
+from scrapy_webarchive.utils import WEBARCHIVE_META_KEY
+
+
+@dataclass
+class WarcMetadata:
+    """
+    Encapsulates metadata about the WARC record.
+
+    Attributes:
+        action (str): The action performed ("read" or "write").
+        record_id (str): The unique ID of the WARC record.
+        wacz_uri (str): The URI of the WACZ file.
+    """
+    action: str
+    record_id: str
+    wacz_uri: str
+
+    def to_dict(self) -> dict:
+        """Convert the object to a dictionary for compatibility with Scrapy's meta."""
+
+        return {
+            "action": self.action,
+            "record_id": self.record_id,
+            "wacz_uri": self.wacz_uri,
+        }
+
+    @classmethod
+    def from_response(cls, response: Response) -> Union["WarcMetadata", None]:
+        """Create a WarcMetadata instance from a Scrapy response object."""
+
+        if not hasattr(response, "meta"):
+            return None
+
+        warc_meta = response.meta.get(WEBARCHIVE_META_KEY)
+
+        if not warc_meta:
+            return None
+
+        return cls(**warc_meta)
diff --git a/scrapy_webarchive/utils.py b/scrapy_webarchive/utils.py
@@ -10,6 +10,7 @@
 WARC_DT_FORMAT = "%Y-%m-%dT%H:%M:%SZ"
 TIMESTAMP_DT_FORMAT = "%Y%m%d%H%M%S"
 BUFF_SIZE = 1024 * 64
+WEBARCHIVE_META_KEY = "webarchive_warc"
 
 logger = logging.getLogger(__name__)