Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to handle AutoExtractError #33

Open
ilias-ant opened this issue Dec 13, 2021 · 1 comment · May be fixed by #34
Open

ability to handle AutoExtractError #33

ilias-ant opened this issue Dec 13, 2021 · 1 comment · May be fixed by #34
Labels
enhancement New feature or request

Comments

@ilias-ant
Copy link
Contributor

ilias-ant commented Dec 13, 2021

Problem statement

A typical scenario when using the Scrapy middleware to auto-extract e.g. product page URLs is that said URLs may respond with 404 status.

However, the library does not provide a way to handle the associated AutoExtractErrors. It seems that only successful (w.r.t to the domain crawled, not the auto-extract API) requests are returned from the middleware, with the rest of them (non-successful) simply logged:

...
if result.get('error'):
    self.inc_metric('autoextract/errors/result_error')
    self._log_debug_error(response, body)
    raise AutoExtractError('Received error from AutoExtract for {}: {}'.format(url, result["error"]))
...

Example

This is the output I get when I try to crawl the 404 URL: https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html

2021-12-13 12:54:43 [scrapy_autoextract.middlewares] DEBUG: Process AutoExtract request for product URL <GET https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html>
2021-12-13 12:54:55 [scrapy_autoextract.middlewares] DEBUG: AutoExtract response status=200  headers={'date': 'Mon, 13 Dec 2021 10:54:44 GMT', 'content-type': 'application/json', 'strict-transport-security': 'max-age=0; includeSubDomains; preload'}  content=[{"query":{"id":"1639392884013-e7d673376b493f68","domain":"dosfarma.com","userAgent":"scrapy-autoextract/0.5.2 scrapy/2.4.1","userQuery":{"url":"https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html","pageType":"product"}},"error":"Downloader error: http404"}]
2021-12-13 12:54:55 [scrapy.core.scraper] ERROR: Error downloading <POST https://autoextract.scrapinghub.com/v1/extract>
Traceback (most recent call last):
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: <200 https://autoextract.scrapinghub.com/v1/extract>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 55, in process_response
    response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy_autoextract/middlewares.py", line 190, in process_response
    '{}: {}'.format(url, result["error"]))
scrapy_autoextract.middlewares.AutoExtractError: Received error from AutoExtract for https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html: Downloader error: http404
2021-12-13 13:00:21 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-13 13:00:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'autoextract/errors/result_error': 1,
 'autoextract/request_count': 1,
 'downloader/request_bytes': 460,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 445,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 31.248149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 13, 11, 0, 21, 791393),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'memusage/max': 70422528,
 'memusage/startup': 70422528,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2021, 12, 13, 10, 59, 50, 543244)}

With this information (a DEBUG-level log + an increase in the metric autoextract/errors/result_error) user does not have access to the information contained in the unsuccessful responses, which may very well be important for many applications. Parsing the DEBUG logs seems a subpar practice, since deployed applications typically log statements with a level of WARNING and above.

Proposal

A refactoring of the (at least) the process_response method of the AutoExtractMiddleware in order to return a more unified response that covers all cases. For example, unsuccessful (w.r.t to the domain crawled, not the auto-extract API) responses should contain the Downloader error: http404.

@BurnzZ BurnzZ added the enhancement New feature or request label Feb 4, 2022
@BurnzZ BurnzZ linked a pull request Feb 4, 2022 that will close this issue
@BurnzZ
Copy link
Member

BurnzZ commented Feb 4, 2022

Hi @ilias-ant, thanks for raising this!

Would something like #34 ease the issue? This introduces AUTOEXTRACT_RESPONSE_ERROR_LOG_LEVEL and AUTOEXTRACT_ALLOWED_RESPONSE_ERRORS where the users would have a finer control over the errors and logs.

The logging default is still set to logging.DEBUG though to prevent existing users' logging records from being filled up unexpectedly if we raise it to a higher level. Nonetheless, it could still be overridden easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants