Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

Add retry mechanism into middleware #12

Open
PyExplorer opened this issue Feb 15, 2021 · 7 comments · May be fixed by #18
Open

Add retry mechanism into middleware #12

PyExplorer opened this issue Feb 15, 2021 · 7 comments · May be fixed by #18
Assignees

Comments

@PyExplorer
Copy link

Hi guys,
I recently faced the case when several retry requests for fetch.crawlera.com helped the spider to work well. As I got from the discussion here https://zytegroup.slack.com/archives/C014HA686ES/p1612975265044000 uncork does 3 retries but not for all failures.
I've implemented this as a temporary fix with retrying requests right in the spider. We could do this customer retry middleware sure, but we will need to add this to every spider/project.
To make things simpler - is it possible to add this right into the CrawleraFetchMiddleware and add meta parameter for retry reasons/retry times along with the existing "skip" parameter?

The reasons for failed responses that I've faced
"crawlera_status":"fail"
"crawlera_status":"ban"
"crawlera_error":"timeout"

Thanks.

@elacuesta
Copy link
Member

Just opened scrapy/scrapy#4993 to make this easier.

@PyExplorer
Copy link
Author

Hi @elacuesta thanks for taking care of the request.
I've looked at your PR #4993, it can be useful sure, but it seems that this approach won't solve the issue with this particular CrawleraFetchMiddleware. Two points that I have in mind:

  1. The existing RetryMiddleware is activated only with this condition:
    if response.status in self.retry_http_codes:
    but in some cases, the response code for "bad" response from CrawleraFetchMiddleware is 200 and this won't be retried even if we need it. "Bad" response means that the response has server_error or json decode error
  2. Also for some cases the request has URL 'http://cm-54.scrapinghub.com/fetch/v2' and the original URL that we need to retry is only in meta. In this case, this will be retried incorrectly.

What do you think about adding something like existing approach

if self.raise_on_error:

    raise CrawleraFetchException(log_msg) from exc

elif self.retry_on_error:

    raise CrawleraRetryException(log_msg) from exc

else

    logger.warning(log_msg)

return response

and implement this CrawleraRetryException separately.

cc @Gallaecio @noviluni @ogiaquino

@elacuesta
Copy link
Member

Hey @PyExplorer! My idea for subclassing the built-in middleware is actually to take advantage of the private _retry method, which handles several things like stats and maximum retries. One alternative we considered in the OSS meeting is closing scrapy/scrapy#4993 in favour of Adrian's scrapy/scrapy#4902, I will suggest some changes to the latter so we can also have customizable stats and the like.

@akshayphilar
Copy link

Hey @elacuesta I see that scrapy/scrapy#4902 is merged 🎉 Would be good to leverage that and implement the retry functionality in the middleware since there are growing use cases that warrant it.

@elacuesta
Copy link
Member

elacuesta commented Apr 10, 2021

Indeed @akshayphilar, my plan is to work on this next week 👍

@elacuesta elacuesta linked a pull request Apr 10, 2021 that will close this issue
5 tasks
@pawelmhm
Copy link

Also for some cases the request has URL 'http://cm-54.scrapinghub.com/fetch/v2' and the original URL that we need to retry is only in meta. In this case, this will be retried incorrectly

@PyExplorer why do you need to retry the original url? I would think that if request to uncork API fails we should retry request to uncork API, and not to target website.

@pawelmhm
Copy link

pawelmhm commented Apr 12, 2021

re:

The existing RetryMiddleware is activated only with this condition:
if response.status in self.retry_http_codes: but in some cases, the response code for "bad" response from CrawleraFetchMiddleware is 200 and this won't be retried even if we need it. "Bad" response means that the response has server_error or json decode error

I see this as a bug in Uncork API. It should not return 200 if something fails. If there is a failed response, uncork did not bring good results it should return 503 service unavailable, this is what some other products from Zyte are doing e.g. Smart Proxy Manager.

This way we could handle Fetch API 5** codes with default built in Scrapy middleware and there will be no need for separate PR by @elacuesta. And retrying 5** codes should be a default behaviour of middleware. Raising errors is not a good default. Most users will want to get retried response in case of Uncork failure not error in logs.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants