Add retry mechanism into middleware #12

PyExplorer · 2021-02-15T16:53:23Z

Hi guys,
I recently faced the case when several retry requests for fetch.crawlera.com helped the spider to work well. As I got from the discussion here https://zytegroup.slack.com/archives/C014HA686ES/p1612975265044000 uncork does 3 retries but not for all failures.
I've implemented this as a temporary fix with retrying requests right in the spider. We could do this customer retry middleware sure, but we will need to add this to every spider/project.
To make things simpler - is it possible to add this right into the CrawleraFetchMiddleware and add meta parameter for retry reasons/retry times along with the existing "skip" parameter?

The reasons for failed responses that I've faced
"crawlera_status":"fail"
"crawlera_status":"ban"
"crawlera_error":"timeout"

Thanks.

elacuesta · 2021-02-19T15:14:36Z

Just opened scrapy/scrapy#4993 to make this easier.

PyExplorer · 2021-03-01T17:32:02Z

Hi @elacuesta thanks for taking care of the request.
I've looked at your PR #4993, it can be useful sure, but it seems that this approach won't solve the issue with this particular CrawleraFetchMiddleware. Two points that I have in mind:

The existing RetryMiddleware is activated only with this condition:
if response.status in self.retry_http_codes:
but in some cases, the response code for "bad" response from CrawleraFetchMiddleware is 200 and this won't be retried even if we need it. "Bad" response means that the response has server_error or json decode error
Also for some cases the request has URL 'http://cm-54.scrapinghub.com/fetch/v2' and the original URL that we need to retry is only in meta. In this case, this will be retried incorrectly.

What do you think about adding something like existing approach

if self.raise_on_error:

    raise CrawleraFetchException(log_msg) from exc

elif self.retry_on_error:

    raise CrawleraRetryException(log_msg) from exc

else

    logger.warning(log_msg)

return response

and implement this CrawleraRetryException separately.

cc @Gallaecio @noviluni @ogiaquino

elacuesta · 2021-03-02T12:41:22Z

Hey @PyExplorer! My idea for subclassing the built-in middleware is actually to take advantage of the private _retry method, which handles several things like stats and maximum retries. One alternative we considered in the OSS meeting is closing scrapy/scrapy#4993 in favour of Adrian's scrapy/scrapy#4902, I will suggest some changes to the latter so we can also have customizable stats and the like.

akshayphilar · 2021-04-09T14:51:19Z

Hey @elacuesta I see that scrapy/scrapy#4902 is merged 🎉 Would be good to leverage that and implement the retry functionality in the middleware since there are growing use cases that warrant it.

elacuesta · 2021-04-10T17:22:10Z

Indeed @akshayphilar, my plan is to work on this next week 👍

pawelmhm · 2021-04-12T11:23:40Z

Also for some cases the request has URL 'http://cm-54.scrapinghub.com/fetch/v2' and the original URL that we need to retry is only in meta. In this case, this will be retried incorrectly

@PyExplorer why do you need to retry the original url? I would think that if request to uncork API fails we should retry request to uncork API, and not to target website.

pawelmhm · 2021-04-12T11:26:22Z

re:

The existing RetryMiddleware is activated only with this condition:
if response.status in self.retry_http_codes: but in some cases, the response code for "bad" response from CrawleraFetchMiddleware is 200 and this won't be retried even if we need it. "Bad" response means that the response has server_error or json decode error

I see this as a bug in Uncork API. It should not return 200 if something fails. If there is a failed response, uncork did not bring good results it should return 503 service unavailable, this is what some other products from Zyte are doing e.g. Smart Proxy Manager.

This way we could handle Fetch API 5** codes with default built in Scrapy middleware and there will be no need for separate PR by @elacuesta. And retrying 5** codes should be a default behaviour of middleware. Raising errors is not a good default. Most users will want to get retried response in case of Uncork failure not error in logs.

elacuesta self-assigned this Feb 19, 2021

elacuesta mentioned this issue Feb 19, 2021

Allow customization of retry middleware scrapy/scrapy#4993

Closed

elacuesta linked a pull request Apr 10, 2021 that will close this issue

Retry #18

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry mechanism into middleware #12

Add retry mechanism into middleware #12

PyExplorer commented Feb 15, 2021

elacuesta commented Feb 19, 2021

PyExplorer commented Mar 1, 2021

elacuesta commented Mar 2, 2021

akshayphilar commented Apr 9, 2021

elacuesta commented Apr 10, 2021 •

edited

Loading

pawelmhm commented Apr 12, 2021

pawelmhm commented Apr 12, 2021 •

edited

Loading

Add retry mechanism into middleware #12

Add retry mechanism into middleware #12

Comments

PyExplorer commented Feb 15, 2021

elacuesta commented Feb 19, 2021

PyExplorer commented Mar 1, 2021

elacuesta commented Mar 2, 2021

akshayphilar commented Apr 9, 2021

elacuesta commented Apr 10, 2021 • edited Loading

pawelmhm commented Apr 12, 2021

pawelmhm commented Apr 12, 2021 • edited Loading

elacuesta commented Apr 10, 2021 •

edited

Loading

pawelmhm commented Apr 12, 2021 •

edited

Loading