Retry #18

elacuesta · 2021-04-10T18:45:07Z

Closes #12

Tasks:

Document settings
Read settings in middleware
Deprecate CRAWLERA_FETCH_RAISE_ON_ERROR in favour of a more general CRAWLERA_FETCH_ON_ERROR setting (with possible values warn(ing), raise, retry)
Implement retry
Tests

codecov · 2021-04-10T18:45:12Z

Codecov Report

Merging #18 (8e87867) into master (d715b08) will decrease coverage by 3.10%.
The diff coverage is 90.52%.

@@             Coverage Diff             @@
##            master      #18      +/-   ##
===========================================
- Coverage   100.00%   96.89%   -3.11%     
===========================================
  Files            3        4       +1     
  Lines          206      290      +84     
===========================================
+ Hits           206      281      +75     
- Misses           0        9       +9

Impacted Files	Coverage Δ
crawlera_fetch/_utils.py	`76.66% <76.66%> (ø)`
crawlera_fetch/middleware.py	`99.15% <96.87%> (-0.85%)`	⬇️
crawlera_fetch/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d715b08...8e87867. Read the comment docs.

PyExplorer · 2021-04-12T09:54:34Z

Hi @elacuesta, thank you for starting the implementation.
Please let me know when I can test it with the real spiders.

pawelmhm · 2021-04-12T10:35:39Z

README.md

@@ -76,6 +87,21 @@ Crawlera middleware won't be able to handle them.
    Default values to be sent to the Crawlera Fetch API. For instance, set to `{"device": "mobile"}`
    to render all requests with a mobile profile.

+* `CRAWLERA_FETCH_SHOULD_RETRY` (type `Optional[Callable, str]`, default `None`)


I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

Hey @pawelmhm:

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

CRAWLERA_FETCH_SHOULD_RETRY receives a callable (or the name of a callable within the spider) to be used to determine if a request should be retried. Perhaps it could be named differently, I'm open to suggestions. CRAWLERA_FETCH_ON_ERROR is the setting to determine what to do with errors. I made OnError.Warn the default, just to keep backward-compatibility, but perhaps OnError.Retry can be a better default.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of scrapy/scrapy#4902. I wanted to avoid code duplication but I guess I can just use the upstream function if available and fall back to copying the implementation.

thanks for explanation. Is there any scenario where you don't need retry? Because in my experience it is very rare not to want retry of internal server errors, timeouts, or bans?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of

I think in some projects people are using old Scrapy versions and they will have to update Scrapy to most recent versions, which will be extra work, extra effort for them. If they are stuck on some old versions like 1.6 or 1.7 updating to 2.5 might not be straigtforward.

But my main point after thinking about this is that why do we actually need custom retry? Why can't we handle this in retry middleware by default? like all other HTTP error codes? There are 2 use cases mentioned by Taras in issue on GH, but I'm not convinced about them and after talking with developer of Fetch API I hear they plan to change behavior to return 500 anbd 503 HTTP status codes instead of 200 HTTP status code with error code in response body.

.coveragerc

crawlera_fetch/middleware.py

elacuesta · 2021-11-17T15:00:59Z

@akshayphilar do you have any insights on this? Mostly about "change behavior to return 500 and 503 HTTP status codes instead of 200 HTTP status code with error code in response body" from #18 (comment). Bottom line is: should we provide a client-side retrying mechanism or should we rely on the server to return codes that will be retried by default by Scrapy?

elacuesta added 3 commits April 10, 2021 15:06

Retry: settings and docs

3e12067

Import get_retry_request, update settings format in readme

c3543a9

Merge branch 'master' into retry

5adda54

elacuesta added the enhancement New feature or request label Apr 10, 2021

elacuesta changed the title ~~Retry~~ [WIP] Retry Apr 10, 2021

elacuesta added 3 commits April 12, 2021 03:01

Deprecate CRAWLERA_FETCH_RAISE_ON_ERROR, add CRAWLERA_FETCH_ON_ERROR

dba0789

Paint it black, fix type issue (py35)

6e56acf

CRAWLERA_FETCH_SHOULD_RETRY

8e2b2d8

pawelmhm reviewed Apr 12, 2021

View reviewed changes

elacuesta added 7 commits April 12, 2021 12:08

Fallback implementation of get_retry_request

fd6b54d

Exclude _utils from coverage report

e5c5feb

Retry tests

1e36a03

Add missing type hint

c25600a

Add another missing type hint

022f1fc

py35 compliance

dcd4efa

Full diff coverage

872dcda

elacuesta changed the title ~~[WIP] Retry~~ Retry Apr 22, 2021

elacuesta marked this pull request as ready for review April 22, 2021 01:20

Gallaecio approved these changes Jul 30, 2021

View reviewed changes

.coveragerc Outdated Show resolved Hide resolved

crawlera_fetch/middleware.py Show resolved Hide resolved

elacuesta added 3 commits November 5, 2021 11:59

Comment about excluding upstream code from test coverage reports

9b9560e

Validate CRAWLERA_FETCH_ON_ERROR setting

c77983e

Merge branch 'master' into retry

8e87867

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry #18

Retry #18

elacuesta commented Apr 10, 2021 •

edited

Loading

codecov bot commented Apr 10, 2021 •

edited

Loading

PyExplorer commented Apr 12, 2021

pawelmhm Apr 12, 2021

elacuesta Apr 12, 2021

pawelmhm Apr 12, 2021 •

edited

Loading

elacuesta commented Nov 17, 2021

Retry #18

Are you sure you want to change the base?

Retry #18

Conversation

elacuesta commented Apr 10, 2021 • edited Loading

codecov bot commented Apr 10, 2021 • edited Loading

Codecov Report

PyExplorer commented Apr 12, 2021

pawelmhm Apr 12, 2021

Choose a reason for hiding this comment

elacuesta Apr 12, 2021

Choose a reason for hiding this comment

pawelmhm Apr 12, 2021 • edited Loading

Choose a reason for hiding this comment

elacuesta commented Nov 17, 2021

elacuesta commented Apr 10, 2021 •

edited

Loading

codecov bot commented Apr 10, 2021 •

edited

Loading

pawelmhm Apr 12, 2021 •

edited

Loading