[WIP] [Discussion] Use response url for original request #21

Verz1Lka · 2021-10-05T15:34:35Z

Make request.url consist with final url

codecov · 2021-10-05T16:02:28Z

Codecov Report

Merging #21 (7f5928a) into master (1492753) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 7f5928a differs from pull request most recent head e651190. Consider uploading reports for the commit e651190 to get more accurate results

@@            Coverage Diff            @@
##            master       #21   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         3           
  Lines          206       207    +1     
=========================================
+ Hits           206       207    +1

Impacted Files	Coverage Δ
crawlera_fetch/middleware.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1492753...e651190. Read the comment docs.

elacuesta · 2021-10-06T14:34:54Z

In which case(s) would this URL be different from the original one? Wouldn't we want to keep the actual URL of the original request?

Gallaecio · 2021-10-06T17:31:48Z

In which case(s) would this URL be different from the original one?

Redirects? (I really have no idea)

Wouldn't we want to keep the actual URL of the original request?

I think this is a valid point. Currently I imagine the response.url has the final URL, so you already have a way to find that URL. I imagine this change is to fix the perceived inconsistency that request.url may not match response.url; or is there more to it?

Verz1Lka · 2021-10-06T19:08:48Z

Yes, It require for cases when request has been redirected. In this case fetch API will return resolved url after all redirects. But scrapy is using request.url for logging (source).
This fix should make it consist.

elacuesta · 2021-10-06T22:07:07Z

I was under the impression that redirects would not be followed directly, leaving them to be handled by the user agent instead (in this case, to be processed by the redirect middleware). Even if that's not the case here, this still looks a little bit counterintuitive to me, since vanilla Scrapy gives you the actual last request in the Response.attribute (which might be different in other aspects from the original request, not just the URL).

Verz1Lka · 2021-10-07T05:50:41Z

Crawlera fetch API has auto-redirects.
There are some inconsistencies in the logs.
[scrapy.core.engine] DEBUG: Crawled (200) <GET <https://www.example.com/redirect-to-home-page>] (referer: None latency: 0.00)
But this page has redirect to the https://www.example.com, and response.url should be https://www.example.com

Later in log:
Scraped from <200 https://www.example.com

So, https://www.example.com/redirect-to-home-page
But Scraped from https://www.example.com

So I'm still thinking about this fix, because it requre also some changes on smart proxy side

Gallaecio · 2021-10-07T05:57:53Z

Maybe it would be better to have some middleware log a message about a redirect happening within Crawlera Fetch by detecting that inconsistency, but otherwise let things be, i.e. do not change the request URL.

akshayphilar · 2021-10-07T07:18:20Z

I agree with @Gallaecio. It makes sense to log the final URL to which the page is redirected rather than replace request.url which hold the original URL. Users who need to retrieve this final URL should be able to do so through the meta which contains the original response right?
Both of these should address the original concerns.

Verz1Lka · 2021-10-07T08:05:35Z

Ok, thank you for feedback guys.
I agree to leave original request as is.
At this moment we need support from crawlera fetch side. It provides only source request url.
Let's hold this PR until they didn't fix it, and than back to this and think how to improve logging better

Use response url for original request

e651190

Verz1Lka changed the title ~~Use response url for original request~~ [WIP] [Discussion] Use response url for original request Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Discussion] Use response url for original request #21

[WIP] [Discussion] Use response url for original request #21

Verz1Lka commented Oct 5, 2021

codecov bot commented Oct 5, 2021

elacuesta commented Oct 6, 2021

Gallaecio commented Oct 6, 2021

Verz1Lka commented Oct 6, 2021 •

edited

Loading

elacuesta commented Oct 6, 2021

Verz1Lka commented Oct 7, 2021 •

edited

Loading

Gallaecio commented Oct 7, 2021

akshayphilar commented Oct 7, 2021

Verz1Lka commented Oct 7, 2021

[WIP] [Discussion] Use response url for original request #21

Are you sure you want to change the base?

[WIP] [Discussion] Use response url for original request #21

Conversation

Verz1Lka commented Oct 5, 2021

codecov bot commented Oct 5, 2021

Codecov Report

elacuesta commented Oct 6, 2021

Gallaecio commented Oct 6, 2021

Verz1Lka commented Oct 6, 2021 • edited Loading

elacuesta commented Oct 6, 2021

Verz1Lka commented Oct 7, 2021 • edited Loading

Gallaecio commented Oct 7, 2021

akshayphilar commented Oct 7, 2021

Verz1Lka commented Oct 7, 2021

Verz1Lka commented Oct 6, 2021 •

edited

Loading

Verz1Lka commented Oct 7, 2021 •

edited

Loading