Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

[WIP] [Discussion] Use response url for original request #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Verz1Lka
Copy link
Contributor

@Verz1Lka Verz1Lka commented Oct 5, 2021

Make request.url consist with final url

@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #21 (7f5928a) into master (1492753) will not change coverage.
The diff coverage is 100.00%.

❗ Current head 7f5928a differs from pull request most recent head e651190. Consider uploading reports for the commit e651190 to get more accurate results
Impacted file tree graph

@@            Coverage Diff            @@
##            master       #21   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         3           
  Lines          206       207    +1     
=========================================
+ Hits           206       207    +1     
Impacted Files Coverage Δ
crawlera_fetch/middleware.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1492753...e651190. Read the comment docs.

@elacuesta
Copy link
Member

In which case(s) would this URL be different from the original one? Wouldn't we want to keep the actual URL of the original request?

@Gallaecio
Copy link

In which case(s) would this URL be different from the original one?

Redirects? (I really have no idea)

Wouldn't we want to keep the actual URL of the original request?

I think this is a valid point. Currently I imagine the response.url has the final URL, so you already have a way to find that URL. I imagine this change is to fix the perceived inconsistency that request.url may not match response.url; or is there more to it?

@Verz1Lka
Copy link
Contributor Author

Verz1Lka commented Oct 6, 2021

Yes, It require for cases when request has been redirected. In this case fetch API will return resolved url after all redirects. But scrapy is using request.url for logging (source).
This fix should make it consist.

@elacuesta
Copy link
Member

I was under the impression that redirects would not be followed directly, leaving them to be handled by the user agent instead (in this case, to be processed by the redirect middleware). Even if that's not the case here, this still looks a little bit counterintuitive to me, since vanilla Scrapy gives you the actual last request in the Response.attribute (which might be different in other aspects from the original request, not just the URL).

@Verz1Lka
Copy link
Contributor Author

Verz1Lka commented Oct 7, 2021

Crawlera fetch API has auto-redirects.
There are some inconsistencies in the logs.
[scrapy.core.engine] DEBUG: Crawled (200) <GET <https://www.example.com/redirect-to-home-page>] (referer: None latency: 0.00)
But this page has redirect to the https://www.example.com, and response.url should be https://www.example.com

Later in log:
Scraped from <200 https://www.example.com

So, https://www.example.com/redirect-to-home-page
But Scraped from https://www.example.com

So I'm still thinking about this fix, because it requre also some changes on smart proxy side

@Gallaecio
Copy link

Maybe it would be better to have some middleware log a message about a redirect happening within Crawlera Fetch by detecting that inconsistency, but otherwise let things be, i.e. do not change the request URL.

@akshayphilar
Copy link

I agree with @Gallaecio. It makes sense to log the final URL to which the page is redirected rather than replace request.url which hold the original URL. Users who need to retrieve this final URL should be able to do so through the meta which contains the original response right?
Both of these should address the original concerns.

@Verz1Lka
Copy link
Contributor Author

Verz1Lka commented Oct 7, 2021

Ok, thank you for feedback guys.
I agree to leave original request as is.
At this moment we need support from crawlera fetch side. It provides only source request url.
Let's hold this PR until they didn't fix it, and than back to this and think how to improve logging better

@Verz1Lka Verz1Lka changed the title Use response url for original request [WIP] [Discussion] Use response url for original request Oct 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants