Skip to content

Set 'Referer' HTTP request header#2062

Merged
audiodude merged 2 commits intomainfrom
referer
Jul 22, 2024
Merged

Set 'Referer' HTTP request header#2062
audiodude merged 2 commits intomainfrom
referer

Conversation

@audiodude
Copy link
Copy Markdown
Member

@audiodude audiodude commented Jul 15, 2024

Fixes #2061

Tested with Italian wikipedia download with article listed in bug.

@audiodude audiodude requested a review from kelson42 July 15, 2024 16:35
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.44%. Comparing base (c09bc92) to head (157c2b9).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2062      +/-   ##
==========================================
+ Coverage   74.38%   74.44%   +0.06%     
==========================================
  Files          41       41              
  Lines        3146     3146              
  Branches      689      689              
==========================================
+ Hits         2340     2342       +2     
+ Misses        686      684       -2     
  Partials      120      120              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Jul 15, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

@audiodude
Copy link
Copy Markdown
Member Author

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Jul 16, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

@kelson42 kelson42 changed the title Set 'Referer' header Set 'Referer' HTTP request header Jul 16, 2024
@audiodude
Copy link
Copy Markdown
Member Author

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

It just seems better to put a "dummy" value that is obviously a hack, than a WMF domain which is potentially misleading.

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Jul 17, 2024

@audiodude Thank you very much for this simple but very important fix. One thing, why bot just taking the Mediwiki URL given as argument to MWoffliner in place of this fake "http://localhost/"? To me this would more correct and probably more robust.

Because the WMF software looks for specific values of the Referer header. Presumably 'localhost' works because it is left in for local development. It seems better to put that, as a workaround, then en.wikipedia.org

All WMF domains are part of the Regex look like, so why a hack like "localhost" could be better?

It just seems better to put a "dummy" value that is obviously a hack, than a WMF domain which is potentially misleading.

Not really convinced, but I guess you can argue that way. I believe we should anyway have a small test validating this(proper download of map) to secure next time we detect and can fix such kind of issue early.

@kelson42
Copy link
Copy Markdown
Contributor

@audiodude Maybe we could just extend saveArticles.test.ts around the London test as https://en.wikipedia.org/wiki/London as a map?

@audiodude
Copy link
Copy Markdown
Member Author

@audiodude Maybe we could just extend saveArticles.test.ts around the London test as https://en.wikipedia.org/wiki/London as a map?

Added test in downloader.test.ts PTAL, thanks!

Copy link
Copy Markdown
Contributor

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@audiodude audiodude merged commit 6c919b0 into main Jul 22, 2024
@audiodude audiodude deleted the referer branch July 22, 2024 15:04
Comment thread src/Downloader.ts
const mwResp = await axios(url, this.arrayBufferRequestOptions)
// The 'Referer' header is set to get around WMF domain origin restrictions.
// See: https://github.com/openzim/mwoffliner/issues/2061
const mwResp = await axios(url, { ...this.arrayBufferRequestOptions, headers: { Referer: 'https://localhost/' } })
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the actual site/page you are scraping as the referrer instead of this fiction? That would make your scraper function in the same way as any normal user-agent consuming the wiki content and allow you to avoid being seen as deliberately violating hot linking protections used on the Wikimedia content farm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bd808 Was pretty much my proposal, see comments above and responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wikimedia Maps HTTP 403 (acting like a bad bot)

3 participants