-
-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-work image resizing #2101
Re-work image resizing #2101
Conversation
With this change, we see the following:
Scraping of BM with these settings results in a ZIM that is about 2 MB, or 15% bigger. |
With image max size set to 280px, we save another half megabyte. Compared to the 1.13 ZIM, the size increases by about 11%.
|
By the way, I spent way too long looking through the repo here and even posted a phabricator ticket, but my intuition is that they were just setting max sizes and we should do the same. |
Just checking, the current max appears to be 264px (looking at full English Wikipedia). I quite like the larger 320px, and it is something users have been requesting... I suppose this needs weighing up carefully. 15% on ~100GB doesn't seem like too high a price to pay to me... |
How did you find that value? Just by finding some 264 width images and assuming there's nothing bigger? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2101 +/- ##
=======================================
Coverage ? 74.71%
=======================================
Files ? 41
Lines ? 3188
Branches ? 703
=======================================
Hits ? 2382
Misses ? 686
Partials ? 120 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
Oh sorry, I did that a bit fast, and having checked, I've found some at 280px down the right-hand side, so no, no guarantee that that's the max, more an average! There are some much wider images centred in some pages and we'll have to be careful not to clobber those. For example here in the "Paris" article: |
Okay, when investigating the Paris panorama, I found a bug, which I fixed and added a test for above. Now the sizes look like this:
That's a 1.8% increase for 280px width images, and a 7.65% increase for 320px width images. These numbers have gotten smaller because we're catching more images and squishing them to our max size. |
And of course, as you predicted @Jaifroid, we get this: Looks like we need to add logic to only scale in the dominant dimension. |
Okay, added logic for scaling the smallest dimension to 320px, while keeping the aspect ratio. This, unsurprisingly, gives us the biggest ZIM yet (
And |
@audiodude To me your approach is going to fail, this is an heuristic to approach the result of Mediawiki. If we don´t have the data with the proper value (which seems the case) here, we should retake the logic from Mediawiki, otherwise we will have a hight discrepancy. |
Thanks for the response @kelson42. I think "fail" is a strong word. Fail in what sense? Fail to produce an identical ZIM as compared to 1.13? Clearly. But I think it succeeds in creating reasonably sized images for scraped MediaWikis. I did spend time trying to find the code in Page Content Service for the "mobile section" endpoint image resizing, but it was hard to track down. I don't see why it's important to do exactly the same thing in the new version of mwoffliner. |
As long as we don't original, we will have a hard time to handle all the edge cases. This is what I would like to avoid and why I believe this approach will probably fail. Thad been said, we hae to move forward and I will open a ticket to follow the "other path", so we can - at least for the moment - benefit of this fix and move forward. Thx for the PR. |
See #2107 for the follow-up issue. |
… of 320px, assuming it can use URL hacking to generate the proper URL to download
… image in dominant dimension (width or height)
3d237e7
to
b56044d
Compare
Fixes #1925
Fixes #2071
Re-work image sizing algorithm. It now enforces a maximum image width of 320px.
This value is chosen because it is a reasonable size on both mobile and desktop, but saves bandwidth compared to "full size" images (which was what the code might have been grabbing from
data-data-original-file-src
before). This should give us a nice tradeoff between efficiency and quality. We can always adjust this value later, of course, but the logic for retrieving and resizing the image URLs remains the same.Of note, it takes the smallest of:
src
attribute set on the<span>
itself that will become the imagedata-data-original-file-src
attribute, of the original size the image was in the articleTests have been added/adjusted.
EDIT:
Additionally, based on feedback in this thread, I have modified the algorithm to refrain from scaling any dimension smaller than 320px. So for large panorama images such as in the enwiki Paris article, the image does not get destroyed because it is so wide (because scaling to a width of 320px leaves a height of 50px)