Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work image resizing #2101

Merged
merged 6 commits into from
Dec 2, 2024
Merged

Re-work image resizing #2101

merged 6 commits into from
Dec 2, 2024

Conversation

audiodude
Copy link
Member

@audiodude audiodude commented Nov 14, 2024

Fixes #1925
Fixes #2071

Re-work image sizing algorithm. It now enforces a maximum image width of 320px.

This value is chosen because it is a reasonable size on both mobile and desktop, but saves bandwidth compared to "full size" images (which was what the code might have been grabbing from data-data-original-file-src before). This should give us a nice tradeoff between efficiency and quality. We can always adjust this value later, of course, but the logic for retrieving and resizing the image URLs remains the same.

Of note, it takes the smallest of:

  • The src attribute set on the <span> itself that will become the image
  • The data-data-original-file-src attribute, of the original size the image was in the article
  • 320px, provided it can look at one of the previous two URLs and find a URL that is hackable to set the image size.

Tests have been added/adjusted.

EDIT:

Additionally, based on feedback in this thread, I have modified the algorithm to refrain from scaling any dimension smaller than 320px. So for large panorama images such as in the enwiki Paris article, the image does not get destroyed because it is so wide (because scaling to a width of 320px leaves a height of 50px)

@audiodude
Copy link
Member Author

Fixes #1925. Fixes #2071.

@audiodude
Copy link
Member Author

With this change, we see the following:

14272	output/wikipedia_bm_all_maxi_2024-11.113.zim
16536	output/wikipedia_bm_all_maxi_2024-11.114.zim

Scraping of BM with these settings results in a ZIM that is about 2 MB, or 15% bigger.

@audiodude
Copy link
Member Author

With image max size set to 280px, we save another half megabyte. Compared to the 1.13 ZIM, the size increases by about 11%.

14272	wikipedia_bm_all_maxi_2024-11.113.zim
15880	wikipedia_bm_all_maxi_2024-11.114-280.zim
16536	wikipedia_bm_all_maxi_2024-11.114.zim

@audiodude
Copy link
Member Author

audiodude commented Nov 14, 2024

By the way, I spent way too long looking through the repo here and even posted a phabricator ticket, but my intuition is that they were just setting max sizes and we should do the same.

@audiodude audiodude requested a review from kelson42 November 14, 2024 18:43
@Jaifroid
Copy link
Collaborator

Just checking, the current max appears to be 264px (looking at full English Wikipedia). I quite like the larger 320px, and it is something users have been requesting... I suppose this needs weighing up carefully. 15% on ~100GB doesn't seem like too high a price to pay to me...

@audiodude
Copy link
Member Author

Just checking, the current max appears to be 264px (looking at full English Wikipedia). I quite like the larger 320px, and it is something users have been requesting... I suppose this needs weighing up carefully. 15% on ~100GB doesn't seem like too high a price to pay to me...

How did you find that value? Just by finding some 264 width images and assuming there's nothing bigger?

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@7d95c2b). Learn more about missing BASE report.
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/renderers/wikimedia-mobile.renderer.ts 93.02% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2101   +/-   ##
=======================================
  Coverage        ?   74.71%           
=======================================
  Files           ?       41           
  Lines           ?     3188           
  Branches        ?      703           
=======================================
  Hits            ?     2382           
  Misses          ?      686           
  Partials        ?      120           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@Jaifroid
Copy link
Collaborator

How did you find that value? Just by finding some 264 width images and assuming there's nothing bigger?

Oh sorry, I did that a bit fast, and having checked, I've found some at 280px down the right-hand side, so no, no guarantee that that's the max, more an average! There are some much wider images centred in some pages and we'll have to be careful not to clobber those. For example here in the "Paris" article:

image

@audiodude
Copy link
Member Author

Okay, when investigating the Paris panorama, I found a bug, which I fixed and added a test for above. Now the sizes look like this:

14272	output/wikipedia_bm_all_maxi_2024-11.113.zim
14532	output/wikipedia_bm_all_maxi_2024-11.114-280.zim
15364	output/wikipedia_bm_all_maxi_2024-11.114-320.zim

That's a 1.8% increase for 280px width images, and a 7.65% increase for 320px width images. These numbers have gotten smaller because we're catching more images and squishing them to our max size.

@audiodude
Copy link
Member Author

And of course, as you predicted @Jaifroid, we get this:

image

Looks like we need to add logic to only scale in the dominant dimension.

@audiodude
Copy link
Member Author

Okay, added logic for scaling the smallest dimension to 320px, while keeping the aspect ratio.

This, unsurprisingly, gives us the biggest ZIM yet (.114A-320 is the current build)

14272	wikipedia_bm_all_maxi_2024-11.113.zim
14532	wikipedia_bm_all_maxi_2024-11.114-280.zim
15364	wikipedia_bm_all_maxi_2024-11.114-320.zim
15640	wikipedia_bm_all_maxi_2024-11.114A-320.zim

And

image

@kelson42
Copy link
Collaborator

kelson42 commented Dec 1, 2024

@audiodude To me your approach is going to fail, this is an heuristic to approach the result of Mediawiki. If we don´t have the data with the proper value (which seems the case) here, we should retake the logic from Mediawiki, otherwise we will have a hight discrepancy.

@audiodude
Copy link
Member Author

@audiodude To me your approach is going to fail, this is an heuristic to approach the result of Mediawiki. If we don´t have the data with the proper value (which seems the case) here, we should retake the logic from Mediawiki, otherwise we will have a hight discrepancy.

Thanks for the response @kelson42. I think "fail" is a strong word. Fail in what sense? Fail to produce an identical ZIM as compared to 1.13? Clearly. But I think it succeeds in creating reasonably sized images for scraped MediaWikis.

I did spend time trying to find the code in Page Content Service for the "mobile section" endpoint image resizing, but it was hard to track down. I don't see why it's important to do exactly the same thing in the new version of mwoffliner.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 2, 2024

@audiodude To me your approach is going to fail, this is an heuristic to approach the result of Mediawiki. If we don´t have the data with the proper value (which seems the case) here, we should retake the logic from Mediawiki, otherwise we will have a hight discrepancy.

Thanks for the response @kelson42. I think "fail" is a strong word. Fail in what sense? Fail to produce an identical ZIM as compared to 1.13? Clearly. But I think it succeeds in creating reasonably sized images for scraped MediaWikis.

I did spend time trying to find the code in Page Content Service for the "mobile section" endpoint image resizing, but it was hard to track down. I don't see why it's important to do exactly the same thing in the new version of mwoffliner.

As long as we don't original, we will have a hard time to handle all the edge cases. This is what I would like to avoid and why I believe this approach will probably fail.

Thad been said, we hae to move forward and I will open a ticket to follow the "other path", so we can - at least for the moment - benefit of this fix and move forward. Thx for the PR.

@kelson42
Copy link
Collaborator

kelson42 commented Dec 2, 2024

See #2107 for the follow-up issue.

@kelson42 kelson42 force-pushed the better-img-resizing branch from 3d237e7 to b56044d Compare December 2, 2024 18:39
@kelson42 kelson42 merged commit d7818ff into main Dec 2, 2024
2 of 3 checks passed
@kelson42 kelson42 deleted the better-img-resizing branch December 2, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants