Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't write image descriptions twice #1816

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pavel-karatsiuba
Copy link
Contributor

@pavel-karatsiuba pavel-karatsiuba commented Mar 23, 2023

If the image contains the same description in the nearest DOM element then do not add an image description because it will be displayed twice.

Fix: #1536

@pavel-karatsiuba pavel-karatsiuba force-pushed the treat_image_description branch from 185b102 to 51c6946 Compare March 23, 2023 17:48
@codecov
Copy link

codecov bot commented Mar 23, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.06 🎉

Comparison is base (e1e0798) 70.90% compared to head (51c6946) 70.97%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1816      +/-   ##
==========================================
+ Coverage   70.90%   70.97%   +0.06%     
==========================================
  Files          23       23              
  Lines        2609     2615       +6     
  Branches      593      594       +1     
==========================================
+ Hits         1850     1856       +6     
  Misses        653      653              
  Partials      106      106              
Impacted Files Coverage Δ
src/util/saveArticles.ts 81.54% <100.00%> (+0.19%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 23, 2023

@pavel-karatsiuba in a first place, before codin, understanding of the problem. why a few images suffer of this problem and other don't? I guess the bug ia not in the original dom?

Pretyy sure there is a better way to solve it, once we fully undetstand what is going on.

@pavel-karatsiuba
Copy link
Contributor Author

On the original Wikipedia page, I see other tags and it does not contains the description for the image, but only for the block.

The easiest way is not to use the figcaption tag as a description. But maybe it is used for other pages.
My fix is testing if we have the same text from the figcaption tag and from td tag.

This is the HTML from API:

<td colspan="2" style="margin:5px 0;text-align:center">
    <figure class="mw-halign-center">
        <a>...</a>
        <figcaption>Das Gebiet des Nationalparks</figcaption>
    </figure>
    Das Gebiet des Nationalparks
</td>

@kelson42
Copy link
Collaborator

@pavel-karatsiuba Does our parser is adapted at all to parse the "figure" DOM element? Have you read https://www.mediawiki.org/wiki/Parsoid/Parser_Unification/Media_structure/FAQ/en ?

@pavel-karatsiuba
Copy link
Contributor Author

I have read this document and see that our code is fit to described format.
But the problem is the same: we are getting two equal texts.
The current fix is not acceptable because we need to use text from figcaption tag.

So I propose to remove the text which is placed after the figure tag.

@kelson42
Copy link
Collaborator

@pavel-karatsiuba How do you explain the text is not displayed in duplicate on Wikipedia itself? The problem behind this is big.

@pavel-karatsiuba
Copy link
Contributor Author

@kelson42 I discovered more about this issue and found the ticked with almost the same problem: #1544. In this ticket, I see the link to the upstream ticket, which is opened.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 26, 2023

I don't understand, please explain.

@pavel-karatsiuba
Copy link
Contributor Author

https://de.wikipedia.org/api/rest_v1/page/html/Nationalpark_Eifel
On the REST desktop page source code I see also two descriptions. But in Browser I don't see a second description because it is hidden by a simple CSS rule:

figure[typeof~='mw:File'] > figcaption {
    display: none;
}

mwoffliner also has the same CSS rule. But mwoffliner uses the mobile version and API for the mobile version does not provide the typeof attributes.

To see the difference, you can compare the next pages in the browser:
https://de.wikipedia.org/api/rest_v1/page/html/Nationalpark_Eifel
https://de.wikipedia.org/api/rest_v1/page/mobile-html/Nationalpark_Eifel

@kelson42
Copy link
Collaborator

@pavel-karatsiuba Why you refer to https://de.wikipedia.org/api/rest_v1/page/mobile-sections/Nationalpark_Eifel ?! AFAIK MWoffliner does not use this API (this is the topic of #1664)!

@pavel-karatsiuba
Copy link
Contributor Author

Sorry, I didn't know about the old and new versions of API. I thought that mobile-html and mobile-sections return the same data but with a different view.

This is the ticket that described the same situation: https://phabricator.wikimedia.org/T291779
I think this is a duplicate ticket. The problem described in this ticket is the same as in this one: #1544. Both tickets have problem with description which should be hidden. But we can't hide it simply because for mobile-section we did not get typeof attribute.

My PR is fixing only the current ticket.

@stale
Copy link

stale bot commented May 26, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Double image description
2 participants