-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in extracted images sources returning a base64 #92
Comments
Hi @cayolblake. Welcome 👋 Are you using ReadabiliPy from the command line or as a Python package? If the Python package, what function are you calling? If the command line, what parameters are you passing? When called from the command line ReadbiliPy defaults to Mozilla's Readability.js package for parsing the source HTML to a simple representation, so this will help us figure out if the issue is in our code or the Mozilla code. Thanks. |
I'm using the Python library. Code is as simple as follows.
|
Hi @cayolblake. The We moved away from using I've been trying to replicate your issue using this webpage: https://edition.cnn.com/2021/02/09/europe/ursula-von-der-leyen-profile-intl/index.html. Is that the one you are having the issue with? Unfortunately when I call
I could also look at extending our pure python simplifier to retain image tags (and potentially embedded video). We currently strip these out as our use case was very focussed on extracting the text while retaining the paragraph structure in simple HTML to allow easy annotation. What's your use case? |
Appreciate your extensive involvement. I don't have the Yes, the link you stated is the one I am testing on and experiencing such a problem. I just tried the I just tried on I believe if the |
So, I just tried replacing the This fixed as well some other mistakenly extracted content. The only problem now is for Will wait your input on that. |
That's brilliant! Thanks very much for trying this out. We're very happy to accept a pull request from you if you'd like to make one. Otherwise, I'll make this change once I solve my node/javascript issues and tag you on the change so you get credit. |
Sure, I can make a pull request once I check if the image size could be fixed as well if possible. |
In terms of the behaviour of Our original use case was focussed on extracting the text content of pages while retaining enough of the structure around paragraphs/lists etc to render sensibly for human annotation of the extracted text. We ended up writing our own HTML simplification functions in python as it was much easier for us to iterate on as we found issues with the |
Solution
|
I guess the best course of action now is to link to |
Option 2 (referencing the Mozilla node package for |
1 similar comment
Option 2 (referencing the Mozilla node package for |
Totally agree with your suggestion to be best course of action. |
Solution
Added by @martintoreilly
Original issue
Raised by @cayolblake
Hello,
Any idea why original tag on site looks like the following.
<img alt="Ursula von der Leyen, who trained as a physician before going into politics, pictured with her husband and seven children in 2005. " data-src-mini="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-169.jpg" data-src-xsmall="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-medium-plus-169.jpg" data-src-small="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-large-169.jpg" data-src-medium="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-src-large="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-super-169.jpg" data-src-full16x9="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-full-169.jpg" data-src-mini1x1="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-11.jpg" data-demand-load="loaded" data-eq-pts="mini: 0, xsmall: 221, small: 308, medium: 461, large: 781" src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-eq-state="mini xsmall small medium" data-src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" class="media__image media__image--responsive">
And the extracted one looks like the following with the "src" attribute messed up?
<img data-src-mini="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-169.jpg" data-src-xsmall="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-medium-plus-169.jpg" data-src-small="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-large-169.jpg" data-src-medium="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg" data-src-large="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-super-169.jpg" data-src-full16x9="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-full-169.jpg" data-src-mini1x1="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-small-11.jpg" data-demand-load="not-loaded" data-eq-pts="mini: 0, xsmall: 221, small: 308, medium: 461, large: 781" src="image/gif;base64,R0lGODlhEAAJAJEAAAAAAP///////wAAACH5BAEAAAIALAAAAAAQAAkAAAIKlI+py+0Po5yUFQA7">
Basically,
src="//cdn.cnn.com/cnnnext/dam/assets/190714161615-ursula-von-der-leyen-children-exlarge-169.jpg"
ends up beingsrc="image/gif;base64,R0lGODlhEAAJAJEAAAAAAP///////wAAACH5BAEAAAIALAAAAAAQAAkAAAIKlI+py+0Po5yUFQA7"
What to do to avoid such behavior?
The text was updated successfully, but these errors were encountered: