-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A part of the webpage is missing #47
Comments
1717848194081 | DEBUG | onValidate called, no changes detected
1717848195613 | DEBUG | onValidate called, no changes detected
1717848261971 | DEBUG | attempting to parse prop metadata
1717848261971 | DEBUG | found prop elements
1717848261971 | DEBUG | found prop elements
1717848261971 | DEBUG | found prop elements
1717848261971 | DEBUG | found prop elements
1717848261971 | DEBUG | attempting to parse prop metadata
1717848261971 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | attempting to parse prop metadata
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | attempting to parse prop metadata
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | found prop elements
1717848261972 | DEBUG | attempting to parse prop metadata
1717848261972 | DEBUG | found prop elements
|
thanks for the report! often this is due to Readability thinking the block is a nav header or similar. I'll have a look and see whether that can be parsed out safely. are there any more logs to go with this? seems to just sort of end mid-parse. |
Here's another log: I tried slurping the same url as above twice. |
yeah so this is a readability thing. it works by scoring nodes individually. the scores are based on things like link density, classes which are commonly associated with content or non-content, content length, etc. nodes with any score > 0 become a candidate and the node with the highest score becomes the "top candidate" for the node which contains the page's actual content. once it has its top candidate, it moves up the tree to check for an ancestor which contains at least three other candidates and has a score that's no less than 25% lower than the top candidate's score. your site produces these candidates:
note that the main container's score is 29.86 and the work content's score is 42.23, so the main container's score is about 30% lower and it gets disqualified. it might be possible to give extra weight to an ancestor with multiple high scoring children but it might degrade the experience when parsing more complex sites. i'll leave this issue open for now as a reminder for the next time i dive into readability work and tinker with the scoring mechanism. if that tinkering seems promising, i'll open an issue upstream and link it here. i have to point out though: readability and slurp are geared toward news sites, blogs, and long-form writeups as that is by far the most common use case. reliably extracting page content while excluding irrelevancies like ads and nav bars requires a lot of fuzzy logic which won't work as intended on every possible page structure. |
Thank you for the explanation and for keeping this issue open for further work.
That makes sense. I have tried multiple "url to .MD" tools and I found that most of them ignore the first part of my website, so that's why I opened this issue. After your explanation, I have more insight about why this happens. Thanks again! |
What was slurped:
the website: https://tbp.land
The text was updated successfully, but these errors were encountered: