Skip to content

Adds header support for substack articles#7

Open
mhstoller wants to merge 2 commits intoXatpy:mainfrom
mhstoller:5009-fix-html-parsing
Open

Adds header support for substack articles#7
mhstoller wants to merge 2 commits intoXatpy:mainfrom
mhstoller:5009-fix-html-parsing

Conversation

@mhstoller
Copy link
Copy Markdown

This PR ensures that headers in the HTML are preserved in the extracted content.

Previously, headers (h2 elements) were being stripped when parsing a substack article. I've attached an HTML example that can be used as a test input for validation.

Tested the unpacked extension in chrome after these changes, and the article works as expected now. This is an unfortunate workaround of a Readability issue that hasn't seen much traction: mozilla/readability#928

Before:
image

After:
image

Input file: How Codex is built - by Gergely Orosz.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant