Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't always insert spaces around inline tags? #16

Open
lopuhin opened this issue Dec 12, 2018 · 4 comments
Open

Don't always insert spaces around inline tags? #16

lopuhin opened this issue Dec 12, 2018 · 4 comments

Comments

@lopuhin
Copy link
Contributor

lopuhin commented Dec 12, 2018

Example input: <div><strong>fo</strong>o</div><div>bar</div>. Current output: 'fo o\nbar', while desired output is 'foo\nbar'.

At the same time, in the changelog I find this:

Fix unwanted joins of words with inline tags: spaces are added for inline tags too, but a heuristic is used to preserve punctuation without extra spaces.

So it's not entirely clear if we should always avoid adding spaces around all inline tags. Maybe we could start with not adding them around tags such as strong, em, etc.

Also note that more common usage such as <div><strong>foo</span>, next</div> is handled correctly regardless: foo, next.

@ivanprado
Copy link

Another example:

>>> html_text.extract_text("<p><span>N</span>o one is responsible</p>", True)
N o one is responsible

I did a quick test in Chrome and it is not adding spaces between inline elements. Is there any case in which is clear that spaces should be added?

@lopuhin
Copy link
Contributor Author

lopuhin commented Sep 6, 2019

@ivanprado please see #2 and #1, there are some examples from the wild where adding spaces makes sense.

@lopuhin
Copy link
Contributor Author

lopuhin commented Sep 6, 2019

If we had information about actual CSS properties of the elements, we could do a better job, but that would probably be out of scope of html-text

@ivanprado
Copy link

I see. My experience with article bodies is that is a better policy not to add any spacing when removing span. It seems to give much better results. But I can understand that in other parts of a page or for different pages the case could be different (Like in examples in #1 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants