Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement regarding content decoding/encoding #32

Open
vbanos opened this issue Sep 25, 2022 · 1 comment
Open

Improvement regarding content decoding/encoding #32

vbanos opened this issue Sep 25, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@vbanos
Copy link
Collaborator

vbanos commented Sep 25, 2022

webpages.fetch uses requests to fetch the web content and returns response.text as python str.
requests does auto-encoding of the binary data it fetches from the target site from bytes to unicode str.
https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text

Then, in all the other functions you pass str as input.

But later, in many functions, you need to convert back str to bytes.
For instance, you do that in languages._from_text:
https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/languages.py#L58

My suggestion is:
Use requests -> response.content which is the fetched content in bytes.

extract() method should get html_bytes as input.

Make the conversion of bytes to str a separate function and run it inside extract().

Pass bytes to the extractors that work better with bytes and str to the extractors that work better with str.

This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using requests and trafilatura.utils.decode_file
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.

Thanks

BTW, many of the external libraries you use support both str and bytes as input. E.g. readability-lxml does this:
https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.

@rahulbot rahulbot added the enhancement New feature or request label Jan 17, 2024
@philbudne
Copy link
Contributor

philbudne commented Sep 16, 2024

Hi, I'm a mediacloud developer, and today, before seeing this issue I was wondering if the character set decoding code currently used in our production story-indexer (https://github.com/mediacloud/story-indexer/tree/main) pipeline could, or should migrate to this (metadata-lib) library.

I took a quick look at the code:
https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/parser.py#L36
(which, by the way handles data that might NOT have been fetched by requests), again before seeing this issue, and thought, given the complexity, maybe not...

For full context, we fetch HTML in a separate pipeline stage in story-indexer, and the code that handles the requests response is here: https://github.com/mediacloud/story-indexer/blob/main/indexer/workers/fetcher/tqfetcher.py#L564 (which passes undecoded resp.content on to the parser stage)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants