You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My suggestion is:
Use requests -> response.content which is the fetched content in bytes.
extract() method should get html_bytes as input.
Make the conversion of bytes to str a separate function and run it inside extract().
Pass bytes to the extractors that work better with bytes and str to the extractors that work better with str.
This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using requests and trafilatura.utils.decode_file
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.
Thanks
BTW, many of the external libraries you use support both str and bytes as input. E.g. readability-lxml does this: https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.
The text was updated successfully, but these errors were encountered:
Hi, I'm a mediacloud developer, and today, before seeing this issue I was wondering if the character set decoding code currently used in our production story-indexer (https://github.com/mediacloud/story-indexer/tree/main) pipeline could, or should migrate to this (metadata-lib) library.
webpages.fetch
usesrequests
to fetch the web content and returnsresponse.text
as pythonstr
.requests
does auto-encoding of the binary data it fetches from the target site frombytes
to unicodestr
.https://stackoverflow.com/questions/17011357/what-is-the-difference-between-content-and-text
Then, in all the other functions you pass
str
as input.But later, in many functions, you need to convert back
str
tobytes
.For instance, you do that in
languages._from_text
:https://github.com/mediacloud/metadata-lib/blob/main/mcmetadata/languages.py#L58
My suggestion is:
Use
requests
->response.content
which is the fetched content inbytes
.extract()
method should gethtml_bytes
as input.Make the conversion of bytes to str a separate function and run it inside
extract()
.Pass
bytes
to the extractors that work better with bytes and str to the extractors that work better with str.This way, you'll have:
a) a correct operation (I wonder if language detection is working 100% correctly, the content has been auto encoded/decoded before using
requests
andtrafilatura.utils.decode_file
b) a bit faster as you'll avoid redundant bytes -> str -> bytes conversions.
Thanks
BTW, many of the external libraries you use support both
str
andbytes
as input. E.g.readability-lxml
does this:https://github.com/buriy/python-readability/blob/master/readability/htmls.py
But its not efficient to let this conversion be done in the extractors because it will be done multiple times. (internally in each extractor). Its better to do it just once.
The text was updated successfully, but these errors were encountered: