You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If called from multiple threads or processes, multiple inputs will collide into the same "full.html" / "article.json" on the filesystem and erroneous responses will be returned.
This is due to this in 'simple_json_from_html_string_safe()':
ifuse_readability:
temp_dir=tempfile.gettempdir()
# Write input HTML to temporary file so it is available to the node.js scripthtml_path=os.path.join(temp_dir, "full.html")
withopen(html_path, 'w') asf:
f.write(html)
# Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary filearticle_json_path=os.path.join(temp_dir, "article.json")
jsdir=os.path.join(os.path.dirname(__file__), 'javascript')
withchdir(jsdir):
subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path])
# Read output of call to Readability.parse() from JSON file and return as Python dictionarywithopen(article_json_path) asf:
input_json=json.loads(f.read())
here tempfile.gettempdir() is global, for example "/tmp"
there will be collisions in "/tmp/full.html" and "/tmp/article.json"
possible fix, add 'use_readability_temp_dir' like so:
defsimple_json_from_html_string_safe(html, content_digests=False, node_indexes=False, use_readability=False, use_readability_temp_dir=None):
ifuse_readability:
temp_dir=use_readability_temp_dirortempfile.gettempdir() # if no specific temp_dir is provided use global system temp dir, this will cause collisions in multiprocess situations
...
@erpic: This project is no longer actively developed. If you are happy to make a PR with this fix, I'm happy to review and merge it when I have some free time.
Any preference between the approach above (minimal change but you have to explicitely set 'use_readability_temp_dir' to be thread safe) or touching a couple more lines so and we always create a unique temp dir within that function and then it's always thread safe (that seems better to me)?
@jemrobinson here is a pull request
I believe this can be merged as is. But note that tests fail (before and after these changes, but after the merge this is really extra linebreaks)
If called from multiple threads or processes, multiple inputs will collide into the same "full.html" / "article.json" on the filesystem and erroneous responses will be returned.
This is due to this in 'simple_json_from_html_string_safe()':
here tempfile.gettempdir() is global, for example "/tmp"
there will be collisions in "/tmp/full.html" and "/tmp/article.json"
possible fix, add 'use_readability_temp_dir' like so:
then use like so:
The text was updated successfully, but these errors were encountered: