Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple_json_from_html_string(...) with use_readability=True is not multi-thread/process safe #109

Open
erpic opened this issue Sep 4, 2024 · 4 comments

Comments

@erpic
Copy link

erpic commented Sep 4, 2024

If called from multiple threads or processes, multiple inputs will collide into the same "full.html" / "article.json" on the filesystem and erroneous responses will be returned.

This is due to this in 'simple_json_from_html_string_safe()':

    if use_readability:
        temp_dir = tempfile.gettempdir()
        # Write input HTML to temporary file so it is available to the node.js script
        html_path = os.path.join(temp_dir, "full.html")
        with open(html_path, 'w') as f:
            f.write(html)

        # Call Mozilla's Readability.js Readability.parse() function via node, writing output to a temporary file
        article_json_path = os.path.join(temp_dir, "article.json")
        jsdir = os.path.join(os.path.dirname(__file__), 'javascript')
        with chdir(jsdir):
            subprocess.check_call(["node", "ExtractArticle.js", "-i", html_path, "-o", article_json_path])

        # Read output of call to Readability.parse() from JSON file and return as Python dictionary
        with open(article_json_path) as f:
            input_json = json.loads(f.read())

here tempfile.gettempdir() is global, for example "/tmp"
there will be collisions in "/tmp/full.html" and "/tmp/article.json"

possible fix, add 'use_readability_temp_dir' like so:

def simple_json_from_html_string_safe(html, content_digests=False, node_indexes=False, use_readability=False, use_readability_temp_dir=None):

    if use_readability:
        temp_dir = use_readability_temp_dir or tempfile.gettempdir()  # if no specific temp_dir is provided use global system temp dir, this will cause collisions in multiprocess situations
        ...

then use like so:

with tempfile.TemporaryDirectory(prefix="readabilipy", delete=True) as temp_dir:  # create a unique temp dir
    res = simple_json_from_html_string_safe(html, use_readability=True, use_readability_temp_dir=temp_dir
@jemrobinson
Copy link
Member

@erpic: This project is no longer actively developed. If you are happy to make a PR with this fix, I'm happy to review and merge it when I have some free time.

@erpic
Copy link
Author

erpic commented Sep 4, 2024

@jemrobinson will do, many thanks.

Any preference between the approach above (minimal change but you have to explicitely set 'use_readability_temp_dir' to be thread safe) or touching a couple more lines so and we always create a unique temp dir within that function and then it's always thread safe (that seems better to me)?

@jemrobinson
Copy link
Member

Yes - the second one sounds better to me too. Thanks!

@erpic
Copy link
Author

erpic commented Sep 5, 2024

@jemrobinson here is a pull request
I believe this can be merged as is. But note that tests fail (before and after these changes, but after the merge this is really extra linebreaks)

#110

many thanks for your support on this great and useful package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants