Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions pyopencl/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,18 +88,32 @@ def __init__(self, cleanup_m, cache_dir):
except OSError:
pass

# This value was chosen based on the py-filelock package:
# https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113
# When running pyopencl in an application with multiple ranks
# that share a cache_dir, higher timeouts can lead to
# application stalls even with low numbers of ranks.
# cf. https://github.com/inducer/pyopencl/pull/504
wait_time_seconds = 0.05
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this change? It seems really short. If you were running this (by mistake, say) on 1000 nodes sharing a cache directory, I think the resulting thrashing would not be a good thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific value was taken from https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113. I can't really predict what the result of this with thousands of nodes would be, but we already keep on hitting the 1 second timeout when running with 2 ranks on a single node, which I guess indicates that the existing polling time would cause minutes-long waiting times when running with thousands of ranks.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that argument. Could you add a comment justifying this choice of time-out value? (making reference to the single-node interactive "test" use case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 387011e


# Warn every 10 seconds if not able to acquire lock
warn_attempts = int(10/wait_time_seconds)

# Exit after 60 seconds if not able to acquire lock
exit_attempts = int(60/wait_time_seconds)

from time import sleep
sleep(1)
sleep(wait_time_seconds)

attempts += 1

if attempts > 10:
if attempts % warn_attempts == 0:
from warnings import warn
warn("could not obtain cache lock--delete '%s' if necessary"
% self.lock_file)

if attempts > 3 * 60:
raise RuntimeError("waited more than three minutes "
if attempts > exit_attempts:
raise RuntimeError("waited more than one minute "
"on the lock file '%s'"
"--something is wrong" % self.lock_file)

Expand Down