Skip to content

Commit

Permalink
Throttle workers on failure
Browse files Browse the repository at this point in the history
if workers fail to post a task_update, or for other reasons,
they start communicating with the server more. This includes: a 5x retry
to update the task, a failure message, an upload pgn of the task, a new task,
etc. If the reason for failing is actually the load on the server, that
load suddenly increases significantly, leading to an unstable,
run-away situation in which most workers fail, and of which the server can't recover.

The attached patch tries to improve upon this, by increasing
the retry time for task updates upon failure progressively.
If a worker really failed, it starts with a 2min sleep before retrying.

This patch was successfully tested over the past couple of days,
and made the server auto-recover under fairly large load.
  • Loading branch information
vondele committed May 5, 2024
1 parent c3d0478 commit 242e68f
Show file tree
Hide file tree
Showing 4 changed files with 6 additions and 6 deletions.
2 changes: 1 addition & 1 deletion server/fishtest/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
This depends on how frequently the main instance flushes its `run_cache`.
"""

WORKER_VERSION = 236
WORKER_VERSION = 237


def validate_request(request):
Expand Down
4 changes: 2 additions & 2 deletions worker/games.py
Original file line number Diff line number Diff line change
Expand Up @@ -1021,7 +1021,7 @@ def shorten_hash(match):
):
# Attempt to send game results to the server. Retry a few times upon error.
update_succeeded = False
for _ in range(5):
for attempt in range(5):
try:
response = send_api_post_request(
remote + "/api/update_task", result
Expand All @@ -1048,7 +1048,7 @@ def shorten_hash(match):
update_succeeded = True
num_games_updated = num_games_finished
break
time.sleep(UPDATE_RETRY_TIME)
time.sleep(UPDATE_RETRY_TIME * (attempt + 2))
if not update_succeeded:
raise WorkerException("Too many failed update attempts")

Expand Down
2 changes: 1 addition & 1 deletion worker/sri.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"__version": 236, "updater.py": "Mg+pWOgGA0gSo2TuXuuLCWLzwGwH91rsW1W3ixg3jYauHQpRMtNdGnCfuD1GqOhV", "worker.py": "6+uEiLrveb452zembFH3erS4psr6m57/DXLVa7nXiO1zogUAy9AH5b9qFpsidmaJ", "games.py": "U9tidRvT37Rq3e0FByhHWLTV9p+4nfj5+c0W1wtHCseP7b58rxrlTrz6W6LQvsu1"}
{"__version": 237, "updater.py": "Mg+pWOgGA0gSo2TuXuuLCWLzwGwH91rsW1W3ixg3jYauHQpRMtNdGnCfuD1GqOhV", "worker.py": "xIyJWylOkJngYAxSZ6wvJeBBtc0djb4njQqoz2CbeJPljOeHUOmgmxvIGQz8Jlp9", "games.py": "o8RR3ICNHS5v82eqXz+Jq8tHJsU3MTMUa+IegCjWu2UrVPveXQukqh5DEQgVhdLa"}
4 changes: 2 additions & 2 deletions worker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,10 +55,10 @@
# Several packages are called "expression".
# So we make sure to use the locally installed one.

WORKER_VERSION = 236
WORKER_VERSION = 237
FILE_LIST = ["updater.py", "worker.py", "games.py"]
HTTP_TIMEOUT = 30.0
INITIAL_RETRY_TIME = 15.0
INITIAL_RETRY_TIME = 120.0
THREAD_JOIN_TIMEOUT = 15.0
MAX_RETRY_TIME = 900.0 # 15 minutes
IS_COLAB = False
Expand Down

0 comments on commit 242e68f

Please sign in to comment.