-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split flush_buffers()
#2048
Split flush_buffers()
#2048
Conversation
4abbdad
to
ab51f90
Compare
flush_buffers: syncs oldest cache entry (period: 1s) clean_cache: evicts old cache entries (period: 60s) scavenge_dead_tasks: (period: 60s). This PR should also fix the deadlock official-stockfish#2041.
Rebased on master after merging #2047 to fix the conflict. |
Looks good on DEV, testing on PROD now. |
It seems dead tasks are no longer scavenged. Unfortunately I will be afk till tonight. |
The last dead tasks in the log 3 hours ago fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "3 hours ago" | grep dead
Jun 01 08:36:23 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 789 worker: AndreasKrug-1cores-ea30eef7-1aaf
Jun 01 08:36:35 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 1629 worker: technologov-28cores-r314-7a2e
Jun 01 08:37:02 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9821062b2c3cf814fed8 task_id: 1415 worker: technologov-28cores-r306-e3ea
Jun 01 08:37:04 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1313 worker: technologov-28cores-r309-5bed
Jun 01 08:37:27 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665ada260223e235f05b7ce3 task_id: 18 worker: technologov-28cores-r308-72c0
Jun 01 08:37:36 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa743062b2c3cf814ff47 task_id: 2126 worker: technologov-28cores-r307-02fe
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1091 worker: technologov-28cores-r302-0374
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1553 worker: technologov-28cores-r303-41a1
Jun 01 08:37:38 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1147 worker: technologov-28cores-r301-8b0f
Jun 01 08:37:54 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1502 worker: technologov-28cores-r304-5a01
Jun 01 08:37:55 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665accd50223e235f05b7ca3 task_id: 487 worker: technologov-28cores-r300-1cab
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 901 worker: technologov-28cores-r305-855d
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 2027 worker: technologov-28cores-r310-cc32
Jun 01 08:39:01 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9ec2062b2c3cf814ff02 task_id: 600 worker: tolkki963-2cores-8b0a6e2c-0a2a
Jun 01 08:39:06 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 921 worker: tolkki963-2cores-5697b53c-fe0d
Jun 01 08:40:44 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 824 worker: tolkki963-2cores-dd850aff-7edf
fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead
|
I restarted fishtest and this triggered a dead tasks purge. |
Are there any exceptions in the server log? |
No |
I'm reloading the DB on DEV to have many dead tasks to purge :) |
Dead tesks purge on DEV looks fine with master and this PR after reloading the DB Jun 01 13:48:55 dfts-0 systemd[1]: Stopping Fishtest Server port 6543...
Jun 01 13:48:55 dfts-0 pserve[5465]: flush
Jun 01 13:48:55 dfts-0 pserve[5465]: done
Jun 01 13:49:00 dfts-0 systemd[1]: Stopped Fishtest Server port 6543.
Jun 01 13:49:34 dfts-0 systemd[1]: Started Fishtest Server port 6543.
Jun 01 13:49:37 dfts-0 pserve[6303]: Starting server in PID 6303.
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: no useful cached runs left
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: refresh queue
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 264 worker: technologov-28cores-r301-8b0f
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 334 worker: vdv-7cores-473c6083-3c48*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 335 worker: vdv-7cores-80cd7727-9c82*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 336 worker: vdv-7cores-fa3ade53-c640*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 337 worker: vdv-7cores-09497247-a19f*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 338 worker: vdv-7cores-99126dee-bd34*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 339 worker: vdv-7cores-3c7a73a0-b78b*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 340 worker: vdv-7cores-1c7e40a2-852e*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 341 worker: vdv-7cores-d8fc0606-d410*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 342 worker: vdv-7cores-c3bfdbb7-45fb* |
Dead tasks purge seems fine on PROD as well, there are new entries: fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead
Jun 01 11:21:43 tests.stockfishchess.org pserve[17770]: dead task: run: https://tests.stockfishchess.org/tests/view/665af82c69d88d67b62fd9a7 task_id: 159 worker: technologov-28cores-r310-b4cc
Jun 01 11:23:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665afdbf45962393aa9c1c6b task_id: 6 worker: vdbergh-1cores-22f29ecf-9f53
Jun 01 11:36:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ad97f0223e235f05b7ccc task_id: 1113 worker: technologov-28cores-r306-7901
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 469 worker: technologov-28cores-r303-3b39
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1004 worker: technologov-28cores-r300-fba5
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1024 worker: technologov-28cores-r307-9032
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1034 worker: technologov-28cores-r309-aaac |
Are the validated random run messages still in the logs? Just checking that the scheduler had not stopped. |
fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep validate
Jun 01 09:58:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 10:01:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:04:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aeee00223e235f05b7d21...
Jun 01 10:07:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:10:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:13:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a7eea062b2c3cf814fe40...
Jun 01 10:16:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:19:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:21:55 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:25:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:28:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 10:31:06 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ada260223e235f05b7ce3...
Jun 01 10:34:09 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:37:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:40:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:43:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:46:20 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae7660223e235f05b7d17...
Jun 01 10:49:15 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 10:53:58 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:57:03 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:00:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:03:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 11:06:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665afdbf45962393aa9c1c6b...
Jun 01 11:09:15 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:12:13 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9821062b2c3cf814fed8...
Jun 01 11:15:10 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a64b0062b2c3cf814fdab...
Jun 01 11:18:17 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:21:26 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 11:24:51 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04d38da109e362924e2a...
Jun 01 11:27:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:30:36 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:33:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b00cd45962393aa9c1c75...
Jun 01 11:36:43 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:39:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ad97f0223e235f05b7ccc...
Jun 01 11:42:42 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae1a30223e235f05b7cfd...
Jun 01 11:45:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:48:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:51:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:54:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04f78da109e362924e2d...
Jun 01 11:57:31 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d... |
Hmm... seems normal... |
Dead task collection seems to be working now... |
I'm waiting that the DB backup is finished to stop fishtest for 10 minutes and tidy up the runs collection, that should create many dead tasks. |
I've done a sampling under the current load, and I have this profile:
One of the question I have is how long the 'flush_buffers' call actually takes right now. To understand if this might be a significant part of the server load, and if we should pick a schedule that is different from 1/s. Would it be possible to instrument the server to print the time taken by exactly that function ? |
done diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 2e9814b2..007f4311 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,6 +481,7 @@ class RunDb:
# For documentation of the cache format see "cache_schema" in schemas.py.
def flush_buffers(self):
+ now = time.time()
oldest_entry = None
old = float("inf")
with self.run_cache_lock:
@@ -495,6 +496,7 @@ class RunDb:
oldest_entry["last_sync_time"] = time.time()
with self.run_cache_write_lock:
self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+ print(f"flush_buffers: {time.time() - now}", flush=True)
def clean_cache(self):
now = time.time() Jun 01 19:22:56 tests.stockfishchess.org systemd[1]: Started Fishtest Server port 6543.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: Starting server in PID 25750.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.1885673999786377
Jun 01 19:23:00 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.01155710220336914
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.02557539939880371
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: no useful cached runs left
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: refresh queue
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:02 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.031136274337768555
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020483016967773438
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:04 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.17199039459228516
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.016495466232299805
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:06 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.013139009475708008
Jun 01 19:23:07 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.007906436920166016
Jun 01 19:23:08 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020139694213867188
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.0214385986328125
Jun 01 19:23:10 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.06426382064819336 |
OK, so probably 0.02s regular, and 0.17s if waiting on a lock from request_task. So every second is kind of reasonable. |
I added the missing s, flush_buffers |
Little histogram it actually has a bit of a tail. Actual tail:
I would be a bit worried if the tail would exceed 1s, in that case we have probably a new flush_buffers scheduled before the old one finishes. This is actually were I think it would make sense to schedule the next call, 1s after the previous one finishes, rather than every 1s. Alternatively, we extend a little the window to e.g. 2 or 5s. |
The new flush_buffers would actually be executed right after the previous one, not parallel with it. It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option. |
I have been afk for the better part of the day. Are you saying that in fact there is no problem with the collection of dead tasks in the PR? |
From logs and instrumented code the scavenge of dead tasks seems fine. |
I created this dead task with a worker https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646?show_task=902 (at the time of posting, last update 10min ago). The dead task appears not to be collected. |
I think the PR has a race condition (which you also pointed out): setting I don't see how to trigger this, but who knows... The correct approach is to reacquire the lock and then set |
I analyzed a snippet of log and the number of dead tasks deleted was equal to the internal count (before the release of the lock). But it seems that sometimes the code can be racy. |
I will update the PR, removing the race condition. |
My dead task was finally collected https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=vdbergh&before=1717274816.783141&run_id= It took much longer than 6min though. Perhaps this is an artifact of the high server load? |
It was scavenged in this block: Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: start
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: get lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: run_cache count: 46
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: release lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead_tasks_count=524
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 252 worker: vdv-7cores-e6adc6b1-5163*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 254 worker: vdv-7cores-5e50707f-c4d2*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 257 worker: vdv-7cores-4b019af9-c187*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 259 worker: vdv-7cores-1456a905-8a5a*
...
Jun 01 20:46:56 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 902 worker: vdbergh-1cores-22f29ecf-acb7 |
- Remove a race condition - Do not require a cache entry to be clean before scavenging. Under heavy load clean cache entries will be a rarity.
I think I understand what the problem is. I was only scavenging clean cache entries. However under heavy load a cache entry will almost never be clean. So it will not be seen by the scavenging task. Fixed in the last push (I also removed the race condition). |
PROD updated with 6487461 |
I think I will call it |
if it is a bool, 'guaranteeWait' or 'enforceGap' or similar. |
@vondele Could you perhaps also calculate the average duration of flush_buffers()? |
|
Some data with 10k cores, based on 1969 measurements. It appears that the tail is not dependent on waiting for the lock. diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 3097575b..e22dff5f 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,9 +481,11 @@ class RunDb:
# For documentation of the cache format see "cache_schema" in schemas.py.
def flush_buffers(self):
+ t_start = time.time()
oldest_entry = None
old = float("inf")
with self.run_cache_lock:
+ t_lock = time.time()
for cache_entry in self.run_cache.values():
run = cache_entry["run"]
if cache_entry["is_changed"] and cache_entry["last_sync_time"] < old:
@@ -495,6 +497,10 @@ class RunDb:
oldest_entry["last_sync_time"] = time.time()
with self.run_cache_write_lock:
self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+ t_stop = time.time()
+ print(f"flush_buffers: total {t_stop - t_start}", flush=True)
+ print(f"flush_buffers: wait {t_lock - t_start}", flush=True)
+ print(f"flush_buffers: with {t_stop - t_lock}", flush=True)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on PROD
Thank you @vdbergh :)
Thanks for the stats. From the average duration, it appears that Also I guess that almost all time is spent writing to the db, during which the GIL is released. |
trying with some heavier load on the system right now, I think this already improved the server, as the loops over the tasks in the scavenge function are now execute only every 60s. |
Code instrumented to view the waiting time for the lock under high load. |
current stats for local workers:
|
This is a PR on top of #2047
Split the convoluted
flush_buffers()
into three scheduled tasksflush_buffers()
: syncs oldest cache entry (period: 1s)clean_cache()
: evicts old cache entries (period: 60s)scavenge_dead_tasks()
: (period: 60s).This PR should also fix the deadlock #2041.