Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split flush_buffers() #2048

Merged
merged 2 commits into from
Jun 2, 2024
Merged

Conversation

vdbergh
Copy link
Contributor

@vdbergh vdbergh commented Jun 1, 2024

This is a PR on top of #2047

Split the convoluted flush_buffers() into three scheduled tasks

  • flush_buffers(): syncs oldest cache entry (period: 1s)
  • clean_cache(): evicts old cache entries (period: 60s)
  • scavenge_dead_tasks(): (period: 60s).

This PR should also fix the deadlock #2041.

@vdbergh vdbergh force-pushed the split_tasks branch 2 times, most recently from 4abbdad to ab51f90 Compare June 1, 2024 09:36
@ppigazzini ppigazzini added enhancement server server side changes bug and removed bug labels Jun 1, 2024
flush_buffers: syncs oldest cache entry (period: 1s)
clean_cache: evicts old cache entries (period: 60s)
scavenge_dead_tasks: (period: 60s).

This PR should also fix the deadlock official-stockfish#2041.
@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

Rebased on master after merging #2047 to fix the conflict.

@ppigazzini
Copy link
Collaborator

Looks good on DEV, testing on PROD now.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

It seems dead tasks are no longer scavenged.

Unfortunately I will be afk till tonight.

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

The last dead tasks in the log 3 hours ago

fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "3 hours ago" | grep dead
Jun 01 08:36:23 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 789 worker: AndreasKrug-1cores-ea30eef7-1aaf
Jun 01 08:36:35 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 1629 worker: technologov-28cores-r314-7a2e
Jun 01 08:37:02 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9821062b2c3cf814fed8 task_id: 1415 worker: technologov-28cores-r306-e3ea
Jun 01 08:37:04 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1313 worker: technologov-28cores-r309-5bed
Jun 01 08:37:27 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665ada260223e235f05b7ce3 task_id: 18 worker: technologov-28cores-r308-72c0
Jun 01 08:37:36 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa743062b2c3cf814ff47 task_id: 2126 worker: technologov-28cores-r307-02fe
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1091 worker: technologov-28cores-r302-0374
Jun 01 08:37:37 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9f87062b2c3cf814ff0e task_id: 1553 worker: technologov-28cores-r303-41a1
Jun 01 08:37:38 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1147 worker: technologov-28cores-r301-8b0f
Jun 01 08:37:54 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa4e1062b2c3cf814ff2c task_id: 1502 worker: technologov-28cores-r304-5a01
Jun 01 08:37:55 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665accd50223e235f05b7ca3 task_id: 487 worker: technologov-28cores-r300-1cab
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 901 worker: technologov-28cores-r305-855d
Jun 01 08:38:29 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a7eea062b2c3cf814fe40 task_id: 2027 worker: technologov-28cores-r310-cc32
Jun 01 08:39:01 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a9ec2062b2c3cf814ff02 task_id: 600 worker: tolkki963-2cores-8b0a6e2c-0a2a
Jun 01 08:39:06 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 921 worker: tolkki963-2cores-5697b53c-fe0d
Jun 01 08:40:44 tests.stockfishchess.org pserve[15218]: dead task: run: https://tests.stockfishchess.org/tests/view/665a8e00062b2c3cf814fe8d task_id: 824 worker: tolkki963-2cores-dd850aff-7edf
fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead

@ppigazzini
Copy link
Collaborator

It seems that the scavenge of dead tasks was working with #2045 and #2046 but stopped with #2047

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

Are there any exceptions in the server log?

@ppigazzini
Copy link
Collaborator

No

@ppigazzini
Copy link
Collaborator

I'm reloading the DB on DEV to have many dead tasks to purge :)

@ppigazzini
Copy link
Collaborator

Dead tesks purge on DEV looks fine with master and this PR after reloading the DB

Jun 01 13:48:55 dfts-0 systemd[1]: Stopping Fishtest Server port 6543...
Jun 01 13:48:55 dfts-0 pserve[5465]: flush
Jun 01 13:48:55 dfts-0 pserve[5465]: done
Jun 01 13:49:00 dfts-0 systemd[1]: Stopped Fishtest Server port 6543.
Jun 01 13:49:34 dfts-0 systemd[1]: Started Fishtest Server port 6543.
Jun 01 13:49:37 dfts-0 pserve[6303]: Starting server in PID 6303.
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: no useful cached runs left
Jun 01 13:50:07 dfts-0 pserve[6303]: Request_task: refresh queue
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 264 worker: technologov-28cores-r301-8b0f
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 334 worker: vdv-7cores-473c6083-3c48*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 335 worker: vdv-7cores-80cd7727-9c82*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 336 worker: vdv-7cores-fa3ade53-c640*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 337 worker: vdv-7cores-09497247-a19f*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 338 worker: vdv-7cores-99126dee-bd34*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 339 worker: vdv-7cores-3c7a73a0-b78b*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 340 worker: vdv-7cores-1c7e40a2-852e*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 341 worker: vdv-7cores-d8fc0606-d410*
Jun 01 13:50:34 dfts-0 pserve[6303]: dead task: run: https://tests.stockfishchess.org/tests/view/665aa925062b2c3cf814ff4d task_id: 342 worker: vdv-7cores-c3bfdbb7-45fb*

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

Dead tasks purge seems fine on PROD as well, there are new entries:

fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep dead
Jun 01 11:21:43 tests.stockfishchess.org pserve[17770]: dead task: run: https://tests.stockfishchess.org/tests/view/665af82c69d88d67b62fd9a7 task_id: 159 worker: technologov-28cores-r310-b4cc
Jun 01 11:23:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665afdbf45962393aa9c1c6b task_id: 6 worker: vdbergh-1cores-22f29ecf-9f53
Jun 01 11:36:49 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ad97f0223e235f05b7ccc task_id: 1113 worker: technologov-28cores-r306-7901
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 469 worker: technologov-28cores-r303-3b39
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1004 worker: technologov-28cores-r300-fba5
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1024 worker: technologov-28cores-r307-9032
Jun 01 11:51:51 tests.stockfishchess.org pserve[19045]: dead task: run: https://tests.stockfishchess.org/tests/view/665ae6320223e235f05b7d14 task_id: 1034 worker: technologov-28cores-r309-aaac

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

Are the validated random run messages still in the logs? Just checking that the scheduler had not stopped.

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

fishtest@tests:~$ sudo journalctl -u fishtest@6543 --since "2 hours ago" | grep validate
Jun 01 09:58:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 10:01:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:04:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aeee00223e235f05b7d21...
Jun 01 10:07:05 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:10:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:13:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665a7eea062b2c3cf814fe40...
Jun 01 10:16:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665aa766062b2c3cf814ff49...
Jun 01 10:19:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:21:55 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:25:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:28:10 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 10:31:06 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ada260223e235f05b7ce3...
Jun 01 10:34:09 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae4010223e235f05b7d0b...
Jun 01 10:37:01 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:40:02 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:43:11 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665af1db69d88d67b62fd9a1...
Jun 01 10:46:20 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ae7660223e235f05b7d17...
Jun 01 10:49:15 tests.stockfishchess.org pserve[17121]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 10:53:58 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 10:57:03 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:00:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:03:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ae6320223e235f05b7d14...
Jun 01 11:06:11 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665afdbf45962393aa9c1c6b...
Jun 01 11:09:15 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:12:13 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9821062b2c3cf814fed8...
Jun 01 11:15:10 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a64b0062b2c3cf814fdab...
Jun 01 11:18:17 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:21:26 tests.stockfishchess.org pserve[17770]: Validate_random_run: validated aggregated data in cache run 665a9ec2062b2c3cf814ff02...
Jun 01 11:24:51 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04d38da109e362924e2a...
Jun 01 11:27:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ab2b2062b2c3cf814ff8a...
Jun 01 11:30:36 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:33:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b00cd45962393aa9c1c75...
Jun 01 11:36:43 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae2d90223e235f05b7d04...
Jun 01 11:39:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ad97f0223e235f05b7ccc...
Jun 01 11:42:42 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665ae1a30223e235f05b7cfd...
Jun 01 11:45:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:48:44 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...
Jun 01 11:51:41 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665af82c69d88d67b62fd9a7...
Jun 01 11:54:39 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665b04f78da109e362924e2d...
Jun 01 11:57:31 tests.stockfishchess.org pserve[19045]: Validate_random_run: validated aggregated data in cache run 665a8e00062b2c3cf814fe8d...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

Hmm... seems normal...

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

Dead task collection seems to be working now...

@ppigazzini
Copy link
Collaborator

I'm waiting that the DB backup is finished to stop fishtest for 10 minutes and tidy up the runs collection, that should create many dead tasks.

server/fishtest/rundb.py Outdated Show resolved Hide resolved
@vondele
Copy link
Member

vondele commented Jun 1, 2024

I've done a sampling under the current load, and I have this profile:

===============================================================
Collecting 600 samples for pserve 6543 spaced 0.5 seconds apart
Total time needed 300 seconds 
Found pid 22791 for pserve 6543
Showing 10 most frequent samples traces: 
217 
     File "util.py", line 749, in __next_schedule self.__event.wait(delay) 
16 
     File "api.py", line 237, in update_task self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 417, in get_run with self.run_cache_lock: 
13 
     File "util.py", line 751, in __next_schedule next_task.do_work() 
          File "util.py", line 634, in do_work self.worker(*self.args, *self.kwargs) 
               File "rundb.py", line 497, in flush_buffers self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run) 
12 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 802, in request_task with self.request_task_lock: 
11 
     File "api.py", line 308, in beat self.validate_request() 
          File "api.py", line 129, in validate_request run = self.request.rundb.get_run(run_id) 
               File "rundb.py", line 417, in get_run with self.run_cache_lock: 
7 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 47, in worker_name name = "{}-{}cores".format(username, cores) 
6 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 905, in sync_request_task (run, task_id, task) 
5 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 41, in worker_name def worker_name(worker_info, short=False): 
4 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 947, in sync_request_task if connections >= connections_limit: 
4 
     File "api.py", line 221, in request_task result = self.request.rundb.request_task(worker_info) 
          File "rundb.py", line 803, in request_task return self.sync_request_task(worker_info) 
               File "rundb.py", line 916, in sync_request_task task_name = worker_name(task["worker_info"], short=True) 
                    File "util.py", line 44, in worker_name cores = str(worker_info["concurrency"])

One of the question I have is how long the 'flush_buffers' call actually takes right now. To understand if this might be a significant part of the server load, and if we should pick a schedule that is different from 1/s. Would it be possible to instrument the server to print the time taken by exactly that function ?

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

done

diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 2e9814b2..007f4311 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,6 +481,7 @@ class RunDb:
     # For documentation of the cache format see "cache_schema" in schemas.py.

     def flush_buffers(self):
+        now = time.time()
         oldest_entry = None
         old = float("inf")
         with self.run_cache_lock:
@@ -495,6 +496,7 @@ class RunDb:
                 oldest_entry["last_sync_time"] = time.time()
                 with self.run_cache_write_lock:
                     self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+        print(f"flush_buffers: {time.time() - now}", flush=True)

     def clean_cache(self):
         now = time.time()
Jun 01 19:22:56 tests.stockfishchess.org systemd[1]: Started Fishtest Server port 6543.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: Starting server in PID 25750.
Jun 01 19:22:59 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.1885673999786377
Jun 01 19:23:00 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.01155710220336914
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.02557539939880371
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: no useful cached runs left
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: refresh queue
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:01 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:02 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.031136274337768555
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020483016967773438
Jun 01 19:23:03 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:04 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.17199039459228516
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.016495466232299805
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:05 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:06 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.013139009475708008
Jun 01 19:23:07 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.007906436920166016
Jun 01 19:23:08 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.020139694213867188
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: Request_task: the server is currently too busy...
Jun 01 19:23:09 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.0214385986328125
Jun 01 19:23:10 tests.stockfishchess.org pserve[25750]: flush_buffers: 0.06426382064819336

@vondele
Copy link
Member

vondele commented Jun 1, 2024

OK, so probably 0.02s regular, and 0.17s if waiting on a lock from request_task. So every second is kind of reasonable.

@ppigazzini
Copy link
Collaborator

I added the missing s, flush_buffers

@vondele
Copy link
Member

vondele commented Jun 1, 2024

Little histogram

image

image

it actually has a bit of a tail.

Actual tail:

$ sudo journalctl -u fishtest@6543 --since "30 minutes ago" | grep flush_b | awk '{print $NF}'  | sort -n | tail
0.3540463447570801
0.3559889793395996
0.3574235439300537
0.376375675201416
0.38812804222106934
0.3957796096801758
0.4064204692840576
0.44736409187316895
0.5016348361968994
0.6342136859893799

I would be a bit worried if the tail would exceed 1s, in that case we have probably a new flush_buffers scheduled before the old one finishes. This is actually were I think it would make sense to schedule the next call, 1s after the previous one finishes, rather than every 1s. Alternatively, we extend a little the window to e.g. 2 or 5s.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

I would be a bit worried if the tail would exceed 1s, in that case we have probably a new flush_buffers scheduled before the old one finishes.

The new flush_buffers would actually be executed right after the previous one, not parallel with it.

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

@ppigazzini

I have been afk for the better part of the day.

Are you saying that in fact there is no problem with the collection of dead tasks in the PR?

@ppigazzini
Copy link
Collaborator

From logs and instrumented code the scavenge of dead tasks seems fine.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

I created this dead task with a worker https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646?show_task=902 (at the time of posting, last update 10min ago).

The dead task appears not to be collected.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

I think the PR has a race condition (which you also pointed out): setting is_changed=True and then making the change outside the lock.

I don't see how to trigger this, but who knows...

The correct approach is to reacquire the lock and then set is_changed is True.

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 1, 2024

I analyzed a snippet of log and the number of dead tasks deleted was equal to the internal count (before the release of the lock). But it seems that sometimes the code can be racy.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

I will update the PR, removing the race condition.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

My dead task was finally collected https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=vdbergh&before=1717274816.783141&run_id=

It took much longer than 6min though. Perhaps this is an artifact of the high server load?

@ppigazzini
Copy link
Collaborator

It was scavenged in this block:

Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: start
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: get lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: run_cache count: 46
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: scavenge_dead_tasks: release lock
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead_tasks_count=524
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 252 worker: vdv-7cores-e6adc6b1-5163*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 254 worker: vdv-7cores-5e50707f-c4d2*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 257 worker: vdv-7cores-4b019af9-c187*
Jun 01 20:46:55 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 259 worker: vdv-7cores-1456a905-8a5a*
...
Jun 01 20:46:56 tests.stockfishchess.org pserve[25854]: dead task: run: https://tests.stockfishchess.org/tests/view/665b77666658f2adece78646 task_id: 902 worker: vdbergh-1cores-22f29ecf-acb7

- Remove a race condition

- Do not require a cache entry to be clean before scavenging.

Under heavy load clean cache entries will be a rarity.
@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

I think I understand what the problem is. I was only scavenging clean cache entries. However under heavy load a cache entry will almost never be clean. So it will not be seen by the scavenging task.

Fixed in the last push (I also removed the race condition).

@ppigazzini
Copy link
Collaborator

PROD updated with 6487461

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

Problem fixed it seems :)

https://tests.stockfishchess.org/actions?max_actions=1&action=dead_task&user=&text=&before=1717277250.57244&run_id=

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

I think I will call it stretch (i.e. stretch the period with the duration of the task). It would be a useful option for tasks that are scheduled with a short period.

@vondele
Copy link
Member

vondele commented Jun 1, 2024

It would be simple though to optionally not count the execution time of the task during scheduling. I was looking for a good name for that option.

I think I will call it stretch (i.e. stretch the period with the duration of the task). It would be a useful option for tasks that are scheduled with a short period.

if it is a bool, 'guaranteeWait' or 'enforceGap' or similar.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 1, 2024

@vondele Could you perhaps also calculate the average duration of flush_buffers()?

@vondele
Copy link
Member

vondele commented Jun 1, 2024

# total number of data points = 1799
# minimum of the data = 0.003052
# maximum of the data = 0.634214
# average of the data = 0.063830
# Binned median  = 0.045000
# Binned average = 0.063883
# Binned mode    = 0.035000
# binned variance   = 0.004938
# binned sigma      = 0.070271
# binned MAD        = 0.042839
# binned skew   = 2.963915
# binned kurt   = 10.158041
# general moments = 0.063830 0.009021 0.002240 0.000723 0.000271 0.000115 0.000054 0.000027 0.000015 0.000008 
# central moments = 0.000000 0.004947 0.001032 0.000322 0.000113 0.000045 0.000020 0.000010 0.000005 0.000003 
# invariant moments = 0.063830 0.004947 0.001032 0.000248 0.000062 0.000014 0.000002 -0.000001 -0.000001 -0.000001 

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 2, 2024

Some data with 10k cores, based on 1969 measurements. It appears that the tail is not dependent on waiting for the lock.

diff --git a/server/fishtest/rundb.py b/server/fishtest/rundb.py
index 3097575b..e22dff5f 100644
--- a/server/fishtest/rundb.py
+++ b/server/fishtest/rundb.py
@@ -481,9 +481,11 @@ class RunDb:
     # For documentation of the cache format see "cache_schema" in schemas.py.

     def flush_buffers(self):
+        t_start = time.time()
         oldest_entry = None
         old = float("inf")
         with self.run_cache_lock:
+            t_lock = time.time()
             for cache_entry in self.run_cache.values():
                 run = cache_entry["run"]
                 if cache_entry["is_changed"] and cache_entry["last_sync_time"] < old:
@@ -495,6 +497,10 @@ class RunDb:
                 oldest_entry["last_sync_time"] = time.time()
                 with self.run_cache_write_lock:
                     self.runs.replace_one({"_id": oldest_run["_id"]}, oldest_run)
+        t_stop = time.time()
+        print(f"flush_buffers: total {t_stop - t_start}", flush=True)
+        print(f"flush_buffers: wait {t_lock - t_start}", flush=True)
+        print(f"flush_buffers: with {t_stop - t_lock}", flush=True)
total wait with
min 0.001279116 2.38419E-06 0.001270294
max 0.282125473 0.011865139 0.282114506
avg 0.018602283 1.91109E-05 0.018583172
stdev.p 0.034925572 0.000275271 0.034926888
median 0.007624388 1.04904E-05 0.00760293

image
image
image

Copy link
Collaborator

@ppigazzini ppigazzini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on PROD

Thank you @vdbergh :)

@ppigazzini ppigazzini merged commit 0e0f1c9 into official-stockfish:master Jun 2, 2024
19 checks passed
@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 2, 2024

Thanks for the stats. From the average duration, it appears that flush_buffers() does not load the system very much.

Also I guess that almost all time is spent writing to the db, during which the GIL is released.

@vondele
Copy link
Member

vondele commented Jun 2, 2024

trying with some heavier load on the system right now, I think this already improved the server, as the loops over the tasks in the scavenge function are now execute only every 60s.

@ppigazzini
Copy link
Collaborator

Code instrumented to view the waiting time for the lock under high load.

@vondele
Copy link
Member

vondele commented Jun 2, 2024

current stats for local workers:

request_spsa 1.25719
request_task 136.976
request_version 0.208234
update_task 3.55331
upload_pgn 32.1143

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 2, 2024

last 1800 timings taken under high load (75k cores)

total wait
min 0.00143981 2.14577E-06
max 0.707242489 0.144560814
avg 0.06804863 0.000234107
stdev.p 0.066657078 0.004019203
median 0.048844218 7.15256E-06

image
image

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 2, 2024

last 3600 timings with high load (75k cores), max total time passed 1 second threshold

total wait
min 0.00143981 2.14577E-06
max 1.01291585 0.325762033
avg 0.07342107 0.00028337
stdev.p 0.079427364 0.006401669
median 0.049941778 7.15256E-06

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug enhancement server server side changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants