Proof of concept implementation of get_run via api (RFC). #1993

vdbergh · 2024-05-08T05:15:11Z

With this PR /api/serialize_run hosted by the main instance can be invoked by a secondary instance to receive a copy of the run from the cache.

In this way there is never an issue that the run might not be up to date.

Sadly I don't know how feasible this is. A typical serialized run is 5.5k. Is this too much?

Note: /api/serialize_run restricts access to localhost over http. Some form of access control is necessary but it can also be done externally.

I have tested that the PR works. The deserialized runs pass validation (for this I needed to supply the tz_aware codec option to bson.decode()).

vdbergh · 2024-05-08T14:11:07Z

Serializing of a run with 500 tasks only takes a couple of ms, but I am worried the middleware will wreak havoc with the timings.

It would be nice to have a more direct mechanism for communicating between instances than going through nginx + middleware...

server/fishtest/rundb.py

ppigazzini · 2024-05-08T15:36:58Z

It would be nice to have a more direct mechanism for communicating between instances than going through nginx + middleware...

redis or memcached, but it's another tool to setup and maintain.

vdbergh · 2024-05-08T15:45:16Z

I was more thinking of something basic like shared memory.

vdbergh · 2024-05-08T15:57:17Z

Instead of returning the serialized run in the body, the primary instance could write it to shared memory where the secondary instance would find it. But it is not so obvious what is the cleanest way to make this thread safe.

What you would actually want is an in-memory mongodb collection (i.e. a collection which is never written to disk).

vdbergh · 2024-05-08T16:13:20Z

Another option is a named pipe. The primary instance creates it, writes the serialized run to it and communicates the name to the secondary instance. The secondary instance reads it, and then removes it. Not sure this is all possible.

vdbergh · 2024-05-08T16:14:23Z

Of course we may just as well use a real temporary file for this then.

ppigazzini · 2024-05-08T16:40:59Z

Right now PROD is able to withstand 60k cores or 100k cores with #1983 (the usual contribution is 3-5k cores).
Im fine with a clean/standard solution able to scale, not with complex code useful only in very rarely occasions.

ppigazzini · 2024-05-08T16:48:04Z

snippet of nginx config

server {
    listen 80;
    listen [::]:80;

    server_name dfts-0.pigazzini.it;

    location / {
        return 301 https://$host/tests;
    }

    location /api/ {
        proxy_set_header Connection "";
        proxy_set_header X-Forwarded-Proto  $scheme;
        proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host   $host:$server_port;
        proxy_set_header X-Forwarded-Port   $server_port;
        proxy_set_header X-Country-Code     $region;

        client_max_body_size        120m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       5s;
        proxy_send_timeout          30s;
        proxy_read_timeout          30s;
        proxy_max_temp_file_size    0;
        proxy_redirect              off;
        proxy_http_version          1.1;

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            allow 127.0.0.1;
            deny all;
            proxy_pass http://$backends;
        }
    }
}


server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2 ipv6only=on;

... 

    location / {
        rewrite ^/$ /tests permanent;

...

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            deny all;
        }
    }
}

vdbergh · 2024-05-08T17:32:40Z

Well #1983 is something else. This PR is because in #1985 and #1992 the run read from the database may be missing some tasks compared to what's in the cache. This could lead to confusing behavior. This PR seems conceptually to be the most elegant way to fix that, but I am not sure if it scales...

EDIT: Note that this PR may be optimized a little bit by skipping the base64 encoding and just returning the gzipped data as application/octet-stream (or perhaps application/gzip) in the body with a Response object.

ppigazzini · 2024-05-08T17:45:00Z

I was talking about named pipe, shared memory, not this PR.

I'm not a fan to add another tool, hitting the rough corner, but redis is the standard tool to cache in memory the db.
Setup:

curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt update
sudo apt -y install redis

sudo systemctl status redis-server

On python side something like this:

# pip install redis

import redis
import json

r = redis.Redis(host='localhost', port=6379, db=0)

def get_run(self, r_id):
    r_id = str(r_id)
    run_cache_item = r.get(r_id)
    if run_cache_item is not None:
        run_cache_item = json.loads(run_cache_item)
        run_cache_item["rtime"] = time.time()
        r.set(r_id, json.dumps(run_cache_item))
        return run_cache_item["run"]
    run = self.runs.find_one({"_id": ObjectId(r_id)})
    if run is not None:
        run_cache_item = {
            "rtime": time.time(),
            "ftime": time.time(),
            "run": run,
            "dirty": False,
        }
        r.set(r_id, json.dumps(run_cache_item))
    return run

        
def buffer(self, run, flush):
    r_id = str(run["_id"])
    if flush:
        run_cache_item = {
            "dirty": False,
            "rtime": time.time(),
            "ftime": time.time(),
            "run": run,
        }
        r.set(r_id, json.dumps(run_cache_item))
        self.runs.replace_one({"_id": ObjectId(r_id)}, run)
    else:
        run_cache_item = r.get(r_id)
        if run_cache_item is not None:
            run_cache_item = json.loads(run_cache_item)
            ftime = run_cache_item["ftime"]
        else:
            ftime = time.time()
        run_cache_item = {
            "dirty": True,
            "rtime": time.time(),
            "ftime": ftime,
            "run": run,
        }
        r.set(r_id, json.dumps(run_cache_item))

But datetime and id must be converted before the dump.

run["start_time"] = run["start_time"].isoformat()
run["_id"] = str(run["_id"])

vdbergh · 2024-05-08T17:48:43Z

return self.runs.find_one({"_id": ObjectId(r_id)})

I don't understand. This still seems to return the run from the db, which is not what you want...

ppigazzini · 2024-05-08T17:50:07Z

Written on the fly, code fixed.

vdbergh · 2024-05-08T17:58:01Z

But datetime and id must be converted before the dump.

One can use bson instead of json. bson doesn't have this issue.

vdbergh · 2024-05-09T03:45:16Z

@ppigazzini Your code seems to be much more ambitious than what I am trying to do in this PR.

In this PR I want to make it possible for a secondary Fishtest instance ("the client") to ask for a copy of a run in the cache of the primary instance (e.g to have a bulletproof get_run()). I was musing about how to get this copy to the client as quickly as possible (perhaps this could be done via redis). I am not sure however if it is necessary to do anything. Perhaps this PR is fast enough as it is (I am going to optimize it a bit more by skipping the base64 encoding).
IUC, in your code you want to replace the Fishtest cache by a redis cache. This requires a lot of thought. For example it seems to me that a naive implementation would lead to lots of unnecessary serializations/deserializations.

vdbergh · 2024-05-09T04:58:40Z

In the latest push I skip the base64 encoding. This makes the payload smaller.

EDIT: I should perhaps point out that runs compress extremely well (obviously because there is so much redundancy). I did an experiment with a typical run (500 tasks, 50 bad tasks and 101 SPSA history entries) and the compression ratio was 77:1. I.e. 425k vs 5.5k.

ppigazzini · 2024-05-09T17:17:21Z

In nginx config we set at http level few rate limits:

per host
per URI
per IP (here we have a whitelist setup)

Also with the cookie module with a js challenge

Those rate limits should apply to this API as well. I must replicate the filtering on DEV to verify the hypothesis and test a new configuration to whitelist localhost.

BTW: DEV running with the PR.

vdbergh · 2024-05-09T17:22:52Z

From api.log

2024-05-09 17:12:01.645786+00:00 :   5.52 ms (s)  1190.12 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:14:10.054970+00:00 :   6.18 ms (s)  1191.21 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:16:19.091182+00:00 :   5.40 ms (s)  2105.73 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:18:28.921216+00:00 :   6.91 ms (s)  1098.78 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:20:32.112779+00:00 :   7.16 ms (s)  1096.62 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn

vdbergh · 2024-05-09T19:36:39Z

On PROD

2024-05-09 18:07:24.438469+00:00 :  21.81 ms (s)  1698.87 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 18:29:16.939441+00:00 :   6.34 ms (s)  1341.83 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 18:50:43.001719+00:00 :  56.01 ms (s)  1253.63 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 19:11:44.861195+00:00 :  21.35 ms (s)  1322.71 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 19:24:29.873764+00:00 :  37.82 ms (s)   911.89 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn

ppigazzini · 2024-06-25T12:38:34Z

PROD nginx is running without limits in place right now.

vdbergh · 2024-06-25T12:45:09Z

You mean this PR could work if I rebase it?

ppigazzini · 2024-06-25T12:47:59Z

It worked on DEV, so I expect that it should work on PROD as well now with a simpler nginx configuration.

vdbergh · 2024-06-25T12:48:47Z

Ok tonight I will have a look...

vdbergh · 2024-06-25T16:51:04Z

Rebased. I hope I did not mess up too much.

ppigazzini · 2024-06-28T16:42:19Z

DEV running with the PR, logs clean right now.

vdbergh · 2024-06-28T20:04:52Z

It seems more expensive. This an extract of api.log

2024-06-12 11:16:49.812611+00:00 :   5.78 ms (s)   228.43 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:18:20.175159+00:00 :   7.17 ms (s)   231.59 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:19:46.106092+00:00 :   5.94 ms (s)   224.95 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:21:16.060719+00:00 :   6.54 ms (s)   227.58 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:22:40.993474+00:00 :   6.10 ms (s)   231.74 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:24:10.936604+00:00 :  13.73 ms (s)   243.28 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:25:32.842110+00:00 :   5.51 ms (s)   231.13 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:26:54.913908+00:00 :   6.24 ms (s)   307.63 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:28:21.854786+00:00 :   6.09 ms (s)   228.63 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:29:08.940692+00:00 :   6.01 ms (s)   199.73 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 09:25:29.849932+00:00 : 223.98 ms (s)   383.87 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 10:00:26.724927+00:00 :   2.41 ms (s)   165.89 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:40:18.892302+00:00 :   3.21 ms (s)   202.10 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:52:32.129110+00:00 :   2.22 ms (s)   208.15 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:59:56.952025+00:00 :   1.75 ms (s)   193.12 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 19:29:59.789806+00:00 :   2.35 ms (s)   219.05 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 19:42:08.685836+00:00 :   1.86 ms (s)   198.28 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:44:24.270762+00:00 :  25.50 ms (s)   334.55 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:46:52.579195+00:00 :  16.50 ms (s)   310.05 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:49:06.613025+00:00 :  15.87 ms (s)   337.55 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:51:35.929257+00:00 :  16.62 ms (s)   316.70 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:53:47.925970+00:00 :  13.78 ms (s)   310.23 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:55:49.681393+00:00 :  16.99 ms (s)   322.44 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:57:42.453471+00:00 :  15.22 ms (s)   312.66 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:59:38.246520+00:00 :  12.62 ms (s)   306.52 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn

ppigazzini · 2024-06-28T20:09:10Z

PROD has been running with the PR for some time now.

Timings before the change:

# logging from [28/Jun/2024:17:36:27 +0000] to [28/Jun/2024:18:51:30 +0000] ########
# duration (seconds)            : 4503
# calls in total                : 10000
# calls per second              : 2
# calls not reaching the backend: 42
# calls handled by the backend  : 9958
#                         route    calls      total    average    minimum    maximum
                         /tests      251     84.421      0.336      0.008      0.469
                   /tests/user/       99     22.129      0.224      0.035      0.275
               /api/update_task     4759     17.804      0.004      0.001      0.042
                   /api/actions     1350     12.130      0.009      0.002      0.115
                     /tests/run        8     10.708      1.339      0.490      2.408
                      /api/beat     2447      7.666      0.003      0.001      0.105
                /tests/finished       61      5.852      0.096      0.027      0.143
                /api/upload_pgn      153      3.825      0.025      0.008      0.109
               /api/active_runs      121      3.714      0.031      0.004      2.615
                       /actions        3      3.468      1.156      0.829      1.332
                  /api/get_elo/      122      3.414      0.028      0.008      0.048
                /tests/machines        9      1.931      0.215      0.178      0.250
               /tests/live_elo/       70      1.389      0.020      0.003      0.034
                   /tests/view/       50      1.307      0.026      0.013      0.054
                  /contributors        4      1.037      0.259      0.130      0.377
              /api/request_task      173      0.775      0.004      0.002      0.029
                  /tests/tasks/       11      0.593      0.054      0.022      0.190
                           /nns       19      0.430      0.023      0.009      0.121
           /api/request_version      181      0.290      0.002      0.001      0.006
                       /api/nn/       20      0.108      0.005      0.003      0.006
              /api/request_spsa       35      0.101      0.003      0.002      0.004
                 /tests/approve        2      0.040      0.020      0.013      0.027
          /contributors/monthly        1      0.034      0.034      0.034      0.034
                  /tests/delete        1      0.025      0.025      0.025      0.025
                         /login        3      0.023      0.008      0.004      0.010
               /api/failed_task        2      0.018      0.009      0.008      0.010
                              /        1      0.009      0.009      0.009      0.009
                     /sprt_calc        1      0.008      0.008      0.008      0.008
                      /workers/        1      0.003      0.003      0.003      0.003

Timings after the change:

# logging from [28/Jun/2024:18:59:01 +0000] to [28/Jun/2024:20:08:07 +0000] ########
# duration (seconds)            : 4146
# calls in total                : 10000
# calls per second              : 2
# calls not reaching the backend: 37
# calls handled by the backend  : 9963
#                         route    calls      total    average    minimum    maximum
                         /tests      215     63.553      0.296      0.001      0.516
               /api/update_task     5047     19.137      0.004      0.001      0.041
                   /tests/user/       52     11.593      0.223      0.026      0.279
                   /api/actions     1135      9.883      0.009      0.002      0.050
                /api/upload_pgn      172      8.999      0.052      0.007      0.171
                      /api/beat     2369      7.177      0.003      0.001      0.031
                     /tests/run        4      5.402      1.351      0.627      2.269
                /tests/finished       45      4.634      0.103      0.026      0.133
                       /actions        7      4.419      0.631      0.080      0.947
            /api/serialize_run/      172      4.306      0.025      0.001      0.108
               /api/active_runs      106      3.344      0.032      0.004      2.456
                  /contributors        5      1.842      0.368      0.220      0.516
                /tests/machines        6      1.327      0.221      0.182      0.251
                  /api/get_elo/       54      1.321      0.024      0.008      0.045
                   /tests/view/       43      1.197      0.028      0.002      0.087
              /api/request_task      200      0.855      0.004      0.002      0.028
               /tests/live_elo/       33      0.462      0.014      0.002      0.027
           /api/request_version      228      0.366      0.002      0.001      0.004
                  /tests/tasks/        6      0.324      0.054      0.030      0.147
                      /api/pgn/        3      0.231      0.077      0.041      0.124
                           /nns       11      0.195      0.018      0.014      0.024
                              /        4      0.131      0.033      0.001      0.113
                       /api/nn/       22      0.116      0.005      0.004      0.006
              /api/request_spsa       20      0.055      0.003      0.002      0.004
                    /tests/stop        1      0.047      0.047      0.047      0.047
                 /tests/approve        2      0.027      0.013      0.013      0.014
               /api/failed_task        1      0.008      0.008      0.008      0.008

vdbergh · 2024-06-28T20:09:49Z

Probably for validation of the task id it is overkill to ask the primary instance for the whole run.

ppigazzini · 2024-06-28T20:10:22Z

By the way, logs are clean.

ppigazzini · 2024-06-28T20:17:10Z

Monitoring PROD I saw many nginx warnings for cache on disk during the pgn upload, so I raised the nginx buffer to lower the number of warnings:

-       client_body_buffer_size     128k;
+       client_body_buffer_size     1m;

2024/06/28 19:23:44 [warn] 94515#94515: *88 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:26:19 [warn] 94516#94516: *429 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000002, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:43:38 [warn] 94516#94516: *2606 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000003, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:50:55 [warn] 94515#94515: *3541 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000004, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:56:54 [warn] 94516#94516: *4245 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000005, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:00:08 [warn] 94516#94516: *4679 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000006, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:02:35 [warn] 94515#94515: *5023 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000007, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:09:42 [warn] 94515#94515: *5898 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000008, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"

vdbergh · 2024-06-28T20:19:27Z

Before & after

/api/upload_pgn 153 3.825 0.025 0.008 0.109
/api/upload_pgn 172 8.999 0.052 0.007 0.171

This confirms what I thought.

ppigazzini · 2024-06-29T08:58:23Z

8 hours timings with PR using the new nginx config (no cache on disk during the pgn upload).
I'm reverting to master to use the same nginx config.

# logging from [29/Jun/2024:00:00:26 +0000] to [29/Jun/2024:08:53:53 +0000] ########
# duration (seconds)            : 32007
# calls in total                : 100000
# calls per second              : 3
# calls not reaching the backend: 330
# calls handled by the backend  : 99670
#                         route    calls      total    average    minimum    maximum
                         /tests     1467    504.141      0.344      0.001      0.576
                   /tests/user/      656    149.462      0.228      0.013      0.991
               /api/update_task    35768    135.750      0.004      0.001      0.173
                   /api/actions     9269     95.834      0.010      0.001      0.142
                  /api/get_elo/     2563     73.265      0.029      0.007      0.134
                       /actions      102     62.721      0.615      0.038      3.352
                /api/upload_pgn     1160     60.732      0.052      0.008      0.238
                      /api/beat    16149     49.470      0.003      0.001      0.109
                 /api/run_pgns/        6     41.123      6.854      2.346     15.032
                  /contributors       46     34.405      0.748      0.000     23.490
                     /tests/run       44     29.028      0.660      0.001      2.626
                /tests/finished      280     28.911      0.103      0.018      0.181
            /api/serialize_run/     1160     27.212      0.023      0.001      0.111
                   /tests/view/      779     18.880      0.024      0.001      0.549
               /tests/live_elo/      889     17.377      0.020      0.001      0.294
                /tests/machines       37      7.921      0.214      0.000      0.276
               /api/active_runs      803      7.429      0.009      0.004      0.092
              /api/request_task     1353      6.145      0.005      0.002      0.131
                  /tests/tasks/      207      4.577      0.022      0.002      0.138
           /api/request_version     1428      2.366      0.002      0.000      0.032
                           /nns       81      1.695      0.021      0.007      0.100
               /user_management       16      1.577      0.099      0.000      0.222
              /api/request_spsa      410      1.213      0.003      0.001      0.038
                              /      137      0.787      0.006      0.001      0.014
                       /api/nn/      147      0.773      0.005      0.002      0.016
                  /tests/stats/       13      0.711      0.055      0.020      0.073
          /contributors/monthly       23      0.500      0.022      0.011      0.033
                         /login       43      0.329      0.008      0.001      0.013
                 /tests/approve       10      0.250      0.025      0.016      0.028
                      /workers/       10      0.158      0.016      0.003      0.060
                          /user        8      0.141      0.018      0.008      0.026
                        /signup        8      0.127      0.016      0.007      0.061
                         /user/        6      0.099      0.017      0.002      0.028
                  /tests/modify        4      0.087      0.022      0.010      0.035
               /api/failed_task        5      0.043      0.009      0.002      0.016
                    /tests/stop        2      0.041      0.021      0.014      0.027
                     /sprt_calc        2      0.022      0.011      0.008      0.014
             /api/finished_runs        2      0.018      0.009      0.006      0.012
                  /api/get_run/        1      0.014      0.014      0.014      0.014
                  /tests/delete        1      0.003      0.003      0.003      0.003

vdbergh · 2024-06-29T09:15:15Z

It seems the timings for upload_pgn remain the same (50ms).

I think the serialize_run api is good and may be useful. But for validation of run_id/task_id we don't need to pull in the full run. A much more restricted internal api should be used (we can pull in the run later if we actually need it).

I will try to cook something up.

server/fishtest/views.py

With this PR /api/serialize_run hosted by the main instance can be invoked by a secondary instance to receive a copy of the run from the cache. In this way there is never an issue that the run might not be up to date. Note: /api/serialize_run restricts access to localhost over http. Some form of access control is necessary.

ppigazzini · 2024-06-29T22:30:27Z

12 hours timings with master and updated nginx configuration (it solves the warnings in the log, but there is not an improvement on the average timing of the pgn upload)

# logging from [29/Jun/2024:10:08:36 +0000] to [29/Jun/2024:22:28:46 +0000] ########
# duration (seconds)            : 44410
# calls in total                : 100000
# calls per second              : 2
# calls not reaching the backend: 456
# calls handled by the backend  : 99544
#                         route    calls      total    average    minimum    maximum
                         /tests     3442   1156.082      0.336      0.000      0.898
               /api/update_task    47904    181.824      0.004      0.000      0.921
                   /tests/user/      694    163.920      0.236      0.023      0.645
                 /api/run_pgns/       12    152.795     12.733      0.768     30.307
                   /api/actions    12483    132.893      0.011      0.002      0.856
                      /api/beat    22386     71.249      0.003      0.000      1.799
                     /tests/run       55     61.134      1.112      0.002      2.685
                       /actions       90     58.371      0.649      0.031      2.754
                   /tests/view/      875     57.171      0.065      0.002      2.370
                /tests/finished      471     52.820      0.112      0.018      3.009
                  /api/get_elo/     1629     45.579      0.028      0.007      0.138
                /api/upload_pgn     1683     43.686      0.026      0.005      0.175
                /tests/machines      102     23.422      0.230      0.148      0.281
               /api/active_runs     1177     20.413      0.017      0.001      2.652
                  /contributors       59     16.381      0.278      0.000      0.725
               /tests/live_elo/      789     14.208      0.018      0.002      0.125
                  /tests/tasks/      169     10.483      0.062      0.001      0.216
              /api/request_task     1925      8.604      0.004      0.002      0.047
                       /api/nn/      646      3.432      0.005      0.000      0.066
           /api/request_version     2004      3.385      0.002      0.001      0.109
                           /nns      142      3.047      0.021      0.002      0.178
          /contributors/monthly       61      1.515      0.025      0.008      0.157
              /api/request_spsa      453      1.230      0.003      0.001      0.018
                              /      122      0.925      0.008      0.001      0.080
                  /tests/stats/       11      0.619      0.056      0.006      0.106
                 /tests/approve       20      0.503      0.025      0.010      0.031
                         /login       44      0.389      0.009      0.002      0.015
                        /signup       22      0.330      0.015      0.002      0.098
                  /tests/modify       13      0.313      0.024      0.015      0.030
               /user_management        9      0.311      0.035      0.001      0.168
                    /tests/stop        7      0.250      0.036      0.025      0.047
               /api/failed_task       16      0.157      0.010      0.004      0.022
                      /workers/       16      0.139      0.009      0.002      0.029
                         /user/        6      0.103      0.017      0.003      0.036
                     /sprt_calc        3      0.033      0.011      0.009      0.014
             /api/finished_runs        2      0.022      0.011      0.005      0.017
                          /user        1      0.007      0.007      0.007      0.007
                  /api/get_run/        1      0.006      0.006      0.006      0.006

The convention throughout the code base is that run_id is a string. Since a run_id and the corresponding ObjectId print in exactly the same way, violating this convention may lead to hard to understand bugs such as this one https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal which was introduced in official-stockfish#1993.

The convention throughout the code base is that run_id is a string. Since a run_id and the corresponding ObjectId print in exactly the same way, violating this convention may lead to hard to understand bugs such as this one https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal which was introduced in #1993.

vdbergh force-pushed the serialize branch from 2dd8e37 to 062d336 Compare May 8, 2024 13:58

vdbergh changed the title ~~Proof of concept inplementation of get_run via api.~~ Proof of concept inplementation of get_run via api (RFC). May 8, 2024

ppigazzini reviewed May 8, 2024

View reviewed changes

server/fishtest/rundb.py Outdated Show resolved Hide resolved

vdbergh force-pushed the serialize branch from 062d336 to fa16af5 Compare May 8, 2024 15:26

vdbergh force-pushed the serialize branch from fa16af5 to 6a3d8e0 Compare May 9, 2024 04:53

vdbergh force-pushed the serialize branch 5 times, most recently from ae56deb to f7e42b0 Compare May 9, 2024 13:15

vdbergh changed the title ~~Proof of concept inplementation of get_run via api (RFC).~~ Proof of concept implementation of get_run via api (RFC). May 9, 2024

vdbergh force-pushed the serialize branch from f7e42b0 to 66ea57e Compare May 9, 2024 13:33

ppigazzini marked this pull request as draft May 31, 2024 16:22

vdbergh force-pushed the serialize branch from ee7a6d9 to b619163 Compare June 25, 2024 16:48

vdbergh marked this pull request as ready for review June 25, 2024 16:50

vdbergh force-pushed the serialize branch from b619163 to 5610a08 Compare June 28, 2024 16:17

ppigazzini mentioned this pull request Jun 29, 2024

Validation of Internal Data Struture Failed [24‑06‑29 04:58:02] #2086

Closed

ppigazzini reviewed Jun 29, 2024

View reviewed changes

server/fishtest/views.py Outdated Show resolved Hide resolved

vdbergh force-pushed the serialize branch from 5610a08 to 3ffca9b Compare June 29, 2024 19:58

vdbergh mentioned this pull request Jul 1, 2024

More debug messages when validating internal data structures. #2088

Closed

vdbergh mentioned this pull request Jul 1, 2024

Make new_run return a string instead of an ObjectId. #2090

Merged

vdbergh marked this pull request as draft July 1, 2024 18:01

vdbergh mentioned this pull request Jul 4, 2024

get_machines is too slow and does not use the book keeping data #2101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of concept implementation of get_run via api (RFC). #1993

Proof of concept implementation of get_run via api (RFC). #1993

vdbergh commented May 8, 2024 •

edited

Loading

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

vdbergh commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024 •

edited

Loading

vdbergh commented May 8, 2024

vdbergh commented May 9, 2024 •

edited

Loading

vdbergh commented May 9, 2024 •

edited

Loading

ppigazzini commented May 9, 2024 •

edited

Loading

vdbergh commented May 9, 2024

vdbergh commented May 9, 2024

ppigazzini commented Jun 25, 2024

vdbergh commented Jun 25, 2024

ppigazzini commented Jun 25, 2024

vdbergh commented Jun 25, 2024

vdbergh commented Jun 25, 2024

ppigazzini commented Jun 28, 2024

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 28, 2024 •

edited

Loading

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 28, 2024

ppigazzini commented Jun 28, 2024 •

edited

Loading

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 29, 2024

vdbergh commented Jun 29, 2024

ppigazzini commented Jun 29, 2024 •

edited

Loading

Proof of concept implementation of get_run via api (RFC). #1993

Are you sure you want to change the base?

Proof of concept implementation of get_run via api (RFC). #1993

Conversation

vdbergh commented May 8, 2024 • edited Loading

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024 • edited Loading

ppigazzini commented May 8, 2024 • edited Loading

vdbergh commented May 8, 2024 • edited Loading

ppigazzini commented May 8, 2024 • edited Loading

vdbergh commented May 8, 2024

ppigazzini commented May 8, 2024 • edited Loading

vdbergh commented May 8, 2024

vdbergh commented May 9, 2024 • edited Loading

vdbergh commented May 9, 2024 • edited Loading

ppigazzini commented May 9, 2024 • edited Loading

vdbergh commented May 9, 2024

vdbergh commented May 9, 2024

ppigazzini commented Jun 25, 2024

vdbergh commented Jun 25, 2024

ppigazzini commented Jun 25, 2024

vdbergh commented Jun 25, 2024

vdbergh commented Jun 25, 2024

ppigazzini commented Jun 28, 2024

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 28, 2024 • edited Loading

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 28, 2024

ppigazzini commented Jun 28, 2024 • edited Loading

vdbergh commented Jun 28, 2024

ppigazzini commented Jun 29, 2024

vdbergh commented Jun 29, 2024

ppigazzini commented Jun 29, 2024 • edited Loading

vdbergh commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

vdbergh commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

ppigazzini commented May 8, 2024 •

edited

Loading

vdbergh commented May 9, 2024 •

edited

Loading

vdbergh commented May 9, 2024 •

edited

Loading

ppigazzini commented May 9, 2024 •

edited

Loading

ppigazzini commented Jun 28, 2024 •

edited

Loading

ppigazzini commented Jun 28, 2024 •

edited

Loading

ppigazzini commented Jun 29, 2024 •

edited

Loading