Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proof of concept implementation of get_run via api (RFC). #1993

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vdbergh
Copy link
Contributor

@vdbergh vdbergh commented May 8, 2024

With this PR /api/serialize_run hosted by the main instance can be invoked by a secondary instance to receive a copy of the run from the cache.

In this way there is never an issue that the run might not be up to date.

Sadly I don't know how feasible this is. A typical serialized run is 5.5k. Is this too much?

Note: /api/serialize_run restricts access to localhost over http. Some form of access control is necessary but it can also be done externally.

I have tested that the PR works. The deserialized runs pass validation (for this I needed to supply the tz_aware codec option to bson.decode()).

@vdbergh vdbergh changed the title Proof of concept inplementation of get_run via api. Proof of concept inplementation of get_run via api (RFC). May 8, 2024
@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

Serializing of a run with 500 tasks only takes a couple of ms, but I am worried the middleware will wreak havoc with the timings.

It would be nice to have a more direct mechanism for communicating between instances than going through nginx + middleware...

server/fishtest/rundb.py Outdated Show resolved Hide resolved
@ppigazzini
Copy link
Collaborator

It would be nice to have a more direct mechanism for communicating between instances than going through nginx + middleware...

redis or memcached, but it's another tool to setup and maintain.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

I was more thinking of something basic like shared memory.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

Instead of returning the serialized run in the body, the primary instance could write it to shared memory where the secondary instance would find it. But it is not so obvious what is the cleanest way to make this thread safe.

What you would actually want is an in-memory mongodb collection (i.e. a collection which is never written to disk).

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

Another option is a named pipe. The primary instance creates it, writes the serialized run to it and communicates the name to the secondary instance. The secondary instance reads it, and then removes it. Not sure this is all possible.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

Of course we may just as well use a real temporary file for this then.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 8, 2024

Right now PROD is able to withstand 60k cores or 100k cores with #1983 (the usual contribution is 3-5k cores).
Im fine with a clean/standard solution able to scale, not with complex code useful only in very rarely occasions.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 8, 2024

snippet of nginx config

server {
    listen 80;
    listen [::]:80;

    server_name dfts-0.pigazzini.it;

    location / {
        return 301 https://$host/tests;
    }

    location /api/ {
        proxy_set_header Connection "";
        proxy_set_header X-Forwarded-Proto  $scheme;
        proxy_set_header X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Host   $host:$server_port;
        proxy_set_header X-Forwarded-Port   $server_port;
        proxy_set_header X-Country-Code     $region;

        client_max_body_size        120m;
        client_body_buffer_size     128k;
        proxy_connect_timeout       5s;
        proxy_send_timeout          30s;
        proxy_read_timeout          30s;
        proxy_max_temp_file_size    0;
        proxy_redirect              off;
        proxy_http_version          1.1;

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            allow 127.0.0.1;
            deny all;
            proxy_pass http://$backends;
        }
    }
}


server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2 ipv6only=on;

... 

    location / {
        rewrite ^/$ /tests permanent;

...

        proxy_pass http://$backends;

        location /api/serialize_run/ {
            deny all;
        }
    }
}

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

Well #1983 is something else. This PR is because in #1985 and #1992 the run read from the database may be missing some tasks compared to what's in the cache. This could lead to confusing behavior. This PR seems conceptually to be the most elegant way to fix that, but I am not sure if it scales...

EDIT: Note that this PR may be optimized a little bit by skipping the base64 encoding and just returning the gzipped data as application/octet-stream (or perhaps application/gzip) in the body with a Response object.

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 8, 2024

I was talking about named pipe, shared memory, not this PR.

I'm not a fan to add another tool, hitting the rough corner, but redis is the standard tool to cache in memory the db.
Setup:

curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt update
sudo apt -y install redis

sudo systemctl status redis-server

On python side something like this:

# pip install redis

import redis
import json

r = redis.Redis(host='localhost', port=6379, db=0)

def get_run(self, r_id):
    r_id = str(r_id)
    run_cache_item = r.get(r_id)
    if run_cache_item is not None:
        run_cache_item = json.loads(run_cache_item)
        run_cache_item["rtime"] = time.time()
        r.set(r_id, json.dumps(run_cache_item))
        return run_cache_item["run"]
    run = self.runs.find_one({"_id": ObjectId(r_id)})
    if run is not None:
        run_cache_item = {
            "rtime": time.time(),
            "ftime": time.time(),
            "run": run,
            "dirty": False,
        }
        r.set(r_id, json.dumps(run_cache_item))
    return run

        
def buffer(self, run, flush):
    r_id = str(run["_id"])
    if flush:
        run_cache_item = {
            "dirty": False,
            "rtime": time.time(),
            "ftime": time.time(),
            "run": run,
        }
        r.set(r_id, json.dumps(run_cache_item))
        self.runs.replace_one({"_id": ObjectId(r_id)}, run)
    else:
        run_cache_item = r.get(r_id)
        if run_cache_item is not None:
            run_cache_item = json.loads(run_cache_item)
            ftime = run_cache_item["ftime"]
        else:
            ftime = time.time()
        run_cache_item = {
            "dirty": True,
            "rtime": time.time(),
            "ftime": ftime,
            "run": run,
        }
        r.set(r_id, json.dumps(run_cache_item))

But datetime and id must be converted before the dump.

run["start_time"] = run["start_time"].isoformat()
run["_id"] = str(run["_id"])

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

return self.runs.find_one({"_id": ObjectId(r_id)})

I don't understand. This still seems to return the run from the db, which is not what you want...

@ppigazzini
Copy link
Collaborator

ppigazzini commented May 8, 2024

Written on the fly, code fixed.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 8, 2024

But datetime and id must be converted before the dump.

One can use bson instead of json. bson doesn't have this issue.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 9, 2024

@ppigazzini Your code seems to be much more ambitious than what I am trying to do in this PR.

  • In this PR I want to make it possible for a secondary Fishtest instance ("the client") to ask for a copy of a run in the cache of the primary instance (e.g to have a bulletproof get_run()). I was musing about how to get this copy to the client as quickly as possible (perhaps this could be done via redis). I am not sure however if it is necessary to do anything. Perhaps this PR is fast enough as it is (I am going to optimize it a bit more by skipping the base64 encoding).

  • IUC, in your code you want to replace the Fishtest cache by a redis cache. This requires a lot of thought. For example it seems to me that a naive implementation would lead to lots of unnecessary serializations/deserializations.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 9, 2024

In the latest push I skip the base64 encoding. This makes the payload smaller.

EDIT: I should perhaps point out that runs compress extremely well (obviously because there is so much redundancy). I did an experiment with a typical run (500 tasks, 50 bad tasks and 101 SPSA history entries) and the compression ratio was 77:1. I.e. 425k vs 5.5k.

@vdbergh vdbergh force-pushed the serialize branch 5 times, most recently from ae56deb to f7e42b0 Compare May 9, 2024 13:15
@vdbergh vdbergh changed the title Proof of concept inplementation of get_run via api (RFC). Proof of concept implementation of get_run via api (RFC). May 9, 2024
@ppigazzini
Copy link
Collaborator

ppigazzini commented May 9, 2024

In nginx config we set at http level few rate limits:

  • per host
  • per URI
  • per IP (here we have a whitelist setup)

Also with the cookie module with a js challenge

Those rate limits should apply to this API as well. I must replicate the filtering on DEV to verify the hypothesis and test a new configuration to whitelist localhost.

BTW: DEV running with the PR.

@vdbergh
Copy link
Contributor Author

vdbergh commented May 9, 2024

From api.log

2024-05-09 17:12:01.645786+00:00 :   5.52 ms (s)  1190.12 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:14:10.054970+00:00 :   6.18 ms (s)  1191.21 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:16:19.091182+00:00 :   5.40 ms (s)  2105.73 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:18:28.921216+00:00 :   6.91 ms (s)  1098.78 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-05-09 17:20:32.112779+00:00 :   7.16 ms (s)  1096.62 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn

@vdbergh
Copy link
Contributor Author

vdbergh commented May 9, 2024

On PROD

2024-05-09 18:07:24.438469+00:00 :  21.81 ms (s)  1698.87 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 18:29:16.939441+00:00 :   6.34 ms (s)  1341.83 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 18:50:43.001719+00:00 :  56.01 ms (s)  1253.63 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 19:11:44.861195+00:00 :  21.35 ms (s)  1322.71 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn
2024-05-09 19:24:29.873764+00:00 :  37.82 ms (s)   911.89 ms (w)  https://tests.stockfishchess.org:443/api/upload_pgn

@ppigazzini ppigazzini marked this pull request as draft May 31, 2024 16:22
@ppigazzini
Copy link
Collaborator

PROD nginx is running without limits in place right now.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 25, 2024

You mean this PR could work if I rebase it?

@ppigazzini
Copy link
Collaborator

It worked on DEV, so I expect that it should work on PROD as well now with a simpler nginx configuration.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 25, 2024

Ok tonight I will have a look...

@vdbergh vdbergh marked this pull request as ready for review June 25, 2024 16:50
@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 25, 2024

Rebased. I hope I did not mess up too much.

@ppigazzini
Copy link
Collaborator

DEV running with the PR, logs clean right now.

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 28, 2024

It seems more expensive. This an extract of api.log

2024-06-12 11:16:49.812611+00:00 :   5.78 ms (s)   228.43 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:18:20.175159+00:00 :   7.17 ms (s)   231.59 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:19:46.106092+00:00 :   5.94 ms (s)   224.95 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:21:16.060719+00:00 :   6.54 ms (s)   227.58 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:22:40.993474+00:00 :   6.10 ms (s)   231.74 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:24:10.936604+00:00 :  13.73 ms (s)   243.28 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:25:32.842110+00:00 :   5.51 ms (s)   231.13 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:26:54.913908+00:00 :   6.24 ms (s)   307.63 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:28:21.854786+00:00 :   6.09 ms (s)   228.63 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-12 11:29:08.940692+00:00 :   6.01 ms (s)   199.73 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 09:25:29.849932+00:00 : 223.98 ms (s)   383.87 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 10:00:26.724927+00:00 :   2.41 ms (s)   165.89 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:40:18.892302+00:00 :   3.21 ms (s)   202.10 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:52:32.129110+00:00 :   2.22 ms (s)   208.15 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 16:59:56.952025+00:00 :   1.75 ms (s)   193.12 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 19:29:59.789806+00:00 :   2.35 ms (s)   219.05 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-20 19:42:08.685836+00:00 :   1.86 ms (s)   198.28 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:44:24.270762+00:00 :  25.50 ms (s)   334.55 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:46:52.579195+00:00 :  16.50 ms (s)   310.05 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:49:06.613025+00:00 :  15.87 ms (s)   337.55 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:51:35.929257+00:00 :  16.62 ms (s)   316.70 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:53:47.925970+00:00 :  13.78 ms (s)   310.23 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:55:49.681393+00:00 :  16.99 ms (s)   322.44 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:57:42.453471+00:00 :  15.22 ms (s)   312.66 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn
2024-06-28 19:59:38.246520+00:00 :  12.62 ms (s)   306.52 ms (w)  https://dfts-0.pigazzini.it:443/api/upload_pgn

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 28, 2024

PROD has been running with the PR for some time now.

Timings before the change:

# logging from [28/Jun/2024:17:36:27 +0000] to [28/Jun/2024:18:51:30 +0000] ########
# duration (seconds)            : 4503
# calls in total                : 10000
# calls per second              : 2
# calls not reaching the backend: 42
# calls handled by the backend  : 9958
#                         route    calls      total    average    minimum    maximum
                         /tests      251     84.421      0.336      0.008      0.469
                   /tests/user/       99     22.129      0.224      0.035      0.275
               /api/update_task     4759     17.804      0.004      0.001      0.042
                   /api/actions     1350     12.130      0.009      0.002      0.115
                     /tests/run        8     10.708      1.339      0.490      2.408
                      /api/beat     2447      7.666      0.003      0.001      0.105
                /tests/finished       61      5.852      0.096      0.027      0.143
                /api/upload_pgn      153      3.825      0.025      0.008      0.109
               /api/active_runs      121      3.714      0.031      0.004      2.615
                       /actions        3      3.468      1.156      0.829      1.332
                  /api/get_elo/      122      3.414      0.028      0.008      0.048
                /tests/machines        9      1.931      0.215      0.178      0.250
               /tests/live_elo/       70      1.389      0.020      0.003      0.034
                   /tests/view/       50      1.307      0.026      0.013      0.054
                  /contributors        4      1.037      0.259      0.130      0.377
              /api/request_task      173      0.775      0.004      0.002      0.029
                  /tests/tasks/       11      0.593      0.054      0.022      0.190
                           /nns       19      0.430      0.023      0.009      0.121
           /api/request_version      181      0.290      0.002      0.001      0.006
                       /api/nn/       20      0.108      0.005      0.003      0.006
              /api/request_spsa       35      0.101      0.003      0.002      0.004
                 /tests/approve        2      0.040      0.020      0.013      0.027
          /contributors/monthly        1      0.034      0.034      0.034      0.034
                  /tests/delete        1      0.025      0.025      0.025      0.025
                         /login        3      0.023      0.008      0.004      0.010
               /api/failed_task        2      0.018      0.009      0.008      0.010
                              /        1      0.009      0.009      0.009      0.009
                     /sprt_calc        1      0.008      0.008      0.008      0.008
                      /workers/        1      0.003      0.003      0.003      0.003

Timings after the change:

# logging from [28/Jun/2024:18:59:01 +0000] to [28/Jun/2024:20:08:07 +0000] ########
# duration (seconds)            : 4146
# calls in total                : 10000
# calls per second              : 2
# calls not reaching the backend: 37
# calls handled by the backend  : 9963
#                         route    calls      total    average    minimum    maximum
                         /tests      215     63.553      0.296      0.001      0.516
               /api/update_task     5047     19.137      0.004      0.001      0.041
                   /tests/user/       52     11.593      0.223      0.026      0.279
                   /api/actions     1135      9.883      0.009      0.002      0.050
                /api/upload_pgn      172      8.999      0.052      0.007      0.171
                      /api/beat     2369      7.177      0.003      0.001      0.031
                     /tests/run        4      5.402      1.351      0.627      2.269
                /tests/finished       45      4.634      0.103      0.026      0.133
                       /actions        7      4.419      0.631      0.080      0.947
            /api/serialize_run/      172      4.306      0.025      0.001      0.108
               /api/active_runs      106      3.344      0.032      0.004      2.456
                  /contributors        5      1.842      0.368      0.220      0.516
                /tests/machines        6      1.327      0.221      0.182      0.251
                  /api/get_elo/       54      1.321      0.024      0.008      0.045
                   /tests/view/       43      1.197      0.028      0.002      0.087
              /api/request_task      200      0.855      0.004      0.002      0.028
               /tests/live_elo/       33      0.462      0.014      0.002      0.027
           /api/request_version      228      0.366      0.002      0.001      0.004
                  /tests/tasks/        6      0.324      0.054      0.030      0.147
                      /api/pgn/        3      0.231      0.077      0.041      0.124
                           /nns       11      0.195      0.018      0.014      0.024
                              /        4      0.131      0.033      0.001      0.113
                       /api/nn/       22      0.116      0.005      0.004      0.006
              /api/request_spsa       20      0.055      0.003      0.002      0.004
                    /tests/stop        1      0.047      0.047      0.047      0.047
                 /tests/approve        2      0.027      0.013      0.013      0.014
               /api/failed_task        1      0.008      0.008      0.008      0.008

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 28, 2024

Probably for validation of the task id it is overkill to ask the primary instance for the whole run.

@ppigazzini
Copy link
Collaborator

By the way, logs are clean.

@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 28, 2024

Monitoring PROD I saw many nginx warnings for cache on disk during the pgn upload, so I raised the nginx buffer to lower the number of warnings:

-       client_body_buffer_size     128k;
+       client_body_buffer_size     1m;
2024/06/28 19:23:44 [warn] 94515#94515: *88 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:26:19 [warn] 94516#94516: *429 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000002, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:43:38 [warn] 94516#94516: *2606 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000003, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:50:55 [warn] 94515#94515: *3541 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000004, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 19:56:54 [warn] 94516#94516: *4245 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000005, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:00:08 [warn] 94516#94516: *4679 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000006, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:02:35 [warn] 94515#94515: *5023 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000007, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"
2024/06/28 20:09:42 [warn] 94515#94515: *5898 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000008, client: xxx.xxx.xxx.xxx, server: tests.stockfishchess.org, request: "POST /api/upload_pgn HTTP/1.1", host: "tests.stockfishchess.org"

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 28, 2024

Before & after

/api/upload_pgn 153 3.825 0.025 0.008 0.109
/api/upload_pgn 172 8.999 0.052 0.007 0.171

This confirms what I thought.

@ppigazzini
Copy link
Collaborator

8 hours timings with PR using the new nginx config (no cache on disk during the pgn upload).
I'm reverting to master to use the same nginx config.

# logging from [29/Jun/2024:00:00:26 +0000] to [29/Jun/2024:08:53:53 +0000] ########
# duration (seconds)            : 32007
# calls in total                : 100000
# calls per second              : 3
# calls not reaching the backend: 330
# calls handled by the backend  : 99670
#                         route    calls      total    average    minimum    maximum
                         /tests     1467    504.141      0.344      0.001      0.576
                   /tests/user/      656    149.462      0.228      0.013      0.991
               /api/update_task    35768    135.750      0.004      0.001      0.173
                   /api/actions     9269     95.834      0.010      0.001      0.142
                  /api/get_elo/     2563     73.265      0.029      0.007      0.134
                       /actions      102     62.721      0.615      0.038      3.352
                /api/upload_pgn     1160     60.732      0.052      0.008      0.238
                      /api/beat    16149     49.470      0.003      0.001      0.109
                 /api/run_pgns/        6     41.123      6.854      2.346     15.032
                  /contributors       46     34.405      0.748      0.000     23.490
                     /tests/run       44     29.028      0.660      0.001      2.626
                /tests/finished      280     28.911      0.103      0.018      0.181
            /api/serialize_run/     1160     27.212      0.023      0.001      0.111
                   /tests/view/      779     18.880      0.024      0.001      0.549
               /tests/live_elo/      889     17.377      0.020      0.001      0.294
                /tests/machines       37      7.921      0.214      0.000      0.276
               /api/active_runs      803      7.429      0.009      0.004      0.092
              /api/request_task     1353      6.145      0.005      0.002      0.131
                  /tests/tasks/      207      4.577      0.022      0.002      0.138
           /api/request_version     1428      2.366      0.002      0.000      0.032
                           /nns       81      1.695      0.021      0.007      0.100
               /user_management       16      1.577      0.099      0.000      0.222
              /api/request_spsa      410      1.213      0.003      0.001      0.038
                              /      137      0.787      0.006      0.001      0.014
                       /api/nn/      147      0.773      0.005      0.002      0.016
                  /tests/stats/       13      0.711      0.055      0.020      0.073
          /contributors/monthly       23      0.500      0.022      0.011      0.033
                         /login       43      0.329      0.008      0.001      0.013
                 /tests/approve       10      0.250      0.025      0.016      0.028
                      /workers/       10      0.158      0.016      0.003      0.060
                          /user        8      0.141      0.018      0.008      0.026
                        /signup        8      0.127      0.016      0.007      0.061
                         /user/        6      0.099      0.017      0.002      0.028
                  /tests/modify        4      0.087      0.022      0.010      0.035
               /api/failed_task        5      0.043      0.009      0.002      0.016
                    /tests/stop        2      0.041      0.021      0.014      0.027
                     /sprt_calc        2      0.022      0.011      0.008      0.014
             /api/finished_runs        2      0.018      0.009      0.006      0.012
                  /api/get_run/        1      0.014      0.014      0.014      0.014
                  /tests/delete        1      0.003      0.003      0.003      0.003

@vdbergh
Copy link
Contributor Author

vdbergh commented Jun 29, 2024

It seems the timings for upload_pgn remain the same (50ms).

I think the serialize_run api is good and may be useful. But for validation of run_id/task_id we don't need to pull in the full run. A much more restricted internal api should be used (we can pull in the run later if we actually need it).

I will try to cook something up.

With this PR /api/serialize_run hosted by the main instance can
be invoked by a secondary instance to receive a copy of the run
from the cache.

In this way there is never an issue that the run might not
be up to date.

Note: /api/serialize_run restricts access to localhost over http.
Some form of access control is necessary.
@ppigazzini
Copy link
Collaborator

ppigazzini commented Jun 29, 2024

12 hours timings with master and updated nginx configuration (it solves the warnings in the log, but there is not an improvement on the average timing of the pgn upload)

# logging from [29/Jun/2024:10:08:36 +0000] to [29/Jun/2024:22:28:46 +0000] ########
# duration (seconds)            : 44410
# calls in total                : 100000
# calls per second              : 2
# calls not reaching the backend: 456
# calls handled by the backend  : 99544
#                         route    calls      total    average    minimum    maximum
                         /tests     3442   1156.082      0.336      0.000      0.898
               /api/update_task    47904    181.824      0.004      0.000      0.921
                   /tests/user/      694    163.920      0.236      0.023      0.645
                 /api/run_pgns/       12    152.795     12.733      0.768     30.307
                   /api/actions    12483    132.893      0.011      0.002      0.856
                      /api/beat    22386     71.249      0.003      0.000      1.799
                     /tests/run       55     61.134      1.112      0.002      2.685
                       /actions       90     58.371      0.649      0.031      2.754
                   /tests/view/      875     57.171      0.065      0.002      2.370
                /tests/finished      471     52.820      0.112      0.018      3.009
                  /api/get_elo/     1629     45.579      0.028      0.007      0.138
                /api/upload_pgn     1683     43.686      0.026      0.005      0.175
                /tests/machines      102     23.422      0.230      0.148      0.281
               /api/active_runs     1177     20.413      0.017      0.001      2.652
                  /contributors       59     16.381      0.278      0.000      0.725
               /tests/live_elo/      789     14.208      0.018      0.002      0.125
                  /tests/tasks/      169     10.483      0.062      0.001      0.216
              /api/request_task     1925      8.604      0.004      0.002      0.047
                       /api/nn/      646      3.432      0.005      0.000      0.066
           /api/request_version     2004      3.385      0.002      0.001      0.109
                           /nns      142      3.047      0.021      0.002      0.178
          /contributors/monthly       61      1.515      0.025      0.008      0.157
              /api/request_spsa      453      1.230      0.003      0.001      0.018
                              /      122      0.925      0.008      0.001      0.080
                  /tests/stats/       11      0.619      0.056      0.006      0.106
                 /tests/approve       20      0.503      0.025      0.010      0.031
                         /login       44      0.389      0.009      0.002      0.015
                        /signup       22      0.330      0.015      0.002      0.098
                  /tests/modify       13      0.313      0.024      0.015      0.030
               /user_management        9      0.311      0.035      0.001      0.168
                    /tests/stop        7      0.250      0.036      0.025      0.047
               /api/failed_task       16      0.157      0.010      0.004      0.022
                      /workers/       16      0.139      0.009      0.002      0.029
                         /user/        6      0.103      0.017      0.003      0.036
                     /sprt_calc        3      0.033      0.011      0.009      0.014
             /api/finished_runs        2      0.022      0.011      0.005      0.017
                          /user        1      0.007      0.007      0.007      0.007
                  /api/get_run/        1      0.006      0.006      0.006      0.006

vdbergh added a commit to vdbergh/fishtest that referenced this pull request Jul 1, 2024
The convention throughout the code base is that run_id is a
string.

Since a run_id and the corresponding ObjectId print in
exactly the same way, violating this convention may lead to
hard to understand bugs such as this one

https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal

which was introduced in official-stockfish#1993.
vdbergh added a commit to vdbergh/fishtest that referenced this pull request Jul 1, 2024
The convention throughout the code base is that run_id is a
string.

Since a run_id and the corresponding ObjectId print in
exactly the same way, violating this convention may lead to
hard to understand bugs such as this one

https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal

which was introduced in official-stockfish#1993.
ppigazzini pushed a commit that referenced this pull request Jul 1, 2024
The convention throughout the code base is that run_id is a
string.

Since a run_id and the corresponding ObjectId print in
exactly the same way, violating this convention may lead to
hard to understand bugs such as this one

https://tests.stockfishchess.org/actions?action=log_message&user=&text=internal

which was introduced in #1993.
@vdbergh vdbergh marked this pull request as draft July 1, 2024 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants