Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_machines is too slow and does not use the book keeping data #2101

Open
peregrineshahin opened this issue Jul 3, 2024 · 30 comments
Open

Comments

@peregrineshahin
Copy link
Contributor

https://tests.stockfishchess.org/tests/machines
This route should use the latest optimizations of workers book-keeping if exists
It's also too slow under fleet load, ignoring the 10s cached http resposne

    def get_machines(self):
        active_runs = self.runs.find({"finished": False}, {"tasks": 1, "args": 1})
        machines = (
            task["worker_info"]
            | {
                "last_updated": (
                    task["last_updated"] if task.get("last_updated") else None
                ),
                "run": run,
                "task_id": task_id,
            }
            for run in active_runs
            if any(task["active"] for task in reversed(run["tasks"]))
            for task_id, task in enumerate(run["tasks"])
            if task["active"]
        )
        return machines
@vdbergh
Copy link
Contributor

vdbergh commented Jul 3, 2024

Well there is an intrinsic complexity here: a list of all active tasks needs to be made.

On the primary instance the db access could be eliminated by using the cache and the set of unfinished runs (which is dynamically kept up to date), but I believe get_machines runs on a secondary instance.

What we could do (in addition perhaps) is to create a scheduled task to regenerate the machines list every minute and then serve this list.

@peregrineshahin
Copy link
Contributor Author

peregrineshahin commented Jul 4, 2024

there is an intrinsic complexity here: a list of all active tasks needs to be made.

how is that? the goal is to get the active machines, the "added" complexity that you need to loop through the tasks is an implementation detail..

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

Actually you are right. The main instance has a data structure wtt_map which is a dictionary which maps short worker names to tasks.

wtt_map_schema = {
    short_worker_name: (runs_schema, task_id),
}

So that could be used to get the machines list quickly on the main instance.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

Either the machine list api should be moved to a primary instance, or else an internal api should be created where a secondary instance asks the primary instance for the list (this may involve serialization/deserialization of json). An example of such an api is in #1993.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

And as I said, an additional enhancement could be to recreate the list every minute in a scheduled task (rather than creating it on demand).

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

Note that the wtt_map itself may contain some inactive tasks (avoiding this clutters the code too much IMHO). These inactive tasks are evicted in a scheduled task that runs every minute.

@peregrineshahin
Copy link
Contributor Author

I think we moved this API to the secondary instance because of performance issues.. while I expected we have such a map in the flow that should provide us such information I'm not quite sure about the 1 min reschedule time build, do you mean that the numbers showing in the home page cards currently only renew every 1 min? This sounds too much TBH.. I'm not generally a fan of rescheduling api's.. neither does the nature of book-keeping approaches requires it.. so this caught me by surprise.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

Well I do not know where the bottleneck is. It could also be in mako.

Currently the table is computed on demand but there is a delay (of the order of 30-60s) before the db is fully updated.

I like to schedule expensive computing tasks at regular intervals rather than on demand (unless accuracy is critical). In that way one get predictable performance independently of load.

Note that the new scheduler in Fishtest makes creating such tasks very easy.

@peregrineshahin
Copy link
Contributor Author

Wel my point is if we look at book-keeping as a technique it is neither on demand or scheduled.. it none of both.. so my question was .. suppose each time a worker gets active you add it to the dict.. and a worker gets non active you remove it from the list.. I'm not sure I understand why would such a map require scheduling?

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

I do not know how much time it takes to generate the html from the data structure. If it takes 1s and 60 people want to view the machines lists then there is no time to do anything else...

EDIT: Simplified of course...

@peregrineshahin
Copy link
Contributor Author

Sorry but I'm not following.. my question is related to the wttp_map, my question is why do we have it every 1 min.. I'm not understanding conceptually why such map doesn't exist always up to date just like how run results gets incrementally updated.. is it because we don't spot the worker while they leave?

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

The wtt_map is always up to date (except that it contains some inactive tasks which are periodically cleaned). But you still need to generate the html from it. That's what I am talking about (during a visit of Vondele's fleet this will typically be a table with 10000 rows). Of course you could fiddle with paging but then sorting by username becomes unpleasant (I think).

PS.

The reason why there are some inactive tasks is that I did not want to clutter the set_inactive_task() function with code to update the wtt_map. Besides such code being messy one would also need to be careful to avoid deadlock.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 4, 2024

Perhaps I should clarify that the wtt_map was not created to support the machine list, but to very efficiently detect duplicate workers (even when Vondele's fleet is visiting).

For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema

{
    run_id: {task_id}
}

This would be very easy to keep up to date.

@peregrineshahin
Copy link
Contributor Author

I'm not sure where to look, perhaps it would be easier for you to code the new function please.

For the machine list perhaps another data structure would be more appropriate: a dictionary with keys the run_id's of unfinished runs and with values the sets of corresponding active tasks. I.e. with schema

{
run_id: {task_id}
}
This would be very easy to keep up to date.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).

If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

I guess we could try. But it would require a fleet visit to see the effect.

@dubslow
Copy link
Contributor

dubslow commented Jul 6, 2024

Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

Well it takes more than 15min to implement this...

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

Good thing the fleet is visiting right this very moment :) (the server is even holding fairly steady at 100+k cores for two hours now!)

Yes the server seems to be doing fine. Local api.log times remain stable (a few ms/api call except pgn_upload which takes ~ 20ms).

@vondele
Copy link
Member

vondele commented Jul 6, 2024

Server is indeed doing fine under this load now. Current api locally:

# logging from [06/Jul/2024:09:52:14 +0000] to [06/Jul/2024:09:53:10 +0000] ########
# duration (seconds)            : 56
# calls in total                : 10000
# calls per second              : 178
# calls not reaching the backend: 1
# calls handled by the backend  : 9999
#                         route    calls      total    average    minimum    maximum maxcount
                      /api/beat     6810    744.814      0.109      0.001     30.001       17
               /api/update_task     3047    201.978      0.066      0.001     30.001        4
                         /tests       12      7.307      0.609      0.015      0.997        0
                  /tests/tasks/       12      6.094      0.508      0.292      1.007        0
                     /tests/run        2      3.938      1.969      1.180      2.758        0
                   /tests/view/       14      0.846      0.060      0.010      0.159        0
              /api/request_spsa       50      0.820      0.016      0.003      0.114        0
                  /api/get_elo/        7      0.478      0.068      0.017      0.184        0
                   /api/actions       11      0.199      0.018      0.003      0.037        0
                       /api/nn/       28      0.158      0.006      0.004      0.007        0
                /tests/finished        1      0.092      0.092      0.092      0.092        0
                 /tests/approve        1      0.054      0.054      0.054      0.054        0
               /api/active_runs        2      0.025      0.013      0.006      0.019        0
           /api/request_version        1      0.005      0.005      0.005      0.005        0
              /api/request_task        1      0.003      0.003      0.003      0.003        0

a bit earlier the 3 pgn_upload pserves were 100% loaded, but not hanging (eventually we should just reduce the amount of STC pgns we upload, IMO, e.g. only for the first N tasks or so).
We still have these small bursts of api calls that take 30s or so, even though the average timing is very low. The primary pserve is around 50% (30-70%) loaded.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

Hmm the worker seems to log only successful api calls (which take a few ms), not those that time out (~30s I guess).

@vondele
Copy link
Member

vondele commented Jul 6, 2024

Note that these are rare (17 out of 6810) for the table above.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

It would be nice to understand this though.

I wonder if the dead tasks scavenging has something to do with it. If this scheduled task (which runs every minute) has to suddenly handle 1000s of dead tasks (which involves acquiring active_run_locks) then perhaps this may temporarily clog things up.

As a safety measure I am thinking of only scavenging a limited number of dead tasks per turn. E.g. 500. Then clearing 17000 dead tasks would take 34 minutes. This is not entirely free though because it means runs may take longer to finish. But it seems like a worthy trade off.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

See #2109

@peregrineshahin
Copy link
Contributor Author

pagination needed in both cases

The thing is that I am not 100% sure that generating the data is the bottleneck. I wonder if it is not the generation of a table with > 15000 rows from that data (a table that also needs to be transferred to the client).

If the latter would be the case then moving the machines api to the primary instance would be bad since there would less time to handle the worker apis.

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

pagination needed in both cases

Yes but one should still be able to sort the whole table (not only the current page).

@peregrineshahin
Copy link
Contributor Author

Actually vondele suggested this one which I forgot about:

or a smarter way to show the table. E.g. summary first per user, only after a click expand that user?
(effectively with 10k workers, just scrolling or clicking through pages is not very practical)

@vdbergh
Copy link
Contributor

vdbergh commented Jul 6, 2024

Sounds good. Sorting by username is indeed the only thing I do.

@vondele
Copy link
Member

vondele commented Jul 6, 2024

I actually do sort per core count, compiler version, python version etc...
So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.

@peregrineshahin
Copy link
Contributor Author

I actually do sort per core count, compiler version, python version etc... So maybe the grouping should be 'per sorted item' and on a click expand that? In this case we have all functionality.

Yeah doable.. will work on it

@peregrineshahin peregrineshahin changed the title get_machines does not use the book keeping data get_machines is too slow and does not use the book keeping data Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants