You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running test_climatology::test_highlevel_api at medium scale on a Coiled cluster, the scheduler stalls periodically for up to a minute. There are no incoming scheduler metrics that appear in Prometheus and the scheduler logs that the event loop was blocked, while scheduler CPU is busy as ever:
Diagnosis
This issue appears to be caused by a combination of large task graphs and garbage collection. When tracking the garbage collections of (the most long-lived) generation 2, we see that these correspond to the periods of scheduler unresponsiveness observable in the Prometheus data we collected:
Log excerpt:
...
(scheduler) 2024-12-04 10:56:46.372000 distributed.gc - INFO - generation: 2, duration: 43 s
(scheduler) 2024-12-04 10:56:46.373000 distributed.gc - INFO - done, 562758 collected, 0 uncollectable
(scheduler) 2024-12-04 10:56:46.373000 distributed.gc - INFO - objects in each generation: 1720 0 44431094
...
(scheduler) 2024-12-04 11:04:50.870000 distributed.gc - INFO - generation: 2, duration: 41 s
(scheduler) 2024-12-04 11:04:50.870000 distributed.gc - INFO - done, 927074 collected, 0 uncollectable
(scheduler) 2024-12-04 11:04:50.871000 distributed.gc - INFO - objects in each generation: 5489 0 44929016
(scheduler) 2024-12-04 11:04:50.915000 distributed.core - INFO - Event loop was unresponsive in Scheduler for 66.16s. This is often caused by
long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...
(scheduler) 2024-12-04 11:10:03.228000 distributed.gc - INFO - generation: 2, duration: 42 s
(scheduler) 2024-12-04 11:10:03.229000 distributed.gc - INFO - done, 969091 collected, 0 uncollectable
(scheduler) 2024-12-04 11:10:03.231000 distributed.gc - INFO - objects in each generation: 4865 0 45020404
(scheduler) 2024-12-04 11:10:03.255000 distributed.core - INFO - Event loop was unresponsive in Scheduler for 67.42s. This is often caused by
long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...
We also notice that the number of objects tracked in generation 2 is huge, there are ~45 MM objects that the garbage collector has to scan periodically! Of these 45 MM objects, these are the most common types by count:
We see that we have the expected ~1.2 MM TaskState objects, but also several _task_spec-related classes that contribute ~45 MM objects, i.e., 1/3 of all tracked objects. We take a closer look at the individual classes when discussing solutions.
Possible solutions
Broadly, there are two ways of reducing the cost of GC:
Reduce the cost of tracking individual objects
Track fewer objects
Reduce the cost of tracking individual objects
To assess the potential impact of this, let's calculate a lower bound for collecting ~45 MM objects without any attributes (let alone cyclic references):
import gc
class Foo:
__slots__ = ()
objs = [Foo() for i in range(45_000_000)]
%timeit gc.collect()
944 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Adding a single slotted int attribute, this roughly doubles to 1.8 s:
import gc
class Foo:
a: int
__slots__ = ("a")
def __init__(self, a):
self.a = a
objs = [Foo(i) for i in range(45_000_000)]
%timeit gc.collect()
1.8 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Removing the __slots__, we end up with 10x increase from the trivial case:
import gc
class Foo:
a: int
def __init__(self, a):
self.a = a
objs = [Foo(i) for i in range(45_000_000)]
%timeit gc.collect()
13.1 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
We can see that while there is some room for improvement, the GC cost increases rapidly once objects become more complex. Even if we were to arrive at the cost we see in the second example, this still doesn't scale to 10x the amount of data/tasks without going beyond the single-digit seconds. As a result, this approach might serve as a short-term patch but would not allow Dask to scale further.
Track fewer objects
A closer look at the tracked types shows that _task_spec classes contribute ~15 MM objects or 1/3 of the total. These classes also contain a number of collections, .e.g, to track dependencies, so I estimate that the majority of all tracked objects are tied to the task graph representation. If we were able to untrack these objects, we would significantly reduce the cost of GC. For this, there exists the gc.freeze call. This API essentially moves all currently tracked items to the permanent collection and ignores them in subsequent collections.
This is an open question. When, if ever, do we want to unfreeze the permanent collection?
How to deal with repeated calls to update_graph?
When freezing after materialization, we avoid freezing TaskState objects for the first invocation. However, if update_graph gets called repeatedly and the TaskState objects still exist, we would freeze them with the subsequent calls to gc.freeze.
Separately, we may also want to reduce the number of tracked objects caused by objects related to the TaskState class. For example, these classes contain various set[TaskState]. These sets get tracked by the GC. However, we are only interested in having a set-like collection of the keys. By converting these sets to dict[Key, None], we can untrack these collections. (dicts of untracked objects are not tracked, but sets are).
We may also want to get rid of the unnecessary weakrefs.
The text was updated successfully, but these errors were encountered:
Issue
Coiled cluster: https://cloud.coiled.io/clusters/674429/account/dask-benchmarks-gcp/information?customGroup=100&tab=Code (not accessible to everyone, so I will provide the most relevant information below)
When running
test_climatology::test_highlevel_api
atmedium
scale on a Coiled cluster, the scheduler stalls periodically for up to a minute. There are no incoming scheduler metrics that appear in Prometheus and the scheduler logs that the event loop was blocked, while scheduler CPU is busy as ever:Diagnosis
This issue appears to be caused by a combination of large task graphs and garbage collection. When tracking the garbage collections of (the most long-lived) generation 2, we see that these correspond to the periods of scheduler unresponsiveness observable in the Prometheus data we collected:
Log excerpt:
We also notice that the number of objects tracked in generation 2 is huge, there are ~45 MM objects that the garbage collector has to scan periodically! Of these 45 MM objects, these are the most common types by count:
We see that we have the expected ~1.2 MM
TaskState
objects, but also several_task_spec
-related classes that contribute ~45 MM objects, i.e., 1/3 of all tracked objects. We take a closer look at the individual classes when discussing solutions.Possible solutions
Broadly, there are two ways of reducing the cost of GC:
Reduce the cost of tracking individual objects
To assess the potential impact of this, let's calculate a lower bound for collecting ~45 MM objects without any attributes (let alone cyclic references):
Adding a single slotted
int
attribute, this roughly doubles to 1.8 s:Removing the
__slots__
, we end up with 10x increase from the trivial case:We can see that while there is some room for improvement, the GC cost increases rapidly once objects become more complex. Even if we were to arrive at the cost we see in the second example, this still doesn't scale to 10x the amount of data/tasks without going beyond the single-digit seconds. As a result, this approach might serve as a short-term patch but would not allow Dask to scale further.
Track fewer objects
A closer look at the tracked types shows that
_task_spec
classes contribute ~15 MM objects or 1/3 of the total. These classes also contain a number of collections, .e.g, to track dependencies, so I estimate that the majority of all tracked objects are tied to the task graph representation. If we were able to untrack these objects, we would significantly reduce the cost of GC. For this, there exists thegc.freeze
call. This API essentially moves all currently tracked items to the permanent collection and ignores them in subsequent collections.When to gc.freeze?
I see two possible points:
2. freeze after
update_graph
TaskState
objects get frozen; they contain cyclic references, so they would never get cleaned up,GraphNode
and related objects (at least 1/3 of the total)GraphNode
objects not to have any cycles, so they should eventually get cleaned upTaskState
objectsGiven the cyclic references of
TaskState
objects, freezing after materialization seems preferable.When to gc.unfreeze?
This is an open question. When, if ever, do we want to unfreeze the permanent collection?
How to deal with repeated calls to
update_graph
?When freezing after
materialization
, we avoid freezingTaskState
objects for the first invocation. However, ifupdate_graph
gets called repeatedly and theTaskState
objects still exist, we would freeze them with the subsequent calls togc.freeze
.Separately, we may also want to reduce the number of tracked objects caused by objects related to the
TaskState
class. For example, these classes contain variousset[TaskState]
. These sets get tracked by the GC. However, we are only interested in having a set-like collection of the keys. By converting these sets todict[Key, None]
, we can untrack these collections. (dict
s of untracked objects are not tracked, but sets are).We may also want to get rid of the unnecessary
weakref
s.The text was updated successfully, but these errors were encountered: