GC stalls scheduler for up to a minute with large task graph #8958

hendrikmakait · 2024-12-05T11:55:40Z

Issue

Coiled cluster: https://cloud.coiled.io/clusters/674429/account/dask-benchmarks-gcp/information?customGroup=100&tab=Code (not accessible to everyone, so I will provide the most relevant information below)

When running test_climatology::test_highlevel_api at medium scale on a Coiled cluster, the scheduler stalls periodically for up to a minute. There are no incoming scheduler metrics that appear in Prometheus and the scheduler logs that the event loop was blocked, while scheduler CPU is busy as ever:

Diagnosis

This issue appears to be caused by a combination of large task graphs and garbage collection. When tracking the garbage collections of (the most long-lived) generation 2, we see that these correspond to the periods of scheduler unresponsiveness observable in the Prometheus data we collected:

Log excerpt:

...
(scheduler)     2024-12-04 10:56:46.372000 distributed.gc - INFO - generation: 2, duration: 43 s
(scheduler)     2024-12-04 10:56:46.373000 distributed.gc - INFO - done, 562758 collected, 0 uncollectable
(scheduler)     2024-12-04 10:56:46.373000 distributed.gc - INFO - objects in each generation: 1720 0 44431094
...
(scheduler)     2024-12-04 11:04:50.870000 distributed.gc - INFO - generation: 2, duration: 41 s
(scheduler)     2024-12-04 11:04:50.870000 distributed.gc - INFO - done, 927074 collected, 0 uncollectable
(scheduler)     2024-12-04 11:04:50.871000 distributed.gc - INFO - objects in each generation: 5489 0 44929016
(scheduler)     2024-12-04 11:04:50.915000 distributed.core - INFO - Event loop was unresponsive in Scheduler for 66.16s.  This is often caused by
long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...
(scheduler)     2024-12-04 11:10:03.228000 distributed.gc - INFO - generation: 2, duration: 42 s
(scheduler)     2024-12-04 11:10:03.229000 distributed.gc - INFO - done, 969091 collected, 0 uncollectable
(scheduler)     2024-12-04 11:10:03.231000 distributed.gc - INFO - objects in each generation: 4865 0 45020404
(scheduler)     2024-12-04 11:10:03.255000 distributed.core - INFO - Event loop was unresponsive in Scheduler for 67.42s.  This is often caused by
long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...

We also notice that the number of objects tracked in generation 2 is huge, there are ~45 MM objects that the garbage collector has to scan periodically! Of these 45 MM objects, these are the most common types by count:

[("<class 'frozenset'>", 7293861),
 ("<class 'tuple'>", 6252780),
 ("<class 'dask._task_spec.Task'>", 5946128),
 ("<class 'dict'>", 4729313),
 ("<class 'set'>", 4728750),
 ("<class 'dask._task_spec.DataNode'>", 3246123),
 ("<class 'dask._task_spec.Alias'>", 3197497),
 ("<class 'dask._task_spec.TaskRef'>", 3197497),
 ("<class 'list'>", 2269842),
 ("<class 'weakref.ReferenceType'>", 1184699),
 ("<class 'distributed.scheduler.TaskState'>", 1165752),
 ("<class 'slice'>", 539903)]

We see that we have the expected ~1.2 MM TaskState objects, but also several _task_spec-related classes that contribute ~45 MM objects, i.e., 1/3 of all tracked objects. We take a closer look at the individual classes when discussing solutions.

Possible solutions

Broadly, there are two ways of reducing the cost of GC:

Reduce the cost of tracking individual objects
Track fewer objects

Reduce the cost of tracking individual objects

To assess the potential impact of this, let's calculate a lower bound for collecting ~45 MM objects without any attributes (let alone cyclic references):

import gc

class Foo:
    __slots__ = ()

objs = [Foo() for i in range(45_000_000)]

%timeit gc.collect()

944 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Adding a single slotted int attribute, this roughly doubles to 1.8 s:

import gc

class Foo:
    a: int
    __slots__ = ("a")

    def __init__(self, a):
        self.a = a

objs = [Foo(i) for i in range(45_000_000)]

%timeit gc.collect()

1.8 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Removing the __slots__, we end up with 10x increase from the trivial case:

import gc

class Foo:
    a: int

    def __init__(self, a):
        self.a = a

objs = [Foo(i) for i in range(45_000_000)]

%timeit gc.collect()

13.1 s ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can see that while there is some room for improvement, the GC cost increases rapidly once objects become more complex. Even if we were to arrive at the cost we see in the second example, this still doesn't scale to 10x the amount of data/tasks without going beyond the single-digit seconds. As a result, this approach might serve as a short-term patch but would not allow Dask to scale further.

Track fewer objects

A closer look at the tracked types shows that _task_spec classes contribute ~15 MM objects or 1/3 of the total. These classes also contain a number of collections, .e.g, to track dependencies, so I estimate that the majority of all tracked objects are tied to the task graph representation. If we were able to untrack these objects, we would significantly reduce the cost of GC. For this, there exists the gc.freeze call. This API essentially moves all currently tracked items to the permanent collection and ignores them in subsequent collections.

When to gc.freeze?

I see two possible points:
2. freeze after update_graph

would freeze almost all created objects
TaskState objects get frozen; they contain cyclic references, so they would never get cleaned up,

freeze after materialization

would freeze all GraphNode and related objects (at least 1/3 of the total)
we expect GraphNode objects not to have any cycles, so they should eventually get cleaned up
would probably cause a mild memory leak because it also freeze all other objects that exist at that point in time
would NOT freeze TaskState objects

Given the cyclic references of TaskState objects, freezing after materialization seems preferable.

When to gc.unfreeze?

This is an open question. When, if ever, do we want to unfreeze the permanent collection?

How to deal with repeated calls to update_graph?

When freezing after materialization, we avoid freezing TaskState objects for the first invocation. However, if update_graph gets called repeatedly and the TaskState objects still exist, we would freeze them with the subsequent calls to gc.freeze.

Separately, we may also want to reduce the number of tracked objects caused by objects related to the TaskState class. For example, these classes contain various set[TaskState]. These sets get tracked by the GC. However, we are only interested in having a set-like collection of the keys. By converting these sets to dict[Key, None], we can untrack these collections. (dicts of untracked objects are not tracked, but sets are).
We may also want to get rid of the unnecessary weakrefs.

The text was updated successfully, but these errors were encountered:

hendrikmakait · 2024-12-06T09:57:16Z

Freezing after materialization moves ~34 MM objects to the permanent collection.

github-actions bot added the needs triage label Dec 5, 2024

hendrikmakait added performance stability Issue or feature related to cluster stability (e.g. deadlock) scheduler and removed needs triage labels Dec 5, 2024

hendrikmakait self-assigned this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC stalls scheduler for up to a minute with large task graph #8958

GC stalls scheduler for up to a minute with large task graph #8958

hendrikmakait commented Dec 5, 2024 •

edited

Loading

hendrikmakait commented Dec 6, 2024

GC stalls scheduler for up to a minute with large task graph #8958

GC stalls scheduler for up to a minute with large task graph #8958

Comments

hendrikmakait commented Dec 5, 2024 • edited Loading

Issue

Diagnosis

Possible solutions

Reduce the cost of tracking individual objects

Track fewer objects

hendrikmakait commented Dec 6, 2024

hendrikmakait commented Dec 5, 2024 •

edited

Loading