Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove idle and saturated sets from scheduler #8889

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Oct 11, 2024

This is an attempt at removing the idle/saturated/idle_task_count containers on the scheduler.

  • They are expensive to maintain. They inherently rely on occupancy which is a quantity that cannot be accurately maintained in constant time. We therefore fell back a while ago to decompose occupancy into various quantities that can be maintained online but the occupancy itself has to be computed on demand (scaling linearily with the number of task prefixes). This can be expensive considering how often check_idle_saturated has to be evaluated.
  • Their definition is arguably fragile and hard to grasp. Particularly since they rely on the also quite fragile quantity occupancy which historically has put network load on equal footing to compute load which caused plenty of confusion in the past. It also makes reliable testing notoriously difficult
  • Ultimately, these sets are just maintained for performance reasons. Apart from performance optimization, their only functional purpose is to control the color of the worker bars. Saturated workers (a state that is incredibly difficult to attain) are colored yellow. That's it.

While ripping this out I simplified the stealing code and eradicated a couple of sources of non-determinism

The most important change that is also causing some of the tests to fail is that in the current iteration I chose to define possible thieves on the basis of the number of threads that are available. Previously, all idle classified workers, i.e. all workers with less than half of average occupancy, were considered possible thieves. Therefore, this stealing code is much, much more conservative. That is much more reliable and predictable but is also much less aggressive and cannot enforce absolute homogeneity.

The lack of determinism often originate from the usage of sets or the lack of tie-breakers when sorting. I believe that even without removing the idle/saturated sets it makes sense to remove those sources of non-determinism. Particularly, since this can subtly also affect global task ordering.

I still have to run some actual tests on this. So far this is all rather theoretical

Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    25 files  ±0      25 suites  ±0   10h 21m 40s ⏱️ - 3m 1s
 4 130 tests ±0   3 988 ✅  -  28    110 💤 ±0   32 ❌ + 28 
47 708 runs  ±0  45 327 ✅  - 282  2 095 💤 ±0  286 ❌ +282 

For more details on these failures, see this check.

Results for commit 393ee21. ± Comparison against base commit ecee9e8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant