Batched job processing (opt-in) #474

benjie · 2024-06-11T15:09:58Z

Description

Replaces #99 and #470.

If you're at very high scale (e.g. you're running multiple Worker instances, and each instance has high concurrency) then the act of looking for and releasing jobs can start to dominate the load on the database. The PR gives the ability to configure Graphile Worker such that getJob (via localQueueSize), completeJob (via completeJobBatchDelay) and failJob (via failJobBatchDelay) can be batched, thereby reducing this database load (and improving job throughput). This is an opt-in feature, via the following settings:

const preset = {
  worker: {
    localQueueSize: jobConcurrency,
    completeJobBatchDelay: 10, // milliseconds
    failJobBatchDelay: 10, // milliseconds
  }
};

If localQueueSize >= 1, Pools become responsible for getting jobs and will grab the number of jobs that you specify up front, and distribute these to workers on demand. This is done via a "Local Queue".
If completeJobBatchDelay >= 0 or failJobBatchDelay >= 0 then pools are also now responsible for completing or failing jobs (respectively); they will wait the specified number of milliseconds after a completeJob or failJob call and batch any other calls made in the interrim; all of these results will be sent to the database at the same time reducing the total number of transactions.

Note that enabling these features changes the behavior of Worker in a few ways:

Since pools grab a batch of jobs up front they represent a snapshot at that time and newer higher priority jobs will not be evaluated until the batch is done being processed
Since pools grab a batch of jobs up front, jobs may not be as evenly distributed across workers
Since pools grab a batch of jobs up front, jobs may not start until a later time than they previously did, potentially increasing latency (but also increasing throughput)

Performance impact

If not enabled, impact is minimal.

If enabled, throughput improvement at the cost of potential latency increases.

The following results were produced with the following setup:

CPU: i9 14900K
OS: Ubuntu
OS tweaks: efficiency cores disabled, and CPU configured to use performance governor
Job count: 200,000
Worker instance count: 4 (4 Node.js processes, each running one Worker instance, via the CLI)
Worker concurrency: 24 (each of the 4 instances can process 24 jobs concurrently, for a total of 96 concurrent jobs)

Base performance:

Jobs per second: 16093.94

With localQueueSize: 500:

Jobs per second: 35177.47

Performance with localQueueSize: 500, completeJobBatchDelay: 0, failJobBatchDelay: 0 (note: even though the numbers are 0 this still enables batching, it is just limited to (roughly) a single JS event loop tick):

Jobs per second: 180684.70

You should note that the workload benchmarked here is a workload designed to put maximal stress on the database (i.e. the tasks are basically no-ops); YMMV with real-world loads.

The CPUs were configured with this script:

#!/usr/bin/env bash

# Enable performance governor
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

# Disable the efficiency cores
for i in {16..31}; do
    echo 0 | sudo tee /sys/devices/system/cpu/cpu$i/online
done

Security impact

Not known.

Checklist

My code matches the project's code style and yarn lint:fix passes.
I've added tests for the new feature, and yarn test passes.
I have detailed the new feature in the relevant documentation.
I have added this feature to 'Pending' in the RELEASE_NOTES.md file (if one exists).
~~If this is a breaking change I've explained why.~~

benjie · 2024-07-01T09:04:06Z

…init middleware

…ntext and share relevant types

… returning jobs fails)

…led a second time (e.g. from forcefulShutdown)

…when called a second time

…n hooks

benjie mentioned this pull request Jun 11, 2024

Batch job processing #470

Closed

5 tasks

benjie force-pushed the pool-centric branch from 7868f57 to 007ca06 Compare June 11, 2024 15:31

benjie mentioned this pull request Jun 14, 2024

Add JS method to get the outstanding job count + queue depth #380

Open

6 tasks

benjie added 26 commits July 11, 2024 15:59

Start work on batch fetching jobs

1237c23

Batched job fetching with watermark

c356e0a

Fix getJob call to reflect changes in #469

9781bdc

Hoist completeJob and failJob

892733f

Refactor failJob/completeJob in preparation for batching

e4782d1

Stub batch function

2dacae8

Implement batching function

080b674

Lint fix

79eee2d

Tweak graceful/forceful shutdown handover

3586553

Evaluate envvar just once up top

a4acb7e

Release batches before releasing worker pool

aa58a94

Tweak perfTest for more logging

c1c4c8f

Fairer startup time calc

f201f60

Refactor

ea6ae05

Warn about bad batch settings

070de02

Refactor in preparation for new queue

772962e

Refactor again and explain the purpose of LocalQueue and how it works

2f0a54f

Implement localQueue

c0592c9

Return to POLLING from WAITING; RELEASE mode.

c2077d9

Once only once.

6d26ee3

Fix releasing from POLLING mode

b520cde

Concurrency 24 works better on my new machine

5f96eac

Wait for background tasks to complete before releasing

42ed1b1

Enable ts-check on graphile.config.js

f6890ee

Add types to latencyTest file

96e8f36

Default settings

963f5bc

benjie added 30 commits November 18, 2024 15:25

Fix the syntax makeSelectionOfJobs

89813a5

Upgrade jest time helpers so sleep is unref'd

64fd62c

Test that graceful shutdown works in runOnce

8a2bb75

Test gracefulShutdown of run

5e64200

Fix runOnce test

2055211

Upgrade graphile-config

003ef81

Implement middleware system

9f52007

Fix types for makeJobHelpers

b453de2

gracefulShutdown middleware

5ebd5d3

Shutting down non-continuous worker should go via gracefulShutdown path

e5810bf

f

a5e690f

Add middleware for forcefulShutdown, guarantee event emitter, rework …

e23a214

…init middleware

Clarify relationship between CompiledSharedOptions and WorkerPluginCo…

0d525b9

…ntext and share relevant types

Fix lint issues

5190bb6

Only shutdown if we're not already doing so

b6b1bc0

Fix onTerminate handling

407e169

Forceful shutdown should error if something went wrong

6ccbbfb

Forceful shutdown should result in promise rejection

d3edcb4

Refactoring to ensure all cases are handled

ef1f7cf

We don't release job releasers here any more.

03d1dee

Ensure that LocalQueue exits with the correct status (e.g. rejects if…

bfb7c36

… returning jobs fails)

Fix types in a test

e76b3af

I threw it myself, I know what it is

17bea7c

Ensure deactivate happens at most once, and yields same errors if cal…

ae310f5

…led a second time (e.g. from forcefulShutdown)

Ensure forcefulShutdown and gracefulShutdown yield the same promises …

13740a4

…when called a second time

Graceful shutdown should not complete if forceful shutdown has begun

128f52e

Cleanup gracefulShutdown handover to forcefulShutdown

7a8f872

More consistently handle errors in forcefulShutdown

d14f3bf

Move promise handling outside of the gracefulShutdown/forcefulShutdow…

48ac9ea

…n hooks

Only finish once

6f392f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched job processing (opt-in) #474

Batched job processing (opt-in) #474

benjie commented Jun 11, 2024 •

edited

Loading

benjie commented Jul 1, 2024 •

edited

Loading

Batched job processing (opt-in) #474

Are you sure you want to change the base?

Batched job processing (opt-in) #474

Conversation

benjie commented Jun 11, 2024 • edited Loading

Description

Performance impact

Security impact

Checklist

benjie commented Jul 1, 2024 • edited Loading

benjie commented Jun 11, 2024 •

edited

Loading

benjie commented Jul 1, 2024 •

edited

Loading