Skip to content

Conversation

@tbadalov
Copy link

The problem

When the cost of creating a connection is high enough (e.g. longer than connection-timeout, like a few seconds or more), the pool fails to reach sufficient number of connections to satisfy demand (rate of requests). Even when com.zaxxer.hikari.blockUntilFilled is set to true to make the app wait at startup to fill the pool with minimum idle connections, after max-lifetime of connections reaches, the connections are closed faster than the refill rate can keep up. This leads to the high number of waiting threads and small number of active and idle connections (less than the minimum-idle) either periodically or, in the worst case, becomes a new norm.

In addition, since PoolEntryCreator creates and adds new connections sequentially, initialization-fail-timeout should be large enough to guarantee minimum idle connections are ready which can quickly turn into several minutes, though unnecessarily - had the connections been created concurrently. Moreover, when we prefer the smallest max-lifetime (which is 30s), closing of connections can start in the middle of blockUntilFilled at the startup, hence by the end of the prefill, we will have less than the desired number of idle connections.

Another point is that, while a creator task is already running,refillPool() method simply submits a creator task which ends up in the queue, because executor's corePoolsize = maxPoolSize = 1. As the creator task sequentially adds connections in the loop, by the time it finishes not much changes when the next creator task starts its loop again because shouldContinueCreating() has just returned false, and even if the condition satisfies again, it will barely help by adding a few connections. At this point, the queue becomes nothing more than just an indicator of demand, based on its size, rather than truly preserving refill requests and helping to meet the demand.

Discussion

Some might ask why would establishing a connection take longer than a few seconds, and that this problem should be fixed in the first place. Although in most of the cases that would be fair enough but this is not always possible, and in fact, I believe that is the whole point of a connection pool to take the burden/cost of creating connections on itself and have warm connections available regardless of the time it takes to create a connection.

Another question some might ask, why would connection-timeout be less than the time it takes to establish a connection? That might be a legit argument if requests can afford running that long, which could be the case with apps with relatively low traffic. However, with comparatively low-latency systems and large number of requests, we have to make sure if DB is not available, we don't wait longer than a few hundreds of milliseconds not to saturate threads and avoid cascading failures. In which case we rely on the connection pool to maintain connections on the side and have ready connections whenever needed.

I think for this type of scenarios, the strategy of maintaining connections could be discussed further and taken to the next level perhaps. In the short term though, creating new connections concurrently should be good enough of a solution to meet the aforementioned requirements.

Overall, if there is a better way of tackling the problem, feel free to comment on how the code change should look like. Will appreciate if you review this PR and let me know what you think. Cheers!

@SuppressWarnings("NullableProblems")
public Thread newThread(Runnable r) {
var thread = new Thread(r, threadName);
var thread = new Thread(r, threadName + "_" + counter.incrementAndGet());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. If the overall idea is good, gotta change this part so that thread naming doesn't interfere with other existing executors.

var backoffMs = 10L;
long jitterMs = ThreadLocalRandom.current().nextLong(Math.max(10, config.getConnectionTimeout() / 10));
if (connectionsInProgress.get() >= Runtime.getRuntime().availableProcessors()) {
quietlySleep(jitterMs);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sleeping for some jittered time to avoid potential CPU throttling in CPU-limited environments like Kubernetes, when multiple connections are created at once.

final int maxPoolSize = config.getMaximumPoolSize();
this.addConnectionExecutor = createThreadPoolExecutor(maxPoolSize, poolName + ":connection-adder", threadFactory, new CustomDiscardPolicy());
this.addConnectionExecutor = createCreatorThreadPoolExecutor(
Math.min(Runtime.getRuntime().availableProcessors() * 2, config.getMinimumIdle()),
Copy link
Author

@tbadalov tbadalov May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, core pool size will have to be the same as max pool size (not Hikari's), so that executor doesn't hold the task in the queue and executes immediately. Something to fix

@brettwooldridge brettwooldridge force-pushed the dev branch 2 times, most recently from d7a9b2f to e56fa46 Compare July 20, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants