Skip to content

Fix NPE in ChannelPool metrics methods when pool is uninitialized#928

Merged
dkropachev merged 2 commits into
scylla-4.xfrom
fix/customer-413-channel-pool-npe-metrics
Jun 23, 2026
Merged

Fix NPE in ChannelPool metrics methods when pool is uninitialized#928
dkropachev merged 2 commits into
scylla-4.xfrom
fix/customer-413-channel-pool-npe-metrics

Conversation

@nikagra

@nikagra nikagra commented Jun 17, 2026

Copy link
Copy Markdown

Summary

Fixes a NullPointerException when Dropwizard Metrics reporters scrape per-node gauge values while the driver is reconnecting to a node.

Affects: com.scylladb:java-driver-core (confirmed in 4.18.0.1, same code present in scylla-4.x).


Root Cause

ChannelPool has a ChannelSet[] channels field that is not initialized at construction — it starts as null:

// ChannelPool.java
@VisibleForTesting ChannelSet[] channels;  // null until initialize() runs

channels is only assigned inside SingleThreaded.initialize(), which is called only when the first channel connection succeeds:

private void initialize(DriverChannel c) {
    ...
    channels = new ChannelSet[shardsCount];   // assigned here
    ...
    initialized = true;                        // set AFTER channels
}

When the initial connection fails, makeInitialConnection() completes connectFuture immediately — with channels still null — and starts a reconnect in the background:

// on the failure branch of makeInitialConnection():
reconnection.start();
connectFuture.complete(ChannelPool.this);  // pool enters PoolManager.pools with channels == null

PoolManager.onPoolsInit() unconditionally puts every completed pool into the pools map (PoolManager.java:252). The Javadoc comment there even notes: "pool init always succeeds". This is intentional design — but it means a pool with channels == null can be in the map.

Why the existing null-check doesn't help

AbstractMetricUpdater guards against a missing pool entry, but not against an uninitialized pool:

// AbstractMetricUpdater.java
protected int inFlightRequests(Node node) {
    ChannelPool pool = context.getPoolManager().getPools().get(node);
    return (pool == null) ? 0 : pool.getInFlight();  // pool is non-null, but channels is null
}

The NPE

The three gauge methods — and size() — call Arrays.stream(channels) with no null guard:

public int getInFlight() {
    return Arrays.stream(channels).mapToInt(ChannelSet::getInFlight).sum();
    //                   ^^^^^^^^ NullPointerException when channels == null
}

Arrays.stream(null) throws NullPointerException. This is triggered by any Dropwizard reporter (JMX, Graphite, Prometheus bridge, etc.) that calls Gauge.getValue() during the window between a failed initial connection and a successful reconnect.

Full call chain

Metrics reporter  →  Gauge.getValue()
  DropwizardMetricUpdater.java:112   registry.gauge(..., () -> supplier::get)
  DropwizardNodeMetricUpdater.java:50  () -> inFlightRequests(node)
  AbstractMetricUpdater.java:141     pool = getPools().get(node)  [non-null, reconnecting]
  AbstractMetricUpdater.java:142     pool.getInFlight()
  ChannelPool.java:226               Arrays.stream(channels)  →  NPE  (channels == null)

Same path applies to getAvailableIds() and getOrphanedIds().


Fix

Mirror the guard already used by next(), which correctly handles this case via the initialized flag:

public DriverChannel next(...) {
    if (!singleThreaded.initialized) {  // ← existing safe pattern
        return null;
    }
    ...
}

Apply the same guard to all four affected methods:

public int size() {
    if (!singleThreaded.initialized) return 0;
    return Arrays.stream(channels).mapToInt(ChannelSet::size).sum();
}
// same for getAvailableIds(), getInFlight(), getOrphanedIds()

Why initialized and not channels != null

initialized is declared volatile (meaning the JMM guarantees that a thread reading initialized == true will also observe all writes that preceded the volatile write, including channels = new ChannelSet[...]). The channels field itself is not volatile, so checking it directly would be a weaker guarantee. Using initialized is both JMM-correct and consistent with the existing pattern in next().

Why size() is included

size() has the identical unguarded Arrays.stream(channels) pattern. While it is not currently called from any metrics path, it is a public method and would throw the same NPE if called before initialization.


Changed files

  • core/src/main/java/com/datastax/oss/driver/internal/core/pool/ChannelPool.java
    • size() — add initialized guard
    • getAvailableIds() — add initialized guard
    • getInFlight() — add initialized guard
    • getOrphanedIds() — add initialized guard

@nikagra nikagra requested review from dkropachev and mykaul June 17, 2026 14:14
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@dkropachev, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: f1778e74-8d4f-46da-bbc6-b8823957cf0d

📥 Commits

Reviewing files that changed from the base of the PR and between 92221eb and fdcbf39.

📒 Files selected for processing (2)
  • core/src/main/java/com/datastax/oss/driver/internal/core/pool/ChannelPool.java
  • core/src/test/java/com/datastax/oss/driver/internal/core/pool/ChannelPoolShutdownTest.java
📝 Walkthrough

Walkthrough

Two defensive fixes are applied to ChannelPool. First, the four pool statistic methods (size, getAvailableIds, getInFlight, getOrphanedIds) each gain an early-return guard that returns 0 when singleThreaded has not finished initializing (i.e., channels is null), preventing NullPointerException from Dropwizard Metrics gauge lambdas. Second, in the reconnection completion handler, a new branch detects when a successfully connected channel arrives after the pool has already begun closing; in that case the channel is immediately force-closed and the reconnection future is completed as successful. Both fixes are covered by new unit tests in ChannelPoolInitTest and ChannelPoolShutdownTest.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: fixing a NullPointerException in ChannelPool metrics methods when the pool is uninitialized, referencing the customer issue.
Description check ✅ Passed The description is highly detailed and directly related to the changeset, explaining the root cause, the full call chain, the fix, and why specific implementation choices were made.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

When the initial TCP connection to a node fails, ChannelPool.connectFuture
completes (so the pool is added to PoolManager.pools) while `channels` is
still null — initialize() is only called on the success path.

The four methods size(), getAvailableIds(), getInFlight(), and
getOrphanedIds() called Arrays.stream(channels) without guarding against
this, throwing NullPointerException whenever a Dropwizard Metrics reporter
(JMX, Graphite, etc.) scraped the gauge values during the reconnection
window.

Fix: mirror the pattern already used by next(), which checks
singleThreaded.initialized before touching channels. Using the volatile
initialized flag is correct by the JMM: the volatile write at line 365
(initialized = true) happens-after all channels writes, so any thread
that reads initialized == true is guaranteed to see channels as
fully populated.

Fixes: CUSTOMER-413
@nikagra nikagra force-pushed the fix/customer-413-channel-pool-npe-metrics branch from d6dc64c to 250cb9d Compare June 17, 2026 15:56
@nikagra nikagra marked this pull request as ready for review June 17, 2026 15:58
@nikagra nikagra requested a review from Copilot June 17, 2026 15:59

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race where ChannelPool metric accessor methods could throw NullPointerException when a pool instance is published before its channels array is initialized (e.g., after an initial connection failure while reconnecting). This aligns the metric accessors with the existing safety pattern already used by next().

Changes:

  • Guard size(), getAvailableIds(), getInFlight(), and getOrphanedIds() with singleThreaded.initialized to safely return 0 when uninitialized.
  • Add a regression test ensuring those accessors return 0 (and do not throw) when channels is still null after a failed initial connection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
core/src/main/java/com/datastax/oss/driver/internal/core/pool/ChannelPool.java Adds initialized guards to prevent NPEs when channels is still null.
core/src/test/java/com/datastax/oss/driver/internal/core/pool/ChannelPoolInitTest.java Adds a regression test covering metric accessor behavior for an uninitialized pool.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dkropachev

Copy link
Copy Markdown

@nikagra , it looks good, but it is too narrow, there are other problems due to postponed initialization, please fix them too update pr and commit message

If the initial pool connection attempt fails, ChannelPool completes while still uninitialized and starts a reconnect loop. A session close during that state can complete immediately because there are no active channels yet.

When the pending reconnect later succeeds, the uninitialized reconnect path previously initialized the pool and kept the new channel open even though the pool had already been closed. This could leave a leaked channel and incorrectly publish channelOpened after shutdown.

Guard that reconnect-success path with isClosing: force-close the new channel and complete the reconnect attempt without initializing the pool. Add coverage for both closeAsync and forceCloseAsync when the first successful channel arrives after shutdown.
@dkropachev dkropachev force-pushed the fix/customer-413-channel-pool-npe-metrics branch from 92221eb to fdcbf39 Compare June 23, 2026 17:59
@dkropachev dkropachev merged commit 710f931 into scylla-4.x Jun 23, 2026
28 checks passed
@dkropachev dkropachev deleted the fix/customer-413-channel-pool-npe-metrics branch June 23, 2026 18:01
@dkropachev dkropachev changed the title Fix NPE in ChannelPool metrics methods when pool is uninitialized (CUSTOMER-413) Fix NPE in ChannelPool metrics methods when pool is uninitialized Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants