Skip to content

Fix Shepherd/JDO connection leaks causing stuck dbconnections entries#1553

Draft
JasonWildMe wants to merge 7 commits into
mainfrom
more-shepherd-leak-fixes
Draft

Fix Shepherd/JDO connection leaks causing stuck dbconnections entries#1553
JasonWildMe wants to merge 7 commits into
mainfrom
more-shepherd-leak-fixes

Conversation

@JasonWildMe
Copy link
Copy Markdown
Collaborator

Summary

Fixes a set of PersistenceManager / connection leaks observed on production Sharkbook and Flukebook via dbconnections.jsp (stuck entries at IBEISIA.processCallback:rollback, RestServlet:new, plus in-flight Encounter.opensearchDocumentSerializer:begin and api.Bulk.doGet:begin).

The single commit (993a81e) touches five files with +184 / -150 lines (functional changes only — ignoring CRLF→LF noise in the rest of the tree):

  • Shepherd.closeDBTransaction / rollbackDBTransaction — move the ShepherdState writes into finally blocks so state always advances past "begin". When pm.close() or rollback() throws, the entry now stays as "close-failed" / "rollback-failed" so the dashboard preserves diagnostic evidence instead of either silently losing it or getting stuck at "begin".
  • IBEISIA.processCallback — wrap both Shepherds in try { ... } finally { rollbackAndClose(); }. The bare return rtn; on the no-log path (the source of the IBEISIA.processCallback:rollback stuck entries) now runs the finally. Setup moved inside the try for consistency.
  • RestServlet.doPost / doDelete / doHead — outer try/finally around each method body removes the RestServlet.class_<id> state entry on every return path, including doPost's empty-body 400 early return. Fixes the accumulating RestServlet:new entries.
  • BulkImport.doGet / doPost — Shepherd construction / setAction / beginDBTransaction moved inside the try with a null-safe finally. Background-thread inner class captures bgContext as final.
  • OpenSearch.setPermissionsNeeded / setActive / unsetActive / updatePermissionsIndex / updateEncounterIndexes — same try-with-null-safe-finally hardening.

Flagged as follow-up (not in this PR)

  • OpenSearch indexing fan-out — per-object Executors.newFixedThreadPool(4) in Encounter/MarkedIndividual/Occurrence opensearchIndexDeep, plus a nested Shepherd created per Jackson serialization in Base.opensearchDocumentSerializer(jgen) — this multiplies concurrent DB transactions during indexing storms. It is what drives the 70+ in-flight Encounter.opensearchDocumentSerializer:begin entries. Needs a design pass (shared bounded executor + pass the outer Shepherd through serialization).
  • commitDBTransaction swallows exceptions — separate correctness bug: IBEISIA.processCallback can still claim success=true after a silently-failed commit. Not a leak, so out of scope here.

Test plan

  • mvn compile clean (verified locally)
  • IBEISIA.processCallback: exercise the no-log early-return path, the successful detect path, and the "no annotations suitable for identification" path; confirm no stuck IBEISIA.processCallback:* entries accumulate on dbconnections.jsp
  • RestServlet: POST (empty body + normal), DELETE, HEAD requests while watching dbconnections.jsp — confirm RestServlet:new no longer accumulates
  • BulkImport list + detail doGet and doPost/doDelete flows
  • Any reproducible pm.close() failure — confirm the dashboard shows close-failed rather than vanishing
  • Monitor dbconnections.jsp on staging for 24h post-deploy; confirm entry count stabilizes

Credits

  • Root cause investigation and implementation: Claude Opus 4.7 (Claude Code, 1M context)
  • Independent plan review and code review: Codex CLI (OpenAI)
  • Human: JasonWildMe

🤖 Generated with Claude Code and reviewed with Codex CLI

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 51.54%. Comparing base (606982f) to head (3070e41).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1553   +/-   ##
=======================================
  Coverage   51.54%   51.54%           
=======================================
  Files         308      308           
  Lines       11952    11952           
  Branches     3842     3833    -9     
=======================================
  Hits         6161     6161           
- Misses       5503     5510    +7     
+ Partials      288      281    -7     
Flag Coverage Δ
backend 51.54% <ø> (ø)
frontend 51.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@JasonWildMe JasonWildMe requested a review from naknomum April 20, 2026 14:15
@JasonWildMe JasonWildMe self-assigned this Apr 20, 2026
Both Shepherds in scanEndApplet.jsp previously had their close calls
outside any try/finally. The `indShepherd` block (lines 603-786) was
the source of accumulating `scanEndApplet.jsp_displayNames:begin`
entries on Sharkbook dbconnections.jsp: any exception during
`xmlReader.read(file)` on a partially-written scan XML, or any
`getMarkedIndividual` failure, skipped the close, and the page
auto-refreshes every 15s during active scans.

Wraps both Shepherds with try { ... } finally { rollbackAndClose(); }
matching the pattern used elsewhere in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JasonWildMe JasonWildMe marked this pull request as draft April 28, 2026 20:28
JasonWildMe and others added 3 commits April 28, 2026 14:16
Most of the stuck "begin" entries on Sharkbook's dbconnections.jsp are
not individual Shepherd leaks but symptoms of one upstream cause:
threads waiting for a Postgres connection that is pinned by a
long-running operation. Two fixes:

1. Encounter.opensearchIndexPermissions() — the periodic permissions
   sweep was holding a Postgres tx open for the entire duration of
   per-encounter OpenSearch HTTP updates. On installs with hundreds
   of thousands of encounters this pinned a connection for tens of
   minutes per run, starving every concurrent request behind it.

   Refactored into two phases: phase 1 loads users/collab maps and
   all eligible encounter rows into in-memory structures under a
   short-lived Shepherd, which is then closed; phase 2 iterates the
   in-memory rows and issues OpenSearch updates with no DB connection
   held. Same OpenSearch updates, same filter logic.

2. jdoconfig.properties — datanucleus.connectionPool.maxWait was -1
   (wait forever). Threads blocked on the pool sat indefinitely and
   showed up as stuck "begin" entries. Set to 30000 ms so a request
   under contention fails fast with a clear error and pool slots
   free up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier commits on this branch saved Encounter.java, scanEndApplet.jsp
and jdoconfig.properties with CRLF line endings, which made the GitHub
PR diff display each file as entirely changed. main has these files as
LF, so this commit normalizes back so reviewers see only the actual
code changes from this branch.

No functional change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The method opened a Shepherd at line 241 and only closed it on three
specific success/early-return paths (lines 290, 339, 375). Anything
that threw between begin and one of those closes leaked the PM:

- IBEISIAIdentificationMatchingState.allAsJSONArray() — a JDO query
- WildbookIAM.getIASpecies() and other lazy-loading getters in the
  qanns/tanns iteration loops
- qanns.get(0).getMatchingSet() — runs an OpenSearch HTTP call while
  the Postgres tx is still open; fails on slow/hung OpenSearch
- annotGetIndiv() — a JDO query
- Any unchecked Throwable

Wraps the body in try { ... } finally { rollbackAndClose() }, removes
the three explicit close pairs, and hoists the HashMap declaration
out of the try so the post-try RestClient.post() (which intentionally
runs after the DB connection is released) can still see it.

Same JSONObject results on every path; same "close DB before HTTP"
ordering preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JasonWildMe JasonWildMe force-pushed the more-shepherd-leak-fixes branch from eae68f9 to a6d9942 Compare April 29, 2026 16:30
@naknomum naknomum added this to the 10.10.5 milestone May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants