perf(mongodb): batch finding writes (bulk_write) instead of one DB call per item by ocervell · Pull Request #1176 · freelabz/secator

ocervell · 2026-06-15T13:10:13Z

Motivation

Findings are written one-by-one today — one update_one/insert_one round-trip per item in hooks/mongodb.py:update_finding (called on on_item). A URL crawl with ~1M findings = ~1M DB round-trips.

Approach

Mint _id client-side. ObjectId() generates locally with no round-trip, so each finding gets a stable id immediately (downstream — the Celery chain, get_results, dedup — all key off it). This decouples identity from persistence.
Buffer + bulk flush. update_finding appends a pymongo.UpdateOne(..., upsert=True) to a per-runner buffer and flushes via a single db.findings.bulk_write on:
- a size cap (MONGODB_FLUSH_SIZE, bounds worker memory),
- on_interval, throttled by CONFIG.runners.backend_update_frequency (mongodb is a backend — -1 ⇒ flush only on size cap + on_end = maximal batching; 5 ⇒ ~every 5s for live UI),
- on_end — mandatory: the next runner re-hydrates these findings from the DB, so they must be persisted before the runner finishes.

What is persisted is unchanged (same OUTPUT_TYPES guard) — only how.

Win

~1M update_one/insert_one → ~1M / MONGODB_FLUSH_SIZE bulk_writes (e.g. ~1k). Same precedent already used by tag_duplicates.

⚠️ Needs validation (hot path)

I can't run the suite locally (deps). Please validate against your local MongoDB repro: run a crawl/domain scan and confirm (a) finding counts match, (b) cross-workflow chaining still works (subdomains → host_recon — the bug we just fixed), (c) dedup/tag_duplicates still tags correctly. CI runs unit/integration; any test that called update_finding expecting an immediate write will need updating.

Follow-up (not here): the per-runner update_runner writes could be throttled/batched similarly, but findings are the dominant cost.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Enhanced finding persistence performance with optimized batch database operations, reducing write overhead and improving overall synchronization efficiency.

Findings were persisted one-by-one (one update_one/insert_one round-trip per item) — millions of DB calls on a large crawl. Now: - mint the Mongo _id client-side (ObjectId() needs no round-trip) so each item has a stable id immediately; - buffer an upsert per finding and flush in a single bulk_write on a size cap (MONGODB_FLUSH_SIZE), on_interval (throttled by runners.backend_update_frequency — mongodb is a backend), and on_end; - on_end flush is mandatory for correctness: the next runner re-hydrates findings from the DB (get_results), so they must be persisted before the runner finishes. What is persisted is unchanged (same OUTPUT_TYPES guard) — only how (batched). With backend_update_frequency=-1 the buffer flushes only on size cap + on_end (maximal batching). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-15T13:10:27Z

Walkthrough

secator/hooks/mongodb.py replaces immediate per-item MongoDB update_one/insert_one calls in update_finding with an in-memory UpdateOne buffer that performs bulk_write at a configurable size cap (MONGODB_FLUSH_SIZE). Three new flush helpers (flush_findings_buffer, flush_findings, flush_findings_final) handle size-triggered, time-throttled, and final end-of-run flushes. The HOOKS dict is updated to wire these helpers into runner lifecycle events.

Changes

MongoDB Buffered Bulk Upsert for Findings

Layer / File(s)	Summary
Buffered upsert logic and flush helpers `secator/hooks/mongodb.py`	Adds `should_update` import and `MONGODB_FLUSH_SIZE` constant; rewrites `update_finding` to append `UpdateOne(upsert=True)` operations to an in-memory buffer and flush at size cap; introduces `flush_findings_buffer` (bulk_write + reset), `flush_findings` (throttled via `should_update`), and `flush_findings_final` (unconditional end-of-run flush).
HOOKS lifecycle wiring `secator/hooks/mongodb.py`	Updates `HOOKS` to connect `flush_findings` to `on_interval` for `Scan`, `Workflow`, and `Task`; `flush_findings_final` to `on_end` for `Scan` and `Workflow`; `flush_findings` to `on_end` for `Task`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 No more one-by-one hops to the store,
I batch my carrots, then write them all at once!
A buffer of finds, a flush at the door,
bulk_write leaps where single writes were dunce.
The interval rings, and final flush seals the run —
Efficient persistence, my warren's well done! 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: replacing one-per-item MongoDB writes with batched bulk_write operations for improved performance.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/mongodb-batch-finding-writes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@secator/hooks/mongodb.py`:
- Around line 253-255: The hook ordering in the MongoDB hooks file needs to be
adjusted so that flush operations execute before update_runner, because
update_runner sets self.last_updated_db which the runner uses to determine
whether to skip interval-based flushes. In the on_interval list (lines 253-255),
move flush_findings before update_runner in the list. In the on_end list (lines
269-270), move flush_findings_final before update_runner so that findings are
persisted before the runner is marked as complete. For the on_duplicate hooks at
lines 260-262, apply the same principle if flush operations are present there.
- Around line 122-127: The _mongodb_findings_buffer currently uses a list to
append every UpdateOne operation, but when flushed with unordered bulk writes,
operations for the same finding (same item._uuid) can execute out of order,
leaving the document in a non-deterministic state. Replace the list-based buffer
with a dict keyed by item._uuid so that repeated updates for the same finding
are coalesced and only the latest UpdateOne operation per UUID is retained. This
ensures that when the buffer is flushed at the bulk_write call, each finding is
updated only once with its most recent state, regardless of the unordered
execution.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 98966e40-ca21-4234-b404-8e9b35116df8

📥 Commits

Reviewing files that changed from the base of the PR and between fdd388f and ad9e680.

📒 Files selected for processing (1)

secator/hooks/mongodb.py

coderabbitai · 2026-06-15T13:17:55Z

+	buffer = getattr(self, '_mongodb_findings_buffer', None)
+	if buffer is None:
+		buffer = self._mongodb_findings_buffer = []
+	buffer.append(
+		pymongo.UpdateOne({'_id': ObjectId(item._uuid)}, {'$set': item.toDict()}, upsert=True)
+	)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyMongo and MongoDB bulk_write with ordered=False, are operations guaranteed to execute in the order provided, especially multiple UpdateOne upserts targeting the same _id?

💡 Result:

When using PyMongo or any MongoDB driver with bulk_write and ordered=False, the operations are not guaranteed to execute in the order provided [1][2][3]. In unordered mode, MongoDB may execute operations in parallel and reorder them for performance optimization [1][2][3]. Consequently, applications should not rely on the execution order of these operations [2][3]. Regarding multiple UpdateOne upserts targeting the same _id in an unordered bulk write: because there is no guarantee of serial execution, the final state of the document is non-deterministic if multiple operations in the same batch target the same _id [1][2][3]. You cannot rely on the operations "winning" based on their index in the input array. If your application logic requires a specific final state for a document, you must either use ordered=True (the default) or ensure that your bulk operations do not have overlapping targets [1][2][3].

Citations:

1: https://www.mongodb.com/docs/manual/reference/method/db.collection.bulkwrite/

2: https://www.mongodb.com/docs/manual/reference/command/bulkwrite/

3: https://www.mongodb.com/docs/v8.2/reference/command/bulkwrite/

🏁 Script executed:

cd secator && wc -l hooks/mongodb.py

Repository: freelabz/secator

Length of output: 81

🏁 Script executed:

cd secator && sed -n '115,150p' hooks/mongodb.py

Repository: freelabz/secator

Length of output: 1379

Coalesce repeated finding updates before unordered bulk writes.

The buffer appends every UpdateOne, then Line 142 flushes with ordered=False. With unordered bulk writes, operations are not guaranteed to execute in order, making the final document state non-deterministic when the same finding is buffered multiple times before a flush—an older snapshot can be applied after the newer one. Keep only the latest operation per item._uuid by using a dict keyed by UUID, or switch to ordered writes.

Proposed coalescing fix

- buffer = getattr(self, '_mongodb_findings_buffer', None) - if buffer is None: - buffer = self._mongodb_findings_buffer = [] - buffer.append( - pymongo.UpdateOne({'_id': ObjectId(item._uuid)}, {'$set': item.toDict()}, upsert=True) - ) + buffer = getattr(self, '_mongodb_findings_buffer', None) + if buffer is None: + buffer = self._mongodb_findings_buffer = {} + buffer[item._uuid] = pymongo.UpdateOne( + {'_id': ObjectId(item._uuid)}, + {'$set': item.toDict()}, + upsert=True, + ) if len(buffer) >= MONGODB_FLUSH_SIZE: flush_findings_buffer(self) return item @@ - count = len(buffer) - db.findings.bulk_write(buffer, ordered=False) - self._mongodb_findings_buffer = [] + operations = list(buffer.values()) + count = len(operations) + db.findings.bulk_write(operations, ordered=False) + self._mongodb_findings_buffer = {}

Also applies to: 141-143

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@secator/hooks/mongodb.py` around lines 122 - 127, The _mongodb_findings_buffer currently uses a list to append every UpdateOne operation, but when flushed with unordered bulk writes, operations for the same finding (same item._uuid) can execute out of order, leaving the document in a non-deterministic state. Replace the list-based buffer with a dict keyed by item._uuid so that repeated updates for the same finding are coalesced and only the latest UpdateOne operation per UUID is retained. This ensures that when the buffer is flushed at the bulk_write call, each finding is updated only once with its most recent state, regardless of the unordered execution.

coderabbitai · 2026-06-15T13:17:55Z

+		'on_interval': [update_runner, flush_findings],
 		'on_duplicate': [update_finding],
-		'on_end': [update_runner],
+		'on_end': [update_runner, flush_findings_final],


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run findings flush hooks before update_runner.

For on_interval, secator/runners/_base.py:948-983 re-checks self.last_updated_db before each hook; update_runner sets it on Line 96, so placing flush_findings after update_runner causes the interval flush to be skipped. For on_end, persisting the runner first can mark completion before the mandatory final findings flush. Move flush hooks first.

Proposed hook ordering fix

Scan: { 'on_init': [update_runner], 'on_start': [update_runner], - 'on_interval': [update_runner, flush_findings], + 'on_interval': [flush_findings, update_runner], 'on_duplicate': [update_finding], - 'on_end': [update_runner, flush_findings_final], + 'on_end': [flush_findings_final, update_runner], }, Workflow: { 'on_init': [update_runner], 'on_start': [update_runner], - 'on_interval': [update_runner, flush_findings], + 'on_interval': [flush_findings, update_runner], 'on_duplicate': [update_finding], - 'on_end': [update_runner, flush_findings_final], + 'on_end': [flush_findings_final, update_runner], }, Task: { 'on_init': [update_runner], 'on_start': [update_runner], 'on_item': [update_finding], 'on_duplicate': [update_finding], - 'on_interval': [update_runner, flush_findings], - 'on_end': [update_runner, flush_findings_final] + 'on_interval': [flush_findings, update_runner], + 'on_end': [flush_findings_final, update_runner] }

Also applies to: 260-262, 269-270

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@secator/hooks/mongodb.py` around lines 253 - 255, The hook ordering in the MongoDB hooks file needs to be adjusted so that flush operations execute before update_runner, because update_runner sets self.last_updated_db which the runner uses to determine whether to skip interval-based flushes. In the on_interval list (lines 253-255), move flush_findings before update_runner in the list. In the on_end list (lines 269-270), move flush_findings_final before update_runner so that findings are persisted before the runner is marked as complete. For the on_duplicate hooks at lines 260-262, apply the same principle if flush operations are present there.

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

ocervell mentioned this pull request Jun 15, 2026

feat(runner): fire on_interval on a real timer + thread-safe finding flush #1179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(mongodb): batch finding writes (bulk_write) instead of one DB call per item#1176

perf(mongodb): batch finding writes (bulk_write) instead of one DB call per item#1176
ocervell wants to merge 1 commit into
mainfrom
feat/mongodb-batch-finding-writes

ocervell commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

coderabbitai Bot Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ocervell commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Approach

Win

⚠️ Needs validation (hot path)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ocervell commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading