Skip to content

perf(dedup): run duplicate check only on the outermost runner#1218

Open
ocervell wants to merge 2 commits into
mainfrom
fix/dedup-outermost
Open

perf(dedup): run duplicate check only on the outermost runner#1218
ocervell wants to merge 2 commits into
mainfrom
fix/dedup-outermost

Conversation

@ocervell

@ocervell ocervell commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

A task that's part of a workflow, and a workflow that's part of a scan, each ran their own duplicate check. So in a scan with 10,000s of findings the dedup ran redundantly at every nesting level — a real toll on scan runtime for no benefit.

Where dedup was triggered per-level

The active dedup is the in-memory Runner.mark_duplicates() (secator/runners/_base.py), called from mark_completed() (on on_end). It is gated by self.enable_duplicate_check. The existing build code already disabled it for some nested runners but not all:

  • task.py (Task → child command task): enable_duplicate_check = False
  • workflow.py (Workflow → child tasks): enable_duplicate_check = False
  • celery.py chunking (task → chunk sigs): enable_duplicate_check = False
  • scan.py (Scan → child workflows): sets has_parent = True but never disables the check → each workflow inside a scan still deduped, then the scan deduped again. That was the redundant path.

The gate added

secator/runners/_base.py (right after enable_duplicate_check is resolved in Runner.__init__):

self.enable_duplicate_check = self.run_opts.get('enable_duplicate_check', self.enable_duplicate_check)
if self.has_parent and 'enable_duplicate_check' not in self.run_opts:
    self.enable_duplicate_check = False

Any nested runner (has_parent=True) disables its duplicate check by default; the outermost runner (scan, or a standalone workflow/task with no parent) runs it once. An explicit enable_duplicate_check override (e.g. via the Python API) is still respected — enable_duplicate_check is not a CLI/config opt, so it is never present in a top-level user's run_opts by accident.

Correctness: the outermost runner still covers all descendant findings

mark_duplicates() operates on self.results. In Runner.__iter__, the runner loops self.yielder() and calls _process_item()add_result() for every item; in async mode the yielder streams all descendant results up via CeleryData.iter_results. So a scan's self.results contains every finding produced by all its workflows and their tasks. Its single mark_duplicates() pass therefore groups and dedups the full aggregated set — nothing a child would have caught is lost by skipping the child's pass.

has_parent is reliable for this: workflow.py sets has_parent=True for workflows under a scan/parent, scan.py sets has_parent=True for workflows it builds, and celery.py sets it for chunked sub-tasks; it defaults to False (run_opts.get('has_parent', False)) for the outermost runner.

Tests

Added TestDuplicateCheckGating in tests/unit/test_runners.py:

  • outermost runner (has_parent=False) → enable_duplicate_check True
  • nested runner (has_parent=True) → enable_duplicate_check False
  • nested + explicit enable_duplicate_check=True → respected (True)
  • outermost + explicit enable_duplicate_check=False → respected (False)

tests/unit/test_runners.py is green (43 passed); flake8 clean for the changed code.

🤖 Generated with Claude Code

https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm

Summary by CodeRabbit

  • Bug Fixes
    • Nested runner operations now avoid duplicate checks by default, reducing unnecessary repeated work.
    • Explicit duplicate-check settings are now respected consistently, including for nested runs.

A task inside a workflow, and a workflow inside a scan, each ran their own
duplicate check. On scans with 10,000s of findings the dedup ran redundantly at
every nesting level — a real runtime toll for no benefit.

mark_duplicates() operates on self.results, and descendant results aggregate up
into the outermost runner's self.results via the yielder, so its single dedup
pass already covers every descendant's findings. Gate the check so nested runners
(has_parent=True) skip it by default, while still respecting an explicit
enable_duplicate_check override if the caller set one. Standalone tasks/workflows
(no parent) still dedup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 46981a6d-033f-4c9d-80f4-3ee696d2b668

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Runner.__init__ now disables duplicate checking by default for nested runners while still honoring explicit enable_duplicate_check overrides. New tests cover outermost defaults, nested defaults, and explicit true/false overrides.

Changes

Duplicate-check gating

Layer / File(s) Summary
Nested runner default
secator/runners/_base.py, tests/unit/test_runners.py
Runner.__init__ now flips enable_duplicate_check off when has_parent=True, while tests cover the default and override combinations for nested and outermost runners.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 I hopped through the runner's nested den,
Duplicate checks now pause, then wake again.
Default stays true outside, inside it may rest,
Overrides still whisper, and tests do the rest.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: duplicate checks are now limited to the outermost runner by default.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dedup-outermost

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant