perf(dedup): run duplicate check only on the outermost runner#1218
perf(dedup): run duplicate check only on the outermost runner#1218ocervell wants to merge 2 commits into
Conversation
A task inside a workflow, and a workflow inside a scan, each ran their own duplicate check. On scans with 10,000s of findings the dedup ran redundantly at every nesting level — a real runtime toll for no benefit. mark_duplicates() operates on self.results, and descendant results aggregate up into the outermost runner's self.results via the yielder, so its single dedup pass already covers every descendant's findings. Gate the check so nested runners (has_parent=True) skip it by default, while still respecting an explicit enable_duplicate_check override if the caller set one. Standalone tasks/workflows (no parent) still dedup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Walkthrough
ChangesDuplicate-check gating
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Problem
A task that's part of a workflow, and a workflow that's part of a scan, each ran their own duplicate check. So in a scan with 10,000s of findings the dedup ran redundantly at every nesting level — a real toll on scan runtime for no benefit.
Where dedup was triggered per-level
The active dedup is the in-memory
Runner.mark_duplicates()(secator/runners/_base.py), called frommark_completed()(onon_end). It is gated byself.enable_duplicate_check. The existing build code already disabled it for some nested runners but not all:task.py(Task → child command task):enable_duplicate_check = False✅workflow.py(Workflow → child tasks):enable_duplicate_check = False✅celery.pychunking (task → chunk sigs):enable_duplicate_check = False✅scan.py(Scan → child workflows): setshas_parent = Truebut never disables the check → each workflow inside a scan still deduped, then the scan deduped again. That was the redundant path.The gate added
secator/runners/_base.py(right afterenable_duplicate_checkis resolved inRunner.__init__):Any nested runner (
has_parent=True) disables its duplicate check by default; the outermost runner (scan, or a standalone workflow/task with no parent) runs it once. An explicitenable_duplicate_checkoverride (e.g. via the Python API) is still respected —enable_duplicate_checkis not a CLI/config opt, so it is never present in a top-level user's run_opts by accident.Correctness: the outermost runner still covers all descendant findings
mark_duplicates()operates onself.results. InRunner.__iter__, the runner loopsself.yielder()and calls_process_item()→add_result()for every item; in async mode the yielder streams all descendant results up viaCeleryData.iter_results. So a scan'sself.resultscontains every finding produced by all its workflows and their tasks. Its singlemark_duplicates()pass therefore groups and dedups the full aggregated set — nothing a child would have caught is lost by skipping the child's pass.has_parentis reliable for this:workflow.pysetshas_parent=Truefor workflows under a scan/parent,scan.pysetshas_parent=Truefor workflows it builds, andcelery.pysets it for chunked sub-tasks; it defaults toFalse(run_opts.get('has_parent', False)) for the outermost runner.Tests
Added
TestDuplicateCheckGatingintests/unit/test_runners.py:has_parent=False) →enable_duplicate_checkTruehas_parent=True) →enable_duplicate_checkFalseenable_duplicate_check=True→ respected (True)enable_duplicate_check=False→ respected (False)tests/unit/test_runners.pyis green (43 passed);flake8clean for the changed code.🤖 Generated with Claude Code
https://claude.ai/code/session_01P5vSjfkBuGAAHdKxHS3ySm
Summary by CodeRabbit