-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Feature/fase 2 prompt optimization #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add comprehensive unit tests for strix/llm module: - test_llm_utils.py: 35 tests for tool parsing, HTML entity decoding - test_config.py: 18 tests for LLMConfig initialization - test_memory_compressor.py: 27 tests for token counting, compression - test_request_queue.py: 19 tests for rate limiting, retries - Create test fixtures: - Sample LLM responses (valid, truncated, multiple functions) - Vulnerability test cases (SQL injection, XSS, IDOR) - Add conftest.py with shared fixtures for testing Results: - 97 tests passing, 2 skipped - LLM module coverage: 53% - utils.py: 100%, config.py: 100%, request_queue.py: 98% This completes Phase 1 of the optimization plan.
Phase 2 - Prompt Optimization & False Positive Reduction: New Features: - Add confidence scoring system (strix/llm/confidence.py): - ConfidenceLevel enum (HIGH, MEDIUM, LOW, FALSE_POSITIVE) - VulnerabilityFinding dataclass with serialization - Automatic analysis of responses for FP and exploitation indicators - Pattern dictionaries for common vulnerability types - Add tool invocation validation (strix/llm/utils.py): - validate_tool_invocation() with URL, path, command validation - KNOWN_TOOLS dictionary with required parameters - validate_all_invocations() for batch validation Prompt Improvements: - Add Chain-of-Thought (CoT) vulnerability validation protocol to system_prompt.jinja with 6-step mandatory analysis - Add detailed <false_positive_indicators> sections to: - sql_injection.jinja (WAF, rate limit, encoding checks) - xss.jinja (CSP, sanitization, execution verification) - idor.jinja (public vs private, authorization checks) - ssrf.jinja (client vs server, egress verification) Testing: - 79 new tests for Phase 2 features - Total: 176 tests passing (2 skipped) - Coverage improved: 62% (up from 53%) - confidence.py: 100% coverage - utils.py: 95% coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements Phase 2 of a 3-phase LLM optimization plan for Strix, an open-source AI security testing tool. The primary goal is to reduce false positives in vulnerability detection by at least 25% through prompt optimization, confidence scoring, and structured validation.
Key Changes:
- Implements a confidence scoring system to classify vulnerability findings (HIGH/MEDIUM/LOW/FALSE_POSITIVE)
- Adds comprehensive tool invocation validation with URL, file path, and command checks
- Enhances vulnerability detection prompts with detailed false positive indicators for SQL injection, XSS, SSRF, and IDOR
- Adds mandatory Chain-of-Thought (6-step) analysis protocol to the system prompt
- Introduces 176 unit tests (79 new tests for Phase 2), bringing total coverage to 62% for the LLM module
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| todo.md | Comprehensive 3-phase optimization plan document (1122 lines) detailing Phase 1 completion and Phase 2 implementation status |
| strix/llm/confidence.py | New confidence scoring module with pattern-based false positive and exploitation detection |
| strix/llm/utils.py | Extended tool validation functions for URLs, file paths, and commands with structured error reporting |
| strix/agents/StrixAgent/system_prompt.jinja | Added <vulnerability_validation_protocol> with 6-step mandatory CoT analysis and false positive avoidance guidance |
| strix/prompts/vulnerabilities/*.jinja | Enhanced SQL injection, XSS, SSRF, and IDOR prompts with detailed <false_positive_indicators> sections |
| tests/unit/test_confidence.py | 46 comprehensive tests for confidence scoring system |
| tests/unit/test_llm_utils.py | 33 new validation tests added to existing tool parsing tests |
| tests/unit/test_*.py | Additional unit tests for request queue, memory compressor, and config modules |
| tests/fixtures/* | Sample responses and vulnerability test cases for validation |
| tests/conftest.py | Shared pytest fixtures for testing LLM responses and vulnerability findings |
| pyproject.toml | Updated pytest configuration with test markers (unit/integration/slow) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def to_dict(self) -> dict[str, Any]: | ||
| """Convierte el hallazgo a diccionario para serialización.""" |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The docstring comment uses Spanish ("Convierte el hallazgo a diccionario para serialización") while the codebase appears to use English for documentation. Consider maintaining consistency by using English for all docstrings and comments.
| """Valida que una invocación de herramienta sea correcta. | ||
| Realiza validaciones de: | ||
| - Presencia de toolName | ||
| - Formato correcto de args | ||
| - Parámetros requeridos según la herramienta | ||
| - Validación de URLs para herramientas de browser | ||
| Args: | ||
| invocation: Diccionario con la invocación de herramienta | ||
| Returns: | ||
| Tuple de (es_válido, lista_de_errores) | ||
| Example: | ||
| >>> invocation = {"toolName": "browser_actions.navigate", "args": {"url": "https://example.com"}} | ||
| >>> is_valid, errors = validate_tool_invocation(invocation) | ||
| >>> is_valid | ||
| True | ||
| """ |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The docstrings in this file mix Spanish and English. For example, "Valida que una invocación de herramienta sea correcta" is in Spanish while the example is in English. Maintain consistency by using English throughout for better international collaboration and code maintainability.
| addopts = [ | ||
| "-v", | ||
| "--strict-markers", | ||
| "--strict-config", | ||
| "--cov=strix", | ||
| "--cov-report=term-missing", | ||
| "--cov-report=html", | ||
| "--cov-report=xml", | ||
| "--cov-fail-under=80" | ||
| "--tb=short", | ||
| ] |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The coverage-related options (--cov=strix, --cov-report=*, --cov-fail-under=80) have been removed from the default pytest addopts. This means coverage will no longer be measured by default when running tests. If this is intentional, it should be documented. Otherwise, these options should be retained to maintain test quality standards.
| 4. CHAIN-OF-THOUGHT ANALYSIS (MANDATORY): | ||
| Before concluding any finding, analyze step by step: | ||
|
|
||
| Step 1 - Initial Observation: | ||
| "I observed [specific behavior] when sending [specific payload]" | ||
|
|
||
| Step 2 - Hypothesis: | ||
| "This could indicate [vulnerability type] because [reasoning]" | ||
|
|
||
| Step 3 - Verification: | ||
| "To verify, I will [additional tests to perform]" | ||
|
|
||
| Step 4 - Evidence Evaluation: | ||
| "The evidence [supports/contradicts] my hypothesis because [specific reasons]" | ||
|
|
||
| Step 5 - False Positive Check: | ||
| "I checked for false positive indicators: [list what you checked]" | ||
|
|
||
| Step 6 - Conclusion: | ||
| "My confidence level is [HIGH/MEDIUM/LOW/FALSE_POSITIVE] because [justification]" | ||
|
|
||
| 5. AVOID COMMON FALSE POSITIVE PATTERNS: | ||
| - Generic error pages mistaken for injection success | ||
| - Rate limiting responses confused with vulnerability indicators | ||
| - Cached responses giving inconsistent results | ||
| - WAF blocks interpreted as application errors | ||
| - Input validation errors vs actual vulnerabilities | ||
| - Timing variations due to network latency vs actual time-based injection | ||
| </vulnerability_validation_protocol> |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new <vulnerability_validation_protocol> section requires a mandatory 6-step Chain-of-Thought analysis, but the protocol document only shows 6 steps in the example. Ensure this matches the documented requirement of "CoT obligatorio de 6 pasos" mentioned in line 415 of todo.md.
| def calculate_confidence( | ||
| indicators: list[str], | ||
| fp_indicators: list[str], | ||
| exploitation_confirmed: bool = False, | ||
| vuln_type: str = "generic", | ||
| ) -> ConfidenceLevel: | ||
| """Calcula el nivel de confianza basado en evidencia. | ||
| Args: | ||
| indicators: Lista de indicadores positivos de vulnerabilidad | ||
| fp_indicators: Lista de indicadores de falso positivo encontrados | ||
| exploitation_confirmed: Si la explotación fue confirmada | ||
| vuln_type: Tipo de vulnerabilidad para aplicar reglas específicas | ||
| Returns: | ||
| ConfidenceLevel apropiado basado en la evidencia | ||
| Example: | ||
| >>> calculate_confidence( | ||
| ... indicators=["sql_error", "data_leak", "timing_diff"], | ||
| ... fp_indicators=[], | ||
| ... exploitation_confirmed=True | ||
| ... ) | ||
| ConfidenceLevel.HIGH | ||
| """ | ||
| # Si hay explotación confirmada con suficiente evidencia, es HIGH | ||
| if exploitation_confirmed and len(indicators) >= 2: | ||
| return ConfidenceLevel.HIGH | ||
|
|
||
| # Si los indicadores de FP superan a los positivos, es FALSE_POSITIVE | ||
| if len(fp_indicators) > len(indicators) and not exploitation_confirmed: | ||
| return ConfidenceLevel.FALSE_POSITIVE | ||
|
|
||
| # Si hay múltiples indicadores sin FP significativos | ||
| if len(indicators) >= 3 and len(fp_indicators) <= 1: | ||
| return ConfidenceLevel.HIGH if exploitation_confirmed else ConfidenceLevel.MEDIUM | ||
|
|
||
| # Si hay algunos indicadores | ||
| if len(indicators) >= 2: | ||
| return ConfidenceLevel.MEDIUM | ||
|
|
||
| # Pocos indicadores = baja confianza | ||
| return ConfidenceLevel.LOW |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function calculate_confidence has an unused parameter vuln_type that is documented but never used in the function body. Either utilize this parameter for vulnerability-specific confidence calculations or remove it from the signature.
|
|
||
| import os | ||
| import pytest | ||
| from unittest.mock import MagicMock, AsyncMock, patch |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'AsyncMock' is not used.
| from unittest.mock import MagicMock, AsyncMock, patch | |
| from unittest.mock import MagicMock, patch |
|
|
||
| import os | ||
| import pytest | ||
| from typing import Any |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Any' is not used.
| from typing import Any |
| - Detección de indicadores de explotación | ||
| - Creación de VulnerabilityFinding | ||
| """ | ||
| import pytest |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'pytest' is not used.
| import pytest |
| - Content cleaning | ||
| """ | ||
|
|
||
| import pytest |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'pytest' is not used.
| import pytest |
| import os | ||
| import pytest | ||
| import asyncio | ||
| from unittest.mock import patch, MagicMock, AsyncMock |
Copilot
AI
Dec 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'AsyncMock' is not used.
| from unittest.mock import patch, MagicMock, AsyncMock | |
| from unittest.mock import patch, MagicMock |
No description provided.