LFX - Phase 6: Prompt Refinement & V2 Results by ishaan-arora-1 · Pull Request #1793 · riscv/riscv-unified-db

ishaan-arora-1 · 2026-04-15T16:48:58Z

Summary

Refine LLM prompts based on Phase 5 gap analysis, targeting 49 recoverable UDB recall misses
Add v2 system prompt with classification disambiguation rules and 7 commonly missed parameter pattern categories (counter/HPM, VM modes, tval reporting, alignment, implementation values, conditional SC failure, stateen control)
Add 4 new positive few-shot examples targeting previously missed parameter types
Implement prompt versioning support (PROMPT_VERSION env var) in run_prompt.py and extract.py for side-by-side v1/v2 comparison

V1 vs V2 Comparison

Metric	V1	V2	Change
Adjusted recall	58.8%	71.8%	+13.0pp
Classification accuracy	77.8%	85.9%	+8.1pp
Total params (deduped)	202	330	+63%
UDB recall misses	73	50	-31%
New params discovered	115	220	+91%

What changed in v2 prompts

System prompt additions: Classification disambiguation section clarifying NORM_CSR_WARL vs NORM_CSR_RW vs NORM_DIRECT boundaries; "Commonly Missed Parameter Patterns" section with 7 specific categories and indicators
New examples: COUNTINHIBIT_EN (counter inhibit), GSTAGE_MODE_BARE (VM mode support), REPORT_ENCODING_IN_MTVAL_ON_ILLEGAL_INSTRUCTION (tval reporting), LRSC_FAIL_ON_NON_EXACT_LRSC (LR/SC conditional failure)
Versioning: Results stored in results/v2/ directory, prompts in prompts/v2/

Test plan

V2 extraction completes all 59 chunks without errors
Adjusted recall exceeds 70% target (achieved 71.8%)
Classification accuracy improves over v1 (85.9% vs 77.8%)
Prompt versioning correctly isolates v1 and v2 results
Pre-commit hooks pass (ruff, SPDX headers, formatting)

…tion Add scripts and data for cataloging all 185 UDB architectural parameters with schema analysis, CSR cross-references, heuristic classifications, and candidate spec text locations. This forms the foundation for LLM-based parameter extraction from the RISC-V specification. Scripts: - export_udb_params.py: extracts parameters from YAML, derives value types, cross-references CSR IDL, classifies each parameter - map_params_to_spec.py: searches 74 spec .adoc files for text related to each parameter using multi-strategy keyword matching - generate_report.py: produces CSV catalog, text report, and param name list Key results: - 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW, 26 NORM_CSR_WARL, 2 SW_RULE) - 81% high-confidence classifications - 98% of parameters mapped to spec text candidates Closes riscv#1747

Design and implement the formal parameter classification taxonomy and prompt architecture for LLM-based extraction from RISC-V specifications. Deliverables: - taxonomy.md: formal definitions for 8 parameter classes (NORM_DIRECT, NORM_CSR_WARL, NORM_CSR_RW, SW_RULE, NON_ISA, NON_NORM, DOC_RULE, UNKNOWN) with disambiguation rules and a decision tree - system_prompt.txt: ~940 token system prompt defining role, task, taxonomy, critical rules, and JSON output schema - examples.json: 6 positive + 4 negative few-shot examples from real spec text covering all normative classes and key false-positive patterns (NOTE blocks, CSR behavior, fixed requirements, permission vs optionality "may") - run_prompt.py: prompt assembler with 3 CLI modes (assemble, chunk, estimate) supporting context window management across models - validate_prompt.py: 175-check validation suite for all deliverables Key design decisions: - Single-pass extraction + classification to preserve context - Mandatory reasoning field in LLM output to reduce hallucinations - Section-boundary-aware chunking with configurable overlap - Three-layer prompt: system + examples + param names + spec chunk Closes riscv#1748

Add a chunker that splits the 52,602-line RISC-V specification into 78 semantically coherent chunks across 74 .adoc files, preserving CSR section integrity for LLM parameter extraction. Key features: - Never splits within a ==== section (CSR descriptions stay atomic) - Splits at === or ==== AsciiDoc heading boundaries - Target chunk size: 2,500-3,500 lines (~35K-45K tokens) - Overlap context (30 lines) at chunk boundaries - Files under 2,000 lines stay as single chunks - Built-in verify command checks CSR integrity, coverage, and metadata Results: - 78 chunks across 74 files (4 files split into 2 chunks each) - 100% line coverage on all multi-chunk files - Zero CSR section splits - Full manifest.json with per-chunk metadata Closes riscv#1749

Add extract.py for automated parameter extraction using Anthropic Claude. Features include token-aware rate limiting, exponential backoff for API errors, source file skipping for non-parameter content, and pilot/run/merge CLI modes. Includes v1 extraction results across 59 spec chunks (208 unique parameters found).

Add analyze.py for deduplication, UDB alignment, metrics computation, and discrepancy reporting. V1 results: 58.8% adjusted recall, 77.8% classification accuracy, 202 unique parameters after deduplication. Identifies 73 UDB recall misses categorized as debug-spec (32) and recoverable (49) for prompt refinement targeting.

Refine prompts based on Phase 5 gap analysis: - Add v2 system prompt with classification disambiguation and 7 commonly missed parameter pattern categories - Add 4 new positive examples targeting counter/HPM, VM modes, tval reporting, and LR/SC conditional failure patterns - Add prompt versioning support (PROMPT_VERSION env var) to run_prompt.py and extract.py for side-by-side v1/v2 comparison V2 results show significant improvement over v1: - Adjusted recall: 71.8% (up from 58.8%) - Classification accuracy: 85.9% (up from 77.8%) - Total parameters found: 330 (up from 202)

codecov · 2026-04-15T16:55:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.16%. Comparing base (ba151af) to head (d58f726).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   71.95%   72.16%   +0.20%     
==========================================
  Files          55       55              
  Lines       28085    27799     -286     
  Branches     6172     6009     -163     
==========================================
- Hits        20209    20060     -149     
+ Misses       7876     7739     -137

Flag	Coverage Δ
idlc	`76.18% <ø> (+0.21%)`	⬆️
udb	`66.11% <ø> (+0.31%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ishaan-arora-1 added 6 commits April 9, 2026 15:19

ishaan-arora-1 requested review from ThinkOpenly and dhower-qc as code owners April 15, 2026 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFX - Phase 6: Prompt Refinement & V2 Results#1793

LFX - Phase 6: Prompt Refinement & V2 Results#1793
ishaan-arora-1 wants to merge 6 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase6-prompt-refinement

ishaan-arora-1 commented Apr 15, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishaan-arora-1 commented Apr 15, 2026

Summary

V1 vs V2 Comparison

What changed in v2 prompts

Test plan

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 15, 2026 •

edited

Loading