Skip to content

LFX - Phase 6: Prompt Refinement & V2 Results#1793

Open
ishaan-arora-1 wants to merge 6 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase6-prompt-refinement
Open

LFX - Phase 6: Prompt Refinement & V2 Results#1793
ishaan-arora-1 wants to merge 6 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase6-prompt-refinement

Conversation

@ishaan-arora-1
Copy link
Copy Markdown
Contributor

Summary

  • Refine LLM prompts based on Phase 5 gap analysis, targeting 49 recoverable UDB recall misses
  • Add v2 system prompt with classification disambiguation rules and 7 commonly missed parameter pattern categories (counter/HPM, VM modes, tval reporting, alignment, implementation values, conditional SC failure, stateen control)
  • Add 4 new positive few-shot examples targeting previously missed parameter types
  • Implement prompt versioning support (PROMPT_VERSION env var) in run_prompt.py and extract.py for side-by-side v1/v2 comparison

V1 vs V2 Comparison

Metric V1 V2 Change
Adjusted recall 58.8% 71.8% +13.0pp
Classification accuracy 77.8% 85.9% +8.1pp
Total params (deduped) 202 330 +63%
UDB recall misses 73 50 -31%
New params discovered 115 220 +91%

What changed in v2 prompts

  • System prompt additions: Classification disambiguation section clarifying NORM_CSR_WARL vs NORM_CSR_RW vs NORM_DIRECT boundaries; "Commonly Missed Parameter Patterns" section with 7 specific categories and indicators
  • New examples: COUNTINHIBIT_EN (counter inhibit), GSTAGE_MODE_BARE (VM mode support), REPORT_ENCODING_IN_MTVAL_ON_ILLEGAL_INSTRUCTION (tval reporting), LRSC_FAIL_ON_NON_EXACT_LRSC (LR/SC conditional failure)
  • Versioning: Results stored in results/v2/ directory, prompts in prompts/v2/

Test plan

  • V2 extraction completes all 59 chunks without errors
  • Adjusted recall exceeds 70% target (achieved 71.8%)
  • Classification accuracy improves over v1 (85.9% vs 77.8%)
  • Prompt versioning correctly isolates v1 and v2 results
  • Pre-commit hooks pass (ruff, SPDX headers, formatting)

…tion

Add scripts and data for cataloging all 185 UDB architectural parameters
with schema analysis, CSR cross-references, heuristic classifications,
and candidate spec text locations. This forms the foundation for
LLM-based parameter extraction from the RISC-V specification.

Scripts:
- export_udb_params.py: extracts parameters from YAML, derives value
  types, cross-references CSR IDL, classifies each parameter
- map_params_to_spec.py: searches 74 spec .adoc files for text related
  to each parameter using multi-strategy keyword matching
- generate_report.py: produces CSV catalog, text report, and param
  name list

Key results:
- 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW,
  26 NORM_CSR_WARL, 2 SW_RULE)
- 81% high-confidence classifications
- 98% of parameters mapped to spec text candidates

Closes riscv#1747
Design and implement the formal parameter classification taxonomy and
prompt architecture for LLM-based extraction from RISC-V specifications.

Deliverables:
- taxonomy.md: formal definitions for 8 parameter classes (NORM_DIRECT,
  NORM_CSR_WARL, NORM_CSR_RW, SW_RULE, NON_ISA, NON_NORM, DOC_RULE,
  UNKNOWN) with disambiguation rules and a decision tree
- system_prompt.txt: ~940 token system prompt defining role, task,
  taxonomy, critical rules, and JSON output schema
- examples.json: 6 positive + 4 negative few-shot examples from real
  spec text covering all normative classes and key false-positive
  patterns (NOTE blocks, CSR behavior, fixed requirements, permission
  vs optionality "may")
- run_prompt.py: prompt assembler with 3 CLI modes (assemble, chunk,
  estimate) supporting context window management across models
- validate_prompt.py: 175-check validation suite for all deliverables

Key design decisions:
- Single-pass extraction + classification to preserve context
- Mandatory reasoning field in LLM output to reduce hallucinations
- Section-boundary-aware chunking with configurable overlap
- Three-layer prompt: system + examples + param names + spec chunk

Closes riscv#1748
Add a chunker that splits the 52,602-line RISC-V specification into
78 semantically coherent chunks across 74 .adoc files, preserving
CSR section integrity for LLM parameter extraction.

Key features:
- Never splits within a ==== section (CSR descriptions stay atomic)
- Splits at === or ==== AsciiDoc heading boundaries
- Target chunk size: 2,500-3,500 lines (~35K-45K tokens)
- Overlap context (30 lines) at chunk boundaries
- Files under 2,000 lines stay as single chunks
- Built-in verify command checks CSR integrity, coverage, and metadata

Results:
- 78 chunks across 74 files (4 files split into 2 chunks each)
- 100% line coverage on all multi-chunk files
- Zero CSR section splits
- Full manifest.json with per-chunk metadata

Closes riscv#1749
Add extract.py for automated parameter extraction using Anthropic Claude.
Features include token-aware rate limiting, exponential backoff for API
errors, source file skipping for non-parameter content, and pilot/run/merge
CLI modes. Includes v1 extraction results across 59 spec chunks (208
unique parameters found).
Add analyze.py for deduplication, UDB alignment, metrics computation,
and discrepancy reporting. V1 results: 58.8% adjusted recall, 77.8%
classification accuracy, 202 unique parameters after deduplication.
Identifies 73 UDB recall misses categorized as debug-spec (32) and
recoverable (49) for prompt refinement targeting.
Refine prompts based on Phase 5 gap analysis:
- Add v2 system prompt with classification disambiguation and 7 commonly
  missed parameter pattern categories
- Add 4 new positive examples targeting counter/HPM, VM modes, tval
  reporting, and LR/SC conditional failure patterns
- Add prompt versioning support (PROMPT_VERSION env var) to run_prompt.py
  and extract.py for side-by-side v1/v2 comparison

V2 results show significant improvement over v1:
- Adjusted recall: 71.8% (up from 58.8%)
- Classification accuracy: 85.9% (up from 77.8%)
- Total parameters found: 330 (up from 202)
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.16%. Comparing base (ba151af) to head (d58f726).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1793      +/-   ##
==========================================
+ Coverage   71.95%   72.16%   +0.20%     
==========================================
  Files          55       55              
  Lines       28085    27799     -286     
  Branches     6172     6009     -163     
==========================================
- Hits        20209    20060     -149     
+ Misses       7876     7739     -137     
Flag Coverage Δ
idlc 76.18% <ø> (+0.21%) ⬆️
udb 66.11% <ø> (+0.31%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant