LFX - Phase 4: LLM Extraction Pipeline#1791
Open
ishaan-arora-1 wants to merge 4 commits intoriscv:mainfrom
Open
LFX - Phase 4: LLM Extraction Pipeline#1791ishaan-arora-1 wants to merge 4 commits intoriscv:mainfrom
ishaan-arora-1 wants to merge 4 commits intoriscv:mainfrom
Conversation
…tion Add scripts and data for cataloging all 185 UDB architectural parameters with schema analysis, CSR cross-references, heuristic classifications, and candidate spec text locations. This forms the foundation for LLM-based parameter extraction from the RISC-V specification. Scripts: - export_udb_params.py: extracts parameters from YAML, derives value types, cross-references CSR IDL, classifies each parameter - map_params_to_spec.py: searches 74 spec .adoc files for text related to each parameter using multi-strategy keyword matching - generate_report.py: produces CSV catalog, text report, and param name list Key results: - 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW, 26 NORM_CSR_WARL, 2 SW_RULE) - 81% high-confidence classifications - 98% of parameters mapped to spec text candidates Closes riscv#1747
Design and implement the formal parameter classification taxonomy and prompt architecture for LLM-based extraction from RISC-V specifications. Deliverables: - taxonomy.md: formal definitions for 8 parameter classes (NORM_DIRECT, NORM_CSR_WARL, NORM_CSR_RW, SW_RULE, NON_ISA, NON_NORM, DOC_RULE, UNKNOWN) with disambiguation rules and a decision tree - system_prompt.txt: ~940 token system prompt defining role, task, taxonomy, critical rules, and JSON output schema - examples.json: 6 positive + 4 negative few-shot examples from real spec text covering all normative classes and key false-positive patterns (NOTE blocks, CSR behavior, fixed requirements, permission vs optionality "may") - run_prompt.py: prompt assembler with 3 CLI modes (assemble, chunk, estimate) supporting context window management across models - validate_prompt.py: 175-check validation suite for all deliverables Key design decisions: - Single-pass extraction + classification to preserve context - Mandatory reasoning field in LLM output to reduce hallucinations - Section-boundary-aware chunking with configurable overlap - Three-layer prompt: system + examples + param names + spec chunk Closes riscv#1748
Add a chunker that splits the 52,602-line RISC-V specification into 78 semantically coherent chunks across 74 .adoc files, preserving CSR section integrity for LLM parameter extraction. Key features: - Never splits within a ==== section (CSR descriptions stay atomic) - Splits at === or ==== AsciiDoc heading boundaries - Target chunk size: 2,500-3,500 lines (~35K-45K tokens) - Overlap context (30 lines) at chunk boundaries - Files under 2,000 lines stay as single chunks - Built-in verify command checks CSR integrity, coverage, and metadata Results: - 78 chunks across 74 files (4 files split into 2 chunks each) - 100% line coverage on all multi-chunk files - Zero CSR section splits - Full manifest.json with per-chunk metadata Closes riscv#1749
Add extract.py for automated parameter extraction using Anthropic Claude. Features include token-aware rate limiting, exponential backoff for API errors, source file skipping for non-parameter content, and pilot/run/merge CLI modes. Includes v1 extraction results across 59 spec chunks (208 unique parameters found).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1791 +/- ##
==========================================
+ Coverage 71.95% 72.16% +0.20%
==========================================
Files 55 55
Lines 28085 27799 -286
Branches 6172 6009 -163
==========================================
- Hits 20209 20060 -149
+ Misses 7876 7739 -137
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
extract.py— automated LLM extraction pipeline for identifying architectural parameters in the RISC-V specificationKey capabilities
machine.adocchunks only (prompt validation).adocfiles (bibliography, index, rationale, etc.) automatically excludedResults structure
results/claude-sonnet-4/chunk_NNN.jsonall_results_claude-sonnet-4.jsonwith all extracted parametersTest plan
machine.adocchunks validates prompt quality