LFX Phase 3: AsciiDoc-aware spec chunking by ishaan-arora-1 · Pull Request #1783 · riscv/riscv-unified-db

ishaan-arora-1 · 2026-04-09T15:30:01Z

Summary

Split the 52,602-line RISC-V specification into 78 semantically coherent chunks that preserve CSR section integrity for LLM-based parameter extraction. Builds on Phase 1 (#1765) and Phase 2 (#1766).

chunker.py: AsciiDoc-aware chunking script with run, info, and verify CLI commands
chunks/: 78 numbered chunk files with metadata headers + manifest.json

Chunking Rules

CSR section atomicity: Never splits within a ==== section — each CSR description (heading, bytefield, behavioral paragraphs) stays together
Section boundaries: Splits at === or ==== AsciiDoc heading boundaries
Target size: 2,500–3,500 lines (~35K–45K tokens), leaving room for prompt layers within 128K context
Overlap: 30 lines of overlap context at chunk boundaries
Small files: Files under 2,000 lines are processed as single chunks

Results

Metric	Value
Total chunks	78
Total files	74
Multi-chunk files	4 (machine.adoc, scalar-crypto.adoc, v-st-ext.adoc, vector-crypto.adoc)
CSR section splits	0
Line coverage	100% on all multi-chunk files
Chunk size range	2–3,448 lines

Multi-Chunk File Details

File	Chunks	Sizes
machine.adoc (3,629 lines)	2	3,334 + 325
scalar-crypto.adoc (5,590 lines)	2	3,448 + 2,172
v-st-ext.adoc (5,396 lines)	2	3,393 + 2,011
vector-crypto.adoc (4,966 lines)	2	3,340 + 1,656

How to Run

# Chunk all spec files
python3 param_extraction/scripts/chunker.py run

# Show chunking for a specific file
python3 param_extraction/scripts/chunker.py info ext/riscv-isa-manual/src/machine.adoc

# Verify chunking output
python3 param_extraction/scripts/chunker.py verify

Test Plan

chunker.py verify passes: 74/74 files, 0 CSR splits, 0 gaps
All 78 chunk files exist with correct metadata headers
manifest.json is consistent with chunk files
100% line coverage on all 4 multi-chunk files
Overlap regions present in all non-first chunks of multi-chunk files
content_start_line correctly distinguishes overlap from new content
No debug artifacts or unused imports

Closes #1749

…tion Add scripts and data for cataloging all 185 UDB architectural parameters with schema analysis, CSR cross-references, heuristic classifications, and candidate spec text locations. This forms the foundation for LLM-based parameter extraction from the RISC-V specification. Scripts: - export_udb_params.py: extracts parameters from YAML, derives value types, cross-references CSR IDL, classifies each parameter - map_params_to_spec.py: searches 74 spec .adoc files for text related to each parameter using multi-strategy keyword matching - generate_report.py: produces CSV catalog, text report, and param name list Key results: - 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW, 26 NORM_CSR_WARL, 2 SW_RULE) - 81% high-confidence classifications - 98% of parameters mapped to spec text candidates Closes riscv#1747

Design and implement the formal parameter classification taxonomy and prompt architecture for LLM-based extraction from RISC-V specifications. Deliverables: - taxonomy.md: formal definitions for 8 parameter classes (NORM_DIRECT, NORM_CSR_WARL, NORM_CSR_RW, SW_RULE, NON_ISA, NON_NORM, DOC_RULE, UNKNOWN) with disambiguation rules and a decision tree - system_prompt.txt: ~940 token system prompt defining role, task, taxonomy, critical rules, and JSON output schema - examples.json: 6 positive + 4 negative few-shot examples from real spec text covering all normative classes and key false-positive patterns (NOTE blocks, CSR behavior, fixed requirements, permission vs optionality "may") - run_prompt.py: prompt assembler with 3 CLI modes (assemble, chunk, estimate) supporting context window management across models - validate_prompt.py: 175-check validation suite for all deliverables Key design decisions: - Single-pass extraction + classification to preserve context - Mandatory reasoning field in LLM output to reduce hallucinations - Section-boundary-aware chunking with configurable overlap - Three-layer prompt: system + examples + param names + spec chunk Closes riscv#1748

Add a chunker that splits the 52,602-line RISC-V specification into 78 semantically coherent chunks across 74 .adoc files, preserving CSR section integrity for LLM parameter extraction. Key features: - Never splits within a ==== section (CSR descriptions stay atomic) - Splits at === or ==== AsciiDoc heading boundaries - Target chunk size: 2,500-3,500 lines (~35K-45K tokens) - Overlap context (30 lines) at chunk boundaries - Files under 2,000 lines stay as single chunks - Built-in verify command checks CSR integrity, coverage, and metadata Results: - 78 chunks across 74 files (4 files split into 2 chunks each) - 100% line coverage on all multi-chunk files - Zero CSR section splits - Full manifest.json with per-chunk metadata Closes riscv#1749

codecov · 2026-04-09T15:37:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.94%. Comparing base (ba151af) to head (ab8e241).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1783      +/-   ##
==========================================
- Coverage   71.95%   71.94%   -0.01%     
==========================================
  Files          55       55              
  Lines       28085    28085              
  Branches     6172     6172              
==========================================
- Hits        20209    20207       -2     
- Misses       7876     7878       +2

Flag	Coverage Δ
idlc	`75.96% <ø> (ø)`
udb	`65.78% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ishaan-arora-1 added 3 commits April 9, 2026 15:19

ishaan-arora-1 requested review from ThinkOpenly and dhower-qc as code owners April 9, 2026 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LFX Phase 3: AsciiDoc-aware spec chunking#1783

LFX Phase 3: AsciiDoc-aware spec chunking#1783
ishaan-arora-1 wants to merge 3 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase3-spec-chunking

ishaan-arora-1 commented Apr 9, 2026

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ishaan-arora-1 commented Apr 9, 2026

Summary

Chunking Rules

Results

Multi-Chunk File Details

How to Run

Test Plan

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Apr 9, 2026 •

edited

Loading