Skip to content

LFX Phase 3: AsciiDoc-aware spec chunking#1783

Open
ishaan-arora-1 wants to merge 3 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase3-spec-chunking
Open

LFX Phase 3: AsciiDoc-aware spec chunking#1783
ishaan-arora-1 wants to merge 3 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase3-spec-chunking

Conversation

@ishaan-arora-1
Copy link
Copy Markdown
Contributor

Summary

Split the 52,602-line RISC-V specification into 78 semantically coherent chunks that preserve CSR section integrity for LLM-based parameter extraction. Builds on Phase 1 (#1765) and Phase 2 (#1766).

  • chunker.py: AsciiDoc-aware chunking script with run, info, and verify CLI commands
  • chunks/: 78 numbered chunk files with metadata headers + manifest.json

Chunking Rules

  1. CSR section atomicity: Never splits within a ==== section — each CSR description (heading, bytefield, behavioral paragraphs) stays together
  2. Section boundaries: Splits at === or ==== AsciiDoc heading boundaries
  3. Target size: 2,500–3,500 lines (~35K–45K tokens), leaving room for prompt layers within 128K context
  4. Overlap: 30 lines of overlap context at chunk boundaries
  5. Small files: Files under 2,000 lines are processed as single chunks

Results

Metric Value
Total chunks 78
Total files 74
Multi-chunk files 4 (machine.adoc, scalar-crypto.adoc, v-st-ext.adoc, vector-crypto.adoc)
CSR section splits 0
Line coverage 100% on all multi-chunk files
Chunk size range 2–3,448 lines

Multi-Chunk File Details

File Chunks Sizes
machine.adoc (3,629 lines) 2 3,334 + 325
scalar-crypto.adoc (5,590 lines) 2 3,448 + 2,172
v-st-ext.adoc (5,396 lines) 2 3,393 + 2,011
vector-crypto.adoc (4,966 lines) 2 3,340 + 1,656

How to Run

# Chunk all spec files
python3 param_extraction/scripts/chunker.py run

# Show chunking for a specific file
python3 param_extraction/scripts/chunker.py info ext/riscv-isa-manual/src/machine.adoc

# Verify chunking output
python3 param_extraction/scripts/chunker.py verify

Test Plan

  • chunker.py verify passes: 74/74 files, 0 CSR splits, 0 gaps
  • All 78 chunk files exist with correct metadata headers
  • manifest.json is consistent with chunk files
  • 100% line coverage on all 4 multi-chunk files
  • Overlap regions present in all non-first chunks of multi-chunk files
  • content_start_line correctly distinguishes overlap from new content
  • No debug artifacts or unused imports

Closes #1749

…tion

Add scripts and data for cataloging all 185 UDB architectural parameters
with schema analysis, CSR cross-references, heuristic classifications,
and candidate spec text locations. This forms the foundation for
LLM-based parameter extraction from the RISC-V specification.

Scripts:
- export_udb_params.py: extracts parameters from YAML, derives value
  types, cross-references CSR IDL, classifies each parameter
- map_params_to_spec.py: searches 74 spec .adoc files for text related
  to each parameter using multi-strategy keyword matching
- generate_report.py: produces CSV catalog, text report, and param
  name list

Key results:
- 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW,
  26 NORM_CSR_WARL, 2 SW_RULE)
- 81% high-confidence classifications
- 98% of parameters mapped to spec text candidates

Closes riscv#1747
Design and implement the formal parameter classification taxonomy and
prompt architecture for LLM-based extraction from RISC-V specifications.

Deliverables:
- taxonomy.md: formal definitions for 8 parameter classes (NORM_DIRECT,
  NORM_CSR_WARL, NORM_CSR_RW, SW_RULE, NON_ISA, NON_NORM, DOC_RULE,
  UNKNOWN) with disambiguation rules and a decision tree
- system_prompt.txt: ~940 token system prompt defining role, task,
  taxonomy, critical rules, and JSON output schema
- examples.json: 6 positive + 4 negative few-shot examples from real
  spec text covering all normative classes and key false-positive
  patterns (NOTE blocks, CSR behavior, fixed requirements, permission
  vs optionality "may")
- run_prompt.py: prompt assembler with 3 CLI modes (assemble, chunk,
  estimate) supporting context window management across models
- validate_prompt.py: 175-check validation suite for all deliverables

Key design decisions:
- Single-pass extraction + classification to preserve context
- Mandatory reasoning field in LLM output to reduce hallucinations
- Section-boundary-aware chunking with configurable overlap
- Three-layer prompt: system + examples + param names + spec chunk

Closes riscv#1748
Add a chunker that splits the 52,602-line RISC-V specification into
78 semantically coherent chunks across 74 .adoc files, preserving
CSR section integrity for LLM parameter extraction.

Key features:
- Never splits within a ==== section (CSR descriptions stay atomic)
- Splits at === or ==== AsciiDoc heading boundaries
- Target chunk size: 2,500-3,500 lines (~35K-45K tokens)
- Overlap context (30 lines) at chunk boundaries
- Files under 2,000 lines stay as single chunks
- Built-in verify command checks CSR integrity, coverage, and metadata

Results:
- 78 chunks across 74 files (4 files split into 2 chunks each)
- 100% line coverage on all multi-chunk files
- Zero CSR section splits
- Full manifest.json with per-chunk metadata

Closes riscv#1749
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.94%. Comparing base (ba151af) to head (ab8e241).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1783      +/-   ##
==========================================
- Coverage   71.95%   71.94%   -0.01%     
==========================================
  Files          55       55              
  Lines       28085    28085              
  Branches     6172     6172              
==========================================
- Hits        20209    20207       -2     
- Misses       7876     7878       +2     
Flag Coverage Δ
idlc 75.96% <ø> (ø)
udb 65.78% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LFX - Phase 3: Implement Spec Text Chunking Strategy

1 participant