Skip to content

LFX Phase 1: Ground truth map for architectural parameter extraction#1765

Open
ishaan-arora-1 wants to merge 3 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase1-ground-truth
Open

LFX Phase 1: Ground truth map for architectural parameter extraction#1765
ishaan-arora-1 wants to merge 3 commits intoriscv:mainfrom
ishaan-arora-1:lfx-phase1-ground-truth

Conversation

@ishaan-arora-1
Copy link
Copy Markdown
Contributor

Summary

  • Adds scripts and data that catalog all 185 UDB architectural parameters with schema analysis, CSR cross-references, heuristic classifications, and candidate spec text locations
  • This is the foundation for LLM-based parameter extraction from the RISC-V privileged and unprivileged specifications (see LFX - Phase 1: Build the Ground Truth Map from UDB Parameters #1747)
  • Part of the LFX project to systematically identify and tag architectural parameters in the spec

What's included

Scripts (param_extraction/scripts/)

Script Purpose
export_udb_params.py Reads all 185 spec/std/isa/param/*.yaml files (excluding 22 MOCK fixtures), analyzes JSON Schema structure, cross-references CSR IDL code for sw_write()/type()/reset_value() references, and classifies each parameter
map_params_to_spec.py Searches all 74 spec .adoc files (52,602 lines) for text related to each parameter using multi-strategy keyword matching (exact name, CSR backtick refs, description keywords, WARL proximity patterns)
generate_report.py Produces the CSV catalog, text report, and flat parameter name list

Data outputs (param_extraction/data/)

File Description
ground_truth.json Full structured data for all 185 parameters: name, description, value type, definedBy, CSR cross-references, classification with confidence and reasoning
spec_mappings.json Top candidate spec text locations per parameter with relevance scores, line numbers, and context
parameters_catalog.csv 19-column spreadsheet-ready catalog
phase1_report.txt Human-readable report with statistics and per-parameter breakdown
udb_param_names.txt Flat list of 185 parameter names (for inclusion in LLM prompts in later phases)

Key results

Metric Value
Parameters cataloged 185 (22 MOCK fixtures excluded)
Classification: NORM_DIRECT 102 (55%) — directly configurable, not CSR-controlled
Classification: NORM_CSR_RW 55 (30%) — controls RO/RW behavior of CSR fields
Classification: NORM_CSR_WARL 26 (14%) — legal values of WARL CSR fields
Classification: SW_RULE 2 (1%) — software-deterministic with correct fencing
High-confidence classifications 150 (81%)
Value type: binary 111 (60%), enum: 36 (19%), range: 12 (6%)
Parameters with CSR cross-references 94 (51%)
Parameters mapped to spec text 183/185 (98%)
Strong spec matches (score >= 5) 161 (87%)

How to run

# Requires PyYAML (pip install pyyaml)
# Requires ext/riscv-isa-manual submodule to be initialized

python3 param_extraction/scripts/export_udb_params.py
python3 param_extraction/scripts/map_params_to_spec.py
python3 param_extraction/scripts/generate_report.py

Test plan

  • All 185 non-MOCK parameters exported with complete metadata
  • Value types verified against actual YAML schemas (100% match)
  • CSR cross-references verified against actual CSR YAML files
  • 98% of parameters have at least one spec text candidate match
  • 81% high-confidence classifications (target was >= 75%)
  • CSV catalog and JSON outputs are consistent (all 185 rows match)
  • No duplicate parameter names
  • All source YAML files exist on disk
  • Scripts run cleanly end-to-end with no errors

Closes #1747

…tion

Add scripts and data for cataloging all 185 UDB architectural parameters
with schema analysis, CSR cross-references, heuristic classifications,
and candidate spec text locations. This forms the foundation for
LLM-based parameter extraction from the RISC-V specification.

Scripts:
- export_udb_params.py: extracts parameters from YAML, derives value
  types, cross-references CSR IDL, classifies each parameter
- map_params_to_spec.py: searches 74 spec .adoc files for text related
  to each parameter using multi-strategy keyword matching
- generate_report.py: produces CSV catalog, text report, and param
  name list

Key results:
- 185 parameters cataloged (102 NORM_DIRECT, 55 NORM_CSR_RW,
  26 NORM_CSR_WARL, 2 SW_RULE)
- 81% high-confidence classifications
- 98% of parameters mapped to spec text candidates

Closes riscv#1747
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.95%. Comparing base (de41e7b) to head (9381f19).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1765      +/-   ##
==========================================
- Coverage   71.96%   71.95%   -0.01%     
==========================================
  Files          54       54              
  Lines       27976    27976              
  Branches     6183     6183              
==========================================
- Hits        20132    20131       -1     
- Misses       7844     7845       +1     
Flag Coverage Δ
idlc 75.90% <ø> (ø)
udb 65.84% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add REUSE annotation for param_extraction/** in REUSE.toml
- Fix ruff lint errors: remove unused variables, prefix unused loop
  vars with underscore, remove extraneous f-string prefixes, sort
  import blocks
- Apply ruff formatting to all Python scripts
- Make Python scripts executable to satisfy EXE001 shebang check
- Fix prettier formatting for ground_truth.json and spec_mappings.json
- Strip trailing whitespace from parameters_catalog.csv
- Add missing end-of-file newline to phase1_report.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LFX - Phase 1: Build the Ground Truth Map from UDB Parameters

1 participant