Implement AsciiDoc classifier by ankit-cybertron · Pull Request #1803 · riscv/riscv-unified-db

ankit-cybertron · 2026-04-25T11:13:20Z

Overview

This PR overhauls the rule matching and classification logic inside pipeline/process/classifier.py as defined in the downstream pipeline requirements. It establishes a highly contextual rules engine capable of safely extracting and mapping ISA sentences to definitive parameters, classes, and types based on strict priority hierarchies.
Related: #1800

Impact on Classification Accuracy

Comparing the evaluation reports before and after the new classifier rules, we’ve achieved a massive increase in classification accuracy and structural certainty.

1. Halving the "Unknown" Parameters

Previously, the classifier struggled to understand implicit architectural rules that were heavily formatted or lacked explicit CSR names, leaving them tagged as unknown.

Unknown Classes Before: 1,637 (38.0%)
Unknown Classes After: 853 (20.6%)
We successfully classified over 800 previously "unknown" fragments into proper buckets.

2. Identifying True Normative Rules (non_CSR_parameter)

Because we added implicit outcome matching (e.g., detecting phrases like "raises an exception" or "is sign-extended" even without strong modals like "must"), we drastically improved the pipeline's ability to identify general architectural constraints.

non_CSR_parameters Before: 1,038 (24.1%)
non_CSR_parameters After: 1,899 (46.0%)

3. Cleaning up False-Positives in Complex Documents

In highly technical files like machine.adoc and hypervisor.adoc, the extraction accuracy tightened up perfectly. The number of raw candidates properly categorized as low_signal noise increased from 3,092 to 3,461. This is because our newly anchored regex boundaries (\bword\b) guarantee that short abbreviations like "ro" (read-only) don't accidentally trick the classifier into accepting random narrative words like "zero".

Features Added

Robust Signal Matching & Sandboxing

Word-Boundary Regex Anchoring: Replaced naive substring in checks with dynamically cached regex patterns (\bword\b). This fixes major false-positive false-promotions.
Markup Stripping: Implemented a pre-processor (strip_markup) that forcibly removes residual AsciiDoc formatting (e.g. `backticks` and *asterisks*) so the classifier doesn’t "miss" keywords hiding behind punctuation markers.

Deep Contextual Promotion

Section-Context Awareness: Added section_has_csr(). Even if an extracted sentence lacks an explicit parameter keyword, if its originating section title belongs to a known CSR, the logic smartly promotes it into a CSR_controlled constraint instead of dumping it as an unknown or SW rule.
Numeric States Matched: We added regex bounds for mathematical definitions (16-bit, at most 4) and numeric-state logic (has_binary_state()) to formally classify phrases like "hardwired to 1" and "read-only 0".

Strict Classification Ladders

Deterministic Classifications (classify_parameter_class / type): Implements an ordered prioritization ladder to defensively map fragments to canonical taxonomy.yaml groupings (CSR_controlled, SW_rule, non_CSR_parameter, binary, range, enum) separating definitive hardware parameters from software guidelines.

Completed Tasks

Fix dangerous substring false positives using word-boundary regex (\b).
Integrate AsciiDoc inline markup stripper (strip_markup) for keyword safety.
Implement section breadcrumb context analysis (section_has_csr).
Embed implicit mathematical and behavioral constraints (has_binary_state / _NORMATIVE_OUTCOME).
Organize taxonomy parameters behind a strict prioritization hierarchy.

To-Do

Connect classified taxonomy limits directly into the UDB YAML schema compiler mapping.
Implement an LLM evaluation phase specifically targeting the remaining 20% unknown parameter classes to auto-classify edge cases.
Develop a systematic workflow to flag and analyze remaining false positives (e.g., narrative prose mimicking architectural rules).
Design a feedback loop to automatically reject chunks when a downstream LLM validator determines they are purely descriptive despite passing keyword checks.
Expand boundary matching regexes (e.g. _RE_NUMERIC_CONSTRAINT) to cover more complex formulaic definitions.
Add Pytest integration tests locking down the classifier logic for critical access schemas (like RW-H).

ankit-cybertron added 11 commits March 28, 2026 23:00

Add optimized parameter discovery pipeline for RAG

c0a2e5d

trigger CI

86c7d5e

restructured architecture for next phase

d5129b6

restructured architecture for next phase

7f0564d

config refactor and project scaffolding

7a6f2c9

Add Chunking scripts and ouput

464185f

Make gitignore clean

3967fd9

remove build_vector_db.py

aa7439b

Remove PErvious version files

6eb006d

Add Utility Funtions

adb4f11

Add reporter for ingestion layer

3699c61

ankit-cybertron requested review from ThinkOpenly and dhower-qc as code owners April 25, 2026 11:13

ankit-cybertron force-pushed the rag_pipeline_classifier branch 2 times, most recently from 2772516 to 81b9938 Compare May 2, 2026 16:57

ankit-cybertron added 3 commits May 2, 2026 23:31

Add inital classifer

f3fe400

Add genrated outputs and reports

57ab00a

Add dependency_extractor and schema_parser

f2cd88f

ankit-cybertron force-pushed the rag_pipeline_classifier branch from 81b9938 to f2cd88f Compare May 2, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement AsciiDoc classifier#1803

Implement AsciiDoc classifier#1803
ankit-cybertron wants to merge 14 commits intoriscv:mainfrom
ankit-cybertron:rag_pipeline_classifier

ankit-cybertron commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankit-cybertron commented Apr 25, 2026

Overview

Impact on Classification Accuracy

1. Halving the "Unknown" Parameters

2. Identifying True Normative Rules (non_CSR_parameter)

3. Cleaning up False-Positives in Complex Documents

Features Added

Robust Signal Matching & Sandboxing

Deep Contextual Promotion

Strict Classification Ladders

Completed Tasks

To-Do

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant