Implement AsciiDoc classifier#1803
Open
ankit-cybertron wants to merge 14 commits intoriscv:mainfrom
Open
Conversation
2772516 to
81b9938
Compare
81b9938 to
f2cd88f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR overhauls the rule matching and classification logic inside
pipeline/process/classifier.pyas defined in the downstream pipeline requirements. It establishes a highly contextual rules engine capable of safely extracting and mapping ISA sentences to definitive parameters, classes, and types based on strict priority hierarchies.Related: #1800
Impact on Classification Accuracy
Comparing the evaluation reports before and after the new classifier rules, we’ve achieved a massive increase in classification accuracy and structural certainty.
1. Halving the "Unknown" Parameters
Previously, the classifier struggled to understand implicit architectural rules that were heavily formatted or lacked explicit CSR names, leaving them tagged as
unknown.We successfully classified over 800 previously "unknown" fragments into proper buckets.
2. Identifying True Normative Rules (non_CSR_parameter)
Because we added implicit outcome matching (e.g., detecting phrases like
"raises an exception"or"is sign-extended"even without strong modals like "must"), we drastically improved the pipeline's ability to identify general architectural constraints.3. Cleaning up False-Positives in Complex Documents
In highly technical files like
machine.adocandhypervisor.adoc, the extraction accuracy tightened up perfectly. The number of raw candidates properly categorized aslow_signalnoise increased from 3,092 to 3,461. This is because our newly anchored regex boundaries (\bword\b) guarantee that short abbreviations like"ro"(read-only) don't accidentally trick the classifier into accepting random narrative words like"zero".Features Added
Robust Signal Matching & Sandboxing
inchecks with dynamically cached regex patterns (\bword\b). This fixes major false-positive false-promotions.strip_markup) that forcibly removes residual AsciiDoc formatting (e.g. `backticks` and *asterisks*) so the classifier doesn’t "miss" keywords hiding behind punctuation markers.Deep Contextual Promotion
section_has_csr(). Even if an extracted sentence lacks an explicit parameter keyword, if its originating section title belongs to a known CSR, the logic smartly promotes it into aCSR_controlledconstraint instead of dumping it as an unknown or SW rule.16-bit,at most 4) and numeric-state logic (has_binary_state()) to formally classify phrases like"hardwired to 1"and"read-only 0".Strict Classification Ladders
classify_parameter_class / type): Implements an ordered prioritization ladder to defensively map fragments to canonicaltaxonomy.yamlgroupings (CSR_controlled,SW_rule,non_CSR_parameter,binary,range,enum) separating definitive hardware parameters from software guidelines.Completed Tasks
\b).strip_markup) for keyword safety.section_has_csr).has_binary_state/_NORMATIVE_OUTCOME).To-Do
unknownparameter classes to auto-classify edge cases._RE_NUMERIC_CONSTRAINT) to cover more complex formulaic definitions.RW-H).