Skip to content

Add Spec chunking (AsciiDoc + UDB) #1790

Open
ankit-cybertron wants to merge 11 commits intoriscv:mainfrom
ankit-cybertron:rag_pipeline_chunker
Open

Add Spec chunking (AsciiDoc + UDB) #1790
ankit-cybertron wants to merge 11 commits intoriscv:mainfrom
ankit-cybertron:rag_pipeline_chunker

Conversation

@ankit-cybertron
Copy link
Copy Markdown

@ankit-cybertron ankit-cybertron commented Apr 15, 2026

Summary

This PR introduces the initial ingest layer for the RISC-V unified database RAG pipeline, focusing on extracting structured information from both specification text and parameter definitions. It implements an AsciiDoc-based chunker for parsing and processing ISA manual files, along with a YAML-based chunker for UDB parameter data. Both sources are normalized into a unified chunk schema to support downstream tasks such as classification and retrieval. The pipeline is designed to be config-driven and modular, allowing iterative refinement of filtering and chunking logic.

Note:- This is an early-stage implementation and is intended to evolve based on feedback and validation.


Ingestion — Input Coverage

Source Files Chunker
YAML (param + csr + ext) 781 chunker_udb.py
ISA Manual (.adoc) 136 chunker_adoc.py
Total 917

Notes

  • YAML files are local (spec/std/isa/)
  • .adoc files are cloned at runtime and removed after processing
  • CSR files are discovered recursively across nested directories

Status — Current Extraction

Metric Value
Output dataset chunks_repo.json
Parameter dataset parameter_dataset.csv
Raw chunk files 100+
Largest chunk file src__v-st-ext.json
Coverage unpriv + priv + profiles + extensions

Chunkers

chunker_adoc.py

  • Parses .adoc specification files
  • Cleans formatting artifacts (anchors, tables, directives)
  • Tracks section hierarchy (breadcrumb metadata)
  • Splits content into sentence and logical rule-level chunks
  • Outputs structured chunks via classifier integration

chunker_udb.py

  • Parses YAML parameter definitions
  • Extracts relevant fields into text form
  • Normalizes output to match AsciiDoc chunk schema

Tasks

Completed

  • AsciiDoc parsing, cleaning, and chunking
  • UDB YAML ingestion
  • Section hierarchy tracking
  • Shared chunk schema across sources
  • Pipeline execution flow
  • Initial dataset generation (json, csv)

Pending

  • Normative vs descriptive filtering improvements
  • Chunk scoring/confidence tuning
  • Classifier integration (external)
  • Performance optimization
  • Manual Review and schema_rules validation

Observations

  • Normative rules are extracted but mixed with descriptive text
  • Code blocks and examples are still present in chunks
  • Some files produce low or zero output
  • Chunking is currently sentence-level and may need refinement
  • Outputs require manual review and requirement-based filtering

Concerns

  • A large number of generated outputs are pushed in this PR and are for review purposes
  • Output files are regenerable and will be removed in a follow-up PR
  • Filtering logic is still evolving and requires tuning

Testing

  • Pipeline executed on full ISA specification
  • Outputs manually inspected (sample-based)
  • No LLM-based testing and spec verification

Expected Outcome

  • Unified ingest layer for specification + UDB sources
  • Structured chunk dataset ready for downstream RAG pipeline stages

Status

In progress (#1779)

@dhower-qc
Copy link
Copy Markdown
Collaborator

This work looks very promising. Would you be able to attend a PR review meeting we have at 2pm Eastern Time (+4 UTC)? If not, let's try to schedule a time with the reviewers that works.

@ankit-cybertron
Copy link
Copy Markdown
Author

Hey @dhower-qc
yes, it would be great if we can review this PR. but can we schedule it for tomorrow?
Here is my mail ankit.cybertron@proton.me we can coordinate there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants