Add Spec chunking (AsciiDoc + UDB) by ankit-cybertron · Pull Request #1790 · riscv/riscv-unified-db

ankit-cybertron · 2026-04-15T05:48:46Z

Summary

This PR introduces the initial ingest layer for the RISC-V unified database RAG pipeline, focusing on extracting structured information from both specification text and parameter definitions. It implements an AsciiDoc-based chunker for parsing and processing ISA manual files, along with a YAML-based chunker for UDB parameter data. Both sources are normalized into a unified chunk schema to support downstream tasks such as classification and retrieval. The pipeline is designed to be config-driven and modular, allowing iterative refinement of filtering and chunking logic.

Note:- This is an early-stage implementation and is intended to evolve based on feedback and validation.

Ingestion — Input Coverage

Source	Files	Chunker
YAML (param + csr + ext)	781	`chunker_udb.py`
ISA Manual (`.adoc`)	136	`chunker_adoc.py`
Total	917	—

Notes

YAML files are local (spec/std/isa/)
.adoc files are cloned at runtime and removed after processing
CSR files are discovered recursively across nested directories

Status — Current Extraction

Metric	Value
Output dataset	`chunks_repo.json`
Parameter dataset	`parameter_dataset.csv`
Raw chunk files	100+
Largest chunk file	`src__v-st-ext.json`
Coverage	unpriv + priv + profiles + extensions

Chunkers

`chunker_adoc.py`

Parses .adoc specification files
Cleans formatting artifacts (anchors, tables, directives)
Tracks section hierarchy (breadcrumb metadata)
Splits content into sentence and logical rule-level chunks
Outputs structured chunks via classifier integration

`chunker_udb.py`

Parses YAML parameter definitions
Extracts relevant fields into text form
Normalizes output to match AsciiDoc chunk schema

Tasks

Completed

AsciiDoc parsing, cleaning, and chunking
UDB YAML ingestion
Section hierarchy tracking
Shared chunk schema across sources
Pipeline execution flow
Initial dataset generation (json, csv)

Pending

Normative vs descriptive filtering improvements
Chunk scoring/confidence tuning
Classifier integration (external)
Performance optimization
Manual Review and schema_rules validation

Observations

Normative rules are extracted but mixed with descriptive text
Code blocks and examples are still present in chunks
Some files produce low or zero output
Chunking is currently sentence-level and may need refinement
Outputs require manual review and requirement-based filtering

Concerns

A large number of generated outputs are pushed in this PR and are for review purposes
Output files are regenerable and will be removed in a follow-up PR
Filtering logic is still evolving and requires tuning

Testing

Pipeline executed on full ISA specification
Outputs manually inspected (sample-based)
No LLM-based testing and spec verification

Expected Outcome

Unified ingest layer for specification + UDB sources
Structured chunk dataset ready for downstream RAG pipeline stages

Status

In progress (#1779)

dhower-qc · 2026-04-15T14:00:29Z

This work looks very promising. Would you be able to attend a PR review meeting we have at 2pm Eastern Time (+4 UTC)? If not, let's try to schedule a time with the reviewers that works.

ankit-cybertron · 2026-04-15T17:20:14Z

Hey @dhower-qc
yes, it would be great if we can review this PR. but can we schedule it for tomorrow?
Here is my mail ankit.cybertron@proton.me we can coordinate there.

ankit-cybertron added 9 commits March 28, 2026 23:00

Add optimized parameter discovery pipeline for RAG

c0a2e5d

trigger CI

86c7d5e

restructured architecture for next phase

d5129b6

restructured architecture for next phase

7f0564d

config refactor and project scaffolding

7a6f2c9

Add Chunking scripts and ouput

464185f

Make gitignore clean

3967fd9

remove build_vector_db.py

aa7439b

Remove PErvious version files

6eb006d

ankit-cybertron requested review from ThinkOpenly and dhower-qc as code owners April 15, 2026 05:48

ankit-cybertron added 2 commits April 17, 2026 21:10

Add Utility Funtions

adb4f11

Add reporter for ingestion layer

3699c61

ankit-cybertron mentioned this pull request Apr 22, 2026

[RAG Pipeline] Add downstream pipeline modules: classifier, vector store, and exporters #1800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spec chunking (AsciiDoc + UDB) #1790

Add Spec chunking (AsciiDoc + UDB) #1790
ankit-cybertron wants to merge 11 commits intoriscv:mainfrom
ankit-cybertron:rag_pipeline_chunker

ankit-cybertron commented Apr 15, 2026 •

edited

Loading

Uh oh!

dhower-qc commented Apr 15, 2026

Uh oh!

ankit-cybertron commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ankit-cybertron commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Ingestion — Input Coverage

Status — Current Extraction

Chunkers

chunker_adoc.py

chunker_udb.py

Tasks

Completed

Pending

Observations

Concerns

Testing

Expected Outcome

Status

Uh oh!

dhower-qc commented Apr 15, 2026

Uh oh!

ankit-cybertron commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankit-cybertron commented Apr 15, 2026 •

edited

Loading

`chunker_adoc.py`

`chunker_udb.py`