Add hierarchical and semantic-embedding chunking strategies #5

Copilot · 2025-12-24T23:22:09Z

Implements two advanced chunking strategies for improved document segmentation in RAG pipelines: hierarchical chunking for multi-level document structure and semantic similarity-based splitting for topic boundary detection.

Hierarchical Chunking

Multi-level document hierarchy with parent-child relationships:

from src.chunker import hierarchical_chunk

chunks = hierarchical_chunk(
    text,
    levels=['section', 'paragraph', 'sentence'],
    use_tiktoken=True
)

# Each chunk includes: id, parent_id, level, start_char, end_char, token_count
# Enables multi-resolution retrieval and context reconstruction

CLI:

rag-chunk analyze docs/ --strategy hierarchical \
  --hierarchical-levels "section,paragraph,sentence"

Extracts markdown headers (h1-h6) as sections
Falls back to paragraph splitting for unstructured documents
Metadata enables parent-child navigation and multi-resolution indexing

Semantic Similarity-Based Splitting

Topic boundary detection using sentence embeddings:

from src.chunker import semantic_split

chunks = semantic_split(
    text,
    model='all-MiniLM-L6-v2',
    threshold=0.7
)
# Splits where cosine similarity between consecutive sentences drops below threshold

CLI:

rag-chunk analyze docs/ --strategy semantic-embedding \
  --semantic-threshold 0.7 \
  --semantic-model "all-MiniLM-L6-v2"

Uses sentence-transformers for embeddings
Changepoint detection creates variable-length, semantically coherent chunks
Requires pip install rag-chunk[embeddings]

Changes

src/chunker.py:

Added hierarchical_chunk() with section/paragraph/sentence splitting
Added semantic_split() with embedding-based boundary detection
Helper functions: _split_into_sentences(), _extract_markdown_sections()
Registered both strategies in STRATEGIES dict

src/cli.py:

Added --hierarchical-levels argument (default: "section,paragraph")
Added --semantic-model argument (default: "all-MiniLM-L6-v2")
Added --semantic-threshold argument (default: 0.7)
Updated _run_strategy() to pass strategy-specific parameters

pyproject.toml:

Added embeddings optional dependency for sentence-transformers
Updated all group to include embeddings

tests/test_basic.py:

Added 6 tests covering hierarchical levels and semantic splitting

README.md & examples/advanced_strategies.md:

Complete documentation with usage examples and best practices

Original prompt

This section details on the original issue you should resolve

<issue_title>Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting</issue_title>
<issue_description>## Summary
Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.

Hierarchical Chunking

Build multi-level chunk hierarchies: e.g., split on Markdown headlines (h1/h2/h3) → sections → paragraphs → sentences.

API: hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence']) in src/chunker.py.

Each chunk should include metadata: id, parent_id, level, start_char, end_char, token_count, source_path.

CLI: select via --strategy hierarchical, with tunable levels and fallback to paragraph splitting.

Semantic Similarity-based Splitting

Detect topic boundaries via semantic embeddings (using sentence-transformers or LangChain embeddings).

API: semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7) in src/chunker.py.

Split at points where similarity with neighboring sentences drops below threshold (changepoint detection).

CLI: select via --strategy semantic-embedding, with options for model and threshold, configurable in CLI.

Compare results with current recursive-character strategy (LangChain integration).

Implementation Plan

Add new strategy registrations to STRATEGIES in src/chunker.py.

Write modular functions for hierarchical and semantic chunking.

Integrate into CLI and allow strategy selection.

Add optional dependencies for sentence-transformers (rag-chunk[embeddings]).

Add output and chunk metadata to improve traceability for hierarchical chunks.

Extend existing tests in tests/ and add example files under examples/ for notebook demonstrations.

Motivation

Enable multi-resolution document retrieval and indexing.

Improve semantic relevance and reduce context fragmentation for RAG applications.

Acceptance Criteria

Hierarchical and semantic strategies available via CLI (--strategy flag)

New functions in src/chunker.py with clear API signatures

Tests and example notebooks for both strategies

Benchmark/evaluation harness extended to new strategies

Optional dependencies well-documented in README

References

See roadmap in README.md

Related code: src/chunker.py, tests, examples

Labels: enhancement, strategy, roadmap, semantic, hierarchical</issue_description>

<agent_instructions>go go</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting #4

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: messkan <[email protected]>

Initial plan

d3dc783

Copilot AI assigned Copilot and messkan Dec 24, 2025

Copilot started work on behalf of messkan December 24, 2025 23:22 View session

Copilot AI and others added 3 commits December 24, 2025 23:28

Implement hierarchical and semantic chunking strategies

d3a2199

Co-authored-by: messkan <[email protected]>

Update README and add advanced strategies documentation

6e991e7

Co-authored-by: messkan <[email protected]>

Fix code review feedback and finalize implementation

1f6f656

Co-authored-by: messkan <[email protected]>

Copilot AI changed the title ~~[WIP] Implement advanced chunking strategies for document segmentation~~ Add hierarchical and semantic-embedding chunking strategies Dec 24, 2025

Copilot AI requested a review from messkan December 24, 2025 23:38

Copilot finished work on behalf of messkan December 24, 2025 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hierarchical and semantic-embedding chunking strategies #5

Add hierarchical and semantic-embedding chunking strategies #5

Copilot AI commented Dec 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add hierarchical and semantic-embedding chunking strategies #5

Are you sure you want to change the base?

Add hierarchical and semantic-embedding chunking strategies #5

Conversation

Copilot AI commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hierarchical Chunking

Semantic Similarity-Based Splitting

Changes

Hierarchical Chunking

Semantic Similarity-based Splitting

Implementation Plan

Motivation

Acceptance Criteria

References

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 24, 2025 •

edited

Loading