Skip to content

Conversation

Copy link

Copilot AI commented Dec 24, 2025

Implements two advanced chunking strategies for improved document segmentation in RAG pipelines: hierarchical chunking for multi-level document structure and semantic similarity-based splitting for topic boundary detection.

Hierarchical Chunking

Multi-level document hierarchy with parent-child relationships:

from src.chunker import hierarchical_chunk

chunks = hierarchical_chunk(
    text,
    levels=['section', 'paragraph', 'sentence'],
    use_tiktoken=True
)

# Each chunk includes: id, parent_id, level, start_char, end_char, token_count
# Enables multi-resolution retrieval and context reconstruction

CLI:

rag-chunk analyze docs/ --strategy hierarchical \
  --hierarchical-levels "section,paragraph,sentence"
  • Extracts markdown headers (h1-h6) as sections
  • Falls back to paragraph splitting for unstructured documents
  • Metadata enables parent-child navigation and multi-resolution indexing

Semantic Similarity-Based Splitting

Topic boundary detection using sentence embeddings:

from src.chunker import semantic_split

chunks = semantic_split(
    text,
    model='all-MiniLM-L6-v2',
    threshold=0.7
)
# Splits where cosine similarity between consecutive sentences drops below threshold

CLI:

rag-chunk analyze docs/ --strategy semantic-embedding \
  --semantic-threshold 0.7 \
  --semantic-model "all-MiniLM-L6-v2"
  • Uses sentence-transformers for embeddings
  • Changepoint detection creates variable-length, semantically coherent chunks
  • Requires pip install rag-chunk[embeddings]

Changes

src/chunker.py:

  • Added hierarchical_chunk() with section/paragraph/sentence splitting
  • Added semantic_split() with embedding-based boundary detection
  • Helper functions: _split_into_sentences(), _extract_markdown_sections()
  • Registered both strategies in STRATEGIES dict

src/cli.py:

  • Added --hierarchical-levels argument (default: "section,paragraph")
  • Added --semantic-model argument (default: "all-MiniLM-L6-v2")
  • Added --semantic-threshold argument (default: 0.7)
  • Updated _run_strategy() to pass strategy-specific parameters

pyproject.toml:

  • Added embeddings optional dependency for sentence-transformers
  • Updated all group to include embeddings

tests/test_basic.py:

  • Added 6 tests covering hierarchical levels and semantic splitting

README.md & examples/advanced_strategies.md:

  • Complete documentation with usage examples and best practices
Original prompt

This section details on the original issue you should resolve

<issue_title>Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting</issue_title>
<issue_description>## Summary
Implement advanced chunking strategies to improve document segmentation and retrieval for RAG pipelines. These extend current strategies (fixed-size, sliding-window, paragraph, recursive-character) by leveraging hierarchical document structure and deep semantic cues.

Hierarchical Chunking

  • Build multi-level chunk hierarchies: e.g., split on Markdown headlines (h1/h2/h3) → sections → paragraphs → sentences.
  • API: hierarchical_chunk(text, levels=['section', 'paragraph', 'sentence']) in src/chunker.py.
  • Each chunk should include metadata: id, parent_id, level, start_char, end_char, token_count, source_path.
  • CLI: select via --strategy hierarchical, with tunable levels and fallback to paragraph splitting.

Semantic Similarity-based Splitting

  • Detect topic boundaries via semantic embeddings (using sentence-transformers or LangChain embeddings).
  • API: semantic_split(text, model='all-MiniLM-L6-v2', threshold=0.7) in src/chunker.py.
  • Split at points where similarity with neighboring sentences drops below threshold (changepoint detection).
  • CLI: select via --strategy semantic-embedding, with options for model and threshold, configurable in CLI.
  • Compare results with current recursive-character strategy (LangChain integration).

Implementation Plan

  • Add new strategy registrations to STRATEGIES in src/chunker.py.
  • Write modular functions for hierarchical and semantic chunking.
  • Integrate into CLI and allow strategy selection.
  • Add optional dependencies for sentence-transformers (rag-chunk[embeddings]).
  • Add output and chunk metadata to improve traceability for hierarchical chunks.
  • Extend existing tests in tests/ and add example files under examples/ for notebook demonstrations.

Motivation

  • Enable multi-resolution document retrieval and indexing.
  • Improve semantic relevance and reduce context fragmentation for RAG applications.

Acceptance Criteria

  • Hierarchical and semantic strategies available via CLI (--strategy flag)
  • New functions in src/chunker.py with clear API signatures
  • Tests and example notebooks for both strategies
  • Benchmark/evaluation harness extended to new strategies
  • Optional dependencies well-documented in README

References


Labels: enhancement, strategy, roadmap, semantic, hierarchical</issue_description>

<agent_instructions>go go</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Implement advanced chunking strategies for document segmentation Add hierarchical and semantic-embedding chunking strategies Dec 24, 2025
Copilot AI requested a review from messkan December 24, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Advanced Strategies: Hierarchical chunking, semantic similarity-based splitting

2 participants