Skip to content

YUALAB/bunsetsu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bunsetsu (文節)

PyPI version License: MIT Python 3.9+

Japanese-optimized semantic text chunking for RAG applications.

Unlike general-purpose text splitters, Bunsetsu understands Japanese text structure—no spaces between words, particles that bind phrases, and sentence patterns that differ from English. This results in more coherent chunks and better retrieval accuracy for Japanese RAG systems.

Why Bunsetsu?

Feature Generic Splitters Bunsetsu
Japanese word boundaries ❌ Breaks mid-word ✅ Respects morphology
Particle handling ❌ Splits は/が from nouns ✅ Keeps phrases intact
Sentence detection ⚠️ Basic (。only) ✅ Full (。!?、etc.)
Topic boundaries ❌ Ignores ✅ Detects は/が patterns
Dependencies Heavy Zero by default

Installation

# Basic installation (zero dependencies)
pip install bunsetsu

# With MeCab tokenizer (higher accuracy)
pip install bunsetsu[mecab]

# With Sudachi tokenizer (multiple granularity modes)
pip install bunsetsu[sudachi]

# All tokenizers
pip install bunsetsu[all]

Quick Start

from bunsetsu import chunk_text

text = """
人工知能の発展は目覚ましいものがあります。
特に大規模言語モデルの登場により、自然言語処理の分野は大きく変わりました。
"""

# Simple semantic chunking
chunks = chunk_text(text, strategy="semantic", chunk_size=200)

for chunk in chunks:
    print(f"[{chunk.char_count} chars] {chunk.text[:50]}...")

Chunking Strategies

1. Semantic Chunking (Recommended for RAG)

Splits text based on meaning and topic boundaries:

from bunsetsu import SemanticChunker

chunker = SemanticChunker(
    min_chunk_size=100,
    max_chunk_size=500,
)

chunks = chunker.chunk(text)

2. Fixed-Size with Sentence Awareness

Character-based splitting that respects sentence boundaries:

from bunsetsu import FixedSizeChunker

chunker = FixedSizeChunker(
    chunk_size=500,
    chunk_overlap=50,
    respect_sentences=True,  # Don't break mid-sentence
)

chunks = chunker.chunk(text)

3. Recursive (Document Structure)

Splits hierarchically by headings, paragraphs, sentences, then clauses:

from bunsetsu import RecursiveChunker

chunker = RecursiveChunker(
    chunk_size=500,
    chunk_overlap=50,
)

chunks = chunker.chunk(markdown_text)

Tokenizer Backends

SimpleTokenizer (Default)

Regex-based, zero dependencies. Good for most use cases:

from bunsetsu import SimpleTokenizer

tokenizer = SimpleTokenizer()
tokens = tokenizer.tokenize("日本語のテキスト")

MeCabTokenizer (High Accuracy)

Uses MeCab via fugashi for proper morphological analysis:

from bunsetsu import MeCabTokenizer, SemanticChunker

tokenizer = MeCabTokenizer()
chunker = SemanticChunker(tokenizer=tokenizer)

SudachiTokenizer (Flexible Granularity)

Supports three tokenization modes (A/B/C):

from bunsetsu import SudachiTokenizer

# Mode C: Longest unit (compound words kept together)
tokenizer = SudachiTokenizer(mode="C")

# Mode A: Shortest unit (fine-grained)
tokenizer = SudachiTokenizer(mode="A")

Framework Integrations

LangChain

from bunsetsu.integrations import LangChainTextSplitter
from langchain.schema import Document

splitter = LangChainTextSplitter(
    strategy="semantic",
    chunk_size=500,
)

# Split plain text
chunks = splitter.split_text(text)

# Split Documents
docs = [Document(page_content=text, metadata={"source": "file.txt"})]
split_docs = splitter.split_documents(docs)

LlamaIndex

from bunsetsu.integrations import LlamaIndexNodeParser

parser = LlamaIndexNodeParser(
    strategy="semantic",
    chunk_size=500,
)

nodes = parser.get_nodes_from_documents(documents)

API Reference

chunk_text()

Convenience function for quick chunking:

chunks = chunk_text(
    text,
    strategy="semantic",      # "fixed", "semantic", or "recursive"
    chunk_size=500,           # Target chunk size
    chunk_overlap=50,         # Overlap between chunks
    tokenizer_backend="simple",  # "simple", "mecab", or "sudachi"
)

Chunk Object

chunk.text        # The chunk content
chunk.start_char  # Start position in original text
chunk.end_char    # End position in original text
chunk.char_count  # Number of characters
chunk.metadata    # Additional metadata dict

Token Object

token.surface      # Surface form (as written)
token.token_type   # TokenType enum (NOUN, VERB, PARTICLE, etc.)
token.reading      # Reading (if available)
token.base_form    # Dictionary form (if available)
token.is_content_word  # True for nouns, verbs, adjectives

Performance

Benchmarked on a 100KB Japanese document:

Chunker Time Chunks Avg Size
FixedSizeChunker 12ms 203 492 chars
SemanticChunker (simple) 45ms 187 534 chars
SemanticChunker (mecab) 89ms 192 521 chars
RecursiveChunker 23ms 198 505 chars

Design Philosophy

  1. Japanese-first: Built specifically for Japanese text, not adapted from English
  2. Zero dependencies by default: Works out of the box, optional backends for accuracy
  3. RAG-optimized: Chunks designed for embedding and retrieval, not just display
  4. Framework-agnostic: Core library works standalone, integrations provided separately

Contributing

Contributions are welcome! Please check CONTRIBUTING.md for guidelines.

# Development setup
git clone https://github.com/YUALAB/bunsetsu.git
cd bunsetsu
pip install -e ".[dev]"

# Run tests
pytest

# Run linter
ruff check src/

License

MIT License - see LICENSE for details.

About

Developed by YUA LAB (AQUA LLC), Tokyo.

We build AI agents and RAG systems for enterprise. This library powers our production RAG deployments.

About

Japanese-optimized semantic text chunking for RAG applications

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages