Skip to content

feat(indexer): add AST-aware chunking via cAST algorithm#190

Open
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin:feat/ast-chunking-cast
Open

feat(indexer): add AST-aware chunking via cAST algorithm#190
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin:feat/ast-chunking-cast

Conversation

@Don-Yin
Copy link

@Don-Yin Don-Yin commented Mar 16, 2026

summary

this PR adds an optional AST-aware chunking strategy based on the cAST algorithm (Zhang, Zhao, Wang et al., EMNLP 2025, arXiv:2506.15655). instead of splitting files at fixed character windows, the chunker uses tree-sitter to parse source files and produces chunks aligned with function, class, and declaration boundaries.

the algorithm works as follows:

  1. if the entire file fits within the non-whitespace character budget, emit it as a single chunk
  2. otherwise, iterate over root-level AST children, greedily grouping adjacent nodes whose combined non-whitespace characters fit
  3. if a single node exceeds the budget, recursively descend into its children
  4. apply a second greedy merge pass on adjacent ranges
  5. fill byte gaps between ranges to guarantee verbatim reconstruction

supported languages: Go, Python, JavaScript, TypeScript. unsupported files fall back to the existing fixed-size chunker.

configuration

chunking:
  size: 512
  overlap: 50
  strategy: ast   # "fixed" (default) or "ast"

results

tested on a mixed workspace (~189 files). compared to fixed-size chunking, cAST improved file diversity by 56% (25 vs 16 unique files in top-5 across five queries) while maintaining the same source-code ranking quality. small files that previously produced diluted embeddings (e.g., a 15-line config module) now rank correctly as single coherent chunks.

see results.md for the full experiment write-up.

changes

  • indexer/chunker_iface.go: new FileChunker interface
  • indexer/chunker_ast.go: ASTChunker implementation (build tag: treesitter)
  • indexer/chunker_ast_stub.go: stub factory for builds without tree-sitter
  • indexer/chunker_ast_test.go: unit tests (chunking, verbatim reconstruction, merge, recursive descent, fallback)
  • config/config.go: Strategy field added to ChunkingConfig
  • indexer/indexer.go: Indexer.chunker changed from *Chunker to FileChunker
  • cli/watch.go: uses NewFileChunker(strategy, size, overlap)

all existing tests pass under both treesitter and default build tags.

reference

Zhang, Zhao, Wang et al. (2025). "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree." EMNLP 2025. arXiv:2506.15655.

Made with Cursor

Implement structure-aware code chunking based on the cAST algorithm
(Zhang et al., EMNLP 2025, arXiv: 2506.15655). The chunker uses
tree-sitter to parse supported languages and recursively splits
oversized AST nodes while greedily merging small siblings, producing
chunks aligned with function and class boundaries.

Key properties:
- non-whitespace character count as size metric
- recursive descent for nodes exceeding the budget
- greedy sibling merge to minimize chunk count
- verbatim reconstruction guarantee (chunks concatenate to original)
- fallback to fixed-size chunker for unsupported file types

Configured via chunking.strategy: "ast" or "fixed" (default).
Supported languages: Go, Python, JavaScript, TypeScript.

Includes unit tests for chunking, reconstruction, merge logic,
recursive descent, and fallback behaviour.
@Don-Yin Don-Yin force-pushed the feat/ast-chunking-cast branch from fd860ba to 7e559b5 Compare March 16, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant