feat(indexer): add AST-aware chunking via cAST algorithm#190
Open
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Open
feat(indexer): add AST-aware chunking via cAST algorithm#190Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Don-Yin wants to merge 1 commit intoyoanbernabeu:mainfrom
Conversation
Implement structure-aware code chunking based on the cAST algorithm (Zhang et al., EMNLP 2025, arXiv: 2506.15655). The chunker uses tree-sitter to parse supported languages and recursively splits oversized AST nodes while greedily merging small siblings, producing chunks aligned with function and class boundaries. Key properties: - non-whitespace character count as size metric - recursive descent for nodes exceeding the budget - greedy sibling merge to minimize chunk count - verbatim reconstruction guarantee (chunks concatenate to original) - fallback to fixed-size chunker for unsupported file types Configured via chunking.strategy: "ast" or "fixed" (default). Supported languages: Go, Python, JavaScript, TypeScript. Includes unit tests for chunking, reconstruction, merge logic, recursive descent, and fallback behaviour.
fd860ba to
7e559b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
this PR adds an optional AST-aware chunking strategy based on the cAST algorithm (Zhang, Zhao, Wang et al., EMNLP 2025, arXiv:2506.15655). instead of splitting files at fixed character windows, the chunker uses tree-sitter to parse source files and produces chunks aligned with function, class, and declaration boundaries.
the algorithm works as follows:
supported languages: Go, Python, JavaScript, TypeScript. unsupported files fall back to the existing fixed-size chunker.
configuration
results
tested on a mixed workspace (~189 files). compared to fixed-size chunking, cAST improved file diversity by 56% (25 vs 16 unique files in top-5 across five queries) while maintaining the same source-code ranking quality. small files that previously produced diluted embeddings (e.g., a 15-line config module) now rank correctly as single coherent chunks.
see
results.mdfor the full experiment write-up.changes
indexer/chunker_iface.go: newFileChunkerinterfaceindexer/chunker_ast.go:ASTChunkerimplementation (build tag:treesitter)indexer/chunker_ast_stub.go: stub factory for builds without tree-sitterindexer/chunker_ast_test.go: unit tests (chunking, verbatim reconstruction, merge, recursive descent, fallback)config/config.go:Strategyfield added toChunkingConfigindexer/indexer.go:Indexer.chunkerchanged from*ChunkertoFileChunkercli/watch.go: usesNewFileChunker(strategy, size, overlap)all existing tests pass under both
treesitterand default build tags.reference
Zhang, Zhao, Wang et al. (2025). "cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree." EMNLP 2025. arXiv:2506.15655.
Made with Cursor