Skip to content

Feature Request: Automatic Chunking (a.k.a. “YOLO Divide”) for Oversized Codebases #424

@rockmandash

Description

@rockmandash

It would be immensely helpful if Repomix could automatically split large codebases into multiple output files once a certain token threshold (e.g. 128k) is reached. This “YOLO divide” approach would allow users to simply specify a cutoff and let Repomix handle the heavy lifting. The resulting “chunked” files could then be fed sequentially into AI tools that have strict context size limits.


Use Case

I often want to provide my entire codebase to a large language model for analysis or troubleshooting, but the codebase exceeds the 128k token limit by a wide margin—sometimes up to ten times more. Manually chopping the output into smaller pieces is tedious. An automated chunking solution would streamline this process significantly.

Desired Behavior

  1. Config Option: A single setting, for example:

    {
      "output": {
        "yoloDivideIntoChunksIfExceedToken": 128000
      }
    }
  2. Automatic Chunk Generation:

    • If the total token count surpasses the specified threshold, Repomix splits the output into separate files (e.g., repomix-output-chunk-1.xml, repomix-output-chunk-2.xml, etc.).
    • The exact method of splitting doesn’t matter much to me. Any reasonable approach (by file boundaries, lines, or token count) would be sufficient, as long as each chunk stays under the limit.
  3. Outcome:

    • Users can quickly copy and paste each chunk into their AI tool of choice (e.g., O1 Pro, ChatGPT, Claude, etc.) without worrying about hitting token limits.

Why This Matters

  • Efficiency: Eliminates manual slicing of the codebase output.
  • Ease of Use: Allows large repositories to be handled in one go, rather than requiring multiple runs or external scripts.
  • Flexibility: Users who just want “the whole codebase in the AI” can get a straightforward multi-file output to paste into their model in chunks.

Thank you for considering this request! An automatic chunking feature would be a game-changer for those of us dealing with large codebases and strict LLM context limits.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions