Skip to content

feat: add graph min-loc repair pass#29

Merged
vitali87 merged 1 commit intomainfrom
feat/graph-loc-bounds-repair
Mar 27, 2026
Merged

feat: add graph min-loc repair pass#29
vitali87 merged 1 commit intomainfrom
feat/graph-loc-bounds-repair

Conversation

@vitali87
Copy link
Owner

Summary

  • add a deterministic graph-backend repair pass for undersized groups when min_loc is set
  • keep the graph backend's max bound hard by only merging groups when the combined load still fits within max_loc
  • avoid merging same-file groups when doing so would interleave hunk order, and add focused graph repair tests

Testing

  • uv run ruff check pr_split tests
  • uv run pytest -q

Part of #6

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the graph-based partitioning strategy by introducing a new repair pass. This pass intelligently merges undersized groups to meet a specified minimum size ("min_loc"), while strictly adhering to the maximum group size ("max_loc") and preserving the contiguous order of changes within shared files. This ensures more robust and well-formed code partitions.

Highlights

  • Graph-backend repair pass: Implemented a deterministic repair pass for the graph-based partitioning strategy to address undersized groups when a "min_loc" (minimum lines of code) is specified.
  • Group merging logic: Ensured that group merging respects the "max_loc" (maximum lines of code) constraint, only combining groups if their total load remains within this limit.
  • Hunk order preservation: Prevented the merging of groups from the same file if doing so would disrupt the original hunk order, and added specific tests for this graph repair logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new feature to the graph-based partitioning strategy, allowing for the repair of undersized groups by merging them based on a min_loc setting. This involves several new helper functions for calculating group load, affinity, and finding the best merge targets, as well as a new _repair_graph_min_loc function. Corresponding tests have been added to verify this new functionality, including its determinism. The review comment points out an inefficiency in the _merge_group_units function due to deep copying and suggests a more performant approach.

Comment on lines +271 to +278
merged_group = sorted(
grouped_units[target_idx] + grouped_units[source_idx],
key=lambda unit: unit.position,
)
repaired_groups = [list(group_units) for group_units in grouped_units]
repaired_groups[target_idx] = merged_group
del repaired_groups[source_idx]
return repaired_groups

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _merge_group_units is inefficient as it creates a deep copy of all groups ([list(group_units) for group_units in grouped_units]) on every merge. This can be costly if there are many groups. Additionally, using del with indices can be subtle to reason about.

A clearer and more performant approach is to build a new list from scratch, which avoids the expensive copy and the del operation.

Suggested change
merged_group = sorted(
grouped_units[target_idx] + grouped_units[source_idx],
key=lambda unit: unit.position,
)
repaired_groups = [list(group_units) for group_units in grouped_units]
repaired_groups[target_idx] = merged_group
del repaired_groups[source_idx]
return repaired_groups
merged_group = sorted(
grouped_units[target_idx] + grouped_units[source_idx],
key=lambda unit: unit.position,
)
repaired_groups = []
for i, group in enumerate(grouped_units):
if i == source_idx:
continue
if i == target_idx:
repaired_groups.append(merged_group)
else:
repaired_groups.append(group)
return repaired_groups

@greptile-apps
Copy link

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR adds a deterministic post-processing repair pass (_repair_graph_min_loc) to the graph partitioning backend that iteratively merges undersized groups (those below min_loc) into their best neighbours, while respecting the hard max_loc ceiling and preserving hunk-order contiguity for shared files. The implementation fits cleanly into partition_diff as a single line after _group_units_graph.

Key changes:

  • Four new helpers: _group_load, _group_anchor_position, _group_affinity, _shared_file_merge_is_contiguous, _best_graph_merge_target, _merge_group_units, and the orchestrating _repair_graph_min_loc.
  • The repair loop is deterministic: sources are processed smallest-load-first, and targets are ranked by a stable 6-element key tuple.
  • partition_diff calls _repair_graph_min_loc only for PartitionStrategy.GRAPH; the CP-SAT backend is unaffected.
  • Two new tests (test_min_loc_merges_undersized_groups_when_possible, test_min_loc_repair_is_deterministic) and extended helper signatures for min_loc in the test file.
  • Both new tests exercise only the same simple two-group UNRELATED_DIFF scenario; the tiebreaking logic in _best_graph_merge_target and the blocked-merge graceful-degradation path are not covered by any test.

Confidence Score: 4/5

  • Safe to merge; implementation is logically correct and terminates, but test coverage for edge cases could be strengthened.
  • The core algorithm is correct: the repair loop terminates (bounded by the finite number of groups), all index manipulations are safe (indices are recomputed after each merge), and the Settings validator already rejects min_loc >= max_loc. The contiguity guard prevents hunk-order interleaving. The main gap is test coverage — both new tests use the same trivial two-group scenario, leaving the tiebreaking code path and the permanently-blocked-merge path unexercised. The greedy source-selection order also lacks a rationale comment, which could confuse future maintainers.
  • tests/test_partitioning_extensive.py — new tests duplicate the same scenario and miss important edge cases.

Important Files Changed

Filename Overview
pr_split/planner/partitioning.py Adds a deterministic graph-backend repair pass (_repair_graph_min_loc) with four new helper functions; logic is correct and terminates, but the greedy source-selection order lacks a documentation comment explaining its rationale.
tests/test_partitioning_extensive.py Adds two new graph repair tests and extends helper signatures for min_loc; both new tests use the identical two-group UNRELATED_DIFF scenario, leaving the tiebreaking logic and the blocked-merge graceful-degradation path untested.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[partition_diff] --> B[_group_units_graph]
    B --> C[_repair_graph_min_loc]
    C --> D{min_loc set AND\ngroups >= 2?}
    D -- No --> E[Return groups as-is]
    D -- Yes --> F[Sort undersized groups\nby load, anchor, idx]
    F --> G{Any undersized group\nhas valid merge target?}
    G -- No --> H[Return repaired groups]
    G -- Yes --> I[_best_graph_merge_target\nfor first source]
    I --> J{merged_load\n<= max_loc?}
    J -- No --> K[Skip target]
    J -- Yes --> L{_shared_file_merge\n_is_contiguous?}
    L -- No --> K
    L -- Yes --> M{merged_underflow\n< current_underflow?}
    M -- No --> K
    M -- Yes --> N[Score merge via\n6-element key tuple]
    N --> O[_merge_group_units\nsource into target]
    O --> F
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/test_partitioning_extensive.py
Line: 187-197

Comment:
**Duplicate test scenario reduces determinism-test value**

`test_min_loc_repair_is_deterministic` uses the exact same `UNRELATED_DIFF`, `max_loc=10`, `min_loc=5`, and `ORTHOGONAL` priority as `test_min_loc_merges_undersized_groups_when_possible`. Because `UNRELATED_DIFF` produces exactly two groups with a single forced merge path, the determinism test doesn't exercise any branching in `_best_graph_merge_target`. A truly meaningful determinism test would use a diff with three or more groups so there are multiple valid merge candidates and the tiebreaking logic is actually exercised.

Consider replacing `UNRELATED_DIFF` here with a multi-group fixture, for example a diff with three files each adding 3 LOC (`max_loc=10`, `min_loc=5`), which would produce three undersized groups and have at least two possible merge paths to verify against.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_partitioning_extensive.py
Line: 121-133

Comment:
**No test coverage for permanently-blocked repair pass**

There is no test for the case where every possible merge for an undersized group is blocked — either because every pair would exceed `max_loc`, or because contiguity constraints rule out all candidates. In that scenario `_repair_graph_min_loc` returns groups that still violate `min_loc`, which is valid and expected behaviour, but it is currently untested.

A suggested fixture: two files each with 6 LOC (`max_loc=10`, `min_loc=7`). Each file forms a single group of 6 LOC (below `min_loc=7`), but merging them would yield 12 LOC which exceeds `max_loc=10`. The repair pass must leave both groups as-is. Adding this as a test (and asserting `len(groups) == 2`) would confirm the graceful degradation path.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: pr_split/planner/partitioning.py
Line: 289-301

Comment:
**Greedy source-selection order can miss globally better merge sequences**

The `undersized_group_indices` list is sorted by `(load, anchor, group_idx)` — smallest-load group is always attempted first as the merge source. When there are three or more undersized groups with different affinities, this greedy priority can leave a higher-affinity merge unreachable.

For example, suppose groups A (load 2), B (load 2), C (load 3) all exist with `min_loc=5`, `max_loc=6`. A+C=5 (fits, resolves A's underflow) and B+C=5 (fits, resolves B's underflow), but A+B=4 (still undersized). The algorithm will pick A (smallest load) first. If it finds C as the best target it merges A+C, leaving B alone and unable to merge (B+merged=7 > max_loc). Had B+C been done first, A could still merge with B+C later (2+5=7, over max). Alternatively B+C first, then A is stuck. So here the result is the same, but in more complex scenarios the greedy source order can lead to more residual undersized groups than an alternative ordering would.

This is a known limitation of greedy repair and may be acceptable for the current goals, but it is worth documenting with a comment above the sort so future maintainers understand why this ordering was chosen.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat: repair undersized graph groups" | Re-trigger Greptile

Comment on lines +187 to +197
def test_min_loc_repair_is_deterministic(self, monkeypatch: pytest.MonkeyPatch) -> None:
settings = _settings(
monkeypatch,
max_loc=10,
min_loc=5,
partition_strategy=PartitionStrategy.GRAPH,
priority=Priority.ORTHOGONAL,
)
parsed = parse_diff(UNRELATED_DIFF)
signatures = {_group_signature(partition_diff(parsed, settings)) for _ in range(3)}
assert len(signatures) == 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate test scenario reduces determinism-test value

test_min_loc_repair_is_deterministic uses the exact same UNRELATED_DIFF, max_loc=10, min_loc=5, and ORTHOGONAL priority as test_min_loc_merges_undersized_groups_when_possible. Because UNRELATED_DIFF produces exactly two groups with a single forced merge path, the determinism test doesn't exercise any branching in _best_graph_merge_target. A truly meaningful determinism test would use a diff with three or more groups so there are multiple valid merge candidates and the tiebreaking logic is actually exercised.

Consider replacing UNRELATED_DIFF here with a multi-group fixture, for example a diff with three files each adding 3 LOC (max_loc=10, min_loc=5), which would produce three undersized groups and have at least two possible merge paths to verify against.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_partitioning_extensive.py
Line: 187-197

Comment:
**Duplicate test scenario reduces determinism-test value**

`test_min_loc_repair_is_deterministic` uses the exact same `UNRELATED_DIFF`, `max_loc=10`, `min_loc=5`, and `ORTHOGONAL` priority as `test_min_loc_merges_undersized_groups_when_possible`. Because `UNRELATED_DIFF` produces exactly two groups with a single forced merge path, the determinism test doesn't exercise any branching in `_best_graph_merge_target`. A truly meaningful determinism test would use a diff with three or more groups so there are multiple valid merge candidates and the tiebreaking logic is actually exercised.

Consider replacing `UNRELATED_DIFF` here with a multi-group fixture, for example a diff with three files each adding 3 LOC (`max_loc=10`, `min_loc=5`), which would produce three undersized groups and have at least two possible merge paths to verify against.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +121 to +133
def test_min_loc_merges_undersized_groups_when_possible(
self, monkeypatch: pytest.MonkeyPatch
) -> None:
settings = _settings(
monkeypatch,
max_loc=10,
min_loc=5,
partition_strategy=PartitionStrategy.GRAPH,
priority=Priority.ORTHOGONAL,
)
groups = partition_diff(parse_diff(UNRELATED_DIFF), settings)
assert len(groups) == 1
_assert_valid_plan(groups, UNRELATED_DIFF, 10, min_loc=5)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No test coverage for permanently-blocked repair pass

There is no test for the case where every possible merge for an undersized group is blocked — either because every pair would exceed max_loc, or because contiguity constraints rule out all candidates. In that scenario _repair_graph_min_loc returns groups that still violate min_loc, which is valid and expected behaviour, but it is currently untested.

A suggested fixture: two files each with 6 LOC (max_loc=10, min_loc=7). Each file forms a single group of 6 LOC (below min_loc=7), but merging them would yield 12 LOC which exceeds max_loc=10. The repair pass must leave both groups as-is. Adding this as a test (and asserting len(groups) == 2) would confirm the graceful degradation path.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/test_partitioning_extensive.py
Line: 121-133

Comment:
**No test coverage for permanently-blocked repair pass**

There is no test for the case where every possible merge for an undersized group is blocked — either because every pair would exceed `max_loc`, or because contiguity constraints rule out all candidates. In that scenario `_repair_graph_min_loc` returns groups that still violate `min_loc`, which is valid and expected behaviour, but it is currently untested.

A suggested fixture: two files each with 6 LOC (`max_loc=10`, `min_loc=7`). Each file forms a single group of 6 LOC (below `min_loc=7`), but merging them would yield 12 LOC which exceeds `max_loc=10`. The repair pass must leave both groups as-is. Adding this as a test (and asserting `len(groups) == 2`) would confirm the graceful degradation path.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +289 to +301
while True:
undersized_group_indices = sorted(
(
group_idx
for group_idx, group_units in enumerate(repaired_groups)
if _group_load(group_units) < settings.min_loc
),
key=lambda group_idx: (
_group_load(repaired_groups[group_idx]),
_group_anchor_position(repaired_groups[group_idx]),
group_idx,
),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Greedy source-selection order can miss globally better merge sequences

The undersized_group_indices list is sorted by (load, anchor, group_idx) — smallest-load group is always attempted first as the merge source. When there are three or more undersized groups with different affinities, this greedy priority can leave a higher-affinity merge unreachable.

For example, suppose groups A (load 2), B (load 2), C (load 3) all exist with min_loc=5, max_loc=6. A+C=5 (fits, resolves A's underflow) and B+C=5 (fits, resolves B's underflow), but A+B=4 (still undersized). The algorithm will pick A (smallest load) first. If it finds C as the best target it merges A+C, leaving B alone and unable to merge (B+merged=7 > max_loc). Had B+C been done first, A could still merge with B+C later (2+5=7, over max). Alternatively B+C first, then A is stuck. So here the result is the same, but in more complex scenarios the greedy source order can lead to more residual undersized groups than an alternative ordering would.

This is a known limitation of greedy repair and may be acceptable for the current goals, but it is worth documenting with a comment above the sort so future maintainers understand why this ordering was chosen.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pr_split/planner/partitioning.py
Line: 289-301

Comment:
**Greedy source-selection order can miss globally better merge sequences**

The `undersized_group_indices` list is sorted by `(load, anchor, group_idx)` — smallest-load group is always attempted first as the merge source. When there are three or more undersized groups with different affinities, this greedy priority can leave a higher-affinity merge unreachable.

For example, suppose groups A (load 2), B (load 2), C (load 3) all exist with `min_loc=5`, `max_loc=6`. A+C=5 (fits, resolves A's underflow) and B+C=5 (fits, resolves B's underflow), but A+B=4 (still undersized). The algorithm will pick A (smallest load) first. If it finds C as the best target it merges A+C, leaving B alone and unable to merge (B+merged=7 > max_loc). Had B+C been done first, A could still merge with B+C later (2+5=7, over max). Alternatively B+C first, then A is stuck. So here the result is the same, but in more complex scenarios the greedy source order can lead to more residual undersized groups than an alternative ordering would.

This is a known limitation of greedy repair and may be acceptable for the current goals, but it is worth documenting with a comment above the sort so future maintainers understand why this ordering was chosen.

How can I resolve this? If you propose a fix, please make it concise.

@vitali87 vitali87 merged commit df2302b into main Mar 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant