main: add explicit handling of overlapping collection arguments #13704

bluetech · 2025-09-07T16:53:43Z

Consider a pytest invocation like pytest tests/ tests/test_it.py. What should happen?

Currently what happens is that only tests/test_it.py is run, which is obviously wrong. This regressed in the big package collection rework (PR #11646).

The reason it regressed is the way pytest collection works. See #12083 for (some) details.

I have made an attempt to fix the problem directly in the collection loop, but failed. The main challenge is the node caching, i.e. when should a collector node be reused when it is needed for several collection arguments. I believe it is possible to make it work, but it's hard.

In order to not leave this embarrassing bug lingering for any longer, this instead takes an easier approach, which is to massage the collection argument list itself such that issues with overlapping nodes don't come up during collection at all. This adds complexity instead of simplifying things, but I hope it should be good enough in practice for now, and maybe we can revisit in the future.

This change introduces behavioral changes, mainly:

pytest a/b a/ is equivalent to pytest a; if there is an a/a then a/b will not be ordered before a/a. So the ability to order a subset before a superset is lost.
pytest x.py x.py does not run the file twice; previously we took an explicit request like this to mean that it should.

The --keep-duplicates option remains as a sort of "expert mode" that retains its current behavior; though it is still subtly broken in that collector nodes are also duplicated (not just the items). A fix for that requires the harder change.

Fix #12083.

bluetech · 2025-09-07T16:55:13Z

src/_pytest/main.py

+    keeps the shorter prefix, or the earliest argument if duplicate, preserving
+    order. The result is prefix-free.
+    """
+    # TODO: Fix quadratic runtime.


Needs to be fixed before merging, but wanted to post the simple understandable version before complicating with a trie or whatever.

bluetech · 2025-09-07T16:56:17Z

This is a breaking change, but I think we're about due for a pytest 9, so seems OK.

nicoddemus

Thanks for tackling this @bluetech!

Everything looks good to me, specially the comprehensive tests.

The quadratic time is important, as you mention, just wanted to highlight a use case common for us at work:

When we get a CI failure, we output a GH actions summary showing a command line that can be copy/pasted to reproduce the failed tests directly.

For example:

pytest test1.py::test_foo test1.py::TestBar::test_bar ...

This list can be quite long, because it contains all the node ids that failed in the job, so it is important that this use case does not cause the collection time to balloon out of control.

changelog/12083.breaking.rst

src/_pytest/main.py

bluetech · 2025-09-08T21:38:36Z

I tried a bit to get it non-quadratic, got something but it's quite a bit more complex and only gets faster at very high counts due to the higher fixed costs.

Before this I optimized the simple quadratic solution a bit (removed unneeded is_dir check which did a stat, and optimized the relative-to check). I had the AI dummy write me a benchmark, which seems fine to me.

Benchmark

#!/usr/bin/env python
"""Benchmark for normalize_collection_arguments function to verify sub-quadratic performance."""

from __future__ import annotations

from pathlib import Path
import random
import time

from _pytest.main import CollectionArgument
from _pytest.main import normalize_collection_arguments


def generate_test_data(n: int) -> list[CollectionArgument]:
    """Generate test data with various overlapping patterns."""
    args = []

    # Create a mix of different argument types
    for i in range(n):
        choice = random.randint(0, 4)

        if choice == 0:
            # Simple file paths
            path = Path(f"/test/path/file_{i % 100}.py")
            parts = ()
        elif choice == 1:
            # Paths with class specifications
            path = Path(f"/test/path/file_{i % 50}.py")
            parts = (f"TestClass{i % 10}",)
        elif choice == 2:
            # Paths with method specifications
            path = Path(f"/test/path/file_{i % 50}.py")
            parts = (f"TestClass{i % 10}", f"test_method_{i % 5}")
        elif choice == 3:
            # Directory paths
            path = Path(f"/test/path/dir_{i % 20}")
            parts = ()
        else:
            # Nested directory paths (creates parent-child relationships)
            depth = random.randint(1, 5)
            path = Path("/test/path")
            for d in range(depth):
                path = path / f"subdir_{i % 10}_{d}"
            path = path / f"file_{i}.py"
            parts = ()

        args.append(CollectionArgument(path=path, parts=parts, module_name=None))

    # Add some duplicates and overlapping arguments
    for i in range(n // 10):
        if args:
            # Add exact duplicates
            args.append(args[random.randint(0, len(args) - 1)])

            # Add parent directory that subsumes children
            if random.random() < 0.3:
                child = args[random.randint(0, len(args) - 1)]
                if child.path.parent != child.path:
                    args.append(
                        CollectionArgument(
                            path=child.path.parent, parts=(), module_name=None
                        )
                    )

    random.shuffle(args)
    return args[:n]


def benchmark_function(sizes: list[int], num_runs: int = 3) -> dict:
    """Run the benchmark for different input sizes."""
    random.seed(0)
    results = {}

    for n in sizes:
        print(f"Benchmarking n={n}...", end=" ")
        times = []

        for run in range(num_runs):
            # Generate test data
            test_data = generate_test_data(n)

            # Measure runtime
            start = time.perf_counter()
            result = normalize_collection_arguments(test_data)
            end = time.perf_counter()

            times.append(end - start)

        avg_time = sum(times) / len(times)
        results[n] = avg_time
        print(f"avg time: {avg_time:.6f}s")

    return results


def plot_ascii(results: dict, width: int = 70, height: int = 20):
    """Create an ASCII plot of the results."""
    if not results:
        return

    sizes = sorted(results.keys())
    times = [results[s] for s in sizes]

    # Normalize times for plotting
    max_time = max(times)
    min_time = min(times) if min(times) > 0 else 0.000001
    max_size = max(sizes)
    min_size = min(sizes)

    print("\n" + "=" * width)
    print("Runtime vs Input Size (n)")
    print("=" * width)

    # Create the plot grid
    grid = [[" " for _ in range(width)] for _ in range(height)]

    # Add Y-axis labels
    for i in range(height):
        y_val = max_time - (i * max_time / (height - 1))
        label = f"{y_val:.4f}s"
        for j, char in enumerate(label):
            if j < width:
                grid[i][j] = char

    # Plot the data points
    label_offset = 10  # Space for Y-axis labels
    plot_width = width - label_offset - 2

    for size, time in results.items():
        if max_size > min_size:
            x = label_offset + int(
                (size - min_size) / (max_size - min_size) * plot_width
            )
        else:
            x = label_offset + plot_width // 2

        if max_time > 0:
            y = int((1 - time / max_time) * (height - 1))
        else:
            y = height // 2

        if 0 <= x < width and 0 <= y < height:
            grid[y][x] = "*"

    # Add axes
    for i in range(height):
        if label_offset < width:
            grid[i][label_offset - 1] = "|"

    for j in range(label_offset, width):
        grid[height - 1][j] = "-"

    # Print the grid
    for row in grid:
        print("".join(row))

    # Add X-axis labels
    print(" " * label_offset + "└" + "─" * (width - label_offset - 1))

    # Print size labels
    x_labels = []
    for i, size in enumerate(sizes):
        if i % max(1, len(sizes) // 5) == 0 or i == len(sizes) - 1:
            x_labels.append((size, sizes.index(size)))

    label_line = " " * (label_offset + 1)
    for size, idx in x_labels:
        if max_size > min_size:
            x_pos = int((size - min_size) / (max_size - min_size) * plot_width)
        else:
            x_pos = plot_width // 2
        label = str(size)
        if x_pos + len(label) < plot_width:
            padding = x_pos - len(label_line) + label_offset + 1
            if padding > 0:
                label_line += " " * padding + label

    print(label_line)
    print(" " * (width // 2 - 10) + "Input size (n)")

    # Analyze complexity
    print("\n" + "=" * width)
    print("Complexity Analysis:")
    print("=" * width)

    if len(sizes) >= 3:
        # Calculate growth rate between different size pairs
        ratios = []
        for i in range(1, len(sizes)):
            if results[sizes[i - 1]] > 0:
                size_ratio = sizes[i] / sizes[i - 1]
                time_ratio = results[sizes[i]] / results[sizes[i - 1]]
                ratios.append((size_ratio, time_ratio))

        # Estimate complexity
        avg_size_ratio = sum(r[0] for r in ratios) / len(ratios)
        avg_time_ratio = sum(r[1] for r in ratios) / len(ratios)

        # For O(n), time ratio ≈ size ratio
        # For O(n log n), time ratio ≈ size ratio * log(size ratio)
        # For O(n²), time ratio ≈ size ratio²

        import math

        expected_linear = avg_size_ratio
        expected_nlogn = avg_size_ratio * (1 + math.log2(avg_size_ratio))
        expected_quadratic = avg_size_ratio * avg_size_ratio

        # Find closest match
        diff_linear = abs(avg_time_ratio - expected_linear)
        diff_nlogn = abs(avg_time_ratio - expected_nlogn)
        diff_quadratic = abs(avg_time_ratio - expected_quadratic)

        min_diff = min(diff_linear, diff_nlogn, diff_quadratic)

        if min_diff == diff_linear:
            complexity = "O(n) - Linear"
        elif min_diff == diff_nlogn:
            complexity = "O(n log n) - Linearithmic"
        else:
            complexity = "O(n²) - Quadratic"

        print(f"Estimated complexity: {complexity}")
        print(f"Average size increase: {avg_size_ratio:.2f}x")
        print(f"Average time increase: {avg_time_ratio:.2f}x")
        print(f"Expected for O(n): {expected_linear:.2f}x")
        print(f"Expected for O(n log n): {expected_nlogn:.2f}x")
        print(f"Expected for O(n²): {expected_quadratic:.2f}x")

        if avg_time_ratio >= expected_quadratic * 0.8:
            print("\nWARNING: Appears to be quadratic or worse")
        else:
            print("\nSUCCESS: Seems sub-quadratic")


def main():
    """Run the benchmark."""
    print("Benchmarking normalize_collection_arguments")
    print("-" * 50)

    sizes = [25, 50, 100, 200, 400, 800, 1_600, 3_200, 6_400, 12_800, 25_600, 51_200, 102_400]

    print(f"Testing with sizes: {sizes}")
    print("Each size will be run 3 times and averaged")
    print("-" * 50)

    results = benchmark_function(sizes, num_runs=3)

    print("\nResults Summary:")
    print("-" * 50)
    for size, time in sorted(results.items()):
        print(f"n={size:6d}: {time:.6f}s")

    plot_ascii(results)


if __name__ == "__main__":
    main()

These are my results (AMD HX 370, Python 3.13, Linux):

Benchmarking normalize_collection_arguments
--------------------------------------------------
Testing with sizes: [25, 50, 100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600, 51200, 102400]
Each size will be run 3 times and averaged
--------------------------------------------------

Results Summary:
--------------------------------------------------
n=    25: 0.000202s
n=    50: 0.000551s
n=   100: 0.001038s
n=   200: 0.002315s
n=   400: 0.005072s
n=   800: 0.011990s
n=  1600: 0.028096s
n=  3200: 0.054844s
n=  6400: 0.146995s
n= 12800: 0.301674s
n= 25600: 0.611664s
n= 51200: 1.646973s
n=102400: 4.372818s

======================================================================
Runtime vs Input Size (n)
======================================================================
4.3728s  |                                                          * 
4.1427s  |                                                            
3.9125s  |                                                            
3.6824s  |                                                            
3.4522s  |                                                            
3.2221s  |                                                            
2.9919s  |                                                            
2.7618s  |                                                            
2.5316s  |                                                            
2.3015s  |                                                            
2.0713s  |                                                            
1.8412s  |                            *                               
1.6110s  |                                                            
1.3809s  |                                                            
1.1507s  |                                                            
0.9206s  |                                                            
0.6904s  |              *                                             
0.4603s  |       *                                                    
0.2301s  |** *                                                        
0.0000s  |------------------------------------------------------------
          └───────────────────────────────────────────────────────────
              6400       25600
                         Input size (n)

======================================================================
Complexity Analysis:
======================================================================
Estimated complexity: O(n) - Linear
Average size increase: 2.00x
Average time increase: 2.32x
Expected for O(n): 2.00x
Expected for O(n log n): 4.00x
Expected for O(n²): 4.00x

SUCCESS: Seems sub-quadratic

Not so good -- about 4.5s for 100k args, and getting worse with more -- but maybe OK? Possibly other parts of the pipeline break at this scale? We do have the --keep-duplicates workaround which avoids the entire normalization, if the given input is well formed.

This list can be quite long, because it contains all the node ids that failed in the job, so it is important that this use case does not cause the collection time to balloon out of control.

Do you have an estimate on how many arguments it can get to max?

nicoddemus · 2025-09-08T21:44:11Z

Do you have an estimate on how many arguments it can get to max?

Oh not that many, usually less than 100. I was just mentioning because I'm not sure how common that use case is.

bluetech · 2025-09-09T18:12:32Z

I realize the benchmark is no good - generates way too many overlaps. A benchmark on the common case of no duplicates and overlaps shows the quadratic behavior and it becomes slow a lot sooner. So needs another approach.

This is for the benefit of the next commit. That commit wants to check whether a CollectionArgument is subsumed by another. According to pytest semantics: `test_it.py::TestIt::test_it[a]` subsumed by `test_it.py::TestIt::test_it` However the `parts` are ["TestIt", test_it[a]"] ["TestIt", test_it"] which means a simple list prefix cannot be used. By splitting the parametrization `"[a]"` part to its own attribute, it can be handled cleanly. I also think this is a reasonable change regardless. We'd probably want something like this when the "collection structure contains parametrization" TODO is tackled.

Consider a pytest invocation like `pytest tests/ tests/test_it.py`. What should happen? Currently what happens is that only `tests/test_it.py` is run, which is obviously wrong. This regressed in the big package collection rework (PR pytest-dev#11646). The reason it regressed is the way pytest collection works. See pytest-dev#12083 for (some) details. I have made an attempt to fix the problem directly in the collection loop, but failed. The main challenge is the node caching, i.e. when should a collector node be reused when it is needed for several collection arguments. I believe it is possible to make it work, but it's hard. In order to not leave this embarrassing bug lingering for any longer, this instead takes an easier approach, which is to massage the collection argument list itself such that issues with overlapping nodes don't come up during collection at all. This *adds* complexity instead of simplifying things, but I hope it should be good enough in practice for now, and maybe we can revisit in the future. This change introduces behavioral changes, mainly: - `pytest a/b a/` is equivalent to `pytest a`; if there is an `a/a` then `a/b` will *not* be ordered before `a/a`. So the ability to order a subset before a superset is lost. - `pytest x.py x.py` does *not* run the file twice; previously we took an explicit request like this to mean that it should. The `--keep-duplicates` option remains as a sort of "expert mode" that retains its current behavior; though it is still subtly broken in that *collector nodes* are also duplicated (not just the items). A fix for that requires the harder change. Fix pytest-dev#12083.

bluetech · 2025-09-11T19:55:32Z

I updated the PR:

Added a preparatory commit which moves the parametrization [...] part to a separate field in CollectionArgument. See the commit message for details.
Changed to an O(n*log(n)) algorithm. It now scales to 100K (~1s for me) which is the reasonable limit I put on it. Performance is dominated by the sort which is dominated by pathlib.Path.__lt__.
Cleaned up some keepduplicates checks. Now the flag solely affects whether the normalization is performed, which I think is more coherent.

Unless there is something I missed, I think it's ready.

bluetech · 2025-10-05T10:08:20Z

I'm starting to test pytest 9 on some of my projects, and want to include this. So I'll merge based on @nicoddemus's conceptual approval, the difference since then is just optimizing the runtime complexity.

bluetech added topic: collection related to the collection phase type: backward compatibility might present some backward compatibility issues which should be carefully noted in the changelog labels Sep 7, 2025

psf-chronographer bot added the bot:chronographer:provided (automation) changelog entry is part of PR label Sep 7, 2025

bluetech commented Sep 7, 2025

View reviewed changes

bluetech changed the title ~~WIP: main: add explicit handling of overlapping collection arguments~~ main: add explicit handling of overlapping collection arguments Sep 7, 2025

nicoddemus reviewed Sep 8, 2025

View reviewed changes

changelog/12083.breaking.rst Outdated Show resolved Hide resolved

src/_pytest/main.py Outdated Show resolved Hide resolved

bluetech force-pushed the overlapping-collection-args branch from 9554f1d to 3fef8d6 Compare September 8, 2025 21:23

bluetech force-pushed the overlapping-collection-args branch from 3fef8d6 to 67fca28 Compare September 9, 2025 18:09

bluetech marked this pull request as draft September 9, 2025 18:10

bluetech added 2 commits September 11, 2025 22:01

bluetech force-pushed the overlapping-collection-args branch from 67fca28 to 6764439 Compare September 11, 2025 19:46

bluetech marked this pull request as ready for review September 11, 2025 19:47

bluetech mentioned this pull request Sep 11, 2025

pytest 9.0 release planning #13719

Open

bluetech merged commit d036b12 into pytest-dev:main Oct 5, 2025
37 of 68 checks passed

bluetech deleted the overlapping-collection-args branch October 5, 2025 12:36

This was referenced Oct 7, 2025

over-riden/extended fixtures that self-depend don't pass over dependend fixtures for parameterization #12091

Closed

fix overridden/extended fixtures #12110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

main: add explicit handling of overlapping collection arguments #13704

main: add explicit handling of overlapping collection arguments #13704

Uh oh!

bluetech commented Sep 7, 2025

Uh oh!

bluetech Sep 7, 2025

Uh oh!

bluetech commented Sep 7, 2025

Uh oh!

nicoddemus left a comment

Uh oh!

Uh oh!

Uh oh!

bluetech commented Sep 8, 2025

Uh oh!

nicoddemus commented Sep 8, 2025

Uh oh!

bluetech commented Sep 9, 2025 •

edited

Loading

Uh oh!

bluetech commented Sep 11, 2025

Uh oh!

bluetech commented Oct 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

main: add explicit handling of overlapping collection arguments #13704

main: add explicit handling of overlapping collection arguments #13704

Uh oh!

Conversation

bluetech commented Sep 7, 2025

Uh oh!

bluetech Sep 7, 2025

Choose a reason for hiding this comment

Uh oh!

bluetech commented Sep 7, 2025

Uh oh!

nicoddemus left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bluetech commented Sep 8, 2025

Uh oh!

nicoddemus commented Sep 8, 2025

Uh oh!

bluetech commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bluetech commented Sep 11, 2025

Uh oh!

bluetech commented Oct 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bluetech commented Sep 9, 2025 •

edited

Loading