Skip to content

Conversation

bluetech
Copy link
Member

@bluetech bluetech commented Sep 7, 2025

Consider a pytest invocation like pytest tests/ tests/test_it.py. What should happen?

Currently what happens is that only tests/test_it.py is run, which is obviously wrong. This regressed in the big package collection rework (PR #11646).

The reason it regressed is the way pytest collection works. See #12083 for (some) details.

I have made an attempt to fix the problem directly in the collection loop, but failed. The main challenge is the node caching, i.e. when should a collector node be reused when it is needed for several collection arguments. I believe it is possible to make it work, but it's hard.

In order to not leave this embarrassing bug lingering for any longer, this instead takes an easier approach, which is to massage the collection argument list itself such that issues with overlapping nodes don't come up during collection at all. This adds complexity instead of simplifying things, but I hope it should be good enough in practice for now, and maybe we can revisit in the future.

This change introduces behavioral changes, mainly:

  • pytest a/b a/ is equivalent to pytest a; if there is an a/a then a/b will not be ordered before a/a. So the ability to order a subset before a superset is lost.

  • pytest x.py x.py does not run the file twice; previously we took an explicit request like this to mean that it should.

The --keep-duplicates option remains as a sort of "expert mode" that retains its current behavior; though it is still subtly broken in that collector nodes are also duplicated (not just the items). A fix for that requires the harder change.

Fix #12083.

@bluetech bluetech added topic: collection related to the collection phase type: backward compatibility might present some backward compatibility issues which should be carefully noted in the changelog labels Sep 7, 2025
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided (automation) changelog entry is part of PR label Sep 7, 2025
keeps the shorter prefix, or the earliest argument if duplicate, preserving
order. The result is prefix-free.
"""
# TODO: Fix quadratic runtime.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be fixed before merging, but wanted to post the simple understandable version before complicating with a trie or whatever.

@bluetech
Copy link
Member Author

bluetech commented Sep 7, 2025

This is a breaking change, but I think we're about due for a pytest 9, so seems OK.

@bluetech bluetech changed the title WIP: main: add explicit handling of overlapping collection arguments main: add explicit handling of overlapping collection arguments Sep 7, 2025
Copy link
Member

@nicoddemus nicoddemus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @bluetech!

Everything looks good to me, specially the comprehensive tests.

The quadratic time is important, as you mention, just wanted to highlight a use case common for us at work:

When we get a CI failure, we output a GH actions summary showing a command line that can be copy/pasted to reproduce the failed tests directly.

For example:

pytest test1.py::test_foo test1.py::TestBar::test_bar ...

This list can be quite long, because it contains all the node ids that failed in the job, so it is important that this use case does not cause the collection time to balloon out of control.

@bluetech bluetech force-pushed the overlapping-collection-args branch from 9554f1d to 3fef8d6 Compare September 8, 2025 21:23
@bluetech
Copy link
Member Author

bluetech commented Sep 8, 2025

I tried a bit to get it non-quadratic, got something but it's quite a bit more complex and only gets faster at very high counts due to the higher fixed costs.

Before this I optimized the simple quadratic solution a bit (removed unneeded is_dir check which did a stat, and optimized the relative-to check). I had the AI dummy write me a benchmark, which seems fine to me.

Benchmark

#!/usr/bin/env python
"""Benchmark for normalize_collection_arguments function to verify sub-quadratic performance."""

from __future__ import annotations

from pathlib import Path
import random
import time

from _pytest.main import CollectionArgument
from _pytest.main import normalize_collection_arguments


def generate_test_data(n: int) -> list[CollectionArgument]:
    """Generate test data with various overlapping patterns."""
    args = []

    # Create a mix of different argument types
    for i in range(n):
        choice = random.randint(0, 4)

        if choice == 0:
            # Simple file paths
            path = Path(f"/test/path/file_{i % 100}.py")
            parts = ()
        elif choice == 1:
            # Paths with class specifications
            path = Path(f"/test/path/file_{i % 50}.py")
            parts = (f"TestClass{i % 10}",)
        elif choice == 2:
            # Paths with method specifications
            path = Path(f"/test/path/file_{i % 50}.py")
            parts = (f"TestClass{i % 10}", f"test_method_{i % 5}")
        elif choice == 3:
            # Directory paths
            path = Path(f"/test/path/dir_{i % 20}")
            parts = ()
        else:
            # Nested directory paths (creates parent-child relationships)
            depth = random.randint(1, 5)
            path = Path("/test/path")
            for d in range(depth):
                path = path / f"subdir_{i % 10}_{d}"
            path = path / f"file_{i}.py"
            parts = ()

        args.append(CollectionArgument(path=path, parts=parts, module_name=None))

    # Add some duplicates and overlapping arguments
    for i in range(n // 10):
        if args:
            # Add exact duplicates
            args.append(args[random.randint(0, len(args) - 1)])

            # Add parent directory that subsumes children
            if random.random() < 0.3:
                child = args[random.randint(0, len(args) - 1)]
                if child.path.parent != child.path:
                    args.append(
                        CollectionArgument(
                            path=child.path.parent, parts=(), module_name=None
                        )
                    )

    random.shuffle(args)
    return args[:n]


def benchmark_function(sizes: list[int], num_runs: int = 3) -> dict:
    """Run the benchmark for different input sizes."""
    random.seed(0)
    results = {}

    for n in sizes:
        print(f"Benchmarking n={n}...", end=" ")
        times = []

        for run in range(num_runs):
            # Generate test data
            test_data = generate_test_data(n)

            # Measure runtime
            start = time.perf_counter()
            result = normalize_collection_arguments(test_data)
            end = time.perf_counter()

            times.append(end - start)

        avg_time = sum(times) / len(times)
        results[n] = avg_time
        print(f"avg time: {avg_time:.6f}s")

    return results


def plot_ascii(results: dict, width: int = 70, height: int = 20):
    """Create an ASCII plot of the results."""
    if not results:
        return

    sizes = sorted(results.keys())
    times = [results[s] for s in sizes]

    # Normalize times for plotting
    max_time = max(times)
    min_time = min(times) if min(times) > 0 else 0.000001
    max_size = max(sizes)
    min_size = min(sizes)

    print("\n" + "=" * width)
    print("Runtime vs Input Size (n)")
    print("=" * width)

    # Create the plot grid
    grid = [[" " for _ in range(width)] for _ in range(height)]

    # Add Y-axis labels
    for i in range(height):
        y_val = max_time - (i * max_time / (height - 1))
        label = f"{y_val:.4f}s"
        for j, char in enumerate(label):
            if j < width:
                grid[i][j] = char

    # Plot the data points
    label_offset = 10  # Space for Y-axis labels
    plot_width = width - label_offset - 2

    for size, time in results.items():
        if max_size > min_size:
            x = label_offset + int(
                (size - min_size) / (max_size - min_size) * plot_width
            )
        else:
            x = label_offset + plot_width // 2

        if max_time > 0:
            y = int((1 - time / max_time) * (height - 1))
        else:
            y = height // 2

        if 0 <= x < width and 0 <= y < height:
            grid[y][x] = "*"

    # Add axes
    for i in range(height):
        if label_offset < width:
            grid[i][label_offset - 1] = "|"

    for j in range(label_offset, width):
        grid[height - 1][j] = "-"

    # Print the grid
    for row in grid:
        print("".join(row))

    # Add X-axis labels
    print(" " * label_offset + "└" + "─" * (width - label_offset - 1))

    # Print size labels
    x_labels = []
    for i, size in enumerate(sizes):
        if i % max(1, len(sizes) // 5) == 0 or i == len(sizes) - 1:
            x_labels.append((size, sizes.index(size)))

    label_line = " " * (label_offset + 1)
    for size, idx in x_labels:
        if max_size > min_size:
            x_pos = int((size - min_size) / (max_size - min_size) * plot_width)
        else:
            x_pos = plot_width // 2
        label = str(size)
        if x_pos + len(label) < plot_width:
            padding = x_pos - len(label_line) + label_offset + 1
            if padding > 0:
                label_line += " " * padding + label

    print(label_line)
    print(" " * (width // 2 - 10) + "Input size (n)")

    # Analyze complexity
    print("\n" + "=" * width)
    print("Complexity Analysis:")
    print("=" * width)

    if len(sizes) >= 3:
        # Calculate growth rate between different size pairs
        ratios = []
        for i in range(1, len(sizes)):
            if results[sizes[i - 1]] > 0:
                size_ratio = sizes[i] / sizes[i - 1]
                time_ratio = results[sizes[i]] / results[sizes[i - 1]]
                ratios.append((size_ratio, time_ratio))

        # Estimate complexity
        avg_size_ratio = sum(r[0] for r in ratios) / len(ratios)
        avg_time_ratio = sum(r[1] for r in ratios) / len(ratios)

        # For O(n), time ratio ≈ size ratio
        # For O(n log n), time ratio ≈ size ratio * log(size ratio)
        # For O(n²), time ratio ≈ size ratio²

        import math

        expected_linear = avg_size_ratio
        expected_nlogn = avg_size_ratio * (1 + math.log2(avg_size_ratio))
        expected_quadratic = avg_size_ratio * avg_size_ratio

        # Find closest match
        diff_linear = abs(avg_time_ratio - expected_linear)
        diff_nlogn = abs(avg_time_ratio - expected_nlogn)
        diff_quadratic = abs(avg_time_ratio - expected_quadratic)

        min_diff = min(diff_linear, diff_nlogn, diff_quadratic)

        if min_diff == diff_linear:
            complexity = "O(n) - Linear"
        elif min_diff == diff_nlogn:
            complexity = "O(n log n) - Linearithmic"
        else:
            complexity = "O(n²) - Quadratic"

        print(f"Estimated complexity: {complexity}")
        print(f"Average size increase: {avg_size_ratio:.2f}x")
        print(f"Average time increase: {avg_time_ratio:.2f}x")
        print(f"Expected for O(n): {expected_linear:.2f}x")
        print(f"Expected for O(n log n): {expected_nlogn:.2f}x")
        print(f"Expected for O(n²): {expected_quadratic:.2f}x")

        if avg_time_ratio >= expected_quadratic * 0.8:
            print("\nWARNING: Appears to be quadratic or worse")
        else:
            print("\nSUCCESS: Seems sub-quadratic")


def main():
    """Run the benchmark."""
    print("Benchmarking normalize_collection_arguments")
    print("-" * 50)

    sizes = [25, 50, 100, 200, 400, 800, 1_600, 3_200, 6_400, 12_800, 25_600, 51_200, 102_400]

    print(f"Testing with sizes: {sizes}")
    print("Each size will be run 3 times and averaged")
    print("-" * 50)

    results = benchmark_function(sizes, num_runs=3)

    print("\nResults Summary:")
    print("-" * 50)
    for size, time in sorted(results.items()):
        print(f"n={size:6d}: {time:.6f}s")

    plot_ascii(results)


if __name__ == "__main__":
    main()

These are my results (AMD HX 370, Python 3.13, Linux):

Benchmarking normalize_collection_arguments
--------------------------------------------------
Testing with sizes: [25, 50, 100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600, 51200, 102400]
Each size will be run 3 times and averaged
--------------------------------------------------

Results Summary:
--------------------------------------------------
n=    25: 0.000202s
n=    50: 0.000551s
n=   100: 0.001038s
n=   200: 0.002315s
n=   400: 0.005072s
n=   800: 0.011990s
n=  1600: 0.028096s
n=  3200: 0.054844s
n=  6400: 0.146995s
n= 12800: 0.301674s
n= 25600: 0.611664s
n= 51200: 1.646973s
n=102400: 4.372818s

======================================================================
Runtime vs Input Size (n)
======================================================================
4.3728s  |                                                          * 
4.1427s  |                                                            
3.9125s  |                                                            
3.6824s  |                                                            
3.4522s  |                                                            
3.2221s  |                                                            
2.9919s  |                                                            
2.7618s  |                                                            
2.5316s  |                                                            
2.3015s  |                                                            
2.0713s  |                                                            
1.8412s  |                            *                               
1.6110s  |                                                            
1.3809s  |                                                            
1.1507s  |                                                            
0.9206s  |                                                            
0.6904s  |              *                                             
0.4603s  |       *                                                    
0.2301s  |** *                                                        
0.0000s  |------------------------------------------------------------
          └───────────────────────────────────────────────────────────
              6400       25600
                         Input size (n)

======================================================================
Complexity Analysis:
======================================================================
Estimated complexity: O(n) - Linear
Average size increase: 2.00x
Average time increase: 2.32x
Expected for O(n): 2.00x
Expected for O(n log n): 4.00x
Expected for O(n²): 4.00x

SUCCESS: Seems sub-quadratic

Not so good -- about 4.5s for 100k args, and getting worse with more -- but maybe OK? Possibly other parts of the pipeline break at this scale? We do have the --keep-duplicates workaround which avoids the entire normalization, if the given input is well formed.

This list can be quite long, because it contains all the node ids that failed in the job, so it is important that this use case does not cause the collection time to balloon out of control.

Do you have an estimate on how many arguments it can get to max?

@nicoddemus
Copy link
Member

Do you have an estimate on how many arguments it can get to max?

Oh not that many, usually less than 100. I was just mentioning because I'm not sure how common that use case is.

@bluetech bluetech force-pushed the overlapping-collection-args branch from 3fef8d6 to 67fca28 Compare September 9, 2025 18:09
@bluetech bluetech marked this pull request as draft September 9, 2025 18:10
@bluetech
Copy link
Member Author

bluetech commented Sep 9, 2025

I realize the benchmark is no good - generates way too many overlaps. A benchmark on the common case of no duplicates and overlaps shows the quadratic behavior and it becomes slow a lot sooner. So needs another approach.

This is for the benefit of the next commit. That commit wants to check
whether a CollectionArgument is subsumed by another. According to pytest
semantics:
    `test_it.py::TestIt::test_it[a]` subsumed by `test_it.py::TestIt::test_it`
However the `parts` are
    ["TestIt", test_it[a]"]                       ["TestIt", test_it"]
which means a simple list prefix cannot be used. By splitting the
parametrization `"[a]"` part to its own attribute, it can be handled
cleanly.

I also think this is a reasonable change regardless. We'd probably want
something like this when the "collection structure contains
parametrization" TODO is tackled.
Consider a pytest invocation like `pytest tests/ tests/test_it.py`. What
should happen?

Currently what happens is that only `tests/test_it.py` is run, which is
obviously wrong. This regressed in the big package collection rework
(PR pytest-dev#11646).

The reason it regressed is the way pytest collection works. See pytest-dev#12083
for (some) details.

I have made an attempt to fix the problem directly in the collection
loop, but failed. The main challenge is the node caching, i.e. when
should a collector node be reused when it is needed for several
collection arguments. I believe it is possible to make it work, but it's
hard.

In order to not leave this embarrassing bug lingering for any longer,
this instead takes an easier approach, which is to massage the
collection argument list itself such that issues with overlapping nodes
don't come up during collection at all. This *adds* complexity instead
of simplifying things, but I hope it should be good enough in practice
for now, and maybe we can revisit in the future.

This change introduces behavioral changes, mainly:

- `pytest a/b a/` is equivalent to `pytest a`; if there is an `a/a` then
  `a/b` will *not* be ordered before `a/a`. So the ability to order a
  subset before a superset is lost.

- `pytest x.py x.py` does *not* run the file twice; previously we took
  an explicit request like this to mean that it should.

The `--keep-duplicates` option remains as a sort of "expert mode" that
retains its current behavior; though it is still subtly broken in that
*collector nodes* are also duplicated (not just the items). A fix for
that requires the harder change.

Fix pytest-dev#12083.
@bluetech bluetech force-pushed the overlapping-collection-args branch from 67fca28 to 6764439 Compare September 11, 2025 19:46
@bluetech bluetech marked this pull request as ready for review September 11, 2025 19:47
@bluetech
Copy link
Member Author

I updated the PR:

  1. Added a preparatory commit which moves the parametrization [...] part to a separate field in CollectionArgument. See the commit message for details.
  2. Changed to an O(n*log(n)) algorithm. It now scales to 100K (~1s for me) which is the reasonable limit I put on it. Performance is dominated by the sort which is dominated by pathlib.Path.__lt__.
  3. Cleaned up some keepduplicates checks. Now the flag solely affects whether the normalization is performed, which I think is more coherent.

Unless there is something I missed, I think it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot:chronographer:provided (automation) changelog entry is part of PR topic: collection related to the collection phase type: backward compatibility might present some backward compatibility issues which should be carefully noted in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with duplicates handling in Pytest 8
2 participants