Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 5% (0.05x) speedup for AdvancedPdfLoader._format_image_element in cognee/infrastructure/loaders/external/advanced_pdf_loader.py

⏱️ Runtime : 545 microseconds 518 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 5% speedup through several targeted micro-optimizations that reduce object allocations and dictionary operations:

Key Optimizations:

  1. Eliminated unnecessary dict allocation: Changed metadata.get("coordinates", {}) to metadata.get("coordinates", None) - avoids creating an empty dictionary when coordinates are missing, which is beneficial since many test cases show missing coordinates.

  2. Walrus operator for early evaluation: Combined the dictionary lookup and assignment using (points := coordinates.get("points")) directly in the conditional chain. This eliminates the separate points = coordinates.get("points") line and reduces the number of variable assignments.

  3. Tuple unpacking optimization: Replaced individual indexing (leftup = points[0], rightdown = points[3]) with direct unpacking (leftup, _, _, rightdown = points). This is more efficient as it avoids multiple tuple index lookups.

  4. Improved f-string formatting: Streamlined the layout info concatenation by using a single f-string instead of string concatenation with +, which is more efficient for string building.

Performance Impact Analysis:
The test results show consistent improvements across most scenarios:

  • Best gains (10-19% faster) occur with edge cases like missing coordinates or invalid data structures
  • Moderate gains (5-10% faster) for normal cases with complete metadata
  • Large-scale tests maintain 3-6% improvement, indicating good scalability

The optimizations are particularly effective for this function because it processes many dictionary lookups and conditional checks. Given that this is a PDF processing utility that likely processes many images per document, even a 5% improvement can compound significantly across large documents or batch processing workflows.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 739 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any, Dict

# imports
import pytest
from cognee.infrastructure.loaders.external.advanced_pdf_loader import \
    AdvancedPdfLoader

# unit tests

@pytest.fixture
def loader():
    # Fixture to provide a fresh instance of AdvancedPdfLoader for each test
    return AdvancedPdfLoader()

# ----------------------
# Basic Test Cases
# ----------------------

def test_basic_no_coordinates(loader):
    # No coordinates: should return just the placeholder
    metadata = {}
    codeflash_output = loader._format_image_element(metadata) # 927ns -> 678ns (36.7% faster)

def test_basic_empty_coordinates(loader):
    # Empty coordinates dict: should return just the placeholder
    metadata = {"coordinates": {}}
    codeflash_output = loader._format_image_element(metadata) # 836ns -> 826ns (1.21% faster)

def test_basic_points_tuple_correct(loader):
    # Correct points tuple with 4 points, each a tuple of 2 numbers
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8))
        }
    }
    expected = "[Image omitted] (bbox=(1, 2, 7, 8))"
    codeflash_output = loader._format_image_element(metadata) # 2.12μs -> 2.07μs (2.27% faster)

def test_basic_points_tuple_with_layout(loader):
    # Points tuple plus layout_width, layout_height, system
    metadata = {
        "coordinates": {
            "points": ((10, 20), (30, 40), (50, 60), (70, 80)),
            "layout_width": 200,
            "layout_height": 100,
            "system": "XYZ"
        }
    }
    expected = "[Image omitted] (bbox=(10, 20, 70, 80)), system=XYZ, layout_width=200, layout_height=100))"
    codeflash_output = loader._format_image_element(metadata) # 2.97μs -> 2.69μs (10.6% faster)

# ----------------------
# Edge Test Cases
# ----------------------

def test_edge_points_not_tuple(loader):
    # Points is a list, not a tuple: should not format bbox
    metadata = {
        "coordinates": {
            "points": [(1, 2), (3, 4), (5, 6), (7, 8)]
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 929ns -> 909ns (2.20% faster)

def test_edge_points_tuple_wrong_length(loader):
    # Points tuple has less than 4 elements: should not format bbox
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6))
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 1.12μs -> 974ns (15.1% faster)

def test_edge_points_tuple_elements_not_tuples(loader):
    # Points tuple contains elements not tuples
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), 123)
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 1.54μs -> 1.42μs (8.80% faster)

def test_edge_points_tuple_elements_wrong_length(loader):
    # Points tuple contains tuples of wrong length
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8, 9))
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 1.57μs -> 1.37μs (14.8% faster)

def test_edge_coordinates_not_dict(loader):
    # Coordinates is not a dict: should not fail, just return placeholder
    metadata = {"coordinates": "notadict"}
    codeflash_output = loader._format_image_element(metadata) # 779ns -> 667ns (16.8% faster)

def test_edge_points_is_none(loader):
    # Points is None: should return placeholder
    metadata = {"coordinates": {"points": None}}
    codeflash_output = loader._format_image_element(metadata) # 838ns -> 827ns (1.33% faster)

def test_edge_missing_some_layout_fields(loader):
    # Missing one or more layout fields: should not append layout info
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 200,
            # layout_height missing
            "system": "XYZ"
        }
    }
    expected = "[Image omitted] (bbox=(1, 2, 7, 8))"
    codeflash_output = loader._format_image_element(metadata) # 2.38μs -> 2.37μs (0.169% faster)

def test_edge_layout_fields_falsey(loader):
    # layout_width, layout_height, system are present but falsey
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 0,
            "layout_height": 0,
            "system": ""
        }
    }
    # Should not append layout info
    expected = "[Image omitted] (bbox=(1, 2, 7, 8))"
    codeflash_output = loader._format_image_element(metadata) # 2.26μs -> 2.19μs (3.15% faster)

def test_edge_points_tuple_with_non_numeric(loader):
    # Points tuple with non-numeric values
    metadata = {
        "coordinates": {
            "points": (("a", "b"), (3, 4), (5, 6), (7, 8))
        }
    }
    # The function does not check for numeric, so it will still format the string
    expected = "[Image omitted] (bbox=(a, b, 7, 8))"
    codeflash_output = loader._format_image_element(metadata) # 2.09μs -> 2.12μs (1.32% slower)

def test_edge_extra_fields_in_coordinates(loader):
    # Extra fields in coordinates dict should not affect output
    metadata = {
        "coordinates": {
            "points": ((11, 22), (33, 44), (55, 66), (77, 88)),
            "layout_width": 123,
            "layout_height": 456,
            "system": "ABC",
            "extra_field": "should be ignored"
        }
    }
    expected = "[Image omitted] (bbox=(11, 22, 77, 88)), system=ABC, layout_width=123, layout_height=456))"
    codeflash_output = loader._format_image_element(metadata) # 2.98μs -> 2.80μs (6.47% faster)

def test_edge_coordinates_is_none(loader):
    # coordinates is None
    metadata = {"coordinates": None}
    codeflash_output = loader._format_image_element(metadata) # 805ns -> 675ns (19.3% faster)

def test_edge_metadata_is_none(loader):
    # metadata is None (should not crash, but in practice this would error)
    with pytest.raises(AttributeError):
        loader._format_image_element(None) # 1.40μs -> 1.55μs (9.56% slower)

# ----------------------
# Large Scale Test Cases
# ----------------------

def test_large_scale_many_metadata(loader):
    # Test the function's performance and correctness on a large number of metadata dicts
    metadatas = []
    expected = []
    for i in range(100):
        meta = {
            "coordinates": {
                "points": ((i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7)),
                "layout_width": i*10,
                "layout_height": i*20,
                "system": f"SYS{i}"
            }
        }
        metadatas.append(meta)
        expected.append(f"[Image omitted] (bbox=({i}, {i+1}, {i+6}, {i+7})), system=SYS{i}, layout_width={i*10}, layout_height={i*20}))")
    # Check all outputs match expected
    for i in range(100):
        codeflash_output = loader._format_image_element(metadatas[i]) # 101μs -> 94.8μs (6.96% faster)

def test_large_scale_some_missing_points(loader):
    # Mix of valid and invalid metadata in a large batch
    metadatas = []
    expected = []
    for i in range(100):
        if i % 2 == 0:
            # Valid
            meta = {
                "coordinates": {
                    "points": ((i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7))
                }
            }
            exp = f"[Image omitted] (bbox=({i}, {i+1}, {i+6}, {i+7}))"
        else:
            # Invalid: points as list
            meta = {
                "coordinates": {
                    "points": [(i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7)]
                }
            }
            exp = "[Image omitted]"
        metadatas.append(meta)
        expected.append(exp)
    for i in range(100):
        codeflash_output = loader._format_image_element(metadatas[i]) # 56.7μs -> 54.6μs (3.82% faster)

def test_large_scale_all_edge_cases(loader):
    # All metadata are edge cases (should all return placeholder)
    metadatas = []
    for i in range(100):
        if i % 3 == 0:
            meta = {"coordinates": None}
        elif i % 3 == 1:
            meta = {"coordinates": {"points": None}}
        else:
            meta = {"coordinates": {"points": ((1, 2), (3, 4))}}  # too short
        metadatas.append(meta)
    for meta in metadatas:
        codeflash_output = loader._format_image_element(meta) # 28.9μs -> 27.5μs (5.17% faster)

def test_large_scale_non_numeric_points(loader):
    # Large number of metadata with non-numeric points
    metadatas = []
    expected = []
    for i in range(100):
        meta = {
            "coordinates": {
                "points": ((str(i), str(i+1)), (3, 4), (5, 6), (7, 8))
            }
        }
        exp = f"[Image omitted] (bbox=({i}, {i+1}, 7, 8))"
        metadatas.append(meta)
        expected.append(exp)
    for i in range(100):
        codeflash_output = loader._format_image_element(metadatas[i]) # 71.5μs -> 69.0μs (3.73% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from typing import Any, Dict

# imports
import pytest  # used for our unit tests
from cognee.infrastructure.loaders.external.advanced_pdf_loader import \
    AdvancedPdfLoader

# unit tests

# Helper to instantiate loader
@pytest.fixture
def loader():
    return AdvancedPdfLoader()

# 1. Basic Test Cases

def test_no_coordinates(loader):
    # No coordinates key present
    metadata = {}
    codeflash_output = loader._format_image_element(metadata) # 919ns -> 778ns (18.1% faster)

def test_coordinates_not_dict(loader):
    # Coordinates is not a dict
    metadata = {"coordinates": "not_a_dict"}
    codeflash_output = loader._format_image_element(metadata) # 761ns -> 730ns (4.25% faster)

def test_coordinates_dict_no_points(loader):
    # Coordinates is dict but no points key
    metadata = {"coordinates": {}}
    codeflash_output = loader._format_image_element(metadata) # 876ns -> 860ns (1.86% faster)

def test_points_not_tuple(loader):
    # Points is present but not a tuple
    metadata = {"coordinates": {"points": [1, 2, 3, 4]}}
    codeflash_output = loader._format_image_element(metadata) # 949ns -> 925ns (2.59% faster)

def test_points_tuple_wrong_length(loader):
    # Points is a tuple but wrong length
    metadata = {"coordinates": {"points": ((0, 0), (1, 1), (2, 2))}}
    codeflash_output = loader._format_image_element(metadata) # 1.12μs -> 1.14μs (1.93% slower)

def test_points_tuple_elements_not_tuples(loader):
    # Points tuple has elements that are not tuples
    metadata = {"coordinates": {"points": (1, 2, 3, 4)}}
    codeflash_output = loader._format_image_element(metadata) # 1.54μs -> 1.33μs (15.2% faster)

def test_points_tuple_elements_wrong_length(loader):
    # Points tuple elements are tuples but wrong length
    metadata = {"coordinates": {"points": ((0,), (1,), (2,), (3,))}}
    codeflash_output = loader._format_image_element(metadata) # 1.59μs -> 1.34μs (18.4% faster)

def test_bbox_formatting_basic(loader):
    # Points is valid: 4 tuples of length 2
    metadata = {
        "coordinates": {
            "points": ((10, 20), (30, 40), (50, 60), (70, 80))
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 2.25μs -> 2.31μs (2.77% slower)

def test_bbox_formatting_with_extra_keys(loader):
    # Points valid, extra unrelated keys in coordinates
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "foo": "bar"
        }
    }
    codeflash_output = loader._format_image_element(metadata) # 2.24μs -> 2.13μs (5.01% faster)

def test_bbox_formatting_with_layout_and_system(loader):
    # Points valid, layout_width, layout_height, system present
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 100,
            "layout_height": 200,
            "system": "pdf"
        }
    }
    expected = "[Image omitted] (bbox=(1, 2, 7, 8)), system=pdf, layout_width=100, layout_height=200))"
    codeflash_output = loader._format_image_element(metadata) # 2.93μs -> 2.76μs (6.42% faster)

# 2. Edge Test Cases

def test_points_tuple_with_none(loader):
    # Points tuple contains None elements
    metadata = {"coordinates": {"points": (None, None, None, None)}}
    codeflash_output = loader._format_image_element(metadata) # 1.40μs -> 1.22μs (14.9% faster)

def test_points_tuple_with_mixed_types(loader):
    # Points tuple contains mixed types
    metadata = {"coordinates": {"points": ((1, 2), "bad", (3, 4), (5, 6))}}
    codeflash_output = loader._format_image_element(metadata) # 2.16μs -> 2.27μs (4.63% slower)

def test_points_tuple_with_nested_tuples(loader):
    # Points tuple contains nested tuples (length 2)
    metadata = {"coordinates": {"points": (((1, 2),), ((3, 4),), ((5, 6),), ((7, 8),))}}
    codeflash_output = loader._format_image_element(metadata) # 1.48μs -> 1.25μs (18.3% faster)

def test_layout_width_height_system_missing(loader):
    # Points valid, but only some of layout_width, layout_height, system present
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 100,
            # "layout_height" missing
            "system": "pdf"
        }
    }
    # Should not append extra info since not all three are present
    codeflash_output = loader._format_image_element(metadata) # 2.33μs -> 2.29μs (1.71% faster)

def test_layout_width_height_system_zero_values(loader):
    # Points valid, layout_width, layout_height, system present but width/height are zero
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 0,
            "layout_height": 0,
            "system": "pdf"
        }
    }
    # Zero is falsy, so should not append extra info
    codeflash_output = loader._format_image_element(metadata) # 2.27μs -> 2.14μs (5.94% faster)

def test_layout_width_height_system_empty_string(loader):
    # Points valid, layout_width, layout_height, system present but system is empty string
    metadata = {
        "coordinates": {
            "points": ((1, 2), (3, 4), (5, 6), (7, 8)),
            "layout_width": 100,
            "layout_height": 200,
            "system": ""
        }
    }
    # Empty string is falsy, so should not append extra info
    codeflash_output = loader._format_image_element(metadata) # 2.29μs -> 2.22μs (3.16% faster)

def test_coordinates_is_none(loader):
    # Coordinates is None
    metadata = {"coordinates": None}
    codeflash_output = loader._format_image_element(metadata) # 787ns -> 702ns (12.1% faster)

def test_coordinates_is_empty_string(loader):
    # Coordinates is empty string
    metadata = {"coordinates": ""}
    codeflash_output = loader._format_image_element(metadata) # 779ns -> 655ns (18.9% faster)

def test_points_tuple_large_numbers(loader):
    # Points tuple contains large numbers
    metadata = {
        "coordinates": {
            "points": ((999999, 888888), (777777, 666666), (555555, 444444), (333333, 222222))
        }
    }
    expected = "[Image omitted] (bbox=(999999, 888888, 333333, 222222))"
    codeflash_output = loader._format_image_element(metadata) # 2.40μs -> 2.47μs (2.84% slower)

def test_points_tuple_negative_numbers(loader):
    # Points tuple contains negative numbers
    metadata = {
        "coordinates": {
            "points": ((-1, -2), (-3, -4), (-5, -6), (-7, -8))
        }
    }
    expected = "[Image omitted] (bbox=(-1, -2, -7, -8))"
    codeflash_output = loader._format_image_element(metadata) # 2.31μs -> 2.33μs (0.858% slower)

def test_points_tuple_float_values(loader):
    # Points tuple contains float values
    metadata = {
        "coordinates": {
            "points": ((1.1, 2.2), (3.3, 4.4), (5.5, 6.6), (7.7, 8.8))
        }
    }
    expected = "[Image omitted] (bbox=(1.1, 2.2, 7.7, 8.8))"
    codeflash_output = loader._format_image_element(metadata) # 4.85μs -> 4.79μs (1.36% faster)

def test_points_tuple_with_bool_values(loader):
    # Points tuple contains boolean values
    metadata = {
        "coordinates": {
            "points": ((True, False), (1, 0), (0, 1), (False, True))
        }
    }
    # True/False are ints, but length checks pass
    expected = "[Image omitted] (bbox=(True, False, False, True))"
    codeflash_output = loader._format_image_element(metadata) # 2.68μs -> 2.73μs (1.58% slower)

# 3. Large Scale Test Cases

def test_large_metadata_dict(loader):
    # Large metadata dict, but only the relevant keys matter
    metadata = {
        "coordinates": {
            "points": ((10, 20), (30, 40), (50, 60), (70, 80)),
            "layout_width": 100,
            "layout_height": 200,
            "system": "pdf"
        }
    }
    # Add many irrelevant keys
    for i in range(500):
        metadata[f"irrelevant_{i}"] = "value"
    expected = "[Image omitted] (bbox=(10, 20, 70, 80)), system=pdf, layout_width=100, layout_height=200))"
    codeflash_output = loader._format_image_element(metadata) # 2.99μs -> 2.80μs (6.60% faster)

def test_large_coordinates_dict(loader):
    # Large coordinates dict, but only the relevant keys matter
    coordinates = {
        "points": ((10, 20), (30, 40), (50, 60), (70, 80)),
        "layout_width": 100,
        "layout_height": 200,
        "system": "pdf"
    }
    for i in range(500):
        coordinates[f"extra_{i}"] = i
    metadata = {"coordinates": coordinates}
    expected = "[Image omitted] (bbox=(10, 20, 70, 80)), system=pdf, layout_width=100, layout_height=200))"
    codeflash_output = loader._format_image_element(metadata) # 3.00μs -> 2.84μs (5.77% faster)

def test_many_images(loader):
    # Test with many different metadata dicts in a loop
    for i in range(100):
        metadata = {
            "coordinates": {
                "points": ((i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7)),
                "layout_width": i*10+1,
                "layout_height": i*10+2,
                "system": f"sys{i}"
            }
        }
        expected = f"[Image omitted] (bbox=({i}, {i+1}, {i+6}, {i+7})), system=sys{i}, layout_width={i*10+1}, layout_height={i*10+2}))"
        codeflash_output = loader._format_image_element(metadata) # 99.2μs -> 93.9μs (5.64% faster)

def test_many_images_no_extra_info(loader):
    # Test with many images, but no layout/system info
    for i in range(100):
        metadata = {
            "coordinates": {
                "points": ((i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7))
            }
        }
        expected = f"[Image omitted] (bbox=({i}, {i+1}, {i+6}, {i+7}))"
        codeflash_output = loader._format_image_element(metadata) # 78.6μs -> 77.0μs (2.10% faster)

def test_large_points_values(loader):
    # Points with large numbers, but within tuple length 2
    metadata = {
        "coordinates": {
            "points": ((999999999, 888888888), (777777777, 666666666), (555555555, 444444444), (333333333, 222222222))
        }
    }
    expected = "[Image omitted] (bbox=(999999999, 888888888, 333333333, 222222222))"
    codeflash_output = loader._format_image_element(metadata) # 2.31μs -> 2.26μs (2.21% faster)

def test_large_scale_with_edge_cases(loader):
    # Mix of valid and invalid metadata in a batch
    valid_count = 0
    for i in range(50):
        if i % 2 == 0:
            metadata = {
                "coordinates": {
                    "points": ((i, i+1), (i+2, i+3), (i+4, i+5), (i+6, i+7))
                }
            }
            expected = f"[Image omitted] (bbox=({i}, {i+1}, {i+6}, {i+7}))"
            codeflash_output = loader._format_image_element(metadata)
            valid_count += 1
        else:
            metadata = {
                "coordinates": {
                    "points": ((i,), (i+1,), (i+2,), (i+3,))
                }
            }
            codeflash_output = loader._format_image_element(metadata)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-AdvancedPdfLoader._format_image_element-mhwrucsz and push.

Codeflash Static Badge

The optimized code achieves a **5% speedup** through several targeted micro-optimizations that reduce object allocations and dictionary operations:

**Key Optimizations:**

1. **Eliminated unnecessary dict allocation**: Changed `metadata.get("coordinates", {})` to `metadata.get("coordinates", None)` - avoids creating an empty dictionary when coordinates are missing, which is beneficial since many test cases show missing coordinates.

2. **Walrus operator for early evaluation**: Combined the dictionary lookup and assignment using `(points := coordinates.get("points"))` directly in the conditional chain. This eliminates the separate `points = coordinates.get("points")` line and reduces the number of variable assignments.

3. **Tuple unpacking optimization**: Replaced individual indexing (`leftup = points[0]`, `rightdown = points[3]`) with direct unpacking (`leftup, _, _, rightdown = points`). This is more efficient as it avoids multiple tuple index lookups.

4. **Improved f-string formatting**: Streamlined the layout info concatenation by using a single f-string instead of string concatenation with `+`, which is more efficient for string building.

**Performance Impact Analysis:**
The test results show consistent improvements across most scenarios:
- **Best gains** (10-19% faster) occur with edge cases like missing coordinates or invalid data structures
- **Moderate gains** (5-10% faster) for normal cases with complete metadata
- **Large-scale tests** maintain 3-6% improvement, indicating good scalability

The optimizations are particularly effective for this function because it processes many dictionary lookups and conditional checks. Given that this is a PDF processing utility that likely processes many images per document, even a 5% improvement can compound significantly across large documents or batch processing workflows.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 01:50
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant