Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 13, 2025

📄 34% (0.34x) speedup for AdvancedPdfLoader._format_table_element in cognee/infrastructure/loaders/external/advanced_pdf_loader.py

⏱️ Runtime : 205 microseconds 153 microseconds (best of 95 runs)

📝 Explanation and details

The optimized code achieves a 33% speedup by eliminating unnecessary function calls and optimizing string handling in the hot path.

Key optimizations:

  1. Function call elimination: The original code always called self._clean_text() even when table_html was available and would be returned immediately. The optimized version moves the text cleaning logic inline and only executes it when needed (when table_html is falsy), avoiding 214 unnecessary function calls based on the profiler data.

  2. Lazy text processing: Instead of always calling element.get("text", "") and cleaning it upfront, the optimized code first checks if table_html exists. Only when falling back to text does it retrieve and process the text value, saving work in 62% of cases (214 out of 345 calls returned HTML).

  3. Conditional string conversion: The optimized code uses isinstance(text, str) to avoid redundant str() calls when the text is already a string, which is the common case. This eliminates double string conversions.

Performance impact by test case type:

  • HTML-preferred cases (54-104% faster): Maximum benefit since text processing is completely skipped
  • Missing text cases (67-88% faster): Substantial gains from avoiding unnecessary _clean_text calls
  • Text-only cases (0-17% faster): Modest gains from inline processing and conditional string conversion
  • Large-scale tests (42-68% faster): Benefits scale well with data size

The optimization is particularly effective because most real PDF processing workflows prioritize HTML table representations when available, making the early return path the dominant case. The inline text processing also reduces Python function call overhead, which becomes significant in data processing pipelines.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 387 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest
from cognee.infrastructure.loaders.external.advanced_pdf_loader import \
    AdvancedPdfLoader

# unit tests

# Basic Test Cases

def test_html_preferred_over_text():
    # If metadata contains 'text_as_html', it should be returned, stripped
    loader = AdvancedPdfLoader()
    element = {
        "text": "Some text",
        "metadata": {"text_as_html": "   <table><tr><td>1</td></tr></table>   "}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.61μs -> 1.04μs (54.6% faster)

def test_text_returned_when_no_html():
    # If metadata does not contain 'text_as_html', returns cleaned text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Some text",
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.30μs -> 1.31μs (0.688% slower)

def test_text_returned_when_html_is_none():
    # If 'text_as_html' is None, returns cleaned text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Some text",
        "metadata": {"text_as_html": None}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.34μs -> 1.13μs (18.6% faster)

def test_text_returned_when_html_is_empty_string():
    # If 'text_as_html' is empty string, returns cleaned text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Some text",
        "metadata": {"text_as_html": ""}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.33μs -> 1.08μs (23.5% faster)

def test_text_cleaning_removes_nbsp_and_strips():
    # The _clean_text method should replace non-breaking spaces and strip whitespace
    loader = AdvancedPdfLoader()
    element = {
        "text": "  Some\xa0text  ",
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 2.11μs -> 1.84μs (14.4% faster)

# Edge Test Cases

def test_element_missing_metadata_key():
    # If 'metadata' key is missing, should not raise and return cleaned text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Edge case text"
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.37μs -> 1.19μs (15.2% faster)

def test_element_missing_text_key():
    # If 'text' key is missing, should return empty string
    loader = AdvancedPdfLoader()
    element = {
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.33μs -> 755ns (76.6% faster)

def test_text_is_none():
    # If 'text' is None, should return empty string
    loader = AdvancedPdfLoader()
    element = {
        "text": None,
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 902ns -> 730ns (23.6% faster)


def test_text_is_integer():
    # If 'text' is an integer, should convert to string and clean
    loader = AdvancedPdfLoader()
    element = {
        "text": 12345,
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.92μs -> 1.93μs (0.467% slower)

def test_text_is_list():
    # If 'text' is a list, should convert to string and clean
    loader = AdvancedPdfLoader()
    element = {
        "text": ["a", "b", "c"],
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 2.81μs -> 2.89μs (2.90% slower)

def test_html_is_whitespace_only():
    # If 'text_as_html' is whitespace only, should return cleaned text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Fallback text",
        "metadata": {"text_as_html": "    "}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.49μs -> 879ns (70.0% faster)

def test_text_as_html_is_false():
    # If 'text_as_html' is False, should fallback to text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Text value",
        "metadata": {"text_as_html": False}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.34μs -> 1.15μs (16.9% faster)

def test_text_as_html_is_zero():
    # If 'text_as_html' is 0, should fallback to text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Zero text",
        "metadata": {"text_as_html": 0}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.32μs -> 1.10μs (20.3% faster)

def test_text_as_html_is_list():
    # If 'text_as_html' is a list, should fallback to text
    loader = AdvancedPdfLoader()
    element = {
        "text": "List text",
        "metadata": {"text_as_html": [1,2,3]}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output

def test_text_as_html_is_dict():
    # If 'text_as_html' is a dict, should fallback to text
    loader = AdvancedPdfLoader()
    element = {
        "text": "Dict text",
        "metadata": {"text_as_html": {"a": 1}}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output

# Large Scale Test Cases

def test_large_html_table():
    # Large HTML table should be handled and stripped correctly
    loader = AdvancedPdfLoader()
    rows = "".join(f"<tr><td>{i}</td></tr>" for i in range(1000))
    html = f"\n\n<table>{rows}</table>\n\n"
    element = {
        "text": "Large table text",
        "metadata": {"text_as_html": html}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 2.88μs -> 2.02μs (42.0% faster)

def test_large_text_with_nbsp():
    # Large text with many non-breaking spaces should be cleaned
    loader = AdvancedPdfLoader()
    text = " ".join([f"word{i}\xa0" for i in range(1000)])
    element = {
        "text": text,
        "metadata": {}
    }
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 8.66μs -> 8.48μs (2.22% faster)

def test_large_element_with_missing_keys():
    # Large element dict missing 'metadata' and 'text', should return empty string
    loader = AdvancedPdfLoader()
    element = {str(i): i for i in range(1000)}
    codeflash_output = loader._format_table_element(element); result = codeflash_output # 1.49μs -> 889ns (67.3% faster)

def test_many_elements_html_and_text():
    # Test many elements with both HTML and text, ensure HTML is always preferred
    loader = AdvancedPdfLoader()
    for i in range(100):
        html = f"<table><tr><td>{i}</td></tr></table>"
        text = f"Table {i}"
        element = {
            "text": text,
            "metadata": {"text_as_html": f"   {html}   "}
        }
        codeflash_output = loader._format_table_element(element); result = codeflash_output # 42.4μs -> 26.0μs (62.9% faster)

def test_many_elements_text_only():
    # Test many elements with only text, ensure text is returned and cleaned
    loader = AdvancedPdfLoader()
    for i in range(100):
        text = f"   Table {i}\xa0"
        element = {
            "text": text,
            "metadata": {}
        }
        codeflash_output = loader._format_table_element(element); result = codeflash_output # 47.6μs -> 41.9μs (13.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from typing import Any, Dict

# imports
import pytest  # used for our unit tests
from cognee.infrastructure.loaders.external.advanced_pdf_loader import \
    AdvancedPdfLoader

# unit tests

@pytest.fixture
def loader():
    # Fixture to create an instance of AdvancedPdfLoader
    return AdvancedPdfLoader()

# ------------------------
# 1. Basic Test Cases
# ------------------------

def test_returns_html_if_present(loader):
    # Should return the HTML if metadata['text_as_html'] is present and not empty
    element = {
        "metadata": {"text_as_html": "<table><tr><td>foo</td></tr></table>"},
        "text": "Should not be used"
    }
    codeflash_output = loader._format_table_element(element) # 1.31μs -> 790ns (66.2% faster)

def test_returns_html_stripped(loader):
    # Should strip leading/trailing whitespace from HTML
    element = {
        "metadata": {"text_as_html": "   <table><tr><td>bar</td></tr></table>   "},
        "text": "Should not be used"
    }
    codeflash_output = loader._format_table_element(element) # 1.40μs -> 687ns (104% faster)

def test_returns_cleaned_text_if_no_html(loader):
    # Should return cleaned text if no HTML is present
    element = {
        "metadata": {},
        "text": "  some text  "
    }
    codeflash_output = loader._format_table_element(element) # 1.40μs -> 1.25μs (11.5% faster)

def test_returns_cleaned_text_if_html_is_none(loader):
    # Should fall back to cleaned text if metadata['text_as_html'] is None
    element = {
        "metadata": {"text_as_html": None},
        "text": "  hello world  "
    }
    codeflash_output = loader._format_table_element(element) # 1.40μs -> 1.10μs (26.5% faster)

def test_returns_cleaned_text_if_html_is_empty_string(loader):
    # Should fall back to cleaned text if metadata['text_as_html'] is empty string
    element = {
        "metadata": {"text_as_html": ""},
        "text": "  hello world  "
    }
    codeflash_output = loader._format_table_element(element) # 1.31μs -> 1.14μs (14.4% faster)

def test_returns_empty_string_if_text_and_html_missing(loader):
    # Should return empty string if both text and HTML are missing
    element = {
        "metadata": {},
    }
    codeflash_output = loader._format_table_element(element) # 1.34μs -> 775ns (72.9% faster)

def test_returns_empty_string_if_text_is_none_and_no_html(loader):
    # Should return empty string if text is None and no HTML
    element = {
        "metadata": {},
        "text": None
    }
    codeflash_output = loader._format_table_element(element) # 949ns -> 766ns (23.9% faster)

def test_returns_empty_string_if_text_is_empty_and_no_html(loader):
    # Should return empty string if text is empty and no HTML
    element = {
        "metadata": {},
        "text": ""
    }
    codeflash_output = loader._format_table_element(element) # 1.31μs -> 1.14μs (15.0% faster)

def test_returns_text_with_non_breaking_space_cleaned(loader):
    # Should replace non-breaking spaces with regular spaces and strip
    element = {
        "metadata": {},
        "text": "\xa0foo\xa0bar\xa0"
    }
    codeflash_output = loader._format_table_element(element) # 2.09μs -> 1.79μs (16.9% faster)

def test_returns_text_even_if_metadata_is_missing(loader):
    # Should return cleaned text if metadata key is missing
    element = {
        "text": "  some text  "
    }
    codeflash_output = loader._format_table_element(element) # 1.45μs -> 1.23μs (17.7% faster)

# ------------------------
# 2. Edge Test Cases
# ------------------------


def test_text_is_integer(loader):
    # Should convert non-string text to string and clean
    element = {
        "metadata": {},
        "text": 12345
    }
    codeflash_output = loader._format_table_element(element) # 1.95μs -> 1.86μs (4.95% faster)

def test_text_is_list(loader):
    # Should convert list to string and clean
    element = {
        "metadata": {},
        "text": ["foo", "bar"]
    }
    # str(["foo", "bar"]) == "['foo', 'bar']"
    codeflash_output = loader._format_table_element(element) # 2.85μs -> 2.85μs (0.070% faster)

def test_text_is_dict(loader):
    # Should convert dict to string and clean
    element = {
        "metadata": {},
        "text": {"a": 1, "b": 2}
    }
    # str({"a": 1, "b": 2}) == "{'a': 1, 'b': 2}"
    codeflash_output = loader._format_table_element(element) # 2.67μs -> 2.68μs (0.187% slower)


def test_element_is_empty_dict(loader):
    # Should return empty string if element is empty
    element = {}
    codeflash_output = loader._format_table_element(element) # 1.86μs -> 987ns (88.4% faster)

def test_element_is_none(loader):
    # Should raise error if element is None
    with pytest.raises(AttributeError):
        loader._format_table_element(None) # 1.58μs -> 1.46μs (8.52% faster)

def test_html_is_whitespace_only(loader):
    # Should treat whitespace-only HTML as absent and fallback to text
    element = {
        "metadata": {"text_as_html": "   "},
        "text": "actual text"
    }
    codeflash_output = loader._format_table_element(element) # 1.69μs -> 968ns (74.6% faster)

def test_text_is_whitespace_only(loader):
    # Should return empty string if text is whitespace only and no HTML
    element = {
        "metadata": {},
        "text": "   "
    }
    codeflash_output = loader._format_table_element(element) # 1.42μs -> 1.33μs (6.83% faster)

def test_html_and_text_are_whitespace(loader):
    # Should return empty string if both are whitespace only
    element = {
        "metadata": {"text_as_html": "   "},
        "text": "   "
    }
    codeflash_output = loader._format_table_element(element) # 1.46μs -> 866ns (68.9% faster)



def test_large_text(loader):
    # Should handle large text efficiently
    large_text = "foo " * 500  # 2000 characters
    element = {
        "metadata": {},
        "text": large_text
    }
    codeflash_output = loader._format_table_element(element) # 2.22μs -> 1.96μs (13.5% faster)

def test_large_html(loader):
    # Should handle large HTML efficiently
    large_html = "<table>" + ("<tr><td>foo</td></tr>" * 500) + "</table>"
    element = {
        "metadata": {"text_as_html": large_html},
        "text": "should not be used"
    }
    codeflash_output = loader._format_table_element(element) # 1.50μs -> 857ns (74.8% faster)

def test_many_non_breaking_spaces(loader):
    # Should replace many non-breaking spaces with regular spaces
    text = "\xa0".join(["foo"] * 500)
    element = {
        "metadata": {},
        "text": text
    }
    expected = " ".join(["foo"] * 500)
    codeflash_output = loader._format_table_element(element) # 3.94μs -> 3.77μs (4.53% faster)

def test_large_number_of_elements(loader):
    # Should process multiple large table elements correctly
    for i in range(100):
        html = f"<table><tr><td>{i}</td></tr></table>"
        element = {
            "metadata": {"text_as_html": html},
            "text": f"text {i}"
        }
        codeflash_output = loader._format_table_element(element) # 41.9μs -> 25.0μs (67.8% faster)

def test_large_metadata_dict(loader):
    # Should ignore unrelated metadata keys and just use text_as_html
    metadata = {f"key_{i}": i for i in range(500)}
    metadata["text_as_html"] = "<table><tr><td>big</td></tr></table>"
    element = {
        "metadata": metadata,
        "text": "should not be used"
    }
    codeflash_output = loader._format_table_element(element) # 1.26μs -> 770ns (63.8% faster)

def test_large_element_dict(loader):
    # Should ignore unrelated element keys and just use metadata/text
    element = {f"key_{i}": i for i in range(500)}
    element["metadata"] = {"text_as_html": "<table><tr><td>huge</td></tr></table>"}
    element["text"] = "should not be used"
    codeflash_output = loader._format_table_element(element) # 1.35μs -> 807ns (67.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-AdvancedPdfLoader._format_table_element-mhwrjz00 and push.

Codeflash Static Badge

The optimized code achieves a **33% speedup** by eliminating unnecessary function calls and optimizing string handling in the hot path.

**Key optimizations:**

1. **Function call elimination**: The original code always called `self._clean_text()` even when `table_html` was available and would be returned immediately. The optimized version moves the text cleaning logic inline and only executes it when needed (when `table_html` is falsy), avoiding 214 unnecessary function calls based on the profiler data.

2. **Lazy text processing**: Instead of always calling `element.get("text", "")` and cleaning it upfront, the optimized code first checks if `table_html` exists. Only when falling back to text does it retrieve and process the text value, saving work in 62% of cases (214 out of 345 calls returned HTML).

3. **Conditional string conversion**: The optimized code uses `isinstance(text, str)` to avoid redundant `str()` calls when the text is already a string, which is the common case. This eliminates double string conversions.

**Performance impact by test case type:**
- **HTML-preferred cases** (54-104% faster): Maximum benefit since text processing is completely skipped
- **Missing text cases** (67-88% faster): Substantial gains from avoiding unnecessary `_clean_text` calls
- **Text-only cases** (0-17% faster): Modest gains from inline processing and conditional string conversion
- **Large-scale tests** (42-68% faster): Benefits scale well with data size

The optimization is particularly effective because most real PDF processing workflows prioritize HTML table representations when available, making the early return path the dominant case. The inline text processing also reduces Python function call overhead, which becomes significant in data processing pipelines.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 13, 2025 01:41
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant