Fix entity extraction for large episode inputs with adaptive chunking #1129

danielchalef · 2025-12-30T23:46:55Z

Summary

Implement density-based adaptive chunking for entity extraction to handle large entity-dense inputs that cause LLM timeouts and truncation issues.

Type of Change

Performance improvement

Objective

Large entity-dense content triggers LLM timeouts and invalid JSON responses during extraction. This PR adds intelligent chunking that only activates for content >= 1000 tokens with high entity density (P95+), while preserving context for prose/narrative content. Short or low-density content processes as-is.

Testing

Unit tests added (39 tests for chunking, 16 for entity extraction)
All existing tests pass (55 tests)
Linting and type checking pass

Checklist

Code follows project style guidelines
Tests added for all new functionality
No secrets or sensitive information committed

Implement density-based chunking for entity extraction to handle large, entity-dense inputs (e.g., AWS cost data, bulk imports) that cause LLM timeouts and truncation. Small content processes as-is; chunking only triggers for content >= 1000 tokens with high entity density (P95+). 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <[email protected]>

claude · 2025-12-30T23:48:51Z

graphiti_core/utils/content_chunking.py

+        # Skip if it's likely a sentence starter (after . ! ? or first word)
+        if i == 0:
+            continue
+        if i > 0 and words[i - 1].rstrip()[-1:] in '.!?':


Potential IndexError: If words[i-1] is an empty string (from multiple spaces), words[i - 1].rstrip()[-1:] could fail. Consider using:

prev_word = words[i - 1].rstrip() if prev_word and prev_word[-1] in '.!?': continue

claude · 2025-12-30T23:48:52Z

graphiti_core/utils/content_chunking.py

+        # Move start forward, ensuring progress even if overlap >= chunk_size
+        # Always advance by at least (chunk_size - overlap) or 1 char minimum
+        min_progress = max(1, chunk_size_chars - overlap_chars)
+        start = max(start + min_progress, end - overlap_chars)


Logic issue with overlap calculation: When overlap_chars >= chunk_size_chars, the formula max(start + min_progress, end - overlap_chars) may cause start to move backwards (when end - overlap_chars < start + min_progress), leading to infinite loops or duplicated content.

Consider simplifying to always move forward:

start = end - overlap_chars if end < len(text) else len(text) if start <= current_start: # Ensure forward progress start = current_start + 1

claude · 2025-12-30T23:48:53Z

graphiti_core/utils/content_chunking.py

+def _chunk_json_array(
+    data: list,
+    chunk_size_chars: int,
+    overlap_chars: int,


Entity deduplication uses case-insensitive name matching only. This is too aggressive and will incorrectly merge distinct entities with different names that happen to normalize to the same string (e.g., "AWS S3" vs "aws s3" could be different contexts, or "Apple Inc" the company vs "apple inc" in casual text).

Consider using:

More sophisticated similarity matching (Levenshtein distance, embeddings)

Additional context from the summary field

Type matching - only deduplicate entities of the same type

claude · 2025-12-30T23:48:54Z

graphiti_core/utils/maintenance/node_operations.py

+def _merge_extracted_entities(
+    chunk_results: list[list[ExtractedEntity]],
+) -> list[ExtractedEntity]:
+    """Merge entities from multiple chunks, deduplicating by normalized name.
+
+    When duplicates occur, prefer the first occurrence (maintains ordering).
+    """
+    seen_names: set[str] = set()
+    merged: list[ExtractedEntity] = []
+
+    for entities in chunk_results:
+        for entity in entities:
+            normalized = entity.name.strip().lower()


Entity deduplication uses case-insensitive name matching only. This is too aggressive and will incorrectly merge distinct entities with different names that happen to normalize to the same string (e.g., "AWS S3" vs "aws s3" could be different contexts, or "Apple Inc" the company vs "apple inc" in casual text).

Consider using:

More sophisticated similarity matching (Levenshtein distance, embeddings)

Additional context from the summary field

Type matching - only deduplicate entities of the same type

claude · 2025-12-30T23:48:55Z

graphiti_core/utils/content_chunking.py

+    Returns:
+        Estimated token count
+    """
+    return len(text) // CHARS_PER_TOKEN


The 4 chars/token heuristic is reasonable but varies significantly by language and content type:

Code/JSON: ~5-6 chars/token

English prose: ~4-5 chars/token

Technical text: ~3-4 chars/token

Non-English: varies widely

This could cause chunking to trigger incorrectly or miss cases where it's needed. Consider:

Adding a configuration parameter for chars_per_token

Using actual tokenization for more accurate counting (tiktoken library for OpenAI models)

Documenting the assumption and potential inaccuracy

claude · 2025-12-30T23:48:56Z

graphiti_core/helpers.py

+# Content chunking configuration for entity extraction
+# Density-based chunking: only chunk high-density content (many entities per token)
+# This targets the failure case (large entity-dense inputs) while preserving
+# context for prose/narrative content
+CHUNK_TOKEN_SIZE = int(os.getenv('CHUNK_TOKEN_SIZE', 3000))
+CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))
+# Minimum tokens before considering chunking - short content processes fine regardless of density
+CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000))
+# Entity density threshold: chunk if estimated density > this value
+# For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens)


Missing validation for environment variables. Invalid values could cause runtime errors or unexpected behavior:

CHUNK_TOKEN_SIZE and CHUNK_OVERLAP_TOKENS: No min/max validation. Negative or zero values would break chunking logic.

CHUNK_OVERLAP_TOKENS >= CHUNK_TOKEN_SIZE: This creates edge cases in chunking logic (see line 527 in content_chunking.py)

CHUNK_DENSITY_THRESHOLD: No range validation (should be 0.0-1.0 or similar)

Consider adding validation or using Pydantic settings:

CHUNK_TOKEN_SIZE = max(100, int(os.getenv('CHUNK_TOKEN_SIZE', 3000))) CHUNK_OVERLAP_TOKENS = max(0, min(CHUNK_TOKEN_SIZE // 2, int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))))

claude · 2025-12-30T23:48:57Z

graphiti_core/utils/maintenance/node_operations.py

+    episode: EpisodicNode,
+) -> list[ExtractedEntity]:
+    """Extract entities from a single chunk."""
+    chunk_context = {**base_context, 'episode_content': chunk}


Context spreading issue: When chunking, each chunk gets the full previous_episodes context, but loses awareness of other chunks being processed in parallel. This could lead to:

Duplicate entity extraction across chunks (partially addressed by merge, but summaries may differ)

Lost relationships between entities that span chunk boundaries

Inconsistent entity naming across chunks

Consider:

Adding cross-chunk entity resolution

Including context about entities from adjacent chunks

A two-pass approach: extract entities, then extract relationships with full context

claude · 2025-12-30T23:48:57Z

graphiti_core/utils/content_chunking.py

+    try:
+        data = json.loads(content)
+    except json.JSONDecodeError:
+        # Invalid JSON, fall back to text heuristics
+        return _text_likely_dense(content, tokens)
+
+    if isinstance(data, list):
+        # For arrays, each element likely contains entities
+        element_count = len(data)


The JSON density calculation could fail or give misleading results for deeply nested structures. The threshold CHUNK_DENSITY_THRESHOLD * 1000 means with default 0.15, you need 150 elements per 1000 tokens to trigger chunking. This may be too high for some use cases.

Issues:

Nested objects/arrays aren't fully accounted for (max_depth=2 in key counting)

Array-of-primitives vs array-of-objects treated differently despite similar entity density

No consideration for value sizes - [1,2,3,...] has same density as [{large object}, {large object}, ...]

Consider: Weighing by both structure complexity AND content size.

claude · 2025-12-30T23:48:58Z

tests/utils/test_content_chunking.py

+        # Each chunk should end at a sentence boundary where possible
+        for chunk in chunks[:-1]:  # All except last
+            # Should end with sentence punctuation or continue to next chunk
+            assert chunk[-1] in '.!? ' or True  # Allow flexibility


This test assertion is too permissive (or True always passes). Either implement proper validation or remove the test:

# Should end with sentence punctuation or be mid-sentence assert chunk.rstrip()[-1] in '.!?' or not chunk.endswith('...')

claude · 2025-12-30T23:49:23Z

graphiti_core/utils/content_chunking.py

+            if space_idx > start:
+                end = space_idx
+
+        chunks.append(text[start:end].strip())


Using .strip() on chunks can remove significant whitespace that may be semantically important (e.g., in code, formatted text, or indented content). This could corrupt the content structure.

Consider only stripping if the content type is known to be prose/narrative.

claude · 2025-12-30T23:49:24Z

graphiti_core/utils/maintenance/node_operations.py

+    else:
+        chunks = chunk_text_content(episode.content)
+
+    logger.debug(f'Chunked content into {len(chunks)} chunks for entity extraction')


Missing error handling for chunking failures. If any chunk fails to extract entities (LLM timeout, rate limit, invalid JSON response), the entire operation could fail or return incomplete results.

Consider:

Adding try-catch around individual chunk extraction

Logging failed chunks

Optionally retrying failed chunks

Returning partial results with warnings

claude · 2025-12-30T23:49:25Z

graphiti_core/utils/content_chunking.py

+        # Check if capitalized (first char upper, not all caps)
+        cleaned = word.strip('.,!?;:\'"()[]{}')
+        if cleaned and cleaned[0].isupper() and not cleaned.isupper():


The capitalized word heuristic for entity density is language-specific (English) and will fail for:

Languages without capitalization (Chinese, Japanese, Arabic, Hebrew)

Languages with different capitalization rules (German capitalizes all nouns)

All-lowercase text (chat messages, tweets)

ALL CAPS TEXT

This could cause incorrect chunking decisions for non-English content or informal text. Consider adding language detection or alternative heuristics.

claude · 2025-12-30T23:49:26Z

Summary of Review Findings

This PR implements adaptive chunking for entity extraction, which addresses an important problem (LLM timeouts on large entity-dense inputs). However, there are several issues that should be addressed:

Critical Issues

Potential infinite loop in _chunk_by_size when overlap >= chunk_size (line 527)
Overly aggressive entity deduplication using only case-insensitive name matching (line 265 in node_operations.py)
Missing error handling for individual chunk failures in parallel processing
No environment variable validation - invalid config could cause runtime errors

Significant Issues

Token estimation inaccuracy - 4 chars/token varies by content type and language (3-6 chars)
IndexError risk in text density calculation when processing empty words (line 195)
Language-specific heuristics - capitalized word detection only works for English
Context fragmentation - entities spanning chunk boundaries may be lost or duplicated

Minor Issues

JSON density calculation may not accurately reflect complexity for nested structures
Test assertion that always passes (or True) at line 186 in test file
Whitespace stripping could corrupt structured content like code or formatted text

Documentation Gap

The new chunking feature and its 4 environment variables are not documented in README.md or other user-facing docs. Users need to know:

When chunking triggers
How to tune the parameters
Performance implications
Limitations with non-English content

Recommendations

Add validation/bounds checking for environment variables
Fix the overlap calculation logic to prevent infinite loops
Improve entity deduplication to use type + similarity, not just normalized names
Add error handling with partial result recovery
Document the feature in README with configuration examples
Consider using actual tokenization (tiktoken) for better accuracy across content types

Shows how Graphiti handles different content types: - Normal content (prose/narrative) - single LLM call - Dense content (structured data) - automatically chunked - Message content (conversations) - preserves speaker boundaries 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2025-12-31T00:09:51Z

graphiti_core/utils/content_chunking.py

+    for i, word in enumerate(words):
+        # Skip if it's likely a sentence starter (after . ! ? or first word)
+        if i == 0:
+            continue


The check if i > 0 and words[i - 1].rstrip()[-1:] in '.!?' is potentially unsafe. If words[i-1].rstrip() returns an empty string, [-1:] will return an empty string rather than raising an IndexError, causing the check to fail silently. Consider:

prev_word = words[i - 1].rstrip() if i > 0 and prev_word and prev_word[-1] in '.!?': continue

claude · 2025-12-31T00:09:53Z

graphiti_core/utils/content_chunking.py

+        # Move start forward, ensuring progress even if overlap >= chunk_size
+        # Always advance by at least (chunk_size - overlap) or 1 char minimum
+        min_progress = max(1, chunk_size_chars - overlap_chars)
+        start = max(start + min_progress, end - overlap_chars)


Potential infinite loop risk: If overlap_chars >= chunk_size_chars and chunk_size_chars is 1, the calculation max(1, chunk_size_chars - overlap_chars) returns 1, but then max(start + min_progress, end - overlap_chars) could still result in no progress if end - overlap_chars <= start + 1. Consider adding an assertion that overlap_chars < chunk_size_chars at function entry or document this constraint in the docstring.

claude · 2025-12-31T00:09:54Z

graphiti_core/utils/content_chunking.py

+logger = logging.getLogger(__name__)
+
+# Approximate characters per token (conservative estimate)
+CHARS_PER_TOKEN = 4


The constant CHARS_PER_TOKEN = 4 is overly simplistic and varies significantly across languages, LLM tokenizers, and content types. For example:

Code typically has ~3 chars/token

Chinese text has ~1.5-2 chars/token

English prose is ~4-5 chars/token

This could lead to under-chunking (and thus LLM timeouts) or over-chunking (loss of context) depending on content. Consider:

Adding a configuration parameter for different content types

Using actual tokenizer libraries (tiktoken for OpenAI, etc.) when available

At minimum, document this limitation prominently in the docstring and add a warning in logs when chunking is triggered

claude · 2025-12-31T00:09:56Z

graphiti_core/utils/content_chunking.py

+        element_count = len(data)
+        # Estimate density: elements per 1000 tokens
+        density = (element_count / tokens) * 1000 if tokens > 0 else 0
+        return density > CHUNK_DENSITY_THRESHOLD * 1000  # Scale threshold


Scaling CHUNK_DENSITY_THRESHOLD by 1000 for comparison against density (which is also per 1000 tokens) is confusing. The density calculation already normalizes per 1000 tokens, so multiplying the threshold by 1000 makes the comparison less intuitive. Consider either:

Keeping density as a simple ratio (0-1) and comparing directly: density = element_count / tokens if tokens > 0 else 0; return density > CHUNK_DENSITY_THRESHOLD

Or making the threshold values more intuitive (e.g., 150 instead of 0.15) and documenting units clearly

claude · 2025-12-31T00:09:58Z

graphiti_core/utils/content_chunking.py

+
+    for element in data:
+        element_json = json.dumps(element)
+        element_size = len(element_json) + 2  # Account for comma and space


The size calculation element_size = len(element_json) + 2 adds 2 characters for comma and space, but this doesn't account for:

The first element (no leading comma)

Array brackets [] accounting

Variable JSON spacing when re-serializing

This could cause chunks to be slightly larger than chunk_size_chars. While not critical, it may cause edge cases where chunks exceed LLM token limits. Consider calculating the actual serialized size: len(json.dumps(current_elements)) before the append decision.

claude · 2025-12-31T00:09:59Z

graphiti_core/utils/maintenance/node_operations.py

+
+    for entities in chunk_results:
+        for entity in entities:
+            normalized = entity.name.strip().lower()


Entity deduplication by normalized = entity.name.strip().lower() is too simplistic and may cause false positives. Consider:

"Apple" (company) vs "apple" (fruit)

"Paris" (city) vs "Paris" (person's name)

"US" vs "us" (pronoun)

This lossy deduplication discards potentially distinct entities that happen to share the same normalized name. Since chunks process overlapping content, this will lead to missing entities. Consider:

Preserving all entities and letting the existing LLM-based deduplication handle it (in resolve_extracted_nodes)

Using more sophisticated normalization (stemming, entity type awareness)

At minimum, logging when entities are dropped during merge

claude · 2025-12-31T00:10:00Z

graphiti_core/helpers.py

+CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))
+# Minimum tokens before considering chunking - short content processes fine regardless of density
+CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000))
+# Entity density threshold: chunk if estimated density > this value


The comment claims "Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON" but this is not validated by tests. The test suite uses much lower thresholds (0.01, 0.05) in monkeypatched tests, which don't validate the actual production default of 0.15. Consider adding integration tests that verify the default 0.15 threshold with realistic data like the AWS cost example from the quickstart.

claude · 2025-12-31T00:10:02Z

graphiti_core/utils/content_chunking.py

+
+    # Text density threshold is typically lower than JSON
+    # A well-written article might have 5-10% named entities
+    return density > CHUNK_DENSITY_THRESHOLD * 500  # Half the JSON threshold


Using CHUNK_DENSITY_THRESHOLD * 500 for text is an arbitrary magic number with no clear justification. The comment says "Half the JSON threshold" but:

Why half? Different content types need different thresholds, but this relationship is not explained

The scaling factor 500 vs 1000 is inconsistent with the JSON logic

This makes tuning the threshold counterintuitive (changing CHUNK_DENSITY_THRESHOLD has different effects on JSON vs text)

Consider separate configuration variables: CHUNK_DENSITY_THRESHOLD_JSON and CHUNK_DENSITY_THRESHOLD_TEXT with documented defaults.

Tests cover: - Small input single LLM call (no chunking) - Entity type classification and exclusion - Empty name filtering - Large input chunking triggers - JSON/message-aware chunking - Cross-chunk deduplication (case-insensitive) - Prompt selection by episode type - Entity type context building - Merge extracted entities behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

claude · 2026-01-06T00:55:25Z

graphiti_core/utils/content_chunking.py

+        # Skip if it's likely a sentence starter (after . ! ? or first word)
+        if i == 0:
+            continue
+        if i > 0 and words[i - 1].rstrip()[-1:] in '.!?':


Potential issue: Line 195 checks if i > 0 but line 193 already continues if i == 0, making the inner condition redundant. The logic for detecting sentence starters has a flaw:

if i > 0 and words[i - 1].rstrip()[-1:] in '.!?':

This will fail with an IndexError if words[i - 1].rstrip() is an empty string (accessing [-1:] on empty string returns empty string, but the logic seems fragile). Consider:

prev_word = words[i - 1].rstrip() if prev_word and prev_word[-1] in '.!?': continue

claude · 2026-01-06T00:55:26Z

graphiti_core/utils/content_chunking.py

+        # Move start forward, ensuring progress even if overlap >= chunk_size
+        # Always advance by at least (chunk_size - overlap) or 1 char minimum
+        min_progress = max(1, chunk_size_chars - overlap_chars)
+        start = max(start + min_progress, end - overlap_chars)


Critical bug: This calculation can result in zero or negative progress when overlap_chars >= chunk_size_chars, causing an infinite loop. While the max(1, ...) prevents infinite loops, it's still problematic if overlap is configured incorrectly.

Consider adding validation at function entry:

if overlap_chars >= chunk_size_chars: raise ValueError(f"overlap_chars ({overlap_chars}) must be less than chunk_size_chars ({chunk_size_chars})")

Or at minimum, log a warning when the configuration is problematic.

claude · 2026-01-06T00:55:27Z

graphiti_core/helpers.py

+# Minimum tokens before considering chunking - short content processes fine regardless of density
+CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000))
+# Entity density threshold: chunk if estimated density > this value
+# For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens)


Documentation inconsistency: The comment says "For JSON: elements per 1000 tokens > threshold * 1000" but with default CHUNK_DENSITY_THRESHOLD=0.15, this means 150 elements per 1000 tokens, not "0.15 = 150 elements/1000 tokens" as stated in line 49. The math is correct but the phrasing is confusing.

Consider clarifying:

# For JSON: elements per 1000 tokens > threshold * 1000 # (e.g., 0.15 threshold = 150 elements per 1000 tokens triggers chunking)

claude · 2026-01-06T00:55:28Z

graphiti_core/utils/content_chunking.py

+        # For arrays, each element likely contains entities
+        element_count = len(data)
+        # Estimate density: elements per 1000 tokens
+        density = (element_count / tokens) * 1000 if tokens > 0 else 0


Edge case: Division by zero is checked with if tokens > 0 else 0, but this means content with exactly 0 tokens (empty string) will have density 0 and not chunk. While this is probably correct behavior, the function is also called with very small token counts where the density calculation becomes unreliable.

For example, with 10 tokens and 5 elements, density = 500, which would trigger chunking even though the content is tiny. The CHUNK_MIN_TOKENS check in should_chunk() prevents this, but consider adding a comment here explaining the relationship.

claude · 2026-01-06T00:55:30Z

graphiti_core/utils/content_chunking.py

+        pass
+
+    # Try speaker pattern (e.g., "Alice: Hello")
+    speaker_pattern = r'^([A-Za-z_][A-Za-z0-9_\s]*):(.+?)(?=^[A-Za-z_][A-Za-z0-9_\s]*:|$)'


Regex complexity: The speaker pattern r'^([A-Za-z_][A-Za-z0-9_\s]*):(.+?)(?=^[A-Za-z_][A-Za-z0-9_\s]*:|$)' with re.MULTILINE | re.DOTALL is complex and may have performance implications on large message content. Consider:

Adding a comment explaining what this pattern matches

Testing performance with large message arrays (10k+ messages)

Consider caching compiled regex patterns at module level for better performance

claude · 2026-01-06T00:55:31Z

graphiti_core/utils/maintenance/node_operations.py

+    logger.debug(f'Chunked content into {len(chunks)} chunks for entity extraction')
+
+    # Extract entities from each chunk in parallel
+    chunk_results = await semaphore_gather(


Concurrency consideration: Using semaphore_gather here is good, but with potentially hundreds of chunks from a large document, this could create a spike in concurrent LLM API calls. Consider documenting the expected behavior or adding a note about the SEMAPHORE_LIMIT from helpers.

Also, if one chunk extraction fails, does it fail the entire operation? The error handling strategy should be documented.

claude · 2026-01-06T00:55:32Z

Code Review Summary

This PR implements adaptive density-based chunking for entity extraction, which is a solid approach to handling large entity-dense inputs. The implementation is well-tested with 55 new tests. However, several issues need attention:

Critical Issues

Infinite loop risk in _chunk_by_size() (line 527 of content_chunking.py): When overlap_chars >= chunk_size_chars, the progress calculation could cause problems. While max(1, ...) prevents infinite loops, validation should be added to reject invalid configurations.
Regex performance concern (line 584): The speaker pattern matching for messages uses complex regex with DOTALL that may have performance issues on large message arrays.

Moderate Issues

Fragile sentence detection (line 195): The logic for detecting sentence starters could be more robust. Edge case with empty strings after rstrip() needs handling.
Edge case in density calculation (line 129): Small token counts (e.g., 10 tokens) can produce unreliable density estimates, though CHUNK_MIN_TOKENS mitigates this.
Concurrency implications (node_operations.py:200): With potentially hundreds of chunks, semaphore_gather could create large API call spikes. Error handling strategy for chunk failures should be documented.
Missing test coverage: No tests for invalid overlap configuration (overlap >= chunk_size).

Documentation Issues

Confusing density threshold docs (helpers.py:48): The explanation of the threshold calculation could be clearer.
Missing README/docs updates: This is a significant new feature with 4 environment variables (CHUNK_TOKEN_SIZE, CHUNK_OVERLAP_TOKENS, CHUNK_MIN_TOKENS, CHUNK_DENSITY_THRESHOLD) but README.md has no documentation about these configuration options or the adaptive chunking feature.

Recommendations

Add validation for chunk/overlap size configuration
Add README section documenting the new environment variables and adaptive chunking behavior
Consider adding performance benchmarks or documentation for large-scale usage
Add test coverage for edge cases (invalid configs, empty content edge cases)
Document error handling strategy for partial chunk extraction failures

The example file dense_vs_normal_ingestion.py is helpful but shouldn't be the only documentation for this feature.

danielchalef · 2026-01-06T00:55:47Z

Code Review Response

Claude (on behalf of @danielchalef)

Thank you for the thorough review. Here's our assessment of each finding:

Already Fixed

Finding	Status
Line 527 infinite loop risk	Fixed via `min_progress = max(1, chunk_size_chars - overlap_chars)`
Overlap logic causing backwards movement	Same fix ensures forward progress

Intentional Design

Finding	Rationale
Entity deduplication too aggressive (case-insensitive)	This is chunk-level merge only. Full semantic deduplication happens downstream in `resolve_extracted_nodes` via LLM. The simple merge here just prevents obvious duplicates from overlapping chunks.
Context spreading across chunks	`previous_episodes` provides shared context. Cross-chunk entity resolution is handled by the existing deduplication pipeline after extraction.
Separate thresholds for JSON/text	Single threshold with scaling factor (`* 1000` for JSON, `* 500` for text) is simpler than two config variables. The 2x ratio reflects that capitalized-word detection is noisier than JSON element counting.

Known Limitations (Documented)

Finding	Notes
`CHARS_PER_TOKEN = 4` varies by language	Known approximation. Adding tiktoken dependency is overkill for a chunking heuristic. Works well for English prose/JSON.
Capitalized word heuristic is English-specific	Non-English content still processes correctly, it just won't benefit from density-triggered chunking.
JSON density depth=2	Design choice to prevent over-counting deeply nested structures.

Low Priority / Not Actionable

Finding	Reason
Missing env var validation	Invalid values cause obvious errors. Adds complexity for marginal benefit.
Missing error handling for chunk failures	Current behavior: task fails on any chunk failure. Retry logic could be added later.
`.strip()` removes significant whitespace	For entity extraction, whitespace is not semantically significant.
Element size calculation off by ~2 chars	Marginal impact (~0.1% larger chunks). Not meaningful.
Integration tests with 0.15 threshold	Unit tests cover logic; integration tests use monkeypatch for speed.

Will Address

Finding	Action
`or True` in test assertion (line 186)	Will fix - test is currently always passing
Empty string handling in `words[i-1].rstrip()[-1:]`	Will add defensive check

danielchalef temporarily deployed to development December 30, 2025 23:46 — with GitHub Actions Inactive

claude bot reviewed Dec 30, 2025

View reviewed changes

danielchalef temporarily deployed to development December 31, 2025 00:08 — with GitHub Actions Inactive

claude bot reviewed Dec 31, 2025

View reviewed changes

danielchalef requested a review from prasmussen15 January 6, 2026 00:53

danielchalef temporarily deployed to development January 6, 2026 00:53 — with GitHub Actions Inactive

claude bot reviewed Jan 6, 2026

View reviewed changes

prasmussen15 approved these changes Jan 6, 2026

View reviewed changes

danielchalef merged commit a4af90b into main Jan 7, 2026
29 checks passed

getzep locked and limited conversation to collaborators Jan 7, 2026

Fix entity extraction for large episode inputs with adaptive chunking #1129

Fix entity extraction for large episode inputs with adaptive chunking #1129

Conversation

danielchalef commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Objective

Testing

Checklist

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 30, 2025

Summary of Review Findings

Critical Issues

Significant Issues

Minor Issues

Documentation Gap

Recommendations

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Jan 6, 2026

danielchalef commented Dec 30, 2025 •

edited

Loading