-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix entity extraction for large episode inputs with adaptive chunking #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Implement density-based chunking for entity extraction to handle large, entity-dense inputs (e.g., AWS cost data, bulk imports) that cause LLM timeouts and truncation. Small content processes as-is; chunking only triggers for content >= 1000 tokens with high entity density (P95+). 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <[email protected]>
| # Skip if it's likely a sentence starter (after . ! ? or first word) | ||
| if i == 0: | ||
| continue | ||
| if i > 0 and words[i - 1].rstrip()[-1:] in '.!?': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError: If words[i-1] is an empty string (from multiple spaces), words[i - 1].rstrip()[-1:] could fail. Consider using:
prev_word = words[i - 1].rstrip()
if prev_word and prev_word[-1] in '.!?':
continue| # Move start forward, ensuring progress even if overlap >= chunk_size | ||
| # Always advance by at least (chunk_size - overlap) or 1 char minimum | ||
| min_progress = max(1, chunk_size_chars - overlap_chars) | ||
| start = max(start + min_progress, end - overlap_chars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic issue with overlap calculation: When overlap_chars >= chunk_size_chars, the formula max(start + min_progress, end - overlap_chars) may cause start to move backwards (when end - overlap_chars < start + min_progress), leading to infinite loops or duplicated content.
Consider simplifying to always move forward:
start = end - overlap_chars if end < len(text) else len(text)
if start <= current_start: # Ensure forward progress
start = current_start + 1| def _chunk_json_array( | ||
| data: list, | ||
| chunk_size_chars: int, | ||
| overlap_chars: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entity deduplication uses case-insensitive name matching only. This is too aggressive and will incorrectly merge distinct entities with different names that happen to normalize to the same string (e.g., "AWS S3" vs "aws s3" could be different contexts, or "Apple Inc" the company vs "apple inc" in casual text).
Consider using:
- More sophisticated similarity matching (Levenshtein distance, embeddings)
- Additional context from the summary field
- Type matching - only deduplicate entities of the same type
| def _merge_extracted_entities( | ||
| chunk_results: list[list[ExtractedEntity]], | ||
| ) -> list[ExtractedEntity]: | ||
| """Merge entities from multiple chunks, deduplicating by normalized name. | ||
| When duplicates occur, prefer the first occurrence (maintains ordering). | ||
| """ | ||
| seen_names: set[str] = set() | ||
| merged: list[ExtractedEntity] = [] | ||
|
|
||
| for entities in chunk_results: | ||
| for entity in entities: | ||
| normalized = entity.name.strip().lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entity deduplication uses case-insensitive name matching only. This is too aggressive and will incorrectly merge distinct entities with different names that happen to normalize to the same string (e.g., "AWS S3" vs "aws s3" could be different contexts, or "Apple Inc" the company vs "apple inc" in casual text).
Consider using:
- More sophisticated similarity matching (Levenshtein distance, embeddings)
- Additional context from the summary field
- Type matching - only deduplicate entities of the same type
| Returns: | ||
| Estimated token count | ||
| """ | ||
| return len(text) // CHARS_PER_TOKEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 4 chars/token heuristic is reasonable but varies significantly by language and content type:
- Code/JSON: ~5-6 chars/token
- English prose: ~4-5 chars/token
- Technical text: ~3-4 chars/token
- Non-English: varies widely
This could cause chunking to trigger incorrectly or miss cases where it's needed. Consider:
- Adding a configuration parameter for chars_per_token
- Using actual tokenization for more accurate counting (tiktoken library for OpenAI models)
- Documenting the assumption and potential inaccuracy
| # Content chunking configuration for entity extraction | ||
| # Density-based chunking: only chunk high-density content (many entities per token) | ||
| # This targets the failure case (large entity-dense inputs) while preserving | ||
| # context for prose/narrative content | ||
| CHUNK_TOKEN_SIZE = int(os.getenv('CHUNK_TOKEN_SIZE', 3000)) | ||
| CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200)) | ||
| # Minimum tokens before considering chunking - short content processes fine regardless of density | ||
| CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000)) | ||
| # Entity density threshold: chunk if estimated density > this value | ||
| # For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for environment variables. Invalid values could cause runtime errors or unexpected behavior:
CHUNK_TOKEN_SIZEandCHUNK_OVERLAP_TOKENS: No min/max validation. Negative or zero values would break chunking logic.CHUNK_OVERLAP_TOKENS >= CHUNK_TOKEN_SIZE: This creates edge cases in chunking logic (see line 527 in content_chunking.py)CHUNK_DENSITY_THRESHOLD: No range validation (should be 0.0-1.0 or similar)
Consider adding validation or using Pydantic settings:
CHUNK_TOKEN_SIZE = max(100, int(os.getenv('CHUNK_TOKEN_SIZE', 3000)))
CHUNK_OVERLAP_TOKENS = max(0, min(CHUNK_TOKEN_SIZE // 2, int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))))| episode: EpisodicNode, | ||
| ) -> list[ExtractedEntity]: | ||
| """Extract entities from a single chunk.""" | ||
| chunk_context = {**base_context, 'episode_content': chunk} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context spreading issue: When chunking, each chunk gets the full previous_episodes context, but loses awareness of other chunks being processed in parallel. This could lead to:
- Duplicate entity extraction across chunks (partially addressed by merge, but summaries may differ)
- Lost relationships between entities that span chunk boundaries
- Inconsistent entity naming across chunks
Consider:
- Adding cross-chunk entity resolution
- Including context about entities from adjacent chunks
- A two-pass approach: extract entities, then extract relationships with full context
| try: | ||
| data = json.loads(content) | ||
| except json.JSONDecodeError: | ||
| # Invalid JSON, fall back to text heuristics | ||
| return _text_likely_dense(content, tokens) | ||
|
|
||
| if isinstance(data, list): | ||
| # For arrays, each element likely contains entities | ||
| element_count = len(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON density calculation could fail or give misleading results for deeply nested structures. The threshold CHUNK_DENSITY_THRESHOLD * 1000 means with default 0.15, you need 150 elements per 1000 tokens to trigger chunking. This may be too high for some use cases.
Issues:
- Nested objects/arrays aren't fully accounted for (max_depth=2 in key counting)
- Array-of-primitives vs array-of-objects treated differently despite similar entity density
- No consideration for value sizes -
[1,2,3,...]has same density as[{large object}, {large object}, ...]
Consider: Weighing by both structure complexity AND content size.
| # Each chunk should end at a sentence boundary where possible | ||
| for chunk in chunks[:-1]: # All except last | ||
| # Should end with sentence punctuation or continue to next chunk | ||
| assert chunk[-1] in '.!? ' or True # Allow flexibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test assertion is too permissive (or True always passes). Either implement proper validation or remove the test:
# Should end with sentence punctuation or be mid-sentence
assert chunk.rstrip()[-1] in '.!?' or not chunk.endswith('...')| if space_idx > start: | ||
| end = space_idx | ||
|
|
||
| chunks.append(text[start:end].strip()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .strip() on chunks can remove significant whitespace that may be semantically important (e.g., in code, formatted text, or indented content). This could corrupt the content structure.
Consider only stripping if the content type is known to be prose/narrative.
| else: | ||
| chunks = chunk_text_content(episode.content) | ||
|
|
||
| logger.debug(f'Chunked content into {len(chunks)} chunks for entity extraction') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for chunking failures. If any chunk fails to extract entities (LLM timeout, rate limit, invalid JSON response), the entire operation could fail or return incomplete results.
Consider:
- Adding try-catch around individual chunk extraction
- Logging failed chunks
- Optionally retrying failed chunks
- Returning partial results with warnings
| # Check if capitalized (first char upper, not all caps) | ||
| cleaned = word.strip('.,!?;:\'"()[]{}') | ||
| if cleaned and cleaned[0].isupper() and not cleaned.isupper(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capitalized word heuristic for entity density is language-specific (English) and will fail for:
- Languages without capitalization (Chinese, Japanese, Arabic, Hebrew)
- Languages with different capitalization rules (German capitalizes all nouns)
- All-lowercase text (chat messages, tweets)
- ALL CAPS TEXT
This could cause incorrect chunking decisions for non-English content or informal text. Consider adding language detection or alternative heuristics.
Summary of Review FindingsThis PR implements adaptive chunking for entity extraction, which addresses an important problem (LLM timeouts on large entity-dense inputs). However, there are several issues that should be addressed: Critical Issues
Significant Issues
Minor Issues
Documentation GapThe new chunking feature and its 4 environment variables are not documented in README.md or other user-facing docs. Users need to know:
Recommendations
|
Shows how Graphiti handles different content types: - Normal content (prose/narrative) - single LLM call - Dense content (structured data) - automatically chunked - Message content (conversations) - preserves speaker boundaries 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
| for i, word in enumerate(words): | ||
| # Skip if it's likely a sentence starter (after . ! ? or first word) | ||
| if i == 0: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if i > 0 and words[i - 1].rstrip()[-1:] in '.!?' is potentially unsafe. If words[i-1].rstrip() returns an empty string, [-1:] will return an empty string rather than raising an IndexError, causing the check to fail silently. Consider:
prev_word = words[i - 1].rstrip()
if i > 0 and prev_word and prev_word[-1] in '.!?':
continue| # Move start forward, ensuring progress even if overlap >= chunk_size | ||
| # Always advance by at least (chunk_size - overlap) or 1 char minimum | ||
| min_progress = max(1, chunk_size_chars - overlap_chars) | ||
| start = max(start + min_progress, end - overlap_chars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential infinite loop risk: If overlap_chars >= chunk_size_chars and chunk_size_chars is 1, the calculation max(1, chunk_size_chars - overlap_chars) returns 1, but then max(start + min_progress, end - overlap_chars) could still result in no progress if end - overlap_chars <= start + 1. Consider adding an assertion that overlap_chars < chunk_size_chars at function entry or document this constraint in the docstring.
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Approximate characters per token (conservative estimate) | ||
| CHARS_PER_TOKEN = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constant CHARS_PER_TOKEN = 4 is overly simplistic and varies significantly across languages, LLM tokenizers, and content types. For example:
- Code typically has ~3 chars/token
- Chinese text has ~1.5-2 chars/token
- English prose is ~4-5 chars/token
This could lead to under-chunking (and thus LLM timeouts) or over-chunking (loss of context) depending on content. Consider:
- Adding a configuration parameter for different content types
- Using actual tokenizer libraries (tiktoken for OpenAI, etc.) when available
- At minimum, document this limitation prominently in the docstring and add a warning in logs when chunking is triggered
| element_count = len(data) | ||
| # Estimate density: elements per 1000 tokens | ||
| density = (element_count / tokens) * 1000 if tokens > 0 else 0 | ||
| return density > CHUNK_DENSITY_THRESHOLD * 1000 # Scale threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaling CHUNK_DENSITY_THRESHOLD by 1000 for comparison against density (which is also per 1000 tokens) is confusing. The density calculation already normalizes per 1000 tokens, so multiplying the threshold by 1000 makes the comparison less intuitive. Consider either:
- Keeping density as a simple ratio (0-1) and comparing directly:
density = element_count / tokens if tokens > 0 else 0; return density > CHUNK_DENSITY_THRESHOLD - Or making the threshold values more intuitive (e.g., 150 instead of 0.15) and documenting units clearly
|
|
||
| for element in data: | ||
| element_json = json.dumps(element) | ||
| element_size = len(element_json) + 2 # Account for comma and space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size calculation element_size = len(element_json) + 2 adds 2 characters for comma and space, but this doesn't account for:
- The first element (no leading comma)
- Array brackets
[]accounting - Variable JSON spacing when re-serializing
This could cause chunks to be slightly larger than chunk_size_chars. While not critical, it may cause edge cases where chunks exceed LLM token limits. Consider calculating the actual serialized size: len(json.dumps(current_elements)) before the append decision.
|
|
||
| for entities in chunk_results: | ||
| for entity in entities: | ||
| normalized = entity.name.strip().lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entity deduplication by normalized = entity.name.strip().lower() is too simplistic and may cause false positives. Consider:
- "Apple" (company) vs "apple" (fruit)
- "Paris" (city) vs "Paris" (person's name)
- "US" vs "us" (pronoun)
This lossy deduplication discards potentially distinct entities that happen to share the same normalized name. Since chunks process overlapping content, this will lead to missing entities. Consider:
- Preserving all entities and letting the existing LLM-based deduplication handle it (in
resolve_extracted_nodes) - Using more sophisticated normalization (stemming, entity type awareness)
- At minimum, logging when entities are dropped during merge
| CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200)) | ||
| # Minimum tokens before considering chunking - short content processes fine regardless of density | ||
| CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000)) | ||
| # Entity density threshold: chunk if estimated density > this value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment claims "Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON" but this is not validated by tests. The test suite uses much lower thresholds (0.01, 0.05) in monkeypatched tests, which don't validate the actual production default of 0.15. Consider adding integration tests that verify the default 0.15 threshold with realistic data like the AWS cost example from the quickstart.
|
|
||
| # Text density threshold is typically lower than JSON | ||
| # A well-written article might have 5-10% named entities | ||
| return density > CHUNK_DENSITY_THRESHOLD * 500 # Half the JSON threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using CHUNK_DENSITY_THRESHOLD * 500 for text is an arbitrary magic number with no clear justification. The comment says "Half the JSON threshold" but:
- Why half? Different content types need different thresholds, but this relationship is not explained
- The scaling factor 500 vs 1000 is inconsistent with the JSON logic
- This makes tuning the threshold counterintuitive (changing CHUNK_DENSITY_THRESHOLD has different effects on JSON vs text)
Consider separate configuration variables: CHUNK_DENSITY_THRESHOLD_JSON and CHUNK_DENSITY_THRESHOLD_TEXT with documented defaults.
Tests cover: - Small input single LLM call (no chunking) - Entity type classification and exclusion - Empty name filtering - Large input chunking triggers - JSON/message-aware chunking - Cross-chunk deduplication (case-insensitive) - Prompt selection by episode type - Entity type context building - Merge extracted entities behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
| # Skip if it's likely a sentence starter (after . ! ? or first word) | ||
| if i == 0: | ||
| continue | ||
| if i > 0 and words[i - 1].rstrip()[-1:] in '.!?': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue: Line 195 checks if i > 0 but line 193 already continues if i == 0, making the inner condition redundant. The logic for detecting sentence starters has a flaw:
if i > 0 and words[i - 1].rstrip()[-1:] in '.!?':This will fail with an IndexError if words[i - 1].rstrip() is an empty string (accessing [-1:] on empty string returns empty string, but the logic seems fragile). Consider:
prev_word = words[i - 1].rstrip()
if prev_word and prev_word[-1] in '.!?':
continue| # Move start forward, ensuring progress even if overlap >= chunk_size | ||
| # Always advance by at least (chunk_size - overlap) or 1 char minimum | ||
| min_progress = max(1, chunk_size_chars - overlap_chars) | ||
| start = max(start + min_progress, end - overlap_chars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical bug: This calculation can result in zero or negative progress when overlap_chars >= chunk_size_chars, causing an infinite loop. While the max(1, ...) prevents infinite loops, it's still problematic if overlap is configured incorrectly.
Consider adding validation at function entry:
if overlap_chars >= chunk_size_chars:
raise ValueError(f"overlap_chars ({overlap_chars}) must be less than chunk_size_chars ({chunk_size_chars})")Or at minimum, log a warning when the configuration is problematic.
| # Minimum tokens before considering chunking - short content processes fine regardless of density | ||
| CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000)) | ||
| # Entity density threshold: chunk if estimated density > this value | ||
| # For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation inconsistency: The comment says "For JSON: elements per 1000 tokens > threshold * 1000" but with default CHUNK_DENSITY_THRESHOLD=0.15, this means 150 elements per 1000 tokens, not "0.15 = 150 elements/1000 tokens" as stated in line 49. The math is correct but the phrasing is confusing.
Consider clarifying:
# For JSON: elements per 1000 tokens > threshold * 1000
# (e.g., 0.15 threshold = 150 elements per 1000 tokens triggers chunking)| # For arrays, each element likely contains entities | ||
| element_count = len(data) | ||
| # Estimate density: elements per 1000 tokens | ||
| density = (element_count / tokens) * 1000 if tokens > 0 else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case: Division by zero is checked with if tokens > 0 else 0, but this means content with exactly 0 tokens (empty string) will have density 0 and not chunk. While this is probably correct behavior, the function is also called with very small token counts where the density calculation becomes unreliable.
For example, with 10 tokens and 5 elements, density = 500, which would trigger chunking even though the content is tiny. The CHUNK_MIN_TOKENS check in should_chunk() prevents this, but consider adding a comment here explaining the relationship.
| pass | ||
|
|
||
| # Try speaker pattern (e.g., "Alice: Hello") | ||
| speaker_pattern = r'^([A-Za-z_][A-Za-z0-9_\s]*):(.+?)(?=^[A-Za-z_][A-Za-z0-9_\s]*:|$)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex complexity: The speaker pattern r'^([A-Za-z_][A-Za-z0-9_\s]*):(.+?)(?=^[A-Za-z_][A-Za-z0-9_\s]*:|$)' with re.MULTILINE | re.DOTALL is complex and may have performance implications on large message content. Consider:
- Adding a comment explaining what this pattern matches
- Testing performance with large message arrays (10k+ messages)
- Consider caching compiled regex patterns at module level for better performance
| logger.debug(f'Chunked content into {len(chunks)} chunks for entity extraction') | ||
|
|
||
| # Extract entities from each chunk in parallel | ||
| chunk_results = await semaphore_gather( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrency consideration: Using semaphore_gather here is good, but with potentially hundreds of chunks from a large document, this could create a spike in concurrent LLM API calls. Consider documenting the expected behavior or adding a note about the SEMAPHORE_LIMIT from helpers.
Also, if one chunk extraction fails, does it fail the entire operation? The error handling strategy should be documented.
Code Review SummaryThis PR implements adaptive density-based chunking for entity extraction, which is a solid approach to handling large entity-dense inputs. The implementation is well-tested with 55 new tests. However, several issues need attention: Critical Issues
Moderate Issues
Documentation Issues
Recommendations
The example file |
Code Review ResponseClaude (on behalf of @danielchalef) Thank you for the thorough review. Here's our assessment of each finding: Already Fixed
Intentional Design
Known Limitations (Documented)
Low Priority / Not Actionable
Will Address
|
Summary
Implement density-based adaptive chunking for entity extraction to handle large entity-dense inputs that cause LLM timeouts and truncation issues.
Type of Change
Objective
Large entity-dense content triggers LLM timeouts and invalid JSON responses during extraction. This PR adds intelligent chunking that only activates for content >= 1000 tokens with high entity density (P95+), while preserving context for prose/narrative content. Short or low-density content processes as-is.
Testing
Checklist