-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix entity extraction for large episode inputs with adaptive chunking #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,6 +36,22 @@ | |
| SEMAPHORE_LIMIT = int(os.getenv('SEMAPHORE_LIMIT', 20)) | ||
| DEFAULT_PAGE_LIMIT = 20 | ||
|
|
||
| # Content chunking configuration for entity extraction | ||
| # Density-based chunking: only chunk high-density content (many entities per token) | ||
| # This targets the failure case (large entity-dense inputs) while preserving | ||
| # context for prose/narrative content | ||
| CHUNK_TOKEN_SIZE = int(os.getenv('CHUNK_TOKEN_SIZE', 3000)) | ||
| CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200)) | ||
| # Minimum tokens before considering chunking - short content processes fine regardless of density | ||
| CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000)) | ||
| # Entity density threshold: chunk if estimated density > this value | ||
| # For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens) | ||
|
Comment on lines
+39
to
+48
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing validation for environment variables. Invalid values could cause runtime errors or unexpected behavior:
Consider adding validation or using Pydantic settings: CHUNK_TOKEN_SIZE = max(100, int(os.getenv('CHUNK_TOKEN_SIZE', 3000)))
CHUNK_OVERLAP_TOKENS = max(0, min(CHUNK_TOKEN_SIZE // 2, int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))))
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documentation inconsistency: The comment says "For JSON: elements per 1000 tokens > threshold * 1000" but with default Consider clarifying: # For JSON: elements per 1000 tokens > threshold * 1000
# (e.g., 0.15 threshold = 150 elements per 1000 tokens triggers chunking) |
||
| # For Text: capitalized words per 1000 tokens > threshold * 500 (e.g., 0.15 = 75 caps/1000 tokens) | ||
| # Higher values = more conservative (less chunking), targets P95+ density cases | ||
| # Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON | ||
| # Examples that DON'T chunk at 0.15: meeting transcripts, news articles, documentation | ||
| CHUNK_DENSITY_THRESHOLD = float(os.getenv('CHUNK_DENSITY_THRESHOLD', 0.15)) | ||
|
|
||
|
|
||
| def parse_db_date(input_date: neo4j_time.DateTime | str | None) -> datetime | None: | ||
| if isinstance(input_date, neo4j_time.DateTime): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment claims "Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON" but this is not validated by tests. The test suite uses much lower thresholds (0.01, 0.05) in monkeypatched tests, which don't validate the actual production default of 0.15. Consider adding integration tests that verify the default 0.15 threshold with realistic data like the AWS cost example from the quickstart.