Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions graphiti_core/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,22 @@
SEMAPHORE_LIMIT = int(os.getenv('SEMAPHORE_LIMIT', 20))
DEFAULT_PAGE_LIMIT = 20

# Content chunking configuration for entity extraction
# Density-based chunking: only chunk high-density content (many entities per token)
# This targets the failure case (large entity-dense inputs) while preserving
# context for prose/narrative content
CHUNK_TOKEN_SIZE = int(os.getenv('CHUNK_TOKEN_SIZE', 3000))
CHUNK_OVERLAP_TOKENS = int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))
# Minimum tokens before considering chunking - short content processes fine regardless of density
CHUNK_MIN_TOKENS = int(os.getenv('CHUNK_MIN_TOKENS', 1000))
# Entity density threshold: chunk if estimated density > this value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment claims "Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON" but this is not validated by tests. The test suite uses much lower thresholds (0.01, 0.05) in monkeypatched tests, which don't validate the actual production default of 0.15. Consider adding integration tests that verify the default 0.15 threshold with realistic data like the AWS cost example from the quickstart.

# For JSON: elements per 1000 tokens > threshold * 1000 (e.g., 0.15 = 150 elements/1000 tokens)
Comment on lines +39 to +48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for environment variables. Invalid values could cause runtime errors or unexpected behavior:

  1. CHUNK_TOKEN_SIZE and CHUNK_OVERLAP_TOKENS: No min/max validation. Negative or zero values would break chunking logic.
  2. CHUNK_OVERLAP_TOKENS >= CHUNK_TOKEN_SIZE: This creates edge cases in chunking logic (see line 527 in content_chunking.py)
  3. CHUNK_DENSITY_THRESHOLD: No range validation (should be 0.0-1.0 or similar)

Consider adding validation or using Pydantic settings:

CHUNK_TOKEN_SIZE = max(100, int(os.getenv('CHUNK_TOKEN_SIZE', 3000)))
CHUNK_OVERLAP_TOKENS = max(0, min(CHUNK_TOKEN_SIZE // 2, int(os.getenv('CHUNK_OVERLAP_TOKENS', 200))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation inconsistency: The comment says "For JSON: elements per 1000 tokens > threshold * 1000" but with default CHUNK_DENSITY_THRESHOLD=0.15, this means 150 elements per 1000 tokens, not "0.15 = 150 elements/1000 tokens" as stated in line 49. The math is correct but the phrasing is confusing.

Consider clarifying:

# For JSON: elements per 1000 tokens > threshold * 1000 
# (e.g., 0.15 threshold = 150 elements per 1000 tokens triggers chunking)

# For Text: capitalized words per 1000 tokens > threshold * 500 (e.g., 0.15 = 75 caps/1000 tokens)
# Higher values = more conservative (less chunking), targets P95+ density cases
# Examples that trigger chunking at 0.15: AWS cost data (12mo), bulk data imports, entity-dense JSON
# Examples that DON'T chunk at 0.15: meeting transcripts, news articles, documentation
CHUNK_DENSITY_THRESHOLD = float(os.getenv('CHUNK_DENSITY_THRESHOLD', 0.15))


def parse_db_date(input_date: neo4j_time.DateTime | str | None) -> datetime | None:
if isinstance(input_date, neo4j_time.DateTime):
Expand Down
Loading
Loading