-
Notifications
You must be signed in to change notification settings - Fork 2.2k
feat(extraction): inline attribute extraction during entity extraction #1131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(extraction): inline attribute extraction during entity extraction #1131
Conversation
Currently, extracting entity attributes requires defining entity_types with
Pydantic models, triggering O(n) additional LLM calls. Without schemas, no
attributes are extracted.
The entity extraction pass already processes full episode context. This change
extends the extraction prompt to request attributes inline, achieving attribute
discovery with zero marginal LLM calls and no schema requirement.
Changes:
- Add optional attributes field to ExtractedEntity (defaults to {})
- Update extraction prompts to request attributes inline
- Pass extracted attributes to EntityNode on creation
- Merge new attributes into existing nodes during deduplication
Fully backwards compatible.
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
OpenAI's structured outputs require all properties to be in the
required array. Changed attributes from optional with default_factory
to required field. LLM returns {} when no attributes found.
OpenAI's structured outputs don't allow additionalProperties in objects. Changed attributes from dict[str, str] to list[EntityAttribute] with explicit key/value fields. Convert to dict when creating EntityNode.
|
Update: Encountered schema validation errors with OpenAI's default structured outputs configuration. The issue: Fix: Introduced Reference: OpenAI Structured Outputs docs |
When extracting monetary values, include currency if stated (e.g., '50M USD'). If currency not explicitly mentioned, preserve original format without assuming.
|
Known limitation: Attribute conflict handling During testing, i've observed that conflicting attributes across episodes accumulate rather than resolve: Last-write-wins on key collision, but different phrasing creates different keys. |
Summary
Extends
ExtractedEntitywith an optionalattributesfield, populated during entity extraction.Type of Change
Objective
Entity attribute extraction currently requires defining
entity_typeswith Pydantic models. This triggers O(n) additional LLM calls via_extract_entity_attributesfor n entities. Without predefined schemas, no attributes are extracted.The entity extraction pass already processes full episode context. Attribute identification is a natural byproduct of entity recognition. This change extends the extraction prompt to request attributes inline, achieving attribute discovery with zero marginal LLM calls and no schema requirement.
{"name": "Acme Corp", "entity_type_id": 0, "attributes": {"employee_count": 150}}New attributes are also merged into existing nodes during deduplication.
Default empty dict ensures backward compatibility.
Testing
Breaking Changes
Checklist
make lintpasses)Related Issues
Closes #