Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
749 changes: 749 additions & 0 deletions DOMAIN_AGNOSTIC_IMPROVEMENT_PLAN.md

Large diffs are not rendered by default.

169 changes: 169 additions & 0 deletions PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
## Summary

This PR adds experimental support for **Apache TinkerPop Gremlin** as an alternative query language for AWS Neptune Database, alongside the existing openCypher support. This enables users to choose their preferred query language and opens the door for future support of other Gremlin-compatible databases (Azure Cosmos DB, JanusGraph, DataStax Graph, etc.).

## Motivation

While Graphiti currently supports AWS Neptune Database using openCypher, Neptune also natively supports Gremlin, which:

- Is Neptune's native query language with potentially better performance for certain traversal patterns
- Provides an alternative query paradigm for users who prefer imperative traversal syntax
- Opens the door for broader database compatibility with the TinkerPop ecosystem

## Key Features

- βœ… `QueryLanguage` enum (CYPHER, GREMLIN) for explicit language selection
- βœ… Dual-mode `NeptuneDriver` supporting both Cypher and Gremlin
- βœ… Gremlin query generation functions for common graph operations
- βœ… Graceful degradation when `gremlinpython` is not installed
- βœ… 100% backward compatible (defaults to CYPHER)

## Implementation Details

### Core Infrastructure
- **graphiti_core/driver/driver.py**: Added `QueryLanguage` enum and `query_language` field to base driver
- **graphiti_core/driver/neptune_driver.py**:
- Dual client initialization (Cypher via langchain-aws, Gremlin via gremlinpython)
- Query routing based on language selection
- Separate `_run_cypher_query()` and `_run_gremlin_query()` methods
- **graphiti_core/graph_queries.py**: 9 new Gremlin query generation functions:
- `gremlin_match_node_by_property()`
- `gremlin_match_nodes_by_uuids()`
- `gremlin_match_edge_by_property()`
- `gremlin_get_outgoing_edges()`
- `gremlin_bfs_traversal()`
- `gremlin_delete_all_nodes()`
- `gremlin_delete_nodes_by_group_id()`
- `gremlin_retrieve_episodes()`
- `gremlin_cosine_similarity_filter()` (placeholder)

### Maintenance Operations
- **graphiti_core/utils/maintenance/graph_data_operations.py**: Updated `clear_data()` to support both query languages

### Testing & Documentation
- **tests/test_neptune_gremlin_int.py**: Comprehensive integration tests
- **examples/quickstart/quickstart_neptune_gremlin.py**: Working usage example
- **examples/quickstart/README.md**: Updated with Gremlin instructions
- **GREMLIN_FEATURE.md**: Complete feature documentation

### Dependencies
- **pyproject.toml**: Added `gremlinpython>=3.7.0` to neptune and dev extras

## Usage Example

```python
from graphiti_core import Graphiti
from graphiti_core.driver.driver import QueryLanguage
from graphiti_core.driver.neptune_driver import NeptuneDriver
from graphiti_core.llm_client import OpenAIClient

# Create Neptune driver with Gremlin query language
driver = NeptuneDriver(
host='neptune-db://your-cluster.amazonaws.com',
aoss_host='your-aoss-cluster.amazonaws.com',
query_language=QueryLanguage.GREMLIN # Use Gremlin instead of Cypher
)

llm_client = OpenAIClient()
graphiti = Graphiti(driver, llm_client)

# The high-level Graphiti API remains unchanged
await graphiti.build_indices_and_constraints()
await graphiti.add_episode(...)
results = await graphiti.search(...)
```

## Installation

```bash
# Install with Neptune and Gremlin support
pip install graphiti-core[neptune]
```

## Current Limitations

### Supported βœ…
- Basic graph operations (CRUD on nodes/edges)
- Graph traversal and BFS
- Maintenance operations (clear_data, delete by group_id)
- Neptune Database clusters

### Not Yet Supported ❌
- Neptune Analytics (only supports Cypher)
- Direct Gremlin-based fulltext search (still uses OpenSearch)
- Direct Gremlin-based vector similarity (still uses OpenSearch)
- Complete `search_utils.py` Gremlin implementation (marked for future work)

### Why OpenSearch is Still Used

Neptune's Gremlin implementation doesn't include native fulltext search or vector similarity functions. These operations continue to use the existing OpenSearch (AOSS) integration, which provides:

- BM25 fulltext search across node/edge properties
- Vector similarity search via k-NN
- Hybrid search capabilities

This hybrid approach (Gremlin for graph traversal + OpenSearch for search) is a standard pattern for production Neptune applications.

## Testing

- βœ… All existing unit tests pass (103/103)
- βœ… New integration tests for Gremlin operations
- βœ… Type checking passes with pyright
- βœ… Linting passes with ruff

```bash
# Run unit tests
uv run pytest tests/ -k "not _int"

# Run Gremlin integration tests (requires Neptune Database)
uv run pytest tests/test_neptune_gremlin_int.py
```

## Breaking Changes

**None.** This is fully backward compatible:
- Default query language is `CYPHER` (existing behavior unchanged)
- `gremlinpython` is an optional dependency
- All existing code continues to work without modifications

## Future Work

The following enhancements are planned for future iterations:

1. **Complete search_utils.py Gremlin Support**
- Implement Gremlin-specific versions of hybrid search functions
- May require custom Gremlin steps or continued OpenSearch integration

2. **Broader Database Support**
- Azure Cosmos DB (Gremlin API)
- JanusGraph
- DataStax Graph
- Any Apache TinkerPop 3.x compatible database

3. **Performance Benchmarking**
- Compare Cypher vs Gremlin performance on Neptune
- Identify optimal use cases for each language

## Checklist

- [x] Code follows project style guidelines (ruff formatting)
- [x] Type checking passes (pyright)
- [x] All tests pass
- [x] Documentation updated (README, examples, GREMLIN_FEATURE.md)
- [x] Backward compatibility maintained
- [x] No breaking changes

## Related Issues

This addresses feature requests for:
- Broader database compatibility
- Neptune Gremlin support
- Alternative query language options

## Additional Notes

See `GREMLIN_FEATURE.md` in the repository for complete technical documentation, including detailed implementation notes and architecture decisions.

---

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)
73 changes: 73 additions & 0 deletions TODOs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
### 7. Suggestions for Improving Graphiti's RAG Techniques

Graphiti already employs a sophisticated RAG pipeline that surpasses standard vector search by using a dynamic knowledge graph. The following suggestions aim to build upon this strong foundation by incorporating more advanced RAG strategies to enhance retrieval accuracy, contextual understanding, and overall performance.

---

#### 1. Advanced Query Pre-processing

The current approach of combining recent messages into a single search query is effective for context. However, it can be enhanced by adding a more structured query analysis step before retrieval.

**Current State:** A recent conversation history is concatenated into a single string for searching.
* **Source Snippet (`agent.ipynb`):**
```python name=examples/langgraph-agent/agent.ipynb url=https://github.com/getzep/graphiti/blob/main/examples/langgraph-agent/agent.ipynb
graphiti_query = f'{\"SalesBot\" if isinstance(last_message, AIMessage) else state[\"user_name\"]}: {last_message.content}'
```

**Suggested Improvements:**

* **Query Decomposition:** For multi-faceted questions, an LLM call could break the query down into multiple sub-queries that are executed against the graph. The results can then be synthesized.
* **Example:** A query like *"What non-wool shoes would John like that are available in his size (10)?"* could be decomposed into:
1. `search("John's preferences")` -> Retrieves facts like `John LIKES "Basin Blue"`.
2. `search("John's shoe size")` -> Retrieves `John HAS_SHOE_SIZE "10"`.
3. `search("shoes NOT made of wool")` -> Retrieves products made of cotton, tree fiber, etc.
* **Benefit:** This provides more targeted retrieval than a single, complex query, reducing noise and improving the relevance of retrieved facts.

* **Hypothetical Document Embeddings (HyDE):** Before performing a semantic search, the LLM can generate a hypothetical "perfect" answer to the user's query. The embedding of this *hypothetical answer* is then used for the vector search, which often yields more relevant results than embedding the query itself.
* **Example:** For the query *"What makes the Men's Couriers special?"*, the LLM might generate a hypothetical fact: *"The Men's Couriers are special because they feature a retro silhouette and are made from natural materials like cotton."* The embedding of this sentence is then used to find real facts in the graph.
* **Benefit:** This bridges the gap between the "question space" and the "answer space" in vector embeddings, leading to better semantic matches.

---

#### 2. Enhanced Graph-Native Retrieval and Summarization

Graphiti's core strength is the graph itself. Retrieval can be made even more powerful by leveraging graph topology more deeply.

**Current State:** The system uses graph traversal for re-ranking via `center_node_uuid` and detects communities (`build_communities` in the Neptune example).

**Suggested Improvements:**

* **Recursive Retrieval & Graph-Based Summarization:** Retrieval can be a multi-step process.
1. **Initial Retrieval:** Retrieve a set of initial facts/nodes as is currently done.
2. **Neighbor Exploration:** For the top N initial nodes, automatically retrieve their direct neighbors. This can uncover critical context that wasn't captured by the initial query.
3. **LLM-Powered Summarization:** Pass the retrieved sub-graph (initial nodes + neighbors) to an LLM with a prompt like: *"Summarize the key information and relationships in the following set of facts: [facts list]"*.
* **Benefit:** This moves beyond retrieving a simple list of facts to retrieving a *synthesized insight* from a relevant portion of the graph, which is a much more compressed and potent form of context for the final generation step.

* **Leverage Pre-computed Communities:** The `build_communities` method is a powerful feature. These communities can be used to generate summary nodes *ahead of time*.
* **Implementation:** After running community detection, for each community, create a new `Summary` node. Use an LLM to generate a description for this node that summarizes the entities within that community.
* **Example:** A community of nodes related to "SuperLight Wool Runners" could have a summary node: *"This product family is characterized by its lightweight SuperLight Foam technology and wool-based materials."*
* **Benefit:** During retrieval, if the query matches this summary node, the system can retrieve a single, dense summary instead of dozens of individual product facts, leading to massive prompt compression and faster, more coherent context.

---

#### 3. Optimized Ranking and Post-processing

The final step of filtering and ranking retrieved facts is crucial for prompt quality.

**Current State:** Graphiti uses a hybrid search and likely some form of score fusion (like RRF, as hinted in `search_config_recipes`) to rank results.

**Suggested Improvements:**

* **LLM-based Re-ranking:** After retrieving an initial set of candidate facts (e.g., the top 20-30), use a smaller, faster LLM to perform a final re-ranking pass.
* **Implementation:** The LLM would be prompted with the original query and each fact, and asked to output a relevance score (e.g., 1-10) or a simple "relevant/not relevant" judgment.
* **Benefit:** This can catch nuanced relevance that semantic or keyword scores might miss, providing a final layer of polish to the retrieved context. It is more expensive but can significantly improve the quality of the top-k results.

* **Structured Fact Output:** Instead of returning a flat list of fact strings, the retrieval endpoint could return a structured JSON object that preserves the `Subject-Predicate-Object` nature of the facts.
* **Example:**
```json
[
{ "subject": "John", "predicate": "IS_ALLERGIC_TO", "object": "wool" },
{ "subject": "John", "predicate": "HAS_SHOE_SIZE", "object": "10" }
]
```
* **Benefit:** This structured format is more easily parsed by the final generation LLM, which can be explicitly prompted to "pay attention to the relationships between subjects and objects." This is a more direct way of "prompting a graph" and can lead to more logical and accurate responses. This also helps the LLM differentiate between entities and their attributes more clearly.
Loading
Loading