Incremental indexing (adding new content) #741

natoverse · 2024-07-26T19:59:14Z

A number of users have asked how to add new content to an existing index without needing to re-run the entire process. This is a feature we are planning, and are in the design stages now to ensure we have an efficient approach.

As it stands, new content can be added to a GraphRAG index without requiring a complete re-index. This is because we rely heavily on a cache to avoid repeating the same calls to the model API. There are several stages within the pipeline for which this is very efficient - namely those stages that are atomic and do not have upstream dependencies. For example, if you add new documents to a folder, we will not re-chunk existing documents or perform entity and relationship extraction on the existing content; we will simply fetch the processed content from the cache and pass it on. The new documents will be processed and new entities and relationships extracted. Downstream of this, the graph construction process will need to recreate the graph to include the new nodes and edges, and communities will be recomputed - resulting in re-summarization, etc. You can get a sense of this process and what downstream steps may be re-processed by looking at the data flow diagram.

Describe the solution you'd like

An ideal solution would be to add a new command to GraphRAG such as update that can be run against new data and augment an existing index. Considerations here include things such as evaluating the new entities to determine if they can be added to an existing community, and when those communities have been altered enough to constitute a "drift" that needs recomputing. We could also perform analysis to determine which communities have been edited, such that we ignore summarization on those that haven't changed.

Additional context

We also need to consider the types of analysis incremental ingest can enable beyond just "updates". For example, daily ingest of news with thoughtful graph construction/annotation could allow for delta analysis such that questions like "what happened with person x in the last 24 hours" or "catch me up on the news themes this week".

Some desired types of editing users have described in other issues:

Adding new documents
Removing old documents
Editing the graph itself

Scope

For now we are going to limit the scope of this feature to just an incremental index update to append content, and not worry about removal, manual graph editing, or the metadata tagging that would be required to do delta-style queries.

Approach

Putting here a little more detail on the approach we've discussed. It largely echoes what I put above as ideas, but I'll repeat for clarity:

We will create a new graphrag.append command to run updates that add content. The reason for a new command is so that the original graphrag.index is predictable in its behavior, i.e., that users know that communities will always be recomputed so they don't have to worry about model drift.
The append command will try to minimize community recomputes so that summarization is not performed again. If certain thresholds are met, recompute may be required, so the worst case degrades to the same performance as a normal indexing.
The first efficiency optimization will be to attempt to place all new entities into an existing community rather than re-running Leiden and triggering updates for everything.
We will only run summarization on those communities whose membership has changed, i.e., their new entity inputs should trigger resummarization in order to account for the new content.
We will establish user-configurable thresholds to determine when Leiden must be re-run, such as the number of new entities that don't find an existing community, or possible a measure of the modularity change of the graph (TBD).

The text was updated successfully, but these errors were encountered:

natoverse · 2024-07-26T19:59:57Z

This is a popular request, so I'm going to pin it and route other issues here.

natoverse · 2024-07-26T20:02:09Z

Related: removing existing content, e.g., #585

KylinMountain · 2024-07-26T23:41:08Z

@natoverse
can we split index into graph build and community summary? Lots of fans ask me if we can modify the knowledge graph in manual as sometimes the entity or relationship is wrong?

KylinMountain · 2024-07-26T23:44:01Z

If you modify the Ilm params in settings.yaml, all of cache will be invalid.

natoverse · 2024-08-01T17:53:25Z

Additional use case: adding files of a different type: #784

ljhskyso · 2024-08-07T00:35:55Z

Any ETA on this feature? Need this to assess whether I need to implement my own solution. @natoverse

shaoqing404 · 2024-08-12T09:40:48Z

I think it is more urgent to change the cache from file to milvus(or more vector DB). The bottleneck that affects the overall query timeliness of graphrag has a significant impact on IO in relatively large files.

vishyarjun · 2024-08-27T17:57:11Z

there are two distinct scenarios for adding documents to the index, each requiring a different approach to community management and querying:
Scenario 1: Siloed Document Communities

Concept: Each new document creates its own independent community within the index. This is like having separate folders, each with its own set of files.
Community Management: Communities are created and maintained separately for each document. we'll need a mechanism to track which documents belong to which community.
Querying: Queries are directed at a specific document community. The query pipeline only searches within that community, ensuring results are exclusively from the selected document.

Scenario 2: Unified Document Collection

Concept: New documents are integrated into existing communities, enriching the existing knowledge base. This is like adding new files to a shared folder.
Community Management: manage communities at a higher level than individual documents. New content is tagged or categorized to associate it with the appropriate community.
Querying: Queries search across all communities (or a subset being specified). Results may come from a mix of documents.

gusye1234 · 2024-09-07T03:02:37Z

Hi, maybe checkout this repo, it supports incremental insert for entities and relationships. Also will compute the hash of the docs so only insert the new docs everytime you insert

shandianshan · 2024-09-11T14:02:37Z

@gusye1234 Hi, thank you for sharing your work. Could you please explain how you addressed the issue of integrating new entities into the community.

gusye1234 · 2024-09-13T09:30:33Z

@gusye1234 Hi, thank you for sharing your work. Could you please explain how you addressed the issue of integrating new entities into the community.

Sure. nano-graphrag use the md5 hash of docs and chunks as their key. When inserting begins, the same docs and chunks will be ignored and only the new chunks will continue to insert.
nano-graphrag will automatically load the previous graph from the working dir, and the new entities and relationships will be added on the current graph.

However, everytime you insert, nano-graphrag will still re-compute the communities and generate new community reports. The incremental update of communities is not yet implemented.

Tipik1n · 2024-09-13T11:45:41Z

The lack of this feature is also holding me back ( and probably many more ) from fully committing to using this repo as a solution in the gen ai space, Would love to see how you implement this.

yaroslavyaroslav · 2024-09-13T13:01:45Z

Hi, maybe checkout this repo,

@gusye1234 If you changed this repo by deleting the branch specification, it would be accessible as the GitHub repo from within the GitHub mobile app, which I suppose would increase the number of stars by making it easier for mobile users. Because right now, it just opens the working directory instead.

JViggiani · 2024-09-24T07:49:05Z

+1 to feature request. I'd need this feature to use graphrag as a solution to my problem space

SZabolotnii · 2024-09-24T09:34:23Z

+1 to feature request, it's a critical missing component.

mellanon · 2024-09-27T01:55:48Z

+1 to feature request. It was quite expensive to index my graph. I have one more document that needs to be added to the graph. A bit annoying but I'll leave it until this feature is available so that I don't end up triggering an expensive re-indexing exercise.

Andrew-00 · 2024-09-28T16:31:55Z

I agree with everyone here. Incremental indexing is critical and important.

KennyDizi · 2024-10-03T06:13:42Z

1

wangiii · 2024-10-08T12:08:32Z

+1

zh-nj · 2024-10-10T07:39:26Z

+1

owquresh · 2024-10-21T05:11:29Z

How can we handle cases where our data keeps changing over time? Not just added documents but changes being made to documents?

DanAIDev · 2024-10-21T12:24:21Z

+1

aerwin-ds · 2024-10-28T18:56:32Z

Are there any updates for when a solution for incremental indexing will be available in the official repo?

AlonsoGuevara · 2024-11-01T00:25:48Z

Folks,
In #1318 we merged our first version for Incremental indexing. I'm working on adding documentation, but those running from source are more than welcome to try it and report any issue.

antoniocirclemind · 2024-11-02T03:30:43Z

Hi, it might be worth mentioning this repo too. We just open-sourced it and it supports incremental updates.

arastogi1111 · 2024-12-03T06:06:57Z

@AlonsoGuevara The update command seems to be failing at creating the final documents. It takes the sweet time to index the new document(s) and update them to the existing graph and populates the delta folder but finally encounters a KeyError: "['title'] not in index" at : /usr/local/lib/python3.10/dist-packages/graphrag/index/update/incremental_index.py:281 in _update_entities

MarkusGutjahr · 2024-12-04T09:33:53Z

@AlonsoGuevara The update command seems to be failing at creating the final documents. It takes the sweet time to index the new document(s) and update them to the existing graph and populates the delta folder but finally encounters a KeyError: "['title'] not in index" at : /usr/local/lib/python3.10/dist-packages/graphrag/index/update/incremental_index.py:281 in _update_entities

also getting a similar error, after updating to the newest version of graphrag

MarkusGutjahr · 2024-12-04T09:59:27Z

@AlonsoGuevara The update command seems to be failing at creating the final documents. It takes the sweet time to index the new document(s) and update them to the existing graph and populates the delta folder but finally encounters a KeyError: "['title'] not in index" at : /usr/local/lib/python3.10/dist-packages/graphrag/index/update/incremental_index.py:281 in _update_entities

also getting a similar error, after updating to the newest version of graphrag

when i just run the indexing process, with no existing files, it gets an error for a missing file ("ValueError: Could not find create_final_documents.parquet in storage!")

when i use an exsting indexng result as the storage (as basedir in the settings.yaml), it starts the indexer process, but fails with: "KeyError: "['title'] not in index"

also it seems to make no difference between using "graphrag index" and "graphrag update", but just start te indexing with the update process

so right now, with the newest version, i cant update the existing files and also cant create new files (indexer generally not working)

UPDATED:
to make the normal indexer forcreaing new files work properly, the "update_index_storage:"-entry in the settings.yaml needs to be removed.
then when wanting to update existing gaph data, it needs to be added with also changing the basedir of "storage:" to the alredy created indexer results.
with this i get it to work

LuWei6896 · 2024-12-05T07:34:39Z

@AlonsoGuevara The update command seems to be failing at creating the final documents. It takes the sweet time to index the new document(s) and update them to the existing graph and populates the delta folder but finally encounters a KeyError: "['title'] not in index" at : /usr/local/lib/python3.10/dist-packages/graphrag/index/update/incremental_index.py:281 in _update_entities

also getting a similar error, after updating to the newest version of graphrag

when i just run the indexing process, with no existing files, it gets an error for a missing file ("ValueError: Could not find create_final_documents.parquet in storage!")

when i use an exsting indexng result as the storage (as basedir in the settings.yaml), it starts the indexer process, but fails with: "KeyError: "['title'] not in index"

also it seems to make no difference between using "graphrag index" and "graphrag update", but just start te indexing with the update process

so right now, with the newest version, i cant update the existing files and also cant create new files (indexer generally not working)

UPDATED: to make the normal indexer forcreaing new files work properly, the "update_index_storage:"-entry in the settings.yaml needs to be removed. then when wanting to update existing gaph data, it needs to be added with also changing the basedir of "storage:" to the alredy created indexer results. with this i get it to work

I solved the problem,u need to genarate a new settings.yaml，this error is related to the format and syntax errors of the configuration file.

MarkusGutjahr · 2024-12-05T10:32:13Z

@AlonsoGuevara The update command seems to be failing at creating the final documents. It takes the sweet time to index the new document(s) and update them to the existing graph and populates the delta folder but finally encounters a KeyError: "['title'] not in index" at : /usr/local/lib/python3.10/dist-packages/graphrag/index/update/incremental_index.py:281 in _update_entities

also getting a similar error, after updating to the newest version of graphrag

when i just run the indexing process, with no existing files, it gets an error for a missing file ("ValueError: Could not find create_final_documents.parquet in storage!")
when i use an exsting indexng result as the storage (as basedir in the settings.yaml), it starts the indexer process, but fails with: "KeyError: "['title'] not in index"
also it seems to make no difference between using "graphrag index" and "graphrag update", but just start te indexing with the update process
so right now, with the newest version, i cant update the existing files and also cant create new files (indexer generally not working)
UPDATED: to make the normal indexer forcreaing new files work properly, the "update_index_storage:"-entry in the settings.yaml needs to be removed. then when wanting to update existing gaph data, it needs to be added with also changing the basedir of "storage:" to the alredy created indexer results. with this i get it to work

I solved the problem,u need to genarate a new settings.yaml，this error is related to the format and syntax errors of the configuration file.

yes, for testing i just added and removed the part related to "update_index_storage:".
later i also created 2 seperate files for normal indexing and updating.

radiant-tangent · 2024-12-17T00:00:42Z

Since the scope for this issue is limited to adding documents, are there plans in the future to support updating the graph for documents that are removed? I see a lot of issues related to removal got directed to this issue.

natoverse added the enhancement New feature or request label Jul 26, 2024

natoverse pinned this issue Jul 26, 2024

This was referenced Jul 26, 2024

Time-Based GraphRag #713

Open

[Issue]: What is the process for removing and revising outdated documents? #585

Closed

AlonsoGuevara self-assigned this Jul 30, 2024

natoverse mentioned this issue Aug 1, 2024

[Issue]: If I have already index a .txt file, can I still index a .csv file and merge it to the existing graph? #784

Closed

natoverse mentioned this issue Aug 5, 2024

[Issue]: Does cache still work if extracting one-file graphRAG from a multiple files graphRAG? #819

Closed

natoverse mentioned this issue Aug 12, 2024

[Issue]:生成.parquet文件问题 #893

Closed

This was referenced Aug 19, 2024

[Feature Request]: Merging artefacts parquet files. #958

Closed

[Feature Request]: the current version does not support incremental insertion, but it is very necessary！ #1008

Closed

wanchichang mentioned this issue Sep 3, 2024

如何在索引后的graphrag上添加新的文档？（How to add a new document on the indexed graphrag?） #1081

Closed

3 tasks

This was referenced Sep 9, 2024

[Feature Request]: <title> is_update_run Incremental updates are a useful feature #1098

Closed

[Feature Request]: Incremental construction of knowledge graph #1100

Closed

natoverse mentioned this issue Sep 23, 2024

[Issue]: how to add new file via Increamental Indexing now ? #1192

Closed

AlonsoGuevara mentioned this issue Oct 24, 2024

Add Incremental Indexing v1 #1318

Merged

4 tasks

MB-Finski mentioned this issue Jan 9, 2025

Feature: Add support for knowledge graphs nextcloud/context_chat_backend#129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental indexing (adding new content) #741

Incremental indexing (adding new content) #741

natoverse commented Jul 26, 2024 •

edited

Loading

natoverse commented Jul 26, 2024

natoverse commented Jul 26, 2024

KylinMountain commented Jul 26, 2024 •

edited

Loading

KylinMountain commented Jul 26, 2024

natoverse commented Aug 1, 2024

ljhskyso commented Aug 7, 2024

shaoqing404 commented Aug 12, 2024

vishyarjun commented Aug 27, 2024 •

edited

Loading

gusye1234 commented Sep 7, 2024

shandianshan commented Sep 11, 2024

gusye1234 commented Sep 13, 2024

Tipik1n commented Sep 13, 2024

yaroslavyaroslav commented Sep 13, 2024

JViggiani commented Sep 24, 2024

SZabolotnii commented Sep 24, 2024

mellanon commented Sep 27, 2024

Andrew-00 commented Sep 28, 2024 •

edited

Loading

KennyDizi commented Oct 3, 2024

wangiii commented Oct 8, 2024

zh-nj commented Oct 10, 2024

owquresh commented Oct 21, 2024

DanAIDev commented Oct 21, 2024

aerwin-ds commented Oct 28, 2024

AlonsoGuevara commented Nov 1, 2024

antoniocirclemind commented Nov 2, 2024

arastogi1111 commented Dec 3, 2024

MarkusGutjahr commented Dec 4, 2024

MarkusGutjahr commented Dec 4, 2024 •

edited

Loading

LuWei6896 commented Dec 5, 2024

MarkusGutjahr commented Dec 5, 2024

radiant-tangent commented Dec 17, 2024 •

edited

Loading

Incremental indexing (adding new content) #741

Incremental indexing (adding new content) #741

Comments

natoverse commented Jul 26, 2024 • edited Loading

Describe the solution you'd like

Additional context

Scope

Approach

natoverse commented Jul 26, 2024

natoverse commented Jul 26, 2024

KylinMountain commented Jul 26, 2024 • edited Loading

KylinMountain commented Jul 26, 2024

natoverse commented Aug 1, 2024

ljhskyso commented Aug 7, 2024

shaoqing404 commented Aug 12, 2024

vishyarjun commented Aug 27, 2024 • edited Loading

gusye1234 commented Sep 7, 2024

shandianshan commented Sep 11, 2024

gusye1234 commented Sep 13, 2024

Tipik1n commented Sep 13, 2024

yaroslavyaroslav commented Sep 13, 2024

JViggiani commented Sep 24, 2024

SZabolotnii commented Sep 24, 2024

mellanon commented Sep 27, 2024

Andrew-00 commented Sep 28, 2024 • edited Loading

KennyDizi commented Oct 3, 2024

wangiii commented Oct 8, 2024

zh-nj commented Oct 10, 2024

owquresh commented Oct 21, 2024

DanAIDev commented Oct 21, 2024

aerwin-ds commented Oct 28, 2024

AlonsoGuevara commented Nov 1, 2024

antoniocirclemind commented Nov 2, 2024

arastogi1111 commented Dec 3, 2024

MarkusGutjahr commented Dec 4, 2024

MarkusGutjahr commented Dec 4, 2024 • edited Loading

LuWei6896 commented Dec 5, 2024

MarkusGutjahr commented Dec 5, 2024

radiant-tangent commented Dec 17, 2024 • edited Loading

natoverse commented Jul 26, 2024 •

edited

Loading

KylinMountain commented Jul 26, 2024 •

edited

Loading

vishyarjun commented Aug 27, 2024 •

edited

Loading

Andrew-00 commented Sep 28, 2024 •

edited

Loading

MarkusGutjahr commented Dec 4, 2024 •

edited

Loading

radiant-tangent commented Dec 17, 2024 •

edited

Loading