What is the difference between nodes and entities in GraphRAG? #719

natoverse · 2024-07-25T20:23:08Z

natoverse
Jul 25, 2024
Maintainer

Copied from #567

Part 1 of this question is "what is the conceptual difference between nodes and entities?"
Part 2 of this questions "why are entities sometimes repeated twice in the nodes file?"

The output files in question are create_final_entities.parquet and create_final_nodes.parquet (create_final_relationships.parquet is also related).

GraphRAG extracts entities and relationships from text content and generates a graph. This graph is then used as the entry point for algorithms to summarize and answer questions about your dataset. When we extract entities, we create a canonical list of entities including the text units they were found within. This entity data is saved to create_final_entities.parquet.

We then combine the entities table and the relationships table to create a graph (network). Once we put each entity into the graph, it becomes a node in that graph (and the relationships are edges), and thus adopts new semantic meaning and analytic properties. You'll notice, for example, that in the nodes table each entity has a degree, x, y, and size. Degree is the node degree (connectedness), and x and y can be populated with a position in 2D coordinate space for visualizing the graph (see the configs for Node2Vec embeddings and UMAP). We use the degree to represent the size by default, so those columns are equivalent (but you could use any measure you deem important to set the size of a node in a graph visualization...).

As for the duplication: one of the graph analysis steps we run is hierarchical community detection with Leiden. A community will be assigned for every node, at every level in the hierarchy (unless that node becomes too distinct and becomes "orphaned" at some depth). This results in a duplicate entry in the nodes table for each computed community level. So the create_final_nodes.parquet is a one-to-many from create_final_entities.parquet, using the id field as join key.

To summarize: entities are canonical, nodes are a representation of that entity in graph space, and duplication is because we compute hierarchical communities and add an entry for each in the nodes table.

kouskouss · 2024-07-29T14:59:56Z

kouskouss
Jul 29, 2024

Maybe I didn't understand fully your explanation, but I wanted to ask how is it explained the number of nodes in comparison with the number of entities, all the entities shouldn't be represented in the graph space? In my case, the dataframe contains Entity count: 19670 and entity_embedding_df count: 2810, which is a big difference

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the difference between nodes and entities in GraphRAG? #719

{{title}}

Replies: 1 comment

{{title}}

Select a reply

What is the difference between nodes and entities in GraphRAG? #719

natoverse Jul 25, 2024 Maintainer

Replies: 1 comment

kouskouss Jul 29, 2024

natoverse
Jul 25, 2024
Maintainer

kouskouss
Jul 29, 2024