Why do we keep the highest community level? #716
Unanswered
natoverse
asked this question in
Algorithm + Paper
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Copied from #600
Question text
(partial, see original issue for code snippets and screenshots)
I think we should keep the community id list, not the level
Since the community id is required later in calculating the report weight, the level does not seem to make sense
Answer
The GraphRAG community hierarchies are generated with Leiden, and deeper levels of the hierarchy contain fewer, more closely connected entities. As the clusters get more tightly focused, more and more entities may get left out of any cluster and therefore be assigned no community. So you could have an entity with community 12 at the root level, community 324 at level 1, and then no community at level 2 because it isn't close enough to any of the level 2 clusters.
The GraphRAG query methods include a
community_level
param that allows users to specify what level in the hierarchy should be targeted for summarization. However, because not all entities will have a community assignment at all levels, we treat this as the preferred maximum depth. If the entity does not have a community at that level, we will go up to the next level until we find an assignment.In the code, this is achieved with two steps:
entity_df = _filter_under_community_level(entity_df, community_level)
entity_df = ( entity_df.groupby(["name", "rank"]).agg({"community": "max"}).reset_index() )
Step 2 works because the hierarchical clustering always assigns increasing id numbers at deeper depths, so the result is all entities being filtered to the maximum available depth for each, up to your requested depth.
Beta Was this translation helpful? Give feedback.
All reactions