Use LLM to deduplicate extracted similar entities during the insertion phase #2102

FloretKu · 2025-09-15T16:59:37Z

Description

During the insertion phase, use LLM to deduplicate extracted similar entities.

Related Issues

Changes Made

1.Create a new deduplicate.py to handle the entire functionality.

2.Add a feature toggle and related configurations in lightrag.py

    enable_deduplication: bool = field(default=False)
    deduplication_config: dict[str, Any] = field(
        default_factory=lambda: {
            "strategy": "llm_based",  # Strategy name: currently only "llm_based" is implemented
            "llm_based": {
                "batch_size": get_env_value("DEDUP_BATCH_SIZE", 30, int),
                "similarity_threshold": get_env_value(
                    "DEDUP_SIMILARITY_THRESHOLD", 0.85, float
                ),
                "system_prompt": None,  # Use default if None
                "strictness_level": get_env_value(
                    "DEDUP_STRICTNESS_LEVEL", "strict", str
                ),  # "strict", "medium", "loose"
                # strict: merge nodes ONLY if they represent the exact same real-world concept (e.g., spelling variations, synonyms, or explicit duplicates). Never merge nodes that are merely topically related.
                # medium: merge nodes if they represent the same core concept, including near-synonyms or semantically equivalent phrasing.
                # loose: merge nodes if they represent the same thematic concept, including near-synonyms or semantically equivalent phrasing.
            },
            # Future strategies can be added here by extending the architecture
            # Example: "new_strategy": { ... }
        }
    )

3.Add relevant functions in operate.py

    if deduplication_service and all_entities_for_dedup:
        logger.info(f"Starting comprehensive entity deduplication with {len(all_entities_for_dedup)} entities")
        # Extract deduplication configuration from global_config
        dedup_config_data = global_config.get("deduplication_config", {})
        strategy_name = dedup_config_data.get("strategy", "llm_based")
        strategy_config = dedup_config_data.get(strategy_name, {})

        # Create strategy-specific configuration using ConfigFactory
        try:
            from .duplicate import ConfigFactory

            dedup_config = ConfigFactory.create_config(
                strategy_name,
                {
                    "target_batch_size": strategy_config.get("batch_size", 30),
                    "similarity_threshold": strategy_config.get(
                        "similarity_threshold", 0.85
                    ),
                    "system_prompt": strategy_config.get("system_prompt"),
                    "strictness_level": strategy_config.get(
                        "strictness_level", "strict"
                    ),
                },
            )
    .........

4.Add relevant prompts and examples in prompt.py.

PROMPTS["goal_clean_strict"]
PROMPTS["goal_clean_medium"]
PROMPTS["goal_clean_loose"]
PROMPTS["goal_clean_examples"]
PROMPTS["name_only_analysis_instruction"]
PROMPTS["secondary_merge_verification"]
PROMPTS["secondary_verification_examples"]

Checklist

Changes tested locally
Code reviewed
Documentation updated (if necessary)
Unit tests added (if applicable)

Additional Notes

1. First, retrieve all entities to be inserted and cluster them (batch_size = 30) to ensure that the entities passed to the LLM are sufficiently similar, improving the accuracy of merging.

['汽车', '洋车', '車', '车', '新车', '车口', '洋车夫', '车份儿', '西安门大街人和车厂', '洋车厂子', '人和车厂', '车厂', '洋车界', '北平的洋车夫', '洋车夫派别', '年轻力壮的洋车夫', '年轻人力车夫', '车份儿和嚼谷', '买上车再说']

2. Only pass the entity_name to the LLM for preliminary merging, and return the initial merging results for the batch.

{
    "merge": [
        {"summary": "车","keywords": ["汽车","車","车","新车"]},
        {"summary": "西安门大街人和车厂","keywords": ["西安门大街人和车厂","人和车厂","洋车厂子","车厂"]},
        ..........
    ]
}

3. Add the descriptions of the preliminary results and pass them again to the LLM to determine whether they should be merged.

1. 汽车
   Description: 祥子买的汽车是他辛勤工作的结果，也是他生活的象征和希望的来源。
2. 車
   Description: 祥子希望通过卖骆驼买一辆车。
3. 车
   Description: <SEP>祥子租赁了一辆破旧的车来练习拉车的技术。<SEP>祥子的车是他生活的依靠，他相信这辆车能产生烙饼和其他食物，是万能的土地。<SEP>祥子的车被兵匪劫走，成为他不幸经历的一部分。<SEP>祥子拥有的交通工具，被抢走后成为了他心中难以忘怀的事情。
4. 新车
   Description: 新车是指祥子想要购买的车辆，具有弓子软、铜活地道等特性。

4. LLM returns the final resul.

{
    "merge": [
        {"summary": "汽车","keywords": ["汽车", "車", "车"]}
    ]
}

5. Throughout this process, similarity matching is performed on entity_name to ensure that the results are all originally existing nodes

… extracted similar entities.

danielaskdd · 2025-09-22T10:35:24Z

This is a highly anticipated feature, and I’ll be able to dedicate time to researching and testing it only after addressing my current tasks. Please resolve the conflicts with the main branch first. Thank you.

FloretKu · 2025-09-25T06:39:14Z

This is a highly anticipated feature, and I’ll be able to dedicate time to researching and testing it only after addressing my current tasks. Please resolve the conflicts with the main branch first. Thank you.这是一个备受期待的功能，只有在解决当前任务后，我才能花时间研究和测试它。请先解决与主分支的冲突。谢谢。

The conflict has been resolved. Thank you for your dedication and support to the project.

FloretKu · 2025-10-27T08:00:12Z

@danielaskdd
The new conflicts have been resolved. Do you have any further suggestions for this PR?

Matt23-star · 2025-11-05T05:21:56Z

Hi @FloretKu , I am checking this PR recently. May I ask why there are 2 PROMPTS["naive_rag_response"] in prompt.py?

Removed the duplicate naive_rag_response portion of the prompt.

FloretKu · 2025-11-30T02:25:57Z

Hi @FloretKu , I am checking this PR recently. May I ask why there are 2 PROMPTS["naive_rag_response"] in prompt.py?

Hi @Matt23-star ,thank you for pointing out the issue. The duplicate prompt has now been removed. Your thorough review has greatly improved my PR. Are there any other concerns? Will it merge smoothly into LightRAG?

@danielaskdd This PR has been submitted for quite a while—what are your thoughts on it? I think my PR can effectively merge similar entities with good accuracy, and I really hope it gets adopted by LightRAG.

During the insertion phase, use a large language model to deduplicate…

c38deb0

… extracted similar entities.

Merge branch 'main' into duplicate_dev

a5d1ce5

GeeekyBoy approved these changes Oct 2, 2025

View reviewed changes

Merge branch 'main' into duplicate_dev

42f4eeb

Merge branch 'main' into duplicate_dev

cf6bed7

Delete the duplicate naive_rag_response prompt

02bd7fd

Removed the duplicate naive_rag_response portion of the prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use LLM to deduplicate extracted similar entities during the insertion phase #2102

Use LLM to deduplicate extracted similar entities during the insertion phase #2102

FloretKu commented Sep 15, 2025 •

edited

Loading

Uh oh!

danielaskdd commented Sep 22, 2025

Uh oh!

FloretKu commented Sep 25, 2025

Uh oh!

FloretKu commented Oct 27, 2025

Uh oh!

Matt23-star commented Nov 5, 2025

Uh oh!

FloretKu commented Nov 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use LLM to deduplicate extracted similar entities during the insertion phase #2102

Are you sure you want to change the base?

Use LLM to deduplicate extracted similar entities during the insertion phase #2102

Conversation

FloretKu commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes Made

1.Create a new deduplicate.py to handle the entire functionality.

2.Add a feature toggle and related configurations in lightrag.py

3.Add relevant functions in operate.py

4.Add relevant prompts and examples in prompt.py.

Checklist

Additional Notes

1. First, retrieve all entities to be inserted and cluster them (batch_size = 30) to ensure that the entities passed to the LLM are sufficiently similar, improving the accuracy of merging.

2. Only pass the entity_name to the LLM for preliminary merging, and return the initial merging results for the batch.

3. Add the descriptions of the preliminary results and pass them again to the LLM to determine whether they should be merged.

4. LLM returns the final resul.

5. Throughout this process, similarity matching is performed on entity_name to ensure that the results are all originally existing nodes

Uh oh!

danielaskdd commented Sep 22, 2025

Uh oh!

FloretKu commented Sep 25, 2025

Uh oh!

FloretKu commented Oct 27, 2025

Uh oh!

Matt23-star commented Nov 5, 2025

Uh oh!

FloretKu commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

FloretKu commented Sep 15, 2025 •

edited

Loading

FloretKu commented Nov 30, 2025 •

edited

Loading