-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Use LLM to deduplicate extracted similar entities during the insertion phase #2102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… extracted similar entities.
|
This is a highly anticipated feature, and I’ll be able to dedicate time to researching and testing it only after addressing my current tasks. Please resolve the conflicts with the main branch first. Thank you. |
The conflict has been resolved. Thank you for your dedication and support to the project. |
|
@danielaskdd |
|
Hi @FloretKu , I am checking this PR recently. May I ask why there are 2 |
Removed the duplicate naive_rag_response portion of the prompt.
Hi @Matt23-star ,thank you for pointing out the issue. The duplicate prompt has now been removed. Your thorough review has greatly improved my PR. Are there any other concerns? Will it merge smoothly into LightRAG? @danielaskdd This PR has been submitted for quite a while—what are your thoughts on it? I think my PR can effectively merge similar entities with good accuracy, and I really hope it gets adopted by LightRAG. |
Description
During the insertion phase, use LLM to deduplicate extracted similar entities.
Related Issues
#1323
Changes Made
1.Create a new deduplicate.py to handle the entire functionality.
2.Add a feature toggle and related configurations in lightrag.py
3.Add relevant functions in operate.py
4.Add relevant prompts and examples in prompt.py.
Checklist
Additional Notes
1. First, retrieve all entities to be inserted and cluster them (batch_size = 30) to ensure that the entities passed to the LLM are sufficiently similar, improving the accuracy of merging.
['汽车', '洋车', '車', '车', '新车', '车口', '洋车夫', '车份儿', '西安门大街人和车厂', '洋车厂子', '人和车厂', '车厂', '洋车界', '北平的洋车夫', '洋车夫派别', '年轻力壮的洋车夫', '年轻人力车夫', '车份儿和嚼谷', '买上车再说']2. Only pass the entity_name to the LLM for preliminary merging, and return the initial merging results for the batch.
3. Add the descriptions of the preliminary results and pass them again to the LLM to determine whether they should be merged.
4. LLM returns the final resul.
5. Throughout this process, similarity matching is performed on entity_name to ensure that the results are all originally existing nodes