Skip to content

Conversation

@GGrassia
Copy link

@GGrassia GGrassia commented Oct 9, 2025

Description

Added custom metadata on chunks and nodes with possibility of filtering in query to narrow the field of the knowledge base to pull context from, to both enhance precision and speed.
Metadata are stored as a json string and indexe
Metadata_filter class supports operators for AND, NOT, OR clauses and nested metadata_filter classes for chained or hierarchical filters, [ ... ] arrays for multiple possible values for a single metadata key.
I will gladly cooperate for bugfixing and further development.

Related Issues

As requested and discussed into question/issue #1985

Changes Made

  • Added Pydantic metadata_filter class
  • Added metadata_filter class to all base implementations of query for chunks
  • Added metadata management in chunk writing for Postgres
  • Added metadata as properties on nodes for Neo4j
  • Added metadata filter building for postgres_impl and updated queries for chunks, entities and relations to allow filtering

Checklist

  • Changes tested locally - (fully working in prod for our specific solution!)
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

[Add any additional notes or context for the reviewer(s).]

Giulio Grassia and others added 15 commits September 25, 2025 15:37
…querying

- Implement custom metadata insertion as node properties during file upload.
- Add basic metadata filtering functionality to query API

--NOTE: While the base.py file has been modified, the base implementation is incomplete and untested. Only Neo4j database has been properly implemented and tested.

WIP: Query API is temporarily mocked for debugging. Full implementation with complex AND/OR filtering capabilities is in development.

# Conflicts:
#	lightrag/base.py
#	lightrag/lightrag.py
#	lightrag/operate.py
Added metadata filter dataclass for serializing and deserializing
complex filter to json dict, added node filtering based on metadata
Added functioning (needs testing) metadata filtering on chunks for
query. Fully implemented only on Postgres with pgvector and Neo4j
Added metadata management for chunks in querying for all vdb, ONLY
posgres with pgvector has been fully implemented
@duynt88
Copy link

duynt88 commented Nov 10, 2025

Looking forward this feature. Thank you so much the effort.

@GGrassia
Copy link
Author

Looking forward this feature. Thank you so much the effort.

@duynt88 Thank you! It's being discussed because of a technical potential issue in data reliability, but if you pull the fork it's already working. If you use a large document base the issue is less and less prominent, we've reached >80% successful unstructured data extraction (e. g. who's the executive manager for the xyz store, is the X9000 certification needed for this procedure etc etc...) from a large codebase with the rag only, without any guardrailing for the specific datum extracted save for the metadata filtering to restrict the chunk pool to the documents that we know might contain the datum. Give it a spin!

@mkwl
Copy link

mkwl commented Jan 22, 2026

I looked over your changes and noticed, that you have implemented token tracing feature, which doesn't look like to be part of metadata filtering. While I am not a maintainer of this project, I recommend you to split this both features into two separate PRs.

@GGrassia
Copy link
Author

@mkwl thank you for taking a look! This is because I forked to make a single feature change and worked on the main branch, which is the one I did the pr from, and then had to add something else.
While none of this excuses my mixed pr, this has happened because of two things:

  1. I have exchanged ideas with some of the maintainers and the implementation seems to be error prone in some cases (small doc corpus size especially). While I haven't experienced problems firsthand I understand their desire to keep the library as accurate as possible, so I don't think my pr is ever going to be merged. This has led to:
  2. The changes I made are deployed in a production environment for a custom solution we made! So when I was asked to add token tracking I did it quick and dirty for our specific use case and pipeline not generalizing at all. They are not meant to be merged and this is an oversight on my part, since the architecture was not approved (but we already had the system in place) I started using my repo for my own stuff, I even thought this pr got closed at some point.

So now I ask you, since neither of my changes is or seems to be beneficial, should I close this pr? Or should I just branch the other features and leave it clean so if tomorrow the maintainers find a way to reuse my code they have a quick access to it?

@chikenGhost
Copy link

Issue #2555 discusses a similar idea, and I think the implementation would be quite alike. Perhaps you could move your current branch's implementation into a separate PR?

MilindAPOl added a commit to VitalVector/VV_school_LightRAG-fork that referenced this pull request Jan 28, 2026
…fork

This merge brings in the metadata filtering capabilities from PR HKUDS#2187
which enables database-level filtering for PostgreSQL (pgvector) and Neo4j.

Key changes:
- Added MetadataFilter support in postgres_impl.py
- Added MetadataFilter support in neo4j_impl.py
- Updated query parameters to support metadata_filter
- Added token tracking functionality

Conflicts resolved by accepting GGrassia's version to preserve
metadata filtering implementation which is the core feature we need.

Source: https://github.com/GGrassia/LightRAG
Original PR: HKUDS#2187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants