Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to venomx metadata #81

Open
justaddcoffee opened this issue Sep 11, 2024 · 5 comments
Open

Migrate to venomx metadata #81

justaddcoffee opened this issue Sep 11, 2024 · 5 comments

Comments

@justaddcoffee
Copy link
Member

Per convo with @caufieldjh @iQuxLE et al, change curategpt to use venomx for metadata

@justaddcoffee
Copy link
Member Author

justaddcoffee commented Sep 12, 2024

@cmungall - what are you thoughts?

It seems like it'd make sense to align with venomx, since most (all?) of the metadata is going to be dataset and embedding-related

@cmungall
Copy link
Member

venomx assumes each indexed object has a unique id

curategpt doesn't make any assumptions about indexed objects, it can be any json obj / python dict.

some wrappers (e.g. ontology have a primary key)

but others like the maxoa wrapper return associations, which don't have a natural primary key

some options are

  1. relax the venomx model so that objects don't require a PK
  2. force everything in curgpt to have an ID, autogenerating if it doesn't exist

But I don't think either of these are ideal

I think it's best if we say the mapping is to vx is only supported if the collection declares an identifier field

https://github.com/monarch-initiative/curate-gpt/blob/main/src/curate_gpt/store/db_adapter.py#L342-L353

@iQuxLE
Copy link
Member

iQuxLE commented Sep 13, 2024

@cmungall
@justaddcoffee

venomx assumes each indexed object has a unique id

Than it actually works well with DuckDB as this also wants unique ids for each indexed object.
ChromaDB does not necessarily need this.

I kind of like the idea 2.

  1. force everything in curgpt to have an ID, autogenerating if it doesn't exist

Just a thought:
Can we use a UUID feature for this problem? For DuckDB this would mean a seperate column, in chromaDB I think it is already implemented.


However for the beginning we could also test it a bit by not incorporating the whole venomx model/schema into the metadata but just adding a field for it. This way we can see and test it out, and roll back easily in any case.

@justaddcoffee
Copy link
Member Author

justaddcoffee commented Sep 13, 2024

I kind of like 2) also. For collections that have IDs it works fine, and for those that do not have IDs, it doesn't seem like it hurts anything. Maybe we can mint them using a hash function of all the fields so they are deterministic?

def make_md5_id(data):
    # Concatenate data fields into a single string
    concatenated_data = f"{data['field1']}|{data['field2']}|{data['field3']}"
    
    # Create an MD5 hash
    id_hash = hashlib.md5(concatenated_data.encode()).hexdigest()
    
    return id_hash

(or is that too slow)

@caufieldjh
Copy link
Member

I'm hesitant to include autogenerated identifiers if the process is opaque to users, i.e., if it's just made by CurateGPT for purposes of fitting the metadata model, then it isn't clear whether the ID refers to the some original source or the newly created data (though in this case it will be the latter). It works in the KGs because most edges don't start with IDs but in this setting there's likely to be a mishmash of different sources with and without IDs, plus the newly generated things.
Perhaps a user-defined toggle for ID generation would work.

@iQuxLE iQuxLE mentioned this issue Oct 11, 2024
8 tasks
caufieldjh added a commit that referenced this issue Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants