-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set node metadata schema #203
Comments
I just remembered tskit-dev/tsinfer#416, in which the metadata schema is set for the node table for tree sequences produced by tsinfer. I noted there that if, for efficiency, we have a tsinfer-specific struct schema for node metadata, we should probably check for it in tsdate and amend the schema to add the necessary fields that tsdate wants to set. One (very minor) issue with this is that currently tsdate does not require tsinfer to be installed. It seems a bit of an overkill (edit) - perhaps what we can do is hard-code |
Just to note here that we're supposed to be working with things that arent' tsinfer too, so we need to be very careful about assumptions we make about metadata. |
Thanks for flagging this up @hyanwong - can I clarify that the aim here is to add tsdate specific attributes to the possibly-preexisting metadata? If so this would seem to be a great use case to consider as we add the higher level metadata API. |
Yes, exactly.
Yes, absolutely. I have been thinking about this, and I think it's all compatible. The idea would be that we ask forgiveness rather than permission: if it is a struct type, we somehow extend the schema (see below), if it is json type we attempt to add the extra fields. If we fail in either, we issue a warning but go ahead with the dating anyway (since the metadata additions are an optional nicety).
For a struct type, I presume we could either (a) check if it matches against an existing type, and if so, create a derived struct type with extra fields, dump the existing data and save it back as the new type, or (b) take the existing schema and simply add some new fields onto the end. I don't know if (b) is a bit hairy, though? |
The order of fields in struct is determined by the schema, either alphabetically (by default) or specified by the |
Thanks @benjeffery. I guess it might be safer to save the metadata out into python dict, then put it back in again? Something like this (which I presume will work for both JSON and struct metadata):
|
NB: if we are going with the idea of using struct metadata, as in tskit-dev/tsinfer#416 (comment), then we can do as above and add the |
Code above looks like the slow-but-correct way to do this, and as we will be replacing this by the eventual high-level API I think that is fine. |
Just revisiting this before re-dating the unified genealogy. I think that if the metadata schema on nodes is JSON or struct, we should use the code above to add means and variances. If it is binary, and if there is any metadata content on nodes, we shouldn't stomp on the metadata at all: we can simply warn the user that metadata on means and variances isn't being stored. If it is binary but there is no metadata, we can store struct data, I guess? For the (re-dated) unified genealogy, I think the only node metadata is from tsdate, so we can remove all the existing node metadata before re-dating, and therefore default to struct. |
Just a quick comment re #303 - if struct data and we are adding fields, we probably need to specify default values for each new property, otherwise we won't be able to write non-null data back in to the new metadata column. This is done in #303, which also takes the decision to overwrite any existing data in the |
Set to JSON everywhere, where possible. Closing |
At the moment we don't set a metadata schema for nodes returned by
tsdate
: we simply dump a binary json string in there, e.g.We should probably attempt to set the "mn" and "vr" keys assuming the schema is valid, and simply omit them (perhaps with a warning) if it is not. Perhaps we should also store this as
{"tsdate_time":{"mn":XXX,"vr":YYY}}
so that it's clear what the metadata refers to?If no schema exists, and the node metadata is entirely empty, we can probably set the nodes table schema to
tskit.MetadataSchema.permissive_json()
I wonder if @benjeffery, the metadata king, has any thoughts on the best thing to do here.
The text was updated successfully, but these errors were encountered: