-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul models so that "unfinished" metadata can be represented without cheating Pydantic #205
Comments
I thought to argue that it would add another combinatorial dimension, but well, we kinda already have it half way that in terms of separation of "Common" vs "Publishable"
the question then is really to "shift" validation into something like |
butting in here with uninvited 2 cents: to avoid needing to maintain a whole parallel set of models, you could use a field validator that indicates if validations fail by modifying a like @field_validator('*', mode='wrap')
@classmethod
def allow_optional[T](
cls,
v: T,
handler: ValidatorFunctionWrapHandler,
info: ValidationInfo
) -> T:
try:
return handler(v)
except ValidationError as e:
# do something to indicate that we failed validation but still allow class instantiation, eg.
info.data['validation_errors'].append(e)
return v
@model_validator(mode='after')
def validation_state(self):
# do something here to check if we had validation errors that forbid us from being publishable You could also dynamically control whether one should be validating publishable or draft standards using validator context like eg: https://docs.pydantic.dev/latest/concepts/validators/#using-validation-context-with-basemodel-initialization |
@yarikoptic I don't think I'm the right person to assign this to (at least not at this time). The task requires someone (or discussions with several someones) who knows what the "minimal metadata" requirements are, what the "completed metadata" requirements are, and what's allowed for the in-between. |
gotcha -- thanks. Let us dwell on this aspect -- may be during upcoming meetup with @satra at all. |
I have been thinking about this but still not sure if we ever would be able to easily guarantee that even published Dandiset's metadata confirms "The Model". The reason is the same what haunts NWB (and PyNWB in particular ATM): model versions and the fact that we can and do break backward compatibility (e.g. #235 would make possibly legit prior models now invalid). So, unless we can come up with a way to have a model "Valid" according to a specific version (which we cannot at pydantic level since we have only 1 "current" model version), we cannot guarantee that past "Valid" models remain valid currently. |
Seems like we would need model migration here, and that might be a good thing to have generally, no schema stays still ;). I interject again here bc i have been wanting to do something similar and think it might be a nice little pydantic extension - Decorator/module level const gives model a particular version, each upgrade would need to provide a migration patch that can do I think that might be nicer than maintaining full copies of every version without clean ways to migrate between them. pydantic should allow model instantiation through validation errors in any case, and the migration methods would get us 'best effort' upgrades (eg. be able to perform renames and etc. but not magically fill in missing required values). If the plan is to move to linkml eventually, this would be something i would be happy to work on with y'all in the pydantic generator, i have been doing bigg refactor work there and am in the process of finishing a patch to include all schema metadata, so if we want to make some helper code to do diffs between schema versions and use that to generate model migration code i would be super into working on. edit: fwiw i have this mostly done for NWB in linkml by being able to pull versions of the schema and generate models on the fly - eg see the git provider and the schema provider where one just does |
thanks @sneakers-the-rat - i think this may align well with the transition to linkml. i know @candleindark is working on an updated reverse linkml generator from the current pydantic models in dandischema. i suspect this will be a priority for us to redo the models using linkml in the upcoming month. we will also want to separate out schema validation into two components: 1) are appropriate values being assigned and 2) are required values being assigned. requirement is a project specific thing and hence this will also allow us to reuse schemas. also 2 allows us to further stratify requirement given the state of an asset (pre-upload, uploaded, modified, published) or dandiset. we need a specific project timeline/design for this transition. |
Lmk how I can help - happy to hack on the pydantic generator to make it work to fit yalls needs bc working with DANDI + NWB in linkml is exactly within my short term plans :) |
@sneakers-the-rat My current plan is to improvement Pydantic to linkml generator as I participate in the transition to linkml in dandischema. One approach is to build something that mimics the behavior of |
That would be great! There is already a sort of odd LinkMLGenerator but generalizing that out to accept pydantic models/json schema from them would be a very useful think for initial model import. LinkML definitely wants to work with linkml schema being the primary source of truth, aka do the pydantic model -> linkml schema conversion and from then on use the pydantic models generated from the linkml schema, but I have been working on the generator to make it easier to customize for specific needs eg. If yall want to separate some validation logic in a special way that the default generator doesnt do. I overhauled the templating system recently to that end, see: https://linkml.io/linkml/generators/pydantic.html#templates And im also working on making pydantic models fully invertible to linkml schema, so you could go DANDI models -> linkml schema -> linkml pydantic models -> customizef linkml DANDI models and then be able to generate the schema in reverse from those, but it might be more cumbersome to maintain than just customizing the templates and having the schema be the source of truth. See: linkml/linkml#2036 that way you can also do stuff that cant be supported in pydantic like all the RDF stuff (but im working on that next ) ;) |
(Maybe we should make a separate issue for linkml discussions, sorry to derail this one) |
@sneakers-the-rat Thanks for the input. I think this is a solution to make field optional when receiving the user inputs while keeping the fields required at publication. However, one can't generate, at least not directly, two schema variants (one with the fields optional and the other with the fields required). It would be nice if we can the two different schema variants. Is |
Do you mean at a field level, being able to label a given slot as "toggle required" so at generation time you get two pydantic modules, one with those as required and one as optional? Or do you mean at a generator level, make two pydantic modules, one where all slots are required and one where all are optional? Im assuming the former, where you want to make a subset of slots that are annotated as being part of a Another approach if ya still want to make multiple sets of models might be to do something like have one base model thats the most lax, then have a |
Sorry to answer your question, I am betting that we could rig something up to generate different models on a switch. That would probably be easiest to do by making a step before schemaview where you load the schema, introspect on it to flip requireness depending on a flag, and then send that to schemaview. SV is sorta badly in need of a clarifying refactor bc at the moment its a bit of a "there be dragons" class (in my loving opinion, which is based on appreciating the winding road that led to SV) |
I am not clear about this, but this can be that I have never tried generating two modules from one schema.
Yes, this is close to what I had in mind. Have one LinkML schema, the base schema, as the source of truth. From the base schema, we generate variants from it by toggling the |
for that, probably the easiest way would be to use subsets, like subsets:
DraftSubset:
rank: 0
title: Draft Subset
description: Set of slots that must be present for draft datasets
PublishableSubset:
rank: 1
title: Publishable Subset
description: Set of slots that must be present for publishable datasets
slots:
optional_slot:
required: false
description: Only required for publishable datasets
in_subset:
- PublishableSubset
required_slot:
required: true
description: Required on all datasets
in_subset:
- DraftSubset
- PublishableSubset and then you could either iter through slots and check the subsets they are in or use the I still think that you can do this with one set of models! I am just imagining it being difficult to parse and also to write code for "ok now i import the we would just need to improve the pydantic generator to support So then you would do something like classes:
Dataset
attributes:
attr1: "..."
attr2: "..."
publishable:
equals_expression: "{attr1} and {attr2}" which would make a model like class Dataset(BaseModel):
attr1: str
attr2: str
@computed_field
def publishable(self) -> bool:
return (self.attr1 is not None) and (self.attr2 is not None) or we could combine the approaches and make an extension to the metamodel so we get the best of both - clear metadata on the slots and also single model. # ...
equals_expression: "all(x, PublishableSubset, x)" where the first I think this is probably a common enough need that it would be worth making a way to express this neatly in linkml, so we can probably make the metamodel/generators come to you as well as y'all hacking around the metamodel/generators :). So to keep a running summary of possible implementations:
|
(This is an accumulation of various things discussed elsewhere which need to be written down.)
Currently, it is possible for users of the Dandi Archive API to submit asset & Dandiset metadata that does not fully conform to the relevant dandischema model, and the Archive will accept, store, and return such flawed metadata, largely via use of
DandiBaseModel.unvalidated
(being replaced by Pydantic'sconstruct
in #203). I believe part of the motivation for this is so that web users can fill in metadata over multiple sessions without having to fill in every field in a single sitting.This results in API requests for asset & Dandiset metadata sometimes returning values that do not validate under dandischema's models; in particular, if a user of dandi-cli's Python API calls
get_metadata()
instead ofget_raw_metadata()
, the call may fail because our API returned metadata that doesn't conform to our own models (See dandi/dandi-cli#1205 anddandi/dandi-cli#1363).
The dandischema models should therefore be overhauled as follows:
There should exist models for representing Dandiset & asset metadata in a draft/unfinished state. These models should accept all inputs that we want to accept from users (both via the API and the web UI), store in the database, and return in API responses. (It is likely that such models will have all of their fields marked optional aside from the absolute bare minimum required.)
get_metadata()
methods of dandi-cli's Python API should return instances of these models.There should exist functionality for determining whether an instance of a draft/unfinished model meets all of the requirements for Dandiset publication.
CC @satra @dandi/dandiarchive @dandi/dandi-cli
The text was updated successfully, but these errors were encountered: