Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing annotation contributors for multiple GPAD lines with single annotation id #94

Open
vanaukenk opened this issue Jun 10, 2021 · 5 comments

Comments

@vanaukenk
Copy link
Contributor

vanaukenk commented Jun 10, 2021

From 2021-06-08 MOD imports meeting:

We discussed how we want to handle processing of multiple GPAD lines for a single annotation id. This will be the situation with GPAD files coming from Protein2GO (and possibly other sources) where tools tracked and displayed annotation history.

The goals here are to ingest annotations in as uniform a way as possible so that the near-term display of 'annotations' in Noctua will be similar for each onboarding group and the long-term prospect of dealing with 'annotation' history is at least starting from standardized data with clear, documented semantics.

Here's the proposed workflow:

For each line in an incoming GPAD file, the import code will check the annotation id in the Annotation Properties field.

For annotation ids that appear only once, no additonal processing will be needed.

For annotation ids that appear more than once, the import code will look at the contributor-id field of each annotation with the same id and if there is more than one contributor id, will add all contributors to the Annotation Properties field in the last-edited annotation (assessed by the latest timestamp in the Date field) and then import the last-edited annotation with one or more contributor-ids.

@dustine32 @kltm @sierra-moxon - please feel free to add more or edit for clarity
I will add information about the semantics of the Date, creation-date, and modification-date to the imports SOP doc as well as to the GPAD/GPI 2.0 specs.

@vanaukenk vanaukenk changed the title Processing multiple GPAD lines with single annotation id Processing annotation contributors for multiple GPAD lines with single annotation id Jun 10, 2021
@kltm
Copy link
Member

kltm commented Jun 10, 2021

@vanaukenk I might phrase it as:

For the purpose of comparison, dates without further time information are assumed to be midnight (should probably be in spec).

A "set" here is defined as a group of GPAD annotation lines that all share the same col 12 id.

Rules:

  • If no creation-date is given for a set, it is assumed to be the earliest modification-date (Inferring dates.)

  • GPAD col 9 is modification-date; it must match modification-date if also found in col 12. (Consistent dates.)

  • It is an error for any creation-date to be different within a set. (Consistent dates.)

  • It is an error for any modification-date to be earlier than a given creation-date in a set. (Consistent dates.)

  • It is an error for a modification-date of one line in a set to be between the creation-date and modification-date of another line. (Strict ordering.) It is an error for a the modification-date of one a line without a creation-date in a set to be between the creation-date and modification-date of another line. (Strict ordering.)'

  • It is an error for two lines to have the same modification date. (Strict ordering.)

  • If is an error for the same modification-date to occur twice--sets can be ordered by modification-date. (Strict ordering.)

Given these rules, the output model for a set should:

  • Be able to order the lines in a set, earliest to latest.
  • The first date is the creation-date for all of the output.
  • The last date is the modification-date for all of the output.
  • A unique listing of all contributors in the set is the contributor list for all of the output.

@vanaukenk
Copy link
Contributor Author

Thanks, @kltm I think I understand everything except this one:

'It is an error for a modification-date of one line in a set to be between the creation-date and modification-date of another line. (Strict ordering.)'

Also, if we get a set of annotations with a given id that all have the same YYYY-MM-DD without additional time information, i.e. YYYY-MM-DDTHH:MM, we can't do anything with that as we'd have no way of knowing what the last modification was, right? We might still be able to concatenate contributors, but we couldn't know what the 'final' annotation was for that set.

@kltm
Copy link
Member

kltm commented Jun 11, 2021

@vanaukenk Okay, yeah, let me explain that a little and make a clarification (with edits inline):

'It is an error for a the modification-date of one a line without a creation-date in a set to be between the creation-date and modification-date of another line. (Strict ordering.)'

While a hard case to hit, it is possible in the defined format. The purpose here is to require "sortability"--every line in a set occurred before or after every other line in a set. We want to have this property because we want to be able to have the ability recreate a history of operations and if an action occurs at the same time as another action in a set, or is otherwise inconsistent, we have an ambiguity and cannot consistently figure out what happened.

So, if we have lines in a set that are:
... creation-date=2020-01-01 modification-date=2020-01-03
... modification-date=2020-01-02 (or maybe just col 9, noting no creation-date)
What then was the order of events?

Does this make sense? I definitely could have written the formulation more clearly.

For your second question (dates are not granular enough to distinguish between them), I think it's a pretty big problem that should be regarded as an error, at least initially.
I'm coming from POV where we should be able to reconstruct a series of events that get us from a first annotation to a last annotation for a set that can be decomposed into a unique set of operations. If the items cannot be ordered, we cannot recreate that history. Now, it could be that the resource does not care about this temporal history (all good), but I think it would be important to flag and clarify that with them and maybe try and talk them into giving us a collapsed version (like MGI) to make that understanding concrete.

@vanaukenk
Copy link
Contributor Author

@kltm Thanks, that makes sense now.

Let's just review on tomorrow's MOD imports call, as I think we're clear now on what to do wrt both annotation sources and GO.

@suzialeksander
Copy link

For the SGD annotations, we decided to drop nearly all of the additional information and are just adding any valid contributor ID to the comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants