Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alphabet as inherent property of a sequence collection #46

Open
ahwagner opened this issue Apr 21, 2023 · 2 comments
Open

Alphabet as inherent property of a sequence collection #46

ahwagner opened this issue Apr 21, 2023 · 2 comments
Labels
schema-term Proposals for terms in the core schema

Comments

@ahwagner
Copy link
Member

Related issues: #16, #8 (specifically #8 (comment)).

This came up during the Sequence Collections call at Connect. I think that this is an important feature that would help us resolve a generalizable issue with the interpretation of sequences that is backwards compatible with refget.

Since @andrewyatz was kind enough to raise this on the call I thought I would create an issue here to track discussion on this specific feature request.

@sveinugu @nsheff

@sveinugu
Copy link
Collaborator

Thanks @ahwagner for opening this issue again! I just re-read my novel-length posts in #8 and I do not want anyone else to have to go through that pain 😃, so I have copied out the most relevant paragraph of #8 (comment) here:

I. alphabet

I would like to add 'alphabet' to the list of arrays. In my mind, this is not the same as the summary of the characters used in the sequence, but the possible characters that could have been used. Similar to the assembly gaps use case above, the fact that a character is not used in a sequence position is a separate piece of information given that we know the character was allowed in the first place. Hence, the alphabet (defined in this way) adds information to what is inherent in the sequence itself, and one can easily imagine sequences that are the same but are defined in relation to different alphabets. Having the alphabet available in recursion level 1 (together with the lengths) would also help if one needs to pre-register some data structure before the sequences are downloaded. Also, even if this is not the main argument, having the alphabet available would be a simple way to differentiate between e.g. DNA and protein sequences, without having to add a sequence type array/metadata field (which is btw also be a possibility, of course, but I think I will stop here...).

@sveinugu
Copy link
Collaborator

sveinugu commented Apr 21, 2023

Sidenote:

My takeaway from reading through my rants in #8 is that I was really only going through a lot of examples, suggestions and ideas based on what I perceived as a lack of generality in the then-current state of the specification, but I did not have good concepts to describe these lacks in a clear way. Having now clearly defined the concepts of inherent and collated, it is now very much easier to pose and discuss questions regarding the definition of specific arrays, as @ahwagner has now done here. Most of my rants can similarly now be compressed down to these two suggestions:

  1. Should we support order-invariant sequence collection digests by requiring (or heavily suggesting) a sorted-names-lengths-array which is not inherent and not collated? (Issue now here: Discussion on undigested attributes and sorted-name-length-pairs #40)
  2. Should we support name-invariant per-sequence identifiers through an everything-but-names-array, which would be collated, but not inherent? (I will not open a separate issue on this now, as I believe the current consensus is that we postpone the inclusion of non-inherent arrays other than the sorted-names-lengths one to the next version of seqcol. Will just leave this here now as a half-issue as I had even myself forgotten a bit about this)

My main point of this post is just to convey my experience that a lot of progress has been made and that we have not just walked around in circles. I believe we now have the concepts needed to revisit the issue of additional arrays and hopefully conclude!

@nsheff nsheff added the schema-term Proposals for terms in the core schema label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema-term Proposals for terms in the core schema
Projects
None yet
Development

No branches or pull requests

3 participants