Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumption of seqcol into existing file formats #13

Open
andrewyatz opened this issue Jun 30, 2021 · 5 comments
Open

Consumption of seqcol into existing file formats #13

andrewyatz opened this issue Jun 30, 2021 · 5 comments

Comments

@andrewyatz
Copy link
Collaborator

Speaking in the VRC/VCF meeting, the consensus was these flat file consumers would work directly with the collection header format rather than working with seqcol serialised into their native header format. The thinking was there is no point spending time encoding to decode from one format to another. Much faster to just consume the native seqcol header and use that. Will mean a breaking change in the formats.

@nsheff
Copy link
Member

nsheff commented Jun 30, 2021

Can you specify exactly what you mean by the collection header format? Would this be some JSON blob like:

{'names': ['chrUn_KI270742v1',   'chrUn_GL000216v2',   'chrUn_GL000218v1'],
  'lengths': ['186739', '176608', '161147'],
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',   '725009a7e3f5b78752b68afa922c090c',   
'1d708b54644c26c7e01c2dad5426d38c']}

and you're suggesting this blob would appear verbatim in the VCR/VCF file?

I suppose an alternative is that the seqcol API could provide an endpoint that provided this same information in an alternative format that fit existing native header formats.

@andrewyatz
Copy link
Collaborator Author

Sorry I wasn't meaning that. What I meant was libraries would detect the existence of the appropriate header indicating the sequence collection identifier. That library would then request the sequence collection and parse the JSON directly into the appropriate data structures

@jmarshall
Copy link
Member

I have not caught up with whether the group considers digest hashes, unhashed concatenated strings for digesting, or JSON blobs 🤮 as the canonical unambiguous representation of a sequence collection… but what I have been envisaging as the item that might appear in the header of a SAM or VCF file is the digest hash. For example, along with an optional informal non-normative non-canonical description to give the SAM/VCF file's human reader a clue as to what's intended:

@SD   SH:2d967306d7b589e32aaf3ed6a63c9dde   VN:1   DS:GRCh38-plus-stuff

##collection=<ID=2d967306d7b589e32aaf3ed6a63c9dde,Version=1,Description="GRCh38-plus-stuff">

See also #1 (comment) about the possibility of having the unhashed string to be digested embedded in SAM/VCF/etc. But only this digest hash option would help reduce the size of the header in the millions of sequences case, and using the JSON blob would also cause havoc with delimiters when embedded in another non-JSON text format.

See also this related motivational proposal, in particular slide 6.

@andrewyatz
Copy link
Collaborator Author

And the continuation of this is 2d967306d7b589e32aaf3ed6a63c9dde would be passed into a seqcol endpoint and then can return the appropriate payload to be consumed by the library

@sveinugu
Copy link
Collaborator

sveinugu commented Jul 6, 2021

It would be nice to have to possibility of including the seqcol output into track files in the raw form. I see several situations where having self-contained files that include the sequence collection data would be useful, e.g. in secure settings where access to the internet is restricted. This would be mainly useful for including the coordinate system into the file itself, but there are probably usage scenarios also for the other recursion levels. JSON is not the format for that, I agree. Would it be a possibility to define an alternative but canonical output format (without whitespace) for use in tabular files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants