-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle serialization of structs? #53
Comments
I'm so glad to not be the only one really interested in this feature, as discussed a few months ago with @zaeleus: brainstorm/s3-rust-htslib-bam@9e7a200#commitcomment-48795221 TL;DR: Seems unlikely to see support in Noodles itself, but most probably as an external (BioSerDe) crate? That being said, I'd also like to hear about how Michael would architect such a third party crate so that it integrates/performs best with Noodles. |
I still think serialization tends to be an application-specific output format, particularly in the two examples given thus far. I'm not even sure if it's viable to generalize, so I'm trying to understand the use-case at the library-level. Following For example, what's the expected (JSON) serialization for a { "name": "seq1 LN:8", "sequence": "ACGT" }
{ "name": "sq1", "description": "LN:8", "sequence": "ACGT" }
{ "definition": { "name": "sq1", "description": "LN:8" }, "sequence": "ACGT" } How granular does the serialization go for each field and with what vocabulary? E.g., (JSON) serialization possibilities for a { "cigar": "36M4D8S" }
{ "cigar": ["36M", "4D", "8S"] }
{ "cigar": [{ "kind": "M", "len": 36 }, { "kind": "D", "len": 4 }, { "kind": "S", "len": 8 }] }
{ "cigar": [{ "kind": "Match", "len": 36 }, { "kind": "Deletion", "len": 4 }, { "kind": "SoftClip", "len": 8 }] } Do field names use the spec names/values or noodles API names/values? E.g., (JSON) serialization possibilities for a partial { "refId": 0, "mapq": 255 }
{ "reference_sequence_id": 0, "mapping_quality": 255 }
{ "reference_sequence_id": 0, "mapping_quality": null } This would cause the most problems with interoperability, as most external applications and libraries don't practice the same discipline. If there were a I don't have a good solution to the generalization of this problem. There are a lot of open questions that would have to be discussed before moving forward on a decision. It would be helpful to see more concrete examples to better understand the context. |
I understand the concerns about not having a consensus through committee standards, but for instance Google has interesting protobuf definitions for most of the bioinfo formats that they are ingesting into their systems and it's fairly straightforward to read and grok as-is: https://github.com/google/nucleus/tree/v0.6.0/nucleus/protos If fields are defined on a relatively easy to read "internal representation" schema, the final representation on disk is a bit up to the specific application area (database, parquet, .ORA, etc...) and/or particular use case. In short, a flexible/general internal representation can help out in (de)serializing and match the intended needs. |
Alright, let's re-kindle this issue and discussion, since BioSerDe needs it. Also, as @GabrielSimonetto pointed out in his draft PR:
But first, let's address your questions above, Michael:
Yes!... with some minor intermediate convenient conversions perhaps, but ideally: yes.
From a simplicity standpoint and picking from your alternatives, I think that
For simplicity's sake, I would:
I'd prefer
Ok, a fairly straightforward usecase would be to serialize BED to Parquet in order to be queried by Presto, a columnar database that ingests Parquet among other formats that are not hts-spec compliant. Then proceed to SerDe the rest of the bio (*AM/VCF) formats to allow this mode of scalable data exploration on cloud providers or other emerging Rust data science frameworks. Ultimately, a public interface for BioSerDe should be as usable as the I hope the objective and overall idea and direction is clearer now? /cc @multimeric @GabrielSimonetto @E-Allie @mmalenic @ohofmann. |
I think this is more of a misunderstanding of the format. BED is deceptively simple and doesn't generalize without a tag (e.g., differentiating between BED3+1 and BED4, etc.). The BED implementation in noodles is perhaps unusual and up for a different discussion.
Again, this is why I think complex serialization is better suited to the application, not tied to the library, especially when there is no standard. Your example still requires mapping the Parquet schema to the record representation and vice-versa. I would really like to a see a wrapper and its usage or an implementation of how BioSerDe makes use of Serde. |
Hi -- for my line of work I often have to (de-)serialize bioinformatic file formats into more common formats, and I'm curious if there are any recommendations for how to do that with noodles or if someone else has done it... like in an ideal world I could:
I know it's possible add serialization to structs in external packages, but it's a non trivial amount of work, so thought I'd ask either a) if there was a good path to take; b) any thoughts/plans on supporting serialization a la rust-bio.
Thanks!
The text was updated successfully, but these errors were encountered: