-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add plain and uniprot flavor #2
base: main
Are you sure you want to change the base?
Conversation
Hi Dirk. I went very quickly through the list of changes. I would like to discuss the FASTA reader API at the Rust level.
On top of that we define several implementations of a given trait to parse a Fasta Header. By doing something similar to this, this would allow lazy/deferred parsing of the header. In my apps, most of the time I only need the accession number, the full description and the sequence, so the generic parser is basically all what I need. |
Hey David, thanks for the review! I think this is all covered by the proposed implementation. Of course it is a bit complex due to the generic approach but it does exactly what your suggested:
I like the simplicity of this approach, also it looks kinda complex due to generics. On binding level, go ahead and just us what you need. As you usal would do in Python. On Rust level you can use what is there (Plain, UniProt, ...) or just implement your own custom Header implementation by using one trait compile it and it's still fast without duplicated code, the need of multiple iteration or post-processing. What I don't like is the chaining of the getter. Maybe using public attributes might be an option. But I don't like them either. |
Hi Dirk, Sorry I didn't get it from the first glance. However I think some of my comments are still relevant, I'll spot directly the lines of code of your commit. |
Just to summarize: @david-bouyssie idea: reader = FastaReader(fasta_path)
for header, sequence in reader:
parsed_header_struct = rusteomics.parse_uniprot(header) # Or some other methods, just for illustration purpose
# do something @di-hardt idea: reader = FastaReaderUniProt(fasta_path) # Or UniProt or other FastaReader
for parsed_headerstruct, sequence in reader:
# do something I would actually propose something different, more a combination of both approaches. I would go the way @di-hardt has implemented the FASTA-Reader in Rust, but would try to differently implement (or represent) the python-bindings. In theory we could add more FASTA-Header-Parsers and this would give es the overhead of writing new classes in Python (maybe also in other languages, which could be annoying). I would propose to have another parameter like # g is a directed graph
g.degree(mode="out") # Number of edges outgoing (each graph)
g.degree(mode="in") # Number of edges ingoing (each node) and in R: degree(g, mode = "out") # as above
degree(g, mode = "in") # as above This makes it easier for us, since, we only have to expose one function (or class, or ...) and developers using it can explicitly select what they want to use and can look into the documentation at one place (e.g.: instead of having a page per FASTA-Reader in python we could have it all in one. Also we may be to different from the Rust-documentation, which i would like to be close as possible). Summarized: reader = FastaReader(fasta_path) # Reads it plain (default)
for header, sequence in reader:
# do something
reader = FastaReader(fasta_path, mode="uniprot") # or more explicit, parse_method
for parsed_header, sequence in reader:
# do something or in R: for(i in fastaReader(file_path, mode = "plain")) {
# do something
}
for(i in fastaReader(file_path, mode = "uniprot")) {
# do something
} Maybe my approach could also be to limited, since we then need to ensure for such cases, that the programming languages all support this. E.g.: In Python it is not a problem to return different datatypes from the same function. In R i am not sure. (In C/C++?, WebAssembly?, ...) Let me know what you think! |
I was actually reasoning at the Rust level but not yet at the Python level. Here is what I would like to have on the Python side: reader = FastaReader(fasta_path)
for entry in reader:
aa_seq = entry.sequence()
### retrieve the header as plain text ###
plain_header = entry.header()
### split the header in two strings (split after the first whitespace char) ###
(entry_identifier, entry_description) = entry.split_header()
### parse the header using custom regexes ###
custom_parsed_header = entry.parse_header("(\S+)","\S+ (.+)")
plain_header2 = custom_parsed_header.accession + " " + custom_parsed_header.description
### parse the header using the UniProt parser (faillible operation) ###
try:
uniprot_parsed_header = entry.parse_uniprot_header()
# do something with uniprot_parsed_header
except Exception as e:
# do something with e |
What's the rationale for having different variants of a fasta reader - seems to me like something that should be handled by the end user? |
Hey @lazear There are some libraries, e.g. Protgraph which create FASTA files with extensive headers. It would be nice to use Best, |
Hi Dirk, I understand the use-case (I have run into non-standard FASTA files far too many times), I just question whether this approach is the best thing to do. It seems like a lot of ceremony (e.g. having to implement a basically empty trait). Consider that you could instead just do something like struct Uniprot {
database: String,
...
}
struct Entry {
header: String,
sequence: String
}
impl From<Entry> for Uniprot {
...
} Mike P.S. Protgraph looks really neat... I think I have a usecase for it :) |
I agree with Michael. By separating these ops in dedicated components, this allows for more composition and more flexibility. |
Sorry @david-bouyssie I'm not seeing how this will simplify the implementation of the IndexedFastaReader, as the implementation for parsing is separated from the reader already, and I would not use the FastaReader for indexing as the IndexedFastaReader only needs to return the indexes instead of strings. Ok, that's far more discussion than I like for something that simple, and I think the tendency is clear. I will change it to return just the plain header and move the logic for parsing somewhere else using the From-trait. Any suggestions where? Same crate different module? It's still an IO task, like parsing the information from an mzML-formatted spectra, so But we should keep in mind that we have a similar case for reading different spectra formats. In that case I would like to avoid using the From-trais. While |
@di-hardt sorry Dirk, I didn't want this discussion to be annoying, and I hope it's not killing the motivation. Your approach is very correct, and I think the goal of the present discussion is just to see if there are other approaches that could simplify the API and/or the code. Sorry to see that it's making the process less smooth. Regarding your question about the IndexedFastaReader, in my implementation it returns actually both the values and the position/offset, so there also it's a matter of API. If FastaReader and IndexedFastaReader where doing IO only, this will only eliminate the creation/use of "crate::fasta::entry::Entry" in the readers. I admit, it's not such a big difference. I more concerned about the double More generally regarding the PR process, we will have to find the good tradeoff between best practices and pragmatism. Sorry for the bad balance this time. I'm sure this will improve in the future. |
@david-bouyssie actually, the idea with @di-hardt Regarding where to put the parsing. I would vote for However, the writing part could be a bit tricky with this. E.G.: We load plain-FASTA and want to save in UniProt-format, we would need to have a write-function which would enable this: |
I think the Personally, I try not to introduce new traits unless it really makes sense to do so (e.g. not for trivial conversions between types). The conversion would look like this let mut reader = Reader(some_file);
for entry in reader {
let entry: Result<Uniprot, Entry> = entry.try_into(); // return the failing entry if it cannot be converted to Uniprot struct
} or let mut reader = Reader(some_file);
let entries: Result<Vec<Uniprot>,_> = reader.into_iter().map(TryInto::try_into).collect(); Then, if you need to write, you simply have a single Not trying to belabor these points - but it's because it's for something so simple that I wanted to butt in. |
@david-bouyssie No no! Not annoying, just too much for the discussed subject. In the end it's just a FASTA reader. @lazear I can live with that as it is also a Rust way. I also considered it when implementing my version. I was simply concerned about a little more overhead, as my version is not moving the data into PlainEntry and in a additional step into the UniprotEntry. But maybe there is no overhead? |
@di-hardt @lazear how shall we continue regarding this PR? I think we should maybe more flexible regarding Rusteomics development. Blocking PR is maybe not fostering development progress. We need to find a tradeoff between finding the "best" APIs and having something implemented in a given way. This may impact forward compat, but this should definitely help the project. |
I think we need a decision pipeline in the beginning of Rusteomics, like:
|
I like those ideas (although approving features once a month seems a bit slow, perhaps? ... not that there has been a ton of work on the project through 😃). I would propose that some Rust application be developed so that the rusteomics API can grow in a somewhat organic and cohesive way. For example, could leverage some of the MSP/MGF work already done and build a spectral library generator - take a peptide sequence, MGF file, scan number and write out the matched peaks. This should be relatively straightforward and would help to develop some of the core parts of the API. I think this would solve some of the issues inherent to building a library that has no users - much easier (IMO) to build a library in tandem with an application that is actually using it. |
@di-hardt sorry I missed the notification of your last post @lazear I really love the idea of having a concrete tool built on top of Rusteomics libraries. |
Hey folks,
My changes introduce different implementation for header parsing when reading and writing FASTAs. The Rust implementation of the reader and writer is generic while the Python bindings have differentiated reader and writer classes because the compiler needs to know what generic to use. The implementation should allow other developer to implement their own parsing when including
mzio
in their projects.Best,
Dirk