Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add plain and uniprot flavor #2

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

di-hardt
Copy link
Contributor

@di-hardt di-hardt commented Jan 23, 2023

Hey folks,

My changes introduce different implementation for header parsing when reading and writing FASTAs. The Rust implementation of the reader and writer is generic while the Python bindings have differentiated reader and writer classes because the compiler needs to know what generic to use. The implementation should allow other developer to implement their own parsing when including mzio in their projects.

Best,
Dirk

@david-bouyssie
Copy link
Contributor

david-bouyssie commented Jan 25, 2023

Hi Dirk. I went very quickly through the list of changes.
Before giving a detailed review I would like to switch the discussion a step behind.

I would like to discuss the FASTA reader API at the Rust level.
I think that for the sake of speed and simplicity it would be better to do this in two steps:

  1. I would create a simpler reader very similar to https://github.com/rusteomics/mzio/blob/main/mzio-rs/src/fasta/reader.rs#L11
    This first reader would only iterate the FASTA entries and split it in header and sequence strings.
    A dedicated iterator could just return a very simple like struct FastaEntry { header: String, sequence: String }
  2. Then we could defined other structures like
    struct FastaGenericHeader { accession: String, description: String } // accession & description could be used to define a common trait
    struct FastaUniprotHeader { database: String, accession: String, entry_name: String, protein_name: String, keyword_attributes: HashMap<String, String> } // a description fn could return protein_name

On top of that we define several implementations of a given trait to parse a Fasta Header.
The parsing function would accept a FASTA header as a plain String or &str and would return one of these structs.

By doing something similar to this, this would allow lazy/deferred parsing of the header. In my apps, most of the time I only need the accession number, the full description and the sequence, so the generic parser is basically all what I need.
As discussed during the hackathon it might speed up the benchmarks as well.

@di-hardt
Copy link
Contributor Author

Hey David,

thanks for the review!

I think this is all covered by the proposed implementation. Of course it is a bit complex due to the generic approach but it does exactly what your suggested:

  1. For no further header processing use the Plain-header

        let mut reader = Reader::<Plain>(some_file);
        for entry in reader {
            entry.get_header().get_header(); // Method names are redundent, first call returns the PlainHeader the second the header as string
        }

    Developers are free to process it further as they wish.

    For FASTA-header from Uniprot you can use the UniProt-header which parses the header further. If you got a UniProt FASTA you don't have to care about it.

        let mut reader = Reader::<UniProt>(some_file);
        for entry in reader {
            entry.get_header().get_database();
            entry.get_header().get_accession();
        }

    Of course I would only allow implementation for common header styles in mzio. For custom styles, Rust-developers can just implement the Heaader-trait for their custom header-structs in their crates and can use the same reader.

  2. The bindings for higher level languages like Python the reader is differentiated for each style (e.g. PlainRreader and FastaReader). Just use it like normal in python

    reader = PlainReader(fasta_path)
    for entry in reader:
        entry.get_header().get_header()

I like the simplicity of this approach, also it looks kinda complex due to generics. On binding level, go ahead and just us what you need. As you usal would do in Python. On Rust level you can use what is there (Plain, UniProt, ...) or just implement your own custom Header implementation by using one trait compile it and it's still fast without duplicated code, the need of multiple iteration or post-processing.

What I don't like is the chaining of the getter. Maybe using public attributes might be an option. But I don't like them either.

@david-bouyssie
Copy link
Contributor

Hi Dirk,

Sorry I didn't get it from the first glance. However I think some of my comments are still relevant, I'll spot directly the lines of code of your commit.

@Luxxii
Copy link
Member

Luxxii commented Jan 25, 2023

Just to summarize:

@david-bouyssie idea:

reader = FastaReader(fasta_path)
for header, sequence in reader:
    parsed_header_struct = rusteomics.parse_uniprot(header) # Or some other methods, just for illustration purpose
    # do something

@di-hardt idea:

reader = FastaReaderUniProt(fasta_path) # Or UniProt or other FastaReader
for parsed_headerstruct, sequence in reader:
    # do something

I would actually propose something different, more a combination of both approaches. I would go the way @di-hardt has implemented the FASTA-Reader in Rust, but would try to differently implement (or represent) the python-bindings. In theory we could add more FASTA-Header-Parsers and this would give es the overhead of writing new classes in Python (maybe also in other languages, which could be annoying).

I would propose to have another parameter like mode or to be more explicit: parse_method, which is similar to some implementations i encountered in python. E.G.: in igraph, we have the following in python:

# g is a directed graph
g.degree(mode="out") # Number of edges outgoing (each graph)
g.degree(mode="in") # Number of edges ingoing (each node)

and in R:

degree(g, mode = "out") # as above
degree(g, mode = "in") # as above

This makes it easier for us, since, we only have to expose one function (or class, or ...) and developers using it can explicitly select what they want to use and can look into the documentation at one place (e.g.: instead of having a page per FASTA-Reader in python we could have it all in one. Also we may be to different from the Rust-documentation, which i would like to be close as possible).

Summarized:

reader = FastaReader(fasta_path) # Reads it plain (default)
for header, sequence in reader:
    # do something
    
reader = FastaReader(fasta_path, mode="uniprot") # or more explicit, parse_method
for parsed_header, sequence in reader:
    # do something

or in R:

for(i in fastaReader(file_path, mode = "plain")) {
    # do something
}

for(i in fastaReader(file_path, mode = "uniprot")) {
    # do something
}

Maybe my approach could also be to limited, since we then need to ensure for such cases, that the programming languages all support this. E.g.: In Python it is not a problem to return different datatypes from the same function. In R i am not sure. (In C/C++?, WebAssembly?, ...) Let me know what you think!

@david-bouyssie
Copy link
Contributor

david-bouyssie commented Jan 25, 2023

@Luxxii

I was actually reasoning at the Rust level but not yet at the Python level.
I think your suggestions are totally relevant.
I would prefer to avoid string based configuration though.

Here is what I would like to have on the Python side:

reader = FastaReader(fasta_path)
for entry in reader:
    aa_seq = entry.sequence()
    ### retrieve the header as plain text ###
    plain_header = entry.header()

     ### split the header in two strings (split after the first whitespace char) ###
    (entry_identifier, entry_description) = entry.split_header()

    ### parse the header using custom regexes ###
    custom_parsed_header = entry.parse_header("(\S+)","\S+ (.+)")
    plain_header2 = custom_parsed_header.accession + " " + custom_parsed_header.description
 
    ### parse the header using the UniProt parser (faillible operation) ###
    try:
      uniprot_parsed_header = entry.parse_uniprot_header()
     # do something with uniprot_parsed_header 
    except Exception as e:
      # do something with e

@lazear
Copy link

lazear commented Jan 26, 2023

What's the rationale for having different variants of a fasta reader - seems to me like something that should be handled by the end user?

@di-hardt
Copy link
Contributor Author

Hey @lazear
the reason for different variants is that the FASTA format isn't standardized. I would like to give users of mzio the option to read it in plain and in the most common formats (e.g. UniProt) without the need of implement something else.

There are some libraries, e.g. Protgraph which create FASTA files with extensive headers. It would be nice to use mzio to read them too by just creating another struct and simply implementing the Header trait.

Best,
Dirk

@lazear
Copy link

lazear commented Jan 27, 2023

Hi Dirk,

I understand the use-case (I have run into non-standard FASTA files far too many times), I just question whether this approach is the best thing to do. It seems like a lot of ceremony (e.g. having to implement a basically empty trait). Consider that you could instead just do something like

struct Uniprot {
   database: String,
   ...
}

struct Entry {
   header: String,
   sequence: String
}

impl From<Entry> for Uniprot {
   ...
}

Mike

P.S. Protgraph looks really neat... I think I have a usecase for it :)

@david-bouyssie
Copy link
Contributor

david-bouyssie commented Jan 28, 2023

I agree with Michael.
It makes more sense to separate the IO logic from the parsing logic (which could be done with the idiomatic std::convert::From or with a custom trait, or even with good old static functions).

By separating these ops in dedicated components, this allows for more composition and more flexibility.
For instance, we discussed to implement an IndexedFastaReader (similar to our prototype, https://github.com/pinaraltiner/proteomics-rs/blob/main/src/io/fasta.rs#L15).
If the IO and parsing concerns are separated by design, it will also simplify the implementation of the IndexedFastaReader, and we can still reuse the parsing components at the caller level.

@di-hardt
Copy link
Contributor Author

Sorry @david-bouyssie I'm not seeing how this will simplify the implementation of the IndexedFastaReader, as the implementation for parsing is separated from the reader already, and I would not use the FastaReader for indexing as the IndexedFastaReader only needs to return the indexes instead of strings.

Ok, that's far more discussion than I like for something that simple, and I think the tendency is clear. I will change it to return just the plain header and move the logic for parsing somewhere else using the From-trait. Any suggestions where? Same crate different module? It's still an IO task, like parsing the information from an mzML-formatted spectra, so mzcore doesn't really fit.

But we should keep in mind that we have a similar case for reading different spectra formats. In that case I would like to avoid using the From-trais. While ms2 or mgf files contain only minimal information mzML files bring a lot more stuff with them and I would like to avoid saving the overhead XML is bringing along until From is called.

@david-bouyssie
Copy link
Contributor

@di-hardt sorry Dirk, I didn't want this discussion to be annoying, and I hope it's not killing the motivation. Your approach is very correct, and I think the goal of the present discussion is just to see if there are other approaches that could simplify the API and/or the code. Sorry to see that it's making the process less smooth.

Regarding your question about the IndexedFastaReader, in my implementation it returns actually both the values and the position/offset, so there also it's a matter of API. If FastaReader and IndexedFastaReader where doing IO only, this will only eliminate the creation/use of "crate::fasta::entry::Entry" in the readers. I admit, it's not such a big difference.

I more concerned about the double get_header.get_header syntax.
To make it totally clear, I would try to distinguish the String based header from the parsed header (several possible types) using different function names that could be get_plain_header and get_parsed_header.
Maybe we won't totally agree on this, and maybe it's only a matter of preference.

More generally regarding the PR process, we will have to find the good tradeoff between best practices and pragmatism. Sorry for the bad balance this time. I'm sure this will improve in the future.

@Luxxii
Copy link
Member

Luxxii commented Jan 30, 2023

@david-bouyssie actually, the idea with get_header_plain or get_header_<style> (like get_header_uniprot ) is also a nice solution. Here a user could simply choose what he needs and this certainly would work for multiple languages (FFI) since we have a method for each, therefore one struct to be returned per method. And @lazear, if i understand this correctly this would look in rust like: let uniprot_header = UniProt::From(<SomeHeader>); which we can probably easily map into other languages too! (Sorry my "rustfu" is still not very good, i am looking at the Python and R-perspective here)

@di-hardt Regarding where to put the parsing. I would vote for mzio, since it is tightly coupled with parsing files and how we represent them. We would then only need to extend entry.rs in our implementation (maybe also with structs, like UniProt, if we go with the get_header_plain|From idea).

However, the writing part could be a bit tricky with this. E.G.: We load plain-FASTA and want to save in UniProt-format, we would need to have a write-function which would enable this: FASTAWriter.write(<UniProt>, <Sequence>) (or with vectors for multiple entries at once).

@lazear
Copy link

lazear commented Jan 30, 2023

I think the get_header vs get_header_uniprot question is more one of style. I would lean towards just get_header, allowing the user to convert into Uniprot/etc via From/Into (which should really be the fallible TryFrom). One of the nice things about using the built-in TryFrom/From traits is that you get TryInfo/Into auto-generated.

Personally, I try not to introduce new traits unless it really makes sense to do so (e.g. not for trivial conversions between types).

The conversion would look like this

   let mut reader = Reader(some_file);
    for entry in reader {
        let entry: Result<Uniprot, Entry> = entry.try_into(); // return the failing entry if it cannot be converted to Uniprot struct
    }

or

let mut reader = Reader(some_file);
let entries: Result<Vec<Uniprot>,_> = reader.into_iter().map(TryInto::try_into).collect();

Then, if you need to write, you simply have a single FastaWriter struct that takes Entries - and then implement Into<Entry> for Uniprot.

Not trying to belabor these points - but it's because it's for something so simple that I wanted to butt in. impl From<BaseEntry<BasePlain>> for Plain and custom traits seems like a lot of complexity for something as simple as parsing a fasta file!

@di-hardt
Copy link
Contributor Author

@david-bouyssie No no! Not annoying, just too much for the discussed subject. In the end it's just a FASTA reader.
I see. Yeah since I'm relying on the built-in BufReader for IO-buffering, there would be no shared code between the indexed and normal FASTA-reader, but the BufReader initialization
I totally agree that get_header.get_header is not nice at all and need refactoring

@lazear I can live with that as it is also a Rust way. I also considered it when implementing my version. I was simply concerned about a little more overhead, as my version is not moving the data into PlainEntry and in a additional step into the UniprotEntry. But maybe there is no overhead?

@david-bouyssie
Copy link
Contributor

@di-hardt @lazear how shall we continue regarding this PR? I think we should maybe more flexible regarding Rusteomics development. Blocking PR is maybe not fostering development progress. We need to find a tradeoff between finding the "best" APIs and having something implemented in a given way. This may impact forward compat, but this should definitely help the project.

@di-hardt
Copy link
Contributor Author

I think we need a decision pipeline in the beginning of Rusteomics, like:

  1. Proposal of a new feature (e.g. an issue)
  2. Decision in a (monthly) meeting if a feature is added
  3. Short brainstorm what we're expecting from that feature and how it should be implemented
  4. Someone takes lead in devloping that feature
    With this we're making sure we have a working homogenous API in the end. If someone go ahead, like me in this case, it is difficult to agree on the work in retrospec.

@lazear
Copy link

lazear commented Oct 25, 2023

I like those ideas (although approving features once a month seems a bit slow, perhaps? ... not that there has been a ton of work on the project through 😃).

I would propose that some Rust application be developed so that the rusteomics API can grow in a somewhat organic and cohesive way. For example, could leverage some of the MSP/MGF work already done and build a spectral library generator - take a peptide sequence, MGF file, scan number and write out the matched peaks. This should be relatively straightforward and would help to develop some of the core parts of the API.

I think this would solve some of the issues inherent to building a library that has no users - much easier (IMO) to build a library in tandem with an application that is actually using it.

@david-bouyssie
Copy link
Contributor

@di-hardt sorry I missed the notification of your last post
Should be added to this discussion for a more general visibility?
https://github.com/orgs/rusteomics/discussions/6

@lazear I really love the idea of having a concrete tool built on top of Rusteomics libraries.
Maybe we could also discuss this point on GithHub Discussion/Ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants