-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate tanimoto scores during training #127
Comments
Sorry, coming back to this with some delay... I think that computing the scores on the fly during training is not a good option. We often search for rare cases (say Tanimoto between 0.8 and 0.9) and that could mean that we need to compute many 1,000s of scores before getting to the "right" pair. There are more performant options in my opinion. In particular, because we will never train a model on all pairs anyway (100,000 x 100,000 pairs are a lot).
Advantages: Fairly easy to implement since it is done in the preprocessing. The rest of the training pipeline can remain untouched. Cons: We won't make use of the full variety of the large dataset. And, maybe even more critical, we have to be even more careful in how we select the pairs we keep (which biases do we take into account? Which not?)
|
Between: graph options will likely also become interesting for |
Yes I agree, generating a sparse matrix (or graph) will solve the scalability issue, without repeatedly calculating the same scores. Before switching to new ways of pair generating I hope to get a bit more intuitive understanding of their effect, since to me the benefit of DataGeneratorsSpectrums over DataGeneratorInchikeys is still not fully intuitive. |
I think this will be fixed with #145. |
For training we currently need a Tanimoto score matrix. This is calculated relatively fast for the current GNPS library (25000 inchikeys), but requires a file >5 GB. Since this scales quadratically this does not seem to be a suitable solution for larger and larger libraries. I already got issues reported for MS2Query were the server did not have enough RAM to calculate this for 100.000 vs 100.000 spectra.
It might be possible to include the calculation in the data generators. This will make the training more straightforward for the user and have better scalability.
The fingerprints should of course still be precalculated and stored, but this scales linearly and therefore is not such a big issue. My expectation is that this will not slow down the training time by a lot and will even improve training time for larger datasets, since not all tanimoto score combinations will have to be calculated.
I realize that one of the steps in the calculation process is to select a match from a tanimoto bin. This would mean that it needs to keep calculating Tanimoto scores until it finds a match in that bin, this might be time consuming. Do you think this would be an issue?
@florian-huber What do you think?
The text was updated successfully, but these errors were encountered: