-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor binsplitting #251
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
12 tasks
jakobnissen
force-pushed
the
binsplit
branch
from
November 10, 2023 08:57
b025dd8
to
f357023
Compare
jakobnissen
force-pushed
the
binsplit
branch
from
November 10, 2023 09:10
f357023
to
76ff04a
Compare
jakobnissen
force-pushed
the
binsplit
branch
from
November 10, 2023 09:25
76ff04a
to
991944b
Compare
CC @sgalkina - can I have your review on this (if you have time?) This is a larger PR. I tried to break it into commits but it didn't work out. These are the conceptual large-scale points of change which you can review:
|
jakobnissen
force-pushed
the
binsplit
branch
from
November 10, 2023 11:27
991944b
to
4dbf5ac
Compare
This is a larger change which overhauls how binsplitting is done, and, as a consequence, reworks some of the overall workflow in `__main__.py`. The PR is intended to address the following problems: * Before, we only output either the binsplit clusters, or the unsplit clusters. This is problematic, because we know the binsplit clusters are the best ones, so we would like to output these. However, the unsplit ones contain important information about the source cluster, which powerusers need to be able to recover. - Now, we output both `_split.tsv` and `_unsplit.tsv` files, if binsplitting takes place. * Before, we defaulted to no binsplitting, even as we know it was inferior - Now, `-o C` is default. * Before, if a user passed in a wrong binsplit separator, Vamb would not error until the clustering step, and the error message would be inscrutable - Now, error already when parsing the contigs, EXCEPT if the binsplit sep has defaulted to 'C', in which case binsplitting is disabled, and the user is warned - The error message is significantly improved and more explanatory * Before, the logic of where binsplitting happened was ad-hoc, and scattered all over the place. For example, binsplitting took place during cluster writing, during bin writing, during benchmarking, during clustering itself, and immediately after clustering. It was also implemented multiple places. - Now, create a `BinSplitter` class responsible for binsplitting. The writer functions and loader functions do not binsplit. - Now, binsplitting mostly takes place immediately before writing the split clusters meaning the clusters are unambiguously unsplit for the majority of the program
jakobnissen
force-pushed
the
binsplit
branch
from
November 11, 2023 14:35
4dbf5ac
to
78da424
Compare
OK, gonna merge this |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There are multiple problems with the way we do binsplitting now, which I would like to address is this larger PR:
'C'
.-o C
into the command without realising what it implies about the naming scheme of the identifiers, then Vamb errors only at the clustering step after training. Ideally, this would error already when reading in the FASTA file.Further,
-i
option. This is not generally useful. We should instead output total bin size and other bin statistics to users and let them decide (Meta-issue for improving user-friendliness #240)To do
__main__.py
to deduplicate code-o X
and-o
-c X
Closes #237