Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor binsplitting #251

Merged
merged 1 commit into from
Nov 14, 2023
Merged

Refactor binsplitting #251

merged 1 commit into from
Nov 14, 2023

Conversation

jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Nov 7, 2023

There are multiple problems with the way we do binsplitting now, which I would like to address is this larger PR:

  • We either output split bins or unsplit bins, where it would be more useful to output both, if possible (Output source clusters for bins when binsplitting #237).
    • In this PR, both files are always written if binsplitting takes place.
  • We default to no binsplitting, even though we know this produces worse output. Ideally, we'd like for Vamb's defaults to be optimal.
    • In this PR, we default to binsplitting with 'C'.
  • If users pass in a wrong binsplit separator, e.g. by mindlessly copy-pasting -o C into the command without realising what it implies about the naming scheme of the identifiers, then Vamb errors only at the clustering step after training. Ideally, this would error already when reading in the FASTA file.
    • In this PR, it errors when parsing the FASTA file if the separator is not found, UNLESS the user did not specify any separator and it just defaulted to 'C'. In this case, it will warn the user, and then disable binsplitting
    • The error message is improved in this PR
  • The logic of binsplitting is too complicated, because binsplitting takes place at multiple different levels: Immediately after instantiating the clusterer, when writing the cluster file, or when writing the bins file.
    • In this PR, only binsplit immediately before writing to files.
    • Also, in this PR, the printing functions should not binsplit. Instead, binsplit the clusters, then print them

Further,

To do

  • Refactor the clustering and writing of clusters in __main__.py to deduplicate code
  • Address TODOs
  • Make sure the program runs normally
  • ....and respects -o X and -o
  • ...and respects -c X
  • Make tests pass

Closes #237

@jakobnissen
Copy link
Member Author

CC @sgalkina - can I have your review on this (if you have time?)

This is a larger PR. I tried to break it into commits but it didn't work out. These are the conceptual large-scale points of change which you can review:

  • Removed the -i option. I think we don't really need this as a CLI parameter. I also had some real reason for removing it, but I can't really remember now.
  • In parsecontigs.py, move the warnings to __main__.py, since it's really a user-facing warning meant for CLI use
  • In vambtools.py, Added a BinSplitter class responsible for binsplitting. This is useful now because the binsplitting logic is somewhat more complex to make it more user friendly.
  • In vambtools.py, simplify write_clusters and write_bins, such that they have no binsplitting or filtering obligations. Instead, binsplitting and/or filtering must be done before passing to these functions.
  • Refactoring __main__.py. Mostly just adapt the above changes, but also add cluster_and_write_files and write_clusters_and_bins to cut down on code duplication

@jakobnissen jakobnissen changed the title [WIP] Refactor binsplitting Refactor binsplitting Nov 10, 2023
This is a larger change which overhauls how binsplitting is done, and, as a
consequence, reworks some of the overall workflow in `__main__.py`.
The PR is intended to address the following problems:
* Before, we only output either the binsplit clusters, or the unsplit clusters.
  This is problematic, because we know the binsplit clusters are the best ones,
  so we would like to output these. However, the unsplit ones contain important
  information about the source cluster, which powerusers need to be able to
  recover.
  	- Now, we output both `_split.tsv` and `_unsplit.tsv` files, if
  	  binsplitting takes place.
* Before, we defaulted to no binsplitting, even as we know it was inferior
    - Now, `-o C` is default.
* Before, if a user passed in a wrong binsplit separator, Vamb would not error
  until the clustering step, and the error message would be inscrutable
    - Now, error already when parsing the contigs, EXCEPT if the binsplit sep
      has defaulted to 'C', in which case binsplitting is disabled, and the user
      is warned
    - The error message is significantly improved and more explanatory
* Before, the logic of where binsplitting happened was ad-hoc, and scattered
  all over the place. For example, binsplitting took place during cluster
  writing, during bin writing, during benchmarking, during clustering itself,
  and immediately after clustering. It was also implemented multiple places.
    - Now, create a `BinSplitter` class responsible for binsplitting.
      The writer functions and loader functions do not binsplit.
    - Now, binsplitting mostly takes place immediately before writing the
      split clusters meaning the clusters are unambiguously unsplit for the
      majority of the program
@jakobnissen jakobnissen merged commit 15638ff into master Nov 14, 2023
5 checks passed
@jakobnissen
Copy link
Member Author

OK, gonna merge this

@jakobnissen jakobnissen deleted the binsplit branch November 14, 2023 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Output source clusters for bins when binsplitting
1 participant