Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rehaul and tweak clustering #198

Merged
merged 1 commit into from
Sep 6, 2023
Merged

Rehaul and tweak clustering #198

merged 1 commit into from
Sep 6, 2023

Conversation

jakobnissen
Copy link
Member

This commit is a squashed version of several commits that can be found on the exp_cluster branch, and which has been tested individually. It contains the following changes:

Seed ordering

Before, the order in which seed contigs were picked was randomized, in order to prevent bias arising from the order of the sequences in the FASTA files. Now, we pick the seeds in a precomputed order, here the contig length. The idea is that the contigs which we are most confident should belong to a good cluster gets "first pick" during clustering to see if they can form proper clusters.

Length weighting

When computing the distance histogram from which the threshold is found, contigs are now weighed by their length. This should make it more likely to correctly detect clusters composed of a small number of large contigs. In testing, this improves binning result, and the distance histograms become clearer and more well-shaped.

Simplified medoid search

Before, the medoid should have the lowest mean distance to each contig within a small radius. Now, since we take length into accounts when finding the threshold, instead choose the medoid that has the highest "local density". This is computed as sum(length * (R - distance)) where R is a small contant, for each contig where distance <= R.
Also, the random sampling of medoids is now simpler and should be more efficient

Various refactorings and tweaks

  • Update the hyperparameters windowsize and minsuccesses to make Vamb the cluster selection criteria stricter
  • Implement a pack function which deleted used points in the matrix and associated vectors. This enables better control of when the matrix is packed, which should make clustering faster
  • When the peak valley ratio is raised, start over from the highest ordered seed. This makes sure the highest ordered seeds are attempted multiple times.
  • Misc refactoring

This commit is a squashed version of several commits that can be found on the
exp_cluster branch, and which has been tested individually. It contains the
following changes:

Seed ordering

Before, the order in which seed contigs were picked was randomized, in order to
prevent bias arising from the order of the sequences in the FASTA files.
Now, we pick the seeds in a precomputed order, here the contig length.
The idea is that the contigs which we are most confident should belong to a good
cluster gets "first pick" during clustering to see if they can form proper
clusters.

Length weighting

When computing the distance histogram from which the threshold is found, contigs
are now weighed by their length. This should make it more likely to correctly
detect clusters composed of a small number of large contigs.
In testing, this improves binning result, and the distance histograms become
clearer and more well-shaped.

Simplified medoid search

Before, the medoid should have the lowest mean distance to each contig within
a small radius. Now, since we take length into accounts when finding the
threshold, instead choose the medoid that has the highest "local density".
This is computed as sum(length * (R - distance)) where R is a small contant,
for each contig where distance <= R.
Also, the random sampling of medoids is now simpler and should be more efficient

Various refactorings and tweaks

* Update the hyperparameters windowsize and minsuccesses to make Vamb the
  cluster selection criteria stricter
* Implement a pack function which deleted used points in the matrix and
  associated vectors. This enables better control of when the matrix is packed,
  which should make clustering faster
* When the peak valley ratio is raised, start over from the highest ordered
  seed. This makes sure the highest ordered seeds are attempted multiple times.
* Misc refactoring
@jakobnissen jakobnissen mentioned this pull request Sep 6, 2023
2 tasks
@jakobnissen jakobnissen merged commit 6554009 into master Sep 6, 2023
4 checks passed
@jakobnissen jakobnissen deleted the only_clustering branch September 6, 2023 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant