Rehaul and tweak clustering #198

jakobnissen · 2023-09-06T08:05:41Z

This commit is a squashed version of several commits that can be found on the exp_cluster branch, and which has been tested individually. It contains the following changes:

Seed ordering

Before, the order in which seed contigs were picked was randomized, in order to prevent bias arising from the order of the sequences in the FASTA files. Now, we pick the seeds in a precomputed order, here the contig length. The idea is that the contigs which we are most confident should belong to a good cluster gets "first pick" during clustering to see if they can form proper clusters.

Length weighting

When computing the distance histogram from which the threshold is found, contigs are now weighed by their length. This should make it more likely to correctly detect clusters composed of a small number of large contigs. In testing, this improves binning result, and the distance histograms become clearer and more well-shaped.

Simplified medoid search

Before, the medoid should have the lowest mean distance to each contig within a small radius. Now, since we take length into accounts when finding the threshold, instead choose the medoid that has the highest "local density". This is computed as sum(length * (R - distance)) where R is a small contant, for each contig where distance <= R.
Also, the random sampling of medoids is now simpler and should be more efficient

Various refactorings and tweaks

Update the hyperparameters windowsize and minsuccesses to make Vamb the cluster selection criteria stricter
Implement a pack function which deleted used points in the matrix and associated vectors. This enables better control of when the matrix is packed, which should make clustering faster
When the peak valley ratio is raised, start over from the highest ordered seed. This makes sure the highest ordered seeds are attempted multiple times.
Misc refactoring

This commit is a squashed version of several commits that can be found on the exp_cluster branch, and which has been tested individually. It contains the following changes: Seed ordering Before, the order in which seed contigs were picked was randomized, in order to prevent bias arising from the order of the sequences in the FASTA files. Now, we pick the seeds in a precomputed order, here the contig length. The idea is that the contigs which we are most confident should belong to a good cluster gets "first pick" during clustering to see if they can form proper clusters. Length weighting When computing the distance histogram from which the threshold is found, contigs are now weighed by their length. This should make it more likely to correctly detect clusters composed of a small number of large contigs. In testing, this improves binning result, and the distance histograms become clearer and more well-shaped. Simplified medoid search Before, the medoid should have the lowest mean distance to each contig within a small radius. Now, since we take length into accounts when finding the threshold, instead choose the medoid that has the highest "local density". This is computed as sum(length * (R - distance)) where R is a small contant, for each contig where distance <= R. Also, the random sampling of medoids is now simpler and should be more efficient Various refactorings and tweaks * Update the hyperparameters windowsize and minsuccesses to make Vamb the cluster selection criteria stricter * Implement a pack function which deleted used points in the matrix and associated vectors. This enables better control of when the matrix is packed, which should make clustering faster * When the peak valley ratio is raised, start over from the highest ordered seed. This makes sure the highest ordered seeds are attempted multiple times. * Misc refactoring

jakobnissen mentioned this pull request Sep 6, 2023

Implement marker gene detection #195

Merged

2 tasks

jakobnissen merged commit 6554009 into master Sep 6, 2023
4 checks passed

jakobnissen deleted the only_clustering branch September 6, 2023 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rehaul and tweak clustering #198

Rehaul and tweak clustering #198

jakobnissen commented Sep 6, 2023

Rehaul and tweak clustering #198

Rehaul and tweak clustering #198

Conversation

jakobnissen commented Sep 6, 2023