Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rehaul and tweak clustering #198

Merged
merged 1 commit into from
Sep 6, 2023
Merged

Rehaul and tweak clustering #198

merged 1 commit into from
Sep 6, 2023

Commits on Sep 6, 2023

  1. Rehaul and tweak clustering

    This commit is a squashed version of several commits that can be found on the
    exp_cluster branch, and which has been tested individually. It contains the
    following changes:
    
    Seed ordering
    
    Before, the order in which seed contigs were picked was randomized, in order to
    prevent bias arising from the order of the sequences in the FASTA files.
    Now, we pick the seeds in a precomputed order, here the contig length.
    The idea is that the contigs which we are most confident should belong to a good
    cluster gets "first pick" during clustering to see if they can form proper
    clusters.
    
    Length weighting
    
    When computing the distance histogram from which the threshold is found, contigs
    are now weighed by their length. This should make it more likely to correctly
    detect clusters composed of a small number of large contigs.
    In testing, this improves binning result, and the distance histograms become
    clearer and more well-shaped.
    
    Simplified medoid search
    
    Before, the medoid should have the lowest mean distance to each contig within
    a small radius. Now, since we take length into accounts when finding the
    threshold, instead choose the medoid that has the highest "local density".
    This is computed as sum(length * (R - distance)) where R is a small contant,
    for each contig where distance <= R.
    Also, the random sampling of medoids is now simpler and should be more efficient
    
    Various refactorings and tweaks
    
    * Update the hyperparameters windowsize and minsuccesses to make Vamb the
      cluster selection criteria stricter
    * Implement a pack function which deleted used points in the matrix and
      associated vectors. This enables better control of when the matrix is packed,
      which should make clustering faster
    * When the peak valley ratio is raised, start over from the highest ordered
      seed. This makes sure the highest ordered seeds are attempted multiple times.
    * Misc refactoring
    jakobnissen committed Sep 6, 2023
    Configuration menu
    Copy the full SHA
    3e7739a View commit details
    Browse the repository at this point in the history