Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudamapper] Filtering for self mappings, identical overlaps, and highly-similar overlaps #563

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

edawson
Copy link
Contributor

@edawson edawson commented Sep 17, 2020

This PR brings in filtering strategies that mimic those of minimap2.

  • Self mappings: minimap2 does not output self mappings (where an overlap between a read and itself covers more than some very high fraction of that read's length). This can be disabled with a command line flag (-S).
  • Identical overlaps: remove identical overlaps within some number of indices.
  • Highly-similar overlaps: remove overlaps that have a percent similarity (defined as the reciprocal overlap) higher than some percentage threshold.

Adds a post-processing filter for full-length self mappings. This is
needed as self mappings cause miniasm to produce empty assemblies.
…erlaps in

postprocessing.

Adds a CLI option and functions for filtering overlaps that either have
identical query starts/ends or which have a percent reciprocal identity
greater than a user-defined threshold.
@mimaric mimaric added the cudamapper GPU-based overlapper label Sep 21, 2020
@edawson edawson requested a review from mimaric October 10, 2020 02:43
Comment on lines +54 to +55
{"preserve-self-mappings", no_argument, 0, 'S'},
{"max-reciprocal", required_argument, 0, 'Z'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters should also be added to ApplicationParameters::help()

if (application_parameters.all_to_all && application_parameters.filter_self_mappings)
{
GW_NVTX_RANGE(profiler, "main::postprocess_and_write_thread::remove_self_mappings");
::claraparabricks::genomeworks::cudamapper::details::overlapper::filter_self_mappings(overlaps, *application_parameters.query_parser, *application_parameters.target_parser, 0.9);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to qualify the function this day? Also, is function really a detail if it is used in a completely different (top level) file

@@ -98,7 +110,8 @@ class Overlapper
/// \brief Identified overlaps which can be combined into a larger overlap and add them to the input vector
/// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector
/// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output.
static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false);
/// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find 0.0 a bit confusing. If I understand it correctly passing 0.75 would mean that any pair that has more than 75% overlap will get filtered out (i.e. one of those two overlaps). Wouldn't it than make sense to say that specifying 1.0 means only identical (= 100% overlapped) overlaps are filtered out and >1.0 that nothing gets filtered out as two overlaps cannot overlap more than 100%?

@@ -98,7 +110,8 @@ class Overlapper
/// \brief Identified overlaps which can be combined into a larger overlap and add them to the input vector
/// \param overlaps reference to vector of Overlaps. New overlaps (result of fusing) are added to this vector
/// \param drop_fused_overlaps If true, remove overlaps that are fused into larger overlaps in output.
static void post_process_overlaps(std::vector<Overlap>& overlaps, bool drop_fused_overlaps = false);
/// \param max_reciprocal Maximum reciprocal identity between two overlaps before one is filtered. Default (< 0.0) means no filtering; 0.0 removes only identical adjacent overlaps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how complicated would it be to specify which overlap gets filtered out (e.g. the one with smaller starting position, longer/shorter one or something)

@@ -53,8 +53,10 @@ class ApplicationParameters
int32_t min_overlap_len = 250; // l, recommended range: 100 - 1000
int32_t min_bases_per_residue = 1000; // b
float min_overlap_fraction = 0.8; // z
float max_reciprocal = -1.0; // Z, < 0 : no filtering. 0.0: only identical filtering. > 1.0: filter by percent reciprocity
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For -F we use 0.0 - 1.0 range, not 1.0 - 100.0 (which should actually probably be 0.0 - 100.0)

{
return 0.0;
}
int32_t query_overlap = std::min(a.query_end_position_in_read_, b.query_end_position_in_read_) - std::max(a.query_start_position_in_read_, b.query_start_position_in_read_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar should be position_in_read_t or even better number_of_basepairs_t

Base automatically changed from dev-v0.6.0 to dev February 8, 2021 22:44
@ohadmo ohadmo closed this Feb 8, 2021
@ohadmo ohadmo deleted the branch NVIDIA-Genomics-Research:dev February 8, 2021 22:54
@ohadmo ohadmo reopened this Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudamapper GPU-based overlapper
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants