Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make identity() calculation ignore soft clips #4502

Merged
merged 2 commits into from
Jan 23, 2025

Conversation

faithokamoto
Copy link
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • Stop identity() from penalizing soft clips (insertions at start/end of path) as part of the total length

Description

Soft clips are internally represented as insertions at the start or end of paths. They're not really part of the mapping and thus shouldn't count against the identity score. Currently, the identity calculation is # match bases / (# match bases + # mismatch bases + # insertion bases). This PR changes that to # match bases / (# match bases + # mismatch bases + # insertion bases - # softclip bases). Note that deletions have never been counted against identity, and I do not propose to start using them.

Here are some descriptions of what "identity" means, according to the code:

  • In the identity field of a GAM. This would imply that only aligned bases should be counted as part of the total length for the denominator. Soft clips are not aligned. Arguably even regular insertions aren't aligned either, since they don't have bases in the reference which they correspond to.

    Portion of aligned bases that are perfect matches, or 0 if no bases are aligned

  • In the definition of identity(). The use of "total length" seems likely to mean "total length of the path", which would include regular insertions, but I don't think should include soft clips.

    perfect matches over total length. For zero-length paths, returns 0

For what it's worth, if we want to stop counting all insertions, that's pretty easy. Here's code which would change the identity formula to # match bases / (# match bases + # mismatch bases).

double identity(const Path& path) {
    size_t total_length = 0;
    size_t matched_length = 0;
    for (size_t i = 0; i < path.mapping_size(); ++i) {
        auto& mapping = path.mapping(i);
        for (size_t j = 0; j < mapping.edit_size(); ++j) {
            auto& edit = mapping.edit(j);
            if (edit_is_match(edit)) {
                matched_length += edit.from_length();
                total_length += edit.from_length();
            } else if (edit_is_sub(edit)) {
                total_length += edit.from_length();
            }
        }
    }
    return total_length == 0 ? 0.0 : (double) matched_length / (double) total_length;
}

Soft clips are internally represented as insertions at the start or end of paths. They're not really part of the mapping and thus shouldn't count against the identity score.
@adamnovak adamnovak merged commit faea702 into master Jan 23, 2025
2 checks passed
@faithokamoto faithokamoto deleted the identity-ignore-softclips branch January 23, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants