Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codon_prob.py with a model to adjust codon probs by hit class #50

Merged
merged 10 commits into from
Sep 12, 2024

Conversation

willdumm
Copy link
Contributor

@willdumm willdumm commented Aug 13, 2024

This PR adds a multihit.py containing a model with three free parameters, which are coefficients to adjust target codon probabilities according to their hit class.

An exploration of model fitting and resulting hit class-aggregated OE plots is pushed to the wd-neutral-codon-refactor branch of netam-experiments-1, at human/neutral_codon_exploration.ipynb.

The adjusted OE plots now look as expected after model training:

image

Tests:

  • There is a new file tests/test_multihit.py, with some basic tests for model training and serialization
  • There are additional tests added to tests/test_molevol.py for the new functions added to that file

format

WIP

some cleanup, to be continued

WIP

first draft codon_prob.py

Cleanup, and finish loss

WIP

WIP

WIP

working? branch length opt
@willdumm willdumm marked this pull request as ready for review September 10, 2024 23:34
Copy link
Contributor

@matsen matsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! 🎉

It's really nice to see this come together, and it really shows how you "get" what's going on now. Welcome to the land of pytorch!

You can tell that I'm a horrible nitpicker. Feel free to push back.

We are "correcting" with an "adjustment". Can we "correct" with a "correction"? One word for a thing is better than two.

netam/molevol.py Outdated Show resolved Hide resolved
netam/molevol.py Outdated


# Initialize the 4D tensor to store the hit class tensors
# The shape is [4, 4, 4, 4, 4, 4], corresponding to three nucleotide indices and the hit class tensor (4x4x4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to commit to 4 bases. It's hard-coded below in comments,

netam/molevol.py Outdated Show resolved Hide resolved
netam/molevol.py Outdated
"""
hit_class_tensor_t = hit_class_tensor_full[
parent_codon_idxs[:, 0], parent_codon_idxs[:, 1], parent_codon_idxs[:, 2]
].int()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not already an int?

netam/molevol.py Outdated
torch.Tensor: A (N, 4, 4, 4) shaped tensor containing the log probabilities of mutating to each possible
target codon, for each of the N parent codons, after applying the hit class factors.
"""
hit_class_tensor_t = hit_class_tensor_full[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gain no understanding from tensor_t. Could we do per_parent_hit_classes?

netam/multihit.py Show resolved Hide resolved
netam/multihit.py Show resolved Hide resolved
netam/multihit.py Show resolved Hide resolved
netam/multihit.py Show resolved Hide resolved
]
]
)

ex_parent_codon_idxs = nt_idx_tensor_of_str("ACG")
parent_nt_seq = "CAGGTGCAGCTGGTGGAG" # QVQLVE
weights_path = "data/shmple_weights/my_shmoof"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discard this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weights_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@matsen matsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wunderbar! Great work here. Just a few more little nitpicks.

This uses the same machinery as we use for fitting the DNSM, but we stay on
the codon level rather than moving to syn/nonsyn changes.

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs reformatting with indentation like so:

"""
Compute the probabilities of mutating to various codons for a parent sequence.

This uses the same machinery as we use for fitting the DNSM, but we stay on
the codon level rather than moving to syn/nonsyn changes.

Args:
    parent_idxs (torch.Tensor): The parent nucleotide sequence encoded as a
        tensor of length Cx3, where C is the number of codons, containing the nt indices of each site.
    scaled_rates (torch.Tensor): Poisson rates of mutation per site, scaled by branch length.
    sub_probs (torch.Tensor): Substitution probabilities per site: a 2D tensor with shape (site_count, 4).

Returns:
    torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability
        of the c'th codon mutating to the codon ijk.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the format from the docstrings in molevol.py. Now I see that the last half use the format you want, and the first half use the format I imitated.

I've fixed the docstrings for my functions and opened a new issue to use docformatter, and to establish a consistent docstring format (#56 )

netam/molevol.py Outdated
torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability
of the c'th codon mutating to the codon ijk.
"""
# The following four lines are duplicated from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer, right?

def _trim_seqs_to_codon_boundary_and_max_len(
seqs: list, site_count: int = None
) -> list:
"""Assumes that all sequences have the same length, and trims to codon boundary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that all sequences have the same length?

Also, let's use max_len as an argument because it appears in the name of the function. The current max_len can become max_codon_len.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it doesn't assume that. Adjusted docstring and made name changes

padded_mutations = num_mutations[:codon_count] # Truncate if necessary
padded_mutations += [-100] * (
codon_count - len(padded_mutations)
) # Pad with -1s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) # Pad with -1s
) # Pad with -100s

return [seq[: min(len(seq) - len(seq) % 3, max_len)] for seq in seqs]


def _observed_hit_classes(parents, children):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring please

netam/multihit.py Show resolved Hide resolved
)
self.all_rates = stack_heterogeneous(
pd.Series(
rates[: len(rates) - len(rates) % 3] for rates in all_rates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use your trimming function above? We'd want to rename it to remove "seq" from the title.

@matsen matsen merged commit 7222493 into main Sep 12, 2024
1 check passed
@matsen matsen deleted the 19-add-codon-prob-new branch September 12, 2024 20:56
@matsen matsen mentioned this pull request Sep 12, 2024
@matsen matsen mentioned this pull request Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants