-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add codon_prob.py
with a model to adjust codon probs by hit class
#50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🎉
It's really nice to see this come together, and it really shows how you "get" what's going on now. Welcome to the land of pytorch!
You can tell that I'm a horrible nitpicker. Feel free to push back.
We are "correcting" with an "adjustment". Can we "correct" with a "correction"? One word for a thing is better than two.
netam/molevol.py
Outdated
|
||
|
||
# Initialize the 4D tensor to store the hit class tensors | ||
# The shape is [4, 4, 4, 4, 4, 4], corresponding to three nucleotide indices and the hit class tensor (4x4x4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to commit to 4 bases. It's hard-coded below in comments,
netam/molevol.py
Outdated
""" | ||
hit_class_tensor_t = hit_class_tensor_full[ | ||
parent_codon_idxs[:, 0], parent_codon_idxs[:, 1], parent_codon_idxs[:, 2] | ||
].int() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not already an int?
netam/molevol.py
Outdated
torch.Tensor: A (N, 4, 4, 4) shaped tensor containing the log probabilities of mutating to each possible | ||
target codon, for each of the N parent codons, after applying the hit class factors. | ||
""" | ||
hit_class_tensor_t = hit_class_tensor_full[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gain no understanding from tensor_t
. Could we do per_parent_hit_classes
?
tests/test_molevol.py
Outdated
] | ||
] | ||
) | ||
|
||
ex_parent_codon_idxs = nt_idx_tensor_of_str("ACG") | ||
parent_nt_seq = "CAGGTGCAGCTGGTGGAG" # QVQLVE | ||
weights_path = "data/shmple_weights/my_shmoof" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discard this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weights_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wunderbar! Great work here. Just a few more little nitpicks.
This uses the same machinery as we use for fitting the DNSM, but we stay on | ||
the codon level rather than moving to syn/nonsyn changes. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs reformatting with indentation like so:
"""
Compute the probabilities of mutating to various codons for a parent sequence.
This uses the same machinery as we use for fitting the DNSM, but we stay on
the codon level rather than moving to syn/nonsyn changes.
Args:
parent_idxs (torch.Tensor): The parent nucleotide sequence encoded as a
tensor of length Cx3, where C is the number of codons, containing the nt indices of each site.
scaled_rates (torch.Tensor): Poisson rates of mutation per site, scaled by branch length.
sub_probs (torch.Tensor): Substitution probabilities per site: a 2D tensor with shape (site_count, 4).
Returns:
torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability
of the c'th codon mutating to the codon ijk.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the format from the docstrings in molevol.py
. Now I see that the last half use the format you want, and the first half use the format I imitated.
I've fixed the docstrings for my functions and opened a new issue to use docformatter, and to establish a consistent docstring format (#56 )
netam/molevol.py
Outdated
torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability | ||
of the c'th codon mutating to the codon ijk. | ||
""" | ||
# The following four lines are duplicated from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no longer, right?
netam/multihit.py
Outdated
def _trim_seqs_to_codon_boundary_and_max_len( | ||
seqs: list, site_count: int = None | ||
) -> list: | ||
"""Assumes that all sequences have the same length, and trims to codon boundary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume that all sequences have the same length?
Also, let's use max_len
as an argument because it appears in the name of the function. The current max_len
can become max_codon_len
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it doesn't assume that. Adjusted docstring and made name changes
netam/multihit.py
Outdated
padded_mutations = num_mutations[:codon_count] # Truncate if necessary | ||
padded_mutations += [-100] * ( | ||
codon_count - len(padded_mutations) | ||
) # Pad with -1s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) # Pad with -1s | |
) # Pad with -100s |
netam/multihit.py
Outdated
return [seq[: min(len(seq) - len(seq) % 3, max_len)] for seq in seqs] | ||
|
||
|
||
def _observed_hit_classes(parents, children): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring please
netam/multihit.py
Outdated
) | ||
self.all_rates = stack_heterogeneous( | ||
pd.Series( | ||
rates[: len(rates) - len(rates) % 3] for rates in all_rates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use your trimming function above? We'd want to rename it to remove "seq" from the title.
This PR adds a
multihit.py
containing a model with three free parameters, which are coefficients to adjust target codon probabilities according to their hit class.An exploration of model fitting and resulting hit class-aggregated OE plots is pushed to the
wd-neutral-codon-refactor
branch ofnetam-experiments-1
, athuman/neutral_codon_exploration.ipynb
.The adjusted OE plots now look as expected after model training:
Tests:
tests/test_multihit.py
, with some basic tests for model training and serializationtests/test_molevol.py
for the new functions added to that file