Add `codon_prob.py` with a model to adjust codon probs by hit class #50

willdumm · 2024-08-13T23:21:09Z

This PR adds a multihit.py containing a model with three free parameters, which are coefficients to adjust target codon probabilities according to their hit class.

An exploration of model fitting and resulting hit class-aggregated OE plots is pushed to the wd-neutral-codon-refactor branch of netam-experiments-1, at human/neutral_codon_exploration.ipynb.

The adjusted OE plots now look as expected after model training:

Tests:

There is a new file tests/test_multihit.py, with some basic tests for model training and serialization
There are additional tests added to tests/test_molevol.py for the new functions added to that file

format WIP some cleanup, to be continued WIP first draft codon_prob.py Cleanup, and finish loss WIP WIP WIP working? branch length opt

matsen

Great work! 🎉

It's really nice to see this come together, and it really shows how you "get" what's going on now. Welcome to the land of pytorch!

You can tell that I'm a horrible nitpicker. Feel free to push back.

We are "correcting" with an "adjustment". Can we "correct" with a "correction"? One word for a thing is better than two.

netam/molevol.py

matsen · 2024-09-11T11:19:44Z

netam/molevol.py

+
+
+# Initialize the 4D tensor to store the hit class tensors
+# The shape is [4, 4, 4, 4, 4, 4], corresponding to three nucleotide indices and the hit class tensor (4x4x4)


I'm happy to commit to 4 bases. It's hard-coded below in comments,

netam/molevol.py

matsen · 2024-09-11T11:34:52Z

netam/molevol.py

+    """
+    hit_class_tensor_t = hit_class_tensor_full[
+        parent_codon_idxs[:, 0], parent_codon_idxs[:, 1], parent_codon_idxs[:, 2]
+    ].int()


Is it not already an int?

matsen · 2024-09-11T11:37:56Z

netam/molevol.py

+    torch.Tensor: A (N, 4, 4, 4) shaped tensor containing the log probabilities of mutating to each possible
+                target codon, for each of the N parent codons, after applying the hit class factors.
+    """
+    hit_class_tensor_t = hit_class_tensor_full[


I gain no understanding from tensor_t. Could we do per_parent_hit_classes?

netam/multihit.py

matsen · 2024-09-11T12:28:40Z

tests/test_molevol.py

+        ]
+    ]
+)
+
 ex_parent_codon_idxs = nt_idx_tensor_of_str("ACG")
 parent_nt_seq = "CAGGTGCAGCTGGTGGAG"  # QVQLVE
 weights_path = "data/shmple_weights/my_shmoof"


We can discard this!

weights_path

matsen

Wunderbar! Great work here. Just a few more little nitpicks.

matsen · 2024-09-12T12:32:51Z

netam/molevol.py

+    This uses the same machinery as we use for fitting the DNSM, but we stay on
+    the codon level rather than moving to syn/nonsyn changes.
+
+    Args:


Needs reformatting with indentation like so:

""" Compute the probabilities of mutating to various codons for a parent sequence. This uses the same machinery as we use for fitting the DNSM, but we stay on the codon level rather than moving to syn/nonsyn changes. Args: parent_idxs (torch.Tensor): The parent nucleotide sequence encoded as a tensor of length Cx3, where C is the number of codons, containing the nt indices of each site. scaled_rates (torch.Tensor): Poisson rates of mutation per site, scaled by branch length. sub_probs (torch.Tensor): Substitution probabilities per site: a 2D tensor with shape (site_count, 4). Returns: torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability of the c'th codon mutating to the codon ijk. """

I used the format from the docstrings in molevol.py. Now I see that the last half use the format you want, and the first half use the format I imitated.

I've fixed the docstrings for my functions and opened a new issue to use docformatter, and to establish a consistent docstring format (#56 )

matsen · 2024-09-12T12:33:11Z

netam/molevol.py

+    torch.Tensor: A 4D tensor with shape (codon_count, 4, 4, 4) where the cijk-th entry is the probability
+        of the c'th codon mutating to the codon ijk.
+    """
+    # The following four lines are duplicated from


no longer, right?

matsen · 2024-09-12T12:36:56Z

netam/multihit.py

+def _trim_seqs_to_codon_boundary_and_max_len(
+    seqs: list, site_count: int = None
+) -> list:
+    """Assumes that all sequences have the same length, and trims to codon boundary.


Does this assume that all sequences have the same length?

Also, let's use max_len as an argument because it appears in the name of the function. The current max_len can become max_codon_len.

You're right, it doesn't assume that. Adjusted docstring and made name changes

matsen · 2024-09-12T12:37:33Z

netam/multihit.py

+        padded_mutations = num_mutations[:codon_count]  # Truncate if necessary
+        padded_mutations += [-100] * (
+            codon_count - len(padded_mutations)
+        )  # Pad with -1s


Suggested change

) # Pad with -1s

) # Pad with -100s

matsen · 2024-09-12T12:37:50Z

netam/multihit.py

+        return [seq[: min(len(seq) - len(seq) % 3, max_len)] for seq in seqs]
+
+
+def _observed_hit_classes(parents, children):


docstring please

netam/multihit.py

matsen · 2024-09-12T12:40:01Z

netam/multihit.py

+        )
+        self.all_rates = stack_heterogeneous(
+            pd.Series(
+                rates[: len(rates) - len(rates) % 3] for rates in all_rates


Can't we use your trimming function above? We'd want to rename it to remove "seq" from the title.

willdumm added 8 commits August 9, 2024 16:33

a start

307536b

format WIP some cleanup, to be continued WIP first draft codon_prob.py Cleanup, and finish loss WIP WIP WIP working? branch length opt

finish refactoring for rebase

9d2dd80

checkpoint before major refactor

4e0d042

give cceloss logits

d200bb4

some cleanup

bc522bf

some cleanup

a5c90a8

some cleanup

df9b3e6

add tests

77a647f

willdumm marked this pull request as ready for review September 10, 2024 23:34

willdumm requested a review from matsen September 10, 2024 23:34

matsen requested changes Sep 11, 2024

View reviewed changes

respond to Erick's comments

a7a6ee0

willdumm mentioned this pull request Sep 11, 2024

Consolidate branch length methods to new Dataset subclass #55

Open

matsen requested changes Sep 12, 2024

View reviewed changes

willdumm mentioned this pull request Sep 12, 2024

Establish consistent docstring style, and use docformatter #56

Closed

respond to Erick's comments

49b3594

matsen approved these changes Sep 12, 2024

View reviewed changes

matsen merged commit 7222493 into main Sep 12, 2024
1 check passed

matsen deleted the 19-add-codon-prob-new branch September 12, 2024 20:56

matsen mentioned this pull request Sep 12, 2024

Add codon_prob.py #24

Closed

matsen mentioned this pull request Sep 27, 2024

Add a codon_prob.py #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `codon_prob.py` with a model to adjust codon probs by hit class #50

Add `codon_prob.py` with a model to adjust codon probs by hit class #50

willdumm commented Aug 13, 2024 •

edited

Loading

matsen left a comment

matsen Sep 11, 2024

matsen Sep 11, 2024

matsen Sep 11, 2024

matsen Sep 11, 2024

willdumm Sep 11, 2024

willdumm Sep 11, 2024

matsen left a comment

matsen Sep 12, 2024

willdumm Sep 12, 2024

matsen Sep 12, 2024

matsen Sep 12, 2024

willdumm Sep 12, 2024

matsen Sep 12, 2024

matsen Sep 12, 2024

matsen Sep 12, 2024



		# Initialize the 4D tensor to store the hit class tensors
		# The shape is [4, 4, 4, 4, 4, 4], corresponding to three nucleotide indices and the hit class tensor (4x4x4)

		return [seq[: min(len(seq) - len(seq) % 3, max_len)] for seq in seqs]


		def _observed_hit_classes(parents, children):

Add codon_prob.py with a model to adjust codon probs by hit class #50

Add codon_prob.py with a model to adjust codon probs by hit class #50

Conversation

willdumm commented Aug 13, 2024 • edited Loading

Tests:

matsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matsen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add `codon_prob.py` with a model to adjust codon probs by hit class #50

Add `codon_prob.py` with a model to adjust codon probs by hit class #50

willdumm commented Aug 13, 2024 •

edited

Loading