Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average shared span for multiple matches #12

Merged
merged 25 commits into from
Jan 31, 2025

Conversation

hfr1tz3
Copy link
Collaborator

@hfr1tz3 hfr1tz3 commented Dec 22, 2024

Option 2 for issue #11

For each node $n_2\in T_2$ with multiple matches in $T_1$, Let $\beta(n_2)=[n_1\in T_1 \colon \alpha(n_1)=n_2]$. Then we compute the similarity between two trees $T_1$, and $T_2$ as
$$sim(T_1,T_2)=\sum_{n_2\in T_2}\frac{1}{\beta(n_2)}\sum_{n_1\in \alpha^{-1}(n_2)} m(n_1,n_2),$$
which is the average over shared spans between nodes in $T_2$ and their multiple matches $\beta(n_2)$.

@hfr1tz3 hfr1tz3 marked this pull request as ready for review December 23, 2024 19:00
@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Dec 23, 2024

I added extra tests that are for the ties='average' method. I am open to more suggestions if we need more.

@petrelharp
Copy link
Contributor

Gee, that's a quite clean implementation. Nice.

My first inclination is not to introduce the "ties" argument? I think we don't have a use case for ties=None, so why complicate things?

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 6, 2025

That's a fair point. I think it was sort of an idea to put in place, where in case we want to add a new matching scheme then we have the scaffolding for that already. However, if want to remove that I can.
@petrelharp just ping me if you want it removed. I'll try to get to it while at JMM.

@petrelharp
Copy link
Contributor

Good thought, but there's no need to put the argument in place, since if we add it in the future but with the default "average", then that'll be backwards-compatible. So, it's just unecessary complexity. I'd like it removed.

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 13, 2025

Alright, I removed the ties argument from the compare function. I additionally added another test case to check that the new average node span is working properly.

@petrelharp
Copy link
Contributor

Hm, okay - sorry I didn't pick up on this earlier - but, this is also changing dissimilarity (since both dissimilarity and tpr depend on total_match_span). Now I need to think about whether this is desired or not.

@petrelharp
Copy link
Contributor

Okay: I think that if T1 is inferred and T2 is truth, then our motivation has been:

  1. we want ARF to be "how much of T1 is wrong" (so, denominator is |T1|)
  2. and TPR to be "how much of T2 is correctly represented in T1" (denominator is |T2|)
  3. previously, interpretation (2) does not hold, because adding more nodes to T1 that mapped to the same node in T2 would change TPR when it wasn't actually changing how much of T2 that was represented
  4. with this change, interpretation (1) arguably doesn't hold, because suppose those multiple nodes in T1 that map to a single node in T2 all agree with the node in T2; they are arguably not wrong, so should not count against it
  5. the question is, I guess, what is "wrong"; one interpretation is that a node in T1 is 'wrong' if it implies relationships not in T2 (e.g., "on this chunk of genome A and B are more closely related than to C")
  6. having multiple nodes in T1 that match a given node in T2 feels a bit wrong, but counterpoint: adding completely-unary nodes to T1 should arguably not be penalised because we know that those nodes in fact exist - edges are but chains of inheritance through a bunch of individuals

So: currently we have:
Screenshot from 2025-01-16 05-53-34
... so I guess I'm arguing that the right-hand equality in equation (2) above should not hold, nor the right hand equality here:
Screenshot from 2025-01-16 05-56-17
so that, conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

What do you think?

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 16, 2025

... conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

I actually like this change. When I was attempting to redefine terms in the paper, the notation got bogged down.

ARF in my mind should be defined as it was with

$$\textnormal{ARF}(T_1, T_2) = 1-\frac{1}{|| T_1||} \sum_{n\in N_1} m(n_1, \alpha(n_1)).$$

To comment about points 3, 4, and 6:
I took some time to think about possibly doing a 'weighted avg' for TPR if we have multiple matches for a node in $T_2$, however those still can have a TPR of over 1. I think with the averaged node span for TPR, we can drastically decrease TPR when that doesn't reflect the actual property of "how much of $T_2$ is in $T_1$". For instance, I feel like a tree sequence $T_1$ with extra unary nodes should have TPR of 1 when compared to its simplified version $T_2$.

If we want to remove the 'averaged node span' from TPR we could alternatively define it as

$$ \textnormal{TPR}(T_1, T_2) = \frac{1}{||T_2||}\sum_{n_2\in N_2} \max_{n_1\in \alpha^{-1}(n)}m(n_1, n_2)$$

Below is an example between avg TPR and max TPR:

image

@petrelharp
Copy link
Contributor

Hey, this is a great point. max is totally better than mean here. Totally on board with this.

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 18, 2025

@petrelharp I made the change.

The work is pretty much that there are "best matches" with respect to $T_1$ and $T_2$ resp (marked by n1 or n2). If its an n1 match we use that to compute dissimilarity and ARF and if its an n2 match that we use that to compute the "inverse dissimilarity" := total span $T_2$ - best_n2_match_span and tpr.

I tried to make sure I added all the correct docs and tests, but let me know if I missed something.

@petrelharp
Copy link
Contributor

This looks right, although my brain's a bit slow right now. I'm not sure about the name "inverse dissimilarity" (but, good idea); for clarification: is compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity? (No, right? What's the difference?)

And, I've invited you as a collaborator so your tests run automatically (thought we'd done that already?). See the errors there, e.g.

FAILED tests/test_methods.py::TestDissimilarity::test_basic_comparison[pair0] - TypeError: ARFResult.__init__() got an unexpected keyword argument 'inverse_dissimilarity'

Then, :class:`.ARFResult` contains:

- (`dissimilarity`)
The total "matching span", which is the total span of
all nodes in `ts` over which each node is ancestral to the same set of
samples as its best match in `other`.

- (`inverse_dissimiliarity`)
The total "inverse matching span", which is the total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to say "matched span" instead, but maybe that looks too much like "matching span"?

@@ -348,13 +352,23 @@ def compare(ts, other, transform=None):
If there are multiple matches with the same longest shared span
for a single node, the best match is the match that is closest in time.

For each node in `other` we compute the best matched span
as the average shared span amongst all nodes in `ts` which are its match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - this shouldn't say "average" any more, right?

@petrelharp
Copy link
Contributor

And, maybe we (sigh) need another term, since we use "best match of n1 is n2" to mean something different than "best match of n2 is n1" (since "best match": T1 -> T2 is many-to-one).

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 20, 2025

This looks right, although my brain's a bit slow right now. I'm not sure about the name "inverse dissimilarity" (but, good idea); for clarification: is compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity? (No, right? What's the difference?)

I think you're right, we should call it something else (no idea what yet).
compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity should be true (I checked experimentally with some large simulated tree sequences). It most likely could be proven because inverse_dissmilarity is simply looking for column max on our match matrix which is equivalent to the row max on the transpose of that matrix which is the match matrix for compare(t2,t1).

And, maybe we (sigh) need another term, since we use "best match of n1 is n2" to mean something different than "best match of n2 is n1" (since "best match": T1 -> T2 is many-to-one).

Maybe we just reword "best match of n2 is n1" to "best match for n2 from the matching $\gamma:T_2\to \beta^{-1}\circ\beta(T_1)$ from the fixed matching $\beta: T_1\to T_2$. Not very pretty, but it is exact.

@petrelharp
Copy link
Contributor

petrelharp commented Jan 20, 2025

Thinking out loud here: let $M(n1,n2)$ be the span on which n1 and n2 agree; and let $\beta_{12}(n1)$ be such that
$$M(n1, \beta(n1)) = \max_{n2} M(n1, n2),$$
in other words (why can't I do \argmax here?):
$$\beta_{12}(n1) = argmax_{n2}{M(n1, n2)}.$$
Furthermore, let $\delta(n2) \in \beta_{12}^{-1}(n2)$ that maximizes $M$, i.e.,
$$\delta(n2) = argmax_{n1 \in \beta_{12}^{-1}(n2)}{M(n1, n2)}.$$
Note that this is a priori not the same as
$$\beta_{21}(n2) = argmax_{n1}{M(n1, n2)},$$
since the maximum over all $n1$ might not be the same as the maximum over $\beta^{-1}(n2)$?

Indeed, suppose otherwise, i.e., $\beta_{21}(n2) = n1 \neq \delta(n2)$. Then (as long as there's no ties), we'd have that $M(n1,n2) > M(\delta(n2), n2)$. I think this is possible: suppose that we have $M(a1,b1) > M(a1, b2) > M(a2, b2)$, and that $\beta_{12}(a2) = b2$. Then $a1 \notin \beta_{12}^{-1}(b2)$, but $\beta_{21}(b2) = a1$.

@petrelharp
Copy link
Contributor

So - correct me if I'm wrong, but I think our original proposal was to use $\delta$; and now have you implemented it using $\beta_{21}$? It's not clear to me yet which is the better way to go.

@petrelharp
Copy link
Contributor

Okay - I think I agree with you, the method using $\delta$ is the way to go. This is because we're thinking about the case where T2 is the truth and T1 is inferred, and we want to say something like "okay this node in T1 is standing in for that node in T2". If we do the $\beta_{21}$ method, then we'd be allowing a single node in T1 to be standing in for more than one node in T2, which seems like not what we're going for.

best_match = np.argmax(dissimilarity_matrix, axis=1)
best_match_spans = np.zeros((ts.num_nodes,))
best_match_n1 = np.argmax(dissimilarity_matrix, axis=1)
n2_match_matrix = np.zeros((ts.num_nodes, other.num_nodes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use shared spans not dissimiliarty (since that's times)

Comment on lines 479 to 480
dissimilarity=total_span_ts - total_match_n1_span,
inverse_dissimilarity=total_span_other - total_match_n2_span,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposal to change these to to

Suggested change
dissimilarity=total_span_ts - total_match_n1_span,
inverse_dissimilarity=total_span_other - total_match_n2_span,
matched_span = (total_match_n1_span, total_match_n2_span)

@petrelharp
Copy link
Contributor

petrelharp commented Jan 23, 2025

I rewrote the naive method first (to be more naive), and turned up an issue in the implementation - I think - with rmse. Consider the case where a node has no best match (i.e., shared span is zero everywhere). Happily, there was a test for this! Should this node contributed to RMSE? I'd argue "no", it's only between nodes with a legit match. Previously, we were getting contributions to the RMSE from those nodes. So, now that test fails:

FAILED tests/test_methods.py::TestDissimilarity::test_with_no_match - AssertionError: 1.1702287212721978 != 0.46878005553759217

I guess we need to re-compute the naive expectation for that one.

edit: Hm, the code said that we were doing this right:

    # If a node x in `ts` has no match then we set time_difference to zero
    # This node then does not effect the rmse

but it's a bit hard to follow if that's actually what's going on? That test should be double-checked, anyhow.

@petrelharp
Copy link
Contributor

Okay, done with the re-write! And, I might not have broke anything?!? @hfr1tz3 I hope you look at it carefully though.

And, I turned up another minor issue - we're accounting properly for missing data in node_spans but not in shared_node_spans. (This only affects sample nodes.) This is caught by the new test_very_simple.f

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 23, 2025

And, I turned up another minor issue - we're accounting properly for missing data in node_spans but not in shared_node_spans. (This only affects sample nodes.) This is caught by the new test_very_simple.f

Can you explain more what you mean by missing data? If this is a problem somewhere with the shared_node_spans code that's more of a question @nspope.
I also believe that even in our paper we always begin with tree sequences with the same sample size of sample sets so this example wouldn't occur.

@nspope
Copy link

nspope commented Jan 23, 2025

'missing data' in that an edge connecting a sample node is removed (so the sample node is disconnected from the rest of the graph over part of the sequence). I don't think this'll happen for tsinfer'd tree sequences (or any other inference method I'm aware of), so maybe the immediate thing to do is detect it and error out?

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 23, 2025

Oh I see, I thought for a second the second Tree sequence was just 2 -> 0 with no sample 1.

I definitely don't understand why that example is reading ARF=-1/2...

@petrelharp
Copy link
Contributor

I've dealt with the missing value thing (just declared "we don't deal with missing data"), and uncovered another potential issue: we aren't requiring samples to match themselves. (Suppose that two tree sequences are the same, but in one the sample IDs are shuffled.)

@petrelharp
Copy link
Contributor

I'm dealing with that issue now...

@nspope
Copy link

nspope commented Jan 23, 2025

we aren't requiring samples to match themselves

That's just a documentation fix, I assume? As sample identity is the sample ID -- there's no way to ensure samples are identical between two distinct tree sequences in the absence of identifying metadata

@petrelharp
Copy link
Contributor

Okay - done. I ran into those edge cases with some very simple weird examples. We ought to have another test or two that exercises these things - for instance, samples being parents to other nodes, etc.

@petrelharp
Copy link
Contributor

That's just a documentation fix, I assume? As sample identity is the sample ID -- there's no way to ensure samples are identical between two distinct tree sequences in the absence of identifying metadata

Here's the example:

        # 1.00┊ 1 ┊         1.00┊ 0 ┊
        #     ┊ ┃ ┊   and       ┊ ┃ ┊
        # 0.00┊ 0 ┊         0.00┊ 1 ┊
        #     0   1             0   1
        # Note that if samples weren't required to match samples,
        # then 1 in the first matches 0 in the second (both are
        # ancestral to {0, 1})

The shared_span matrix here is

0 0
1 0

BUT we don't want to actually count that 1 as shared span. In other words, previously we would have said "oh yeah we're correctly representing the node 0 in the right-hand tree sequence, since we've got a node that's ancestral to {0, 1} in the left one"; however, we shouldn't have let that count, since that's maping one sample to another one. I deal with this by just zeroing out all entries shared_spans[i,j] where both i and j are samples but i != j.

Ah: and I added an error if the samples in the two tree sequences are not the same. So I need to document and test that. Then I'll be done.

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 26, 2025

Fixed some indexing errors in the algorithm and ran all the tests.
Probably should clean up the test file a little more, but the bulk of the work is done

@petrelharp
Copy link
Contributor

This looks good! There's one more thing, which is re-naming dissimilarity and inverse_dissimilarity (proposal was to a tuple, called shared_spans I think?). And, as you said, re-read the docs.

@hfr1tz3
Copy link
Collaborator Author

hfr1tz3 commented Jan 29, 2025

@petrelharp
Ready for final review
🎉

@petrelharp petrelharp merged commit c81c59f into tskit-dev:main Jan 31, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants