Average shared span for multiple matches #12

hfr1tz3 · 2024-12-22T19:33:14Z

Option 2 for issue #11

For each node $n_2\in T_2$ with multiple matches in $T_1$, Let $\beta(n_2)=[n_1\in T_1 \colon \alpha(n_1)=n_2]$. Then we compute the similarity between two trees $T_1$, and $T_2$ as
$$sim(T_1,T_2)=\sum_{n_2\in T_2}\frac{1}{\beta(n_2)}\sum_{n_1\in \alpha^{-1}(n_2)} m(n_1,n_2),$$
which is the average over shared spans between nodes in $T_2$ and their multiple matches $\beta(n_2)$.

hfr1tz3 · 2024-12-23T19:03:57Z

I added extra tests that are for the ties='average' method. I am open to more suggestions if we need more.

petrelharp · 2025-01-04T23:03:05Z

Gee, that's a quite clean implementation. Nice.

My first inclination is not to introduce the "ties" argument? I think we don't have a use case for ties=None, so why complicate things?

hfr1tz3 · 2025-01-06T18:54:09Z

That's a fair point. I think it was sort of an idea to put in place, where in case we want to add a new matching scheme then we have the scaffolding for that already. However, if want to remove that I can.
@petrelharp just ping me if you want it removed. I'll try to get to it while at JMM.

petrelharp · 2025-01-07T19:09:23Z

Good thought, but there's no need to put the argument in place, since if we add it in the future but with the default "average", then that'll be backwards-compatible. So, it's just unecessary complexity. I'd like it removed.

hfr1tz3 · 2025-01-13T22:03:22Z

Alright, I removed the ties argument from the compare function. I additionally added another test case to check that the new average node span is working properly.

petrelharp · 2025-01-16T04:25:26Z

Hm, okay - sorry I didn't pick up on this earlier - but, this is also changing dissimilarity (since both dissimilarity and tpr depend on total_match_span). Now I need to think about whether this is desired or not.

petrelharp · 2025-01-16T13:58:52Z

Okay: I think that if T1 is inferred and T2 is truth, then our motivation has been:

we want ARF to be "how much of T1 is wrong" (so, denominator is |T1|)
and TPR to be "how much of T2 is correctly represented in T1" (denominator is |T2|)
previously, interpretation (2) does not hold, because adding more nodes to T1 that mapped to the same node in T2 would change TPR when it wasn't actually changing how much of T2 that was represented
with this change, interpretation (1) arguably doesn't hold, because suppose those multiple nodes in T1 that map to a single node in T2 all agree with the node in T2; they are arguably not wrong, so should not count against it
the question is, I guess, what is "wrong"; one interpretation is that a node in T1 is 'wrong' if it implies relationships not in T2 (e.g., "on this chunk of genome A and B are more closely related than to C")
having multiple nodes in T1 that match a given node in T2 feels a bit wrong, but counterpoint: adding completely-unary nodes to T1 should arguably not be penalised because we know that those nodes in fact exist - edges are but chains of inheritance through a bunch of individuals

So: currently we have:

... so I guess I'm arguing that the right-hand equality in equation (2) above should not hold, nor the right hand equality here:

so that, conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

What do you think?

hfr1tz3 · 2025-01-16T19:15:50Z

... conceptually, sim( ) is a sum over nodes in T2 and dis( ) is a sum over nodes in T1. I think this argues that we need different names that sound less symmetric for them, unfortunately.

I actually like this change. When I was attempting to redefine terms in the paper, the notation got bogged down.

ARF in my mind should be defined as it was with

$$\textnormal{ARF}(T_1, T_2) = 1-\frac{1}{|| T_1||} \sum_{n\in N_1} m(n_1, \alpha(n_1)).$$

To comment about points 3, 4, and 6:
I took some time to think about possibly doing a 'weighted avg' for TPR if we have multiple matches for a node in $T_2$, however those still can have a TPR of over 1. I think with the averaged node span for TPR, we can drastically decrease TPR when that doesn't reflect the actual property of "how much of $T_2$ is in $T_1$". For instance, I feel like a tree sequence $T_1$ with extra unary nodes should have TPR of 1 when compared to its simplified version $T_2$.

If we want to remove the 'averaged node span' from TPR we could alternatively define it as

$$ \textnormal{TPR}(T_1, T_2) = \frac{1}{||T_2||}\sum_{n_2\in N_2} \max_{n_1\in \alpha^{-1}(n)}m(n_1, n_2)$$

Below is an example between avg TPR and max TPR:

petrelharp · 2025-01-17T05:15:58Z

Hey, this is a great point. max is totally better than mean here. Totally on board with this.

hfr1tz3 · 2025-01-18T23:23:46Z

@petrelharp I made the change.

The work is pretty much that there are "best matches" with respect to $T_1$ and $T_2$ resp (marked by n1 or n2). If its an n1 match we use that to compute dissimilarity and ARF and if its an n2 match that we use that to compute the "inverse dissimilarity" := total span $T_2$ - best_n2_match_span and tpr.

I tried to make sure I added all the correct docs and tests, but let me know if I missed something.

petrelharp · 2025-01-20T17:42:01Z

This looks right, although my brain's a bit slow right now. I'm not sure about the name "inverse dissimilarity" (but, good idea); for clarification: is compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity? (No, right? What's the difference?)

And, I've invited you as a collaborator so your tests run automatically (thought we'd done that already?). See the errors there, e.g.

FAILED tests/test_methods.py::TestDissimilarity::test_basic_comparison[pair0] - TypeError: ARFResult.__init__() got an unexpected keyword argument 'inverse_dissimilarity'

petrelharp · 2025-01-20T17:43:25Z

tscompare/methods.py

    Then, :class:`.ARFResult` contains:

    - (`dissimilarity`)
        The total "matching span", which is the total span of
        all nodes in `ts` over which each node is ancestral to the same set of
        samples as its best match in `other`.

+    - (`inverse_dissimiliarity`)
+        The total "inverse matching span", which is the total


I want to say "matched span" instead, but maybe that looks too much like "matching span"?

petrelharp · 2025-01-20T17:43:56Z

tscompare/methods.py

@@ -348,13 +352,23 @@ def compare(ts, other, transform=None):
    If there are multiple matches with the same longest shared span
    for a single node, the best match is the match that is closest in time.

+    For each node in `other` we compute the best matched span
+    as the average shared span amongst all nodes in `ts` which are its match.


whoops - this shouldn't say "average" any more, right?

petrelharp · 2025-01-20T17:48:52Z

And, maybe we (sigh) need another term, since we use "best match of n1 is n2" to mean something different than "best match of n2 is n1" (since "best match": T1 -> T2 is many-to-one).

hfr1tz3 · 2025-01-20T19:04:15Z

This looks right, although my brain's a bit slow right now. I'm not sure about the name "inverse dissimilarity" (but, good idea); for clarification: is compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity? (No, right? What's the difference?)

I think you're right, we should call it something else (no idea what yet).
compare(t1, t2).inverse_dissimilarity == compare(t2, t1).dissimilarity should be true (I checked experimentally with some large simulated tree sequences). It most likely could be proven because inverse_dissmilarity is simply looking for column max on our match matrix which is equivalent to the row max on the transpose of that matrix which is the match matrix for compare(t2,t1).

And, maybe we (sigh) need another term, since we use "best match of n1 is n2" to mean something different than "best match of n2 is n1" (since "best match": T1 -> T2 is many-to-one).

Maybe we just reword "best match of n2 is n1" to "best match for n2 from the matching $\gamma:T_2\to \beta^{-1}\circ\beta(T_1)$ from the fixed matching $\beta: T_1\to T_2$. Not very pretty, but it is exact.

petrelharp · 2025-01-20T19:40:24Z

Thinking out loud here: let $M(n1,n2)$ be the span on which n1 and n2 agree; and let $\beta_{12}(n1)$ be such that
$$M(n1, \beta(n1)) = \max_{n2} M(n1, n2),$$
in other words (why can't I do \argmax here?):
$$\beta_{12}(n1) = argmax_{n2}{M(n1, n2)}.$$
Furthermore, let $\delta(n2) \in \beta_{12}^{-1}(n2)$ that maximizes $M$, i.e.,
$$\delta(n2) = argmax_{n1 \in \beta_{12}^{-1}(n2)}{M(n1, n2)}.$$
Note that this is a priori not the same as
$$\beta_{21}(n2) = argmax_{n1}{M(n1, n2)},$$
since the maximum over all $n1$ might not be the same as the maximum over $\beta^{-1}(n2)$?

Indeed, suppose otherwise, i.e., $\beta_{21}(n2) = n1 \neq \delta(n2)$. Then (as long as there's no ties), we'd have that $M(n1,n2) > M(\delta(n2), n2)$. I think this is possible: suppose that we have $M(a1,b1) > M(a1, b2) > M(a2, b2)$, and that $\beta_{12}(a2) = b2$. Then $a1 \notin \beta_{12}^{-1}(b2)$, but $\beta_{21}(b2) = a1$.

petrelharp · 2025-01-20T19:42:05Z

So - correct me if I'm wrong, but I think our original proposal was to use $\delta$; and now have you implemented it using $\beta_{21}$? It's not clear to me yet which is the better way to go.

tscompare/methods.py

petrelharp · 2025-01-20T21:00:55Z

Okay - I think I agree with you, the method using $\delta$ is the way to go. This is because we're thinking about the case where T2 is the truth and T1 is inferred, and we want to say something like "okay this node in T1 is standing in for that node in T2". If we do the $\beta_{21}$ method, then we'd be allowing a single node in T1 to be standing in for more than one node in T2, which seems like not what we're going for.

petrelharp · 2025-01-22T22:49:11Z

tests/test_methods.py

-    best_match = np.argmax(dissimilarity_matrix, axis=1)
-    best_match_spans = np.zeros((ts.num_nodes,))
+    best_match_n1 = np.argmax(dissimilarity_matrix, axis=1)
+    n2_match_matrix = np.zeros((ts.num_nodes, other.num_nodes))


this should use shared spans not dissimiliarty (since that's times)

petrelharp · 2025-01-22T23:18:24Z

tscompare/methods.py

+        dissimilarity=total_span_ts - total_match_n1_span,
+        inverse_dissimilarity=total_span_other - total_match_n2_span,


proposal to change these to to

Suggested change

dissimilarity=total_span_ts - total_match_n1_span,

inverse_dissimilarity=total_span_other - total_match_n2_span,

matched_span = (total_match_n1_span, total_match_n2_span)

petrelharp · 2025-01-23T01:46:48Z

I rewrote the naive method first (to be more naive), and turned up an issue in the implementation - I think - with rmse. Consider the case where a node has no best match (i.e., shared span is zero everywhere). Happily, there was a test for this! Should this node contributed to RMSE? I'd argue "no", it's only between nodes with a legit match. Previously, we were getting contributions to the RMSE from those nodes. So, now that test fails:

FAILED tests/test_methods.py::TestDissimilarity::test_with_no_match - AssertionError: 1.1702287212721978 != 0.46878005553759217

I guess we need to re-compute the naive expectation for that one.

edit: Hm, the code said that we were doing this right:

    # If a node x in `ts` has no match then we set time_difference to zero
    # This node then does not effect the rmse

but it's a bit hard to follow if that's actually what's going on? That test should be double-checked, anyhow.

petrelharp · 2025-01-23T04:36:38Z

Okay, done with the re-write! And, I might not have broke anything?!? @hfr1tz3 I hope you look at it carefully though.

And, I turned up another minor issue - we're accounting properly for missing data in node_spans but not in shared_node_spans. (This only affects sample nodes.) This is caught by the new test_very_simple.f

hfr1tz3 · 2025-01-23T18:21:38Z

And, I turned up another minor issue - we're accounting properly for missing data in node_spans but not in shared_node_spans. (This only affects sample nodes.) This is caught by the new test_very_simple.f

Can you explain more what you mean by missing data? If this is a problem somewhere with the shared_node_spans code that's more of a question @nspope.
I also believe that even in our paper we always begin with tree sequences with the same sample size of sample sets so this example wouldn't occur.

nspope · 2025-01-23T18:48:26Z

'missing data' in that an edge connecting a sample node is removed (so the sample node is disconnected from the rest of the graph over part of the sequence). I don't think this'll happen for tsinfer'd tree sequences (or any other inference method I'm aware of), so maybe the immediate thing to do is detect it and error out?

hfr1tz3 · 2025-01-23T18:49:57Z

Oh I see, I thought for a second the second Tree sequence was just 2 -> 0 with no sample 1.

I definitely don't understand why that example is reading ARF=-1/2...

petrelharp · 2025-01-23T19:37:01Z

I've dealt with the missing value thing (just declared "we don't deal with missing data"), and uncovered another potential issue: we aren't requiring samples to match themselves. (Suppose that two tree sequences are the same, but in one the sample IDs are shuffled.)

petrelharp · 2025-01-23T19:37:10Z

I'm dealing with that issue now...

nspope · 2025-01-23T19:45:57Z

we aren't requiring samples to match themselves

That's just a documentation fix, I assume? As sample identity is the sample ID -- there's no way to ensure samples are identical between two distinct tree sequences in the absence of identifying metadata

petrelharp · 2025-01-23T19:46:55Z

Okay - done. I ran into those edge cases with some very simple weird examples. We ought to have another test or two that exercises these things - for instance, samples being parents to other nodes, etc.

petrelharp · 2025-01-23T19:50:37Z

That's just a documentation fix, I assume? As sample identity is the sample ID -- there's no way to ensure samples are identical between two distinct tree sequences in the absence of identifying metadata

Here's the example:

        # 1.00┊ 1 ┊         1.00┊ 0 ┊
        #     ┊ ┃ ┊   and       ┊ ┃ ┊
        # 0.00┊ 0 ┊         0.00┊ 1 ┊
        #     0   1             0   1
        # Note that if samples weren't required to match samples,
        # then 1 in the first matches 0 in the second (both are
        # ancestral to {0, 1})

The shared_span matrix here is

0 0
1 0

BUT we don't want to actually count that 1 as shared span. In other words, previously we would have said "oh yeah we're correctly representing the node 0 in the right-hand tree sequence, since we've got a node that's ancestral to {0, 1} in the left one"; however, we shouldn't have let that count, since that's maping one sample to another one. I deal with this by just zeroing out all entries shared_spans[i,j] where both i and j are samples but i != j.

Ah: and I added an error if the samples in the two tree sequences are not the same. So I need to document and test that. Then I'll be done.

hfr1tz3 · 2025-01-26T06:02:22Z

Fixed some indexing errors in the algorithm and ran all the tests.
Probably should clean up the test file a little more, but the bulk of the work is done

tscompare/methods.py

petrelharp · 2025-01-29T05:19:10Z

This looks good! There's one more thing, which is re-naming dissimilarity and inverse_dissimilarity (proposal was to a tuple, called shared_spans I think?). And, as you said, re-read the docs.

hfr1tz3 · 2025-01-29T21:19:44Z

@petrelharp
Ready for final review
🎉

hfr1tz3 added 4 commits December 22, 2024 11:27

add average ties for compare method

22fb7b5

quick fix

4921a5d

added test cases for 'average' ties

a8be5b4

tests passed

f43a9ec

hfr1tz3 marked this pull request as ready for review December 23, 2024 19:00

removed 'ties' arg

ed589f1

**actually** removed ties arg

001957d

TPR constructed with max spans from inverse match

9fc936a

petrelharp reviewed Jan 20, 2025

View reviewed changes

hfr1tz3 added 2 commits January 20, 2025 11:17

minor updates

fd4f999

more minor changes

61a9acf

linting, pre-commit changes

74eb649

hfr1tz3 commented Jan 20, 2025

View reviewed changes

tscompare/methods.py Outdated Show resolved Hide resolved

slight change

04a347f

petrelharp reviewed Jan 22, 2025

View reviewed changes

rewrite of naive method

5b6dc82

rewrite of compare

697d08f

.

35bcb37

make samples always match themselves

edd5646

petrelharp and others added 3 commits January 23, 2025 11:56

test and docs for sample equality

07393af

alg fixes, new tests

7489263

lint

dd378fd

petrelharp reviewed Jan 26, 2025

View reviewed changes

tscompare/methods.py Show resolved Hide resolved

small tweak

a6bd7ab

petrelharp reviewed Jan 26, 2025

View reviewed changes

tscompare/methods.py Show resolved Hide resolved

hfr1tz3 added 2 commits January 26, 2025 11:05

test for isolated samples

db19b60

LINT

2144510

changed syntax, docs

d653b6d

petrelharp merged commit c81c59f into tskit-dev:main Jan 31, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average shared span for multiple matches #12

Average shared span for multiple matches #12

hfr1tz3 commented Dec 22, 2024 •

edited by petrelharp

Loading

hfr1tz3 commented Dec 23, 2024

petrelharp commented Jan 4, 2025

hfr1tz3 commented Jan 6, 2025

petrelharp commented Jan 7, 2025

hfr1tz3 commented Jan 13, 2025

petrelharp commented Jan 16, 2025

petrelharp commented Jan 16, 2025

hfr1tz3 commented Jan 16, 2025

petrelharp commented Jan 17, 2025

hfr1tz3 commented Jan 18, 2025

petrelharp commented Jan 20, 2025

petrelharp Jan 20, 2025

petrelharp Jan 20, 2025

petrelharp commented Jan 20, 2025

hfr1tz3 commented Jan 20, 2025 •

edited

Loading

petrelharp commented Jan 20, 2025 •

edited

Loading

petrelharp commented Jan 20, 2025

petrelharp commented Jan 20, 2025

petrelharp Jan 22, 2025

petrelharp Jan 22, 2025

petrelharp commented Jan 23, 2025 •

edited

Loading

petrelharp commented Jan 23, 2025

hfr1tz3 commented Jan 23, 2025

nspope commented Jan 23, 2025

hfr1tz3 commented Jan 23, 2025 •

edited

Loading

petrelharp commented Jan 23, 2025

petrelharp commented Jan 23, 2025

nspope commented Jan 23, 2025

petrelharp commented Jan 23, 2025

petrelharp commented Jan 23, 2025

hfr1tz3 commented Jan 26, 2025

petrelharp commented Jan 29, 2025

hfr1tz3 commented Jan 29, 2025

		dissimilarity=total_span_ts - total_match_n1_span,
		inverse_dissimilarity=total_span_other - total_match_n2_span,

	dissimilarity=total_span_ts - total_match_n1_span,
	inverse_dissimilarity=total_span_other - total_match_n2_span,
	matched_span = (total_match_n1_span, total_match_n2_span)

Average shared span for multiple matches #12

Average shared span for multiple matches #12

Conversation

hfr1tz3 commented Dec 22, 2024 • edited by petrelharp Loading

hfr1tz3 commented Dec 23, 2024

petrelharp commented Jan 4, 2025

hfr1tz3 commented Jan 6, 2025

petrelharp commented Jan 7, 2025

hfr1tz3 commented Jan 13, 2025

petrelharp commented Jan 16, 2025

petrelharp commented Jan 16, 2025

hfr1tz3 commented Jan 16, 2025

petrelharp commented Jan 17, 2025

hfr1tz3 commented Jan 18, 2025

petrelharp commented Jan 20, 2025

petrelharp Jan 20, 2025

Choose a reason for hiding this comment

petrelharp Jan 20, 2025

Choose a reason for hiding this comment

petrelharp commented Jan 20, 2025

hfr1tz3 commented Jan 20, 2025 • edited Loading

petrelharp commented Jan 20, 2025 • edited Loading

petrelharp commented Jan 20, 2025

petrelharp commented Jan 20, 2025

petrelharp Jan 22, 2025

Choose a reason for hiding this comment

petrelharp Jan 22, 2025

Choose a reason for hiding this comment

petrelharp commented Jan 23, 2025 • edited Loading

petrelharp commented Jan 23, 2025

hfr1tz3 commented Jan 23, 2025

nspope commented Jan 23, 2025

hfr1tz3 commented Jan 23, 2025 • edited Loading

petrelharp commented Jan 23, 2025

petrelharp commented Jan 23, 2025

nspope commented Jan 23, 2025

petrelharp commented Jan 23, 2025

petrelharp commented Jan 23, 2025

hfr1tz3 commented Jan 26, 2025

petrelharp commented Jan 29, 2025

hfr1tz3 commented Jan 29, 2025

hfr1tz3 commented Dec 22, 2024 •

edited by petrelharp

Loading

hfr1tz3 commented Jan 20, 2025 •

edited

Loading

petrelharp commented Jan 20, 2025 •

edited

Loading

petrelharp commented Jan 23, 2025 •

edited

Loading

hfr1tz3 commented Jan 23, 2025 •

edited

Loading