TPR can be greater than 1.0 due to multiple hits #11

petrelharp · 2024-12-20T16:58:05Z

For instance, in our documentation:

The issue here is that "best match" can be many-to-one: in other words; recall that α(n1) is the mapping from the nodes of T1 to T2 that gives us the "best match" in T2; there can be many n1 that all map to the same n2. Potential resolutions:

Leave it as-is. I don't think this is ideal.
Average over all the nodes that map to the same n2.
Redefine α so that it is one-to-one, by letting only the "best" n1 match a given n2.
Redefine sim(T1, T2) so that it only sums over a subset of nodes for which α(n1) is one-to-one (and picking the "best" ones).

For reference, here's the definitions:

So, option (4) is proposing something like: for each node in N2, let β(n2) denote the node n1 such that α(n1)=n2 and n1 has the best agreement out of all such n1; then sim(T1, T2) sums over nodes in N2 for which β(n2) is defined. Option (3) is similar but redefines α(n1).

At first I liked, conceptually, options (3) and (4) since a 1-to-1 map seems more like what we want to be measuring, but I think they open a can of worms because: suppose that both n and m map to n2, but n maps better. So, we say (effectively) that n maps to n2 while m is unmatched. But, then we should map m to its second-best match... but, maybe there's a different node mapping there? Well, I guess we could use the Gale-Shapely algorithm to resolve this?

Option (2) (thx to @hfr1tz3) would certainly be simpler.

One practical consideration is: under the current definition of TPR, one can increase TPR by adding a bunch of entirely-unary nodes above any correctly matching node. And, I suppose that those nodes would not be wrong. So, maybe (2) would be doing the right thing here, while (3&4) would be a bit wrong (since they'd penalize for those nodes).

On the other hand, I'm vaguely imagining a situation where the "right answer" should be that n1 -> n2 and m1 -> m2 but things are arranged so that both n1 and m1 map to n2; in this case, a version of (3&4) with Gale-Shapely would do the right thing. This could happen if m2 is unary above n2 for a long span, but then another branch comes in between the two; and then in the inferred ARG we get this almost right except that the branch coming in has a wrong sample and the span of that new branch is a bit longer than it should be.

Thoughts?

The text was updated successfully, but these errors were encountered:

nspope · 2024-12-20T18:21:06Z

Hmm ... I don't have a strong opinion and you've clearly thought this through more carefully, but I suspect we can always come up with hypothetical situations where one choice will do better than the other. I also don't think we have a straightforward way of measuring how frequent these situations will be in practice (e.g. the scenario in your last paragraph). So, I lean towards (2) which is the simplest to implement and explain.

Let's say we choose (2) and regret it in the future -- then, we could introduce a ties="average" argument that toggles the original behavior, and introduce another method (ties="best" or something) that becomes the default.

hfr1tz3 · 2024-12-20T18:42:12Z

A wandering thought: If we use the symmetrized definitions (where $\overline{\textnormal{sim}}(T1,T2) = \frac{1}{2}(\textnormal{sim}(T_1, T_2)+\textnormal{sim}(T_2,T_1))$ would the TPR be less than 1? I am assuming that if $2>TPR(T_1,T_2)>1$ than most likely (hopefully guaranteed) $TPR(T_2,T_1)<1$ so then their average is less than 1. Could be an additional easy option.

Maybe we want to add an option for compare to give the symmetrized difference anyway?

I also like @nspope 's option to do a ties="average" argument.

petrelharp · 2024-12-20T18:53:06Z

AFACT, TPR is not bounded above: consider as T2 the two-node tree sequence in which node 0 inherits from node 1 on the entire chromosome; and for T1 insert an additional n unary nodes in; this I think has TPR=n+1.

I'm not seeing an intuitive reason for the symmetrized version?

hfr1tz3 · 2024-12-20T19:12:16Z

I'm not seeing an intuitive reason for the symmetrized version?

The basis would be to actually have the 'metric version' of ARF and TPR being used, however you bring up a good counter-example. So appending option 2 would be best instead.

hfr1tz3 mentioned this issue Dec 22, 2024

Average shared span for multiple matches #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPR can be greater than 1.0 due to multiple hits #11

TPR can be greater than 1.0 due to multiple hits #11

petrelharp commented Dec 20, 2024

nspope commented Dec 20, 2024

hfr1tz3 commented Dec 20, 2024

petrelharp commented Dec 20, 2024

hfr1tz3 commented Dec 20, 2024

TPR can be greater than 1.0 due to multiple hits #11

TPR can be greater than 1.0 due to multiple hits #11

Comments

petrelharp commented Dec 20, 2024

nspope commented Dec 20, 2024

hfr1tz3 commented Dec 20, 2024

petrelharp commented Dec 20, 2024

hfr1tz3 commented Dec 20, 2024