-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPR can be greater than 1.0 due to multiple hits #11
Comments
Hmm ... I don't have a strong opinion and you've clearly thought this through more carefully, but I suspect we can always come up with hypothetical situations where one choice will do better than the other. I also don't think we have a straightforward way of measuring how frequent these situations will be in practice (e.g. the scenario in your last paragraph). So, I lean towards (2) which is the simplest to implement and explain. Let's say we choose (2) and regret it in the future -- then, we could introduce a |
A wandering thought: If we use the symmetrized definitions (where Maybe we want to add an option for I also like @nspope 's option to do a |
AFACT, TPR is not bounded above: consider as T2 the two-node tree sequence in which node 0 inherits from node 1 on the entire chromosome; and for T1 insert an additional I'm not seeing an intuitive reason for the symmetrized version? |
The basis would be to actually have the 'metric version' of ARF and TPR being used, however you bring up a good counter-example. So appending option 2 would be best instead. |
For instance, in our documentation:
The issue here is that "best match" can be many-to-one: in other words; recall that
α(n1)
is the mapping from the nodes of T1 to T2 that gives us the "best match" in T2; there can be manyn1
that all map to the samen2
. Potential resolutions:n2
.α
so that it is one-to-one, by letting only the "best"n1
match a givenn2
.sim(T1, T2)
so that it only sums over a subset of nodes for whichα(n1)
is one-to-one (and picking the "best" ones).For reference, here's the definitions:
So, option (4) is proposing something like: for each node in N2, let
β(n2)
denote the noden1
such thatα(n1)=n2
andn1
has the best agreement out of all suchn1
; thensim(T1, T2)
sums over nodes in N2 for whichβ(n2)
is defined. Option (3) is similar but redefinesα(n1)
.At first I liked, conceptually, options (3) and (4) since a 1-to-1 map seems more like what we want to be measuring, but I think they open a can of worms because: suppose that both
n
andm
map ton2
, butn
maps better. So, we say (effectively) thatn
maps ton2
whilem
is unmatched. But, then we should mapm
to its second-best match... but, maybe there's a different node mapping there? Well, I guess we could use the Gale-Shapely algorithm to resolve this?Option (2) (thx to @hfr1tz3) would certainly be simpler.
One practical consideration is: under the current definition of TPR, one can increase TPR by adding a bunch of entirely-unary nodes above any correctly matching node. And, I suppose that those nodes would not be wrong. So, maybe (2) would be doing the right thing here, while (3&4) would be a bit wrong (since they'd penalize for those nodes).
On the other hand, I'm vaguely imagining a situation where the "right answer" should be that
n1 -> n2
andm1 -> m2
but things are arranged so that bothn1
andm1
map ton2
; in this case, a version of (3&4) with Gale-Shapely would do the right thing. This could happen ifm2
is unary aboven2
for a long span, but then another branch comes in between the two; and then in the inferred ARG we get this almost right except that the branch coming in has a wrong sample and the span of that new branch is a bit longer than it should be.Thoughts?
The text was updated successfully, but these errors were encountered: