-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Average shared span for multiple matches #12
Conversation
I added extra tests that are for the |
Gee, that's a quite clean implementation. Nice. My first inclination is not to introduce the "ties" argument? I think we don't have a use case for |
That's a fair point. I think it was sort of an idea to put in place, where in case we want to add a new matching scheme then we have the scaffolding for that already. However, if want to remove that I can. |
Good thought, but there's no need to put the argument in place, since if we add it in the future but with the default |
Alright, I removed the |
Hm, okay - sorry I didn't pick up on this earlier - but, this is also changing dissimilarity (since both dissimilarity and tpr depend on |
I actually like this change. When I was attempting to redefine terms in the paper, the notation got bogged down. ARF in my mind should be defined as it was with To comment about points 3, 4, and 6: If we want to remove the 'averaged node span' from TPR we could alternatively define it as Below is an example between avg TPR and max TPR: |
Hey, this is a great point. |
@petrelharp I made the change. The work is pretty much that there are "best matches" with respect to I tried to make sure I added all the correct docs and tests, but let me know if I missed something. |
This looks right, although my brain's a bit slow right now. I'm not sure about the name "inverse dissimilarity" (but, good idea); for clarification: is And, I've invited you as a collaborator so your tests run automatically (thought we'd done that already?). See the errors there, e.g.
|
tscompare/methods.py
Outdated
Then, :class:`.ARFResult` contains: | ||
|
||
- (`dissimilarity`) | ||
The total "matching span", which is the total span of | ||
all nodes in `ts` over which each node is ancestral to the same set of | ||
samples as its best match in `other`. | ||
|
||
- (`inverse_dissimiliarity`) | ||
The total "inverse matching span", which is the total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to say "matched span" instead, but maybe that looks too much like "matching span"?
tscompare/methods.py
Outdated
@@ -348,13 +352,23 @@ def compare(ts, other, transform=None): | |||
If there are multiple matches with the same longest shared span | |||
for a single node, the best match is the match that is closest in time. | |||
|
|||
For each node in `other` we compute the best matched span | |||
as the average shared span amongst all nodes in `ts` which are its match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops - this shouldn't say "average" any more, right?
And, maybe we (sigh) need another term, since we use "best match of n1 is n2" to mean something different than "best match of n2 is n1" (since "best match": T1 -> T2 is many-to-one). |
I think you're right, we should call it something else (no idea what yet).
Maybe we just reword "best match of n2 is n1" to "best match for n2 from the matching |
Thinking out loud here: let Indeed, suppose otherwise, i.e., |
So - correct me if I'm wrong, but I think our original proposal was to use |
Okay - I think I agree with you, the method using |
tests/test_methods.py
Outdated
best_match = np.argmax(dissimilarity_matrix, axis=1) | ||
best_match_spans = np.zeros((ts.num_nodes,)) | ||
best_match_n1 = np.argmax(dissimilarity_matrix, axis=1) | ||
n2_match_matrix = np.zeros((ts.num_nodes, other.num_nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use shared spans not dissimiliarty (since that's times)
tscompare/methods.py
Outdated
dissimilarity=total_span_ts - total_match_n1_span, | ||
inverse_dissimilarity=total_span_other - total_match_n2_span, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proposal to change these to to
dissimilarity=total_span_ts - total_match_n1_span, | |
inverse_dissimilarity=total_span_other - total_match_n2_span, | |
matched_span = (total_match_n1_span, total_match_n2_span) |
I rewrote the naive method first (to be more naive), and turned up an issue in the implementation - I think - with
I guess we need to re-compute the naive expectation for that one. edit: Hm, the code said that we were doing this right:
but it's a bit hard to follow if that's actually what's going on? That test should be double-checked, anyhow. |
Okay, done with the re-write! And, I might not have broke anything?!? @hfr1tz3 I hope you look at it carefully though. And, I turned up another minor issue - we're accounting properly for missing data in |
Can you explain more what you mean by missing data? If this is a problem somewhere with the |
'missing data' in that an edge connecting a sample node is removed (so the sample node is disconnected from the rest of the graph over part of the sequence). I don't think this'll happen for tsinfer'd tree sequences (or any other inference method I'm aware of), so maybe the immediate thing to do is detect it and error out? |
Oh I see, I thought for a second the second Tree sequence was just 2 -> 0 with no sample 1. I definitely don't understand why that example is reading ARF=-1/2... |
I've dealt with the missing value thing (just declared "we don't deal with missing data"), and uncovered another potential issue: we aren't requiring samples to match themselves. (Suppose that two tree sequences are the same, but in one the sample IDs are shuffled.) |
I'm dealing with that issue now... |
That's just a documentation fix, I assume? As sample identity is the sample ID -- there's no way to ensure samples are identical between two distinct tree sequences in the absence of identifying metadata |
Okay - done. I ran into those edge cases with some very simple weird examples. We ought to have another test or two that exercises these things - for instance, samples being parents to other nodes, etc. |
Here's the example:
The shared_span matrix here is
BUT we don't want to actually count that Ah: and I added an error if the samples in the two tree sequences are not the same. So I need to document and test that. Then I'll be done. |
Fixed some indexing errors in the algorithm and ran all the tests. |
This looks good! There's one more thing, which is re-naming |
@petrelharp |
Option 2 for issue #11
For each node$n_2\in T_2$ with multiple matches in $T_1$ , Let $\beta(n_2)=[n_1\in T_1 \colon \alpha(n_1)=n_2]$ . Then we compute the similarity between two trees $T_1$ , and $T_2$ as
$$sim(T_1,T_2)=\sum_{n_2\in T_2}\frac{1}{\beta(n_2)}\sum_{n_1\in \alpha^{-1}(n_2)} m(n_1,n_2),$$ $T_2$ and their multiple matches $\beta(n_2)$ .
which is the average over shared spans between nodes in