-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Top-K Nearest Neighbors to Matrix Profile (normalize=True) #592
Comments
@seanlaw We can create a new function
And, the output The reason is that I think returning the right and left NN may not be necessary in What do you think? Or, do you prefer to just apply changes to the current function, and be worried about the design later? |
@seanlaw |
I think I would prefer trying to modify the existing functions by adding the relevant and needed data structures. If you add a To be clear, I don't know what the best solution is so we will have to be flexible as the right solution isn't clear to me either. I have a few points:
Any thoughts/questions? Criticism/feedback? |
Right! In fact, I was thinking of creating new private functions as well. That may become a disaster :) I am flexible. So, I will modify the stump function as it seems to be a more rational solution at this moment.
I overlooked this point! I will do as suggested and we can then see if it sounds alright.
Thanks for the link! Very interesting!! I think I got it. Just to confirm: |
For now, let's go with "yes" :) What might fail is 100% code coverage and then we'll have to make sure we handle that. I don't want to get too far ahead of ourselves and over-analyze everything. Given the possible complexities and newness, let's just "try something and see what happens". Maybe we'll break something and we need to either fix it or abandon this altogether. Either way, we learn something new. Sounds good? At the end of the day, I believe that only ONE user has asked about VALMOD and maybe only 2 more have asked about top-k, which isn't a lot. If it's not a big problem to add |
k-NN for each distance-matrix row ? The most useful method is k-NN for each motif.
Alternative Algorithm, entirely based on Black Magic: |
@JaKasb We are talking about distance matrix rows in this case. Given that all pairwise subsequence distances are computed, I suspect that this is more about whether there is a clean way to handle the bookkeeping that also doesn't reduce the maintainability of the code base. Any pointers or suggestions is greatly appreciated! I have not come across this linear time paper. Thanks for sharing it! |
(@seanlaw You beat me to it!) @JaKasb
Are these two different? I mean...the distance matrix is symmetric. If there is a difference, could you please explain what that is?
I can see your point! As you noticed, top-k smallest values in each distance profile may not be useful for getting k-NN best match to each subsequence. Because, it does not exclude trivial matches. Therefore, after removing trivial matches, we may end up with a fewer number than k. The goal of this top-k smallest values in each distance profile is that we want to use them later in the VALMOD algorithm (see PR #586). And, I think the VALMOD algorithm does not work properly if we do not return top-k smallest values of each distance profile (NOT top-k smallest with minimum temporal distance). I am trying to understand the thing you suggested
Could you please elaborate/clarify this sentence?
Did you mean "matrix profile" here?
So, I assume you want to get top-k best matches (considering minimum temporal distance between these top-k matches) for each x in peak_indices. Is that correct? And then, at the end, we have M motifs (# of peaks detected from negative of matrix profile, i.e. M smallest values considering the exclusion of trivial matches) and, for each, we find top-k matches. Correct? And, then there should be a way to sort these motifs considering their k-NN. (maybe average to their k neighbors?) |
Yes! That makes sense!
Yep! Let's go with MVP and see how things will go! |
I am trying to write However, I can also see that SIDE NOTE: |
@seanlaw |
Actually, I don't think you need
At the end of the day, we'll call
Yes, I don't recall "why" (maybe due to laziness or haste) but, indeed, we should have called it |
Thanks for the clarification. I will continue discussion on the PR #595.
Sure. I can create a new issue regarding this matter. (If you think it is better to keep it here, please let me know) |
Yes, a separate issue would be great. Thank you! |
Excuse me, but I want to confirm that
|
No, top-k nearest neighbor does not mean "top-k motifs". They are two different things. Currently, in STUMPY, each matrix profile only returns the top-1 nearest neighbor for each subsequence. This issue deals with allowing the STUMPY return the top-k nearest neighbors rather than only the top-1.
No, STUMPY can output more than the top motifs. Please try the stumpy.motifs function and feel free to post general usage questions in our Discussion Section. |
Hey guys, sorry I'm late to this, @NimaSarajpoor sick work with the top k stuff man, just tried it out locally, and it looks sweet! Love the way you've laid the architecture out, you should be proud of what you've done, the top k feature had been requested for a long time! Great job! 🎉 I also had a few questions about your changes I was hoping you could help me answer:
|
Good point! I think it can be removed.
Yes, we should update the docstrings.
Indeed. And we shouldn't forget about their equivalent non-normalized versions too. @alvii147 If you have time, would you mind submitting a PR for all of these findings? I'd appreciate the help! |
So, I think I forgot to remove it probably. Previously, we had array Right! I think you and @seanlaw have some kind of eagle eyes :) yes.... I should have paid more attention to comments.
This is not what I felt on my system. I mean...my results show that there is a very little overhead in top-k version. Have you tested it on 20 iterations? It might be nice to run it 20 times, and the plot it. Also, I would like to bring your attention to two things here: (2) We also use three arrays
That would be awesome! :) |
If only I spent more time scrutinizing my own code 😫
Honestly it's so weird, somehow while |
So, according to my experience, it seems that the difference in the performances can be platform dependent to some extent. I was trying to check out the performance of top-k on the two PCs that I have access to, and I got different results. Also, @seanlaw tried it on his own PC, and got different conclusion. To investigate, you may want to consider the following items: (1) You can replace the arrays (2) You may want to leave the (3) In some cases, I got some large standard deviation in computing time! For instance, there was a case where I ran a function for run Also, note that this is a two-tail p-value. Hence, you should double the area under the t-distribution and see if it is less than 5%. If it is less than 5%, you can reject null hypothesis, which means there is a statistically significant difference. (I was lazy and didn't do this!) |
Btw, regarding the hypothesis testing, the assumption is that the computing time samples are normally distributed. |
@NimaSarajpoor if I recall correctly, I actually initially had a test script in my local environment that performed t test for unequal variances and showed the p value, where null hypothesis was that the mean execution times were equal. Somewhere along the way I lost that script and I was just too lazy to rewrite it 😩 (always backup your work, kids). Thanks for the suggestion, I'll find some time to do some hypothesis testing! By the way, just in case the list of reasons why your top-k contribution is incredibly helpful isn't exhaustive enough, here's another reason: your changes helped me find two crucial bugs in I'll post performance updates to the |
@alvii147 Can you provide a simple example of this? Or, if you start a
Do we need a unit test to catch this? Or is the tests that we currently have in place already showing that the top-2 is "different"? |
Yea I can try to reproduce it. It was really weird, the kinda error that happens once, then doesn't happen when I run it again without changes. It was giving me
Nope the top-k unit tests in |
Currently, functions like
stump
,stumped
, andgpu_stump
only return the top-1 nearest neighbor. Depending on the added complexity, we may want to consider adding a top-K nearest neighbor matrix profile.This would add additional columns to the output. Namely, it would add 2 additional columns for every nearest neighbor.
Also, we should use
np.searchsorted
in place ofheapq
See: #639 and #640 as they are closely related to this issue
The text was updated successfully, but these errors were encountered: