Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Top-K Nearest Neighbors to Matrix Profile (normalize=True) #592

Closed
seanlaw opened this issue Apr 27, 2022 · 26 comments
Closed

Add Top-K Nearest Neighbors to Matrix Profile (normalize=True) #592

seanlaw opened this issue Apr 27, 2022 · 26 comments
Assignees
Labels
enhancement New feature or request

Comments

@seanlaw
Copy link
Contributor

seanlaw commented Apr 27, 2022

Currently, functions like stump, stumped, and gpu_stump only return the top-1 nearest neighbor. Depending on the added complexity, we may want to consider adding a top-K nearest neighbor matrix profile.

This would add additional columns to the output. Namely, it would add 2 additional columns for every nearest neighbor.

Also, we should use np.searchsorted in place of heapq

See: #639 and #640 as they are closely related to this issue

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Apr 27, 2022

@seanlaw
I am going through the stump right now. I am proposing the following design:

We can create a new function stump_topk, and then:

# in stump
def stump(...., k=1):
    if k>1:
        return stump_topk(...)

    # stump code

And, the output mp in mp = stump_topk(...) does not have four columns. Instead, it has 2k columns:
P_topk = mp[:, :k]
I_topk = mp[:, k:]

The reason is that I think returning the right and left NN may not be necessary in matrixprofile_topk, and this rerouting to another function can result in cleaner code.

What do you think? Or, do you prefer to just apply changes to the current function, and be worried about the design later?

@NimaSarajpoor
Copy link
Collaborator

@seanlaw
I think I should better submit a PR first (in which I just modify the current module without creating new function). Then, you can provide your comments there. I will update you.

@seanlaw
Copy link
Contributor Author

seanlaw commented Apr 27, 2022

Or, do you prefer to just apply changes to the current function, and be worried about the design later?

I think I would prefer trying to modify the existing functions by adding the relevant and needed data structures. If you add a stump_topk function then it will cause too much code branching and much of the code will be highly redundant. This means that if we change _stump then we'll likely need to change stump_topk which is undesirable. Right now, when we change the normalized code, we also need to change the non-normalized code but this was a specific and necessary design choice that has given us more flexibility. In this case, I think stump_topk would lead to more branching and more work in the long run.

To be clear, I don't know what the best solution is so we will have to be flexible as the right solution isn't clear to me either. I have a few points:

  1. Before we do any coding, we need to decide on the output data structure. Whatever we decide should also minimize any change to the existing code logic where possible and affect users the least. Again, I don't know the right answer but I suspect that we should append the additional 2 * k - 2 columns (k-1 distances and k-1 indices) to the existing 4 columns (e.g., mp distances, mp indices, left indices, right indices) of the current matrix profile. This way, if somebody accidentally sets k>1 then the first 4 columns are still the same. The only difference would be the addition of two columns. Of course, it'll be a bit of bookkeeping to select for the top-k values from this array but maybe it's fine.
  2. After we've decided on the output, we should be smarter and actually follow "Test Driven Development", whereby we write our unit test first along with the expected output. This is often referred to as red, green, refactor. It is important that we DON'T change any other tests and we only write one new test. When we run this new test, the test should fail because the output will not match our test expectations. Then, we modify our code (not our tests) to make our test(s) pass. So, the first commit should be ONLY the new unit test. Sure, you can add a stump_topk function to naive.py if you like.
  3. Finally, as per above in 2., you modify the existing functions and we try to learn what will work and what will not work and really think through the design.

Any thoughts/questions? Criticism/feedback?

@NimaSarajpoor
Copy link
Collaborator

I would prefer trying to modify the existing functions by adding the relevant and needed data structures. If you add a stump_topk function then it will cause too much code branching and much of the code will be highly redundant. | To be clear, I don't know what the best solution is so we will have to be flexible as the right solution isn't clear to me either.

Right! In fact, I was thinking of creating new private functions as well. That may become a disaster :) I am flexible. So, I will modify the stump function as it seems to be a more rational solution at this moment.

  1. This way, if somebody accidentally sets k>1 then the first 4 columns are still the same. The only difference would be the addition of two columns.

I overlooked this point! I will do as suggested and we can then see if it sounds alright.

  1. This is often referred to as red, green, refactor. It is important that we DON'T change any other tests and we only write one new test. When we run this new test, the test should fail because the output will not match our test expectations.

Thanks for the link! Very interesting!! I think I got it.

Just to confirm:
stump calls the private functions _stump, which calls another private function _compute_diagonal. And, if I understand correctly, the unit test is only written for stump. So, I assume the new unit test should be written for stump only (and not for those two private functions). Correct?

@seanlaw
Copy link
Contributor Author

seanlaw commented Apr 27, 2022

stump calls the private functions _stump, which calls another private function _compute_diagonal. And, if I understand correctly, the unit test is only written for stump. So, I assume the new unit test should be written for stump only (and not for those two private functions). Correct?

For now, let's go with "yes" :) What might fail is 100% code coverage and then we'll have to make sure we handle that. I don't want to get too far ahead of ourselves and over-analyze everything. Given the possible complexities and newness, let's just "try something and see what happens". Maybe we'll break something and we need to either fix it or abandon this altogether. Either way, we learn something new. Sounds good?

At the end of the day, I believe that only ONE user has asked about VALMOD and maybe only 2 more have asked about top-k, which isn't a lot. If it's not a big problem to add top-k then cool because, computationally, it shouldn't be a huge additional expense! However, if it really makes our code unmaintainable then I don't know how much sense it would make to keep it given that there's not a ton of demand for it. When I develop, I want to make sure that I am focused on adding the things that are the most useful to the most users. Often times, it is adding documentation and tutorials that has the largest impact and, rarely, is it adding a new feature. It doesn't make much sense solving a problem for a party of one when we have limited time and resources. We need to be pragmatic and perfect the things that are actually worth perfecting and that affects the largest number of people.

@JaKasb
Copy link

JaKasb commented Apr 27, 2022

k-NN for each distance-matrix row ?
k-NN for each distance-matrix column ?
k-NN for each motif ?

The most useful method is k-NN for each motif.
It is also the fastest method.
Furthermore it is also easy to implement.
But it won't compute a true k-NN neighborhood search, because there is no neighborhood information for time-points which were not classified as motifs.

  1. compute regular matrix profile
  2. perform peak detection on negative distance profile
    (scipy signal find_peaks, enforce minimum temporal distance between peaks -> exclusion-zone width)
    https://docs.scipy.org/doc/scipy-1.8.0/html-scipyorg/reference/generated/scipy.signal.find_peaks.html
  3. perform MASS(time_series[x:x+w], time_series) for each x in peak_indices
  4. for each MASS output: np.argsort -> computes the location and similarity-ordering of neighbors.
  5. for each np.argsort(mass_result) : select the first k-entries
  6. Despair, because now you have to deal with trivial-matches of order k instead of order 1.

Alternative Algorithm, entirely based on Black Magic:
https://github.com/xiaoshengli/LL

@seanlaw
Copy link
Contributor Author

seanlaw commented Apr 27, 2022

@JaKasb We are talking about distance matrix rows in this case. Given that all pairwise subsequence distances are computed, I suspect that this is more about whether there is a clean way to handle the bookkeeping that also doesn't reduce the maintainability of the code base. Any pointers or suggestions is greatly appreciated!

I have not come across this linear time paper. Thanks for sharing it!

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Apr 27, 2022

(@seanlaw You beat me to it!)

@JaKasb
Thanks for sharing your knowledge and the paper!

k-NN for each distance-matrix row ? k-NN for each distance-matrix column ?

Are these two different? I mean...the distance matrix is symmetric. If there is a difference, could you please explain what that is?

The most useful method is k-NN for each motif.

I can see your point! As you noticed, top-k smallest values in each distance profile may not be useful for getting k-NN best match to each subsequence. Because, it does not exclude trivial matches. Therefore, after removing trivial matches, we may end up with a fewer number than k. The goal of this top-k smallest values in each distance profile is that we want to use them later in the VALMOD algorithm (see PR #586). And, I think the VALMOD algorithm does not work properly if we do not return top-k smallest values of each distance profile (NOT top-k smallest with minimum temporal distance).


I am trying to understand the thing you suggested

a true k-NN neighborhood search, because there is no neighborhood information for time-points which were not classified as motifs."

Could you please elaborate/clarify this sentence?

  1. perform peak detection on negative distance profile

Did you mean "matrix profile" here?

  1. for each np.argsort(mass_result) : select the first k-entries
  2. Despair, because now you have to deal with trivial-matches of order k instead of order 1.

So, I assume you want to get top-k best matches (considering minimum temporal distance between these top-k matches) for each x in peak_indices. Is that correct?

And then, at the end, we have M motifs (# of peaks detected from negative of matrix profile, i.e. M smallest values considering the exclusion of trivial matches) and, for each, we find top-k matches. Correct? And, then there should be a way to sort these motifs considering their k-NN. (maybe average to their k neighbors?)

@NimaSarajpoor
Copy link
Collaborator

@seanlaw

Sounds good?

Yes! That makes sense!

We need to be pragmatic and perfect the things that are actually worth perfecting and that affects the largest number of people.

Yep! Let's go with MVP and see how things will go!

@NimaSarajpoor
Copy link
Collaborator

I am trying to write test_stump_topk.py which is basically the same as test_stump.py. I just need to replace naive.stump with naive.stump_topk (a new function that I added to naive.py).

However, I can also see that test_stump.py calls the function naive.stamp in some of its tests, and that means I need to create naive.stamp_topk as well. Is that correct?


SIDE NOTE:
naive.stamp calls naive.mass function, which is completely different than core.mass.
Why not using the name mass_PI, the name that is used in stumpy.stamp?

@NimaSarajpoor
Copy link
Collaborator

@seanlaw
I decided to submit a PR so you can see if I got your point correctly or not. For now, I skipped the use of naive.stamp in the tests.

@seanlaw
Copy link
Contributor Author

seanlaw commented Apr 28, 2022

I am trying to write test_stump_topk.py which is basically the same as test_stump.py. I just need to replace naive.stump with naive.stump_topk (a new function that I added to naive.py).

Actually, I don't think you need naive.stump_topk. I think you should modify naive.stump so that it can accept an additional parameter k and then update all of the functions that naive.stump depends on.

However, I can also see that test_stump.py calls the function naive.stamp in some of its tests, and that means I need to create naive.stamp_topk as well. Is that correct?

At the end of the day, we'll call stump.stump with k > 1 and so we'll want a naive (i.e., slow) way to return back the correct result.

Why not using the name mass_PI, the name that is used in stumpy.stamp?

Yes, I don't recall "why" (maybe due to laziness or haste) but, indeed, we should have called it mass_PI instead. In fact, we should've split the mass part out into naive.mass and then created naive.mass_PI that calls naive.mass. Would you mind doing this and then updating any tests that use naive.mass?

@NimaSarajpoor
Copy link
Collaborator

Thanks for the clarification. I will continue discussion on the PR #595.

Would you mind doing this and then updating any tests that use naive.mass?

Sure. I can create a new issue regarding this matter. (If you think it is better to keep it here, please let me know)

@seanlaw
Copy link
Contributor Author

seanlaw commented Apr 28, 2022

Sure. I can create a new issue regarding this matter. (If you think it is better to keep it here, please let me know)

Yes, a separate issue would be great. Thank you!

@seanlaw seanlaw changed the title Add Top-K Nearest Neighbors to Matrix Profile Add Top-K Nearest Neighbors to Matrix Profile (normalize=True) Jul 7, 2022
@KennyTC
Copy link

KennyTC commented Jul 14, 2022

Excuse me, but I want to confirm that

  1. "top K Nearest Neighbors" mentioned here means "top K motifs"?
  2. currently, Stumpy can output only 2 motif points which are identical values?
    Thank you very much.

@seanlaw
Copy link
Contributor Author

seanlaw commented Jul 14, 2022

@KennyTC

"top K Nearest Neighbors" mentioned here means "top K motifs"?

No, top-k nearest neighbor does not mean "top-k motifs". They are two different things. Currently, in STUMPY, each matrix profile only returns the top-1 nearest neighbor for each subsequence. This issue deals with allowing the STUMPY return the top-k nearest neighbors rather than only the top-1.

currently, Stumpy can output only 2 motif points which are identical values?

No, STUMPY can output more than the top motifs. Please try the stumpy.motifs function and feel free to post general usage questions in our Discussion Section.

@alvii147
Copy link
Contributor

alvii147 commented Nov 10, 2022

Hey guys, sorry I'm late to this, @NimaSarajpoor sick work with the top k stuff man, just tried it out locally, and it looks sweet! Love the way you've laid the architecture out, you should be proud of what you've done, the top k feature had been requested for a long time! Great job! 🎉

I also had a few questions about your changes I was hoping you could help me answer:

  • Is there any reason why k is being passed to _compute_diagonal? I don't see it being used anywhere.
  • Should we also update the docstrings here and here, and the comments in here to use g instead of k? k in this context is now completely different.
  • Now that we have k, as well as other new parameters like rho_L, rho_R, I_L, and I_R, it might be worthwhile updating the Numba function signatures here and here.
  • Not only is this an incredible feature you've implemented, I'm noticing that stump has gotten much faster for smaller time series! Any idea where this optimization is coming from? What made it faster?

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 10, 2022

Is there any reason why k is being passed to _compute_diagonal? I don't see it being used anywhere.

Good point! I think it can be removed.

Should we also update the docstrings here and here, and the comments in here to use g instead of k? k in this context is now completely different.

Yes, we should update the docstrings.

Now that we have k, as well as other new parameters like rho_L, rho_R, I_L, and I_R, it might be worthwhile updating the Numba function signatures here and here.

Indeed. And we shouldn't forget about their equivalent non-normalized versions too.

@alvii147 If you have time, would you mind submitting a PR for all of these findings? I'd appreciate the help!

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Nov 10, 2022

So, I think I forgot to remove it probably. Previously, we had array $\rho$ that contains top-k, left, and right. And, I used k for indexing: :k, k, k+1 to get the top-k, left, and right profile from $\rho$. Now, by splitting arrays, we do not need k.

  • Should we also update the docstrings here and here, and the comments in here to use g instead of k? k in this context is now completely different.

Right! I think you and @seanlaw have some kind of eagle eyes :)

  • Now that we have k, as well as other new parameters like rho_L, rho_R, I_L, and I_R, it might be worthwhile updating the Numba function signatures here and here.

yes.... I should have paid more attention to comments.

I'm noticing that stump has gotten much faster for smaller time series!

This is not what I felt on my system. I mean...my results show that there is a very little overhead in top-k version. Have you tested it on 20 iterations? It might be nice to run it 20 times, and the plot it. Also, I would like to bring your attention to two things here:
(1) When T is short, the computing time of the merging process (i.e. merging results collected from different threads) is comparable to the computation of matrix profile. Hence, you may want to see if the merging is faster or not. Currently, in top-k version, we do not use parallelization. (we do not use prange).

(2) We also use three arrays rho, rho_L, and rho_R instead of one single array. Again, on my system, I realized that splitting arrays can improve performance for long T (try a size that you think it is heavy for your pc. For instance, I tried 100k and noticed such thing. You may want to try 200k or 300k if you have stronger pc). And, for shorter time series, top-k might be a little bit slower.


@alvii147 If you have time, would you mind submitting a PR for all of these findings? I'd appreciate the help!

That would be awesome! :)

@alvii147
Copy link
Contributor

alvii147 commented Nov 11, 2022

Right! I think you and @seanlaw have some kind of eagle eyes :)

If only I spent more time scrutinizing my own code 😫

This is not what I felt on my system. I mean...my results show that there is a very little overhead in top-k version. Have you tested it on 20 iterations? It might be nice to run it 20 times, and the plot it.

Honestly it's so weird, somehow while stump has gotten faster, stump_tiles algorithm has gotten super slow. Now it takes for us to get at least 250k time series to make stump_tiles perform better than stumpy, before stump_tiles used to perform better starting around 100k. Also, before we were seeing 40% improvement in stump_tiles for 250k, now we're seeing only about 20% improvement at 300k. I have no idea why stump_tiles became slower after implementing top-k, it might be worth investigating. Let me know if you have any insights on what's happening, because honestly I have no idea it's so strange man. It may also be platform dependent 🤷🏽 .

@alvii147
Copy link
Contributor

@alvii147 If you have time, would you mind submitting a PR for all of these findings? I'd appreciate the help!

No worries, done! #712 #713

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Nov 11, 2022

@alvii147

Honestly it's so weird, somehow while stump has gotten faster, stump_tiles algorithm has gotten super slow. Now it takes for us to get at least 250k time series to make stump_tiles perform better than stumpy, before stump_tiles used to perform better starting around 100k. Also, before we were seeing 40% improvement in stump_tiles for 250k, now we're seeing only about 20% improvement at 300k. I have no idea why stump_tiles became slower after implementing top-k, it might be worth investigating. Let me know if you have any insights on what's happening, because honestly I have no idea it's so strange man. It may also be platform dependent 🤷🏽 .

So, according to my experience, it seems that the difference in the performances can be platform dependent to some extent. I was trying to check out the performance of top-k on the two PCs that I have access to, and I got different results. Also, @seanlaw tried it on his own PC, and got different conclusion.

To investigate, you may want to consider the following items:

(1) You can replace the arrays rho/rho_L/rho_R with a single array rho (and do the same for the indices I). And, then checked the performance.

(2) You may want to leave the core._merge_topk as is (i.e. no parallelization). This is because its computing time should be negligible compared to the computation of matrix profile when T is large (And, I think this should not affect stump_tiles)

(3) In some cases, I got some large standard deviation in computing time! For instance, there was a case where I ran a function for N=20 times and the mean was 80 and std 20 (I am not sure about those values but I do remember the std was large). So, to make sure that there IS a difference, then you MIGHT want to perform hypothesis testing, like this:

run stump_tiles and stump_tiles_with_topk, each for 20 runs. Then, calculate the mean and std of each case, and then calculate the p-value that corresponds to the t-value (see below) with degree of freedom df=N-1=19

$t = \frac{\mu_{1} - \mu_{2}}{\sqrt{\sigma_{1}^{2}/n_{1} + \sigma_{2}^{2}/n_{2}}}$

Also, note that this is a two-tail p-value. Hence, you should double the area under the t-distribution and see if it is less than 5%. If it is less than 5%, you can reject null hypothesis, which means there is a statistically significant difference.

(I was lazy and didn't do this!)

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Nov 11, 2022

Btw, regarding the hypothesis testing, the assumption is that the computing time samples are normally distributed.

@alvii147
Copy link
Contributor

alvii147 commented Nov 11, 2022

@NimaSarajpoor if I recall correctly, I actually initially had a test script in my local environment that performed t test for unequal variances and showed the p value, where null hypothesis was that the mean execution times were equal. Somewhere along the way I lost that script and I was just too lazy to rewrite it 😩 (always backup your work, kids). Thanks for the suggestion, I'll find some time to do some hypothesis testing!

By the way, just in case the list of reasons why your top-k contribution is incredibly helpful isn't exhaustive enough, here's another reason: your changes helped me find two crucial bugs in stump_tiles that I hadn't noticed before! One has to do with how numba sometimes erroneously converts float to int (maybe I'll start a numba forum discussion on that), and another has to do with stump_tiles accidentally computing some distances twice. Computing the distance twice isn't an issue when k = 1, but when k > 1, computing a very low distance twice makes it show up twice, once as a top-1 distance, and then again as the top-2 distance. Anyways, the point is, solid work man, you're helping out in more ways than you know!

I'll post performance updates to the stump_tiles PR soon, I think I'm crowding this closed issue unnecessarily 😓

@seanlaw
Copy link
Contributor Author

seanlaw commented Nov 11, 2022

One has to do with how numba sometimes erroneously converts float to int (maybe I'll start a numba forum discussion on that),

@alvii147 Can you provide a simple example of this? Or, if you start a numba discourse discussion, can you please link it back here please?

Computing the distance twice isn't an issue when k = 1, but when k > 1, computing a very low distance twice makes it show up twice, once as a top-1 distance, and then again as the top-2 distance.

Do we need a unit test to catch this? Or is the tests that we currently have in place already showing that the top-2 is "different"?

@alvii147
Copy link
Contributor

@alvii147 Can you provide a simple example of this? Or, if you start a numba discourse discussion, can you please link it back here please?

Yea I can try to reproduce it. It was really weird, the kinda error that happens once, then doesn't happen when I run it again without changes. It was giving me int(1.0)=0 in some specific cases. I'll try to look more into this.

Do we need a unit test to catch this? Or is the tests that we currently have in place already showing that the top-2 is "different"?

Nope the top-k unit tests in tests/test_stump.py are the test cases that helped me catch this. If we do this again, the unit tests should catch it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants