Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaccard distance generalization #24

Closed
chunjiangzhu opened this issue Feb 27, 2019 · 9 comments
Closed

Jaccard distance generalization #24

chunjiangzhu opened this issue Feb 27, 2019 · 9 comments

Comments

@chunjiangzhu
Copy link

Hi, thanks again for the prompt responses to the previous question #23 .

We are trying to generalize your code to Jaccard distance and test it under the ann benchmarking. We assume that the input is the same as the input of hamming distance, but the distance function is changed to Jaccard. E.g., for two bit vectors "A=10111" and "B=10011", their hamming distance is 1 but jaccard distance is 1-popcount(A&B)|/popcount(A|B)=1-3/4=0.25.

What we did are that for every code including hamming, generating corresponding code for jaccard. For example in our repository,
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/PrimitiveComparator.h#L287-L303

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpaceRepository.h#L97-L114

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Index.h#L113

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/Command.cpp#L155-L157

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/lib/NGT/ObjectSpace.h#L162

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/src/ngtpy.cpp#L67-L68

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L131

https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/ngt/base.py#L254-L256

Our input data is an N*1024 numpy array of data type int (or bool). We have tested the code using parameters epsilon=[0.0,0.1], edge_size=[100, 200, 300, 500, 1000], outdegree=[10, 30, 50, 70, 100], indegree=[10, 30, 50, 70, 120], query epsilon=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0] and object_type=Byte. But the resulting recall are consistently lower than 20%. We believe that there are something wrong. We noticed the 16 boundary you mentioned in #21 . Since our data dimension is always a power of 16, e.g. 1024, the error should not be here.

Could you please give suggestions on the generalization? We hope that a successful generalization may be helpful to support more distance metrics, before you make a custom distance function. If you need more information, e.g. a dataset, please let me know. Thank you so much!

@masajiro
Copy link
Member

masajiro commented Mar 1, 2019

Could you replace the search() function below with linearSearch() which does not use the index to search.
https://github.com/chunjiangzhu/ngt/blob/22a99c1eeb13590bae707afbe006247e80b32f5e/python/src/ngtpy.cpp#L149
Then, if the ann benchmarks' recalls always show 1.0, the implementation of the distance must be almost correct. However, I am not sure that the distance satisfies conditions of metric space especially triangle inequality. If not, you cannot get high recalls with the distance, because NGT and many of the other methods are based on metric space.

@chunjiangzhu
Copy link
Author

Thanks so much for the reply!
I have applied the change from search() to linearSearch() as in https://github.com/chunjiangzhu/ngt/blob/44fdeaaa794b9d2dc62ad69e60af898ae2d0b4ef/python/src/ngtpy.cpp#L149

But got the following error:

File "run_algorithm.py", line 3, in
run_from_cmdline()
File "/home/app/ann_benchmarks/runner.py", line 190, in run_from_cmdline
run(definition, args.dataset, args.count, args.runs, args.batch)
File "/home/app/ann_benchmarks/runner.py", line 134, in run
distance, count, run_count, batch)
File "/home/app/ann_benchmarks/runner.py", line 71, in run_individual_query
results = [single_query(x) for x in X_test]
File "/home/app/ann_benchmarks/runner.py", line 71, in
results = [single_query(x) for x in X_test]
File "/home/app/ann_benchmarks/runner.py", line 37, in single_query
candidates = algo.query(v, count)
File "/home/app/ann_benchmarks/algorithms/onng_ngt.py", line 90, in query
results = self.index.search(v, n, self._epsilon, self._edge_size_for_search, with_distance=False)
RuntimeError: /usr/local/include/NGT/Common.h:1815: Inner error: results is not set

I understand your concern, but Jaccard distance is a metric. Its triangle inequality property was proved in https://www.sciencedirect.com/science/article/pii/S0167865518309188 :)

@masajiro
Copy link
Member

masajiro commented Mar 5, 2019

Thank you for your information.
I completely forgot that the specification of linearSearch() is different from that of search(). Could you update ngtpy.cpp as below. I have tested this source code.

--- a/python/src/ngtpy.cpp
+++ b/python/src/ngtpy.cpp
@@ -143,30 +143,21 @@ public:
     sc.setSize(size);                          // the number of resultant objects.
     sc.setEpsilon(epsilon);                    // set exploration coefficient.
     sc.setEdgeSize(edgeSize);                  // if maxEdge is minus, the specified value in advance is used.
+    NGT::ObjectDistances objects;
+    sc.setResults(&objects);
 
-    NGT::Index::search(sc);
+    NGT::Index::linearSearch(sc);
 
     numOfDistanceComputations += sc.distanceComputationCount;
 
     NGT::Index::deleteObject(ngtquery);
     if (!withDistance) {
-      NGT::ResultPriorityQueue &r = sc.getWorkingResult();
-      py::array_t<int> ids(r.size());
+      py::array_t<int> ids(objects.size());
       py::buffer_info idsinfo = ids.request();
-      int *endptr = reinterpret_cast<int*>(idsinfo.ptr); 
-      int *ptr = endptr + (r.size() - 1);
-      if (zeroNumbering) {
-        while (ptr >= endptr) {
-         *ptr-- = r.top().id - 1;
-         r.pop();
-        }
-      } else {
-        while (ptr >= endptr) {
-         *ptr-- = r.top().id;
-         r.pop();
-        }
+      int *ptr = reinterpret_cast<int*>(idsinfo.ptr); 
+      for (size_t oidx = 0; oidx < objects.size(); ++oidx) {
+        ptr[oidx] = objects[oidx].id - 1;
       }
-
       return ids;
     }
     py::list results;

@chunjiangzhu
Copy link
Author

Yes, the distances using linearScan() are all correct. Thank you for the efforts!

486: ONNG-NGT(500, 30, 10, -2, 1.000) 1.000 334.322
487: ONNG-NGT(500, 30, 10, -2, 0.800) 1.000 343.043
488: ONNG-NGT(500, 30, 10, -2, 0.100) 1.000 331.162
489: ONNG-NGT(500, 30, 10, -2, 0.600) 1.000 339.045
490: ONNG-NGT(500, 30, 10, -2, 1.800) 1.000 344.334
491: ONNG-NGT(500, 30, 10, -2, 2.000) 1.000 337.646
492: ONNG-NGT(500, 30, 10, -2, 0.400) 1.000 335.073
493: ONNG-NGT(500, 30, 10, -2, 0.200) 1.000 336.272
494: ONNG-NGT(500, 30, 10, -2, 1.200) 1.000 348.053
495: ONNG-NGT(500, 30, 10, -2, 1.400) 1.000 333.685
496: ONNG-NGT(500, 30, 10, -2, 1.600) 1.000 333.825

But when I turned it back to scan(), the recall dropped down again... :(

486: ONNG-NGT(300, 30, 30, -2, 1.000) 0.027 7200.361
487: ONNG-NGT(300, 30, 30, -2, 1.200) 0.028 6220.843
488: ONNG-NGT(300, 30, 30, -2, 0.100) 0.026 43036.158

@masajiro
Copy link
Member

masajiro commented Mar 5, 2019

I assume that the dimensionality of your dataset is 1024 * 8. It means that your data space is so sparse that NGT might not work well for the space. Anyway, you have to increase the epsilon for NGT construction. For example, when you use 1.8 as the epsilon for search, you might want to use 0.8(=1.8-1.0) for construction instead of [0.0, 0.1]. The epsilon only for search is added 1.0, because the ann benchmarks cannot accept minus value.
Inverted index-based methods might be better for such sparse dataset.

@chunjiangzhu
Copy link
Author

chunjiangzhu commented Mar 9, 2019

Thank you for the reply and suggestion!
Well, in my opinion, the problem should not lie in the data sparsity. The first reason is that the SIFT data has dimension only 128. Either 128*32 or 128*64 are not greater than 1024*8 and ONNG performs not bad for it. The second reason is that for our own datasets, when I changed the distance metric to Euclidean, the recall performance looks "normal" ranging from 20%-95%. I believe that there are some bugs in my generalizations. Will let you know once I find any problems.
Have a nice weekend.

@chunjiangzhu
Copy link
Author

I have found bugs in Graph.h and Graph.cpp in below, where I should put corresponding codes for Jaccard. It seems that the comparator has not been invoked before. Now it works good.
It was a very nice talk with you. Thanks so much for the helpful suggestions!

https://github.com/chunjiangzhu/ngt/blob/846e55f72e7ce26472bbad11b7d04e92b8645a09/lib/NGT/Graph.h#L279

https://github.com/chunjiangzhu/ngt/blob/846e55f72e7ce26472bbad11b7d04e92b8645a09/lib/NGT/Graph.h#L293

@masajiro
Copy link
Member

The number of dimensions of your data with your jaccard distance is 1024 * 8, because the jaccard distance is based on 1 bit for each dimension. Although the sift's data length is 128 * 32, the number of dimensions of the sift with euclidean distance is just 128, because the euclidean distance is based on 1 single float variable (32 bits) for each dimension. Therefore your data with your jaccard distance is supposed to be sparser than the sift with the euclidean distance.

@masajiro
Copy link
Member

@chunjiangzhu
I added jaccard distance. Your code is quite helpful to implement it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants