Objective: Given an image, find the nearest neighbours of it from a database.
Approach: Using LSH (Local Sensitive Hashing) to find approximate near neighbours.
Dataset of Images: Patches.csv (Each row in this dataset is a 20 × 20 image patch represented as a 400-dimensional vector.)
Theory behind the approach:
Assume we have a dataset A of n points in a metric space with distance metric d(·, ·). Let c be a constant greater than 1. Then, the (c, λ)-Approximate Near Neighbor (ANN) problem is defined as follows: Given a query point z, assuming that there is a point x in the dataset with d(x,z) ≤ λ, return a point x′ from the dataset with d(x′,z) ≤ cλ (this point is called a (c,λ)-ANN). The parameter c therefore represents the maximum approximation factor allowed and is a user-defined parameter.
Let us consider an LSH family H of hash functions that is (λ, cλ, p1, p2)-sensitive for the distance measure d(·,·). Let G = Hk = {g = (h1,...,hk)|hi ∈ H, ∀ 1 ≤ i ≤ k}, where k = log1/p2 (n).
Let us consider the following procedure:
- Select L = nρ random members g1,...,gL of G, where ρ = log(1/p1) / log(1/p2)
- Hash all the data points as well as the query point using all gi (1 ≤ i ≤ L).
- Retrieve at most 3L data points (chosen uniformly at random) from the set of L buckets to which the query point hashes.
- Among the points selected in Step 3 (above), report the one that is the closest to the query point as a (c, λ)-ANN.
The code compares performance of LSH with that of linear search.