Principal Investigator Piotr Indyk
The goal of this project is to design efficient algorithms for high-dimensional geometric problems. A prototypical problem in this area is the nearest neighbor problem: given a database of points in a d-dimensional space, devise a data structure which, given any query point, quickly finds its nearest neighbor in the database. Other problem include: finding the closest pair of points, computing a tree spanning all the points with minimum cost (MST), various clustering problems, etc.
The nearest neighbor problem is of key interest in many areas. For example, in order to find a pair of similar images in a large data set, the researchers typically represent each image by a feature vector, and then find the closest pair of vectors. Since the number of features is typically very large, this approach requires solving high-dimensional geometric problems. In a similar fashion, high-dimensional geometric problems naturally occur in text and visual databases, machine learning, computational statistics, vector quantization, etc.
Unfortunately, the known algorithms for this problem suffer from "the curse of dimensionality": they either require storage that is exponential in the dimension, or have query time that quickly becomes linear as the dimension grows.It is widely conjectured that this phenomenon is inevitable, as long as one insists on exact answers.
The approach to this problem is to design algorithms that are allowed to report approximate answers. The approximation is defined with respect to the distance: a c-nearest neighbor is a point whose distance to the query point is at most c times the distance from the query point to its nearest neighbor. In our earlier work we show that it is possible to design a c-approximate nearest neighbor data structure (based on Locality-Sensitive Hashing) with query time (roughly) O(dn^(1/c) and space O(dn^(1+1/c)). The algorithm can be modified to report all approximate nearest neighbors. In this way, the algorithm can be used to find the exact nearest neighbor as well, with the query time depending on the number of approximate nearest neighbors (which is often small). The algorithm, and its variants, has found a substantial number of applications, e.g., to efficient comparison of genomic sequences and motif finding, example-based pose estimation, image and shape matching with distributions of local features, etc.
The Locality-Sensitive Hashing algorithm works only for data sets in which the distance is measured using L1 (Manhattan) or L2 (Euclidean) norm. In many applications the distance functions are more complex, and include Hausdorff distance, Earth-Mover Distance and Edit Distance. Extending the nearest neighbor algorithm to such metrics can be, in principle, achieved by embedding those metrics into L1 or L2 norms.
This leads to the following two questions:
(1) Is it possible to design more efficient algorithms for the approximate near neighbor ?
(2) How to to extend those algorithms to more complex metrics ?
We proposed an improved version of the Locality-Sensitive Hashing algorithm, together with an efficient implementation. Preliminary experiments show that on some benchmark data sets, such as MNIST handwritten digit database, the new algorithm offers an order of magnitude speed-up of the running time over known algorithms, even for the exact nearest neighbor problem. From the theoretical perspective, we were able to show that the exponent in the query time of the new algorithm is (slightly) smaller than 1/c (for the L2 norm). The new algorithm works directly in the L2 norm, and thus is easy to implement for this case.
We presented a method for embedding Earth-Mover Distance into the L1 norm. The theoretical bound on the distance distortion caused by the embedding is logarithmic. Nevertheless, we show that for natural data sets, the actual distortion in practice is much lower, and so the embedding could be still quite useful. The method has been used since then for fast countour matching and image and shape matching with distributions of local features.
For the approximate nearest neighbor under the edit distance (for strings of length up to d), we designed a c-approximation algorithm, with constant c, O(d) query time and "sub-exponential" space, i.e., exponential only in d^a for arbitrarily small a>0. Although not (yet) practical, the algorithm provides hope that a better algorithm (with polynomial space) can be found in the future. Interestingly, the algorithm does not use embeddings and instead relies on a product-metric approach. A recent paper shows that the edit distance cannot be embedded into L1 with constant distortion. This indicates that the product metric method provably outperforms the embedding method for this problem.
Finally, we investigated the complexity of the exact closest pair problem in very high dimensions, that is when the dimension is as large as the number of points. We show that, for a few Lp norms, the problems is reducible to matrix multiplication.