Quality And Efficiency In High Dimensional Nearest Neighbor Search
Abstract
Nearest neighbor (NN) search in high dimensional space is an important problem in many applications. Ideally, a practical solution (i) should be implementable in a relational database, and (ii) its query cost should grow <i>sub-linearly</i> with the dataset size, regardless of the data and query distributions. Despite the bulk of NN literature, no solution fulfills both requirements, except <i>locality sensitive hashing</i> (LSH). The existing LSH implementations are either rigorous or adhoc. <i>Rigorous-LSH</i> ensures good quality of query results, but requires expensive space and query cost. Although <i>adhoc-LSH</i> is more efficient, it abandons quality control, i.e., the neighbor it outputs can be <i>arbitrarily</i> bad. As a result, currently no method is able to ensure both quality and efficiency simultaneously in practice. Motivated by this, we propose a new access method called the <i>locality sensitive B-tree</i> (LSB-tree) that enables fast high-dimensional NN search with excellent quality. The combination of several LSB-trees leads to a structure called the <i>LSB-forest</i> that ensures the same result quality as <i>rigorous-LSH</i>, but reduces its space and query cost dramatically. The LSB-forest also outperforms <i>adhoc-LSH</i>, even though the latter has no quality guarantee. Besides its appealing theoretical properties, the LSB-tree itself also serves as an effective index that consumes linear space, and supports efficient updates. Our extensive experiments confirm that the LSB-tree is faster than (i) the state of the art of exact NN search by <i>two orders of magnitude</i>, and (ii) the best (linear-space) method of approximate retrieval by <i>an order of magnitude</i>, and at the same time, returns neighbors with much better quality.