DBSCAN Revisited: Mis-Claim, Un-Fixability, And Approximation
Abstract
DBSCAN is a popular method for clustering multi-dimensional objects. Just as notable as the method's vast success is the research community's quest for its efficient computation. The original KDD'96 paper claimed an algorithm with <i>O</i>(<i>n</i> log <i>n</i>) running time, where <i>n</i> is the number of objects. Unfortunately, this is a mis-claim; and that algorithm actually requires <i>O</i>(<i>n</i><sup>2</sup>) time. There has been a fix in 2D space, where a genuine <i>O</i>(<i>n</i> log <i>n</i>)-time algorithm has been found. Looking for a fix for dimensionality <i>d</i> ≥ 3 is currently an important open problem. In this paper, we prove that for <i>d</i> ≥ 3, the DBSCAN problem requires Ω(<i>n</i>4/<sup>3</sup>) time to solve, unless very significant breakthroughs---ones widely believed to be impossible---could be made in theoretical computer science. This (i) explains why the community's search for fixing the aforementioned mis-claim has been futile for <i>d</i> ≥ 3, and (ii) indicates (sadly) that <i>all</i> DBSCAN algorithms must be intolerably slow even on moderately large <i>n</i> in practice. Surprisingly, we show that the running time can be dramatically brought down to <i>O</i>(<i>n</i>) in expectation <i>regardless of the dimensionality d</i>, as soon as slight inaccuracy in the clustering results is permitted. We formalize our findings into the new notion of ρ-<i>approximate</i> DBSCAN, which we believe should replace DBSCAN on big data due to the latter's computational intractability.