Constant Interaction-time Scatter/gather Browsing Of Very Large Document Collections
Abstract
The Scatter/Gather document browsing method uses fast document clustering to produce table-of-contents-like outlines of large document collections. Previous work [1] developed linear-time document clustering algorithms to establish the feasibility of this method over moderately large collections. However, even linear-time algorithms are too slow to support interactive browsing of very large collections such as Tipster, the DARPA standard text retrieval evaluation collection. We present a scheme that supports constant interaction-time Scatter/Gather of arbitrarily large collections after near-linear time preprocessing. This involves the construction of a <i>cluster hierarchy</i>. A modification of Scatter/Gather employing this scheme, and an example of its use over the Tipster collection are presented.