Similarity Measures For Tracking Information Flow
Abstract
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.