PAPER DIGEST
Most Influential SIGIR 2015 Paper · 2026-03 edition

HSpam14: A Collection Of 14 Million Tweets For Hashtag-Oriented Spam Research

Surendra Sedhai; Aixin Sun

Venue
ACM SIGIR Conference (SIGIR) 2015
Recognition
Most Influential SIGIR 2015 Paper (Rank No. 15)
Edition
2026-03
Impact factor
4
Certificate ID
38f5b0c78367e3e3

Abstract

Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.

Download PDF certificate