Attentive Collaborative Filtering: Multimedia Recommendation With Item- And Component-Level Attention
Abstract
Multimedia content is dominating today's Web information. The nature of multimedia user-item interactions is 1/0 binary implicit feedback (<i>e.g.</i>, photo likes, video views, song downloads, etc.), which can be collected at a larger scale with a much lower cost than explicit feedback (<i>e.g.</i>, product ratings). However, the majority of existing collaborative filtering (CF) systems are not well-designed for multimedia recommendation, since they ignore the implicitness in users' interactions with multimedia content. We argue that, in multimedia recommendation, there exists <i>item</i>- and <i>component-level</i> implicitness which blurs the underlying users' preferences. The item-level implicitness means that users' preferences on items (<i>e.g.</i> photos, videos, songs, etc.) are unknown, while the component-level implicitness means that inside each item users' preferences on different components (<i>e.g.</i> regions in an image, frames of a video, etc.) are unknown. For example, a 'view'' on a video does not provide any specific information about how the user likes the video (<i>i.e.</i>item-level) and which parts of the video the user is interested in (<i>i.e.</i>component-level). In this paper, we introduce a novel <i>attention</i> mechanism in CF to address the challenging item- and component-level implicit feedback in multimedia recommendation, dubbed Attentive Collaborative Filtering (ACF). Specifically, our attention model is a neural network that consists of two attention modules: the component-level attention module, starting from any content feature extraction network (<i>e.g.</i> CNN for images/videos), which learns to select informative components of multimedia items, and the item-level attention module, which learns to score the item preferences. ACF can be seamlessly incorporated into classic CF models with implicit feedback, such as BPR and SVD++, and efficiently trained using SGD. Through extensive experiments on two real-world multimedia Web services: Vine and Pinterest, we show that ACF significantly outperforms state-of-the-art CF methods.