PAPER DIGEST
Most Influential EMNLP 2021 Paper · 2026-03 edition

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer

Venue
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2021
Recognition
Most Influential EMNLP 2021 Paper (Rank No. 6)
Edition
2026-03
Impact factor
7
Certificate ID
2681221e86ee152d

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

Download PDF certificate