Most Influential ACM Multimedia Papers (2025-09 Version)
ACM International Conference on Multimedia is one of the top multimedia conferences in the world. Paper Digest Team analyzes all papers published on MM in the past years, and presents the 15 most influential papers for each year. This ranking list is automatically constructed based upon citations from both research papers and granted patents, and will be frequently updated to reflect the most recent changes. To find the latest version of this list or the most influential papers from other conferences/journals, please visit Best Paper Digest page. Note: the most influential papers may or may not include the papers that won the best paper awards. (Version: 2025-09)
To search or review papers within MM related to a specific topic, please use the search by venue (MM) and review by venue (MM) services. To browse the most productive MM authors by year ranked by #papers accepted, here are the most productive MM authors grouped by year.
This list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Most Influential MM Papers (2025-09 Version)
| Year | Rank | Paper | Author(s) |
|---|---|---|---|
| 2024 | 1 | Tango 2: Aligning Diffusion-based Text-to-Audio Generations Through Direct Preference Optimization IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. |
NAVONIL MAJUMDER et. al. |
| 2024 | 2 | Making Large Language Models Perform Better in Knowledge Graph Completion IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore methods to incorporate structural information into the LLMs, with the overarching goal of facilitating structure-aware reasoning. |
YICHI ZHANG et. al. |
| 2024 | 3 | Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. |
YUKANG LIN et. al. |
| 2024 | 4 | AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. |
ZHIXI CAI et. al. |
| 2024 | 5 | FreqMamba: Viewing Mamba from A Frequency Perspective for Image Deraining IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining. |
Zhen Zou; Hu Yu; Jie Huang; Feng Zhao; |
| 2024 | 6 | AniTalker: Animate Vivid and Diverse Talking Faces Through Identity-Decoupled Facial Motion Encoding IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper introduces AniTalker, an innovative framework designed to generate lifelike talking faces from a single portrait. |
TAO LIU et. al. |
| 2024 | 7 | PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. |
HAO WU et. al. |
| 2024 | 8 | CompGS: Efficient 3D Scene Representation Via Compressed Gaussian Splatting IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. |
XIANGRUI LIU et. al. |
| 2024 | 9 | LiDAR-NeRF: Novel LiDAR View Synthesis Via Neural Radiance Fields IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new task, novel view synthesis for LiDAR sensors. |
TANG TAO et. al. |
| 2024 | 10 | Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis Via State Space Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. |
Xu Han; Yuan Tang; Zhaoxuan Wang; Xianzhi Li; |
| 2024 | 11 | WorldGPT: Empowering LLM As Multimodal World Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). |
ZHIQI GE et. al. |
| 2024 | 12 | RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. |
HONGTAO WU et. al. |
| 2024 | 13 | HandRefiner: Refining Malformed Hands in Generated Images By Diffusion-based Conditional Inpainting IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For correct hand generation, our paper introduces a lightweight post-processing solution called HandRefiner. |
Wenquan Lu; Yufei Xu; Jing Zhang; Chaoyue Wang; Dacheng Tao; |
| 2024 | 14 | GOI: Find 3D Gaussians of Interest with An Optimizable Open-vocabulary Semantic-space Hyperplane IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. |
YANSONG QU et. al. |
| 2024 | 15 | Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To establish an audio dataset with high-quality captions, we propose an innovative, automatic approach leveraging multimodal inputs, such as video frames, audio streams. |
Luoyi Sun; Xuenan Xu; Mengyue Wu; Weidi Xie; |
| 2023 | 1 | LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. |
DAVIDE MORELLI et. al. |
| 2023 | 2 | A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. |
Xin Zhou; Zhiqi Shen; |
| 2023 | 3 | FourLLIE: Boosting Low-Light Image Enhancement By Fourier Frequency Information IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Some researchers noticed that, in the Fourier space, the lightness degradation mainly exists in the amplitude component and the rest exists in the phase component. By incorporating both the Fourier frequency and the spatial information, these researchers proposed remarkable solutions for LLIE. |
Chenxi Wang; Hongjun Wu; Zhi Jin; |
| 2023 | 4 | Taming The Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model’s generation effectively. |
JUNHONG GOU et. al. |
| 2023 | 5 | LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. |
Leigang Qu; Shengqiong Wu; Hao Fei; Liqiang Nie; Tat-Seng Chua; |
| 2023 | 6 | Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. |
SHUYU YANG et. al. |
| 2023 | 7 | FedGH: Heterogeneous Federated Learning with Generalized Global Header IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing model-heterogeneous FL approaches often require publicly available datasets and incur high communication and/or computational costs, which limit their performances. To address these limitations, we propose a simple but effective Federated Global prediction Header (FedGH) approach. |
Liping Yi; Gang Wang; Xiaoguang Liu; Zhuan Shi; Han Yu; |
| 2023 | 8 | MLIC: Multi-Reference Entropy Model for Learned Image Compression IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most entropy models only capture correlations in one dimension, while the latent representation contains channel-wise, local spatial, and global spatial correlations. To tackle this issue, we propose the Multi-Reference Entropy Model (MEM) and the advanced version, MEM+. |
WEI JIANG et. al. |
| 2023 | 9 | Multi-View Graph Convolutional Network for Multimedia Recommendation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Users often exhibit distinct modality preferences when purchasing different items. Equally fusing each modality feature ignores the relative importance among different modalities, leading to the suboptimal user preference modeling.To tackle the above issues, we propose a novel Multi-View Graph Convolutional Network (MGCN) for the multimedia recommendation. |
Penghang Yu; Zhiyi Tan; Guanming Lu; Bing-Kun Bao; |
| 2023 | 10 | CLIP-Count: Towards Text-Guided Zero-Shot Object Counting IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. |
Ruixiang Jiang; Lingbo Liu; Changwen Chen; |
| 2023 | 11 | Text-to-Image Diffusion Models Can Be Easily Backdoored Through Multimodal Data Poisoning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. |
SHENGFANG ZHAI et. al. |
| 2023 | 12 | AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use.In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. |
ZIQI ZHOU et. al. |
| 2023 | 13 | Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, existing diffusion-based inpainting methods are limited to single-modal guidance and require task-specific training, hindering their cross-modal scalability. To address these limitations, we propose Uni-paint, a unified framework for multimodal inpainting that offers various modes of guidance, including unconditional, text-driven, stroke-driven, exemplar-driven inpainting, as well as a combination of these modes. |
Shiyuan Yang; Xiaodong Chen; Jing Liao; |
| 2023 | 14 | DealMVC: Dual Contrastive Calibration for Multi-view Clustering IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). |
XIHONG YANG et. al. |
| 2023 | 15 | Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person Retrieval IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods underestimate the key cue role of mismatched region-word pairs and ignore the problem of low similarity between matched region-word pairs. To alleviate these issues, we propose a novel Pedestrian-specific Bipartite-aware Similarity Learning (PBSL) framework that efficiently reveals the plausible and credible levels of contribution of pedestrian-specific mismatched and matched region-word pairs towards overall similarity. |
Fei Shen; Xiangbo Shu; Xiaoyu Du; Jinhui Tang; |
| 2022 | 1 | LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. |
Yupan Huang; Tengchao Lv; Lei Cui; Yutong Lu; Furu Wei; |
| 2022 | 2 | X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. |
YIWEI MA et. al. |
| 2022 | 3 | ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. |
RONGJIE HUANG et. al. |
| 2022 | 4 | A Deep Learning Based No-reference Quality Assessment Model for UGC Videos IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous UGC video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of UGC videos for quality regression, which are regarded as the sub-optimal solutions because of the domain shifts between these tasks and the UGC VQA task. In this paper, we propose a very simple but effective UGC VQA model, which tries to address this problem by training an end-to-end spatial feature extraction network to directly learn the quality-aware spatial feature representation from raw pixels of the video frames. |
Wei Sun; Xiongkuo Min; Wei Lu; Guangtao Zhai; |
| 2022 | 5 | DiT: Self-supervised Pre-training for Document Image Transformer IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. |
JUNLONG LI et. al. |
| 2022 | 6 | Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. |
Jiahao Li; Bin Li; Yan Lu; |
| 2022 | 7 | Disentangled Representation Learning for Multimodal Emotion Recognition IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. |
Dingkang Yang; Shuai Huang; Haopeng Kuang; Yangtao Du; Lihua Zhang; |
| 2022 | 8 | You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel local descriptor-based framework, called You Only Hypothesize Once (YOHO), for the registration of two unaligned point clouds. |
Haiping Wang; Yuan Liu; Zhen Dong; Wenping Wang; |
| 2022 | 9 | Real-World Blind Super-Resolution Via Feature Matching with Implicit High-Resolution Priors IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Feature Matching SR (FeMaSR), which restores realistic HR images in a much more compact feature space. |
CHAOFENG CHEN et. al. |
| 2022 | 10 | DetFusion: A Detection-driven Infrared and Visible Image Fusion Network IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For object detection tasks, object-related information in images is often more valuable than focusing on the pixel-level details of images alone. To fill this gap, we propose a detection-driven infrared and visible image fusion network, termed DetFusion, which utilizes object-related information learned in the object detection networks to guide multimodal image fusion. |
Yiming Sun; Bing Cao; Pengfei Zhu; Qinghua Hu; |
| 2022 | 11 | NeRF-SR: High Quality Neural Radiance Fields Using Supersampling IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present NeRF-SR, a solution for high-resolution (HR) novel view synthesis with mostly low-resolution (LR) inputs. |
CHEN WANG et. al. |
| 2022 | 12 | MegaPortraits: One-shot Megapixel Neural Head Avatars IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we advance the neural head avatar technology to the megapixel resolution while focusing on the particularly challenging task of cross-driving synthesis, i.e., when the appearance of the driving image is substantially different from the animated source image. |
NIKITA DROBYSHEV et. al. |
| 2022 | 13 | CubeMLP: An MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce CubeMLP, a multimodal feature processing framework based entirely on MLP. |
Hao Sun; Hongyi Wang; Jiaqing Liu; Yen-Wei Chen; Lanfen Lin; |
| 2022 | 14 | Towards Adversarial Attack on Vision-Language Pre-training Models IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper studied the adversarial attack on popular VLP models and V+L tasks. |
Jiaming Zhang; Qi Yi; Jitao Sang; |
| 2022 | 15 | Learning Granularity-Unified Representations for Text-to-Image Person Re-identification IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. |
ZHIYIN SHAO et. al. |
| 2021 | 1 | Metaverse for Social Good: A University Campus Prototype IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we highlight the representative applications for social good. |
HAIHAN DUAN et. al. |
| 2021 | 2 | UACANet: Uncertainty Augmented Context Attention for Polyp Segmentation IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation which considers an uncertain area of the saliency map. |
Taehun Kim; Hyemin Lee; Daijin Kim; |
| 2021 | 3 | Contrastive Learning for Cold-Start Recommendation IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, the representation learning is theoretically lower-bounded by the integration of two terms: mutual information between collaborative embeddings of users and items, and mutual information between collaborative embeddings and feature representations of items. To model such a learning process, we devise a new objective function founded upon contrastive learning and develop a simple yet efficient Contrastive Learning-based Cold-start Recommendation framework (CLCRec). |
YINWEI WEI et. al. |
| 2021 | 4 | Mining Latent Structures for Multimedia Recommendation IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity. |
JINGHAO ZHANG et. al. |
| 2021 | 5 | Scalable Multi-view Subspace Clustering with Unified Anchors IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the complementary multi-view information has not been well utilized since the graphs are constructed independently by the anchors from the corresponding views. To address these issues, we propose a Scalable Multi-view Subspace Clustering with Unified Anchors (SMVSC). |
MENGJING SUN et. al. |
| 2021 | 6 | Enhanced Invertible Encoding for Learned Image Compression IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, few efforts are devoted to structuring a better transformation between the image space and the latent feature space. In this paper, instead of employing previous autoencoder style networks to build this transformation, we propose an enhanced Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate the information loss problem for better compression. |
Yueqi Xie; Ka Leong Cheng; Qifeng Chen; |
| 2021 | 7 | Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. |
RUIJIE TAO et. al. |
| 2021 | 8 | Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a re-parameterizable building block, namely Edge-oriented Convolution Block (ECB), for efficient SR design. |
Xindong Zhang; Hui Zeng; Lei Zhang; |
| 2021 | 9 | DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper to effectively extract and match person information, and hence achieve a superior retrieval accuracy. |
AICHUN ZHU et. al. |
| 2021 | 10 | MBRS: Enhancing Robustness of DNN-based Watermarking By Mini-Batch of Real and Simulated JPEG Compression IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we found that none of the existing framework can well ensure the robustness against JPEG compression, which is non-differential but is an essential and important image processing operation. To address such limitations, we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. |
Zhaoyang Jia; Han Fang; Weiming Zhang; |
| 2021 | 11 | Spatiotemporal Inconsistency Learning for DeepFake Video Detection IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. |
ZHIHAO GU et. al. |
| 2021 | 12 | Disentangle Your Dense Object Detector IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. |
ZEHUI CHEN et. al. |
| 2021 | 13 | CLIP4Caption: CLIP for Video Caption IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). |
MINGKANG TANG et. al. |
| 2021 | 14 | Diverse Image Inpainting with Bidirectional and Autoregressive Transformers IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose BAT-Fill, an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for image inpainting. |
YINGCHEN YU et. al. |
| 2021 | 15 | TriTransNet: RGB-D Salient Object Detection with A Triplet Transformer Embedding Network IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently U-Net framework is widely used, and continuous convolution and pooling operations generate multi-level features which are complementary with each other. In view of the more contribution of high-level features for the performance, we propose a triplet transformer embedding module to enhance them by learning long-range dependencies across layers. |
Zhengyi Liu; Yuan Wang; Zhengzheng Tu; Yun Xiao; Bin Tang; |
| 2020 | 1 | A Lip Sync Expert Is All You Need For Speech To Lip Generation In The Wild IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. |
K R Prajwal; Rudrabha Mukhopadhyay; Vinay P. Namboodiri; C.V. Jawahar; |
| 2020 | 2 | MISA: Modality-Invariant And -Specific Representations For Multimodal Sentiment Analysis IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to learn effective modality representations to aid the process of fusion. |
Devamanyu Hazarika; Roger Zimmermann; Soujanya Poria; |
| 2020 | 3 | Action2Motion: Conditioned Generation Of 3D Human Motions IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. |
CHUAN GUO et. al. |
| 2020 | 4 | WildDeepfake: A Challenging Real-World Dataset For Deepfake Detection IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. |
Bojia Zi; Minghao Chang; Jingjing Chen; Xingjun Ma; Yu-Gang Jiang; |
| 2020 | 5 | SimSwap: An Efficient Framework For High Fidelity Face Swapping IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. |
Renwang Chen; Xuanhong Chen; Bingbing Ni; Yanhao Ge; |
| 2020 | 6 | Stronger, Faster And More Explainable: A Graph Convolutional Baseline For Skeleton-based Action Recognition IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. |
Yi-Fan Song; Zhang Zhang; Caifeng Shan; Liang Wang; |
| 2020 | 7 | Pop Music Transformer: Beat-based Modeling And Generation Of Expressive Pop Piano Compositions IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. |
Yu-Siang Huang; Yi-Hsuan Yang; |
| 2020 | 8 | Dynamic GCN: Context-enriched Topology Learning For Skeleton-based Action Recognition IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn skeleton topology automatically. |
FANFAN YE et. al. |
| 2020 | 9 | Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a learning-based method for detecting real and fake deepfake multimedia content. |
Trisha Mittal; Uttaran Bhattacharya; Rohan Chandra; Aniket Bera; Dinesh Manocha; |
| 2020 | 10 | University-1652: A Multi-view Multi-source Benchmark For Drone-based Geo-localization IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. |
Zhedong Zheng; Yunchao Wei; Yi Yang; |
| 2020 | 11 | Graph-Refined Convolutional Network For Multimedia Recommendation With Implicit Feedback IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on adaptively refining the structure of interaction graph to discover and prune potential false-positive edges. |
Yinwei Wei; Xiang Wang; Liqiang Nie; Xiangnan He; Tat-Seng Chua; |
| 2020 | 12 | MS2L: Multi-Task Self-Supervised Learning For Skeleton Based Action Recognition IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address self-supervised representation learning from human skeletons for action recognition. |
Lilang Lin; Sijie Song; Wenhan Yang; Jiaying Liu; |
| 2020 | 13 | DeepRhythm: Exposing DeepFakes With Attentional Visual Heartbeat Rhythms IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. |
HUA QI et. al. |
| 2020 | 14 | Cloze Test Helps: Effective Video Anomaly Detection Via Learning To Complete Video Events IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. |
GUANG YU et. al. |
| 2020 | 15 | Arbitrary Style Transfer Via Multi-Adaptation Network IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the multi-adaptation network which involves two self-adaptation (SA) modules and one co-adaptation (CA) module:the SA modules adaptively disentangle the content and style representations, i.e., content SA module uses position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; the CA module rearranges the distribution of style representation based on content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion. |
YINGYING DENG et. al. |