Most Influential MM Papers (2026-03 Version)
To search or review papers within MM related to a specific topic, please use the search by venue (MM) and review by venue (MM) services. To browse the most productive MM authors by year ranked by #papers accepted, here are the most productive MM authors grouped by year.
As a pioneer in the field since 2018, Paper Digest has curated thousands of such lists, drawing on years of accumulated data across decades of conferences and research topics.To ensure users never miss a breakthrough, our daily digest service sifts through tens of thousands of new papers, clinical trials, news articles, community posts every day – delivering only what matters most to your specific interests. Beyond discovery, Paper Digest offers built-in research tools to help users read articles, write articles, get answers, conduct literature reviews, and generate research reports more efficiently.
Paper Digest Team
New York City, New York, 10017
TABLE 1: Most Influential MM Papers (2026-03 Version)
| Year | Rank | Paper | Author(s) |
|---|---|---|---|
| 2025 | 1 | ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. |
KAIXIN LI et. al. |
| 2025 | 2 | Open-Set Image Tagging with Multi-Grained Text Supervision IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. |
XINYU HUANG et. al. |
| 2025 | 3 | FantasyTalking: Realistic Talking Portrait Generation Via Coherent Motion Synthesis IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion Transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. |
MENGCHAO WANG et. al. |
| 2025 | 4 | UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. |
JIANHONG BAI et. al. |
| 2025 | 5 | Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. |
Leyang Li; Shilin Lu; Yan Ren; Adams Wai-Kin Kong; |
| 2025 | 6 | Breaking The Modality Barrier: Universal Embedding Learning with Multimodal LLMs IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. |
TIANCHENG GU et. al. |
| 2025 | 7 | EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. |
GUANROU YANG et. al. |
| 2025 | 8 | EditWorld: Simulating World Dynamics for Instruction-Following Image Editing IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Diffusion models have significantly improved the performance of image editing. |
BOHAN ZENG et. al. |
| 2025 | 9 | MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To date, there is a notable lack of rigorous benchmarks that assess Multimodal Large Language Models (MLLMs) within the financial domain, a field characterized by specialized financial charts and complex domain-specific expertise. To address this gap, we introduce MME-Finance, the first comprehensive bilingual multimodal benchmark tailored for financial analysis. |
ZILIANG GAN et. al. |
| 2025 | 10 | HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. |
PEI LIU et. al. |
| 2025 | 11 | Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models Via Reinforcement Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. |
BAINING ZHAO et. al. |
| 2025 | 12 | Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. |
Tianqi Li; Ruobing Zheng; Minghui Yang; Jingdong Chen; Ming Yang; |
| 2025 | 13 | Visual Instance-aware Prompt Tuning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that this strategy results in sub-optimal performance due to high variance in downstream datasets. To address this challenge, we propose Visual Instance-aware Prompt Tuning (ViaPT), which generates instance-aware prompts based on each individual input and fuses them with dataset-level prompts, leveraging Principal Component Analysis (PCA) to retain important prompting information. |
XI XIAO et. al. |
| 2025 | 14 | Manipulating Multimodal Agents Via Cross-Modal Prompt Injection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attacker embeds adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agents’ decision-making process and execute unauthorized tasks. |
LE WANG et. al. |
| 2025 | 15 | StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. |
Bingyu Li; Da Zhang; Zhiyuan Zhao; Junyu Gao; Xuelong Li; |
| 2024 | 1 | Tango 2: Aligning Diffusion-based Text-to-Audio Generations Through Direct Preference Optimization IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. |
NAVONIL MAJUMDER et. al. |
| 2024 | 2 | FreqMamba: Viewing Mamba from A Frequency Perspective for Image Deraining IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose FreqMamba, an effective and efficient paradigm that leverages the complementary between Mamba and frequency analysis for image deraining. |
Zhen Zou; Hu Yu; Jie Huang; Feng Zhao; |
| 2024 | 3 | Making Large Language Models Perform Better in Knowledge Graph Completion IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore methods to incorporate structural information into the LLMs, with the overarching goal of facilitating structure-aware reasoning. |
YICHI ZHANG et. al. |
| 2024 | 4 | AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While most of the research efforts in this domain are focused on detecting high-quality deepfake images and videos, only a few works address the problem of the localization of small segments of audio-visual manipulations embedded in real videos. In this research, we emulate the process of such content generation and propose the AV-Deepfake1M dataset. |
ZHIXI CAI et. al. |
| 2024 | 5 | CompGS: Efficient 3D Scene Representation Via Compressed Gaussian Splatting IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. |
XIANGRUI LIU et. al. |
| 2024 | 6 | Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis Via State Space Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. |
Xu Han; Yuan Tang; Zhaoxuan Wang; Xianzhi Li; |
| 2024 | 7 | Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, SAM lacks the utilization of multi-scale and multi-level information, as well as the incorporation of fine-grained details. To address these shortcomings, we propose a Multi-scale and Detail-enhanced SAM (MDSAM) for SOD. |
Shixuan Gao; Pingping Zhang; Tianyu Yan; Huchuan Lu; |
| 2024 | 8 | DiffMM: Multi-Modal Diffusion Model for Recommendation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. |
YANGQIN JIANG et. al. |
| 2024 | 9 | Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. |
YUKANG LIN et. al. |
| 2024 | 10 | Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enhance the sensitivity of deepfake audio features, we propose a deepfake audio detection model that incorporates an SLS (Sensitive Layer Selection) module. |
Qishan Zhang; Shuangbing Wen; Tao Hu; |
| 2024 | 11 | FiLo: Zero-Shot Anomaly Detection By Fine-Grained Description and High-Quality Localization IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, computing feature similarities for single patches struggles to pinpoint specific locations of anomalies with various sizes and scales. To address these issues, we propose a novel ZSAD method called FiLo, comprising two components: adaptively learned Fine-Grained Description (FG-Des) and position-enhanced High-Quality Localization (HQ-Loc). |
ZHAOPENG GU et. al. |
| 2024 | 12 | Wave-Mamba: Wavelet State Space Model for Ultra-High-Definition Low-Light Image Enhancement IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It enables state space models (SSMs) to avoid being affected by noise when modeling long sequences, thus making full use of the long-sequence modeling capability of SSMs. On this basis, we propose Wave-Mamba, a novel approach based on two pivotal insights derived from the wavelet domain: 1) most of the content information of an image exists in the low-frequency component, less in the high-frequency component. |
Wenbin Zou; Hongxia Gao; Weipeng Yang; Tongtong Liu; |
| 2024 | 13 | RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unexpectedly, its uni-dimensional sequential process on videos destroys the local correlations across the spatio-temporal dimension by distancing adjacent pixels. To address this, we present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert scanning mechanism to better capture sequence-level local information. |
HONGTAO WU et. al. |
| 2024 | 14 | PastNet: Introducing Physical Inductive Biases for Spatio-temporal Video Prediction IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. |
HAO WU et. al. |
| 2024 | 15 | WorldGPT: Empowering LLM As Multimodal World Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). |
ZHIQI GE et. al. |
| 2023 | 1 | A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Experimentally, we demonstrate that freezing its item-item structure before training can also achieve competitive performance. Based on this finding, we propose a simple yet effective model, dubbed as FREEDOM, that FREEzes the item-item graph and DenOises the user-item interaction graph simultaneously for Multimodal recommendation. |
Xin Zhou; Zhiqi Shen; |
| 2023 | 2 | FourLLIE: Boosting Low-Light Image Enhancement By Fourier Frequency Information IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Some researchers noticed that, in the Fourier space, the lightness degradation mainly exists in the amplitude component and the rest exists in the phase component. By incorporating both the Fourier frequency and the spatial information, these researchers proposed remarkable solutions for LLIE. |
Chenxi Wang; Hongjun Wu; Zhi Jin; |
| 2023 | 3 | LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. |
DAVIDE MORELLI et. al. |
| 2023 | 4 | Multi-View Graph Convolutional Network for Multimedia Recommendation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Users often exhibit distinct modality preferences when purchasing different items. Equally fusing each modality feature ignores the relative importance among different modalities, leading to the suboptimal user preference modeling.To tackle the above issues, we propose a novel Multi-View Graph Convolutional Network (MGCN) for the multimedia recommendation. |
Penghang Yu; Zhiyi Tan; Guanming Lu; Bing-Kun Bao; |
| 2023 | 5 | FedGH: Heterogeneous Federated Learning with Generalized Global Header IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing model-heterogeneous FL approaches often require publicly available datasets and incur high communication and/or computational costs, which limit their performances. To address these limitations, we propose a simple but effective Federated Global prediction Header (FedGH) approach. |
Liping Yi; Gang Wang; Xiaoguang Liu; Zhuan Shi; Han Yu; |
| 2023 | 6 | Taming The Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model’s generation effectively. |
JUNHONG GOU et. al. |
| 2023 | 7 | Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. |
SHUYU YANG et. al. |
| 2023 | 8 | LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. |
Leigang Qu; Shengqiong Wu; Hao Fei; Liqiang Nie; Tat-Seng Chua; |
| 2023 | 9 | MLIC: Multi-Reference Entropy Model for Learned Image Compression IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most entropy models only capture correlations in one dimension, while the latent representation contains channel-wise, local spatial, and global spatial correlations. To tackle this issue, we propose the Multi-Reference Entropy Model (MEM) and the advanced version, MEM+. |
WEI JIANG et. al. |
| 2023 | 10 | DealMVC: Dual Contrastive Calibration for Multi-view Clustering IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The existing multi-view models mainly focus on the consistency of the same samples in different views while ignoring the circumstance of similar but different samples in cross-view scenarios. To solve this problem, we propose a novel Dual contrastive calibration network for Multi-View Clustering (DealMVC). |
XIHONG YANG et. al. |
| 2023 | 11 | CLIP-Count: Towards Text-Guided Zero-Shot Object Counting IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. |
Ruixiang Jiang; Lingbo Liu; Changwen Chen; |
| 2023 | 12 | AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This can greatly benefit various complex downstream tasks, including cross-modal image-text retrieval and image classification. Despite its promising prospect, the security issue of cross-modal pre-trained encoder has not been fully explored yet, especially when the pre-trained encoder is publicly available for commercial use.In this work, we propose AdvCLIP, the first attack framework for generating downstream-agnostic adversarial examples based on cross-modal pre-trained encoders. |
ZIQI ZHOU et. al. |
| 2023 | 13 | Underwater Image Enhancement By Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present an approach to image enhancement with diffusion model in underwater scenes. |
Yi Tang; Hiroshi Kawasaki; Takafumi Iwaguchi; |
| 2023 | 14 | Text-to-Image Diffusion Models Can Be Easily Backdoored Through Multimodal Data Poisoning IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To gain a better understanding of the training process and potential risks of text-to-image synthesis, we perform a systematic investigation of backdoor attack on text-to-image diffusion models and propose BadT2I, a general multimodal backdoor attack framework that tampers with image synthesis in diverse semantic levels. |
SHENGFANG ZHAI et. al. |
| 2023 | 15 | Text-to-Audio Generation Using Instruction Guided Latent Diffusion Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The immense scale of the recent large language models (LLM) allows many interesting properties, such as, instruction- and chain-of-thought-based fine-tuning, that has significantly improved zero- and few-shot performance in many natural language processing (NLP) tasks. Inspired by such successes, we adopt such an instruction-tuned LLM Flan-T5 as the text encoder for text-to-audio (TTA) generation-a task where the goal is to generate an audio from its textual description. |
Deepanway Ghosal; Navonil Majumder; Ambuj Mehrish; Soujanya Poria; |
| 2022 | 1 | LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. |
Yupan Huang; Tengchao Lv; Lei Cui; Yutong Lu; Furu Wei; |
| 2022 | 2 | X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. |
YIWEI MA et. al. |
| 2022 | 3 | A Deep Learning Based No-reference Quality Assessment Model for UGC Videos IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous UGC video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of UGC videos for quality regression, which are regarded as the sub-optimal solutions because of the domain shifts between these tasks and the UGC VQA task. In this paper, we propose a very simple but effective UGC VQA model, which tries to address this problem by training an end-to-end spatial feature extraction network to directly learn the quality-aware spatial feature representation from raw pixels of the video frames. |
Wei Sun; Xiongkuo Min; Wei Lu; Guangtao Zhai; |
| 2022 | 4 | ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. |
RONGJIE HUANG et. al. |
| 2022 | 5 | Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. |
Jiahao Li; Bin Li; Yan Lu; |
| 2022 | 6 | Disentangled Representation Learning for Multimodal Emotion Recognition IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. |
Dingkang Yang; Shuai Huang; Haopeng Kuang; Yangtao Du; Lihua Zhang; |
| 2022 | 7 | DiT: Self-supervised Pre-training for Document Image Transformer IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human-labeled document images. |
JUNLONG LI et. al. |
| 2022 | 8 | RKformer: Runge-Kutta Transformer with Random-Connection Attention for Infrared Small Target Detection IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the small scale of targets as well as noise and clutter in the background, current deep neural network-based methods struggle in extracting features with discriminative semantics while preserving fine details. In this paper, we address this problem by proposing a novel RKformer model with an encoder-decoder structure, where four specifically designed Runge-Kutta transformer (RKT) blocks are stacked sequentially in the encoder. |
MINGJIN ZHANG et. al. |
| 2022 | 9 | DetFusion: A Detection-driven Infrared and Visible Image Fusion Network IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For object detection tasks, object-related information in images is often more valuable than focusing on the pixel-level details of images alone. To fill this gap, we propose a detection-driven infrared and visible image fusion network, termed DetFusion, which utilizes object-related information learned in the object detection networks to guide multimodal image fusion. |
Yiming Sun; Bing Cao; Pengfei Zhu; Qinghua Hu; |
| 2022 | 10 | You Only Hypothesize Once: Point Cloud Registration with Rotation-equivariant Descriptors IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel local descriptor-based framework, called You Only Hypothesize Once (YOHO), for the registration of two unaligned point clouds. |
Haiping Wang; Yuan Liu; Zhen Dong; Wenping Wang; |
| 2022 | 11 | Real-World Blind Super-Resolution Via Feature Matching with Implicit High-Resolution Priors IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Feature Matching SR (FeMaSR), which restores realistic HR images in a much more compact feature space. |
CHAOFENG CHEN et. al. |
| 2022 | 12 | Towards Adversarial Attack on Vision-Language Pre-training Models IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper studied the adversarial attack on popular VLP models and V+L tasks. |
Jiaming Zhang; Qi Yi; Jitao Sang; |
| 2022 | 13 | CubeMLP: An MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce CubeMLP, a multimodal feature processing framework based entirely on MLP. |
Hao Sun; Hongyi Wang; Jiaqing Liu; Yen-Wei Chen; Lanfen Lin; |
| 2022 | 14 | NeRF-SR: High Quality Neural Radiance Fields Using Supersampling IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present NeRF-SR, a solution for high-resolution (HR) novel view synthesis with mostly low-resolution (LR) inputs. |
CHEN WANG et. al. |
| 2022 | 15 | Learning Granularity-Unified Representations for Text-to-Image Person Re-identification IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an end-to-end framework based on transformers to learn granularity-unified representations for both modalities, denoted as LGUR. |
ZHIYIN SHAO et. al. |
| 2021 | 1 | Metaverse for Social Good: A University Campus Prototype IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we highlight the representative applications for social good. |
HAIHAN DUAN et. al. |
| 2021 | 2 | UACANet: Uncertainty Augmented Context Attention for Polyp Segmentation IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation which considers an uncertain area of the saliency map. |
Taehun Kim; Hyemin Lee; Daijin Kim; |
| 2021 | 3 | Mining Latent Structures for Multimedia Recommendation IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity. |
JINGHAO ZHANG et. al. |
| 2021 | 4 | Contrastive Learning for Cold-Start Recommendation IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, the representation learning is theoretically lower-bounded by the integration of two terms: mutual information between collaborative embeddings of users and items, and mutual information between collaborative embeddings and feature representations of items. To model such a learning process, we devise a new objective function founded upon contrastive learning and develop a simple yet efficient Contrastive Learning-based Cold-start Recommendation framework (CLCRec). |
YINWEI WEI et. al. |
| 2021 | 5 | Scalable Multi-view Subspace Clustering with Unified Anchors IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the complementary multi-view information has not been well utilized since the graphs are constructed independently by the anchors from the corresponding views. To address these issues, we propose a Scalable Multi-view Subspace Clustering with Unified Anchors (SMVSC). |
MENGJING SUN et. al. |
| 2021 | 6 | MBRS: Enhancing Robustness of DNN-based Watermarking By Mini-Batch of Real and Simulated JPEG Compression IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we found that none of the existing framework can well ensure the robustness against JPEG compression, which is non-differential but is an essential and important image processing operation. To address such limitations, we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. |
Zhaoyang Jia; Han Fang; Weiming Zhang; |
| 2021 | 7 | DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper to effectively extract and match person information, and hence achieve a superior retrieval accuracy. |
AICHUN ZHU et. al. |
| 2021 | 8 | Enhanced Invertible Encoding for Learned Image Compression IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, few efforts are devoted to structuring a better transformation between the image space and the latent feature space. In this paper, instead of employing previous autoencoder style networks to build this transformation, we propose an enhanced Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate the information loss problem for better compression. |
Yueqi Xie; Ka Leong Cheng; Qifeng Chen; |
| 2021 | 9 | Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a re-parameterizable building block, namely Edge-oriented Convolution Block (ECB), for efficient SR design. |
Xindong Zhang; Hui Zeng; Lei Zhang; |
| 2021 | 10 | Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. |
RUIJIE TAO et. al. |
| 2021 | 11 | DAWN: Dynamic Adversarial Watermarking of Neural Networks IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DAWN (Dynamic Adversarial Watermarking of Neural Networks), the first approach to use watermarking to deter model extraction theft. |
Sebastian Szyller; Buse Gul Atli; Samuel Marchal; N. Asokan; |
| 2021 | 12 | Spatiotemporal Inconsistency Learning for DeepFake Video Detection IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. |
ZHIHAO GU et. al. |
| 2021 | 13 | Disentangle Your Dense Object Detector IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. |
ZEHUI CHEN et. al. |
| 2021 | 14 | Former-DFER: Dynamic Facial Expression Recognition Transformer IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. |
Zengqun Zhao; Qingshan Liu; |
| 2021 | 15 | From Synthetic to Real: Image Dehazing Collaborating with Unlabeled Real Data IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Single image dehazing is a challenging task, for which the domain shift between synthetic training data and real-world testing images usually leads to degradation of existing methods. To address this issue, we propose a novel image dehazing framework collaborating with unlabeled real data. |
YE LIU et. al. |
| 2020 | 1 | A Lip Sync Expert Is All You Need For Speech To Lip Generation In The Wild IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. |
K R Prajwal; Rudrabha Mukhopadhyay; Vinay P. Namboodiri; C.V. Jawahar; |
| 2020 | 2 | MISA: Modality-Invariant And -Specific Representations For Multimodal Sentiment Analysis IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to learn effective modality representations to aid the process of fusion. |
Devamanyu Hazarika; Roger Zimmermann; Soujanya Poria; |
| 2020 | 3 | Action2Motion: Conditioned Generation Of 3D Human Motions IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. |
CHUAN GUO et. al. |
| 2020 | 4 | WildDeepfake: A Challenging Real-World Dataset For Deepfake Detection IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. |
Bojia Zi; Minghao Chang; Jingjing Chen; Xingjun Ma; Yu-Gang Jiang; |
| 2020 | 5 | SimSwap: An Efficient Framework For High Fidelity Face Swapping IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. |
Renwang Chen; Xuanhong Chen; Bingbing Ni; Yanhao Ge; |
| 2020 | 6 | Pop Music Transformer: Beat-based Modeling And Generation Of Expressive Pop Piano Compositions IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. |
Yu-Siang Huang; Yi-Hsuan Yang; |
| 2020 | 7 | University-1652: A Multi-view Multi-source Benchmark For Drone-based Geo-localization IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. |
Zhedong Zheng; Yunchao Wei; Yi Yang; |
| 2020 | 8 | Stronger, Faster And More Explainable: A Graph Convolutional Baseline For Skeleton-based Action Recognition IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. |
Yi-Fan Song; Zhang Zhang; Caifeng Shan; Liang Wang; |
| 2020 | 9 | Dynamic GCN: Context-enriched Topology Learning For Skeleton-based Action Recognition IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn skeleton topology automatically. |
FANFAN YE et. al. |
| 2020 | 10 | Graph-Refined Convolutional Network For Multimedia Recommendation With Implicit Feedback IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on adaptively refining the structure of interaction graph to discover and prune potential false-positive edges. |
Yinwei Wei; Xiang Wang; Liqiang Nie; Xiangnan He; Tat-Seng Chua; |
| 2020 | 11 | Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a learning-based method for detecting real and fake deepfake multimedia content. |
Trisha Mittal; Uttaran Bhattacharya; Rohan Chandra; Aniket Bera; Dinesh Manocha; |
| 2020 | 12 | MS2L: Multi-Task Self-Supervised Learning For Skeleton Based Action Recognition IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address self-supervised representation learning from human skeletons for action recognition. |
Lilang Lin; Sijie Song; Wenhan Yang; Jiaying Liu; |
| 2020 | 13 | DeepRhythm: Exposing DeepFakes With Attentional Visual Heartbeat Rhythms IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. |
HUA QI et. al. |
| 2020 | 14 | DFEW: A Large-Scale Database For Recognizing Dynamic Facial Expressions In The Wild IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. |
XINGXUN JIANG et. al. |
| 2020 | 15 | Cloze Test Helps: Effective Video Anomaly Detection Via Learning To Complete Video Events IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. |
GUANG YU et. al. |