Paper Digest: ACM Multimedia 2025 Papers & Highlights
Note: ACM Multimedia-2025 accepts more than 1,000 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All ACM Multimedia-2025 papers in a separate page.
To search for papers presented at MM-2025 on a specific topic, please make use of the search by venue (MM-2025) service. To summarize the latest research published at ACM Multimedia 2025 on a specific topic, you can utilize the review by venue (MM-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of all authors (MM-2025).
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive daily updates on the latest research, discussions & news in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: ACM Multimedia 2025 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | Farther Than Mirror: Explore Pattern-Compensated Depth of Mirror with Temporal Changes for Video Mirror Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, the changes in the DOM across different video frames are also important for video mirror detection, yet this aspect has not been fully explored. To address these issues, we devise a novel framework called FTM-Net, which includes two main contributions: a Pattern-Compensated DOM estimation strategy and a Dual-Granularity Affinity module. |
Zhaohu Xing; Lihao Liu; Tian Ye; Sixiang Chen; Yijun Yang; Guang Liu; Lei Zhu; |
| 2 | Advancing Reliable Test-Time Adaptation of Vision-Language Models Under Visual Variations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods face two critical reliability challenges: (1) entropy often becomes unreliable under distribution shifts, causing error accumulation in the cache and degradation in adaptation performance; (2) the final predictions may be unreliable due to inflexible decision boundaries that fail to accommodate large downstream shifts. To address these challenges, we propose a Reliable Test-time Adaptation (ReTA) method that integrates two complementary strategies to enhance reliability from two perspectives. |
Yiwen Liang; Hui Chen; Yizhe Xiong; Zihan Zhou; Mengyao Lyu; Zijia Lin; Shuaicheng Niu; Sicheng Zhao; Jungong Han; Guiguang Ding; |
| 3 | CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose CopyJudge, an automated copyright infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. |
Shunchang Liu; Zhuan Shi; Lingjuan Lyu; Yaochu Jin; Boi Faltings; |
| 4 | MPI-CD: Multi-Path Information Contrastive Decoding for Mitigating Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Subsequently, it recalls pertinent memory details, ultimately generating a comprehensive cognitive outcome. Inspired by this process, we propose a novel, training-free decoding approach, dubbed as Multi-Path Information Contrastive Decoding (MPI-CD). |
Jiacheng Ruan; Zongyun Zhang; Jingsheng Gao; Wenzhen Yuan; Ting Liu; Yuzhuo Fu; |
| 5 | InteractMove: Text-Controlled Human-Object Interaction Generation in 3D Scenes with Movable Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel task of text-controlled human-object interaction generation in 3D scenes with movable objects. |
Xinhao Cai; Minghang Zheng; Xin Jin; Yang Liu; |
| 6 | UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present UniEdit, a tuning-free framework that supports both video motion and appearance editing by harnessing the power of a pre-trained text-to-video generator within an inversion-then-generation framework. |
Jianhong Bai; Tianyu He; Yuchi Wang; Junliang Guo; Haoji Hu; Zuozhu Liu; Jiang Bian; |
| 7 | ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. |
Xihang Hu; Fuming Sun; Jiazhe Liu; Feilong Xu; Xiaoli Zhang; |
| 8 | Multi-Agent System for Comprehensive Soccer Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in soccer understanding have demonstrated rapid progress, yet existing research predominantly focuses on isolated or narrow tasks. To bridge this gap, we propose a comprehensive framework for holistic soccer understanding. |
Jiayuan Rao; Zifeng Li; Haoning Wu; Ya Zhang; Yanfeng Wang; Weidi Xie; |
| 9 | FractalForensics: Proactive Deepfake Detection and Localization Via Fractal Watermarks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose novel fractal watermarks for proactive Deepfake detection and localization, namely FractalForensics. |
Tianyi Wang; Harry Cheng; Ming-Hui Liu; Mohan Kankanhalli; |
| 10 | CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, the potential to directly recover high-level semantic content – such as image labels or captions – via a cross-modality inversion attack remains largely unexplored. To address this gap, we propose CapRecover, a general cross-modality feature inversion framework that directly decodes semantic information from intermediate features without requiring image reconstruction. |
Kedong Xiu; Sai Qian Zhang; |
| 11 | Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ConCue, a novel approach that integrates contextual cue generation with feature extraction to enhance HOI detection. |
Yu-Wei Zhan; Fan Liu; Xin Luo; Xin-Shun Xu; Liqiang Nie; Mohan Kankanhalli; |
| 12 | A New Dataset and Benchmark for Grounding Multimodal Misinformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the task of Grounding Multimodal Misinformation (GroundMM), which verifies multimodal content and localizes misleading segments across modalities. |
Bingjian Yang; Danni Xu; Kaipeng Niu; Wenxuan Liu; Zheng Wang; Mohan Kankanhalli; |
| 13 | Activation Shape Matters: OOD Detection with Norm-Entropy Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose that activation distributional shape, not just magnitude, is essential for robust detection. |
Jiawei Gu; Ziyue Qiao; Zechao Li; |
| 14 | Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. |
Zhenghao Liu; Xingsheng Zhu; Tianshuo Zhou; Xinyi Zhang; Xiaoyuan Yi; Yukun Yan; Ge Yu; Maosong Sun; |
| 15 | UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks Via Adapter Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. |
Maoxun Yuan; Bo Cui; Tianyi Zhao; Jiayi Wang; Shan Fu; Xue Yang; Xingxing Wei; |
| 16 | EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. |
Guanrou Yang; Chen Yang; Qian Chen; Ziyang Ma; Wenxi Chen; Wen Wang; Tianrui Wang; Yifan Yang; Zhikang Niu; Wenrui Liu; Fan Yu; Zhihao Du; Zhifu Gao; Shiliang Zhang; Xie Chen; |
| 17 | MultiRef: Controllable Image Generation with Multiple Visual References Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on the task of controllable image generation using multiple visual references. |
Ruoxi Chen; Dongping Chen; Siyuan Wu; Sinan Wang; Shiyun Lang; Peter Sushko; Gaoyang Jiang; Yao Wan; Ranjay Krishna; |
| 18 | Degradation-Aware One-Step Diffusion Model for Content-Sensitive Super-Resolution in The Dark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Diffusion-based super-resolution methods have achieved impressive results under normal lighting conditions. |
Tengyu Ma; Jiafa Ruan; Yuetong Wang; Guangchao Han; Zhu Liu; Long Ma; Risheng Liu; |
| 19 | Towards Universal Perception Through Language-Guided Open-World Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent methods have made progress in detecting unseen categories, they typically require a set of predefined categories during the inference stage, hindering practical deployment in open-world scenarios. To overcome this crucial limitation, we propose UniPerception , a novel universal perception framework based on open-vocabulary object detection. |
Zihan Wang; Yunhang Shen; Yuan Fang; Zuwei Long; Ke Li; Xing Sun; Jiao Xie; Shaohui Lin; |
| 20 | Where Watermark Meets Beauty: Expert-Guided Aesthetic Visible Watermarking for Digital Artworks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we conducted an exploratory study with watermarking experts, identifying key principles, six common design patterns, and a systematic watermarking workflow. Based on these insights, we developed an end-to-end, perceptual-aware framework for aesthetic-preserving watermark embedding, modeled after expert workflows in 5 phases. |
Changjuan Ran; Fang Liu; Runqi Fang; Xiangyu Meng; Shenglan Cui; Yunfan Ye; |
| 21 | TimesBERT: A BERT-Style Foundation Model for Time Series Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. |
Haoran Zhang; Yong Liu; Yunzhong Qiu; Haixuan Liu; Zhongyi Pei; Jianmin Wang; Mingsheng Long; |
| 22 | Frequency Meets Semantics: Text-Visual Fusion with Directional Spectral Enhancement for Salient Object Detection in Optical Remote Sensing Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this limitation, we leverage large language models (LLMs) to expend existing ORSI-SOD datasets with detailed textual annotations, creating a more comprehensive benchmark for image-text ORSI-SOD. Building upon this foundation, we propose the Frequency Meets Semantics Network (FMS-Net), a novel framework that integrates text-visual fusion with directional spectral enhancement for ORSI-SOD. |
Lamei Di; Bin Zhang; Yiming Wang; Wenxia Zhang; |
| 23 | EditWorld: Simulating World Dynamics for Instruction-Following Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Diffusion models have significantly improved the performance of image editing. |
Bohan Zeng; Ling Yang; Jiaming Liu; Minghao Xu; Yuanxing Zhang; Pengfei Wan; Wentao Zhang; Shuicheng Yan; |
| 24 | MINDEV: Multi-modal Integrated Diffusion Framework for Video Reconstruction from EEG Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes MINDEV Multi-modal Integrated Neural DEcoding and Visualization), a framework that places EEG signal processing at the core of video reconstruction. |
Shuai Huang; Yongxiong Wang; Huan Luo; Haodong Jing; Chendong Qin; Jingqun Tang; |
| 25 | Seeing The Overlooked: Bio-Visual Inspired Weak Saliency Feedback Transformer for Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods attempt to bridge this gap by incorporating additional modalities (e.g., text, 3D data) or visual cues (e.g., pose, body masks), these approaches introduce two key limitations: (1) they may distract the model with irrelevant factors like background clutter or clothing variations, and (2) they inevitably increase computational overhead during inference. To address these issues, we propose the Weak Saliency Feedback Transformer (WSFFormer), inspired by the feedback mechanisms in biological visual systems. |
Changshuo Wang; Shuting He; Xiang Fang; Fangzhe Nan; Prayag Tiwari; |
| 26 | Factorized Transformer Hashing with Adaptive Routing for Large-scale Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This rigidity restricts the model’s capacity to learn both highly distinctive and generalizable discrete representations, posing challenges for retrieval in open-world scenarios. To overcome this challenge, we propose a novel Factorized Transformer Hashing (FTH) framework, which introduces a factorized transformer to enhance the generalization and discriminative power of hash codes. |
Yadong Huo; Qibing Qin; Wenfeng Zhang; Lei Huang; Jie Nie; |
| 27 | Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. |
Yifan Yang; Shujie Liu; Jinyu Li; Yuxuan Hu; Haibin Wu; Hui Wang; Jianwei Yu; Lingwei Meng; Haiyang Sun; Yanqing Liu; Yan Lu; Kai Yu; Xie Chen; |
| 28 | ICAS: Detecting Training Data from Autoregressive Image Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study that applies membership inference to this domain. |
Hongyao Yu; Yixiang Qiu; Yiheng Yang; Hao Fang; Tianqu Zhuang; Jiaxin Hong; Bin Chen; Hao Wu; Shu-Tao Xia; |
| 29 | DREAM: Document Reconstruction Via End-to-end Autoregressive Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this approach is deficient in preserving the information related to element layouts, which are vital for document reconstruction. To surmount these aforementioned limitations, we in this paper present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM). |
Xin Li; Mingming Gong; Yunfei Wu; Jianxin Dai; Antai Guo; Xinghua Jiang; Haoyu Cao; Yinsong Liu; Deqiang Jiang; Xing Sun; |
| 30 | Can Audio Language Models Listen Between The Lines? A Study on Metaphorical Reasoning Via Unspoken Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their capacity for profound metaphorical reasoning, especially when derived from audio-specific cues, has yet to be thoroughly investigated. To address this gap, we introduce Unspoken, a bilingual (Chinese-English) question answering benchmark designed to assess ALMs’ comprehension of non-literal, metaphor-rich audio. |
Hongru Xiao; Xiang Li; Duyi Pan; Longfei Zhang; ZhixueSong ZhixueSong; Jiale Han; Songning Lai; Wenshuo Chen; Jing Tang; Benyou Wang; |
| 31 | Infrared and Visible Image Fusion with Language-Driven Loss in CLIP Embedding Space Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: For this purpose, we present a comprehensive language-expressed fusion objective, and encode relevant texts into the multi-modal embedding space using CLIP. |
Yuhao Wang; Lingjuan Miao; Zhiqiang Zhou; Lei Zhang; Qiao Yajun; |
| 32 | Towards Temporal-Aware Multi-Modal Retrieval Augemented Generation in Finance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce FINTMMBench, the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation (RAG) systems in finance. |
Fengbin Zhu; Junfeng Li; Liangming Pan; Wenjie Wang; Fuli Feng; Chao Wang; Huanbo Luan; Tat-Seng Chua; |
| 33 | LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. |
Shangqing Tu; Yucheng Wang; Daniel Zhang-Li; Yushi Bai; Jifan Yu; Yuhao Wu; Lei Hou; Huiqin Liu; Zhiyuan Liu; Bin Xu; Juanzi Li; |
| 34 | From Continuous to Discrete: Cross-Domain Collaborative General Speech Enhancement Via Hierarchical Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces OmniGSE, a novel general speech enhancement (GSE) framework designed to mitigate the diverse distortions that speech signals encounter in real-world scenarios. |
Zhaoxi Mu; Rilin Chen; Andong Li; Meng Yu; Xinyu Yang; Dong Yu; |
| 35 | ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ScreenSpot-Pro, a new benchmark designed to rigorously evaluate the grounding capabilities of MLLMs in high-resolution professional settings. |
Kaixin Li; Ziyang Meng; Hongzhan Lin; Ziyang Luo; Yuchen Tian; Jing Ma; Zhiyong Huang; Tat-Seng Chua; |
| 36 | EDPC: Accelerating Lossless Compression Via Lightweight Probability Models and Decoupled Parallel Dataflow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Autoregressive Compression Models (ACMs) have markedly improved compression efficiency through probabilistic prediction, current approaches remain constrained by two critical limitations: suboptimal compression ratios due to insufficient fine-grained feature extraction during probability modeling, and real-time processing bottlenecks caused by high resource consumption and low compression speeds. To address these challenges, we propose Efficient Dual-path Parallel Compression (EDPC), a hierarchically optimized compression framework that synergistically enhances modeling capability and execution efficiency via coordinated dual-path operations. |
Zeyi Lu; Xiaoxiao Ma; Yujun Huang; Minxiao Chen; Bin Chen; Baoyi An; Shu-Tao Xia; |
| 37 | Client-Server Co-design with Multi-modal Codebooks Makes Better and Faster Federate Knowledge Sharing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a new framework MuCo2 to kill two birds with one stone and facilitate client-server co-design through multi-modal codebooks (MuCo). |
Yichi Zhang; Zhuo Chen; Lingbing Guo; Yajing Xu; Lei Liang; Wen Zhang; Huajun Chen; |
| 38 | Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. |
Yichi Zhang; Zhuo Chen; Lingbing Guo; Yajing Xu; Min Zhang; Wen Zhang; Huajun Chen; |
| 39 | GUI-Narrator: Detecting and Captioning Computer GUI Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (iii) Frames in GUI videos with less information increase unnecessary computational costs for captioning. To address these challenges, we propose Act2Cap, a new video captioning benchmark specifically designed for GUI action videos, comprising 10,866 diverse video caption pairs containing not only temporal information of keyframes but also detailed narration on action types, elements, location, and purpose. |
Qinchen Wu; Difei Gao; Qinghong Lin; Zhuoyu Wu; Mike Zheng Shou; |
| 40 | Noise-Optimized Distribution Distillation for Dataset Condensation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods exhibit a random sampling bias that impairs their performance in dataset condensation settings. We propose a novel dataset condensation method called Noise-Optimized Distribution Distillation (NODD) that mitigates this sampling bias to improve the training performance of synthetic datasets generated with diffusion models. |
Tongfei Liu; Yufan Liu; Bing Li; Weiming Hu; Yuming Li; Chenguang Ma; |
| 41 | DRC: Enhancing Personalized Image Generation Via Disentangled Representation Composition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite notable progress, existing methods — whether based on diffusion models, large language models, or Large Multimodal Models (LMMs) — struggle to accurately capture and composite user style preferences and semantic intentions. In particular, the state-of-the-art LMM-based method suffers from the entanglement of visual features, leading to Guidance Collapse, where the generated images fail to preserve user-preferred styles or reflect the specified semantics.To address these limitations, we introduce DRC, a novel personalized image generation framework that enhances LMMs through Disentangled Representation Composition. |
Yiyan Xu; Wuqiang Zheng; Wenjie Wang; Fengbin Zhu; Xinting Hu; Yang Zhang; Fuli Feng; Tat-Seng Chua; |
| 42 | Breaking The Modality Barrier: Universal Embedding Learning with Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present UniME (Universal Multimodal Embedding), a novel two-stage framework that leverages MLLMs to learn discriminative representations for diverse downstream tasks. |
Tiancheng Gu; Kaicheng Yang; Ziyong Feng; Xingjun Wang; Yanzhao Zhang; Dingkun Long; Yingda Chen; Weidong Cai; Jiankang Deng; |
| 43 | RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. |
Tiancheng Gu; Kaicheng Yang; Chaoyi Zhang; Yin Xie; Xiang An; Ziyong Feng; Dongnan Liu; Weidong Cai; Jiankang Deng; |
| 44 | Target-Guided Bayesian Flow Networks for Quantitatively Constrained CAD Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel framework for quantitatively constrained CAD generation, termed Target-Guided Bayesian Flow Network (TGBFN). |
Wenhao Zheng; Chenwei Sun; Wenbo Zhang; Jiancheng Lv; Xianggen Liu; |
| 45 | From Captions to Rewards (CaReVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods relying on direct distillation often struggle with low-confidence data, leading to suboptimal performance. To address this, we propose (CaReVL), a novel method for preference reward modeling by reliably using both high- and low-confidence data. |
Muzhi Dai; Jiashuo Sun; Zhiyuan Zhao; Shixuan Liu; Rui Li; Junyu Gao; Xuelong Li; |
| 46 | Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional supervised fine-tuning (SFT), on the other hand, often over-refuses harmless inputs, compromising general performance. Given these challenges, we propose Secure Tug-of-War (SecTOW), an innovative iterative defense-attack training method to enhance the security of MLLMs. |
Muzhi Dai; Shixuan Liu; Zhiyuan Zhao; Junyu Gao; Hao Sun; Xuelong Li; |
| 47 | FingER: Content Aware Fine-grained Evaluation with Reasoning for AI-Generated Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we emphasize the critical importance of integrating fine-grained reasoning into video evaluation. |
Rui Chen; Lei Sun; Jing Tang; Geng Li; Xiangxiang Chu; |
| 48 | E-4DGS: High-Fidelity Dynamic Reconstruction from The Multi-view Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose E-4DGS, the first event-driven dynamic Gaussian Splatting approach, for novel view synthesis from multi-view event streams with fast-moving cameras. |
Chaoran Feng; Zhenyu Tang; Wangbo Yu; Yatian Pang; Yian Zhao; Jianbin Zhao; Li Yuan; Yonghong Tian; |
| 49 | Can I Trust You? Advancing GUI Task Automation with Action Trust Score Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing works mainly concentrate on improving AI agent’s capabilities to automate procedures for liberating humans from tedious complexities, lacking the evaluation of potential erroneous agent-generated actions that could lead to task failure and irreversible system damage, inherently involving a degree of risk particularly for fields where automated systems have control over critical operations. To address this issue, we introduce TrustScorer, which evaluates the trustworthiness of actions generated by AI agents, enabling a new human-AI collaboration paradigm in GUI task automation, i.e., actions with low predicted trust scores are redirected for human intervention, thereby mingling human precision with AI efficiency. |
Haiyang Mei; Difei Gao; Xiaopeng Wei; Xin Yang; Mike Zheng Shou; |
| 50 | Multi-view Graph Clustering with Dual Structure Awareness for Remote Sensing Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although recent works attempt to address this issue by refining the structure, they are designed for single-view data and struggle to extend to multi-view scenarios. To bridge this gap, we propose a dual structure awareness multi-view graph clustering method named DSMVGC, which generates two distinct structures for each view through explicit and implicit perspectives. |
Xin Peng; Bowen Liu; Renxiang Guan; Wenxuan Tu; |
| 51 | Contextual Gesture: Co-Speech Gesture Video Generation Through Context-aware Gesture Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advancements, existing methods struggle with accurately identifying the rhythmic or semantic triggers from audio for generating contextualized gesture patterns and achieving pixel-level realism. To address these challenges, we introduce Contextual Gesture, a framework that improves co-speech gesture video generation through three innovative components: (1) a chronological speech-gesture alignment that temporally connects two modalities, (2) a contextualized gesture tokenization that incorporate speech context into motion pattern representation through distillation, and (3) a structure-aware refinement module that employs edge connection to link gesture keypoints to improve video generation. |
Pinxin Liu; Pengfei Zhang; Hyeongwoo Kim; Pablo Garrido; Ari Shapiro; Kyle Olszewski; |
| 52 | Towards Modality Generalization: A Benchmark and Prospective Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities. |
Xiaohao Liu; Xiaobo Xia; Zhuo Huang; See-Kiong Ng; Tat-Seng Chua; |
| 53 | FocusTrack: One-Stage Focus-and-Suppress Framework for 3D Point Cloud Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing two-stage motion-based approaches suffer from fundamental limitations: (1) error accumulation due to decoupled optimization caused by explicit foreground segmentation prior to motion estimation, and (2) computational bottlenecks from sequential processing. To address these challenges, we propose FocusTrack, a novel one-stage paradigms tracking framework that unifies motion-semantics co-modeling through two core innovations: Inter-frame Motion Modeling (IMM) and Focus-and-Suppress Attention. |
Sifan Zhou; Jiahao Nie; Ziyu Zhao; Yichao Cao; Xiaobo Lu; |
| 54 | From Outline to Detail: An Hierarchical End-to-end Framework for Coherent and Consistent Visual Novel Generation and Assembly Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, fully end-to-end VN creation (i.e., from user description to executable VN) remains underexplored and presents several key challenges: 1) the hallucination and limited capacity of LLMs hinder the generation of long and coherent plots; 2) current models lack effective mechanisms for ensuring cross-modal consistency between plot, visual, and audio elements. To address these issues, we propose a hierarchical end-to-end framework for automatic VN generation and assembly, which employs an outline-guided autoregressive generation mechanism that transforms high-level user prompts into coherent plots, while a vision LLM-based self-correction mechanism ensures consistency between multimedia assets and plot content. |
Yilin Zhang; Yanyan Wei; Zhao Zhang; Jicong Fan; Haijun Zhang; Shuicheng Yan; |
| 55 | EIR-SDG: Explore Invariant Representation for Single-source Domain Generalization in Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose EIR-SDG, a novel SDG approach that explores domain-invariant representation for medical image segmentation. |
Ziwei Niu; Shiao Xie; Ziyue Wang; Yen-wei Chen; Yueming Jin; Lanfen Lin; |
| 56 | Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. |
Baining Zhao; Ziyou Wang; Jianjie Fang; Chen Gao; Fanhang Man; Jinqiang Cui; Xin Wang; Xinlei Chen; Yong Li; Wenwu Zhu; |
| 57 | EmbodiedOcc++: Boosting Embodied 3D Occupancy Prediction with Plane Regularization and Uncertainty Sampler Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces EmbodiedOcc++, enhancing the original framework with two key innovations: a Geometry-guided Refinement Module (GRM) that constrains Gaussian updates through plane regularization, along with a Semantic-aware Uncertainty Sampler (SUS) that enables more effective updates in overlapping regions between consecutive frames. |
Hao Wang; Xiaobao Wei; Xiaoan Zhang; Jianing Li; Chengyu Bai; Ying Li; Ming Lu; Wenzhao Zheng; Shanghang Zhang; |
| 58 | ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. |
Yongheng Zhang; Xu Liu; Ruihan Tao; Qiguang Chen; Hao Fei; Wanxiang Che; Libo Qin; |
| 59 | Character-Centric Understanding of Animated Movies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. |
Zhongrui Gui; Junyu Xie; Tengda Han; Weidi Xie; Andrew Zisserman; |
| 60 | Multi-view Graph Clustering with Dual Relation Optimization for Remote Sensing Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches often emphasize capturing rich node relations while overlooking the optimization of these relations, leading to noisy connections and weak inter-cluster discrimination. To address this issue, we propose a novel Multi-view Graph Clustering with dual Relation Optimization (MDRO) framework tailored for remote sensing data. |
Renxiang Guan; Junhong Li; Siwei Wang; Wenxuan Tu; Miaomiao Li; En Zhu; Xinwang Liu; Ping Chen; |
| 61 | DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). |
Cunhang Fan; Sheng Zhang; Jingjing Zhang; Enrui Liu; Xinhui Li; Gangming Zhao; Zhao Lv; |
| 62 | MQuant: Unleashing The Inference Potential of Multimodal Large Language Models Via Static Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). |
Jiangyong Yu; Sifan Zhou; Dawei Yang; Shuoyu Li; Shuo Wang; Xing Hu; Chen Xu; Zukang Xu; Changyong Shu; Zhihang Yuan; |
| 63 | DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: DDL enables the detector to better capture the core semantics of the detection task, thereby enhancing both robustness and generalization. Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance across diverse LLMs. |
Jiachen Fu; Chun-Le Guo; Chongyi Li; |
| 64 | Spiking Neural Networks with Temporal Attention-Guided Adaptive Fusion for Imbalanced Multi-modal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current approaches suffer from uncoordinated convergence speeds across modalities and static fusion mechanisms that ignore time-varying cross-modal interactions. We propose the temporal attention-guided adaptive fusion framework for multimodal SNNs with two synergistic innovations: 1) The Temporal Attention-guided Adaptive Fusion (TAAF) module that dynamically assigns importance scores to fused spiking features at each timestep, enabling hierarchical integration of temporally heterogeneous spike-based features; 2) The temporal adaptive balanced fusion loss that modulates learning rates per modality based on the above attention scores, preventing dominant modalities from monopolizing optimization. |
Jiangrong Shen; Yulin Xie; Qi Xu; Gang Pan; Huajin Tang; Badong Chen; |
| 65 | Unveiling The Impact of Multi-modal Content in Multi-modal Recommender Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ISOLATOR: utIlizing uSer-side cOntent simiLarity via a model-AgnosTic framewORk to leverage multi-modal content more properly. |
Guipeng Xv; Xinyu Li; Yi Liu; Chen Lin; Xiaoli Wang; |
| 66 | DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. |
Zhihang Yuan; Siyuan Wang; Yuzhang Shang; Hanling Zhang; Tongcheng Fang; Rui Xie; Shengen Yan; Guohao Dai; Yu Wang; |
| 67 | Deep Graph Clustering with Disentangled Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel deep graph clustering framework named DisenCluster, which learns disentangled representations to simultaneously consider node separation results from diverse perspectives. |
Yifan Wang; Yuntai Ding; Yiyang Gu; Ziyue Qiao; Chong Chen; Xian-Sheng Hua; Ming Zhang; Wei Ju; |
| 68 | Text-to-Image Generation with Multi-modal Knowledge Graph Construction and Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current Text-to-Image (T2I) generation methods struggle to accurately create images with complex object relationships and scene compositions. To overcome these challenges, we propose KAIG, a novel text-to-image generative model that integrates a knowledge graph into the image generation process. |
Jiawei Meng; Zhengmao Yang; Zhiqiang Liu; Shaokai Chen; Zhizhen Liu; Wen Zhang; Huajun Chen; |
| 69 | SizeGS: Size-aware Compression of 3D Gaussian Splatting Via Mixed Integer Programming Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While many compression techniques have been proposed, they fail to efficiently adapt to fluctuating network bandwidth, leading to resource wastage. We address this issue from the perspective of size-aware compression, where we aim to compress 3DGS to a desired size by quickly searching for suitable hyperparameters. |
Shuzhao Xie; Jiahang Liu; Weixiang Zhang; Shijia Ge; Sicheng Pan; Chen Tang; Yunpeng Bai; Cong Zhang; Xiaoyi Fan; Zhi Wang; |
| 70 | TiP4GEN: Text to Immersive Panorama 4D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce TiP4GEN, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. |
Ke Xing; Hanwen Liang; Dejia Xu; Yuyang Yin; Konstantinos N. Plataniotis; Yao Zhao; Yunchao Wei; |
| 71 | Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by systems biology, we propose Reaction-Diffusion Multimodal Fusion (RDMF), a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. |
Xiang Fang; Wanlong Fang; Wei Ji; Tat-Seng Chua; |
| 72 | SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SynthVLM, a new data synthesis and curation method for generating image-caption pairs. |
Zheng Liu; Hao Liang; Bozhou Li; Wentao Xiong; Chong Chen; Conghui He; Wentao Zhang; Bin Cui; |
| 73 | TimeChat-Online: 80\% Visual Tokens Are Naturally Redundant in Streaming Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce TimeChat-Online, a novel online VideoLLM that revolutionizes real-time video interaction. |
Linli Yao; Yicheng Li; Yuancheng Wei; Lei Li; Shuhuai Ren; Yuanxin Liu; Kun Ouyang; Lean Wang; Shicheng Li; Sida Li; Lingpeng Kong; Qi Liu; Yuanxing Zhang; Xu Sun; |
| 74 | NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Autonomous driving systems have made significant advances in Q&A, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. |
Qucheng Peng; Chen Bai; Guoxiang Zhang; Bo Xu; Xiaotong Liu; Xiaoyin Zheng; Chen Chen; Cheng Lu; |
| 75 | Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite advancements in large-scale vision-language models (VLMs), these models often struggle to capture the nuanced, spatiotemporal details essential for thorough video analysis. To address this gap, we introduce Video-CoT, a groundbreaking dataset designed to enhance spatiotemporal understanding using Chain-of-Thought(CoT) methodologies. |
Shuyi Zhang; Xiaoshuai Hao; Yingbo Tang; Lingfeng Zhang; Pengwei Wang; Zhongyuan Wang; Hongxuan Ma; Shanghang Zhang; |
| 76 | AirScape: An Aerial Generative World Model with Motion Controllability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explore general spatial imagination capability, we present AirScape, the first world model designed for six-degree-of-freedom aerial agents. |
Baining Zhao; Rongze Tang; Mingyuan Jia; Ziyou Wang; Fanhang Man; Xin Zhang; Yu Shang; Weichen Zhang; Wei Wu; Chen Gao; Xinlei Chen; Yong Li; |
| 77 | EEmo-Bench: A Benchmark for Multi-modal Large Language Models on Image Evoked Emotion Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce EEmo-Bench, a novel benchmark dedicated to the analysis of the evoked emotions in images across diverse content categories. |
Lancheng Gao; Ziheng Jia; Yunhao Zeng; Wei Sun; Yiming Zhang; Wei Zhou; Guangtao Zhai; Xiongkuo Min; |
| 78 | CausalCtrl: Causality-Aware Control Framework for Text-Guided Visual Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods ignore confounding effects brought from the pretrained model, i.e., harmful biases learned from the pretraining datasets, leading to spurious correlations during the editing processing. To address this issue, we introduce CausalCtrl, a novel training-free framework that reformulates text-guided visual editing from a causal inference perspective. |
Haoxiang Cao; Chaoqun Wang; Yongwen Lai; Shaobo Min; Xuejin Chen; |
| 79 | EgoMusic: An Egocentric Augmented Reality Glasses Dataset for Music Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces EgoMusic, a multimodal dataset featuring synchronised egocentric audio-visual data captured with AR glasses during live performances, alongside studio-quality audio references. |
Alessandro Ragano; Carl Timothy Tolentino; Kata Szita; Dan Barry; Davoud Shariat Panah; Niall Murray; Andrew Hines; |
| 80 | Towards Explainable Partial-AIGC Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by this gap, we construct the first large-scale PAI dataset towards explainable partial-AIGC image quality assessment (EPAIQA), the EPAIQA-15K, which includes 15K images with localized AI manipulation in different regions and over 300K multi-dimensional human ratings. Based on this, we leverage large multi-modal models (LMMs) and propose a three-stage model training paradigm. |
Jiaying Qian; Ziheng Jia; Zicheng Zhang; Zeyu Zhang; Guangtao Zhai; Xiongkuo Min; |
| 81 | Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. |
Hang Guo; Qing Zhang; Zixuan Gao; Siyuan Yang; Shulin Peng; Xiang Tao; Ting Yu; Yan Wang; Qingli Li; |
| 82 | Benchmarking and Bridging Emotion Conflicts for Multimodal Emotion Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, evaluations on our CA-MER reveal that current state-of-the-art emotion MLLMs systematically over-rely on audio signal during emotion conflicts, neglecting critical cues from visual modality. To mitigate this bias, we propose MoSEAR, a parameter-efficient framework that promotes balanced modality integration. |
Zhiyuan Han; Beier Zhu; Yanlong Xu; Peipei Song; Xun Yang; |
| 83 | EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. |
Cheng Zhang; Hongxia Xie; Bin Wen; Songhan Zuo; Ruoxuan Zhang; Wen-Huang Cheng; |
| 84 | Look Before You Decide: Prompting Active Deduction of MLLMs for Assumptive Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To unveil their reasoning behaviors, we first curate a Multimodal Assumptive Reas oning Benchmark (MARS-Bench) in this paper. |
Yian Li; Wentao Tian; Yang Jiao; Tianwen Qian; Na Zhao; Bin Zhu; Jingjing Chen; Yu-Gang Jiang; |
| 85 | T2UE: Generating Unlearnable Examples from Text Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce Text-to-Unlearnable Example (T2UE), a novel framework that enables users to generate UEs using only text descriptions. |
Xingjun Ma; Hanxun Huang; Tianwei Song; Ye Sun; Yifeng Gao; Yu-Gang Jiang; |
| 86 | JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This oversight frequently leads to low-quality outputs that, while successful in bypassing safety filters, lack substantial harmful content. To address this gap, we propose JPS, Jailbreak MLLMs with collaborative visual Perturbation and textual Steering, which achieves jailbreaks via corporation of visual image and textually steering prompt. |
Renmiao Chen; Shiyao Cui; Xuancheng Huang; Chengwei Pan; Victor Shea-Jay Huang; QingLin Zhang; Xuan Ouyang; Zhexin Zhang; Hongning Wang; Minlie Huang; |
| 87 | VQA2: Visual Question Answering for Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, related work has not been explored in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA² Instruction Dataset-the first visual question answering instruction dataset that focuses on video quality assessment. |
Ziheng Jia; Zicheng Zhang; Jiaying Qian; Haoning Wu; Wei Sun; Chunyi Li; Xiaohong Liu; Weisi Lin; Guangtao Zhai; Xiongkuo Min; |
| 88 | Learning The Anchors with Similar Distributions to Original Data for Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To generate the anchors that are with similar distributions to original data, in the paper we carefully devise a LASD algorithm from the perspective of optimal transport (OT). |
Junpu Zhang; Shengju Yu; Suyuan Liu; Siwei Wang; Miaomiao Li; Xinwang Liu; En Zhu; Kunlun He; |
| 89 | Query-Based Audio-Visual Temporal Forgery Localization with Register-Enhanced Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While video-level detection has advanced, Temporal Forgery Localization (TFL) remains underexplored, often limited by weak audio-visual modeling and reliance on non-learnable post-processing. To address these challenges, we propose RegQAV, a Register-enhanced Query-based Audio-Visual framework for TFL. |
Xiaodong Zhu; Suting Wang; Junqi Yang; Yuhong Yang; Weiping Tu; Zhongyuan Wang; |
| 90 | MuCodec: Ultra Low-Bitrate Music Codec for Music Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This highlights the need for high-compression, high-fidelity music codecs that can reconstruct both vocals and accompaniment with high quality at low frame rates and bitrates, thereby better assisting music generation. To address this, we introduce MuCodec, designed for high-quality music reconstruction at ultra-low bitrates, facilitating more efficient music generation. |
Yaoxun Xu; Hangting Chen; Jianwei Yu; Wei Tan; Shun Lei; Zhiwei Lin; Rongzhi Gu; Zhiyong Wu; |
| 91 | Collaborative Cloud-edge Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Data from different environments or clients cannot be shared, only model parameters can be transferred. To tackle this problem, we propose a novel GCD framework based on energy-guided known class discrimination and multi-level contrastive learning. |
Yingbing Liu; Fei Ma; Yanan Wu; Xinxin Zuo; Fan Zhang; Yang Wang; |
| 92 | MRED-14: A Benchmark for Low-Energy Residential Floor Plan Generation with 14 Flexible Inputs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these challenges, we propose MRED-14, the first large-scale Multimodal Residential Energy Dataset, comprising 14 input types, including energy consumption values, vector drawings, and textual descriptions, paired with 41,280 high-quality residential floor plans that have been scored and annotated by human experts. Based on this dataset, we introduce the LER-net model, which can flexibly adapt to various input types and generate low-energy residential floor plans. |
Pengyu Zeng; Jun Yin; Haoyuan Sun; Yuqin Dai; Maowei Jiang; Miao Zhang; Shuai Lu; |
| 93 | ShieldVLM: Safeguarding The Multimodal Implicit Toxicity Via Deliberative Reasoning with LVLMs: ShieldVLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the success in unimodal text or image moderation, toxicity detection for multimodal content, particularly the multimodal implicit toxicity, remains underexplored. To fill this gap, we comprehensively build a taxonomy for multimodal implicit toxicity (MMIT) and introduce an MMIT-dataset, comprising 2,100 multimodal statements and prompts across 7 risk categories (31 sub-categories) and 5 typical cross-modal correlation modes. |
Shiyao Cui; QingLin Zhang; Xuan Ouyang; Renmiao Chen; Zhexin Zhang; Yida Lu; Hongning Wang; Han Qiu; Minlie Huang; |
| 94 | ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model’s behavior without the need for full retraining. |
Chenxi Wang; Jizhan Fang; Xiang Chen; Bozhong Tian; Ziwen Xu; Huajun Chen; Ningyu Zhang; |
| 95 | Building Embodied EvoAgent: A Brain-inspired Paradigm for Bridging Multimodal Large Models and World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the functional specialization of the left and right hemispheres of the human brain, this paper proposes a brain-inspired learning and evolution paradigm for embodied agents. |
Junyu Gao; Xuan Yao; Yong Rui; Changsheng Xu; |
| 96 | UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation Via Dynamic Tree Scan and Hidden State Weaken Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. |
Runmin Cong; Zongji Yu; Hao Fang; Haoyan Sun; Sam Kwong; |
| 97 | Proactive Deepfake Detection Via Self-Verifiable Semantic Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To proactively defend against Deepfakes, we propose SVS-WM, a Self-Verifiable Semantic Watermarking strategy. |
Peiqi Jiang; Bohan Lei; Yuhao Sun; Lingyun Yu; Zhineng Chen; Hongtao Xie; Yongdong Zhang; |
| 98 | 3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This dataset includes 23,672 Gaussian instances, 8,231 point cloud instances, and 6,631 manually annotated affordance labels, encompassing 21 object categories and 18 affordance types. Building upon this dataset, we introduce AffordSplatNet, a novel model specifically designed for affordance reasoning using 3DGS representations. |
Zeming Wei; Junyi Lin; Yang Liu; Weixing Chen; Jingzhou Luo; Guanbin Li; Liang Lin; |
| 99 | Less Is More: High-value Data Selection for Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the findings, we propose a high-value data selection approach TIVE, to eliminate redundancy within the visual instruction data and reduce the training cost. |
Zikang Liu; Kun Zhou; Wayne Xin Zhao; Dawei Gao; Yaliang Li; Ji-Rong Wen; |
| 100 | DiffusionMat: Alpha Matting As Deterministic Sequential Refinement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes. |
Yangyang Xu; Shengfeng He; Wenqi Shao; Yong Du; Kwan-Yee K. Wong; Yu Qiao; Jun Yu; Ping Luo; |
| 101 | HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. |
Wenqi Dong; Bangbang Yang; Zesong Yang; Yuan Li; Tao Hu; Hujun Bao; Yuewen Ma; Zhaopeng Cui; |
| 102 | GraphVideoAgent: Enhancing Long-form Video Understanding with Entity Relation Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, they struggle to capture the evolving relations among entities and fail to maintain identity consistency when entities temporarily leave and later reappear in the video. These limitations prevent accurate keyframe localization and coherent reasoning.In this paper, we propose GraphVideoAgent, a novel agent-based LVU framework that integrates a dynamic entity relation graph with a large language model (LLM)-based multi-round reasoning. |
Meng Chu; Yicong Li; Tat-Seng Chua; |
| 103 | TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. |
Zhiming Ma; Peidong Wang; Minhua Huang; Jinpeng Wang; Kai Wu; Xiangzhao Lv; Yachun Pang; Yin Yang; Wenjie Tang; Yuchen Kang; |
| 104 | ColorDiffuser: Video Colorization with Pretrained Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present ColorDiffuser, an adaptation of a pre-trained text-to-image latent diffusion model for video colorization. |
Hanyuan Liu; Minshan Xie; Jinbo Xing; Chengze Li; Chi-Sing Leung; Tien-Tsin Wong; |
| 105 | FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching Based Voice Enhancing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a novel dubbing architecture based on Large Language Model (LLM) and Conditional Flow Matching (CFM), named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model with dual contrastive alignment while improving acoustic quality via Flow-based Voice Enhancing (FVE). |
Gaoxiang Cong; Liang Li; Jiadong Pan; Zhedong Zhang; Amin Beheshti; Anton van den Hengel; Yuankai Qi; Qingming Huang; |
| 106 | LargeMvC-Net: Anchor-based Deep Unfolding Network for Large-scale Multi-view Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such designs overlook the core structural demands of anchor-based clustering, neglecting key optimization principles. To bridge this gap, we revisit the underlying optimization problem of large-scale anchor-based multi-view clustering and unfold its iterative solution into a novel deep network architecture, termed LargeMvC-Net. |
Shide Du; Chunming Wu; Zihan Fang; Wendi Zhao; Yilin Wu; Changwei Wang; Shiping Wang; |
| 107 | OmniGen: Unified Multimodal Sensor Generation for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches primarily focus on single-modality generation, leading to inefficiencies and misalignment in multimodal sensor data. To address these challenges, we propose OminiGen, which generates aligned multimodal sensor data in a unified framework. |
Tao Tang; Enhui Ma; Xia Zhou; Letian Wang; Tianyi Yan; Xueyang Zhang; Kun Zhan; Peng Jia; Xianpeng Lang; Jia-Wang Bian; Kaicheng Yu; Xiaodan Liang; |
| 108 | Learning Arbitrary-Scale RAW Image Downscaling with Wavelet-based Recurrent Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a wavelet-based recurrent reconstruction framework that leverages the information lossless attribute of wavelet transformation to fulfill the arbitrary-scale RAW image downscaling in a coarse-to-fine manner, in which the Low-Frequency Arbitrary-Scale Downscaling Module (LASDM) and the High-Frequency Prediction Module (HFPM) are proposed to preserve structural and textural integrity of the reconstructed low-resolution (LR) RAW images, alongside an energy-maximization loss to align high-frequency energy between HR and LR domain. |
Yang Ren; Hai Jiang; Wei Li; Menglong Yang; Heng Zhang; Zehua Sheng; Qingsheng Ye; Shuaicheng Liu; |
| 109 | LMME3DHF: Benchmarking and Evaluating Multimodal 3D Human Face Generation with LMMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on Gen3DHF, we propose LMME3DHF, a Large Multimodal Model (LMM)-based metric for Evaluating 3DHF capable of quality and authenticity score prediction, distortion-aware visual question answering, and distortion-aware saliency prediction. |
Woo Yi Yang; Jiarui Wang; Sijing Wu; Huiyu Duan; Yuxin Zhu; Liu Yang; Kang Fu; Guangtao Zhai; Xiongkuo Min; |
| 110 | DFBench: Benchmarking Deepfake Image Detection Capability of Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on DFBench, we propose MoA-DF, Mixture of Agents for DeepFake detection, leveraging a combined probability strategy from multiple LMMs. |
Jiarui Wang; Huiyu Duan; Juntong Wang; Ziheng Jia; Woo Yi Yang; Xiaorong Zhu; Yu Zhao; Jiaying Qian; Yuke Xing; Guangtao Zhai; Xiongkuo Min; |
| 111 | AtlantisGS: Underwater Sparse-View Scene Reconstruction Via Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AtlantisGS, a novel underwater scene reconstruction method, which only requires sparse-view inputs. |
Jingjun Yi; Qi Bi; Hao Zheng; Huimin Huang; Haolan Zhan; Yixian Shen; Wei Ji; Yawen Huang; Yuexiang Li; Xian Wu; Yefeng Zheng; |
| 112 | Wavelet-GS: 3D Gaussian Splatting with Wavelet Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing 3DGS approaches usually generate plausible outputs and face significant challenges in complex scene reconstruction, manifesting as incomplete holistic structural outlines and unclear local lighting effects. To address these issues simultaneously, we propose a novel decoupled optimization framework, which integrates wavelet decomposition into 3D Gaussian Splatting and 2D sampling. |
Beizhen Zhao; Yifan Zhou; Sicheng Yu; Zijian Wang; Hao Wang; |
| 113 | EditMaster: Bridging Text Instruction and Visual Example for Multimodal Guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EditMaster, a unified framework that integrates text and visual controls through multimodal instruction learning, enabling precise image manipulation with bidirectional consistency. |
Jiahui Zhang; Mengtian Li; Jiewei Tang; Junyu Deng; Siyu Tian; Xiang Liu; Meng Zhang; Guangnan Ye; Yu-Gang Jiang; |
| 114 | RoboAfford: A Dataset and Benchmark for Enhancing Object and Spatial Affordance Learning in Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation stems from the lack of annotations for object and spatial affordance data in their training datasets. To address this gap, we introduce RoboAfford , a novel large-scale dataset designed to enhance object and spatial affordance learning in robot manipulation. |
Yingbo Tang; Lingfeng Zhang; Shuyi Zhang; Yinuo Zhao; Xiaoshuai Hao; |
| 115 | MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To date, there is a notable lack of rigorous benchmarks that assess Multimodal Large Language Models (MLLMs) within the financial domain, a field characterized by specialized financial charts and complex domain-specific expertise. To address this gap, we introduce MME-Finance, the first comprehensive bilingual multimodal benchmark tailored for financial analysis. |
Ziliang Gan; Dong Zhang; Haohan Li; Yang Wu; Xueyuan Lin; Ji Liu; Haipang Wu; Chaoyou Fu; Zenglin Xu; Rongjunchen Zhang; Yong Dai; |
| 116 | Inter-Task Weaving in Image Enhancement: From A New Unified Architecture to A Better Meta-Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we investigate the inter-task weaving from both structural and parametric perspectives. |
Nan An; Siqi Xu; Long Ma; Zhu Liu; Guangchao Han; Tengyu Ma; Risheng Liu; |
| 117 | StePO-Rec: Towards Personalized Outfit Styling Assistant Via Knowledge-Guided Multi-Step Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Advancements in Generative AI offers new opportunities for FashionAI, surpassing traditional recommendation systems that often lack transparency and struggle to integrate expert knowledge, leaving the potential for personalized fashion styling remain untapped. To address these challenges, we present PAFA (Principle-Aware Fashion), a multi-granular knowledge base that organizes professional styling expertise into three levels of metadata, domain principles, and semantic relationships. |
Yuxi Bi; Yunfan Gao; Haofen Wang; |
| 118 | CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. |
Ruoxuan Zhang; Bin Wen; Hongxia Xie; Yi Yao; Songhan Zuo; Jian-Yu Jiang-Lin; Hong-Han Shuai; Wen-Huang Cheng; |
| 119 | RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. |
Ruoxuan Zhang; Jidong Gao; Bin Wen; Hongxia Xie; Chenming Zhang; Hong-Han Shuai; Wen-Huang Cheng; |
| 120 | WhiADD: Semantic-Acoustic Fusion for Robust Audio Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper addresses the critical challenge of detecting codec-based audio deepfakes in multilingual and dynamically evolving adversarial scenarios. |
Jianqiao Cui; Bingyao Yu; Qihao Wang; Fei Meng; Jiwen Lu; |
| 121 | Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. |
Zhenghao Zhang; Junchao Liao; Xiangyu Meng; Long Qin; Weizhi Wang; |
| 122 | FSCDiff: Frequency-Spatial Entangled Conditional Diffusion Model for Underwater Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve the accuracy and robustness of underwater salient object detection, different from the existing spatial domain aware RGB-D methods that rely on pixel-level probabilities, we propose a novel Fourier-Spatial Entangled Conditional Diffusion model (FSCDiff) for underwater salient object detection. |
Hua Li; Gaowei Lin; Zhiyuan Li; Sam Kwong; Runmin Cong; |
| 123 | Enhanced Motion-aware Latent Diffusion Models for Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an Enhanced Motion-Aware latent Diffusion model ( EMADiff ) for video frame interpolation. |
Zhilin Huang; Chujun Qin; Yifei Xing; Wenming Yang; |
| 124 | Multimodal Emotion Recognition with Missing Modality Via A Unified Multi-task Pre-training Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we propose a Unified Multi-task Pre-training (UMAP) framework based on the mixture of experts structure. |
Ziyi Li; Wei-Long Zheng; Bao-Liang Lu; |
| 125 | MelodyEdit: Zero-shot Music Editing with Disentangled Inversion Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing editing methods struggle with conducting complex non-rigid music edits while maintaining content integrity and high fidelity. To address these challenges, we propose MelodyEdit, a novel zero-shot music editing system based on innovative Disentangled Inversion Control (DIC) technique, which comprises Harmonized Attention Control and Disentangled Inversion. |
Huadai Liu; Jialei Wang; Xiangtai Li; Wen Wang; Qian Chen; Rongjie Huang; Yang Liu; Jiayang Xu; Zhou Zhao; Wei Xue; |
| 126 | Flexible Multi-view Clustering with Dynamic Views Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose the Flexible Multi-view Clustering with Dynamic Views Generation (FMCDVG). |
Yalan Qin; Nan Pu; Hanzhou Wu; Zhaoxin Fan; |
| 127 | The Best Is Yet to Come: Graph Convolution in The Testing Phase for Multimodal Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose FastMMRec, a highly efficient multimodal recommendation framework that deploys graph convolutions exclusively during the testing phase, bypassing their use in training. |
Jinfeng Xu; Zheyu Chen; Shuo Yang; Jinze Li; Edith C. H. Ngai; |
| 128 | TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, while many methods focus on geometric consistency, they often neglect the potential of text-driven guidance to enhance semantic understanding, which is crucial for accurately reconstructing fine-grained details in complex scenes. To address this limitation, we propose TextSplat-the first text-driven Generalizable Gaussian Splatting framework. |
Zhicong Wu; Hongbin Xu; Gang Xu; Ping Nie; Zhixin Yan; Jinkai Zheng; Liangqiong Qu; Ming Li; Liqiang Nie; |
| 129 | Financial Models Meets Generative Art: Black-Scholes-Inspired Concept Blending in Text-to-Image Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel approach for concept blending in pretrained text-to-image diffusion models, aiming to generate images at the intersection of multiple text prompts. |
Divya Kothandaraman; Ming Lin; Dinesh Manocha; |
| 130 | ALLM4ADD: Unlocking The Capabilities of Audio Large Language Models for Audio Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. |
Hao Gu; Jiangyan Yi; Chenglong Wang; Jianhua Tao; Zheng Lian; Jiayi He; Yong Ren; Yujie Chen; Zhengqi Wen; |
| 131 | Adaptive Graph Attention-Guided Parallel Sampling and Embedded Selection for Multi-Model Fitting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods rely on inefficient sequential hypothesize-and-verify frameworks that require a predefined number of models and inlier thresholds-parameters that are difficult to determine in practical scenes. To overcome these limitations, we propose a novel Adaptive Graph Attention-guided parallel multi-model fitting method (AGASAC) that jointly learns local and global features, performs parallel hypothesis sampling, and executes confidence-embedded model selection. |
Wenyu Yin; Shuyuan Lin; David Suter; Hanzi Wang; |
| 132 | ISDrama: Immersive Spatial Drama Generation Through Multimodal Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We provide the dataset and the evaluation code at https://huggingface.co/datasets/AaronZ345/MRSDrama and https://github.com/AaronZ345/ISDrama. |
Yu Zhang; Wenxiang Guo; Changhao Pan; Zhiyuan Zhu; Tao Jin; Zhou Zhao; |
| 133 | LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present LaVieID, a novel local a utoregressive vi deo diffusion framework designed to tackle the challenging id entity-preserving text-to-video task. |
Wenhui Song; Hanhui Li; Jiehui Huang; Panwen Hu; Yuhao Cheng; Long Chen; Yiqiang Yan; Xiaodan Liang; |
| 134 | DSF-Net: Dynamic Sparse Fusion of Event-RGB Via Spike-Triggered Attention for High-Speed Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional RGB cameras struggle in high-speed vision due to motion blur (above 60Hz sampling) and limited dynamic range (<60dB). To address these limitations, we propose a multimodal framework integrating event cameras, leveraging their microsecond temporal resolution (1μs) and 140dB dynamic range. |
Dongyang Ma; Zhengyu Ma; Wei Zhang; Yonghong Tian; |
| 135 | Deep-Plant-Disease Dataset Is All You Need for Plant Disease Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, the lack of standardized evaluation protocols and benchmark datasets hinders the fair evaluation of models against these challenges. To bridge this gap, we introduce Deep-Plant-Disease, the largest and most diverse dataset with novel text data designed to enhance model generalization in multi crop disease identification. |
Abel Yu Hao Chai; Kelly Li Zhen Jee; Sue Han Lee; Fei Siang Tay; Jules Vandeputte; Herv\'{e} Goeau; Pierre Bonnet; Alexis Joly; |
| 136 | HateClipSeg: A Segment-Level Annotated Dataset for Fine-Grained Hate Video Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose three tasks to benchmark performance: (1) Trimmed Hateful Video Classification, (2) Temporal Hateful Video Localization, and (3) Online Hateful Video Classification. |
Han Wang; Zhuoran Wang; Roy Ka-Wei Lee; |
| 137 | Phys4DGen: Physics-Compliant 4D Generation with Multi-Material Composition Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current approaches typically incorporate physical priors to animate 3D representations, but these methods suffer from significant limitations: they not only require users lacking physics expertise to manually specify material properties but also struggle to effectively handle the generation of multi-material composite objects. To address these challenges, we propose Phys4DGen, a novel 4D generation framework that integrates multi-material composition perception with physical simulation. |
Jiajing Lin; Zhenzhong Wang; Dejun Xu; Shu Jiang; Yunpeng Gong; Min Jiang; |
| 138 | GM-DF: Generalized Multi-Scenario Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel Mixture-of-Experts framework, termed GM-DF, that decouples domain-specific and domain-invariant features to tackle cross-domain face forgery detection. |
Yingxin Lai; Hongyang Wang; Jing Yang; Xiangui Kang; Bin Li; Linlin Shen; Zitong Yu; |
| 139 | Visual Grounding with Attention-Driven Constraint Balancing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To tackle this problem, in this paper, we first analyze the attention mechanisms of transformer-based models. Building upon this, we propose a novel framework named Attention-Driven Constraint Balancing (AttBalance) to optimize the behavior of visual features within language-relevant regions. |
Weitai Kang; Luowei Zhou; Junyi Wu; Changchang Sun; Yan Yan; |
| 140 | UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing research follows the Vision-and-Language Navigation (VLN) paradigm, which heavily depends on sequential linguistic instructions, limiting its scalability and autonomy. To address this gap, we introduce UAV-ON, a benchmark for large-scale Object Goal Navigation (ObjectNav) by aerial agents in open-world environments, where agents operate based on high-level semantic goals without relying on detailed instructional guidance as in VLN. |
Jianqiang Xiao; Yuexuan Sun; Yixin Shao; Boxi Gan; Rongqiang Liu; Yanjin Wu; Weili Guan; Xiang Deng; |
| 141 | Generative AI for Multimedia Communication: Recent Advances, An Information-Theoretic Framework, and Future Opportunities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an innovative semantic information-theoretic framework, introducing semantic entropy, mutual information, channel capacity, and rate-distortion concepts specifically adapted to multimedia applications. |
Yili Jin; Xue Liu; Jiangchuan Liu; |
| 142 | MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: (1) a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; (2) a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and (3) the influence of same object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing (DAB) mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. |
Jiale Li; Mingrui Wu; Zixiang Jin; Hao Chen; Jiayi Ji; Xiaoshuai Sun; Liujuan Cao; Rongrong Ji; |
| 143 | SAM-TTT: Segment Anything Model Via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. |
Zhenni Yu; Li Zhao; Guobao Xiao; Xiaoqin Zhang; |
| 144 | DAPT: Domain-Aware Prompt-Tuning for Multimodal Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often struggle with: (1) Insufficient intrinsic domain adaptation during representation Learning; (2) Amplified negative transfer from entangled domain style and content representations; and (3) Neglecting domain-varying modality uncertainty. To address these issues, we propose Domain-Aware Prompt Tuning (DAPT), an innovative framework for multimodal multi-domain fake news detection. |
Yu Tong; Weihai Lu; Xiaoxi Cui; Yifan Mao; Zhejun Zhao; |
| 145 | Mitigating Long-tail Distribution in Oracle Bone Inscriptions: Dataset, Model, and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The oracle bone inscription (OBI) recognition plays a significant role in understanding the history and culture of ancient China. However, the existing OBI datasets suffer from a … |
Jinhao Li; Zijian Chen; Runze Jiang; Tingzhu Chen; Changbo Wang; Guangtao Zhai; |
| 146 | MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. |
Jinlan Fu; Shenzhen Huangfu; Hao Fei; Yichong Huang; Xiaoyu Shen; Xipeng Qiu; See-Kiong Ng; |
| 147 | ACMamba: Fast Unsupervised Anomaly Detection Via An Asymmetrical Consensus State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key observation is that, during training, not all samples within the same homogeneous area are indispensable, whereas ingenious sampling can provide a powerful substitute for reducing costs. Motivated by this, we propose an Asymmetrical Consensus State Space Model (ACMamba) to significantly reduce computational costs without compromising accuracy. |
Guanchun Wang; Xiangrong Zhang; Yifei Zhang; Zelin Peng; Tianyang Zhang; Xu Tang; Licheng Jiao; |
| 148 | InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. |
Kun-Hsiang Lin; Yu-Wen Tseng; Kang-Yang Huang; Jhih-Ciang Wu; Wen-Huang Cheng; |
| 149 | Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). |
Feng-Kai Huang; Bo-Lun Huang; Li-Wu Tsao; Jhih-Ciang Wu; Hong-Han Shuai; Wen-Huang Cheng; |
| 150 | OinkTrack: An Ultra-Long-Term Dataset for Multi-Object Tracking and Re-Identification of Group-Housed Pigs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing multi-object tracking datasets rarely capture the combined difficulty of these real-world conditions. To address this, we introduce OinkTrack, a large-scale benchmark for continuous multi-pig tracking in commercial farm environments. |
Feng-Kai Huang; Hong-Wei Xu; Chu-Chuan Lee; Hong-Yi Tu; Hong-Han Shuai; Wen-Huang Cheng; |
| 151 | A Motion Is Worth A Hybrid Sentence: Taming Language Model for Unified Motion Generation By Fine-grained Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To obtain a large corpus of hybrid motion sentences, we introduced a novel motion-to-text generation method that combines atomic motion operators with GPT-4o, resulting in 68.2 million fine-grained textual descriptions across diverse modalities. |
Ronghui Li; Lingxiao Han; Shi Shu; Yueyao Liu; Yukang Lin; Yue Ma; Jie Guo; Ziwei Liu; Xiu Li; |
| 152 | PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation Via Distortion-aware Gaussian-Splatted Volumetric Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. |
Zhiwei Zhang; Ruikai Xu; Weijian Zhang; Zhizhong Zhang; Xin Tan; Jingyu Gong; Yuan Xie; Lizhuang Ma; |
| 153 | DreamFrame: Enhancing Video Understanding Via Automatically Generated QA and Style-Consistent Keyframes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While current LVLMs are primarily trained on existing datasets in broad, general-purpose settings, adapting them to specific downstream scenarios remains challenging, as collecting and annotating task-specific videos is highly labor-intensive and time-consuming. To address this issue, we propose a three-stage framework named DreamFrame for automatically generating style-consistent keyframes and corresponding question-answer (QA) pairs to support LVLM instruction tuning. |
Zhende Song; Chenchen Wang; Jiamu Sheng; Chi Zhang; Shengji Tang; Jiayuan Fan; Tao Chen; |
| 154 | PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the large number of parameters leads to high training resource demands and low inference efficiency. To address this issue, we propose the PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning, an efficient approach to enhancing the unified retrieval capabilities from both structure and learning perspectives: 1) From the perspective of model structure, we analyze and propose Layer-Pruned Self-Distillation approach. |
Yibo Lyu; Rui Shao; Gongwei Chen; Yijie Zhu; Weili Guan; Liqiang Nie; |
| 155 | Inversion-DPO: Precise and Efficient Post-Training for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches typically require computation-intensive training of a base model and a reward model, which not only incurs substantial computational overhead but may also compromise model accuracy and training efficiency. To address these limitations, we propose Inversion-DPO, a novel alignment framework that circumvents reward modeling by reformulating Direct Preference Optimization (DPO) with DDIM inversion for DMs. |
Zejian Li; Yize Li; Chenye Meng; Zhongni Liu; Ling Yang; Shengyuan Zhang; Guang Yang; Changyuan Yang; Zhiyuan Yang; Lingyun Sun; |
| 156 | Tensor-based Opposing Yet Complementary Learning for Multi-view Multi-label Feature Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel tensor-based method for view-specific label learning that integrates adaptive weight mechanisms into both global non-linear and local linear mappings. |
Pingting Hao; Huijie Zhang; Yongshan Zhang; |
| 157 | FedRog: Robust Federated Graph Classification for Strong Heterogeneity and High-Noise Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, real-world federated scenarios often suffer from severe data heterogeneity and label noise, which significantly degrade model performance. To address these challenges, we propose FedRog, a robust and personalized federated graph neural network framework that improves generalization under non-IID and noisy label settings. |
De Li; Zhou Tan; Qiyu Li; Zeming Gan; Tiange Xia; Jinyan Wang; Xianxian Li; |
| 158 | FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Federated Adversarial Prompt Tuning (FedAPT), a novel method designed to enhance the adversarial robustness of FPT. |
Kun Zhai; Siheng Chen; Xingjun Ma; Yu-Gang Jiang; |
| 159 | Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on our findings, we propose an enhanced detector called CLIP-Enhance, which most accurately measures the gender bias in T2I models, with a difference of only 0.47\%-1.23\%, and most effectively filters out 82.91\% of low-quality images.1 We have made our dataset and code publicly available. |
Yunbo Lyu; Zhou Yang; Yuqing Niu; Jing Jiang; David Lo; |
| 160 | GPT-ReID: Learning Fine-grained Representation with GPT for Text-based Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the recent progress of large language models (LLMs), we propose a novel method named GPT-ReID for TBPR, which aims to leverage the strong comprehension of LLMs to alleviate the overfitting risk. |
Xudong Wang; Lei Tan; Pingyang Dai; Liujuan Cao; Rongrong Ji; |
| 161 | Cross Paradigm Representation and Alignment Transformer for Image Deraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, irregular rain patterns and complex geometric overlaps challenge single-paradigm architectures, necessitating a unified framework to integrate complementary global-local and spatial-channel representations. To address this, we propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer). |
Shun Zou; Yi Zou; Juncheng Li; Guangwei Gao; Guo-Jun Qi; |
| 162 | Towards Harmless Multimodal Assistants with Blind Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. |
Yongqi Li; Lu Yang; Jian Wang; Runyang You; Wenjie Li; Liqiang Nie; |
| 163 | Context-aware Image-to-Music Generation Via Bridging Modalities Through Musical Captions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, if musical captions effectively convey the intended message of the image, they serve as an excellent intermediary between the images and music. The proposed method connects these two different modalities through the medium of musical captions that describe the specialized content of the music. |
Shilin Liu; Kyohei Kamikawa; Keisuke Maeda; Takahiro Ogawa; Miki Haseyama; |
| 164 | CITR: Efficient Long Video Understanding Needs Causal Importance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we emphasize the need for causal importance estimation-where a token’s relevance is determined only from prior context-to enable efficient, real-time long video understanding. |
Ziqi Yuan; Jun Li; Yanghao Li; Yuxiang Huang; Chi Chen; Shuo Wang; Zhinan Gou; |
| 165 | Towards Multi-Scenario Forecasting of Building Electricity Loads with Multimodal Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose MMLoad, a novel diffusion-based multimodal framework for multi-scenario building load forecasting with three innovations: (i) a Multimodal Data Enhancement Pipeline generating rich building descriptions using LLMs and integrating temporal factors to analyze multimodal impacts; (ii) a Cross-modal Relation Encoder discovering latent interdependencies through hierarchical fusion, projecting buildings into a unified spatio-temporal (ST) embedding space; and (iii) a Scenario-Conditioned Diffusion Generator employing transformer-based denoising with Scenario-Adaptive Normalization (SAN) for diverse trajectory generation with uncertainty quantification. |
Yongzheng Liu; Siru Zhong; Gefeng Luo; Weilin Ruan; Yuxuan Liang; |
| 166 | VISA: Group-wise Visual Token Selection and Aggregation Via Graph Summarization for Efficient MLLMs Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce a novel method called group-wise VI sual token Selection and Aggregation (VISA) to address the issue of inefficient inference stemming from excessive visual tokens in multimoal large language models (MLLMs). |
Pengfei Jiang; Hanjun Li; Linglan Zhao; Fei Chao; Ke Yan; Shouhong Ding; Rongrong Ji; |
| 167 | FVQ: A Large-Scale Dataset and An LMM-based Method for Face Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. |
Sijing Wu; Yunhao Li; Ziwen Xu; Yixuan Gao; Huiyu Duan; Wei Sun; Guangtao Zhai; |
| 168 | HVEval: Towards Unified Evaluation of Human-Centric Video Generation and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the HVEval dataset, this paper aims to answer three questions: (1) can today’s T2V models effectively generate human-centric videos following the given prompts? |
Sijing Wu; Yunhao Li; Huiyu Duan; Yanwei Jiang; Yucheng Zhu; Guangtao Zhai; |
| 169 | Event Consistency-aware Robust Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the rapid development of short video platforms (such as Kuaishou and TikTok), these platforms have increasingly become important channels for the spread of fake news.Therefore, multi-modal fake news detection has attracted extensive attention.Existing studies mainly focus on directly integrating multi-modal information or discovering implicit clues in posts to improve detection performance.However, due to the abuse of video editing techniques, event-irrelevant segments (e.g., advertisements) are frequently mixed into videos, introducing noise information, thereby weakening models’ ability to learn crucial information.Moreover, video creators often inject personal tampered information into original news content through audio modality manipulation, potentially distorting the factual. To address these challenges, we propose a novel Event Consistency-aware Robust Fake News Detection (ECR-FND) framework, comprising two key components: an Event-aware Video Denoising Learning (EVDL) and an Audio Tampering-information Capturing Module (ATCM). |
Liyuan Cao; Zihang Guo; Huaiwen Zhang; |
| 170 | City-VLM: Towards Multidomain Perception Scene Understanding Via Multimodal Incomplete Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To effectively fuse multimodal data in the absence of one modality, we introduce incomplete multimodal learning to model outdoor scene understanding and design the LVLM named City-VLM. |
Penglei Sun; Yaoxian Song; Xiangru Zhu; Xiang Liu; Qiang Wang; Yue Liu; Changqun Xia; Tiefeng Li; Yang Yang; Xiaowen Chu; |
| 171 | Joint Test-time Adaptation with Refined Pseudo-labels and Latent Score Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation stems from the fact that existing methods usually rely heavily on training data and fail to establish a good connection between the model and the marginal distribution of test data, resulting in reduced generalization ability. To mitigate this issue, we introduce a novel self-supervised framework that integrates latent score matching and pseudo-label refinement into the TTA paradigm to enhance the model’s perception of the test data distribution. |
Yijie Yang; Lianyong Qi; Weiming Liu; Fan Wang; Jing Du; Yuwen Liu; Xiaolong Xu; Qiang Ni; Wanchun Dou; Xiaokang Zhou; |
| 172 | Consistency of Local and Global Flatness for Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By rethinking the SAM in FL and theoretically analyzing the flatness distance, we propose a novel FedNSAM algorithm that accelerates the SAM algorithm by introducing global Nesterov momentum into the local update to harmonize the consistency of global and local flatness. |
Junkang Liu; Fanhua Shang; Yuxuan Tian; Hongying Liu; Yuanyuan Liu; |
| 173 | Dual-Granularity Cross-Modal Identity Association for Weakly-Supervised Text-to-Person Image Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods struggle to predict complex one-to-many identity relationships, severely limiting performance improvements. To address this challenge, we propose a local-and-global dual-granularity identity association mechanism. |
Yafei Zhang; Yongle Shang; Huafeng Li; |
| 174 | Generating 3D Hair Strands from Images with Diverse Styles and Viewpoints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. |
Pengyu Long; Zijun Zhao; Min Ouyang; Qingcheng Zhao; Wei Yang; Lan Xu; Jingyi Yu; |
| 175 | EmotionalCanines: A Dataset for Analysis of Arousal and Valence in Dog Vocalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the current literature on animal communication and its intersection with machine learning, there is a limited amount of open-sourced data available to facilitate research, mainly due to constraints in animal subjects and recording conditions. To address this gap, we propose a framework that enables the collection of reliable arousal and valence labels in animal emotional state at scale. |
Tuan M. Dang; Theron S. Wang; Hridayesh Lekhak; Kenny Q. Zhu; |
| 176 | Sera: Separated Coarse-to-fine Representation Alignment for Cross-subject EEG-based Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by neurophysiological principles, we propose a novel framework, named Sera, for EEG-based emotion recognition that explicitly separates source activities and aligns representations across subjects. |
Zhihao Jia; Meiyan Xu; Jingyuan Wang; Ziyu Jia; Yong Li; Xinliang Zhou; Chenyu Liu; Junfeng Yao; Yi Ding; |
| 177 | SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SVGenius, a comprehensive benchmark comprising 2,377 queries across three progressive dimensions: understanding, editing, and generation. |
Siqi Chen; Xinyu Dong; Haolei Xu; Xingyu Wu; Fei Tang; Hang Zhang; Yuchen Yan; Linjuan Wu; Wenqi Zhang; Guiyang Hou; Yongliang Shen; Weiming Lu; Yueting Zhuang; |
| 178 | InterMind: Doctor-Patient-Family Interactive Depression Assessment Empowered By Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, current automatic depression detection (ADD) methods usually model depression detection as a classification or regression task, lacking interpretability for the decision-making process. To address these issues, we developed InterMind, a doctor-patient-family interactive depression assessment system empowered by large language models (LLMs). |
Zhiyuan Zhou; Jilong Liu; Sanwang Wang; Shijie Hao; Yanrong Guo; Richang Hong; |
| 179 | Learning Long-Range Action Representation By Two-Stream Mamba Pyramid Network for Figure Skating Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. |
Fengshun Wang; Qiurui Wang; Peilin Zhao; |
| 180 | BrokenVideos: A Benchmark Dataset for Fine-Grained Artifact Localization in AI-Generated Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing datasets either focus solely on detection at the video or frame level, or lack fine-grained spatial annotations necessary for developing and benchmarking localization methods. To fill this gap, we present BrokenVideos, a benchmark dataset comprising ~3,254 AI-generated videos with carefully-annotated, pixel-level masks indicating regions of visual corruption. |
Jiahao Lin; Weixuan Peng; Bojia Zi; Yifeng Gao; Xianbiao Qi; Xingjun Ma; Yu-Gang Jiang; |
| 181 | CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CR oss-modal Interaction and Semantic Prompting based on SAM2. |
Xinlei Yu; Changmiao Wang; Hui Jin; Ahmed Elazab; Gangyong Jia; Xiang Wan; Changqing Zou; Ruiquan Ge; |
| 182 | Bridging The Unseen Gap: Label-Enhanced Information Bottleneck Distillation for Multimodal Named Entity Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal Named Entity Recognition (MNER) integrates visual information to resolve textual ambiguities but struggles with generalizing to unseen entities (out-of-vocabulary, OOV), particularly in social media. To bridge this gap, we leveraging internal label knowledge and visual information and propose a Label-Enhanced Information Bottleneck Distillation (LIBD) framework, which transfers label-aware generalization capabilities via a teacher-student architecture. |
Bo Xu; Jie Wei; Hongya Wang; Ming Du; Hui Song; Yanghua Xiao; |
| 183 | DiffuQKT: A Diffusion-Based Approach for Improved Question Representation in Knowledge Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the sparsity and complexity of question data pose significant challenges for existing methods to capture the underlying features of questions, thereby affecting the accuracy of knowledge state predictions. To address this issue, this paper attempts to introduce the diffusion model to the KT field, proposing a novel knowledge tracing model, DiffuQKT. |
Fenghua Yu; Jianwen Sun; Qian Wan; Meicheng Chen; Xiaoxuan Shen; Qing Li; |
| 184 | GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. |
Bo Liu; Xiangyu Zhao; Along He; Yidi Chen; Huazhu Fu; Xiao-Ming Wu; |
| 185 | SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing safe alignment methods aim to mitigate the risk of generating harmful images but often reduce the quality of clean image generation. To address this issue, we propose SafeCFG to adaptively control harmful features with dynamic safe guidance by modulating the CFG generation process. |
Jiadong Pan; Liang Li; Hongcheng Gao; Zheng-Jun Zha; Qingming Huang; Jiebo Luo; |
| 186 | Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Anchor-free methods risk disrupting sampling trajectories, leading to visual artifacts, while anchor-based methods rely on the heuristic selection of anchor concepts. To overcome these shortcomings, we introduce a finetuning framework, dubbed ANT, which Automatically guides deNoising Trajectories to avoid unwanted concepts. |
Leyang Li; Shilin Lu; Yan Ren; Adams Wai-Kin Kong; |
| 187 | OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Notwithstanding the considerable advances made by prevailing methodologies, CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation, and 2) the priority of textual data in the image modification process is overlooked, which leads to a visual focus bias. To address these two limitations, this work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping. |
Zhiwei Chen; Yupeng Hu; Zixu Li; Zhiheng Fu; Xuemeng Song; Liqiang Nie; |
| 188 | FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding Via Agent-of-Thoughts Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. |
Haodong Chen; Haojian Huang; Xinxiang Yin; Dian Shao; |
| 189 | IM-POI: Bridging ID and Multi-modal Gaps in Next POI Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent multi-modal approaches offer alternatives but struggle with two key issues: inadequate handling of heterogeneity between ID and multi-modal features, and difficulties in unified framework integration, limiting their potential benefits. To address these limitations, we propose IM-POI, a novel framework that leverages the complementary strengths of both ID embeddings and multi-modal representations for next POI recommendation. |
Siyuan Huang; Jiahui Jin; Xin Lin; Xigang Sun; Yukun Ban; |
| 190 | Detecting Synthetic Image By Cross-Modal Commonality Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This observation highlights the discriminative power of high-frequency information for this task and provides a strong rationale for learning generalized artifact representations based on multi-modal fusion strategies. Building on this insight, we introduce a multi-modal high-frequency interactive detection framework for general synthetic image detection. |
Kai Li; Wenqi Ren; Wei Wang; Linchao Zhang; Xiaochun Cao; |
| 191 | Prompt-Softbox-Prompt: A Free-Text Embedding Control for Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we provide a comprehensive analysis of text embeddings in Stable Diffusion XL, offering three key insights: (1) aug embedding ~. |
Yitong Yang; Yinglin Wang; Tian Zhang; Jing Wang; Shuting He; |
| 192 | CorrNeXt: Making The ConvNet-Style Correspondence Pruner Stronger for Two-View Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The uproar over two-view correspondence pruning stems from the advent of the ConvNet-style paradigm, which showcases intrinsic proficiency in local context aggregation, tackling the context-agnostic deficiency of MLP-based methods fundamentally and delivering impressive pruning capability. To further unlock the potential of such a paradigm, this perspective study revisits its design decisions and introduces CorrNeXt, a cutting-edge ConvNet-style pruner that incorporates multiple simple but effective improvements. |
Zizhuo Li; Chunbao Su; Fan Fan; Jun Huang; Jiayi Ma; |
| 193 | HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. |
Haiyang Zhou; Wangbo Yu; Jiawen Guan; Xinhua Cheng; Yonghong Tian; Li Yuan; |
| 194 | SAMVSR: Leveraging Semantic Priors to Zone-Focused Mamba for Video Snow Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, to address temporal SAM label misalignment, we introduce an Entropy-wise Zone Propagation technique, which selects a reliable reference mask and semantically aligns instances across different frames via an entropy-guided label matching mechanism. |
Hongtao Wu; Yifeng Wu; Jiaxuan Jiang; Chengyu Wu; Hong Wang; Yefeng Zheng; |
| 195 | Towards Consumer-Grade Cybersickness Prediction: Multi-Model Alignment for Real-Time Vision-Only Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. |
Yitong Zhu; Zhuowen Liang; Yiming Wu; Tangyao Li; Yuyang Wang; |
| 196 | DARL: Mitigating Gradient Conflicts in Long-Tailed Out-of-Distribution Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To avoid the ID-OOD dilemma, we propose Dynamic Ambiguity-aware Recalibration for Logits (DARL), an ambiguity-guided long-tailed OOD learning approach, grounded on two theoretical insights. |
Xuan Zhang; Sinchee Chin; Jing-Hao Xue; Xiaochen Yang; Wenming Yang; |
| 197 | SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. |
Zhiwen Yang; Yuxin Peng; |
| 198 | Beyond Equal Views: Strength-Adaptive Evidential Multi-View Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a Strength-Adaptive Evidential Multi-View Learning (SAEML) method that performs reliability-aware fusion by explicitly modeling the contribution of each view’s evidence. |
Cai Xu; Ziqi Wen; Jie Zhao; Wanqing Zhao; Jinlong Yu; Haishun Chen; Ziyu Guan; Wei Zhao; |
| 199 | Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. |
Hongxing Fan; Lipeng Wang; Haohua Chen; Zehuan Huang; Jiangtao Wu; Lu Sheng; |
| 200 | Multimodal LLMs Can Reason About Aesthetics in Zero-Shot Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. |
Ruixiang Jiang; Chang Wen Chen; |
| 201 | DiffArtist: Towards Structure and Appearance Controllable Image Stylization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing neural stylization techniques primarily focus on transferring appearance-level features such as color and texture, often neglecting the equally crucial aspect of structural stylization. To address this gap, we introduce DiffArtist, the first 2D stylization method to offer fine-grained, disentangled control over both structure and appearance style strength. |
Ruixiang Jiang; Chang Wen Chen; |
| 202 | Online Cross-Modal Hashing with Multi-Level Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, a novel Online Cross-modal Hashing method with Multi-level Memory (OCH-MM) is proposed. |
Wentao Fan; Chao Zhang; Chunlin Chen; Huaxiong Li; |
| 203 | ESOD: Event-Based Small Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the lack of benchmark datasets has significantly hindered progress in this field. To tackle these issues, we propose the Fully Deformable Detection Network (FDDNet), a lightweight framework that dynamically adapts to extract key features. |
Quanmin Liang; Jinyi Lu; Qiang Li; Shuai Liu; Zhihao Zhao; Yinzheng Zhao; Wei Zhang; Kai Huang; Yonghong Tian; |
| 204 | MMESGBench: Pioneering Multimodal Understanding and Complex Reasoning Benchmark for ESG Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To fill the gap, we introduce MMESGBench, a first-of-its-kind benchmark dataset targeted to evaluate multimodal understanding and reasoning across multi-source ESG documents. |
Lei Zhang; Xin Zhou; Chaoyue He; Di Wang; Yi Wu; Hong Xu; Wei Liu; Chunyan Miao; |
| 205 | EventVAD: Training-Free Event-Aware Video Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs to perform fine-grained temporal-event reasoning. |
Yihua Shao; Haojin He; Sijie Li; Siyu Chen; Xinwei Long; Fanhu Zeng; Yuxuan Fan; Muyang Zhang; Ziyang Yan; Ao Ma; Xiaochen Wang; Hao Tang; Yan Wang; Shuyan Li; |
| 206 | Twin Co-Adaptive Dialogue for Progressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Twin-Co, a framework that leverages synchronized, co-adaptive dialogue to progressively refine image generation. |
Jianhui Wang; Yangfan He; Yan Zhong; Xinyuan Song; Jiayi Su; Yuheng Feng; Ruoyu Wang; Hongyang He; Wenyu Zhu; Xinhang Yuan; Miao Zhang; Keqin Li; Jiaqi Chen; Tianyu Shi; Xueqian Wang; |
| 207 | LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. |
Zitong Xu; Huiyu Duan; Bingnan Liu; Guangji Ma; Jiarui Wang; Liu Yang; Shiqi Gao; Xiaoyu Wang; Jia Wang; Xiongkuo Min; Guangtao Zhai; Weisi Lin; |
| 208 | Implicit Retinex Decomposition with Chromaticity Disentanglement for Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most Retinex-based methods adopt explicit, multi-stage pipelines prone to decomposition bias, error accumulation, and chromatic entanglement between illumination and reflectance. To tackle these issues, we propose IDAR (Implicit Decomposition, illumination Adjustment, and reflectance Restoration), a unified Retinex-inspired framework with two key innovations. |
Mufan Liu; Wu Ran; Zhiquan He; Zuojie Xie; Hong Lu; Peirong Ma; |
| 209 | Towards Culturally Fair Multimodal Generation: Quantifying and Mitigating Orientalist Biases in Text-to-Visual Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A mitigation framework employing large language model (LLM) is proposed and experimentally validated. |
Yifan Zeng; Fangzhou Dong; Jian Zhao; Peijia Zheng; Jian Li; Huiyu Zhou; |
| 210 | SLAM-X: Generalizable Dynamic Removal for NeRF and Gaussian Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, real-world environments are often filled with dynamic objects, leading to tracking errors and mapping failures. Several dynamic SLAM approaches have been proposed, but they remain difficult to adopt due to challenges in deployment, framework compatibility, and generalization. |
Mingrui Li; Dong Li; Sijia Hu; Kangxu Wang; Zhenjun Zhao; Hongyu Wang; |
| 211 | Wild3A: Novel View Synthesis from Any Dynamic Images in Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These additional pieces of information limit the method’s generalization ability, real-time performance, and application in AR, VR, and multimedia. To address these issues, we propose Wild3A, a comprehensive end-to-end integrated framework: it directly regresses 3D point positions and initial point confidence via the MASt3R’s Transformer and integrates a Bayesian estimation module based on multimodal information fusion. |
Mingrui Li; Shuhao Zhai; Zibing Zhao; Luyue Sun; Xinxiao Wang; Dong Li; Shuhong Liu; Hongyu Wang; |
| 212 | Open-Vocabulary 3D Affordance Understanding Via Functional Text Enhancement and Multilevel Representation Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, affordance understanding requires constructing a coherent semantic landscape from fragmented linguistic expressions-one that preserves intra-class diversity while minimizing inter-class overlap. To address these challenges, we introduce Aff3DFunc, a framework designed to enhance the alignment between affordance and 3D geometry. |
Lin Wu; Wei Wei; Peizhuo Yu; Jianglin Lan; |
| 213 | SAT: Supervisor Regularization and Animation Augmentation for Two-process Monocular Texture 3D Human Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a two-process 3D human reconstruction framework, SAT, which seamlessly learns various prior geometries in a unified manner and reconstructs high-quality textured 3D avatars as the final output. |
Gangjian Zhang; Jian Shu; Nanjie Yao; Hao Wang; |
| 214 | DT-UFC: Universal Large Model Feature Coding Via Peaky-to-Balanced Distribution Transformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present the first systematic study on universal feature coding for large models. |
Changsheng Gao; Zijie Liu; Li Li; Dong Liu; Xiaoyan Sun; Weisi Lin; |
| 215 | Compressed Feature Quality Assessment: Dataset and Baselines Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. |
Changsheng Gao; Wei Zhou; Guosheng Lin; Weisi Lin; |
| 216 | Towards A New Paradigm of Visual Signal Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, CMC has cer- tain defects in consistency with the original image and perceptual quality. To inspire insights into such a problem, we introduce CMC- Bench, a benchmark of the cooperative performance of Image-to- Text (I2T) and Text-to-Image (T2I) models for image compression. |
Chunyi Li; Xiele Wu; Haoning Wu; Donghui Feng; Zicheng Zhang; Guo Lu; Xiongkuo Min; Xiaohong Liu; Guangtao Zhai; Weisi Lin; |
| 217 | Amplitude-aware Domain Style Replay for Lifelong Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The major challenge of LReID is catastrophic forgetting, typically caused by large domain shifts during training. To address this, we propose a novel Amplitude-aware Domain Style Replay (ADSR) framework, which introduces a Fourier-based Style Transfer (FST) mechanism to generate synthetic data that reflects the style of previously encountered domains. |
Long Chen; De Cheng; Shizhou Zhang; Yinghui Xing; Di Xu; Yanning Zhang; |
| 218 | RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. |
Shuo Yang; Yuqin Dai; Guoqing Wang; Xinran Zheng; Jinfeng Xu; Jinze Li; Zhenzhe Ying; Weiqiang Wang; Edith C. H. Ngai; |
| 219 | Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While several anomaly synthesis approaches have been introduced for data augmentation, they often struggle with low realism, inaccurate mask alignment, and poor generalization. To overcome these limitations, we propose Generate Aligned Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework. |
Yilin Lu; Jianghang Lin; Linhuang Xie; Kai Zhao; Yansong Qu; Shengchuan Zhang; Liujuan Cao; Rongrong Ji; |
| 220 | Robust Modality-Incomplete Anomaly Detection: A Modality-Instructive Framework with Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we conduct the first comprehensive study on Modality-Incomplete Industrial Anomaly Detection (MIIAD) and establish MIIAD Bench , a benchmark covering diverse missing settings. |
Bingchen Miao; Wenqiao Zhang; Juncheng Li; Wangyu Wu; Siliang Tang; Zhaocheng Li; Haochen Shi; Jun Xiao; Yueting Zhuang; |
| 221 | FEALLM: Advancing Facial Emotion Analysis in Multimodal Large Language Models with Emotional Synergy and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recently, Multimodal Large Language Models (MLLMs) have shown exceptional performance in various visual tasks, while they still face significant challenges in FEA due to the lack of specialized datasets and their inability to capture the intricate relationships between FEs and AUs. To address these issues, we introduce a novel FEA Instruction Dataset that provides accurate and aligned FE and AU descriptions and establishes causal reasoning relationships between them, followed by constructing a new benchmark, FEABench. |
Zhuozhao Hu; Kaishen Yuan; Xin Liu; Zitong Yu; Yuan Zong; Jingang Shi; Huanjing Yue; Jingyu Yang; |
| 222 | Single Trajectory Distillation for Accelerating Image and Video Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This partial alignment strategy inevitably fails to guarantee full trajectory consistency, thereby compromising the overall generation quality. To address this issue, we propose Single Trajectory Distillation (STD), a training framework initiated from partial noise states. |
Sijie Xu; Runqi Wang; Wei Zhu; Dejia Song; Nemo Chen; Xu Tang; Yao Hu; |
| 223 | Overfitted Point Cloud Attribute Codec Using Sparse Hierarchical Implicit Neural Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Compressing attributes of 3D point clouds remains challenging due to their inherent sparsity and irregular distribution. To address this, we propose an efficient framework based on sparse hierarchical Implicit Neural Representations (INRs). |
Zhe Sun; Qiang Xu; Qi Zhang; Shan Liu; Ge Li; |
| 224 | FAMRD: Frequency-Aware Multimodal Reverse Distillation for Industrial Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, when one modality indicates normal while another shows anomalies, anomaly detection may be misled by that modality ambiguity. To address these challenges, we propose a Frequency-Aware Multimodal Reverse Distillation (FAMRD) framework from the frequency domain perspective. |
Qiyin Zhong; Xianglin Qiu; Xiaolei Wang; Zhen Zhang; Gang Liu; Jimin Xiao; |
| 225 | SDP: Spectral-Decomposed Prompting for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose the Spectral-Decomposed Prompting (SDP) method, a novel prompt-based approach that dynamically generates prompts based on the current input using a spectral decomposition strategy. |
Siqi Song; Limin Yu; Jimin Xiao; |
| 226 | MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection By Joint Motion-Semantic Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. |
Hongxu Ma; Guanshuo Wang; Fufu Yu; Qiong Jia; Shouhong Ding; |
| 227 | GeoUni: A Unified Model for Generating Geometry Diagrams, Problems and Problem Solutions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose GeoUni, the first unified geometry expert model capable of generating problem solutions and diagrams within a single framework in a way that enables the creation of unique and individualized geometry problems. |
Jo-Ku Cheng; Zeren Zhang; Ran Chen; Jingyang Deng; Ziran Qin; Jinwen Ma; |
| 228 | HandCraft: Tactile-Informed Hand-Object Dynamics Capture and Realistic Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present HandCraft, a framework designed to capture and render hand-object interactions with exceptional precision and realism. |
Hongyang Lin; Kuixiang Shao; Peijun Xu; Zhuoyang Bu; Yuyang Jiao; Ziyuan Tang; Chenxi Xiao; Jingyi Yu; |
| 229 | Focus Where It Matters: LLM-Guided Regional Identification for Instruction-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, unintended changes may occur in non-target areas, where the original image should remain unchanged. To address this issue, we propose FoRE, an MLLM-guided framework that identifies the target region based on the given edit instruction and performs image editing using region-aware embeddings. |
Minho Park; Young Joo Jo; Jae-Hyeok Lee; Ji Yong Lee; Dong-oh Kang; Yong Man Ro; |
| 230 | Zero in on The Target: A Composite Robust Model for Retrieving Information in Traffic Data to Discover Network Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The commonly adopted preset thresholds are difficult to cope with the changes in traffic data, limiting the applicability of the existing methods. Therefore, this paper proposes a Multi-Module-Based Composite Robust Model for Network Attack Detection (MCNAD). |
Ziang Li; Chengxiang Si; Zhenyu Cheng; |
| 231 | Breaking The Spatial-Temporal Consistency Constraint: Towards Reference-Based Hyperspectral Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In view of this spatial-temporal constraint, it becomes a feasible solution to regard HR-MSI, which has similar spatial structure and semantics to LR-HSI, as a reference to assist in reconstruction. Therefore, this paper proposes a Cross-Correlation \& Self-Similarity Guided Texture Transfer Network (C2S2TNet), which utilizes the texture details of HR-MSI and the self-similarity information of LR-HSI to achieve reference-based hyperspectral image super-resolution. |
Xuyao Liu; Jiahui Qu; Wenqian Dong; |
| 232 | TF-ATM: Training-Free Adaptive Token Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel Training-Free Adaptive Token Merging (TF-ATM) method by exploring the intrinsic properties of images themselves. |
Xin Zhang; Weiying Xie; Yunsong Li; Xiaoyu Chen; Tianlin Hui; Jitao Ma; Leyuan Fang; |
| 233 | Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we reveal two critical insights: (1) Significant redundancy exists in object-level 3D token representations, analogous to patch-level redundancy in 2D systems; (2) Global attention patterns exhibit strong predictive power for identifying non-essential tokens in 3D contexts. Building on these observations, we propose Fast3D, a plug-and-play visual token pruning framework for 3D MLLMs featuring two technical innovations: (1) Global Attention Prediction (GAP), where a lightweight neural network learns to predict the global attention distributions of the target model, enabling efficient token importance estimation for precise pruning guidance; (2) Sample-Adaptive visual token Pruning (SAP), which introduces dynamic token budgets through attention-based complexity assessment, automatically adjusting layer-wise pruning ratios based on input characteristics. |
Wencan Huang; Daizong Liu; Wei Hu; |
| 234 | FaceInsight: A Multimodal Large Language Model for Face Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these general-domain MLLMs perform poorly in face perception tasks, often producing inaccurate or misleading responses to face-specific queries. To address this gap, we propose FaceInsight, a versatile face perception MLLM that provides fine-grained information. |
Jingzhi Li; Changjiang Luo; Ruoyu Chen; Hua Zhang; Wenqi Ren; Jianhou Gan; Xiaochun Cao; |
| 235 | Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. |
Zhixin Zheng; Xinyu Wang; Chang Zou; Shaobo Wang; Linfeng Zhang; |
| 236 | Human-Activity AGV Quality Assessment: A Benchmark Dataset and An Objective Evaluation Metric Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We conduct a subjective study to evaluate the human appearance quality, action continuity quality, and overall video quality of AGVs, and identify semantic issues of human body parts. |
Zhichao Zhang; Wei Sun; Xinyue Li; Yunhao Li; qihang ge; Jun Jia; Zicheng Zhang; Zhongpeng Ji; Fengyu Sun; Shangling Jui; Xiongkuo Min; Guangtao Zhai; |
| 237 | Multi-State Tracker: Enhancing Efficient Object Tracking Via Multi-State Specialization and Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this efficiency often compromises the expense of weakened feature representation capacity, thus limiting their ability to accurately capture target states using single-layer features. To overcome this limitation, we propose Multi-State Tracker (MST), which utilizes highly lightweight state-specific enhancement (SSE) to perform specialized enhancement on multi-state features produced by multi-state generation (MSG) and aggregates them in an interactive and adaptive manner using cross-state interaction (CSI). |
Shilei Wang; Gong Cheng; Pujian Lai; Dong Gao; Junwei Han; |
| 238 | DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods tend to exploit modality-invariant visual features but largely overlook gait features, which are not only modality-invariant but also rich in temporal dynamics, thus limiting their ability to model the spatiotemporal consistency essential for cross-modal video matching. To address these challenges, we propose a DINOv2-Driven Gait Representation Learning (DinoGRL) framework that leverages the rich visual priors of DINOv2 to learn gait features complementary to appearance cues, facilitating robust sequence-level representations for cross-modal retrieval. |
Yujie Yang; Shuang Li; Jun Ye; Neng Dong; Fan Li; Huafeng Li; |
| 239 | StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, we introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. |
Yuhang Hu; Zhenyu Yang; Shihan Wang; Shengsheng Qian; Bin Wen; Fan Yang; Tingting Gao; Changsheng Xu; |
| 240 | DITL2: Dual-Stage Invariance Transfer Learning for Generalizable Document Image Tampering Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, domain-specific variations, including differences in color distribution and texture, compromise the performance of joint training. To address this issue, we propose DITL2, a Dual-stage Invariance Transfer Learning framework for Document Image Tampering Localization that consists of Cross-Domain Invariance Pre-training (CDIP) and Frequency Decoupling Parameter Adaptation (FDPA). |
Songze Li; Yunfei Guo; Shen Chen; Bin Li; Kaiqing Lin; Changsheng Chen; Haodong Li; Taiping Yao; Shouhong Ding; |
| 241 | Beyond Emotion Recognition: A Multi-Turn Multimodal Emotion Understanding and Reasoning Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, recent research primarily focuses on enhancing their emotion recognition abilities, leaving the substantial potential in emotion reasoning, which is crucial for improving the naturalness and effectiveness of human-machine interactions. Therefore, in this paper, we introduce a multi-turn multimodal emotion understanding and reasoning (MTMEUR) benchmark, which encompasses 1,451 video data from real-life scenarios, along with 5,101 progressive questions. |
Jinpeng Hu; Hongchang Shi; Chongyuan Dai; Zhuo Li; Peipei Song; Meng Wang; |
| 242 | AV-DiT: Taming Image Diffusion Transformers for Efficient Joint Audio and Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the potential of DiTs to enable superb multimodal content creation remains underexplored. To bridge this gap, we introduce AV-DiT, a novel and efficient audio-visual diffusion transformer designed to generate high-quality, realistic videos with synchronized audio tracks. |
Kai Wang; Shijian Deng; Jing Shi; Dimitrios Hatzinakos; Yapeng Tian; |
| 243 | Scalable Unpaired Multi-View Clustering Via Anchor-Driven High-Throughput Encoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, few approaches attempt to enhance the internal structure of anchor matrix to further improve clustering performance. To address these challenges, we propose a novel Anchor-Driven High-Throughput Encoding (ADHTE) framework that optimizes anchors by maximizing their throughput encoding capacity. |
Junyu Chen; Jiawei Peng; Yuan Sun; Jian Dai; Xingfeng Li; Zhenwen Ren; |
| 244 | Positional Prompt Tuning for Efficient 3D Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We argue that using positional encoding in point Transformer-based methods serves to aggregate multi-scale features of point clouds. |
Shaochen Zhang; Zekun Qi; Runpei Dong; Xiuxiu Bai; Xing Wei; |
| 245 | Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. |
Zecheng Zhao; Selena Song; Tong Chen; Zhi Chen; Shazia Sadiq; Yadan Luo; |
| 246 | A Multimodal Evaluation Framework for Spatial Audio Playback Systems: From Localization to Listener Preference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, objective evaluation methods for perceptual dimensions like sound field and sound image remain underdeveloped, hindered by the lack of fine-grained spatial audio datasets and the neglect of echoes and reverberation in diverse playback conditions. To address these challenges, we propose MESA, a multi-modal evaluation framework for spatial audio systems, and introduce PSA-MOS, a high-quality multi-scene spatial audio dataset. |
Changhao Pan; Wenxiang Guo; Yu Zhang; Zhiyuan Zhu; Zhetao Chen; Han Wang; Zhou Zhao; |
| 247 | Beyond Snapshots: A Multimodal User-Level Dataset for Depression Detection in Dynamic Social Media Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation overlooks the comprehensive mental state of users, which can only be understood through their extended video histories. To address this, we introduce the Multimodal User-level Depression Detection Dataset (MUD3). |
Bichen Wang; Yixin Sun; Yanyan Zhao; Bing Qin; |
| 248 | InterAnimate: Taming Region-Aware Diffusion Model for Realistic Human Interaction Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel paradigm for animating realistic hand-face interactions. |
Yukang Lin; Yan Hong; Zunnan Xu; Xindi Li; Chao Xu; Chuanbiao Song; Ronghui Li; Haoxing Chen; Jun Lan; Huijia Zhu; Weiqiang Wang; Jianfu Zhang; Xiu Li; |
| 249 | EmoSym: A Symbiotic Framework for Unified Emotional Understanding and Generation Via Latent Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to bridge the gap by developing a unified framework. |
Yijie Zhu; Yibo Lyu; Zitong Yu; Rui Shao; Kaiyang Zhou; Liqiang Nie; |
| 250 | Toward Robust Deepfake Detection: A Proactive Method Based on Watermarking and Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, the growing variety of forgery techniques, combined with the degradation of visual quality in forged images, makes reliable detection even more difficult. To address these challenges, we propose WKD, a proactive deepfake detection framework based on Watermarking and Knowledge Distillation. |
Chunpeng Wang; Wenlong Ma; Li Zou; Zhiqiu Xia; Qi Li; Bin Ma; Yunan Liu; |
| 251 | Video-based Transparent Object Segmentation Via Temporal Feature Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, detecting transparent areas from video has not been well explored, especially for different kinds of transparent categories besides glass, due to the scarcity of such a dataset. Therefore, in this paper, we propose the video-based transparent object segmentation task and introduce the first-of-its-kind corresponding dataset named TransVid, which contains nearly 400 videos with a total of 18,523 frames. |
Zhen Wang; Dongyuan Li; Yaozu Wu; Peide Zhu; Shiyin Tan; Renhe Jiang; |
| 252 | CHORD: Customizing Hybrid-precision On-device Model for Sequential Recommendation with Device-cloud Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While on-device finetuning captures personalized user preference, it imposes additional computational burden through local retraining. To address these challenges, we propose a framework for Customizing Hybrid-precision On-device model for sequential Recommendation with Device-cloud collaboration (CHORD), leveraging channel-wise mixed-precision quantization to simultaneously achieve personalization and resource-adaptive deployment. |
Tianqi Liu; Kairui Fu; Shengyu Zhang; Wenyan Fan; Zhaocheng Du; Jieming Zhu; Fan Wu; Fei Wu; |
| 253 | Debiasing Multimodal Large Language Models Via Penalization of Language Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Empirical experiments underscore the persistence of this bias, as MLLMs often provide confident answers even in the absence of relevant images or given incongruent visual inputs. To rectify these biases and redirect the model’s focus toward visual information, we propose two simple, training-free strategies. |
YiFan Zhang; Yang Shi; Weichen Yu; Qingsong Wen; Xue Wang; Wenjing Yang; Zhang Zhang; Liang Wang; Rong Jin; |
| 254 | Dual Uncertainty-Guided Feature Alignment Learning for Text-Based Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing methods focus on addressing heterogeneity while neglecting the issue of uncertainty. To tackle the uncertainty arising from the diverse textual expressions, including both structural and semantic content variations, we propose a novel Dual Uncertainty-Guided Feature Alignment Learning (DUAL) approach, utilizing instance-level and identity-level uncertainty estimations to mitigate these impacts. |
Yufei Zheng; Jiawei Liu; Bingyu Hu; Zikun Wei; Yong Wu; Zheng-Jun Zha; |
| 255 | SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel ” Forecast-then-verify ” acceleration framework that effectively addresses both limitations. |
Jiacheng Liu; Chang Zou; Yuanhuiyi Lyu; Fei Ren; Shaobo Wang; Kaixin Li; Linfeng Zhang; |
| 256 | Where Views Meet Curves: Virtual Anchors for Hyperbolic Multi-View Graph Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a hyperbolic multi-view heat diffusion method. |
Jielong Lu; Zhihao Wu; Jiajun Yu; Qianqian Shen; Jiajun Bu; Haishuai Wang; |
| 257 | CitySculpt: 3D City Generation from Satellite Imagery with UV Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the limited information from satellite views presents significant challenges, hindering existing methods from generating high-quality cities that meet application standards. To address these challenges, we propose CitySculpt, a UV diffusion-based framework for generating 3D cities with high-fidelity geometry and photorealistic textures. |
Xingbo Yao; Xuanmin Wang; Hui Xiong; |
| 258 | Towards Hazardous Activity Recognition for A Novel Real-World Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing datasets often lack coverage of the nuanced and diverse hazards present in indoor environments, which hinders the development of a specialized model. To address this, we introduce the Real-World Hazardous Activities Dataset (RHAD), a novel and diverse video dataset specifically curated for recognizing hazardous activities in real-world indoor settings. |
Shehzad Ali; Md Tanvir Islam; Ik Hyun Lee; Mingfu Xiong; Minh-Son Dao; Saeed Anwar; Sambit Bakshi; Khan Muhammad; |
| 259 | Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To mitigate this bias, we fine-tune the retrieval models using a contrastive learning approach. The results of this study highlight the potential implications of AI-generated videos on retrieval systems and offer valuable insights for future research in this area. |
Haowen Gao; Liang Pang; Shicheng Xu; Leigang Qu; Tat-Seng Chua; Huawei Shen; Xueqi Cheng; |
| 260 | Low-light Image Enhancement Quality Assessment: A Real-World Dataset and An Objective Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: First, we introduce a Real-world Low-light Image Enhancement quality assessment dataset (RLIE), which contains 1540 images from 154 scenarios, each with a subjective score given by the subjects. Based on this, we propose a low light enhanced image quality assessment method based on Multi-level Illumination Injection and Hierarchical Discrepancy Perception (MIIHDP). |
Chunyi Li; Bo Hu; Taiyang Chen; Leida Li; Lihuo He; Xinbo Gao; |
| 261 | MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. |
Kuo Shi; Jie Lu; Shanshan Ye; Guangquan Zhang; Zhen Fang; |
| 262 | MLLMs Meet Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Person re-identification (Re-ID) models have achieved remarkable advancements with the advent of deep learning. |
Mengying Duan; He Li; Mang Ye; |
| 263 | Serial Over Parallel: Learning Continual Unification for Multi-Modal Visual Object Tracking and Benchmarking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Unifying multiple multi-modal visual object tracking (MMVOT) tasks draws increasing attention due to the complementary nature of different modalities in building robust tracking … |
Zhangyong Tang; Tianyang Xu; Xue-Feng Zhu; Chunyang Cheng; Tao Zhou; Xiaojun Wu; Josef Kittler; |
| 264 | Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, most existing methods require iterative optimization over per-view 2D semantic feature maps, which not only results in inefficiencies but also leads to inconsistent 3D semantics across views. To address these limitations, we introduce a training-free framework that constructs a superpoint graph directly from Gaussian primitives. |
Shaohui Dai; Yansong Qu; Zheyan Li; Xinyang Li; Shengchuan Zhang; Liujuan Cao; |
| 265 | Rethinking Individual Fairness in Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we identify for the first time that the original principle of individual fairness fundamentally fails in the context of deepfake detection, revealing a critical gap previously unexplored in the literature. |
Aryana Hou; Li Lin; Justin Li; Shu Hu; |
| 266 | Prior-oriented Anchor Learning with Coalesced Semantics for Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, most strategies use adaptive anchor learning without considering the veracity of anchor selection and the lack of sufficient semantic support in modeling semantic consistency, which leads to anchors deviating from the clustering center. To solve the above problems, we propose a novel method called Prior-oriented Anchor Learning with Coalesced Semantics for Multi-View Clustering (PALCS). |
Jinjia Peng; Tianhang Cheng; Guangqi Jiang; Huibing Wang; |
| 267 | KAID: Knowledge-Aware Interactive Distillation for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, they do not fully leverage the multimodal interaction knowledge from the teacher model, restricting cross-modal semantic alignment. To address these challenges, we propose KAID, a Knowledge-Aware Interactive Distillation method for VLMs. |
Da Zhang; Feiyu Wang; Bingyu Li; Zhiyuan Zhao; Junyu Gao; Xuelong Li; |
| 268 | DDFD: Diffusion-Based Denoising Fusion for Object Detection in Infrared-Visible Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the limitations of previous studies, this paper proposes a diffusion-based denoising fusion for object detection in infrared-visible images, termed DDFD. |
Min Dang; Gang Liu; Jingqi Zhao; Adams Wai-Kin Kong; Nan Luo; Di Wang; |
| 269 | From Guesswork to Guarantee: Towards Faithful Multimedia Web Forecasting with TimeSieve Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While models like TimeSieve have demonstrated strong capabilities in predicting web visitation metrics, they suffer from critical unfaithfulness issues, including sensitivity to random seeds, input noise, layer noise, and parametric perturbations. To address these limitations, we propose Faithful TimeSieve (FTS), an enhanced framework designed to improve prediction reliability and robustness. |
Songning Lai; Ninghui Feng; Jiechao Gao; Hao Wang; Haochen Sui; Xin Zou; Jiayu Yang; Wenshuo Chen; Lijie Hu; Hang Zhao; Xuming Hu; Yutao Yue; |
| 270 | Learning New Concepts, Remembering The Old: Continual Learning for Multimodal Concept Bottleneck Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This task requires models to continuously acquire new concepts (often representing cross-modal attributes) and classes while robustly preserving previously learned knowledge. To tackle this challenging problem, we propose CONceptual Continual Incremental Learning (CONCIL), a novel framework that fundamentally re-imagines concept and decision layer updates as linear regression problems. |
Songning Lai; Mingqian Liao; Zhangyi Hu; Jiayu Yang; Wenshuo Chen; Hongru Xiao; Jianheng Tang; Haicheng Liao; Yutao Yue; |
| 271 | A Data-driven Approach to The Longitudinal Study of Canine Vocal Pattern Development Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Longitudinal studies of animal vocalizations provide crucial insights into developmental patterns and communicative evolution. To aid such investigations in canines, this paper introduces the Canine Age Transition Vocalization Dataset, a large-scale collection of dog vocalizations featuring meticulously verified metadata (including precise birthdate, breed, and individual dog ID) for 125 dogs across 6 common breeds. |
Hridayesh Lekhak; Tuan M. Dang; Theron S. Wang; Kenny Q. Zhu; |
| 272 | DogSpeak: A Canine Vocalization Classification Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DogSpeak, a large-scale public dataset of 77,202 Barkseqs (33.162 hours) from 156 dogs (5 breeds), uniquely sourced from online social media with accurate dog ID, sex, and breed labels. |
Hridayesh Lekhak; Theron S. Wang; Tuan M. Dang; Kenny Q. Zhu; |
| 273 | Enhancing Non-Core Language Instruction-Following in Speech LLMs Via Semi-Implicit Cross-Lingual CoT Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing SLLMs demonstrate strong performance in speech instruction-following for core languages (e.g., English), they often struggle with non-core languages due to the scarcity of paired speech-text data and limited multilingual semantic reasoning capabilities. To address this, we propose the semi-implicit Cross-lingual Speech Chain-of-Thought (XS-CoT) framework, which integrates speech-to-text translation into the reasoning process of SLLMs. |
Hongfei Xue; Yufeng Tang; Hexin Liu; Jun Zhang; Xuelong Geng; Lei Xie; |
| 274 | DHCP: Detecting Hallucinations By Cross-modal Attention Pattern in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large vision-language models (LVLMs) have demonstrated exceptional performance on complex multimodal tasks. |
Yudong Zhang; Ruobing Xie; Xingwu Sun; Yiqing Huang; Jiansheng Chen; Zhanhui Kang; Di Wang; Yu Wang; |
| 275 | Fighting Fire with Fire (F3): A Training-free and Efficient Visual Adversarial Example Purification Method in LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce F3, a novel adversarial purification framework that employs a counterintuitive ”fighting fire with fire” strategy: intentionally introducing simple perturbations to adversarial examples to mitigate their harmful effects. |
Yudong Zhang; Ruobing Xie; Yiqing Huang; Jiansheng Chen; Xingwu Sun; Zhanhui Kang; Di Wang; Yu Wang; |
| 276 | Disentangling Homophily and Heterophily in Multimodal Graph Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal graphs, which integrate unstructured heterogeneous data with structured interconnections, offer substantial real-world utility but remain insufficiently explored in unsupervised learning. In this work, we initiate the study of multimodal graph clustering, aiming to bridge this critical gap. |
Zhaochen Guo; Zhixiang Shen; Xuanting Xie; Liangjian Wen; Zhao Kang; |
| 277 | EMIFS: Efficient Multi-scale Information Fusion Self-supervision for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Relying too much on manually labeled images to assist training. In order to meet these challenges, we propose a lightweight segmentation network, which is dedicated to extracting local and global information and fusing multi-level and multi-source features to maximize the segmentation accuracy for different shape lesions, especially for the case of fuzzy boundary and small segmentation target. |
Luyao Ren; Wenxin Yu; Zhiqiang Zhang; Chang Liu; |
| 278 | From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. |
Shahroz Tariq; Simon S. Woo; Priyanka Singh; Irena Irmalasari; Saakshi Gupta; Dev Gupta; |
| 279 | ComplexBench-Edit: Benchmarking Complex Instruction-Driven Image Editing Via Compositional Dependencies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, they often overlook multi-instruction and chain-instruction complexities, and common consistency metrics are flawed. To address this, we introduce ComplexBench-Edit, a novel benchmark designed to systematically assess model performance on complex, multi-instruction, and chain-dependent image editing tasks. |
Chenglin Wang; Yucheng Zhou; Qianning Wang; Zhe Wang; Kai Zhang; |
| 280 | Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose Stereo-GS, a disentangled framework for efficient 3D Gaussian prediction. |
Xiufeng Huang; Ka Chun Cheung; Runmin Cong; Simon See; Renjie Wan; |
| 281 | MarkSplatter: Generalizable Watermarking for 3D Gaussian Splatting Model Via Splatter Image Structure Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the first generalizable watermarking framework that enables efficient protection of Splatter Image-based 3DGS models through a single forward pass. |
Xiufeng Huang; Ziyuan Luo; Qi Song; Ruofei Wang; Renjie Wan; |
| 282 | Enhancing Multi-view Open-set Learning Via Ambiguity Uncertainty Calibration and View-wise Debiasing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. |
Zihan Fang; Zhiyong Xu; Lan Du; Shide Du; Zhiling Cai; Shiping Wang; |
| 283 | Boosting Chart-to-Code Generation in MLLM Via Dual Preference-Guided Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. |
Zhihan Zhang; Yixin Cao; Lizi Liao; |
| 284 | Pathology-Aware Reconstruction with Discriminative Knowledge Boosting Alignment for Che-Xray Vision-Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, alignment-based methods suffer from suboptimal representations due to the presence of false negatives. To address these challenges, we propose a novel pre-training framework that integrates two key components: Pathology-Aware Reconstruction (PAR) and Discriminative Knowledge-Boosted Alignment (DKBA). |
Lihong Qiao; Shiyi Gao; Yucheng Shu; Bin Xiao; Weisheng Li; Xinbo Gao; |
| 285 | Text-Visual Semantic Constrained AI-Generated Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, when applied to AGIs, these methods encounter two primary challenges: semantic misalignment and details perception missing. To address these limitations, we propose Text-Visual Semantic Constrained AI-Generated Image Quality Assessment (SC-AGIQA), a unified framework that leverages text-visual semantic constraints to significantly enhance the comprehensive evaluation of both text-image consistency and perceptual distortion in AI-generated images. |
Qiang Li; Qingsen Yan; Haojian Huang; Peng Wu; Haokui Zhang; Yanning Zhang; |
| 286 | Seg-Wild: Interactive Segmentation Based on 3D Gaussian Splatting for Unconstrained Image Collections Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous segmentation methods cannot address transient occlusions or accurately restore the scene’s lighting conditions. Therefore, we propose Seg-Wild, an interactive segmentation method based on 3D Gaussian Splatting for unconstrained image collections, suitable for in-the-wild scenes. |
Yongtang Bao; Chengjie Tang; Yuze Wang; Haojie Li; |
| 287 | UniAD: Integrating Geometric and Semantic Cues for Unified Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose UniAD, a novel dual-branch teacher-student framework that achieves unified anomaly detection through synergistic integration of complementary expertise from heterogeneous vision models without requirements of extra manual annotations. |
Xiaodong Wang; Hongmin Hu; Fei Yan; Junwen Lu; Zhiqiang Zeng; Weidong Hong; Zhedong Zheng; |
| 288 | Why Is A Bird’s Caption A Good Demonstration? Towards Effective Multimodal In-Context Learning Without Dedicated Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal Large Language Models (MLLMs) have achieved impressive performance across a range of tasks by leveraging Multimodal In-Context Learning (MICL), which uses a few task-specific examples as demonstrations. |
Junlin Fang; Wenya Wang; Lingli Zhang; Fengmao Lv; |
| 289 | WMamba: Wavelet-based Mamba for Face Forgery Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current wavelet-based approaches fail to fully exploit the distinctive properties of wavelet data, resulting in sub-optimal feature extraction and limited performance gains. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. |
Siran Peng; Tianshuo Zhang; Li Gao; Xiangyu Zhu; Haoyuan Zhang; Kai Pang; Zhen Lei; |
| 290 | TrustCLIP: Learning from Noisy Labels Via Semantic Label Verification and Trust-aligned Gradient Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TrustCLIP, a noise-robust prompt tuning framework that leverages the inherent semantic structure of CLIP through two key components: Semantic Label Verification (SLV) and Trust-aligned Gradient Projection (TGP). |
Xueyi Zhang; Peiyin Zhu; Yuan Liao; Xiyu Wang; Mingrui Lao; Siqi Cai; Yanming Guo; Haizhou Li; |
| 291 | EventLip: Enhancing Event-Based Lip Reading Via Frequency-Aware Spatiotemporal Hypergraph Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose FAST-HG, a Frequency-Aware SpatioTemporal HyperGraph framework specifically designed for event-based lip reading. |
Xueyi Zhang; Jialu Sun; Chengwei Zhang; Xianghu Yue; Tianfang Xiao; Siqi Cai; Mingrui Lao; Haizhou Li; |
| 292 | Semantic-Aware Hard Negative Mining for Medical Vision-Language Contrastive Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing medical vision-language contrastive pretraining methods aim to bring the paired image-report embeddings close together while pushing the unpaired ones apart. |
Yongxin Li; Ying Cheng; Yaning Pan; Wen He; Qing Wang; Rui Feng; Xiaobo Zhang; |
| 293 | Generative Semantic Probing for Vision-Language Models Via Hierarchical Feature Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods for decoding representations of VLMs often produce suboptimal outputs, hindering to probe the clear visual patterns. To address this, we introduce Generative Semantic Probing (GSP), a novel training-free framework that synthesizes images to probe the implicit semantic preferences of VLMs. |
He Wang; Longquan Dai; Shihao Pu; Shaomeng Wang; Jinhui Tang; |
| 294 | From Individuals to Crowds: Dual-Level Public Response Prediction in Social Media Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, they overlook macro-level sentiment distribution and only deal with individual-level sentiment, constraining them from analyzing broader societal trends and group sentiment dynamics. To address these challenges, we propose SocialAlign, a unified framework that predicts real-world responses at both micro and macro levels in social contexts. |
Jinghui Zhang; Kaiyang Wan; Longwei Xu; Ao Li; Zongfang Liu; Xiuying Chen; |
| 295 | FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s. |
Yuxuan Jiang; Zehua Chen; Zeqian Ju; Chang Li; Weibei Dou; Jun Zhu; |
| 296 | Detecting Violations of Physical Common Sense in Images: A Challenge Dataset and Effective Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we propose PhyDetector, a two-stage fine-tuning framework to enhance the VLMs’ capability to detect violations of physical common sense. |
Weibin Wu; Zitong Wang; Zhengjie Luo; Wenqing Chen; Zibin Zheng; |
| 297 | Graph-Guided Dual-Level Augmentation for 3D Scene Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most augmentation strategies only focus on local transformations or semantic recomposition, lacking the consideration of global structural dependencies within scenes. To address this limitation, we propose a graph-guided data augmentation framework with dual-level constraints for realistic 3D scene synthesis. |
Hongbin Lin; Yifan Jiang; Juangui Xu; Jesse J. Xu; Yi Lu; Zhengyu Hu; Ying-Cong Chen; Hao Wang; |
| 298 | Mitigating Information Loss Under High Pruning Rates for Efficient Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an Adaptive Content Compensation Method (ACCM), which can effectively mitigate the visual information loss via an image caption. |
Mingyu Fu; Wei Suo; Ji Ma; Lin Yuanbo Wu; Peng Wang; Yanning Zhang; |
| 299 | LL-Gaussian: Low-Light Scene Reconstruction and Enhancement Via Gaussian Splatting for Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, 3D Gaussian Splatting (3DGS) enables real-time rendering with competitive visual fidelity; however, existing 3DGS-based methods struggle with low-light sRGB inputs, resulting in unstable Gaussian initialization and ineffective noise suppression. To address these challenges, we propose LL-Gaussian, a novel framework for 3D reconstruction and enhancement from low-light sRGB images, enabling pseudo normal-light novel view synthesis. |
Hao Sun; Fenggen Yu; Huiyao Xu; Tao Zhang; Changqing Zou; |
| 300 | Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. |
Guanyu Hu; Dimitrios Kollias; Xinyu Yang; |
| 301 | Uni-Sight: An E2E Vision-Language-Action System Unifying Multi-View Alignment and Multi-Modal Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While current systems have advanced the instruction-following capabilities, their limited spatial perception often leads to suboptimal performance for mobile manipulation tasks in unstructured environments. To address this challenge, we propose Uni-Sight, an end-to-end VLA system for robust mobile manipulation. |
Daixun Li; Sibo He; Jiayun Tian; Yusi Zhang; Weiying Xie; Mingxiang Cao; Donglai Liu; Zirui Li; Tianlin Hui; Rui Huang; Yunsong Li; |
| 302 | Re-Activating Frozen Primitives for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. |
Yuxin Cheng; Binxiao Huang; Wenyong Zhou; Taiqiang Wu; Zhengwu Liu; Graziano Chesi; Ngai Wong; |
| 303 | Selective Shift: Towards Personalized Domain Adaptation in Multi-Agent Collaborative Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing adaptation methods focus on domain-generalized feature extraction while neglecting multi-agent shift uncertainty and relational semantic loss. To address this issue, we propose a Selective Shift Domain Adaptation method in multi-agent collaborative perception, called SSDA. |
Hui Zhang; Yiteng Xu; Yonglin Tian; Yidong Li; Tiago H. Falk; Fei-Yue Wang; |
| 304 | I-C Attack: In-place and Cross-pixel Augmentations for Highly Transferable Transformation-based Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Integrating them enables higher transferability. Therefore, we propose an attack design paradigm to fully leverage both augmentations. |
Jiaming Liang; Chi-Man Pun; |
| 305 | Improving Compositional Generalization in Cross-Embodiment Learning Via Mixture of Disentangled Prototypes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the inherent conflict between the unbounded space of agent-environment combinations and a single unified policy model hinders generalization to unseen combinations. To address this challenge, we propose a novel Mixture of Disentangled Prototypes (MoDP) method to improve the compositional generalization in CEL. |
Ren Wang; Xin Wang; Tongtong Feng; Xinyue Gong; Guangyao Li; Yu-Wei Zhan; Qing Li; Wenwu Zhu; |
| 306 | RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. |
Tianxiao Li; Zhenglin Huang; Haiquan Wen; Yiwei He; Shuchang Lyu; Baoyuan Wu; Guangliang Cheng; |
| 307 | What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system’s process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and object concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. |
Jianghang Lin; Yue Hu; Jiangtao Shen; Yunhang Shen; Liujuan Cao; Shengchuan Zhang; Rongrong Ji; |
| 308 | HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, simply interpolating the positional embeddings of visual encoders to enhance perceptual resolution yields only marginal performance improvements while incurring substantial computational costs. To address this, we propose HRSeg, an efficient model with high-resolution fine-grained perception. |
Weihuang Lin; Yiwei Ma; Xiaoshuai Sun; Shuting He; Jiayi Ji; Liujuan Cao; Rongrong Ji; |
| 309 | Energy-based Deep Incomplete Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing IMVC approaches face a trade-off: imputation-free methods suffer from information bias and imbalance, while full-imputation methods risk introducing and propagating noise. To overcome these limitations, we propose Energy-Based Deep Incomplete Multi-View Clustering (Energy-DIMC), a novel selective-imputation framework that leverages energy-based models (EBMs) to guide reliable imputations and robust clustering. |
Ziyu Wang; Yiming Du; Rui Ning; Lusi Li; |
| 310 | CMM-Math: A Chinese Multimodal Math Dataset To Evaluate and Enhance The Mathematics Reasoning of Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we release a Chinese multimodal math (CMM-Math) dataset, including benchmark and training parts, to evaluate and enhance the mathematical reasoning of LMMs. |
Wentao Liu; Qianjun Pan; Yi Zhang; Zhuo Liu; Ji Wu; Jie Zhou; Aimin Zhou; Qin Chen; Bo Jiang; Liang He; |
| 311 | DriVerse: Navigation World Model for Driving Simulation Via Multimodal Trajectory Prompting and Motion Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents DriVerse, a generative model for simulating navigation-driven driving scenes from a single image and a future trajectory. |
Xiaofan Li; Chenming Wu; Zhao Yang; Zhihao Xu; Yumeng Zhang; Dingkang Liang; Ji Wan; Jun Wang; |
| 312 | Evaluating The Robustness of Multimodal Agents Against Active Environmental Injection Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent’s task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. |
Yurun Chen; Xueyu Hu; Keting Yin; Juncheng Li; Shengyu Zhang; |
| 313 | A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. |
Zhenyang Liu; Sixiao Zheng; Siyu Chen; Cairong Zhao; Longfei Liang; Xiangyang Xue; Yanwei Fu; |
| 314 | Entity-Level Alignment with Prompt-Guided Adapter for Remote Sensing Image-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often overlook explicit attention to semantic entities in RS scenes, limiting their capabilities in fine-grained semantic modeling and cross-modal matching, thereby hindering retrieval performance. To address these limitations, we propose a novel framework, Entity-level Alignment with Prompt-guided Adapter (EAPA), which enhances retrieval performance by explicitly perceiving, embedding, and aligning semantic entities in RS images and texts. |
Shuoshuo Li; Shuli Cheng; Liejun Wang; |
| 315 | Trusted Open-World Multi-View Classification with Dynamic Opinion Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches are designed for closed-set scenarios and fail when novel or unknown categories appear in open-world contexts. To address this limitation, we introduce the concept of Open Multi-View Learning, with the objective of detecting unknown categories with low confidence scores. |
Zhicheng Dong; Xiaodong Yue; Yufei Chen; Yuxian Zhou; |
| 316 | CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose CogDDN, a VLM-based framework that emulates the human cognitive and learning mechanisms by integrating fast and slow thinking systems and selectively identifying key objects essential to fulfilling user demands. |
Yuehao Huang; Liang Liu; Shuangming Lei; Yukai Ma; Hao Su; Jianbiao Mei; Pengxiang Zhao; Yaqing Gu; Yong Liu; Jiajun Lv; |
| 317 | ReactDiff: Fundamental Multiple Appropriate Facial Reaction Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods fail to model the stochasticity and dynamics inherent in real human reactions. To address this, we propose ReactDiff, a novel temporal diffusion framework for generating diverse facial reactions that are appropriate for responding to any given dialogue context. |
Cheng Luo; Siyang Song; Siyuan Yan; Zhen Yu; Zongyuan Ge; |
| 318 | From Model Diagram to Code: A Benchmark Dataset and Multi-Agent Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The complex structural elements and implicit relationships in model diagrams present greater challenges for MLLMs, particularly in terms of visual reasoning and semantic interpretation. To support this task, we introduce MDCDataset, a dataset designed to evaluate the ability of MLLMs to generate code from model diagrams. |
Mengzhen Wang; Xunbin Huang; Jiayuan Xie; Shukai Ma; Jiale Men; Dayong Liang; Yi Cai; |
| 319 | Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify three critical limitations in current emotional talking head generation: insufficient utilization of audio’s inherent emotional cues, identity leakage in emotion representations, and isolated learning of emotion correlations. To address these challenges, we propose a novel framework dubbed as DICE-Talk, following the idea of disentangling identity with emotion, and then cooperating emotions with similar characteristics. |
Weipeng Tan; Chuming Lin; Chengming Xu; FeiFan Xu; Xiaobin Hu; Xiaozhong Ji; Junwei Zhu; Chengjie Wang; Yanwei Fu; |
| 320 | Boosting Multi-Modal Alignment: Geometric Feature Separation for Class Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to differences between the pre-training data and downstream tasks, these textual features can become too similar for certain classes, leading to prediction errors. To address this issue, we propose a method that optimizes the geometric structure of both visual and textual features across different classes. |
Guoqiang Liang; Chuan Qin; De Cheng; Shizhou Zhang; Yanning Zhang; |
| 321 | Regularizing Subspace Redundancy of Low-Rank Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. |
Yue Zhu; Haiwen Diao; Shang Gao; Jiazuo Yu; Jiawen Zhu; Yunzhi Zhuge; Shuai Hao; Xu Jia; Lu Zhang; Ying Zhang; Huchuan Lu; |
| 322 | RealText: Realistic Text Image Generation Based on Glyph and Scene Aware Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce RealText, a method for generating scene text images that excels in producing precise and realistic scene text images in any language. |
Zihou Liu; Dongming Zhang; Jing Zhang; Jun Li; Yongdong Zhang; |
| 323 | MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (MFFI ) dataset, tailored for real-world scenarios. |
Changtao Miao; Yi Zhang; Man Luo; Weiwei Feng; Kaiyuan Zheng; Qi Chu; Tao Gong; Jianshu Li; Yunfeng Diao; Wei Zhou; Joey Tianyi Zhou; Xiaoshuai Hao; |
| 324 | Separate to Collaborate: Dual-Stream Diffusion Model for Coordinated Piano Hand Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a dual-stream neural framework designed to generate synchronized hand gestures for piano playing from audio input, addressing the critical challenge of modeling both hand independence and coordination. |
Zihao Liu; Mingwen Ou; Zunnan Xu; Jiaqi Huang; Haonan Han; Ronghui Li; Xiu Li; |
| 325 | Clustering-Oriented Generative Attribute Graph Imputation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To remedy the problems, we establish the Clustering-oriented Generative Imputation with reliable Refinement (CGIR) model. |
Mulin Chen; Bocheng Wang; Jiaxin Zhong; Zongcheng Miao; Xuelong Li; |
| 326 | EDMG: Towards Efficient Long Dance Motion Generation with Fundamental Movements from Dance Genres Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel dance choreography framework, EDMG, designed to efficiently generate creative and long-lasting dance sequences conditioning on music and dance descriptions. |
Jinming Zhang; Yunlian Sun; Hongwen Zhang; Jinhui Tang; |
| 327 | Phys4DRT: Physics-based 4D Generation for Real-Time Interaction with Time-Frequency Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel physics-based 4D generation method, Phys4DRT, for arbitrary realistic real-time interaction on 3D Gaussian Splatting (3DGS) objects with direct motion supervision in time-frequency domain. |
Yuntian Xiao; Shoulong Zhang; Zihang Zhang; Jiahao Cui; Yan Wang; Shuai Li; |
| 328 | Toward Robust Signed Graph Learning Through Joint Input-Target Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the success of graph information bottleneck (GIB) in information extraction, we propose RIDGE, a novel framework for Robust sIgned graph learning through joint Denoising of Graph inputs and supervision targEts. |
Junran Wu; Beng Chin Ooi; Ke Xu; |
| 329 | Multi-Width Neural Network-Assisted Hierarchical Federated Learning in Heterogeneous Cloud-Edge-Device Computing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, frequent local NN training and transmission impose high energy consumption pressure on users. To tackle these issues, this paper proposes a premium multi-width NN-assisted hierarchical FL (HFL) framework in heterogeneous cloud-edge-device computing to achieve remarkable training speedup and energy conservation. |
Haizhou Wang; Guobing Zou; Fei Xu; Yangguang Cui; Tongquan Wei; |
| 330 | Learning Hierarchical Cross-modal Association with Intra-modal Context for Text-Image Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, to model hierarchical cross-modal semantic relationships, we propose a Hierarchical Relevance Matching (HRM) module. |
Yifei Deng; Chenglong Li; Futian Wang; Jin Tang; |
| 331 | HiProbe-VAD: Video Anomaly Detection Via Hidden States Probing in Tuning-Free Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. |
Zhaolin Cai; Fan Li; Ziwei Zheng; Yanjun Qin; |
| 332 | Investigating Domain Gaps for Indoor 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we consider the task of adapting indoor 3D object detectors from one dataset to another, presenting a comprehensive benchmark with ScanNet, SUN RGB-D and 3D Front datasets, as well as our newly proposed large-scale datasets ProcTHOR-OD and ProcFront generated by a 3D simulator. |
Zijing Zhao; Zhu Xu; Qingchao Chen; Yuxin Peng; Yang Liu; |
| 333 | Dynamic Analysis and Adaptive Discriminator for Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods are heavily rely on human expertise and feedback, lacking flexibility. To address this challenge, we propose a Dynamic Analysis and Adaptive Discriminator (DAAD) approach for fake news detection. |
Xinqi Su; Zitong Yu; Yawen Cui; Ajian Liu; Xun Lin; Yuhao Wang; Haochen Liang; Wenhui Li; Li Shen; Xiaochun Cao; |
| 334 | MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs’ ability to handle multimodal constraints in planning. |
Yiyan Ji; Haoran Chen; Qiguang Chen; Chengyue Wu; Libo Qin; Wanxiang Che; |
| 335 | Scene123: One Prompt to 3D Scene Generation Via Video-Assisted and Consistency-Enhanced MAE Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Benefiting from recent video generation models and implicit neural representations, we propose Scene123, a 3D scene generation model, which combines a video generation framework to ensure realism and diversity with implicit neural fields integrated with Masked Autoencoders (MAE) to effectively ensure the consistency of unseen areas across views. |
Yiying Yang; Fukun Yin; Jiayuan Fan; Wanzhang Li; Xin Chen; Gang Yu; |
| 336 | WetCat: Enabling Automated Skill Assessment in Wet-Lab Cataract Surgery Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite notable progress in ophthalmic surgical datasets, existing resources predominantly focus on real surgeries or isolated tasks, falling short of supporting comprehensive skill evaluation in controlled wet-lab settings. To address these limitations, we introduce WetCat, the first dataset of wet-lab cataract surgery videos specifically curated for automated skill assessment. |
Negin Ghamsarian; Raphael Sznitman; Klaus Schoeffmann; Jens Kowal; |
| 337 | TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies Through Parameter Sensitivity-Guided Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Large Language Models (LLMs) have been explored to facilitate CCS, current methods primarily focus on textual analysis, overlooking the rich visual information intrinsic to classical materials. To bridge this gap, we propose TongGu-VL, a pioneering specialized MLLM designed for CCS applications. |
Jiahuan Cao; Yang Liu; Peirong Zhang; Yongxin Shi; Kai Ding; Lianwen Jin; |
| 338 | DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Vision-Language Pre-training (VLP) models offer a strong open-vocabulary foundation, they often struggle with fine-grained localization under weak supervision and typically fail to explicitly leverage structured relational knowledge beyond basic semantics, limiting performance especially for unseen classes. To overcome these limitations, we propose the Dual Adaptive Refinement Transfer (DART) framework. |
Haijing Liu; Tao Pu; Hefeng Wu; Keze Wang; Liang Lin; |
| 339 | OCR-Critic: Aligning Multimodal Large Language Models’ Perception Through Critical Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These strategies systematically mitigate LMMs’ weaknesses in OCR tasks by providing coarse-to-fine error feedback. To comprehensively evaluate these capabilities, we introduce OCR-ERROR, a benchmark designed to assess LMMs’ ability to detect and categorize OCR errors, covering two task types, diverse error categories, and 2,400 rigorously validated samples. |
Qiuna Tan; Runqi Qiao; Guanting Dong; YiFan Zhang; Minhui Wu; Jiapeng Wang; Miaoxuan Zhang; Yida Xu; Chong Sun; Chen Li; Honggang Zhang; |
| 340 | Camouflaged Object Tracking: A Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our evaluation of 20 existing tracking algorithms reveals significant deficiencies in their performance with camouflaged objects. To address these issues, we propose a novel tracking framework, HIPTrack-MLS, which demonstrates promising results in improving tracking performance for camouflaged objects. |
Xiaoyu Guo; Pengzhi Zhong; Hao Zhang; DeFeng Huang; Huikai Shao; Qijun Zhao; Shuiwang Li; |
| 341 | SHALE: A Scalable Benchmark for Fine-grained Hallucination Evaluation in LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, existing benchmarks often rely on costly manual curation or reused public datasets, raising concerns about scalability and data leakage. To address these limitations, we propose an automated data construction pipeline that produces scalable, controllable, and diverse evaluation data. |
Bei Yan; Zhiyuan Chen; Yuecong Min; Jie Zhang; Jiahao Wang; Xiaozhen Wang; Shiguang Shan; |
| 342 | From Subtle Hints to Grand Expressions – Mastering Fine-grained Emotions with Dynamic Multimodal Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods may overlook the key issue that multimodal components exhibit asynchronism temporally and they obtain insufficient representation of fine-grained emotional expressions. In light of this, we propose a unified emotion reasoning model, EmoChat, which enhances multimodal emotion analysis by dynamically generating emotion-related tokens and fine-grained expression information through facial action modeling. |
Qinfu Xu; Liyuan Pan; Shaozu Yuan; Yiwei Wei; Chunlei Wu; |
| 343 | Anchors Bring Stability and Efficiency: Fast Tensorial Multi-view Clustering on Shuffled Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 2) Extremely high computational complexity arising from tensor-related operations. To address these limitations, we propose a novel framework termed Anchors Bring Stability and Efficiency: Fast Tensorial Multi-view Clustering on Shuffled Case (SE-FTMC). |
Jintian Ji; Songhe Feng; |
| 344 | DATE: Dual Prompt Learning with Information Bottleneck for Graph Out-of-Distribution Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel approach named Dual Prompt Learning with Information Bottleneck (DATE) for graph out-of-distribution generalization. |
Jiayi Zeng; Tao Ren; Changhu Wang; Yifan Wang; Wei Ju; Zhipeng Sun; Xiao Luo; |
| 345 | Multi-round Mutual Emotion-Cause Pair Extraction for Emotion-Attributed Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, a multi-round mutual emotion-cause pair extraction network (MM-ECPE) is proposed in this paper for the joint extraction of emotional cues and visual causes through iterative mutual refinement. |
Cheng Ye; Weidong Chen; Peipei Song; Xinyan Liu; Lei Zhang; Zhendong Mao; |
| 346 | Casual3DHDR: High Dynamic Range 3D Gaussian Splatting from Casually Captured Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To make data acquisition more flexible, we propose Casual3DHDR, a robust one-stage method that reconstructs 3D HDR scenes from casually-captured auto-exposure (AE) videos, even under severe motion blur and unknown, varying exposure times. |
Shucheng Gong; Lingzhe Zhao; Wenpu Li; Hong Xie; Yin Zhang; Shiyu Zhao; Peidong Liu; |
| 347 | DichotomyIR: Universal Image Reconstruction Via Dichotomy Classification and Uncertainty Elimination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Integrating the popular status spatial model with the IUE processing, we propose the Uncertainty Elimination Mamba (UEM) to eliminate the reconstructed uncertainty iteratively. |
Yan Zhang; Shiwen He; Lin Yuan; Jiaxu Leng; Xinbo Gao; |
| 348 | StrandDesigner: Towards Practical Strand Generation with Sketch Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. |
Na Zhang; Moran Li; Chengming Xu; Han Feng; Xiaobin Hu; Jiangning Zhang; Weijian Cao; Chengjie Wang; Yanwei Fu; |
| 349 | FantasyTalking: Realistic Talking Portrait Generation Via Coherent Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion Transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. |
Mengchao Wang; Qiang Wang; Fan Jiang; Yaqi Fan; Yunpeng Zhang; Yonggang Qi; Kun Zhao; Mu Xu; |
| 350 | Tackling Device Data Distribution Real-time Shift Via Prototype-based Parameter Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. |
Zheqi Lv; Wenqiao Zhang; Kairui Fu; Qi Tian; Shengyu Zhang; Jiajie Su; Jingyuan Chen; Kun Kuang; Fei Wu; |
| 351 | Rethinking Diffusion Bridge Model with Dual Alignments for Medical Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent diffusion-based models show promise in medical image synthesis, they face two key limitations: progressive distribution drift from coarse intermediate samples and structural granularity loss due to missing high-frequency constraints. To address these challenges, we propose Dual Diffusion Bridge (DualDB), a framework integrating implicit distribution alignment and explicit structural constraints within a unified diffusion bridge paradigm. |
Jinbao Wei; Yuhang Chen; Zhijie Wang; Gang Yang; Shimin Tao; Jian Gao; Aiping Liu; Xun Chen; |
| 352 | StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, current methods often incorporate specialized feature fusion modules tailored to specific modalities, thereby restricting input flexibility and increasing the number of training parameters. To address these challenges, we propose StitchFusion, a straightforward yet effective modal fusion framework that integrates large-scale pre-trained models directly as encoders and feature fusers. |
Bingyu Li; Da Zhang; Zhiyuan Zhao; Junyu Gao; Xuelong Li; |
| 353 | Rule Meets Learning: Confidence-Aware Multi-View Fusion for Self-Supervised 3D Hand Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we decompose multi-view fusion into two components: a learnable confidence estimation stage and a fixed confidence fusion stage. |
Pengfei Ren; Jingyu Wang; Haifeng Sun; Qi Qi; Jing Wang; Jianxin Liao; |
| 354 | A Spatial Relationship Aware Dataset for Robotics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a spatial-relationship-aware dataset of nearly 1,000 robot-acquired indoor images, annotated with object attributes, positions, and detailed spatial relationships. |
Peng Wang; Minh Huy Pham; Zhihao Guo; Wei Zhou; |
| 355 | HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). |
Zhiwei Chen; Yupeng Hu; Zixu Li; Zhiheng Fu; Haokun Wen; Weili Guan; |
| 356 | Bright to Dark: Stage-wise Bilevel Knowledge Transfer for Seeing Text in The Dark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we address the challenge by adopting a novel approach: tailoring the detector for low light conditions through knowledge distillation from normal light conditions, without relying on any enhancement module. |
Chengpei Xu; Wenhao Zhou; Long Ma; Weimin Wang; Feng Xia; Binghao Li; Wenjie Zhang; |
| 357 | Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This significantly curtails the development of models capable of deep, multi-view 3D scene understanding over distant objects. To address these challenges, we introduce MV-ScanQA, a novel 3D question answering dataset where 68\% of questions explicitly require integrating information from multiple views (compared to less than 7\% in existing datasets), thereby rigorously testing multi-view compositional reasoning. |
Wentao Mo; Qingchao Chen; Yuxin Peng; Siyuan Huang; Yang Liu; |
| 358 | Multi-Dimensional Text-to-Face Image Quality Assessment Using LLM: Database and Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, evaluating the quality of these generated face images, particularly with respect to fine-grained facial attributes, remains a significant challenge. To address this, we introduce the Fine -grained Text-to-Face Image Quality Assessment (FineTFIQA) database, which is designed to evaluate the ability of T2I models to generate fine-grained face images. |
Yixuan Gao; Xiongkuo Min; Jinliang Han; Yuqin Cao; Sijing Wu; Yunze Dou; Guangtao Zhai; |
| 359 | Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they overlook the inherent output biases of MLLMs, where models rely on their parametric memory to answer questions rather than genuinely understanding the chart content. To address this limitation, we introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. |
Xiangnan Chen; Yuancheng Fang; Juncheng Li; Qian Xiao; Jun Lin; Siliang Tang; Yueting Zhuang; |
| 360 | DCount: Decoupled Spatial Perception and Attribute Discrimination for Referring Expression Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This architectural constraint entangles the semantic and localization perception processes, hindering fine-grained understanding of attribute-aware visual features. To address these challenges, we propose DCount, a decoupled counting framework comprising two innovative components: a Decoupled Dual-Decoder (DDD) module and an Attribute Semantic Discriminator (ASD) module. |
Ming Li; Yupeng Hu; Yinwei Wei; Hao Liu; Haocong Wang; Weili Guan; |
| 361 | RATopo: Improving Lane Topology Reasoning Via Redundancy Assignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. |
Han Li; Shaofei Huang; Longfei Xu; Yulu Gao; Beipeng Mu; Si Liu; |
| 362 | Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability Across Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, the transferability of adversarial examples remains an ongoing challenge. In this paper, we specifically analyze the manifestation of adversarial transferability among MLLMs and identify the key factors that influence this characteristic. |
Hao Cheng; Erjia Xiao; Jiayan Yang; Jinhao Duan; Yichi Wang; Jiahang Cao; Qiang Zhang; Le Yang; Kaidi Xu; Jindong Gu; Renjing Xu; |
| 363 | T-GRAG: A Dynamic GraphRAG Framework for Resolving Temporal Conflicts and Redundancy in Knowledge Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing GraphRAG methods largely ignore the temporal dynamics of knowledge, leading to issues such as temporal ambiguity, time-insensitive retrieval, and semantic redundancy. To overcome these limitations, we propose Temporal GraphRAG (T-GRAG), a dynamic, temporally-aware RAG framework that models the evolution of knowledge over time. |
Dong Li; Yichen Niu; Ying Ai; Xiang Zou; Biqing Qi; Jianxing Liu; |
| 364 | PatchWiper: Leveraging Dynamic Patch-Wise Parameters for Real-World Visible Watermark Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Visible watermark removal is crucial for evaluating watermark robustness and advancing more resilient protection techniques. |
Zihao Mo; Junye Chen; Chaowei Fang; Guanbin Li; |
| 365 | U-MERE: Unconstrained Multimodal Entity and Relation Extraction with Collaborative Modeling and Order-Sensitive Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the limitations, we propose a new task, Unconstrained Multimodal Entity and Relation Extraction (U-MERE), which jointly extracts arbitrary visual and textual entities, and their relations from image-text pairs. |
Wei Jia; Li Jin; Kaiwen Wei; Yuying Shang; Nayu Liu; Zhicong Lu; Qing Liu; Linhao Zhang; Jiang Zhong; Yanfeng Hu; |
| 366 | SVGen: Interpretable Vector Graphics Generation with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This provides rich semantic supervision signals for model learning. Based on this dataset, we propose SVGen, an end-to-end generative model capable of directly converting natural language descriptions into SVG code. |
Feiyu Wang; Zhiyuan Zhao; Yuandong Liu; Da Zhang; Junyu Gao; Hao Sun; Xuelong Li; |
| 367 | Multimodal Decomposed Distillation with Instance Alignment and Uncertainty Compensation for Thermal Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a multimodal decomposed distillation framework to develop robust thermal-only detectors by transferring knowledge from multimodal teachers. |
Yanfeng Liu; Lefei Zhang; |
| 368 | Multi-Task Dense Prediction Fine-Tuning with Mixture of Fine-Grained Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. |
Yangyang Xu; Xi Ye; Duo Su; |
| 369 | Camera-Specific Imaging Simulation for Raw Domain Image Super Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The RAW domain image super-resolution faces two critical challenges: the physical impossibility of capturing native high-quality RAW references with a resolution-limited camera and the limitations of neural networks, including inefficient residual layer utilization and spectral bias in feature learning. This paper proposes a strategy combining physics-based imaging simulation and neural networks to jointly address these challenges. |
Xiaobo Liu; Henglu Wei; Chuxi Yang; Wei Yu; Xudong Zhao; Xiangyang Ji; |
| 370 | SLGaussian: Fast Language Gaussian Splatting in Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. |
Kangjie Chen; BingQuan Dai; Minghan Qin; Dongbin Zhang; Peihao Li; Yingshuang Zou; Haoqian Wang; |
| 371 | AudioAtlas: A Comprehensive and Balanced Benchmark Towards Movie-Oriented Text-to-Audio Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces AudioAtlas, a comprehensive and balanced evaluation benchmark specifically designed for evaluating T2A models aimed at movie production. |
Chenxi Wang; Yusheng Dai; Lei Sun; Jun Du; Jianqing Gao; |
| 372 | SaP-Bot: A Multimodal Large-Language Model for End-to-End Same-Product Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional approaches typically depend on manual feature engineering and extensive rule tuning, which limits their adaptability to varying identification criteria across different product categories and inconsistent business scenarios. To overcome these challenges, we propose an end-to-end same-product identification model powered by multimodal large language models (MLLMs) that inherently support multimodal alignment and exhibit strong generalization across diverse real-world settings. |
Yixuan Zhou; Yulu Tian; Wenliang Zhong; Xingbin Yu; Heng Tao Shen; Xing Xu; |
| 373 | SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. |
Inzamamul Alam; Md Tanvir Islam; Simon S. Woo; |
| 374 | MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. |
Tao Feng; Tingfa Xu; Haolin Qin; Tianhao Li; Shuaihao Han; Xuyang Zou; Zhan Lv; Jianan Li; |
| 375 | Analytic Synaptic Dynamic Scaling Balancer for Multimodal Deepfake Continual Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, MDCD remains underexplored, facing two major challenges: (1) modality-specific feature disparities limit the effectiveness of simple feature fusion, exacerbating the forgetting of previous forgery-relevant knowledge; and (2) newly introduced deepfake videos initially exhibit limited scale that gradually expand, causing class imbalance dominated by forged samples, undermines authentic content understanding in comming tasks. To address these issues, we propose the Analytic Synaptic Dynamic Scaling Balancer (ADanser) that adapts to modality-specific biases and class imbalance while employing a closed-form update to preserve prior multimodal deepfake knowledge in an evolving data stream. |
Man Xiao; Jianbin Ye; Bo Liu; Zijian Gao; Kele Xu; Xiaodong Wang; |
| 376 | MAGNeT: Multimodal Adaptive Gaussian Networks for Intent Inference in Moving Target Selection Across Complex Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods require substantial training data for each new context and lack transferability across scenarios, limiting their practical deployment in diverse multimedia environments where rich multimodal contextual information is readily available. This paper introduces MAGNeT (Multimodal Adaptive Gaussian Networks), which addresses these problems by combining classical statistical modeling with context-aware multimodal method. |
Xiangxian Li; Yawen Zheng; Baiqiao Zhang; Yijia Ma; Xianhui Cao; Juan Liu; Yulong Bian; Jin Huang; Chenglei Yang; |
| 377 | Align 3D Representation and Text Embedding for 3D Content Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current 3D personalization approaches predominantly rely on knowledge distillation-based methods, which require computationally expensive retraining procedures. To address this challenge, we propose Invert3D, a novel framework for convenient 3D content personalization. |
Qi Song; Ziyuan Luo; Ka Chun Cheung; Simon See; Renjie Wan; |
| 378 | Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL)-a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. |
Longzhen Yang; Zhangkai Ni; Ying Wen; Yihang Liu; Lianghua He; Heng Tao Shen; |
| 379 | Agent-to-Agent (A2A) Protocol Integrated Digital Twin System with AgentIQ for Multimodal AI Fitness Coaching and Personalized Well-Being Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a protocol-integrated Digital Twin (DT) architecture that reimagines fitness coaching as a distributed, explainable, and emotionally adaptive ecosystem. |
Kamran Gholizadeh HamlAbadi; Monica Vahdati; Fedwa Laamarti; Abdulmotaleb El Saddik; |
| 380 | Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ”agent” to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. |
Jiahao Li; Yang Lu; Yachao Zhang; Fangyong Wang; Yuan Xie; Yanyun Qu; |
| 381 | Gen4Track: A Tuning-free Data Augmentation Framework Via Self-correcting Diffusion Model for Vision-Language Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we propose Gen4Track, a tuning-free data augmentation framework that leverages the self-correcting mechanism to dynamically generate high-quality video data with annotations. |
Jiawei Ge; Xinyu Zhang; Jiuxin Cao; Xuelin Zhu; Weijia Liu; Qingqing Gao; Biwei Cao; Kun Wang; Chang Liu; Bo Liu; Chen Feng; Ioannis Patras; |
| 382 | Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present Gamma, a Generic imAge assessMent model using Mixture of Assessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. |
Hantao Zhou; Rui Yang; Longxiang Tang; Guanyi Qin; Runze Hu; Xiu Li; |
| 383 | Unlocking Joint Image Deraining and Low-Light Enhancement: Benchmark and Baseline Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Rainy weather typically leads to significantly reduced ambient illumination due to overcast skies.However, most existing image deraining datasets overlook this critical physical condition.They are usually constructed by linearly superimposing rain layers onto clean background images, without accounting for illumination degradation.This simplification introduces a clear domain gap between synthetic and real-world rainy images, thus limiting the generalization capability of current deraining algorithms.Moreover, existing methods predominantly focus on removing rain streaks while ignoring the simultaneous degradation caused by low-light conditions.To address these limitations, we introduce a new joint task: image deraining and low-light enhancement.Specifically, we construct a physically plausible dataset that simulates rainy scenes under low-light conditions, incorporating both rain streaks and raindrops with illumination-aware degradation modeling.In addition, we propose a baseline deraining network based on a multi-scale Mamba architecture, which jointly restores rain-free and well-lit images by effectively modeling both global illumination and local rain interference.Extensive experiments demonstrate that our method outperforms existing deraining approaches.The proposed dataset is released at https://drive.google.com/file/d/1QXxHqpYL7Q1TR5tdvAHm8tV2BdZgpGOc/view?usp=sharing. |
Liang Cheng; Hao Wang; Chenwei Wu; Haochen You; Xianhao Wu; |
| 384 | Toward A Training-Free Plug-and-Play Refinement Framework for Infrared and Visible Image Registration and Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing approaches often demonstrate promising results on specific benchmarks, they tend to exhibit performance drops in unseen scenarios and incur high computational overhead when retrained on new datasets. To address these challenges, we propose TRACE, a Training-free Reinforcement-based Alignment method for Cross-modality Enhancement, which incorporates Evaluator, a rewarding network, into an evaluation-driven Reinforcement Learning (RL) framework, enabling efficient and plug-and-play refinement of any existing registration approach. |
Yating Liu; Yang Zou; Xingyuan Li; Xingyue Zhu; Kaiqi Han; Zhiying Jiang; Long Ma; Jinyuan Liu; |
| 385 | RGC-VQA: An Exploration Database for Robotic-Generated Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As camera-equipped robotic platforms become increasingly integrated into daily life, robotic-generated videos have begun to appear on streaming media platforms, enabling us to envision a future where humans and robots coexist. We innovatively propose the concept of Robotic-Generated Content (RGC) to term these videos generated from egocentric perspective of robots. |
Jianing Jin; Jiangyong Ying; Huiyu Duan; Liu Yang; Sijing Wu; Yunhao Li; Yushuo Zheng; Xiongkuo Min; Guangtao Zhai; |
| 386 | Learning Evidential Delta Denoising Scores for Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, the inherent uncertainty introduced by the noise injection process in diffusion models further hinders the improvement of editing performance. To address these limitations, we propose an Evidential Video Editing (EVE) framework, which normalizes noise vectors into probability distributions, enhancing the comparability of element relationships. |
Yufan Hu; Kunlin Yang; Junyu Gao; Bin Fan; Hongmin Liu; |
| 387 | Visual-informed Silent Video Identity Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on the task of Silent Face-based Voice Conversion (SFVC), which does voice conversion entirely from visual inputs. |
Yifan Liu; Yu Fang; Zhouhan Lin; |
| 388 | Mavors: Multi-granularity Video Representation for Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches (e.g., sparse sampling, dense sampling with low resolution, and token compression) suffer from significant information loss in temporal dynamics, spatial details, or subtle interactions, particularly in videos with complex motion or varying resolutions. To address this, we propose Mavors, a novel framework that introduces Multi-granularity video representation for holistic long-video modeling. |
Yang Shi; Jiaheng Liu; Yushuo Guan; Zhenhua Wu; Yuanxing Zhang; Zihao Wang; Weihong Lin; Jingyun Hua; Zekun Wang; Xinlong Chen; Bohan Zeng; Wentao Zhang; Fuzheng Zhang; Wenjing Yang; Di Zhang; |
| 389 | Test-Time Adaptation for Text-Based Person Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, severe domain shift remains a key challenge in this field, causing source-domain-trained models to degrade significantly when applied to an unseen target domain. To address this, we propose the Identity-preserving Cross-modal Alignment and Adaptation (ICAA) model, a novel test-time adaptation framework for TBPS that enables seamless domain adaptation using only unlabeled target samples. |
Kai Niu; Liucun Shi; Ke Han; Qinzi Zhao; Yue Wu; Yanning Zhang; |
| 390 | DACA-Net: A Degradation-Aware Conditional Diffusion Network for Underwater Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. |
Chang Huang; Jiahang Cao; Jun Ma; Kieren Yu; Cong Li; Huayong Yang; Kaishun Wu; |
| 391 | Unveiling Open-set Noise: Theoretical Insights Into Label Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We evaluate entropy-based detection, finding it effective only for easy open-set noise, and propose solutions leveraging vision-language models and self-supervised learning to address hard noise challenges. |
Chen Feng; Nicu Sebe; Georgios Tzimiropoulos; Miguel R. D. Rodrigues; Ioannis Patras; |
| 392 | EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces EasyAnimate, an efficient and high quality video generation framework that leverages diffusion transformers to achieve high-quality video production, encompassing data processing, model training, and end-to-end inference. |
Jiaqi Xu; Kunzhe Huang; Xinyi Zou; Yunkuo Chen; Bo Liu; Mengli Cheng; Jun Huang; Xing Shi; |
| 393 | SmartFreeEdit: Mask-Free Spatial-Aware Image Editing with Complex Instruction Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in image editing have utilized large-scale multimodal models to enable intuitive, natural instruction-driven interactions. |
Qianqian Sun; Jixiang Luo; Dell Zhang; Xuelong Li; |
| 394 | PgM: Partitioner Guided Modal Learning Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. |
Guimin Hu; Yi Xin; Lijie Hu; Zhihong Zhu; Hasti Seifi; |
| 395 | SynergyAmodal: Deocclude Anything with Text Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we identify three critical elements: leveraging in-the-wild image data for diversity, incorporating human expertise for plausibility, and utilizing generative priors for fidelity. We propose SynergyAmodal, a novel framework for co-synthesizing in-the-wild amodal datasets with comprehensive shape and appearance annotations, which integrates these elements through a tripartite data-human-model collaboration. |
Xinyang Li; Chengjie Yi; JiaWei Lai; Mingbao Lin; Yansong Qu; Shengchuan Zhang; Liujuan Cao; |
| 396 | SpecSolver: Solving Spatial-Spectral Fusion Via Semantic Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the issues, we propose a semantic transformer-based solver, namely SpecSolver, which is basically inspired by the benefits of superpixel-based approaches, yet with the inner mechanism completely improved. |
Wei Li; Junwei Zhu; Honghui Xu; Jiawei Jiang; Jianwei Zheng; |
| 397 | Regist3R: Incremental Registration with Stereo Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While DUSt3R and its successors have achieved breakthroughs in 3D reconstruction from unposed images, these methods exhibit significant limitations when scaling to multi-view scenarios, including high computational cost and cumulative error induced by global alignment. To address these challenges, we propose Regist3R, a novel stereo foundation model tailored for efficient and scalable incremental reconstruction. |
Sidun Liu; Wenyu Li; Peng Qiao; Yong Dou; |
| 398 | Gloss Matters: Unlocking The Potential of Non-Autoregressive Sign Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, to alleviate the inconsistency between training and inference of GLevT, which is introduced by glosses, we propose a dual-centric learning policy and a keyframe-based gloss replacement method for training, further improving the translation quality of GLevT. |
Zhihao Wang; Shiyu Liu; Zhiwei He; Kangjie Zheng; Liangying Shao; Junfeng Yao; Jinsong Su; |
| 399 | LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods lack the capability for adaptive perception and efficient interactive fusion of cross-modal features. To address these challenges, we propose a Lightweight Adaptive Cue-Aware Vision Mamba network (LIDAR), which efficiently perceives and integrates morphological and textural cues from different modalities under multimodal crack scenarios, generating clear pixel-level crack segmentation maps. |
Hui Liu; Chen Jia; Fan Shi; Xu Cheng; Mengfei Shi; Xia Xie; Shengyong Chen; |
| 400 | Enhancing Democratic Mediation Through Norm-Awareness in Generative Agent Societies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Democratic mediation serves as a vital mechanism for resolving social conflicts; however, current practices encounter three critical limitations: (1) inefficient operations, wherein traditional laborintensive mediation processes are both time-consuming and inefficient; (2) theoretical gaps, as prevailing mediation theories fail to explore the underlying causes of conflicts; and (3) inadequate analysis, with existing digital tools lacking comprehensive conflict mediation capabilities and primarily focusing on singular data types. To address these limitations, we introduce the Normative Social Simulator for Democratic Mediation, referred to as Norm Mediat. |
Tianjiao Xu; Hao Fu; Suiyang Zhang; Jianhua Yin; Tian Gan; Liqiang Nie; |
| 401 | DA3D: Domain-Aware Dynamic Adaptation for All-Weather Multimodal 3D Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we advocate a new perspective: all-weather 3D detection should be formulated as a lightweight capacity allocation problem, rather than simply enlarging or duplicating models for each weather domain. |
Haochen Yang; Lei Li; Jiacheng Guo; Baolu Li; Minghai Qin; Hongkai Yu; Tianyun Zhang; |
| 402 | T23D-QA: An Open Dataset and Benchmark for Text-driven 3D Generation Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce T23D-QA, a novel open benchmark that couples diverse prompts, multiple generation paradigms, and fine-grained human judgements for text-conditioned 3D synthesis. |
Haohui Li; Bowen Qu; Wei Gao; |
| 403 | ThermVision: Exploring FLUX for Synthesizing Hyper-Realistic Thermal Face Data and Animations Via Image to Video Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we optimized the FLUX text-to-image diffusion model on diverse real-world thermal facial datasets to generate hyper-realistic 2D thermal facial images for both males and females, and propose a new dataset, ThermVision. |
Muhammad Ali Farooq; Waseem Shariff; Peter Corcoran; |
| 404 | D-Judge: How Far Are We? Assessing The Discrepancies Between AI-synthesized and Natural Images Through Multimodal Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the impressive capabilities of advanced AI generative models in producing visually compelling content, significant discrepancies remain when compared to natural images. To systematically investigate and quantify these differences, we construct a large-scale multimodal dataset named DANI, comprising 5,000 natural images and over 440,000 AI-generated image (AIGI) samples produced by nine representative models using both unimodal and multimodal prompts, including Text-to-Image (T2I), Text-and-Image-to-Image (I2I), and Text and Image-to-Image (TI2I). |
Renyang Liu; Ziyu Lyu; Wei Zhou; See-Kiong Ng; |
| 405 | Tractography-Guided Dual-Label Collaborative Learning for Multi-Modal Cranial Nerves Parcellation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a tractography-guided Dual-label Collaborative Learning Network (DCLNet) for multi-modal CNs parcellation. |
Lei Xie; Junxiong Huang; Yuanjing Feng; Qingrun Zeng; |
| 406 | Differential Contrastive Training for Gaze Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. |
Lin Zhang; Yi Tian; Xiyun Wang; Wanru Xu; Yi Jin; Yaping Huang; |
| 407 | HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, existing hierarchical semantic modeling methods rely on precise entity partitioning and strict containment relationship constraints, which limits their effectiveness in complex drone environments. To address these challenges, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework, comprising two core components: 1) Region-Global Image-Text Contrastive Learning (RG-ITC). |
Hao Ruan; Jinliang Lin; Yingxin Lai; Zhiming Luo; Shaozi Li; |
| 408 | ICS-MR: Interactive Conversation Scenarios for Assessment of Mixed Reality Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ICS-MR, a dataset containing three conversational scenarios designed for the evaluation of communication quality in Mixed Reality (MR) systems. |
Felix Immohr; Gareth Rendle; Annika Neidhardt; Anton Benjamin Lammert; Bernd Froehlich; Alexander Raake; |
| 409 | Art4Math: Handwritten Mathematical Expression Recognition Via Multimodal Sketch Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Art for Math (Art4Math), a novel framework that leverages the structural richness of human sketches to enhance HMER through fine-grained, modality-aware learning. |
Yang Zhou; Jin Wang; Yuxiao Zhang; Kaixiang Huang; Guodong Lu; Jingru Yang; Shengfeng He; |
| 410 | Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, a Down-Up Sampling Networks (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. |
Hanzhe Liang; Jie Zhang; Tao Dai; Linlin Shen; Jinbao Wang; Can Gao; |
| 411 | Interact-Custom: Customized Human Object Interaction Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Though a great success, existing approaches mainly concentrate on the target entity’s appearance preservation, while neglecting the fine-grained interaction control among target entities. To enable the model of such interaction control capability, we focus on human object interaction scenario and propose the task of Customized Human Object Interaction Image Generation (CHOI), which simultaneously requires identity preservation for target human object and the interaction semantic control between them. |
Zhu Xu; Zhaowen Wang; Yuxin Peng; Yang Liu; |
| 412 | CP3: Customizable 3D Pop-Out Effect Creation for Immersive Content Using Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, a multi-modal model based 3D pop-out video generation framework (CP3) is proposed to solve the shortcomings of the existing video generation technology for accurate control of 3D pop-out effects. |
Zezhou Chen; Ping Chen; Huan Hu; Xiang Liu; Zipeng Wang; Zhaoxiang Liu; Kai Wang; Shiguo Lian; |
| 413 | Robust Tensor Learning with Graph Diffusion for Scalable Multi-view Graph Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Multi-view Bipartite Graph Clustering (MVBGC) has shown promising results, existing approaches often overlook that the generated bipartite graph is susceptible to disturbances from complex structures and noise. To address these challenges, we propose RTGD-MVC, a novel framework for Robust Tensor Learning with Graph Diffusion tailored for efficient and scalable multi-view graph clustering. |
Jiale Zou; Yan Chen; Bingbing Jiang; Peng Zhou; Liang Du; Lei Duan; Yuhua Qian; |
| 414 | Quantum Interference-Inspired Who-What-Where Composite-Semantics Instance Search for Story Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by quantum interference theory, we propose a Quantum Interference Partial Decomposition (QIPD) method to model the diverse influences of semantic overlap from 2W to 3W INS. |
Zijun Xu; Jiahao Guo; Chunjie Zhang; Zhongyuan Wang; Chunxia Xiao; Chao Liang; |
| 415 | Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from Vector Drawings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Drawing2CAD, a framework with three key technical components: a network-friendly vector primitive representation that preserves precise geometric information, a dual-decoder transformer architecture that decouples command type and parameter generation while maintaining precise correspondence, and a soft target distribution loss function accommodating inherent flexibility in CAD parameters. |
Feiwei Qin; Shichao Lu; Junhao Hou; Changmiao Wang; Meie Fang; Ligang Liu; |
| 416 | Exploring Adapter Design Tradeoffs for Low Resource Music Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music.Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. |
Atharva Mehta; Shivam Chauhan; Monojit Choudhury; |
| 417 | MARA: A Multimodal Adaptive Retrieval-Augmented Framework for Document Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, current approaches rely on query-agnostic document representations that overlook salient content and use static top-k evidence selection, which fails to adapt to the uncertain distribution of relevant information. To address these limitations, we propose the Multimodal Adaptive Retrieval-Augmented (MARA) framework, which introduces query-adaptive mechanisms to both retrieval and generation. |
Hui Wu; Haoquan Zhai; Yuchen Li; Hengyi Cai; Peirong Zhang; Yidan Zhang; Lei Wang; Chunle Wang; Yingyan Hou; Shuaiqiang Wang; Dawei Yin; |
| 418 | UniTalker: Conversational Speech-Visual Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To ensure that the generated speech-visual content remains consistent in terms of emotion, content, and duration, we introduce three key optimizations: 1) Designing a specialized neural landmark codec to tokenize and reconstruct facial expression sequences. |
Yifan Hu; Rui Liu; Yi Ren; Xiang Yin; Haizhou Li; |
| 419 | AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address the task of multimodal-to-speech generation, which aims to synthesize high-quality speech from multiple input modalities: text, video, and reference audio. |
Jeongsoo Choi; Ji-Hoon Kim; Kim Sung-Bin; Tae-Hyun Oh; Joon Son Chung; |
| 420 | CoheDancers: Enhancing Interactive Group Dance Generation Through Music-Driven Coherence Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To tackle the issue, we introduce CoheDancers, a novel framework for Music-Driven Interactive Group Dance Generation. |
Kaixing Yang; Xulong Tang; Haoyu Wu; Biao Qin; Hongyan Liu; Jun He; Zhaoxin Fan; |
| 421 | CalibWorkflow: A General MLLM-Guided Workflow for Centimeter-Level Cross-Sensor Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often lack generalization capabilities when facing diverse hardware configurations, sensor poses, and environmental conditions, hindering their large-scale deployment. To address this limitation, we propose a general extrinsic calibration method, CalibWorkflow. |
Xingchen Li; Wuyang Zhang; Guoliang You; Xiaomeng Chu; Wenhao Yu; Yifan Duan; Yuxuan Xiao; Yanyong Zhang; |
| 422 | Mono3R: Exploiting Monocular Cues for Geometric 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, as we observed, constrained by their matching-based principles, the reconstruction quality of existing models suffers significant degradation in challenging regions with limited matching cues, particularly in weakly textured areas and low-light conditions. To mitigate these limitations, we propose to harness the inherent robustness of monocular geometry estimation to compensate for the shortcomings. |
Wenyu Li; Sidun Liu; Peng Qiao; Yong Dou; |
| 423 | Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs Via Self-Evolving Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. |
Wenhao Li; Xiu Su; Jingyi Wu; Feng Yang; Yang Liu; Yi Chen; Shan You; Chang Xu; |
| 424 | HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. |
Pei Liu; Xin Liu; Ruoyu Yao; Junming Liu; Siyuan Meng; Ding Wang; Jun Ma; |
| 425 | Dome-DETR: DETR with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods suffer from inefficient feature leverage and high computational costs due to redundant feature processing and rigid query allocation. To address these challenges, we propose Dome-DETR, a novel framework with Density-Oriented Feature-Query Manipulation for Efficient Tiny Object Detection. |
Zhangchi Hu; Peixi Wu; Jie Chen; Huyue Zhu; Yijun Wang; Yansong Peng; Hebei Li; Xiaoyan Sun; |
| 426 | DSPF: Dual-Stage Preservation and Fusion for Source-Free Domain Adaptive Point Cloud Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these limitations, this paper explores a practical and challenging setting: ”source-free domain adaptive point cloud completion”, where a well-trained source model must adapt to the target data distribution without access to source data, aiming to improve completion performance. To tackle this problem, we propose a novel method called ”Dual-Stage Preservation and Fusion” (DSPF), which comprises two key training stages tailored to this new setting. |
Zhiqian Xia; Haifeng Xia; Shichao Jin; Wei Wang; Zhengming Ding; Xiaochun Cao; |
| 427 | A Theoretical Proof of Dynamic Multimodal Fusion Exacerbates Modality Greedy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This results in a model that does not fully take advantage of the lower quality modalities. In this paper, we provide a theoretical analysis showing that dynamic fusion intensifies Greedy, and we present experimental results that support this observation. |
Xiaorui Ding; Huan Ma; Changqing Zhang; |
| 428 | VLMPlanner: Integrating Visual Language Models with Motion Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often rely on abstracted perception or map-based inputs, missing crucial visual context, such as fine-grained road cues, accident aftermath, or unexpected obstacles, which are essential for robust decision-making in complex driving environments. To bridge this gap, we propose VLMPlanner, a hybrid framework that combines a learning-based real-time planner with a vision-language model (VLM) capable of reasoning over raw images. |
Zhipeng Tang; Sha Zhang; Jiajun Deng; Chenjie Wang; Guoliang You; Yuting Huang; Xinrui Lin; Yanyong Zhang; |
| 429 | A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enable near real-time Earth observation applications (e.g., disaster and extreme weather monitoring), we should explore how to deploy LVLM in LEO satellite networks, and design SpaceVerse, an efficient satellite-ground synergistic LVLM inference system. |
Yuxin Zhang; Jiahao Yang; Zhe Chen; Wenjun Zhu; Jin Zhao; Yue Gao; |
| 430 | CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address the aforementioned issues by introducing a new audio-visual binaural generation model with an audio-visual conditional normalisation layer that dynamically aligns the target difference audio features using visual context. |
Yuanhong Chen; Kazuki Shimada; Christian Simon; Yukara Ikemiya; Takashi Shibuya; Yuki Mitsufuji; |
| 431 | Formula Spotting Based on Synergy Perception and Representation Mining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing methods that first detect and then recognize have achieved prominent results, they still suffer from semantic confusion from similar character structures, semantic loss from bounding box perturbation, and visual interference from non-formula regions. To address these issues, we propose a Synergy Perception and Representation Mining Network. |
Gang Pan; Hongen Liu; Di Sun; |
| 432 | AFFIR: Dual-Modal Attention Feature Fusion for Scene Text Image Retargeting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Image retargeting technique aims to adjust and reorganize the content of original images to fit different display sizes and visual requirements. |
Gang Pan; Liming Pan; Hongze Mi; Rongyu Xiong; Jiahao Wang; Di Sun; |
| 433 | Image Retargeting Based on Text Region Awareness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This task introduces three primary challenges, which can be summarized as follows: (1) the distinct probability distributions between text and non-text regions in images; (2) the lack of dedicated mechanisms in existing IR methods for handling text regions, often leading to text distortion or blurring; (3) the absence of paired datasets specifically designed for IR tasks involving text regions. To tackle these challenges, we propose SSIR, a unified framework that reformulates IR as a joint Semantic Segmentation and Image Retargeting (SS-IR) task, leveraging an attention mechanism to bridge these components. |
Gang Pan; Meihua Liu; Lei Zhou; Jiahao Wang; Di Sun; |
| 434 | Efficient Video Anomaly Detection Via Scene-Dependent Memory Assisted Inter-Frame RGB Difference Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel inter-frame RGB difference reconstruction network for efficient video anomaly detection. |
Han Hu; Wenli Du; Bing Wang; |
| 435 | Manipulating Multimodal Agents Via Cross-Modal Prompt Injection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attacker embeds adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agents’ decision-making process and execute unauthorized tasks. |
Le Wang; Zonghao Ying; Tianyuan Zhang; Siyuan Liang; Shengshan Hu; Mingchuan Zhang; Aishan Liu; Xianglong Liu; |
| 436 | Leader Is Guided: Interactive Motion Generation Via Lead-Follow Paradigm and Trajectory Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the questions mentioned, we introduce two key concepts: (1) Lead-Follow Paradigm: Inspired by role allocation in partner dancing, we decompose complex interactive motion tasks into a Lead-Follow paradigm. |
Runqi Wang; Caoyuan Ma; Jian Zhao; Hanrui Xu; Dongfang Sun; Haoyang Chen; Lin Xiong; Zheng Wang; Xuelong Li; |
| 437 | A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. |
Wenbo Xu; Junyan Wu; Wei Lu; Xiangyang Luo; Qian Wang; |
| 438 | DilateQuant: Accurate and Efficient Quantization-Aware Training for Diffusion Models Via Weight Dilation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel QAT framework for diffusion models, called DilateQuant. |
Xuewen Liu; Zhikai Li; Minghao Jiang; Mengjuan Chen; Jianquan Li; Qingyi Gu; |
| 439 | Prior-Constrained Relevant Feature Driven Image Fusion with Hybrid Feature Via Mode Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing methods directly extract relevant and complementary features from each modality using neural networks, often overlooking the guidance process and the distinct frequency-domain characteristics of these features. To address this, we propose HRFusion-a novel frequency-domain framework that extracts complementary features from hybrid features using prior-constrained relevant features, effectively enhancing complementary information and reducing redundancy. |
Bingfeng Liu; Songwei Pei; Shuhuai Wang; Wenzheng Yang; Qian Li; Shangguang Wang; |
| 440 | ChartM3: Benchmarking Chart Editing with Multimodal Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators that explicitly highlight the elements to be modified. |
Donglu Yang; Liang Zhang; Zihao Yue; Liangyu Chen; Yichen Xu; Wenxuan Wang; Qin Jin; |
| 441 | KEN: Knowledge Augmentation and Emotion Guidance Network for Multimodal Fake News Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, treating all emotional types of news uniformly without tailored approaches further leads to performance degradation. Therefore, we propose a novel Knowledge Augmentation and Emotion Guidance Network (KEN). |
Peican Zhu; Yubo Jing; Le Cheng; Keke Tang; Yangming Guo; |
| 442 | Quantization Meets OOD: Generalizable Quantization-aware Training from A Flatness Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. |
Jiacheng Jiang; Yuan Meng; Chen Tang; Han Yu; Qun Li; Zhi Wang; Wenwu Zhu; |
| 443 | CWCP: Generalizing Virtual Reality to Real World with Contextual-Weather Correlation Pairing for Deraining and Desnowing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we highlight the importance of contextual influence information and utilize virtual reality to effectively simulate real scenes. |
Yuwu Lu; Chunzhi Liu; Yihan Yang; |
| 444 | CDIB: Consistency Discovery-guided Information Bottleneck for Multi-modal Knowledge Graph Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel Consistency Discovery-guided Information Bottleneck (CDIB) framework to address the aforementioned challenges. |
Haichuan Fang; Haoran Zhang; Yulin Du; Qiang Guo; Zhen Tian; Youwei Wang; Yangdong Ye; |
| 445 | Open-Set Image Tagging with Multi-Grained Text Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces the Recognize Anything Plus Model (RAM++), an open-set image tagging model effectively leveraging multi-grained text supervision. |
Xinyu Huang; Yi-Jie Huang; Youcai Zhang; Weiwei Tian; Rui Feng; Yuejie Zhang; Yanchun Xie; Yaqian Li; Lei Zhang; |
| 446 | FORGET ME: Federated Unlearning for Face Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing generation model unlearning methods are primarily designed for centralized environments and are inadequate for addressing the constraints of data privacy storage and limited client computational resources in federated settings. To address this gap, we propose F2GU, the first federated unlearning framework specifically tailored for face generation models, enabling the effective removal of contributions associated with specific clients (identities) while ensuring privacy. |
Fan Qi; Ao Liu; Zixin Zhang; Changsheng Xu; |
| 447 | TAMER: Interest Tree Augmented Modality Graph Recommender for Multimodal Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often introduce noise when enhancing modality graphs, making it challenging to effectively balance performance and accuracy. To address this issue, we propose an Interest Tree Augmented Modality Graph RecommendER for Multimodal Recommendation (TAMER). |
Fanshen Meng; Zhenhua Meng; Ru Jin; Yuli Chen; Rongheng Lin; Budan Wu; |
| 448 | Compositional Zero-shot Learning Via Progressive Language-based Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For instance, the state old can signify vintage design for a car or advanced age for a cat. In this paper, we argue that these variances can be mitigated by predicting composition categories based on salient observation cues. |
Lin Li; Guikun Chen; Zhen Wang; Jun Xiao; Long Chen; |
| 449 | CIA: Class- and Instance-aware Adaptation for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation leads to suboptimal performance on challenging tasks and restricted generalization capability to unseen data. To address these issues, we propose Class- and Instance-aware Adaptation (CIA), a novel framework that simultaneously optimizes both class-level and instance-level alignments. |
Lin Peng; Cong Wan; Shaokun Wang; Xiang Song; Yuhang He; Yihong Gong; |
| 450 | Controllable Video-to-Music Generation with Multiple Time-Varying Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing V2M methods relying solely on visual features or supplementary textual inputs generate music in a black-box manner, often failing to meet user expectations. To address this challenge, we propose a novel multi-condition guided V2M generation framework that incorporates multiple time-varying conditions for enhanced control over music generation. |
Junxian Wu; Weitao You; Heda Zuo; Dengming Zhang; Pei Chen; Lingyun Sun; |
| 451 | Draw with Thought: Unleashing Multimodal Reasoning for Scientific Diagram Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Draw with Thought (DwT), a training-free framework that guides MLLMs to reconstruct diagrams into editable mxGraph XML code through cognitively inspired Chain-of-Thought reasoning. |
Zhiqing Cui; Jiahao Yuan; Hanqing Wang; Yanshu Li; Chenxu Du; Zhenglong Ding; |
| 452 | RealVG: Unleashing MLLMs for Training-Free Spatio-Temporal Video Grounding in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce RealVG, a robust and training-free pipeline that leverages powerful Multimodal Large Language Models (MLLMs) through question-answering to tackle STVG in the wild. |
Hongchen Wei; Zhenzhong Chen; |
| 453 | Visual Context Window Extension: A New Perspective for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we tackle the challenge of long video understanding from the perspective of context windows, aiming to apply LMMs to long video tasks without retraining on long video datasets. |
Hongchen Wei; Zhenzhong Chen; |
| 454 | Multimodal Markup Document Models for Graphic Design Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. |
Kotaro Kikuchi; Ukyo Honda; Naoto Inoue; Mayu Otani; Edgar Simo-Serra; Kota Yamaguchi; |
| 455 | Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. |
Tianqi Li; Ruobing Zheng; Minghui Yang; Jingdong Chen; Ming Yang; |
| 456 | GOES: 3D Gaussian-based One-shot Head Animation with Any Emotion and Any Style Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce GOES, an 3D Gaussian based One-shot head animation framework for any Emotion and any Style. |
Chuhang Ma; Shuai Tan; Junjie Wei; Ye Pan; |
| 457 | Joint Holistic and Lesion Controllable Mammogram Synthesis Via Gated Conditional Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Gated Conditional Diffusion Model (GCDM), a novel framework designed to jointly synthesize holistic mammogram images and localized lesions. |
Xin Li; Kaixiang Yang; Qiang Li; Zhiwei Wang; |
| 458 | Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This requires a smart map that fuses accurate geometric structure with rich, human-understandable semantics. To address this, we introduce the 3D Queryable Scene Representation (3D QSR), a novel framework built on multimedia data that unifies three complementary 3D representations: (1) 3D-consistent novel view rendering and segmentation from panoptic reconstruction, (2) precise geometry from 3D point clouds, and (3) structured, scalable organization via 3D scene graphs. |
Xun Li; Rodrigo Santa Cruz; Mingze Xi; Hu Zhang; Madhawa Perera; Ziwei Wang; Ahalya Ravendran; Brandon J. Matthews; Feng Xu; Matt Adcock; Dadong Wang; Jiajun Liu; |
| 459 | EBaR: Efficient Buffer and Resetting for Single-Sample Continual Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we proposed a novel Efficient Buffer and Resetting (EBaR) method for S-CoTTA. |
Tianyi Ma; Maoying Qiao; |
| 460 | Rethinking The Reliability of Evidence in End-to-End Fact-Checking from The Causal Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we account for the diverse reliability levels of retrieved evidence and eliminate the negative impact from the causal perspective. |
Xubo Liu; Wenya Guo; Ruxue Yan; Xumeng Liu; Ying Zhang; Ru Zhou; |
| 461 | Zero-shot Compositional Action Recognition with Neural Logic Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite compositional learning’s progress in ZS-CAR, two critical challenges persist: 1) Missing compositional structure constraint, leading to spurious correlations between primitives; 2) Neglecting semantic hierarchy constraint, leading to semantic ambiguity and impairing the training process. In this paper, we argue that human-like symbolic reasoning offers a principled solution to these challenges by explicitly modeling compositional and hierarchical structured abstraction. |
Gefan Ye; Lin Li; Kexin Li; Jun Xiao; Long Chen; |
| 462 | SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. |
Kien T. Pham; Yingqing He; Yazhou Xing; Qifeng Chen; Long Chen; |
| 463 | Quantifying Samples with Invariance for Source-Free Class Incremental Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing Class-Incremental (CI) methods fail to alleviate domain shifts, while traditional Unsupervised Domain Adaptation (UDA) techniques suffer from catastrophic forgetting and privacy concerns. To address these limitations, we explore Source-Free Class Incremental Domain Adaptation (SFCIDA) and propose a novel approach, Quantifying Samples with Invariance (QSI), for this scenario. |
Zhiyu Ye; Guowen Li; Haoyuan Liang; Zixi Wang; Shilei Cao; Yushan Lai; Juepeng Zheng; |
| 464 | GaussianCross: Cross-modal Self-supervised 3D Representation Learning Via Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present GaussianCross, a novel cross-modal self-supervised 3D representation learning architecture integrating feed-forward 3D Gaussian Splatting (3DGS) techniques to address current challenges. |
Lei Yao; Yi Wang; Yi Zhang; Moyun Liu; Lap-Pui Chau; |
| 465 | Querying Autonomous Vehicle Point Clouds: Enhanced By 3D Object Counting with CounterNet Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. |
Xiaoyu Zhang; Zhifeng Bao; Hai Dong; Ziwei Wang; Jiajun Liu; |
| 466 | TASR: Timestep-Aware Diffusion Model for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we first explore the temporal dynamics of information infusion through ControlNet, revealing that the input from LR images predominantly influences the initial stages of the denoising process. Leveraging this insight, we introduce a novel timestep-aware diffusion model that adaptively integrates features from both ControlNet and the pre-trained Stable Diffusion (SD). |
Qinwei Lin; Xiaopeng Sun; Yu Gao; Yujie Zhong; Zheng Zhao; Dengjie Li; Haoqian Wang; |
| 467 | MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. |
Jian Chen; Yuxuan Hu; Haifeng Lu; Wei Wang; Min Yang; Chengming Li; Xiping Hu; |
| 468 | Ingredients-Guided and Nutrients-Prompted Network for Food Nutrition Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, existing methods lack explicit mechanisms for modeling nutrient-specific information and guiding attention toward nutrition-relevant semantics. To solve the above two issues, we propose a novel ingredients-guided and nutrients-prompted nutrition estimation method. |
Donglin Zhang; Boyuan Ma; Xiaojun Wu; Josef Kittler; |
| 469 | Choose Your Expert: Uncertainty-Guided Expert Selection for Continual Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel analytically driven, replay-free continual detection framework that eliminates the need for iterative gradient updates. |
Xueyi Zhang; Peiyin Zhu; Jinping Sui; Xiaoda Yang; Jiahe Tian; Mingrui Lao; Siqi Cai; Yanming Guo; Jun Tang; |
| 470 | S2-Edit3DV: Diffusion-Guided Style Meets Structure for Consistent Multi-View 3D Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, existing methods frequently face significant challenges, including inconsistent textures, pronounced drifting artifacts, and compromised geometric integrity when rendered from various perspectives. To effectively address these limitations, we introduce S2-Edit3DV, a novel diffusion-guided framework that reframes multi-view 3D objects editing as a temporally coherent video editing problem. |
Yuqi Chen; Xiubo Liang; Yu Zhao; Hongzhi Wang; Weidong Geng; |
| 471 | Ensuring Responses Contain Appropriate Images: Timing Judgment for Multimodal Responses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The integration of multimodal information, particularly visual content, into dialogue systems has primarily focused on interpreting user-provided inputs, while comparatively little attention has been given to the proactive use of such content to enhance responses. In this paper, we explore a new research direction that addresses this gap by enabling dialogue systems to autonomously determine when and how to supplement textual responses with relevant images, based on conversational context and user intent. |
Hao Yang; Tian Zheng; Yanyan Zhao; Bing Qin; |
| 472 | GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs’ ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. |
Xiaorong Zhu; Ziheng Jia; Jiarui Wang; Xiangyu Zhao; Haodong Duan; Xiongkuo Min; Jia Wang; Zicheng Zhang; Guangtao Zhai; |
| 473 | Pretraining Large Brain Language Model for Active BCI: Silent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Following the recent success of pretraining large models with self-supervised paradigms to enhance EEG classification performance, we propose Large Brain Language Model (LBLM) pretrained to decode silent speech for active BCI. |
Jinzhao Zhou; Zehong Cao; Yiqun Duan; Connor Barkley; Daniel Leong; Xiaowei Jiang; Quoc-Toan Nguyen; Ziyi Zhao; Thomas Do; Yu-Cheng Chang; Sheng-Fu Liang; Chin-Teng Lin; |
| 474 | BTUAP: Boosting The Transferability of Universal Adversarial Perturbations in The Black-box Setting Under Various Data Dependencies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, both strategies exhibit poor transferability in the black-box settings. To address this limitation, we propose BTUAP, a novel UAP generation method designed to enhance the transferability of UAP in the black-box setting. |
Jie Wan; Jianhao Fu; Ziqi Yang; Kui Ren; |
| 475 | MoTAS: MoE-Guided Feature Selection from TTS-Augmented Speech for Enhanced Multimodal Alzheimer’s Early Screening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, challenges such as limited data and the lack of fine-grained, adaptive feature selection often hinder performance. To address these issues, we propose MoTAS, a robust framework designed to enhance AD screening efficiency. |
Yongqi Shao; Bingxin Mei; Cong Tan; Hong Huo; Tao Fang; |
| 476 | DualDub: Video-to-Soundtrack Generation Via Joint Speech and Background Audio Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. |
Wenjie Tian; Xinfa Zhu; Haohe Liu; Zhixian Zhao; Zihao Chen; Chaofan Ding; Xinhan Di; Junjie Zheng; Lei Xie; |
| 477 | SeqVLM: Proposal-Guided Multi-View Sequences Reasoning Via VLM for Zero-Shot 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. |
Jiawen Lin; Shiran Bian; Yihang Zhu; Wenbin Tan; Yachao Zhang; Yuan Xie; Yanyun Qu; |
| 478 | Spatial-Frequency Mamba Collaborative Learning Network for Infrared Small Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Transformer with quadratic computational complexity struggles for local feature refinement. To tackle this issue, we introduce a Mamba-driven approach dubbed Spatial-Frequency Mamba Collaborative Learning Network (SMCLNet). |
Yongji Li; Luping Wang; |
| 479 | Device-Cloud Collaborative Learning Framework for Efficient Unknown Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, limited device resources restrict existing methods from achieving accurate detection on the device side. Addressing this gap, this paper introduces a device-cloud collaborative framework named DCCUOD that enhances device model performance through efficient cloud collaboration. |
Kewei Zhao; Xiaowei Hu; Qinya Li; |
| 480 | Versatile Multimodal Controls for Expressive Talking Human Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: AI-generated content faces similar requirements, where users not only need automatic generation of lip synchronization and basic gestures from audio input but also desire semantically accurate and expressive body movement that can be ”directly guided” through text descriptions. Therefore, we present VersaAnimator, a versatile framework that synthesizes expressive talking human videos from arbitrary portrait images. |
Zheng Qin; Ruobing Zheng; Yabing Wang; Tianqi Li; Zixin Zhu; Sanping Zhou; Ming Yang; Le Wang; |
| 481 | EgoPrompt: Prompt Learning for Egocentric Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt , to conduct the egocentric action recognition task. |
Huaihai Lyu; Chaofan Chen; Yuheng Ji; Changsheng Xu; |
| 482 | Speech Token Prediction Via Compressed-to-fine Language Modeling for Speech Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that speech token sequences exhibit short-range dependency: due to the monotonic alignment between text and speech in text-to-speech (TTS) tasks, the prediction of the current token primarily relies on its local context, while long-range tokens contribute less to the current token prediction and often contain redundant information. Inspired by this observation, we propose a compressed-to-fine language modeling approach to address the challenge of long sequence speech tokens within neural codec language models: (1) Fine-grained Initial and Short-range Information: Our approach retains the prompt and local tokens during prediction to ensure text alignment and the integrity of paralinguistic information; (2) Compressed Long-range Context: Our approach compresses long-range token spans into compact representations to reduce redundant information while preserving essential semantics. |
Wenrui Liu; Qian Chen; Wen Wang; Guanrou Yang; Weiqin Li; Minghui Fang; Jialong Zuo; Xiaoda Yang; Tao Jin; Jin Xu; Zemin Liu; Yafeng Chen; Jionghao Bai; Zhifang Guo; |
| 483 | Short-LVLM: Compressing and Accelerating Large Vision-Language Models By Pruning Redundant Layers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. |
Ji Ma; Wei Suo; Peng Wang; Yanning Zhang; |
| 484 | Deciphering Functions of Neurons in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, our objective is to delve into the internals of VLMs to interpret the functions of individual neurons. |
Jiaqi Xu; Cuiling Lan; Yan Lu; |
| 485 | Open3D-VQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Open3D-VQA, a novel benchmark for evaluating MLLMs’ ability to reason about complex spatial relationships from an aerial perspective. |
Weichen Zhang; Zile Zhou; Xin Zeng; Liu Xuchen; Jianjie Fang; Chen Gao; Jinqiang Cui; Yong Li; Xinlei Chen; Xiao-Ping Zhang; |
| 486 | Analytic Continual Test-Time Adaptation for Multi-Modality Corruption Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel approach, Multi-modality Dynamic Analytic Adapter (MDAA), to tackle MM-CTTA tasks. |
Yufei Zhang; Yicheng Xu; Hongxin Wei; Zhiping Lin; Xiaofeng Zou; Cen Chen; Huiping Zhuang; |
| 487 | EEG-Face: A Facial-Image Stimulated EEG Data-Set for Analysis of Brain Perceived Multimedia Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we establish a facial-image stimulated EEG dataset, named as EEG-Face, to address the challenge and provide a crucial support for relevant research, such as brain-computer interface (BCI), face recognition via brain-perceived EEGs, and multimedia content analysis via brain perception activities. |
Wuxia Zhang; Yang Xin; Shibo Lv; Xin Zhang; Xiang Zhong; Jianmin Jiang; |
| 488 | Graph-Perceptron with Semantic Fidelity for No-Reference Super-Resolution Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Full-reference methods become inapplicable, while the reduced-reference methods relying on one low-resolution (LR) image offer limited reliability. To address these issues, I propose the SQer, a no-reference SR-IQA method based on a graph perceptron with semantic fidelity. |
Lei Chen; |
| 489 | Tree-of-Reasoning: Towards Complex Medical Diagnosis Via Multi-Agent Reasoning with Evidence Tree Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. |
Qi Peng; Jialin Cui; Jiayuan Xie; Yi Cai; Qing Li; |
| 490 | MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work presents a significant advancement toward LLM agents capable of human-like social reasoning across multimodal domains. |
Zheng Zhang; Nuoqian Xiao; Qi Chai; Deheng Ye; Hao Wang; |
| 491 | Balancing Cross-Modal Attention for Generalized Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel coarse-to-fine framework termed Hierarchical Progressive Attention Network (HPAN), which leverages latent attention consistency across attributes to rectify the focus deviation of minor attributes. |
Zhijie Rao; Jingcai Guo; |
| 492 | A Comprehensive Model for Visual Fatigue Assessment in 3D Light Field Displays Based on Eye Movement Data Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a comprehensive methodology that integrates subjective and objective data to establish a robust dataset and employs eye movement data for systematically investigating visual fatigue in 3D LFDs. |
Yu Chen; Binbin Yan; Shuo Chen; Xinzhu Sang; |
| 493 | Boosting Temporal Sentence Grounding Via Causal Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model’s robustness. |
Kefan Tang; Lihuo He; Jisheng Dang; Xinbo Gao; |
| 494 | BrainFLORA: Uncovering Brain Concept Representation Via Multimodal Neural Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce BrainFLORA, a unified framework for integrating cross-modal neuroimaging data to construct a shared neural representation. |
Dongyang Li; Haoyang Qin; Mingyang Wu; Chen Wei; Quanying Liu; |
| 495 | Perspective from A Higher Dimension: Can 3D Geometric Priors Help Visual Floorplan Localization? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy. |
Bolei Chen; Jiaxu Kang; Haonan Yang; Ping Zhong; Jianxin Wang; |
| 496 | Deep Multi-Level Contrastive Clustering for Multi-Modal Remote Sensing Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an end-to-end deep multi-level contrastive clustering (DMLCC) model for multi-modal remote sensing images. |
Weiqi Liu; Yongshan Zhang; Xinxin Wang; Lefei Zhang; |
| 497 | Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose, for the first time, a dynamic self-adaptive multiscale distillation (DSMD) from pre-trained multi-modal large model for efficient cross-modal retrieval method, considering multiple scales from the perspectives of fine granularity, global structure, and hard negative sample mining. |
Zhengyang Liang; Meiyu Liang; Wei Huang; Yawen Li; Wu Liu; Yingxia Shao; Kangkang Lu; |
| 498 | Evaluating and Mitigating Sycophancy in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SyEval-VL, a benchmark specifically designed to evaluate sycophancy in LVLMs. |
Jiayi Gao; Huaiwen Zhang; |
| 499 | NDM: A Noise-driven Detection and Mitigation Framework Against Implicit Sexual Intentions in Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fine-tuning approaches, while effective to some extent, risk degrading the model’s generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model’s original generative capabilities. |
Yitong Sun; Yao Huang; Ruochen Zhang; Huanran Chen; Shouwei Ruan; Ranjie Duan; Xingxing Wei; |
| 500 | Harnessing Multimodal Large Language Models for Personalized Product Search with Query-aware Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the progress, LLM-based PPS solutions merely take textual contents into consideration, neglecting multimodal contents which play a critical role for product search. Motivated by this, we propose a novel framework, HMPPS, for Harnessing Multimodal large language models (MLLM) to deal with Personalized Product Search based on multimodal contents. |
Beibei Zhang; Yanan Lu; Ruobing Xie; Zongyi Li; Siyuan Xing; Tongwei Ren; Fen Lin; |
This table only includes 500 papers selected by our daily digest algorithm. To browse the full list, please visit Paper Digest: ACM Multimedia-2025 (Full List).