CVPR 2026 Papers with Code & Data
To facilitate rapid community engagement with the presented research, we have compiled an extensive index of accepted papers that have associated public code or data repositories. We list all of them in the following table. This index was generated using an automated extraction process. While we strive for completeness, some papers with public resources may have been missed. Please inform us if you discover any additional papers that should be included. Readers should be aware that some code repositories may not be made fully public until the conference officially begins.
In addition to this index, we encourage readers to explore our related resources: CVPR-2026 papers & highlights: For curated summaries and key takeaways from this year’s conference. “Best Paper” Digest (CVPR): A historical overview of the most influential CVPR papers published since 1988.
Since 2018, Paper Digest has built a foundation of data spanning decades of conferences, journals, and research topics. The platform features a daily digest service that sifts through tens of thousands of new papers, clinical trials, news articles, and community posts, filtering the noise to highlight what matters most to specific interests. Beyond daily updates, dozens of built-in research tools streamline the academic workflow, supporting efficient reading and writing, comprehensive literature reviews, and automated research report generation.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: CVPR 2026 Papers with Code & Data
| Paper | Author(s) | Code | |
|---|---|---|---|
| 1 | Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. |
Matias Turkulainen; Akshay Krishnan; Filippo Aleotti; Mohamed Sayed; Guillermo Garcia-Hernando; Juho Kannala; Arno Solin; Gabriel Brostow; Daniyar Turmukhambetov; | code |
| 2 | Global Structure-from-Motion Meets Feedforward Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. |
Linfei Pan; Johannes Schönberger; Marc Pollefeys; | code |
| 3 | LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. |
Zebin You; Shen Nie; Xiaolu Zhang; JUN ZHOU; Zhiwu Lu; Ji-Rong Wen; Chongxuan Li; | code |
| 4 | Demo2Tutorial: From Human Experience to Multimodal Software Tutorials Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. |
Zechen Bai; Zhiheng Chen; Yiqi Lin; Kevin Qinghong Lin; Difei Gao; Xiangwu Guo; WANG XIN; Mike Zheng Shou; | code |
| 5 | Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation Across 3-D Constrained Terrains Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents $\textbf{Gallant}$, a voxel-grid–based framework for humanoid locomotion and local navigation in 3D constrained terrains. |
Qingwei Ben; Botian Xu; Kailin Li; Feiyu Jia; Wentao Zhang; Jingping Wang; Jingbo Wang; Dahua Lin; Jiangmiao Pang; | code |
| 6 | Retrieving Counterfactuals Improves Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual examples through targeted, attribute-guided composed image retrieval. |
Guangzhi Xiong; Sanchit Sinha; Zhenghao He; Aidong Zhang; | code |
| 7 | DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. |
Zhe Liu; Runhui Huang; Rui Yang; Siming Yan; Zining Wang; Lu Hou; Di Lin; Xiang Bai; Hengshuang Zhao; | code |
| 8 | LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that this asymmetry leads to a significant drop in the performance of the learner. To combat this, we present LEAD, a new high-quality synthetic dataset collected in the CARLA simulator with three key improvements. |
Long Nguyen; Micha Fauth; Bernhard Jaeger; Daniel Dauner; Maximilian Igl; Andreas Geiger; Kashyap Chitta; | code |
| 9 | Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. |
Haoji Zhang; Xin Gu; Jiawen Li; Chixiang Ma; Sule Bai; Chubin Zhang; bowen zhang; zhichao zhou; Dongliang He; Yansong Tang; | code |
| 10 | Pointer-CAD: Unifying B-Rep and Command Sequences Via Pointer-based Edges & Faces Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. |
Dacheng Qi; Chenyu Wang; Jingwei Xu; Tianzhe Chu; Zibo Zhao; Wen Liu; Wenrui Ding; Yi Ma; Shenghua Gao; | code |
| 11 | Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Efficient stereo architectures, on the other hand, sacrificerobustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. |
Bowen Wen; Shaurya Dewan; Stan Birchfield; | code |
| 12 | Stabilizing Streaming Video Geometry Via Dynamic Feature Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth’s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. |
Xiaoyang Lyu; Muxin Liu; Xiaoshan Wu; Ruicheng Wang; Yihua Huang; Yangtian Sun; Shaoshuai Shi; Xiaojuan Qi; | code |
| 13 | Unified Customized Generation By Disentangled Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce **USO**, a **U**nified **S**imultaneous **O**ptimization framework to simultaneously unify different customized tasks (i.e., subject and style). |
Shaojin Wu; Mengqi Huang; Yufeng Cheng; wenxu wu; Jiahe Tian; Yiming Luo; Fei Ding; Qian HE; | code |
| 14 | When Token Pruning Is Worse Than Random: Understanding Visual Token Information in VLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by **vanishing token information”**, where visual tokens progressively lose their salience with increasing network depth. |
Yahong Wang; Juncheng Wu; Zhangkai Ni; Longzhen Yang; Yihang Liu; Chengmei Yang; Ying Wen; Lianghua He; Xianfeng Tang; Hui Liu; Yuyin Zhou; | code |
| 15 | OmniGen2: Towards Instruction-Aligned Multimodal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduces \textbf{OmniGen2}, a unified multimodal generator designed to follow complex, fine-grained instructions. |
Chenyuan Wu; Jiahao Wang; PengFei Zheng; Ruiran Yan; Shitao Xiao; Xin Luo; Yueze Wang; Wanli Li; Xiyan Jiang; Yexin Liu; Junjie Zhou; Ziyi Xia; Ze Liu; Chaofan Li; Haoge Deng; Kun Luo; Bo Zhang; Jiajun Zhang; Dong Liu; Defu Lian; Xinlong Wang; Zhongyuan Wang; Tiejun Huang; Zheng Liu; | code |
| 16 | Frequency-Aware Flow Matching for High-Quality Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. |
Sucheng Ren; Qihang Yu; Ju He; Xiaohui Shen; Liang-Chieh Chen; | code |
| 17 | Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. |
hongyuan chen; Xingyu Chen; Zexiang Xu; Anpei Chen; | code |
| 18 | ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. |
Yuantao Chen; Jiahao Chang; Chongjie Ye; Chaoran Zhang; Zhaojie Fang; Chenghong Li; Xiaoguang Han; | code |
| 19 | HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. |
Hezhen Hu; Wangbo Zhao; Lanqing Guo; Hanwen Jiang; Jonathan Liu; Zhiwen Fan; Kai Wang; Zhangyang Wang; Georgios Pavlakos; | code |
| 20 | Self-Consistency for LLM-based Motion Trajectory Generation and Verification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study how to adapt self-consistency to visual domains; specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. |
Jiaju Ma; R. Kenny Jones; Jiajun Wu; Maneesh Agrawala; | code |
| 21 | TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a generalizable framework to transfer relative depth to metric depth. |
Beilei Cui; Yiming Huang; Long Bai; Hongliang Ren; | code |
| 22 | A Frame Is Worth One Token: Efficient Generative World Modeling with Delta Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. |
Tommie Kerssies; Gabriele Berton; Ju He; Qihang Yu; Wufei Ma; Daan de Geus; Gijs Dubbelman; Liang-Chieh Chen; | code |
| 23 | LongVT: Incentivizing Thinking with Long Videos Via Native Tool Calling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by how humans comprehend long videos—by first skimming globally and then examining relevant clips for details—we introduce LongVT, an end-to-end agentic framework that sparks Thinking with Long Videos via interleaved Multimodal Chain-of-Tool-Thought. |
Zuhao Yang; Sudong Wang; Kaichen Zhang; Keming Wu; Sicong Leng; Yifan Zhang; Bo Li; Chengwei Qin; Shijian Lu; Xingxuan Li; Lidong Bing; | code |
| 24 | VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. |
Maitreya Patel; Jingtao Li; Weiming Zhuang; Yezhou Yang; Lingjuan Lyu; | code |
| 25 | SAGE: Scalable Agentic 3D Scene Generation for Embodied AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. |
Hongchi Xia; Xuan Li; Max Li; Qianli Ma; Jiashu Xu; Ming-Yu Liu; Yin Cui; Tsung-Yi Lin; Wei-Chiu Ma; Shenlong Wang; Shuran Song; Fangyin Wei; | code |
| 26 | OmniDocLayout: Towards Diverse Document Layout Generation Via Coarse-to-Fine LLM Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^6$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. |
Hengrui Kang; Zhuangcheng Gu; Zhiyuan Zhao; Zichen Wen; Bin Wang; Weijia Li; Conghui He; | code |
| 27 | Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy Via Spherical Harmonics for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. |
Qinglun Zhang; Shen Cheng; Tian Dan; Haoqiang Fan; Guanghui Liu; Shuaicheng Liu; | code |
| 28 | CountGD++: Generalized Prompting for Open-World Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. |
Niki Amini-Naieni; Andrew Zisserman; | code |
| 29 | Attend Before Attention: Efficient and Scalable Video Understanding Via Autoregressive Gazing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. |
Baifeng Shi; Stephanie Fu; Long Lian; Hanrong Ye; David Eigen; Aaron Reite; Jan Kautz; Boyi Li; David Chan; Trevor Darrell; Pavlo Molchanov; Danny Yin; | code |
| 30 | Taming Generative Diffusion Model for Task-Oriented Infrared Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified diffusion framework that re-formulates IR restoration as a single-step generative process. |
Tengyu Ma; Zhilong Dai; Yubo Diao; Guanming An; Long Ma; Jinyuan Liu; Risheng Liu; | code |
| 31 | CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. |
Jiange Yang; tom tomlinson; Haoyi Zhu; Mingyu Liu; Kaijing Ma; Yating Wang; Gangshan Wu; Tong He; Limin Wang; | code |
| 32 | Motion-Aware Animatable Gaussian Avatars Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. |
Muyao Niu; Yifan Zhan; Qingtian Zhu; Zhuoxiao Li; Wei Wang; Zhihang Zhong; Xiao Sun; Yinqiang Zheng; | code |
| 33 | Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely **Bridge**, that incorporates causal inference into object detection. |
Mingbo Hong; Feng Liu; Caroline Gevaert; George Vosselman; Hao Cheng; | code |
| 34 | Toward Low-Cost Yet Effective Temporal Learning for UAV Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. |
chaocan xue; Qihua Liang; Bineng Zhong; Yanting Zu; Yuanliang Xue; Haiying Xia; Shuxiang Song; | code |
| 35 | The Image As Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce **Adv-GRPO**, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. |
Weijia Mao; Hao Chen; Zhenheng Yang; Mike Zheng Shou; | code |
| 36 | VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. |
Xinlei Yu; Chengming Xu; Guibin Zhang; Zhangquan Chen; Yudong Zhang; Yongbo He; Peng-Tao Jiang; Jiangning Zhang; Xiaobin Hu; Shuicheng Yan; | code |
| 37 | Unlocking Token Rewards Via Training-Free Reward Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an extremely efficient, training-free method to extract token-level reward signals directly from an existing deep reward model. |
WU Sitong; Haoru Tan; Bin Xia; Xichen Zhang; Jingyao Li; Shaofeng Zhang; Xiaojuan Qi; Bei Yu; Jiaya Jia; | code |
| 38 | Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. |
Christopher Clark; Jieyu Zhang; Zixian Ma; Jae Sung Park; Rohun Tripathi; Sangho Lee; Reza Salehi; Jason Ren; Chris Dongjoo Kim; Yinuo Yang; Vincent Shao; Yue Yang; Weikai Huang; Ziqi Gao; Taira Anderson; Jianrui Zhang; Jitesh Jain; George Stoica; Ali Farhadi; Ranjay Krishna; | code |
| 39 | Spatiotemporal Pyramid Flow Matching for Climate Emulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. |
Jeremy Irvin; Jiaqi Han; Zikui Wang; Abdulaziz Alharbi; Yufei Zhao; Nomin-Erdene Bayarsaikhan; Daniele Visioni; Andrew Y. Ng; Duncan Watson-Parris; | code |
| 40 | SO-Bench: A Structural Output Evaluation of Multimodal LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-BENCH benchmark. |
Di Feng; Kaixin Ma; Feng Nan; Haofeng Chen; Bohan Zhai; David Griffiths; Mingfei Gao; Zhe Gan; Eshan Verma; Yinfei Yang; Zhifeng Chen; Afshin Dehghan; | code |
| 41 | CompBench: Benchmarking Complex Instruction-guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. |
Bohan Jia; Wenxuan Huang; Yuntian Tang; Junbo Qiao; Jincheng Liao; Shaosheng Cao; Fei Zhao; Zhaopeng Feng; Zhouhong Gu; Zhenfei Yin; Lei Bai; Wanli Ouyang; Lin Chen; Fei Zhao; Zihan Wang; Yuan Xie; Shaohui Lin; | code |
| 42 | P-Flow: Prompting Visual Effects Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. |
Rui Zhao; Mike Zheng Shou; | code |
| 43 | 3D-LATTE: Latent Space 3D Editing from Textual Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. |
Maria Parelli; Michael Oechsle; Michael Niemeyer; Federico Tombari; Andreas Geiger; | code |
| 44 | R2G:A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present R2G (RTL-to-GDSII), a standardized benchmark and framework that converts DEF files into typed, heterogeneous, information-preserving circuit graphs and supports node- and edge-level tasks in placement and routing. |
ZEWEI ZHOU; Jiajun Zou; Jiajia Zhang; Ao Yang; Ruichao He; Haozheng Zhou; Ao Liu; Jiawei Liu; Leilei Jin; Shan Shen; Daying Sun; | code |
| 45 | ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ACoT-VLA, a novel architecture that materializes the ACoT paradigm. |
Linqing Zhong; Yi Liu; Yifei Wei; Ziyu Xiong; Si Liu; Guanghui Ren; | code |
| 46 | BinaryAttention: One-Bit Attention for Vision and Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit attention. |
Chaodong XIAO; Zhengqiang ZHANG; Lei Zhang; | code |
| 47 | StableMaterials: Enhancing Diversity in Material Generation Via Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **StableMaterials**, a novel approach for generating photorealistic physical-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). |
Giuseppe Vecchio; | code |
| 48 | VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly … |
Linfeng Tang; Yeda Wang; Meiqi Gong; Zizhuo Li; Yuxin Deng; Xunpeng Yi; Chunyu Li; Han Xu; HAO ZHANG; Jiayi Ma; | code |
| 49 | CTCal: Rethinking Text-to-Image Diffusion Models Via Cross-Timestep Self-Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. |
Xiefan Guo; Xinzhu Ma; Haiyu Zhang; Di Huang; | code |
| 50 | Virtual Full-stack Scanning of Brain MRI Via Imputing Any Quantised Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various “any-to-any” imputation tasks as a region-level full-stack code prediction problem. |
Yicheng Wu; Tao Song; Zhonghua Wu; Jin Ye; Zongyuan Ge; Wenjia Bai; Zhaolin Chen; Jianfei Cai; | code |
| 51 | Wavelet-based Frame Selection By Detecting Semantic Boundary for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce $\textbf{W}$avelet-based $\textbf{F}$rame $\textbf{S}$election by Detecting $\textbf{S}$emantic $\textbf{B}$oundary ($\textbf{WFS-SB}$), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts—pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. |
Wang Chen; Yuhui zeng; Yongdong Luo; Tianyu Xie; Luojun Lin; Jiayi Ji; Yan Zhang; Xiawu Zheng; | code |
| 52 | MeshSplatting: Differentiable Rendering with Opaque Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. |
Jan Held; Sanghyun Son; Renaud Vandeghen; Daniel Rebain; Matheus Gadelha; Yi Zhou; Anthony Cioppa; Ming Lin; Marc Van Droogenbroeck; Andrea Tagliasacchi; | code |
| 53 | SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. |
Xiaofeng Cong; Yu-Xin Zhang; Hao Shen; Yeying Jin; Junming Hou; Jie Gui; | code |
| 54 | BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. |
Zishu Yao; Xiang-Xiang Su; Shengning Zhou; Guang-Yong Chen; Guodong Fan; Xing Chen; | code |
| 55 | From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. |
Yuyuan Liu; Yiping Ji; Anjie Le; Jiayuan Zhu; Jiazhen Pan; Can Peng; Jiajun Deng; Fengbei Liu; Junde Wu; | code |
| 56 | Adversarial Style Optimization: Enhancing VLM Jailbreaks By GRPO-based Stylistic Triggers Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. |
Bingjun Luo; Jialin Guo; Yue Yao; Xinpeng Ding; | code |
| 57 | EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present EffectMaker, a unified reasoning–generation framework that enables reference-based VFX customization. |
Shiyuan Yang; Ruihuang Li; Jiale Tao; Shuai Shao; qinglin lu; Jing Liao; | code |
| 58 | Order Matters: 3D Shape Generation from Sequential VR Sketches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. |
Yizi Chen; Sidi Wu; Tianyi Xiao; Nina Wiedemann; Loic Landrieu; | code |
| 59 | CoD: A Diffusion Foundation Model for Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address it, we introduce CoD, the first Compression-oriented Diffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. |
Zhaoyang Jia; Zihan Zheng; Naifu Xue; Jiahao Li; Bin Li; Zongyu Guo; Xiaoyi Zhang; Houqiang Li; Yan Lu; | code |
| 60 | Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. |
Yuqing Wang; Chuofan Ma; Zhijie Lin; Yao Teng; Lijun Yu; Shuai Wang; Jiaming Han; Jiashi Feng; Yi Jiang; Xihui Liu; | code |
| 61 | MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. |
Geonmo Gu; Byeongho Heo; Jaemyung Yu; Jaehui Hwang; Taekyung Kim; Sangmin Lee; HeeJae Jun; Yoohoon Kang; Sangdoo Yun; Dongyoon Han; | code |
| 62 | Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed \textbf{IQPIR}, that introduces an Image Quality Prior (IQP)—extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models—to guide the restoration process toward perceptually optimal outputs explicitly. |
Fengyang Xiao; Peng Hu; Lei Xu; XingE Guo; Guanyi Qin; Yuqi Shen; Chengyu Fang; Rihan Zhang; Chunming He; Sina Farsiu; | code |
| 63 | Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). |
Tianyi Zhang; Zheng-Peng Duan; Chun-Le Guo; Peng-Tao Jiang; Bo Li; Ming-Ming Cheng; Chongyi Li; | code |
| 64 | Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. |
Hanchao Liu; Fang-Lue Zhang; Shining Zhang; Tai-Jiang Mu; Shi-Min Hu; | code |
| 65 | Rethinking MLLM Itself As A Segmenter with A Single Segmentation Token Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding~(SELF1E) while achieving competitive results, which eliminates the need for external decoders. |
Anqi Zhang; Xiaokang Ji; Guangyu Gao; Jianbo Jiao; Chi Harold Liu; Yunchao Wei; | code |
| 66 | Align Images Before You Generate Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. |
Shihua Zhang; Qiuhong Shen; Xinchao Wang; | code |
| 67 | Semantic Context Matters: Improving Conditioning for Autoregressive Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality. To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models. |
Dongyang Jin; Ryan Xu; Jianhao Zeng; Rui Lan; Yancheng Bai; Lei Sun; Xiangxiang Chu; | code |
| 68 | Event-Illumination Collaborative Low-light Image Enhancement with A High-resolution Real-world Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. |
Senyan Xu; Zhijing Sun; Kean Liu; Xin Lu; Ruixuan Jiang; Xueyang Fu; Zheng-Jun Zha; | code |
| 69 | Dynamic Important Example Mining for Reinforcement Finetuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose **Dynamic Important Example Mining (DIEM)**, a principled and fully automated framework that makes data utilization adaptive throughout RFT. |
Haoru Tan; WU Sitong; Yanfeng Chen; Shizhen Zhao; Yangtian Sun; Tianjia Liu; Chirui Chang; Shaofeng Zhang; Xingwu Sun; Xiuzhe Wu; Ruobing Xie; Xiaojuan Qi; | code |
| 70 | Dataset Distillation Via Influence Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, we introduce a fully differentiable, sample-level influence estimator that quantifies parameter shifts from adding or removing data– without time-consuming inverse-Hessian products or convexity assumptions. |
Haoru Tan; Wang Wang; WU Sitong; Xiuzhe Wu; Yangtian Sun; Chirui Chang; Shaofeng Zhang; Xiaojuan Qi; | code |
| 71 | ARGUS: Defending Against Multimodal Indirect Prompt Injection Via Steering Instruction-Following Behavior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. |
Weikai Lu; Ziqian Zeng; Kehua Zhang; Haoran Li; Huiping Zhuang; Ruidong Wang; Cen Chen; Hao Peng; | code |
| 72 | AnchorFlow: Training-Free 3D Editing Via Latent Anchor-Aligned Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. |
Zhenglin Zhou; Fan Ma; Chengzhuo Gui; Xiaobo Xia; Hehe Fan; Yi Yang; Tat-seng Chua; | code |
| 73 | Rethinking Token Reduction for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. |
Yi Wang; Haofei Zhang; Qihan Huang; Anda Cao; Gongfan Fang; Wei Wang; Xuan Jin; Jie Song; Mingli Song; Xinchao Wang; | code |
| 74 | BrickNet: Graph-Backed Generative Brick Assembly Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We train a language model to generate LEGO®-brick build sequences. |
Peter Kulits; Cordelia Schmid; | code |
| 75 | See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. |
Bo-Yuan Sun; Bo-Wen Yin; Yuan-Ming Li; Xihan Wei; Qibin Hou; | code |
| 76 | Revisiting The Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). |
Yifan Du; Kun Zhou; Yingqian Min; Yue Ling; Xin Zhao; Youbin Wu; Ji-Rong Wen; | code |
| 77 | Unleashing Vision-Language Semantics for Video Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength — the rich vision-language semantics embedded in the latent space. We proposes VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics in enhancing model’s discriminability in deepfake detection. |
Jiawen Zhu; Yunqi Miao; Xueyi Zhang; Jiankang Deng; Guansong Pang; | code |
| 78 | Multi-view Consistent 3D Gaussian Head Avatars ‘without’ Multi-view Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. |
Aviral Chharia; Fernando De la Torre; | code |
| 79 | Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). |
Kunlun Xu; Haotong Cheng; Jiangmeng Li; Xu Zou; Jiahuan Zhou; | code |
| 80 | When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. |
Zhengyang Sun; Yu Chen; Xin Zhou; Xiaofan Li; Xiwu Chen; Dingkang Liang; Xiang Bai; | code |
| 81 | CrackSSM: Reviving SSMs for Crack Segmentation Via Dynamic Scanning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This fixed flattening order disrupts spatial continuity and weakens the SSM’s ability to model irregular crack patterns effectively. To address this limitation, we propose \textbf{CrackSSM}, a novel crack-aware segmentation framework featuring a dynamic scanning strategy that adapts the token sequence to the underlying structure of each image. |
Yubin Gu; Boyang Hou; Yuan Meng; Wenting Luo; Jiayi Ji; Xiaoshuai Sun; | code |
| 82 | Cross-Domain Few-Shot Segmentation Via Multi-view Progressive Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. |
Jiahao Nie; Guanqiao Fu; Wenbin An; Yap-Peng Tan; Alex C. Kot; Shijian Lu; | code |
| 83 | MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the cross-modal domain gap and the limited dense prediction capability of current vision–language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. |
YIMIN WEI; Aoran Xiao; Hongruixuan Chen; Junshi Xia; Naoto Yokoya; | code |
| 84 | The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, in this paper, we explore the utilization of LLaVA for training-free open-vocabulary semantic segmentation. |
Bingfeng Zhang; Siyue Yu; Hui Li; Jiahua Lin; Wenwu Wang; Jimin Xiao; | code |
| 85 | NIL: No-data Imitation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce No-data Imitation Learning (NIL): an imitation learning framework that replaces curated expert demonstrations with videos generated by a pretrained video diffusion model. |
Mert Albaba; Chenhao Li; Markos Diomataris; Omid Taheri; Andreas Krause; Michael J. Black; | code |
| 86 | YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present YOSE — You Only Select Essential Tokens, an efficient fine-tuning framework. |
wu chenyang; Lina Lei; Fan Li; Chun-Le Guo; Dehong Kong; Xinran Qin; Zhixin Wang; Ming-Ming Cheng; Chongyi Li; | code |
| 87 | PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. |
Siyuan Liu; Chaoqun Zheng; Xin Zhou; Tianrui Feng; Dingkang Liang; Xiang Bai; | code |
| 88 | MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. |
Chao Ning; Minghe Shen; Naoto Yokoya; | code |
| 89 | Taming The Long Tail: Rebalancing Adversarial Training Via Adaptive Perturbation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose Rebalanced Adversarial Intensity for Long-Tailed Data (RAIL), a plug-and-play framework that adaptively adjusts perturbations during adversarial training. |
Lilin Zhang; Yimo Guo; Li Yue; Jiancheng Shi; Xianggen Liu; | code |
| 90 | AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent’s state and task progress in a fully end-to-end and data-driven manner. |
Wenxuan Guo; Xiuwei Xu; Yichen Liu; Xiangyu Li; Hang Yin; Huangxing Chen; Wenzhao Zheng; Jianjiang Feng; Jie Zhou; Jiwen Lu; | code |
| 91 | Dynamic-eDiTor: Training-Free Text-Driven 4D Scene Editing with Multimodal Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Dynamic-eDiTor, a training-free text-driven 4D editing framework leveraging Multimodal Diffusion Transformer (MM-DiT) and 4DGS. |
Dong In Lee; Hyungjun Doh; Seunggeun Chi; Runlin Duan; Sangpil Kim; Karthik Ramani; | code |
| 92 | AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we present AnyID, an ultra-fidelity identity-preservation video generation framework. |
Jiahao Wang; Hualian Sheng; Sijia Cai; Yuxiao Yang; Weizhan Zhang; Caixia Yan; Bing Deng; Jieping Ye; | code |
| 93 | EventGait: Towards Robust Gait Recognition with Event Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. |
Senyan Xu; Shuai Chen; Chuanfu Shen; Kean Liu; Zhijing Sun; Chengzhi Cao; Xueyang Fu; | code |
| 94 | Unified Video Editing As Temporal Reasoner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing video editing methods face a critical trade-off: expert models offer precision but rely on task-specific priors like masks, hindering unification; conversely, unified temporal in-context learning models are mask-free but lack explicit spatial cues, leading to weak instruction-to-region mapping and imprecise localization. To resolve this conflict, we propose VideoCoF, a novel Chain-of-Frames approach inspired by Chain-of-Thought reasoning. |
xiangpeng yang; Ji Xie; Yiyuan Yang; Yue Ma; Yan Huang; Min Xu; Qiang Wu; | code |
| 95 | MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. |
Hunor Laczko; Libang Jia; Phat Truong; Diego Hernández; Sergio Escalera; Jordi Gonzàlez; Meysam Madadi; | code |
| 96 | DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images Via Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. |
Hang Zhao; Hang Zhao; Qianyu Zhou; Xuequan Lu; Xiangtai Li; Hao Yang; Bo Yang; Yiren Song; | code |
| 97 | Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. |
Haojie Zheng; Shuchen Weng; Jingqi Liu; Siqi Yang; Boxin Shi; Xinlong Wang; | code |
| 98 | Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. |
Zengyi Yang; Yu Liu; Juan Cheng; Zhiqin Zhu; Yafei Zhang; Huafeng Li; | code |
| 99 | VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This issue becomes particularly acute in reinforcement learning (RL) scenarios, leading to unstable training and suboptimal alignment. To resolve this, we propose a novel framework to enhance Group Relative Policy Optimization (GRPO) by explicitly managing these conflicts. |
Shikun Sun; Liao Qu; Huichao Zhang; Yiheng Liu; Yangyang Song; Xian Li; Yi Jiang; Xu Wang; Jia Jia; Daniel Kang Du; Xinglong Wu; | code |
| 100 | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. |
Haosong Peng; Hao Li; Yalun Dai; Yushi Lan; Yihang Luo; Tianyu Qi; Zhengshen Zhang; Yufeng Zhan; Junfei Zhang; Wenchao Xu; Ziwei Liu; | code |
| 101 | Illuminating Visual Identity in Universal Multimodal Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a crucial capability, visual identity discrimination, remains underexplored in existing UME methods, despite its critical role in a wide range of tasks, including instance retrieval, re-identification, and identity preservation in AI-generated content (AIGC). To bridge this gap, we propose a unified formulation for visual identity discrimination and introduce $\textbf{MIEB}$ ($\textbf{M}$ultimodal Visual $\textbf{I}$dentity $\textbf{E}$mbedding $\textbf{B}$enchmark), a large-scale benchmark curated from both real-world and synthetic datasets to support evaluation and training. |
Jiawei Cao; Junyi Feng; Jiashen Hua; Ziheng Huang; Bing Deng; Kaijie Wu; Chaochen Gu; Jieping Ye; | code |
| 102 | See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI By Identifying Toggles Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge, we propose **St**ate-**a**ware **R**easoning (StaR), a multimodal reasoning method that enables agents to perceive the current toggle state, infer the desired state from the instruction, and act accordingly. |
Zongru Wu; Rui Mao; Zhiyuan Tian; Pengzhou Cheng; Tianjie Ju; Zheng Wu; Lingzhong Dong; Haiyue Sheng; Zhuosheng Zhang; Gongshen Liu; | code |
| 103 | Identity-Preserving Image-to-Video Generation Via Reward-Guided Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. |
Liao Shen; Wentao Jiang; Yiran Zhu; Jiahe Li; Tiezheng Ge; Zhiguo Cao; Bo Zheng; | code |
| 104 | M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image–question pairs for evaluating retrieval-augmented VQA across languages and modalities. |
David Anugraha; Patrick Irawan; Anshul Singh; En-Shiun Annie Lee; Genta Indra Winata; | code |
| 105 | Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This overhead stems from the large number of visual tokens generated during encoding, which significantly increases inference latency and memory consumption when passed to large language models, owing to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework specifically designed for multi-view VLMs in autonomous driving. |
Minhao Xiong; Zichen Wen; Zhuangcheng Gu; Xuyang Liu; Rui Zhang; Hengrui Kang; Jiabing Yang; JUNYUAN ZHANG; Weijia Li; Conghui He; Linfeng Zhang; | code |
| 106 | A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. |
Yichang Xu; Gaowen Liu; Ramana Kompella; Tiansheng Huang; Sihao Hu; Fatih Ilhan; Selim Tekin; Zachary Yahn; Ling Liu; | code |
| 107 | Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Learning from Itself (LfI), which mines CLIP’s internal knowledge to address both challenges. |
Yizheng Gong; Siyue Yu; Waleed Al-Nuaimy; Jimin Xiao; | code |
| 108 | Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute this inefficiency to substantial visual regions redundancy in document images, like background. To tackle this, we propose PaddleOCR-VL, a novel coarse-to-fine architecture that focuses on semantically relevant regions while suppressing redundant ones, thereby improving both efficiency and performance. |
Cheng Cui; Ting Sun; Suyin Liang; Tingquan Gao; Zelun Zhang; Jiaxuan Liu; Xueqing Wang; Changda Zhou; Hongen Liu; Lin Manhui; Yue Zhang; yubo zhang; Jing Zhang; Jun Zhang; Xing Wei; Yi Liu; Dianhai Yu; Yanjun Ma; | code |
| 109 | Dual-level Adaptation for Multi-Object Tracking: Building Test-Time Calibration from Experience and Intuition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. |
Wen Guo; Pengfei Zhao; Zongmeng Wang; Yufan Hu; Junyu Gao; | code |
| 110 | Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. |
Dominik Hollidt; Tommaso Bendinelli; Christian Holz; | code |
| 111 | Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. |
Yanpeng Sun; JING HAO; Ke Zhu; Jiang-Jiang Liu; Xiaofan Li; Na Zhao; Zechao Li; Jingdong Wang; | code |
| 112 | MM-ACT: Learn from Multimodal Parallel Generation to Act Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action(VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. |
Haotian Liang; Xinyi Chen; Bin Wang; MingKang Chen; Yitian Liu; Yuhao Zhang; Zanxin Chen; Tianshuo Yang; Yilun Chen; Jiangmiao Pang; Dong Liu; Xiaokang Yang; Yao Mu; Wenqi Shao; Ping Luo; | code |
| 113 | ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to **Capability Degradation**—the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose **ReCALL** (Recalibrating Capability Degradation), a model-agnostic framework that follows a *diagnose–generate–refine* pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. |
tianyu yang; ChenWei He; xiangzhao hao; Tianyue Wang; Jiarui Guo; Haiyun Guo; Leigang Qu; Jinqiao Wang; Tat-seng Chua; | code |
| 114 | Best Segmentation Buddies for Image-Shape Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we examine the underexplored problem of estimating segmentation-to-segmentation correspondence between images in the wild and untextured 3D shapes. |
Itai Lang; Dongwei Lyu; Dale Decatur; Rana Hanocka; | code |
| 115 | DMAligner: Enhancing Image Alignment Via Diffusion Model Based View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. |
Xinglong Luo; Ao Luo; Zhengning Wang; Yueqi Yang; Chaoyu Feng; Lei Lei; Bing Zeng; Shuaicheng Liu; | code |
| 116 | VidEoMT: Your ViT Is Secretly Also A Video Segmentation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. |
Narges Norouzi; Idil Esen Zulfikar; Niccolò Cavagnero; Tommie Kerssies; Bastian Leibe; Gijs Dubbelman; Daan de Geus; | code |
| 117 | FVGen: Scaling 3D Scene Datasets with Certainty-Aware Free-View Generation from Scene Geometry Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FVGen, a novel framework that leverages the power of scene reconstruction to transform limited real-world image sequences into a scalable source of high-quality training data. |
Chenhan Jiang; Yu Chen; Qingwen Zhang; Jifei Song; Songcen Xu; Dit-Yan Yeung; Jiankang Deng; | code |
| 118 | VITAL: Vision-Encoder-centered Pretraining for LMMs in Visual Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing VQualA LMMs typically focus on a single task and rely on full-parameter fine-tuning, which makes them prone to overfitting on specific modalities or task types, thereby limiting their generalization capacity and transferability. To address this, we propose a **vision-encoder-centered generative pre-training** pipeline and develop the **VITAL-Series** LMMs. |
Ziheng Jia; Linhan Cao; Jinliang Han; Zicheng Zhang; Jiaying Qian; Wang Jiarui; Zijian Chen; Guangtao Zhai; Xiongkuo Min; | code |
| 119 | Scal3R: Scalable Test-Time Training for Feed-forward Large-Scale 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. |
Tao Xie; Peishan Yang; Yudong Jin; Yingfeng Cai; Wei Yin; Weiqiang Ren; Qian Zhang; Wei Hua; Sida Peng; Xiaoyang Guo; Xiaowei Zhou; | code |
| 120 | PhaseWin Search Framework Enable Efficient Object-Level Interpretation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. |
Zihan Gu; Ruoyu Chen; Junchi Zhang; Yue Hu; Hua Zhang; Xiaochun Cao; | code |
| 121 | Thinking with Programming Vision: Towards A Unified View for Thinking with Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes, underscoring the need for more robust tool-based reasoning. To address this, we propose **CodeVision**, a flexible and scalable “code-as-tool” framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. |
Zirun Guo; Minjie Hong; Feng Zhang; Kai Jia; Tao Jin; | code |
| 122 | DuetGen: Towards General Purpose Interleaved Multimodal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. |
Min Shi; Xiaohui Zeng; Jiannan Huang; Yin Cui; Francesco Ferroni; Jialuo Li; Max Li; Yogesh Balaji; Haoxiang Wang; Tsung-Yi Lin; Xiao Fu; Yue Zhao; Chieh-Yun Chen; Ming-Yu Liu; Humphrey Shi; | code |
| 123 | GThinker: Towards General Multimodal Reasoning Via Cue-Guided Rethinking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: GThinker leverages Cue-Rethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers a re-examination of these cues to resolve inconsistencies. To instill this capability, we introduce a novel two-stage training framework. |
Yufei Zhan; Ziheng Wu; Yousong Zhu; Rongkun Xue; Guanghao Zhou; Ruipu Luo; Zhenghao Chen; Can Zhang; Yifan Li; Zhentao he; Zheming Yang; Ming Tang; Minghui Qiu; Jinqiao Wang; | code |
| 124 | From Few-way to Many-way: Rethinking Few-shot Fine-grained Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we pioneer a theoretical analysis of novel class behavior in FSFG and derive a class discriminative index bound. |
Li-Jun Zhao; Zhen-Duo Chen; Xin Luo; Xin-Shun Xu; | code |
| 125 | Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents LivingSwap, the first video reference guided face swapping model. |
Zekai Luo; Zongze Du; Zhouhang Zhu; Hao Zhong; Muzhi Zhu; Wen Wang; Yuling Xi; Chenchen Jing; Hao Chen; Chunhua Shen; | code |
| 126 | IDperturb: Enhancing Variation in Synthetic Face Generation Via Angular Perturbations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose IDperturb, a simple yet effective geometric-driven sampling strategy to enhance diversity in synthetic face generation. |
Fadi Boutros; Eduarda Caldeira; Tahar Chettaoui; Naser Damer; | code |
| 127 | PhysIR-Splat: Physically Consistent Thermal Infrared Radiative Transfer in 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present PhysIR-Splat, a 3DGS framework that follows infrared radiative transfer: we explicitly model temperature, emissivity, and environmental irradiance on Gaussian primitives and, during rendering, jointly account for thermal emission, the reflected component, and atmospheric transmittance to produce physically consistent thermal synthesis. |
Jingyuan Gao; Yumeng Hu; Fei Gao; Mingjin Zhang; | code |
| 128 | Multinex: Lightweight Low-light Image Enhancement Via Multi-prior Retinex Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To achieve low-cost, effective LLIE, we present Multinex, an ultra-lightweight structured framework that integrates multiple fine-grained representations within a principled Retinex formulation. |
Alexandru Brateanu; Tingting Mu; Codruta Ancuti; Cosmin Ancuti; | code |
| 129 | Decoupling Bias, Aligning Distributions: Synergistic Fairness Optimization for Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current fairness-enhanced detectors often improve fairness at the cost of detection accuracy. To address this challenge, we propose a dual-mechanism collaborative optimization framework. |
Feng Ding; Wenhui Yi; Yunpeng Zhou; Xinan He; Hong Rao; Shu Hu; | code |
| 130 | ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ChArtist, a domain-specific method for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. |
Shishi Xiao; Tongyu Zhou; David H. Laidlaw; Gromit Yeuk-Yin Chan; | code |
| 131 | FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. |
Yidi Liu; Zihao Fan; Jie Huang; Jie Xiao; Dong Li; Wenlong Zhang; Lei Bai; Xueyang Fu; Zheng-Jun Zha; | code |
| 132 | SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. |
Zhanxuan Hu; Xu Qiyu; Yu Duan; Yonghang Tai; Huafeng Li; | code |
| 133 | The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. |
Guannan Lai; Da-Wei Zhou; Zhenguo Li; Han-Jia Ye; | code |
| 134 | OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. |
Junuk Cha; Jihyeon Kim; Han-Mu Park; | code |
| 135 | HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. |
Yangguang Lin; Quan Fang; Yufei Li; Jiachen Sun; Junyu Gao; Jitao Sang; | code |
| 136 | From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we reframe keypoint detection as a sequential decision-making problem. |
yepeng liu; Hao Li; Liwen Yang; Fangzhen Li; Xudi Ge; Yuliang Gu; kuang Gao; Bing Wang; Guang Chen; Hangjun Ye; Yongchao Xu; | code |
| 137 | RevINN: An End-to-End Invertible Neural Network for Reversible Adversarial Examples Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this two-stage process often results in RAEs with inferior attack effectiveness and visual quality compared to the original versions. To solve these challenges, we propose a novel end-to-end Invertible Neural Network for Reversible Adversarial Examples Generation (RevINN), which directly generates RAEs in one stage by scrambling the intrinsic frequency information of images. |
Jielun Huang; Chi-Man Pun; Guoheng Huang; | code |
| 138 | UETrack: A Unified and Efficient Framework for Single Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, current multi-modal tracking approaches typically use complex designs, making them too heavy and slow for resource-constrained deployment. To tackle these limitations, we propose UETrack, a unified and efficient framework for single object tracking. |
Ben Kang; Jie Zhao; Xin Chen; Wanting Geng; Bin Zhang; Lu Zhang; Dong Wang; Huchuan Lu; | code |
| 139 | MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current vision–language embodied agents lack ToM-based decision-making, and existing benchmarks focus solely on human mental states while ignoring the agent’s own perspective, hindering coherent decision and action generation. To address this, we propose MindPower, a Robot-Centric framework integrating Perception, Mental Reasoning, Decision Making and Action. |
Ruoxuan Zhang; Qiyun Zheng; Zhiyu Zhou; Ziqi Liao; Siyu Wu; Jian-Yu Jiang-Lin; Bin Wen; Hongxia Xie; Jianlong Fu; Wen-Huang Cheng; | code |
| 140 | HiDRA: Hierarchical Degradation Representation and Adaptation with Generative Priors for Enhancing Infrared Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Pre-trained generative models showcase powerful capabilities for alleviating degradations but lack effective tools to adapt visible generative priors to TIR-specific characteristics. To overcome these challenges, we propose a Hierarchical Degradation Representation and Adaptation (HiDRA) framework to decompose the enhancement procedure into degradation representation estimation and generative model fine-tuning. |
Zihang Chen; Zhu Liu; Changbo Yan; Jinyuan Liu; Risheng Liu; | code |
| 141 | MergeVLA: Cross-Skill Model Merging Toward A Generalist Vision-Language-Action Agent Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (2) Action experts develop inter-block dependencies through self-attention feedback, causing task information to spread across layers and preventing modular recombination. To address these challenges, we present MergeVLA, a merging-oriented VLA architecture that preserves mergeability by design. |
Yuxia Fu; Zhizhen Zhang; Yuqi Zhang; Zijian Wang; Zi Huang; Yadan Luo; | code |
| 142 | Residual Diffusion Bridge Model for Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). |
Hebaixu Wang; Jing Zhang; Haoyang Chen; Haonan Guo; Di Wang; Jiayi Ma; Bo Du; | code |
| 143 | D$^3$FER: Dual Channel and Dual Branch Network for Robust Facial Expression Recognition Under Dual Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose D$^3$FER ($\textbf{D}$ual Channel and $\textbf{D}$ual Branch Network for Robust Facial Expression Recognition under $\textbf{D}$ual Noise), a unified framework that simultaneously tackles data and label noise in a single architecture. |
Hui Tang; Yifan He; Zhong Jin; | code |
| 144 | Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. |
Qifan Li; Xingyu Zhou; Jinhua Zhang; Weiyi You; Shuhang Gu; | code |
| 145 | Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. |
Hulingxiao He; Zhi Tan; Yuxin Peng; | code |
| 146 | Thinking-while-Generating: Interleaving Textual Reasoning Throughout Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They incorporate textual reasoning, i.e., *think*, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce **Thinking-while-Generating** (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. |
Ziyu Guo; Renrui Zhang; Hongyu Li; Manyuan Zhang; Xinyan Chen; Sifan Wang; Yan Feng; Peng Pei; Pheng-Ann Heng; | code |
| 147 | Scaling The Long Video Understanding of Multimodal Large Language Models Via Visual Memory Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Long video understanding is a key challenge that plagues the advancement of Multimodal Large language Models (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed Flexible Memory (FlexMem). |
Tao Chen; Kun Zhang; Qiong Wu; Xiao Chen; Chao Chang; Xiaoshuai Sun; Yiyi Zhou; Rongrong Ji; | code |
| 148 | ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. |
Mingyang Wu; Ashirbad Mishra; Soumik Dey; Shuo Xing; Naveen Ravipati; Hansi Wu; Binbin Li; Zhengzhong Tu; | code |
| 149 | RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models’ capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. |
Kai Zhu; Zhenyu Cui; Zehua Zang; Jiahuan Zhou; | code |
| 150 | Heuristic-inspired Reasoning Priors Facilitate Data-Efficient Referring Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most referring object detection (ROD) models, especially the modern grounding detectors, are designed for data-rich conditions, yet many practical deployments, such as robotics, augmented reality, and other specialized domains, would face severe label scarcity. |
Xu Zhang; Zhe Chen; Jing Zhang; Dacheng Tao; | code |
| 151 | GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching Via Regulated Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, the policy model inevitably enters an **implicit over-optimization stage**—while the proxy reward continues to increase, essential metrics such as image quality and text–prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce **GRPO-Guard**, a simple yet effective enhancement to existing GRPO frameworks. |
Jing Wang; Jiajun Liang; Jie Liu; Henglin Liu; Gongye Liu; Jun Zheng; Wanyuan Pang; Ao Ma; Zhenyu Xie; Xintao Wang; Meng Wang; Pengfei Wan; Xiaodan Liang; | code |
| 152 | Yume1.5: A Text-Controlled Interactive World Generation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose Yume1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. |
Xiaofeng Mao; Zhen Li; Chuanhao Li; Xiaojie Xu; Kaining Ying; Kaipeng Zhang; | code |
| 153 | SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. |
Xiaoyan Zhang; Zechen Bai; Haofan Wang; Yiren Song; | code |
| 154 | MAMMA: Markerless Accurate Multi-person Motion Acquisition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MAMMA, a markerless motion-capture pipeline that accurately recovers SMPL-X parameters from multi-view video. |
Hanz Cuevas Velasquez; Anastasios Yiannakidis; Soyong Shin; Giorgio Becherini; Markus Höschle; Joachim Tesch; Taylor Obersat; Tsvetelina Alexiadis; Eni Halilaj; Michael J. Black; | code |
| 155 | DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose DP-FedAdamW, the first AdamW-based optimizer for DPFL. |
Jin Liu; Ning Xi; Yinbin Miao; Junkang Liu; | code |
| 156 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. |
Yi Yang; Xueqi Li; Yiyang Chen; Jin Song; Yihan Wang; Zipeng Xiao; Jiadi Su; You Qiaoben; Pengfei Liu; Zhijie Deng; | code |
| 157 | Test-Time Instance-Specific Parameter Composition: A New Paradigm for Adaptive Generative Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, humans flexibly adapt their internal generative representations to each perceptual or imaginative context. Inspired by this capability, we introduce Composer, a new paradigm for adaptive generative modeling based on test-time instance-specific parameter composition. |
Minh-Tuan Tran; Xuan-May Le; Quan Hung Tran; Mehrtash Harandi; Dinh Phung; Trung Le; | code |
| 158 | Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, processing long video trajectories with VLMs is computationally prohibitive for real-world deployment. To address these challenges, we propose the Recurrent Reasoning Vision-Language Model ($\text{R}^2$VLM). |
Yuelin Zhang; Sijie Cheng; Chen Li; Zongzhao Li; Yuxin Huang; Yang Liu; Wenbing Huang; | code |
| 159 | RunawayEvil: Jailbreaking The Image-to-Video Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing attack methods remain confined to single-modal settings, relying solely on isolated text or image perturbations, which severely limits their effectiveness. To bridge this gap, we propose Runaway Evil, the first multimodal jailbreaking framework for I2V models with dynamic evolutionary capability. |
yueming lyu; Rufan Qian; Yueming Lyu; Qinglong Liu; Linzhuang Zou; Jie Qin; Songhua Liu; Caifeng Shan; | code |
| 160 | VLM-Pruner: Buffering for Spatial Sparsity in An Efficient VLM Centrifugal Token Pruning Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. |
Zhenkai Wu; Xiaowen Ma; ZHENLIANG NI; Dengming Zhang; Han Shu; Xin Jiang; Xinghao Chen; | code |
| 161 | One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. |
Shijun Shi; Jing Xu; Zhihang Li; Chunli Peng; Xiaoda Yang; Lijing Lu; Kai Hu; Jiangning Zhang; | code |
| 162 | GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose GuideFlow, a novel planning framework that leverages Constrained Flow Matching. |
Lin Liu; Caiyan Jia; Guanyi Yu; Ziying Song; Junqiao Li; Feiyang Jia; Peiliang Wu; Xiaoshuai Hao; Yadan Luo; | code |
| 163 | Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. |
Yuqing Huang; Guotian Zeng; Zhenqiao Yuan; Zhenyu He; Xin Li; Yaowei Wang; Ming-Hsuan Yang; | code |
| 164 | Expanding MmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. |
Zhuoxuan Peng; Boan Zhu; Xingjian Zhang; Wenying Li; S.-H. Gary Chan; | code |
| 165 | FedSDR: Federated Graph Learning with Structural Noise Detection and Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 2) Locally, the global GNN performs poorly on these clients due to structural noise, limiting their ability to benefit from federated collaboration. To address these challenges, we propose $\textbf{FedSDR}$, a spectra-based FGL framework against high-structural-noise scenarios. |
Jiaqi Liu; Zihan Tan; Guancheng Wan; Wenke Huang; He Li; Mang Ye; | code |
| 166 | ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. |
Ruixun Liu; Bowen Fu; Jiayi Song; Kaiyu Li; Wanchen Li; Lanxuan Xue; Hui Qiao; Weizhan Zhang; Deyu Meng; Xiangyong Cao; | code |
| 167 | Label-Free Cross-Task LoRA Merging with Null-Space Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. |
Wonyoung Lee; Wooseong Jeong; Kuk-Jin Yoon; | code |
| 168 | CodePercept: Code-Grounded Visual STEM Perception for MLLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium—executable code provides precise semantics that naturally align with the structured nature of STEM visuals. |
Tongkun Guan; Zhibo Yang; Jianqiang Wan; Mingkun Yang; Zhentao Guo; Zijian Hu; Ruilin Luo; Ruizhe Chen; Sontao Jiang; Peng Wang; Wei Shen; Junyang Lin; Xiaokang Yang; | code |
| 169 | Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MVIG attack, a novel adaptive adversarial CP framework learning to capture vulnerability knowledge disclosed by different defensive CP systems from a unified mutual view information graph (MVIG) representation. |
Yihang Tao; Senkang Hu; Haonan An; Zhengru Fang; Hangcheng Cao; Yuguang Fang; | code |
| 170 | Physics-Guided Multistep Deformation Reversal for Ancient Bamboo Slip Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel framework for inverse restoration of deformed bamboo slips that provides a progressive physical deformation modeling with stepwise inverse displacement prediction. |
Qianqian Tang; Jinchi Zhu; Xiaolu Zhou; Yongchao Xu; | code |
| 171 | Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an approach for high-quality dynamic Gaussian Splatting from monocular videos. |
Xuankai Zhang; Junjin Xiao; Shangwei Huang; Wei-Shi Zheng; Qing Zhang; | code |
| 172 | VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They also lack interpretability due to the absence of chain-of-thought reasoning. To address these issues, we propose \textbf{VideoRealBench}, a comprehensive benchmark for evaluating the realism of generated human-centric videos. |
Min Yang; Xinwen Zhang; Jialei Tang; Xin Zhou; Kehan Li; Zeyi Huang; Limin Wang; | code |
| 173 | Seeing Through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. |
Yunshan Qi; Lin Zhu; Nan Bao; Yifan Zhao; Jia Li; | code |
| 174 | TreeTeaming: Autonomous Red-Teaming of Vision-Language Models Via Hierarchical Strategy Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing red teaming methods are fundamentally constrained by an inherent linear exploration paradigm, confining them to optimizing within a predefined strategy set and preventing the discovery of novel, diverse exploits. To transcend this limitation, we introduce TreeTeaming, an automated red teaming framework that reframes strategy exploration from static testing to a dynamic, evolutionary discovery process. |
Chunxiao Li; Lijun Li; Jing Shao; | code |
| 175 | Breaking Smooth-Motion Assumptions: A UAV Benchmark for Multi-Object Tracking in Complex and Adverse Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing UAV-perspective MOT benchmarks often lack these complexities, featuring predominantly predictable camera dynamics and linear motion patterns. To address this gap, we introduce DynUAV, a new benchmark for dynamic UAV-perspective MOT, characterized by intense ego-motion and the resulting complex apparent trajectories. |
Jingtao Ye; zhang kexin; Xunchi Ma; Johann Li; Guangming Zhu; Peiyi Shen; Linhua Jiang; Xiangdong Zhang; Liang Zhang; | code |
| 176 | You Only Erase Once: Erasing Anything Without Bringing Unexpected Content Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present YOEO, an approach for object erasure. |
Yixing Zhu; Qing Zhang; Wenju Xu; Wei-Shi Zheng; | code |
| 177 | REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint Via Planar Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. |
Di Wu; Liu Liu; Anran Huang; 玉研 刘; Qiaojun Yu; Liu Shaofan; Liangtu Song; Cewu Lu; | code |
| 178 | SAG-GNN: Semantic-Aware Guided GNN for Descriptor-Free 2D-3D Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Leveraging the benefits of semantics in providing context, resolving ambiguities, and enhancing robustness in challenging scenes, we propose the Semantic-Aware Guided Graph Neural Network (SAG-GNN), integrating high-level semantics into descriptor-free 2D-3D matching. |
Shihua Zhang; Tianhao Xu; Zizhuo Li; Qing Ma; Jiayi Ma; | code |
| 179 | CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CADFS, a data-centric framework that enables large vision-language models to generate complex CAD design histories. |
Vladislav Pyatov; Gleb Bobrovskikh; Saveliy Galochkin; Nikita Boldyrev; Oleg Voynov; Alexander Filippov; Gonzalo Ferrer; Peter Wonka; Evgeny Burnaev; | code |
| 180 | LAMP: Language-Assisted Motion Planning for Controllable Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Among these, motion control (i.e., specifying both object dynamics and camera trajectories) is particularly critical for directing complex, cinematic scenes, yet existing interfaces remain limited. To address this gap, we introduce LAMP that leverages large language models~(LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for both dynamic objects and (relatively defined) cameras. |
Muhammed Burak Kızıl; Enes Şanlı; Niloy J. Mitra; Erkut Erdem; Aykut Erdem; Duygu Ceylan; | code |
| 181 | OpenT2M: No-frill Motion Generation with Open-source, Large-scale, High-quality Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as “frills”. |
Bin Cao; Sipeng Zheng; Hao Luo; Boyuan Li; Jing Liu; Zongqing Lu; | code |
| 182 | Convexity-Aware Noise Calibration: A Self-Supervised Framework for Noise-Level-Unknown Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing unsupervised techniques, such as blind-spot networks or methods based on statistical estimation, either compromise performance due to information loss or suffer from inaccuracies in noise level estimation. To address these challenges, we propose a novel two-stage self-supervised denoising framework that first accurately estimates the noise level directly from noisy images, without requiring clean references or prior noise knowledge. |
Zhan Wang; Wang Leiquan; Chunlei Wu; Yu Meng; | code |
| 183 | MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. |
HAO ZHANG; Yanping Zha; Zizhuo Li; Meiqi Gong; Jiayi Ma; | code |
| 184 | ReCoFuse: Ultra-Robust Image Fusion Via Restorative Multi-Modal Diffusion Reciprocal Coupling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods following the integrated hard-regression or decoupling optimization paradigms exhibit limited fusion performance under complex degradations. To address these paradigm-level shortcomings, we propose ReCoFuse, an ultra-robust image fusion framework based on restorative multi-modal diffusion reciprocal coupling. |
HAO ZHANG; Shuhan Yang; Linfeng Tang; Xunpeng Yi; Jiayi Ma; | code |
| 185 | TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Satellite Earth-observation (EO) time series in the optical and microwave ranges are often irregular due to orbital patterns and cloud obstruction, and while compositing addresses these issues, it loses critical phenological information. To overcome this, we present TESSERA, a pixel-wise foundation model for multi-modal (Sentinel-1/2) EO time series that learns robust, label-efficient embeddings. |
Zhengpeng Feng; Clement Atzberger; Sadiq Jaffer; Jovana Knezevic; Silja Sormunen; Robin Young; Madeline Lisaius; Markus Immitzer; Toby Jackson; James Ball; David Coomes; Anil Madhavapeddy; Andrew Blake; Srinivasan Keshav; | code |
| 186 | 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. |
Yiting Lu; Wei Luo; Peiyan Tu; Haoran Li; Hanxin Zhu; Zihao Yu; Xingrui Wang; Xinyi Chen; Xinge Peng; Xin Li; Zhibo Chen; | code |
| 187 | Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their limited representational capacity hinders performance in specific tasks that require 3D spatial imagination. To address this limitation, we propose 3DThinker, a framework that can effectively exploit the rich geometric information embedded within images while reasoning, like humans do. |
Zhangquan Chen; Manyuan Zhang; Xinlei Yu; Xufang Luo; Mingze Sun; Zihao Pan; Xiang An; Yan Feng; Peng Pei; Xunliang Cai; Ruqi Huang; | code |
| 188 | Random Wins All: Rethinking Grouping Strategies for Vision Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? |
Qihang Fan; Yuang Ai; Huaibo Huang; Ran He; | code |
| 189 | CaTok: Taming Mean Flows for One-Dimensional Causal Image Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent diffusion autoencoders similarly fall short: conditioning the decoder on all tokens lacks causality, while applying nested dropout mechanism introduces imbalance. To address these challenges, we present CaTok, a 1D causal image tokenizer with a MeanFlow decoder. |
Yitong Chen; Zuxuan Wu; Xipeng Qiu; Yu-Gang Jiang; | code |
| 190 | Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a unified framework, termed Conformal Predictive Self-Calibration (CPSC), which leverages conformal prediction to equip the model with the ability to perform self-guided calibration on-the-fly. |
Xun Jiang; Yufan Gu; Disen Hu; Yuqing Hou; Yazhou Yao; Fumin Shen; Heng Tao Shen; Xing Xu; | code |
| 191 | 3D Space As A Scratchpad for Editable Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the concept of a spatial scratchpad — a 3D reasoning substrate that bridges linguistic intent and image synthesis. |
Oindrila Saha; Vojtech Krs; Radomir Mech; Subhransu Maji; Matheus Gadelha; Kevin Blackburn-Matzen; | code |
| 192 | Test-Time 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, training dense occupancy decoders to capture fine-grained geometry and semantics can demand hundreds of GPU hours, and once trained, such models struggle to adapt to varying voxel resolutions or novel object categories without extensive retraining. To overcome these limitations, we propose a practical and flexible test-time occupancy prediction framework termed TT-Occ. |
Fengyi Zhang; Xiangyu Sun; Huitong Yang; Zheng Zhang; Zi Huang; Yadan Luo; | code |
| 193 | TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. |
Fengyi Zhang; Tianjun Zhang; Kasra Khosoussi; Zheng Zhang; Zi Huang; Yadan Luo; | code |
| 194 | CUE: Concept-Aware Multi-Label Expansion to Mitigate Concept Confusion in Long-Tailed Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing methods focus solely on mitigating long-tailed distribution bias while overlooking concept confusion caused by the long-tailed distribution. In this paper, we study this problem and attribute it to the mutual exclusivity of single-label supervision under long-tailed distributions, which suppresses feature sharing among related classes and amplifies the dominance of head classes, leading to disrupted inter-class discriminality. |
Ruichi Zhang; Chikai Shang; jiacheng yang; Mengke Li; Yang Zhou; Junlong Gao; Yang Lu; | code |
| 195 | It Takes Two: A Duet of Periodicity and Directionality for Burst Flicker Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reveal that flicker artifacts exhibit two intrinsic characteristics, periodicity and directionality, and propose Flickerformer, a transformer-based architecture that effectively removes flicker without introducing ghosting. |
lishen qu; Shihao Zhou; Jie Liang; Hui Zeng; Lei Zhang; Jufeng Yang; | code |
| 196 | CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate this,we propose a new task of reconstructing continuous video stimuli from multimodal brain signals recorded during audiovisual stimulation. |
Jianxiong Gao; Yichang Liu; baofeng yang; Jianfeng Feng; Yanwei Fu; | code |
| 197 | Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose RDVQ, a vector-quantization (VQ) based generative image compression method designed for extremely low bitrates. |
SHIYIN JIANG; Wei Long; Minghao Han; Zhenghao Chen; Ce Zhu; Shuhang Gu; | code |
| 198 | RegFormer: Transferable Relational Grounding for Efficient Weakly-Supervised Human-Object Interaction Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce Relational Grounding Transformer (RegFormer), a versatile interaction recognition module that enables efficient and accurate HOI reasoning. |
Jihwan Park; Chanhyeong Yang; Jinyoung Park; Taehoon Song; Hyunwoo J. Kim; | code |
| 199 | Single-step Diffusion-based Video Coding with Semantic-Temporal Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Some NVCs incorporate perceptual or adversarial objectives but still suffer from artifacts due to limited generation capacity, whereas others leverage pretrained diffusion models to improve quality at the cost of heavy sampling complexity. To overcome these challenges, we propose S$^2$VC, a Single-Step diffusion–based Video Codec that integrates a conditional coding framework with an efficient single-step diffusion generator, enabling realistic reconstruction at low bitrates with reduced sampling cost. |
Naifu Xue; Zhaoyang Jia; Jiahao Li; Bin Li; Zihan Zheng; Yuan Zhang; Yan Lu; | code |
| 200 | See and Fix The Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts Via Agentic Data Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose **ArtiAgent**, which efficiently creates pairs of real and artifact-injected images. |
Jaehyun Park; Minyoung Ahn; Minkyu Kim; Jonghyun Lee; Jae-Gil Lee; Dongmin Park; | code |
| 201 | Adaptive Action Chunking at Inference-time for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. |
Yuanchang Liang; Xiaobo Wang; Kai Wang; Shuo Wang; Xiaojiang Peng; Haoyu Chen; David Chua; Prahlad Vadakkepat; | code |
| 202 | Complementary Prototype Mapping for Efficient Multimodal Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing methods suffer from slow inference speed and high memory overhead, hindering their deployment in real-world production lines. To address these issues, we propose an efficient and effective Complementary Prototype Mapping (\textbf{CPMAD}) framework, which dynamically extracts consensus and supplementary prototypes to serve as complementary priors, thereby guiding and disambiguating cross-modal mappings. |
Yuan Zhao; Xiaoqin Zhang; Huchuan Lu; Lihe Zhang; | code |
| 203 | Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. |
Haonan Cai; Yuxuan Luo; Zhouhui Lian; | code |
| 204 | MPL: Match-guided Prototype Learning for Few-shot Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches typically face two critical limitations: i) prototypes learned through implicit sample interactions lack clear semantic correspondence between query-support pairs, limiting their class representativeness; ii) the independent design of prototype learning and matching mechanisms creates a potential incompatibility between prototype representations and matching strategies. To address these limitations, we propose a Match-guided Prototype Learning (MPL) method comprising two key components: enhanced match (E-Match) and key-frame extraction match (K-Match). |
Feng Yang; Jie Zhao; Fulin Luo; Anyong Qin; Tiecheng Song; Yue Zhao; CHENQIANG GAO; Junwei Han; | code |
| 205 | EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. |
Lu Wei; Yuta Nakashima; Noa Garcia; | code |
| 206 | Physical Simulator In-the-Loop Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further improve texture consistency during object movement, we propose a Test-Time Texture Consistency Optimization (TTCO) technique that adapts text and feature embeddings based on pixel correspondences from the simulator. |
Lin Geng Foo; Mark He Huang; Alexandros Lattas; Stylianos Moschoglou; Thabo Beeler; Christian Theobalt; | code |
| 207 | GeoMotion: Rethinking Motion Segmentation Via Latent 4D Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. |
Xiankang He; Peile Lin; Ying Cui; Dongyan Guo; Chunhua Shen; Xiaoqin Zhang; | code |
| 208 | RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We develop RHINO (Reconstructing Human Interactions with Novel Objects), a novel three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. |
Lixin Xue; Chengwei Zheng; Georgios Paschalidis; Chen Guo; Manuel Kaufmann; Juan Jose Zarate; Dimitrios Tzionas; | code |
| 209 | RAGTrack: Language-aware RGBT Tracking with Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce a Multi-modal Transformer Encoder (MTE) for unified visual-language modeling. |
Hao Li; Yuhao Wang; Wenning Hao; Pingping Zhang; Dong Wang; Huchuan Lu; | code |
| 210 | ZeroIDIR: Zero-Reference Illumination Degradation Image Restoration with Perturbed Consistency Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a zero-reference diffusion-based framework, named ZeroIDIR, for illumination degradation image restoration, which decouples the restoration process into adaptive illumination correction and diffusion-based reconstruction while being trained solely on low-quality degraded images. |
Hai Jiang; Zhen Liu; Yinjie Lei; Songchen Han; Bing Zeng; Shuaicheng Liu; | code |
| 211 | UniTEX: Universal High Fidelity Generative Texturing for 3D Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. |
Yixun Liang; Kunming Luo; Xiao Chen; Rui Chen; Jiawei Zhou; Weiyu Li; Jiarui Liu; Fei-Peng Tian; Ping Tan; | code |
| 212 | MotionScale: Reconstructing Appearance, Geometry, and Motion of Dynamic Scenes with Scalable 4D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent progress in monocular view synthesis, existing methods still struggle to recover accurate 3D geometry and temporally consistent motion in complex environments. To address these challenges, we propose MotionScale, a 4D Gaussian Splatting framework that scales efficiently to large scenes and extended sequences, enabling faithful reconstruction of high-fidelity scene structures and coherent motion representation under complex dynamics. |
Haoran Zhou; Gim Hee Lee; | code |
| 213 | Revisiting F-measure Optimization in Multi-Label Classification: A Sampling-based Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To avoid multinomial label transformation, we propose an indirect sampling-then-estimation approach to estimate the required probabilities. |
Zixun Wang; | code |
| 214 | Faster-GS: Analyzing and Improving Gaussian Splatting Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. |
Florian Hahlbohm; Linus Franke; Martin Eisemann; Marcus Magnor; | code |
| 215 | RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods struggle with semi-transparent specular surfaces that exhibit both complex reflections and clear transmission, often producing blurry reflections or overly occluded transmission. To address this, we present **RT-Splatting**, a framework that disentangles each Gaussian’s geometric occupancy from its optical opacity. |
Ji Shi; Xianghua Ying; Bowei Xing; Ruohao Guo; Wenzhen Yue; | code |
| 216 | All in One: Unifying Deepfake Detection, Tampering Localization, and Source Tracing with A Robust Landmark-Identity Watermark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing proactive forensics methods typically treat deepfake detection, tampering localization, and source tracing as independent tasks, lacking a unified framework to address them jointly. To bridge this gap, we propose a unified proactive forensics framework that jointly addresses these three core tasks. |
Junjiang Wu; Liejun Wang; Zhiqing Guo; | code |
| 217 | SplitFlux: Learning to Decouple Content and Style from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these challenges, we conduct a systematic analysis of Flux and make two key observations: (1) Single Dream Blocks are essential for image generation; and (2) Early single stream blocks mainly control content, whereas later blocks govern style. Based on these insights, we propose SplitFlux, which disentangles content and style by fine-tuning the single dream blocks via LoRA, enabling the disentangled content to be re-embedded into new contexts. |
Yitong Yang; Yinglin Wang; Changshuo Wang; Yongjun Zhang; Ziyang Chen; Shuting He; | code |
| 218 | Accelerating Diffusion Model Training Under Minimal Budgets: A Condensation-Based Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To operationalize this perspective, we introduce Diffusion Dataset Condensation ($D^2C$), a two-phase framework comprising Select and Attach. |
Rui Huang; Shitong Shao; zikai zhou; Pukun Zhao; Hangyu Guo; Tian Ye; Lichen Bai; Shuo Yang; Zeke Xie; | code |
| 219 | Routing on Demand: DSNet for Efficient Progressive Point Cloud Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing progressive denoising methods rely on fixed iterative pipelines that process all regions uniformly, resulting in redundant computation and over-smoothing of geometric details when handling point clouds with non-uniform noise distributions. To overcome these limitations, we introduce Dynamic Skip Net (DSNet), a novel progressive denoising framework that adaptively determines the optimal denoising path for each local patch based on its noise characteristics. |
Xiaoqian Cheng; Dong Xiao; Husen Li; Zheng Liu; Renjie Chen; | code |
| 220 | Advancing Image Classification with Discrete Diffusion Classification Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. |
Omer Belhasin; Shelly Golan; Ran El-Yaniv; Michael Elad; | code |
| 221 | Synergistic Bleeding Region and Point Detection in Laparoscopic Surgical Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Accordingly, we develop a dual-task synergistic online detector called BlooDet, enabling simultaneous detection of bleeding regions and points in laparoscopic surgery. |
Jialun Pei; Zhangjun Zhou; Diandian Guo; Zhixi Li; Jing Qin; Bo Du; Pheng-Ann Heng; | code |
| 222 | Benchmarking Endoscopic Surgical Image Restoration and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To systematically investigate and address various forms of surgical scene degradation, we introduce a real-world open-source surgical image restoration dataset covering laparoscopic environments, called SurgClean, which involves multi-type image restoration tasks from two medical sites, i.e., desmoking, defogging, and desplashing. |
Jialun Pei; Diandian Guo; Donghui Yang; Zhixi Li; Yuxin Feng; Long Ma; Bo Du; Pheng-Ann Heng; | code |
| 223 | Mario: Multimodal Graph Reasoning with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. |
Yuanfu Sun; Kang Li; Pengkang Guo; Jiajin Liu; Qiaoyu Tan; | code |
| 224 | InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. |
Haoming Wang; Qiyao Xue; Wei Gao; | code |
| 225 | From 2D Alignment to 3D Plausibility: Unifying Heterogeneous 2D Priors and Penetration-Free Diffusion for Occlusion-Robust Two-Hand Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For 3D spatial alignment, we propose a two-hand diffusion model that learns a generative mapping from interpenetrated poses to realistic, collision-free configurations. |
Gaoge Han; Yongkang Cheng; Zhe Chen; Shaoli Huang; Tongliang Liu; | code |
| 226 | Multi-Paradigm Collaborative Adversarial Attack Against Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. |
Yuanbo Li; Tianyang Xu; Cong Hu; Tao Zhou; Xiaojun Wu; Josef Kittler; | code |
| 227 | Towards Highly Transferable Vision-Language Attack Via Semantic-Augmented Dynamic Contrastive Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. |
Yuanbo Li; Tianyang Xu; Cong Hu; Tao Zhou; Xiaojun Wu; Josef Kittler; | code |
| 228 | Instance-level Visual Active Tracking with Occlusion-Aware Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose OA-VAT, a unified pipeline with three complementary modules. |
Haowei Sun; Kai Zhou; Hao Gao; Shiteng Zhang; Jinwu Hu; Xutao Wen; Qixiang Ye; Mingkui Tan; | code |
| 229 | From Indoor to Open World: Revealing The Spatial Reasoning Gap in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. |
Mingrui Wu; Zhaozhi Wang; Fangjinhua Wang; Jiaolong Yang; Marc Pollefeys; Tong Zhang; | code |
| 230 | URScenes: A Multi-scenario Dataset for Unstructured Road Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing public datasets predominantly focus on clear-weather and urban-road scenarios, leaving a significant gap in the coverage of unstructured road environments. To bridge this gap, we construct URScenes, the first multi-scenario, open-source perception dataset for unstructured road environments. |
runsen liu; Aizemaitijiang Baoerhan; Zhangyu Wang; Jie Wang; Jinghao Cui; GuizhenYu GuizhenYu; Songyue Yang; WanCheng Sun; Mingjun Tang; Zhanbo Hua; Wenwen Luo; | code |
| 231 | POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. |
Haicheng Wang; Yuan Liu; Yikun Liu; Zhemeng Yu; Zhongyin Zhao; Yangxiu You; Zilin Yu; Le Tian; Zhou Xiao; Jie Zhou; Weidi Xie; Yanfeng Wang; | code |
| 232 | BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce **Bri**dged **M**odality **A**daptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. |
Kanglei Zhou; Chang Li; Qingyi Pan; Liyuan Wang; | code |
| 233 | RaUF: Learning The Spatial Uncertainty Field of Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose RaUF, a spatial uncertainty field learning framework that models radar measurements through their physically grounded anisotropic properties. |
Shengpeng Wang; Kuangyu Wang; Wei Wang; | code |
| 234 | VQ-VA World: Towards High-Quality Visual Question-Visual Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper studies \textit{Visual Question–Visual Answering (VQ-VA)}: generating an image, rather than text, in response to a visual question—an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. |
Chenhui Gou; Zilong Chen; Zeyu Wang; Feng Li; Deyao Zhu; Zicheng Duan; Kunchang Li; Chaorui Deng; Hongyi Yuan; Haoqi Fan; Cihang Xie; Jianfei Cai; Hamid Rezatofighi; | code |
| 235 | HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We hypothesize that the accuracy degradation stems from the heterogeneity in head-wise sparsification sensitivity, as the existing methods apply a uniform sparsity pattern across all heads. Motivated by this hypothesis, we present a two-stage sparsification pipeline that effectively quantifies and exploits head-wise sparsification sensitivity. |
Yongsung Kim; Wooseok Song; Jaihyun Lew; Hun Hwangbo; Jaehoon Lee; Sungroh Yoon; | code |
| 236 | Cut to The Chase: Training-free Multimodal Summarization Via Chain-of-Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). |
Xiaoxing You; Qiang Huang; Lingyu Li; Xiaojun Chang; Jun Yu; | code |
| 237 | Action–Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. |
Chongyang Xu; Li Haipeng; Shen Cheng; Haoqiang Fan; Ziliang Feng; Shuaicheng Liu; | code |
| 238 | Progressive Multi-cue Alignment for Unaligned RGBT Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 2) They usually require complex models to handle challenging scenarios, resulting in a large computational burden. To overcome these limitations, we propose a novel Progressive Multi-cue Alignment framework called PMATrack, which disentangles the calculation of cross-modal alignment parameters in a progressive manner and dynamically selects appropriate cues to handle different challenges, thereby enabling robust and efficient unaligned RGBT tracking. |
Jiandong Jin; Chenglong Li; Hao Feng; Andong Lu; Lili Huang; Jin Tang; | code |
| 239 | Skyreels-Text: Fine-grained Font-Controllable Text Editing for Poster Design Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although modern image editing models have grown increasingly powerful, they still fall short in fine-grained, font-aware text manipulation, limiting their utility in professional design workflows such as poster editing. To address this issue, we present Skyreels-Text, a novel font-controllable framework for precise poster text editing. |
Yunjie Yu; Jingchen Wu; Junchen Zhu; Chunze Lin; Guibin Chen; | code |
| 240 | Self-guided Semantic Inspection for Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paradigm enforces cross-modal agreement but overlooks the semantic discrepancies between modalities that naturally arise during inference. To address this issue, we propose DiffComp (Differentiate-then-Compose), a difference-driven self-supervised framework that actively induces and exploits cross-modal discrepancies during training. |
Jingjing Zhang; Lei Zhang; Zheren Fu; Bo Hu; Zhendong Mao; | code |
| 241 | GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GUIDE (GUI Understanding, Intent, and Help Decision Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. |
Saelyne Yang; Jaesang Yu; Yi-Hao Peng; Kevin Qinghong Lin; Jae Won Cho; Yale Song; Juho Kim; | code |
| 242 | OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These challenges stem from the limited scale and semantic diversity of existing datasets, which lead to a performance gap between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). |
Minghang Zheng; Zihao Yin; Yi Yang; Yuxin Peng; Yang Liu; | code |
| 243 | Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. |
Minseok Seo; Mark Hamilton; Changick Kim; | code |
| 244 | BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the diversity of clothing increases intra-class variance and makes one of the biggest challenges to learning cloth-invariant features under varying clothing conditions. Therefore, we propose GaitCLIF (Gait-oriented CLoth-Invariant Feature) as a robust baseline model for cross-clothing gait recognition. |
Qingyuan Cai; Saihui Hou; Xuecai Hu; Yongzhen Huang; | code |
| 245 | Extending One-Step Image Generation from Class Labels to Text Via Discriminative Text Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This also explains why discrete and easily distinguishable class features perform well within the MeanFlow framework. Based on this insight, we propose a novel auxiliary loss design to learn the discriminative text representation space, achieving an effective adaptation of MeanFlow to text-to-image generation for the first time. |
Chenxi Zhao; Chen Zhu; Xiaokun Feng; Aiming Hao; Jiashu Zhu; Jiachen Lei; Jiahong Wu; Xiangxiang Chu; Jufeng Yang; | code |
| 246 | SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning Into Efficient Event-Based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. |
Rui Fan; Weidong Hao; Juntao Guan; Lai Rui; Tong Wu; Fanhong Zeng; Lin Gu; | code |
| 247 | Refining Few-Step Text-to-Multiview Diffusion Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reinforcement learning (RL) finetuning offers a potential solution, yet existing approaches desgined for single-image diffusion do not readily extend to the few-step T2MV setting, as they neglect cross-view coordination and suffer from weak learning signals in few-step regimes. To address this, we propose MVC-ZigAL, a tailored RL finetuning framework for few-step T2MV diffusion models. |
Ziyi Zhang; Li Shen; Deheng Ye; Yong Luo; Huangxuan Zhao; Meng Liu; Wei Yu; Lefei Zhang; | code |
| 248 | IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. |
Simone Magistri; Dipam Goswami; Marco Mistretta; Bartłomiej Twardowski; Joost van de Weijer; Andrew Bagdanov; | code |
| 249 | Few-shot Acoustic Synthesis with Multimodal Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FLow-matching ACoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. |
Amandine Brunetto; | code |
| 250 | GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. |
Qiaosi Yi; Shuai Li; Rongyuan Wu; Lingchen Sun; Zhengqiang ZHANG; Lei Zhang; | code |
| 251 | Making Training-Free Diffusion Segmentors Scale with The Generative Power Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage model capability. |
Benyuan Meng; Qianqian Xu; Zitai Wang; Xiaochun Cao; Longtao Huang; Qingming Huang; | code |
| 252 | Omni-Attack: Adversarial Attacks on Open-Ended VQA in Black-Box Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AdvRobustBench, a comprehensive adversarial robustness benchmark for MLLMs comprising 1,000 examples across visual question answering (VQA) and optical character recognition (OCR) tasks, drawn from widely-used MLLM benchmarks (MMBench, MMStar, OCRBench-v2). |
Kai Hu; Weichen Yu; Li Zhang; Alexander Robey; Andy Zou; Haoqi Hu; Chengming Xu; Matt Fredrikson; | code |
| 253 | STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. |
Hao Chen; Tao Han; Jie ZHANG; Song Guo; Lei Bai; | code |
| 254 | Unified Primitive Proxies for Structured Shape Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instead of following the prevailing cascade, we rethink how primitives and points should interact, and find it more effective to decode primitives in a dedicated pathway that attends to shared shape features. Following this principle, we present UniCo, which in a single feed-forward pass predicts a set of primitives with complete geometry, semantics, and inlier membership. |
Zhaiyu Chen; Yuqing Wang; Xiao Xiang Zhu; | code |
| 255 | SpaceDrive: Infusing Spatial Awareness Into VLM-based Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. |
Peizheng Li; Zhenghao Zhang; David Holtz; Hang Yu; Yutong Yang; Yuzhi Lai; Rui Song; Andreas Geiger; Andreas Zell; | code |
| 256 | Representation-Steered Incremental Adapter-Tuning for Class-Incremental Learning with Pre-Trained Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, they lack explicit mechanisms to structure a coherent and discriminative representation space across tasks. To address these limitations, we propose **Representation-Steered Incremental Adapter Tuning** (**RSIAT**). |
Jiarui Zhao; Libo Huang; Xiangqi Li; Zhulin An; Chuanguang Yang; Yu Wang; boyu diao; Yongjun Xu; | code |
| 257 | Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: . We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. |
Aadarsh Sahoo; Georgia Gkioxari; | code |
| 258 | CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. |
Ke Niu; Haiyang Yu; Zhuofan Chen; Zhengtao Yao; WeitaoJia WeitaoJia; Xiaodong Ge; Jingqun Tang; Benlei Cui; Bin Li; Xiangyang Xue; | code |
| 259 | Aligning Text, Images and 3D Structure Token-by-Token Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed cookbook outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. |
Aadarsh Sahoo; Vansh Tibrewal; Georgia Gkioxari; | code |
| 260 | PS-SR: Pseudo-Single-Step Video Super-Resolution Via Speculative Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Video Super-Resolution (VSR) fundamentally struggles with a critical trade-off: single-step models offer unmatched efficiency but often lack the high-frequency detail, creativity, and visual quality of their multi-step diffusion counterparts, which are computationally prohibitive for practical use. In this paper, we propose PS-SR, a novel pseudo single-step VSR framework that transcends this trade-off through a computationally asymmetric sampling pipeline. |
Aiqiu Wu; Zhaofan Qiu; Ting Yao; Tao Mei; | code |
| 261 | Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration Via Heterogeneous Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. |
Pei An; Junfeng Ding; Jiaqi Yang; Yulong Wang; Jie Ma; Liangliang Nan; | code |
| 262 | Phased DMD: Few-step Distribution Matching Distillation Via Score Matching Within Subintervals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While prior works propose stochastic gradient truncation as a potential solution,we observe that it substantially reduces the generative diversity in text-to-image generation and slows motion dynamics in video generation, reducing performance to the level of one-step models. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. |
Xiangyu Fan; Zesong Qiu; Zhuguanyu Wu; Fanzhou Wang; Zhiqian Lin; Tianxiang Ren; Dahua Lin; RUIHAO GONG; Lei Yang; | code |
| 263 | GenBreak: Red Teaming Text-to-Image Generation Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, there remains a lack of reliable tools for evaluating the safety of defended T2I models. To address this gap, we propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities in T2I generators. |
Zilong Wang; Xiang Zheng; Xiaosen Wang; Bo Wang; Xingjun Ma; | code |
| 264 | A³: Towards Advertising Aesthetic Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Advertising images significantly impact commercial conversion rates and brand equity, yet current evaluation methods rely on subjective judgments, lacking scalability, standardized criteria, and interpretability. To address these challenges, we present **A³ (Advertising Aesthetic Assessment)**, a comprehensive framework encompassing four components: a paradigm (**A³-Law**), a dataset (**A³-Dataset**), a multimodal large language model (**A³-Align**), and a benchmark (**A³-Bench**). |
Kaiyuan Ji; Yixuan Gao; Lu Sun; Yushuo Zheng; Zijian Chen; Jianbo Zhang; Xiangyang Zhu; Yuan Tian; Zicheng Zhang; Guangtao Zhai; | code |
| 265 | Chain of World: World Model Thinking in Latent Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Latent-action VLAs encode frame-to-frame transitions compactly, but lack temporally continuous dynamic modeling and world knowledge. To overcome these limitations, we introduce CoWVLA (Chain-of-World VLA), a new Chain of World paradigm that unifies world-model temporal reasoning with a disentangled latent motion representation. |
Fuxiang Yang; Donglin Di; Lulu Tang; Xuancheng Zhang; Lei Fan; Hao Li; Chen Wei; Tonghua Su; Baorui Ma; | code |
| 266 | Multi-Scale Gradient-Guided Unrolling Architecture with Adaptive Mamba for Compressive Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, the feature extraction module struggles to balance the global receptive field and computational efficiency, which limits improvements in image reconstruction details. To address these challenges, we propose a multi-scale gradient-guided unrolling architecture with adaptive Mamba for CS, named MambaCS. |
Le Yang; Hongping Gan; | code |
| 267 | Few-for-Many Personalized Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We prove that this framework achieves near-optimal personalization: the approximation error diminishes as $K$ increases and converges to each client’s optimum as data grows. Building on this reformulation, we propose FedFew, a practical algorithm that jointly optimizes the $K$ server models through efficient gradient-based updates. |
Ping Guo; ZHANG Tiantian; Xi Lin; Xiang Li; Zhi-Ri Tang; Qingfu Zhang; | code |
| 268 | Compositional Text-to-Image Generation Via Region-aware Bimodal Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the rapid progress of text-to-image (T2I) models, generating images that accurately reflect complex compositional prompts (covering attribute bindings, object relationships, counting) still remains challenging. To address this, we propose BIDPO, a framework to enhance T2I model’s capability of compositional text-to-image generation. |
Zhuohan Liu; Wujian Peng; Yitong Chen; Zuxuan Wu; | code |
| 269 | Physically Inspired Gaussian Splatting for HDR Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce PhysHDR-GS, a physically inspired HDR-NVS framework that models scene appearance via intrinsic reflectance and adjustable ambient illumination. |
Huimin Zeng; Yue Bai; hailing wang; Yun Fu; | code |
| 270 | Hierarchical Codec Diffusion for Video-to-Speech Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, leveraging the distinctive hierarchical structure of Residual Vector Quantization (RVQ)-based codecs, we propose $\textbf{HiCoDiT}$, a novel $\textbf{Hi}$erarchical $\textbf{Co}$dec $\textbf{Di}$ffusion $\textbf{T}$ransformer that exploits the inherent hierarchy of discrete speech tokens to achieve efficient alignment. |
Jiaxin Ye; Gaoxiang Cong; Chenhui Wang; Xin-Cheng Wen; Zhaoyang Li; Boyuan Cao; Hongming Shan; | code |
| 271 | Learning to Drive Via Real-World Simulation at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing these crucial massive unseen states upon existing driving logs. |
Haochen Tian; Tianyu Li; Haochen Liu; Jiazhi Yang; Yihang Qiu; Guang Li; junli wang; Yinfeng Gao; Zhang Zhang; Liang Wang; Hangjun Ye; Long Chen; Hongyang Li; | code |
| 272 | SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. |
Jiesong Lian; Ruizhe Zhong; Zixiang Zhou; Xiaoyue Mi; Long Hu; Yuan Zhou; qinglin lu; yixue Hao; Junchi Yan; | code |
| 273 | OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). |
Weiguo Pian; Saksham Singh Kushwaha; Zhimin Chen; Shijian Deng; Kai Wang; Yunhui Guo; Yapeng Tian; | code |
| 274 | SelfHVD: Self-Supervised Handheld Video Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. |
Honglei Xu; Zhilu Zhang; Junjie Fan; Xiaohe Wu; Wangmeng Zuo; | code |
| 275 | Image-Guided Geometric Stylization of 3D Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a geometric stylization framework that deforms a 3D mesh, allowing it to express the style of an image. |
Changwoon Choi; Hyunsoo Lee; Clément Jambon; Yael Vinker; Young Min Kim; | code |
| 276 | FM-Steer: Enhance Generalist Policies with Value-Guided Cascaded Denosing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose \textbf{\ours}: a test-time computing framework that augments flow-based Vision-Language-Action (VLA) generalist policies with value-guided sampling and cascaded action denoising, enabling higher control performance and real-time action rates for dexterous robot manipulation. |
Haoming Song; Delin Qu; Yuanqi Yao; Qizhi Chen; Jiarui Li; Qi Lv; Yiwen Tang; Li Kang; Heng Zhou; Xianqiang Gao; Yuhang Tang; Xiaofan Li; Modi Shi; Guanghui Ren; Maoqing Yao; Bin Zhao; Dong Wang; Xuelong Li; | code |
| 277 | RefTON: Person-to-Person Virtual Try-On with Unpaired Visual References Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RefTON, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. |
Liuzhuozheng Li; Yue Gong; Shanyuan Liu; Zanyi Wang; Dengyang Jiang; Liebucha Wu; Bo Cheng; Yuhang Ma; Dawei Leng; Yuhui Yin; | code |
| 278 | SimLBR: Learning to Detect Fake Images By Learning to Detect Real Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we argue that it is more principled to learn a tight decision boundary around the real image distribution and treat the fake category as a sink class. |
Aayush Dhakal; Subash Khanal; Srikumar Sastry; Jacob Arndt; Philipe Ambrozio Dias; Dalton Lunga; Nathan Jacobs; | code |
| 279 | Asking Like Socrates: Socrates Helps VLMs Understand Remote Sensing Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. |
Run Shao; Ziyu Li; Zhaoyang Zhang; Linrui Xu; Xinran He; Hongyuan Yuan; Bolei He; Yongxing Dai; Yan Yiming; Chen Yijun; Wang Guo; Haifeng Li; | code |
| 280 | Omni-MMSI: Toward Identity-attributed Social Interaction Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \textbf{Omni-MMSI}, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech. |
Xinpeng Li; Bolin Lai; Hardy Chen; Shijian Deng; Cihang Xie; Yuyin Zhou; James Rehg; Yapeng Tian; | code |
| 281 | Rethinking SNN Online Training and Deployment: Gradient-Coherent Learning Via Hybrid-Driven LIF Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the aforementioned challenges, we propose Hybrid-Driven Leaky Integrate-and-Fire (HD-LIF) model family for efficient online learning, which respectively adopt different spiking calculation mechanism in the upper-region and lower-region of the firing threshold. |
Zecheng Hao; Yifan Huang; Zijie Xu; Wenxuan Liu; Yuanhong Tang; Zhaofei Yu; Tiejun Huang; | code |
| 282 | The Devil Is in Gradient Entanglement: Energy-Aware Gradient Coordinator for Robust Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through quantitative analysis, we identify a key issue, *i.e.*, **gradient entanglement**, which 1) distorts supervised gradients and weakens discrimination among known classes, and 2) induces representation-subspace overlap between known and novel classes, reducing the separability of novel categories. To address this issue, we propose the Energy-Aware Gradient Coordinator (EAGC), a plug-and-play gradient-level module that explicitly regulates the optimization process. |
Haiyang Zheng; Nan Pu; Yaqi Cai; Teng Long; Wenjing Li; Nicu Sebe; Zhun Zhong; | code |
| 283 | Learning Transferable Temporal Primitives for Video Reasoning Via Synthetic Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \textbf{SynRL}, a post-training framework that teaches models \textit{temporal primitives}, the fundamental building blocks of temporal understanding including direction, speed, and state tracking. |
Sontao Jiang; Sibo Song; Chenyi Zhou; Yuan Wang; Ruizhe Chen; Tongkun Guan; Ruilin Luo; Yan Zhang; Zhihang Tang; Yuchong Sun; Hang Zhang; Zhibo Yang; Shuai Bai; Junyang Lin; Zuozhu Liu; | code |
| 284 | PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data introduces fundamentally different challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these UAV-specific conditions, we formally define the UAV Reasoning Segmentation task and organize its semantic demands into three dimensions: Spatial, Attribute, and Scene-level reasoning. |
shuyan ke; Yifan Mei; Changli Wu; yonghan zheng; Jiayi Ji; Liujuan Cao; Rongrong Ji; | code |
| 285 | NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This separation is inefficient and fails to model the consistency between encoding and decoding processes. To address this limitation, we propose NeuroFlow, the first unified framework that jointly models visual encoding and decoding from neural activity within a single flow model. |
Weijian Mai; Mu Nan; Yu Zhu; Jiahang Cao; Rui Zhang; Yuqin Dai; Chunfeng Song; Andrew Luo; Jiamin Wu; | code |
| 286 | Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. |
Tianming Liang; Haichao Jiang; Yuting Yang; Chaolei Tan; Shuai Li; Wei-Shi Zheng; Jian-Fang Hu; | code |
| 287 | Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we first show that existing forecasting-based caching methods can be unified in a common linear form, and then analyze DiT feature trajectories, finding that for most denoising steps the current feature can be reconstructed from past features with projection fidelity above 0.95, indicating that accurate linear prediction is feasible. Motivated by this, we propose $L^2P$ (Learnable Linear Predictor), a simple data-driven caching framework that replaces hand-designed coefficients with learnable per-timestep weights trained on a small set of cached trajectories using a mean-squared error loss, converging in about 20 seconds on a single GPU. |
Zhirong Shen; Rui Huang; Jiacheng Liu; Chang Zou; Peiliang Cai; Shikang Zheng; zhengyi shi; Liang Feng; Linfeng Zhang; | code |
| 288 | Bridging The 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent progress with vision-language models (VLMs), a critical semantic–geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic–Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. |
Kailing Li; Tianwen Qian; Lijin Yang; Yuqian Fu; Jingyu Gong; Xiaoling Wang; Liang He; | code |
| 289 | Occluded Human Body Capture with Frequency Domain Denoising Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, occluded human motion typically exhibits periodic patterns and consistent momentum. Inspired by this observation, we exploit reliable image observations in frequency domain and formulate the motion capture task as a wavelet coefficients selection process. |
Buzhen Huang; Chongyang Xu; Wentao Tang; Yuan Shu; Jingyi Ju; Binghui Zuo; Yangang Wang; | code |
| 290 | Forecast The Principal, Stabilize The Residual: Subspace-Aware Feature Caching for Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We reveal that DiT feature spaces contain distinct principal and residual subspaces with divergent temporal behavior: the principal subspace evolves smoothly and predictably, while the residual subspace exhibits volatile, low-energy oscillations that resist accurate prediction. Building on this insight, we propose SVD-Cache, a subspace-aware caching framework that decomposes diffusion features via Singular Value Decomposition (SVD), applies exponential moving average (EMA) prediction to the dominant low-rank components, and directly reuses the residual subspace. |
Guantao Chen; Shikang Zheng; Yuqi Lin; Linfeng Zhang; | code |
| 291 | EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. |
Abhishek Saroha; Huajian Zeng; Xingxing Zuo; Daniel Cremers; Xi Wang; | code |
| 292 | Beyond Pixel Simulation: Pathology Image Generation Via Diagnostic Semantic Tokens and Prototype Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. |
Minghao Han; Yichen Liu; Yizhou Liu; Zizhi Chen; Jingqun Tang; Xuecheng Wu; Dingkang Yang; Lihua Zhang; | code |
| 293 | ResCa: Residual Caching for Diffusion Transformers Acceleration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing token reduction-based acceleration techniques, such as caching and merging, attempt to reduce this cost from both temporal and spatial perspectives, but often compromise generation quality by introducing non-updated or non-self denosing directions. In this paper, we propose Residual Caching (ResCa), a novel, training-free framework that introduces a proxy denoising perspective to overcome these limitations. |
Haipeng Fang; Yu Li; Fan Tang; Yixing Lu; Juan Cao; Sheng Tang; | code |
| 294 | MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. |
Changli Wu; Haodong Wang; Jiayi Ji; Yutian Yao; Chunsai Du; Jihua Kang; Yanwei Fu; Liujuan Cao; | code |
| 295 | MMGait: Towards Multi-Modal Gait Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing methods focus primarily on RGB-derived modalities, which fall short in real-world scenarios requiring multi-modal collaboration and cross-modal retrieval. To overcome these challenges, we present MMGait, a comprehensive multi-modal gait benchmark integrating data from five heterogeneous sensors, including an RGB camera, a depth camera, an infrared camera, a LiDAR scanner, and a 4D Radar system. |
Chenye Wang; Qingyuan Cai; Saihui Hou; Aoqi Li; Yongzhen Huang; | code |
| 296 | FVBench: Benchmarking Deepfake Video Detection Capability of Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose **FVBench**, a comprehensive deep$\underline{f}$ake $\underline{v}$ideo $\underline{bench}$mark designed to advance video deepfake detection. |
Wang Jiarui; Huiyu Duan; Juntong Wang; Xiongkuo Min; | code |
| 297 | UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection Via MoE-Driven Feature Decompression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. |
Yuan Zhao; Youwei Pang; Lihe Zhang; Hanqi Liu; Jiaming Zuo; Huchuan Lu; Xiaoqi Zhao; | code |
| 298 | Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing VAC methods finetune video generation models on custom datasets, yet these datasets often have unrealistic distributions and small scales due to the challenges of collecting real amodal data and thus limit their performance and generalization. To address this, we utilize pre-trained image inpainting models for VAC and introduce in-context (IC) learning to enhance inter-frame consistency. |
Xiaoyu Kong; Ketong Ren; Dongyu She; Weiming Dong; Miao Wang; | code |
| 299 | Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. |
Yang Zou; Jun Ma; Zhidong Jiao; Xingyuan Li; Zhiying Jiang; Jinyuan Liu; | code |
| 300 | CLiViS: Unleashing Cognitive Map Through Linguistic-Visual Synergy for Embodied Visual Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Considering the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. |
Kailing Li; Qi'ao Xu; Tianwen Qian; Yuqian Fu; Yang Jiao; Xiaoling Wang; | code |
| 301 | SVHalluc: Benchmarking Speech–Vision Hallucination in Audio-Visual Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we show that speech content can induce hallucinations in audio-visual LLMs, where models generate inaccurate or misleading outputs. |
Chenshuang Zhang; Kyeongseon Kim; Chengxin Liu; Tae-Hyun Oh; | code |
| 302 | What Is The Optimal Ranking Score Between Precision and Recall? We Can Always Find It and It Is Rarely $F_1$ Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (2) We frame the problem of finding a tradeoff between two scores as an optimization problem expressed with Kendall rank correlations. |
Sébastien Piérard; Adrien Deliege; Marc Van Droogenbroeck; | code |
| 303 | WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To effectively leverage their complementary strengths under diverse query intents, we propose **WISER**, a training-free framework that unifies T2I and I2I via a “retrieve–verify–refine” pipeline, explicitly modeling *intent awareness* and *uncertainty awareness*. |
Tianyue Wang; Leigang Qu; tianyu yang; xiangzhao hao; Yifan Xu; Haiyun Guo; Jinqiao Wang; | code |
| 304 | PhysHead: Simulation-Ready Gaussian Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. |
Berna Kabadayi; Vanessa Sklyarova; Wojciech Zielonka; Justus Thies; Gerard Pons-Moll; | code |
| 305 | Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building upon it, we propose Venus, a two-stage framework that first empowers MLLMs with AG capability through progressively complex aesthetic questions and then activates their aesthetic cropping power via CoT-based rationales. |
Tianxiang Du; Hulingxiao He; Yuxin Peng; | code |
| 306 | GOR-IS: 3D Gaussian Object Removal In The Intrinsic Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present 3D **G**aussian **O**bject **R**emoval in the **I**ntrinsic **S**pace (GOR-IS), a novel framework for physically consistent and visually coherent 3D object removal. |
Yonghao Zhao; Yupeng Gao; Jian Yang; Jin Xie; Beibei Wang; | code |
| 307 | Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. |
yizhou jin; Yuezhu Feng; Jinjin Zhang; Peng Wang; Qingjie Liu; Yunhong Wang; | code |
| 308 | Feed-forward Gaussian Registration for Head Avatar Creation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. |
Malte Prinzler; Paulo Gotardo; Siyu Tang; Timo Bolkart; | code |
| 309 | End-to-End Hyper-Relational Information Extraction for Engineering Diagrams Via Dynamically Tokenized Relation Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, parsing frameworks solely based on object detection can merely localize component positions, yet fail to capture the topological connection semantics and structured knowledge among components, thus offering limited convenience for industrial applications. To address these issues, we propose an end-to-end information extraction framework based on the Dynamically Tokenized Relation Transformer (DTRT), which can dynamically reduce received image tokens, filter redundant information, and efficiently extract structural knowledge to construct hyper-relational knowledge graphs. |
Tianyou Bai; Yan-Ming Zhang; Zixiang Zhang; Jibin Zhou; Fei Yin; Cheng-Lin Liu; | code |
| 310 | DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. |
Mengping Yang; Stewart Tan; Binglei Li; Xiaomeng Yang; Hesen Chen; Hao li; | code |
| 311 | SGAD-SLAM: Splatting Gaussians at Adjusted Depth for Better Radiance Fields in RGBD SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified for improving scalability. |
Pengchong Hu; Zhizhong Han; | code |
| 312 | Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands Via Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. |
Weijian Su; Songqian Zhang; Yuqi Han; Jian Zhuang; Yongdong Huang; Qiang Zhang; | code |
| 313 | OrienPose: Orientation-Guided Novel View Synthesis for Single-Image Unseen Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose OrienPose, a novel object pose estimation framework via orientation-aware NVS from a single image. |
Yating Liu; Zhaoshuai Qi; Yang Zou; Yongnan Yang; Shizhou Zhang; Yanning Zhang; | code |
| 314 | TF-SSD: A Strong Pipeline Via Synergic Mask Filter for Training-free Co-salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we adopt an intra-image saliency filter that employs DINO’s attention maps to identify visually salient masks within individual images. |
Zhijin He; Shuo Jin; Siyue Yu; Shuwei Wu; Bingfeng Zhang; Li Yu; Jimin Xiao; | code |
| 315 | DPGF-Net: Dual-Prior Guided Fusion Network for Joint Assessment of Perceptual Quality and Semantic Consistency in AI-Generated Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As such, disentangling image content and rendering distortions is vital. We propose the dual-prior guided fusion network (DPGF-Net), which leverages image-side priors to disentangle distortions from content and combines them with text-side prompt templates to simulate their interactions, to address this issue. |
Tao Li; Xingran LIAO; Mingliang Zhou; | code |
| 316 | RehearseVLA: Simulated Post-Training for VLAs with Physically-Consistent World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose RehearseVLA, an RL-based post-training framework that replaces physical interaction with a low-cost world model-based virtual simulator. |
Junjin Xiao; Yandan Yang; Xinyuan Chang; Ronghan Chen; Feng Xiong; Mu Xu; Wei-Shi Zheng; Qing Zhang; | code |
| 317 | GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose GenColorBench, the first comprehensive benchmark for T2I color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. |
Muhammad Atif Butt; Alexandra Gomez-Villa; Tao Wu; Javier Vazquez-Corral; Joost van de Weijer; Kai Wang; | code |
| 318 | OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce OLATverse, a large-scale dataset comprising around 9M images of 765 real-world objects, captured from multiple viewpoints under a diverse set of precisely controlled lighting conditions. |
Xilong Zhou; Jianchun Chen; Pramod Rao; Timo Teufel; Linjie Lyu; Tigran Minasian; Oleksandr Sotnychenko; Xiao-Xiao Long; Marc Habermann; Christian Theobalt; | code |
| 319 | UAST: Unified Active Search and Tracking for Arbitrary Targets with UAVs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present UAST, a simple yet effective mapping-free framework that unifies active search and persistent tracking using only RGB-D observations. |
Liang Qin; Min Wang; Xingyu Lu; Aowen Qiu; Wengang Zhou; Houqiang Li; | code |
| 320 | Energy-GS: Image Energy-guided Pose Alignment Gaussian Splatting with Redesigned Pose Gradient Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, unlike NeRF, joint optimization in 3D Gaussian Splatting (3DGS) often requires additional regularization or prior spatial knowledge to reach comparable performance. To eliminate these dependencies, we introduce Energy-GS, a pose-aware Gaussian splatting framework that jointly optimizes scene representation and camera poses using only RGB images. |
Yu Gao; Su Lutong; Ruixiang Huang; Tianji Jiang; Jiadong Tang; Yufeng Yue; Yi Yang; | code |
| 321 | Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. |
Xin Lin; Meixi Song; Dizhe Zhang; Wenxuan Lu; Haodong Li; Bo Du; Ming-Hsuan Yang; Truong Nguyen; Lu Qi; | code |
| 322 | ORIC: Benchmarking Object Recognition Under Contextual Incongruity in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine these failures through the lens of uncertainty, focusing on contextual incongruity, where objects appear unexpectedly or fail to appear in expected contexts, and show that such cases increase recognition difficulty for state-of-the-art LVLMs. To study this regime, we introduce the Object Recognition in Incongruous Context (ORIC) framework, which constructs incongruous object-context pairs through two complementary strategies: (1) LLM-guided sampling to identify hard-to-recognize objects present in the image and (2) CLIP-guided sampling to mine plausible but absent ones. |
Zhaoyang Li; Zhan Ling; Yuchen Zhou; Litian Gong; Erdem Biyik; Hao Su; | code |
| 323 | An Efficient Token Compression Framework for Visual Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker’s overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. |
Weijing Wu; Qihua Liang; Bineng Zhong; Haiying Xia; Zhiyi Mo; Shuxiang Song; | code |
| 324 | Energy Waveify and Redistribution for Test-Time Adaptation: A Control System Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work tackles a key challenge in test-time energy adaptation: prohibitive time overhead arising from recent state-of-the-art test-time adaptation (TTA) methods, which are built on energy models relying on iterative Monte Carlo or Langevin dynamics sampling with multiple stochastic updates per test instance to approximate energy gradients. We tackle the problem from an innovative control system perspective by i) describing the energy as a complex-valued wave, where the amplitude encodes energy uncertainty and the phase characterizes its evolution, and ii) maintaining a time-dependent wave equation that interprets TTA as a control system evolution process. |
Zhenbin Wang; Lei Zhang; Lituan Wang; Zhenwei Zhang; Guangwu Qian; Yan Wang; Wei Huang; | code |
| 325 | ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Senisng Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose ChangeBridge, a conditional spatiotemporal image generation model for remote sensing. |
Zhenghui Zhao; Chen Wu; Xiangyong Cao; Di Wang; Hongruixuan Chen; Datao Tang; Liangpei Zhang; Zhuo Zheng; | code |
| 326 | Correspondence-Attention Alignment for Multi-view Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. |
Minkyung Kwon; Jinhyeok Choi; Jiho Park; Seonghu Jeon; Jinhyuk Jang; Junyoung Seo; Min-Seop Kwak; Jin-Hwa Kim; Seungryong Kim; | code |
| 327 | VVS: Accelerating Speculative Decoding for Visual Autoregressive Model Via Partial Verification Skipping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on an analysis of the drafting stage’s characteristics, we observe that $\textbf{verification redundancy}$ and $\textbf{stale feature reusability}$ are key factors to retain generation quality and speedup for verification-free steps. Inspired of these two observations, we propose a novel SD framework $\textbf{VVS}$ to accelerate $\underline{\text{v}}$isual AR model via partial $\underline{\text{v}}$erification $\underline{\text{s}}$kipping, which integrates three complementary modules: (1) a verification-free token selector with dynamically truncation, (2) token level feature caching and reuse, and (3) fine-grained skipped step scheduling. |
Haotian Dong; Ye Li; Rongwei Lu; Chen Tang; Shu-Tao Xia; Zhi Wang; | code |
| 328 | SemiGDA: Generative Dual-distribution Alignment for Semi-Supervised Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limits robust semantic representation learning and adaptive modeling of unlabeled data in scenarios with few labels. To address these limitations, we propose SemiGDA, a novel Generative Dual-distribution Alignment framework for semi-supervised medical image segmentation. |
kaiwen Huang; Yi Zhou; Yizhe Zhang; Jingxiong Li; Tao Zhou; | code |
| 329 | Uika: Universal Head Avatar from Pose-Free Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. |
Zijian Wu; Boyao Zhou; Liangxiao Hu; Hongyu Liu; Yuan Sun; Xuan Wang; Xun Cao; Yujun Shen; Hao Zhu; | code |
| 330 | InfinityHuman: Towards Long-Term Audio-Driven Human Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. |
Xiaodi Li; Pan Xie; Yi Ren; Qijun Gan; Chen Zhang; Fangyuan Kong; Xiang Yin; Zehuan Yuan; BINGYUE PENG; | code |
| 331 | Scone: Bridging Composition and Distinction in Subject-Driven Image Generation Via Unified Understanding-Generation Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Scone, a unified understanding-generation framework that integrates composition and distinction. |
Yuran Wang; Bohan Zeng; Chengzhuo Tong; Wenxuan Liu; Yang Shi; Xiaochen Ma; Hao Liang; Yuanxing Zhang; Wentao Zhang; | code |
| 332 | Missing No More: Dictionary-Guided Cross-Modal Image Fusion Under Missing Infrared Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We address missing-IR fusion by proposing a dictionary-guided, coefficient-domain framework built upon a shared convolutional dictionary. |
Yafei Zhang; Meng Ma; Huafeng Li; Yu Liu; | code |
| 333 | GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. |
Zixuan Song; Jing Zhang; Di Wang; Zidie Zhou; Wenbin Liu; Haonan Guo; En Wang; Bo Du; | code |
| 334 | PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute these limitations to the absence of high-quality multi-subject datasets and refined post-training strategies. To address these challenges, we propose a scalable multi-subject data generation pipeline that leverages powerful single-subject generation models to construct diverse and high-quality multi-subject training data. |
Shulei Wang; Longhui Wei; XIN HE; Jianbo Ouyang; Hui Lu; Zhou Zhao; Qi Tian; | code |
| 335 | EthoCLIP: Ontology-Enhanced Video-Language Pretraining for Animal Behavior Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EthoCLIP, an ontology-enhanced vision–language contrastive learning framework that explicitly embeds ontology semantics through an ontology-aware graph module to capture hierarchical relationships among behaviors and learn structured semantic dependencies. |
Yinuo Jing; Jinyan Wu; Zixi Yang; Kongming Liang; Xiatian Zhu; Zhanyu Ma; | code |
| 336 | Dance Across Shifts: Forward-Facilitation Continual Test-Time Adaptation Through Dynamic Style Bridging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods predominantly follow a backward-alignment paradigm, constructing weak supervisory surrogates derived from prior knowledge, they struggle with unreliable supervision and evolving distribution shifts. To overcome this, we propose a novel forward-facilitation paradigm through a dynamic style bridging framework. |
Zhilin Zhu; Yabin Wang; Zhiheng Ma; Yaguang Song; Yaowei Wang; Xiaopeng Hong; | code |
| 337 | Query2Uncertainty: Robust Uncertainty Quantification and Calibration for 3D Object Detection Under Distribution Shift Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although post-hoc calibration methods address this issue and provide improved calibration for in-distribution tests, they fail to adapt in distribution-shifted scenarios. In this work, we address this issue and introduce a density-aware calibration method that couples post-hoc calibrators with the feature density of latent object queries from DETR-style 3D object detectors. |
Till Beemelmanns; Alexey Nekrasov; Stefan Vilceanu; Jonas Steinhaus; Timo Woopen; Bastian Leibe; Lutz Eckstein; | code |
| 338 | FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While most existing FL methods assume homogeneous model architectures, client heterogeneity in data and resources renders this assumption impractical, motivating model-heterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. |
Yuan Yao; Lixu Wang; Jiaqi Wu; Jin Song; Simin Chen; Zehua Wang; Zijian Tian; Wei Chen; Huixia Li; Xiaoxiao Li; | code |
| 339 | StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. |
Boyu He; Yunfan Ye; Chang Liu; Weishang Wu; FANG LIU; Zhiping Cai; | code |
| 340 | M4Human: A Large-Scale Multimodal MmWave Radar Benchmark for Human Mesh Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To advance the HMR research community, we introduce M4Human, the current largest-scale (661K-frame) ($9\times$ prior largest) multimodal benchmark, featuring high-resolution mmWave radar, RGB, and depth data. |
Fan Junqiao; Yunjiao Zhou; Yizhuo Yang; Xinyuan Cui; Jiarui Zhang; Lihua Xie; Jianfei Yang; Chris Xiaoxuan Lu; Fangqiang Ding; | code |
| 341 | Enhancing Continual Learning of Vision-Language Models Via Dynamic Prefix Weighting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, previous approaches often normalize the weights of these vectors, disregarding the fact that different input tokens require different degrees of adjustment. To overcome this issue, we propose Dynamic Prefix Weighting (DPW), a framework that dynamically assigns weights to prefixes, complemented by adapters. |
Hyeonseo Jang; Hyuk Kwon; Kibok Lee; | code |
| 342 | Improving Calibration in Test-Time Prompt Tuning for Vision-Language Models Via Data-Free Flatness-Aware Prompt Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent works address this issue by incorporating additional regularization terms that constrain model outputs, which improve calibration but often degrade performance. In this work, we reveal that these regularization strategies implicitly encourage optimization toward flatter minima, and that the sharpness of the loss landscape around adapted prompts is a key factor governing calibration quality. |
Hyeonseo Jang; Jaebyeong Jeon; Joong-won Hwang; Kibok Lee; | code |
| 343 | SVBench: Evaluation of Video Generation Models on Social Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike humans—who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues—current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. |
Wenshuo Peng; Gongxuan Wang; Tianmeng Yang; Chuanhao Li; Xiaojie Xu; Hui He; Kaipeng Zhang; | code |
| 344 | EgoSound: Benchmarking Sound Understanding in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. |
Bingwen Zhu; Yuqian Fu; Qiaole Dong; Guolei Sun; Tianwen Qian; Yuzheng Wu; Danda Paudel; Yanwei Fu; Xiangyang Xue; | code |
| 345 | LoST: Level of Semantics Tokenization for 3D Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. |
Niladri Shekhar Dutt; Zifan Shi; Paul Guerrero; Chun-Hao P. Huang; Duygu Ceylan; Niloy J. Mitra; Xuelin Chen; | code |
| 346 | BiPA: Bilevel Prompt Adaptation for Underwater Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose BiPA, which effectively adapts SAM to the underwater domain. |
Long Ma; Haoze Zheng; Yuhang Mao; Jinyuan Liu; Chengpei Xu; Xinwei Xue; Yi Wang; Xiangjian He; Weimin Wang; | code |
| 347 | ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent diffusion-based approaches offer powerful generative priors, they often overlook the exposure-dependent nature of the degradation and incur substantial computational costs from iterative sampling. To address these challenges, we propose ExpoCM, a novel one-step generative HDR reconstruction framework that reformulates HDR reconstruction as a Probability Flow ODE (PF-ODE) and constructs exposure-aware consistency trajectories via exposure-dependent perturbations. |
Aoyu Liu; Zhen Liu; Ziyi Wang; Dian Chen; Bing Zeng; Shuaicheng Liu; | code |
| 348 | VesMamba: 3D Pulmonary Vessel Segmentation from CT Images Via Mamba with Structural Perception and Scale-aware Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing segmentation models either cannot sufficiently capture long-range structural dependencies, which are of great importance in vessel segmentation, or are constrained by insufficient computational resources in clinical settings. In this paper, we propose VesMamba, a novel model for 3D pulmonary vessel segmentation that comprehensively addresses these challenges. |
Zhipeng Liu; Guilian Chen; Zheng Jiang; Huisi Wu; Jing Qin; | code |
| 349 | Voxify3D: Pixel Art Meets Volumetric Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. |
Yichuan Huang; Jiewen Chan; Hao-Jen Chien; Yu-Lun Liu; | code |
| 350 | PAMotion: Physics-Aware Motion Generation for Full-Body Interaction with Multiple Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present PAMotion, a physics-aware diffusion framework for generating realistic full-body human interactions with multiple objects. |
Yan Di; Yuheng Li; Yaoxing Wang; Mengge Liu; Shan Gao; Xiangyang Ji; | code |
| 351 | Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. |
Hesong Li; Ziqi Wu; Ruiwen Shao; Ying Fu; | code |
| 352 | Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. |
Ziyue Lin; Jiahe Hou; Xia Hongyu; Xinrui Xie; Feifei Wang; Yuyin Zhou; Wei Wang; Jiawei Liu; Liangqiong Qu; | code |
| 353 | Enhancing Unregistered Hyperspectral Image Super-Resolution Via Unmixing-based Abundance Fusion Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an unmixing-based fusion framework that decouples spatial-spectral information to simultaneously mitigate the impact of unregistered fusion and enhance the learnability of SR models. |
Yingkai Zhang; Tao Zhang; Jing Nie; Ying Fu; | code |
| 354 | SeeU: Seeing The Unseen World Via 4D Dynamics-aware Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SeeU, a novel approach that learns the continuous 4D dynamics and generate the unseen visual contents. |
Yu Yuan; Tharindu Wickremasinghe; Zeeshan Nadir; Xijun Wang; Yiheng Chi; Stanley H. Chan; | code |
| 355 | MMVIP: A Visible-infrared Paired Dataset for Multi-weather Marine Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the absence of paired visible–infrared datasets that realistically capture diverse maritime scenarios has severely hindered progress in this field. To overcome this limitation, we present MMVIP, the first large-scale visible–infrared maritime vision dataset covering a wide spectrum of weather conditions and sea states. |
Yunpeng Yin; Lihan Wang; Zhaoshen He; Xinqiang He; Xingming Liao; Zhuowei Wang; Lianglun Cheng; | code |
| 356 | GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. |
Rong Fan; Kaiyan Xiao; Minghao Zhu; Liuyi Wang; KAI DAI; Zhao Yang; | code |
| 357 | WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose \textbf{WeatherCity}, a novel framework for 4D urban scene reconstruction and weather editing. |
Wenhua Wu; Huai Guan; Zhe Liu; Hesheng Wang; | code |
| 358 | FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This not only increases unnecessary computation but can degrade performance by incorporating noisy or unreliable modalities. To overcome these limitations, we propose FusionAgent, a novel agentic framework that leverages a Multimodal Large Language Model (MLLM) to perform dynamic, sample-specific model selection. |
Jie Zhu; Xiao Guo; Yiyang Su; Anil Kumar Jain; Xiaoming Liu; | code |
| 359 | GardenDesigner: Encoding Aesthetic Principles Into Jiangnan Garden Construction Via A Chain of Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. |
Mengtian Li; Fan Yang; Ruixue Xiong; Yiyan Fan; Zhifeng Xie; Zeyu Wang; | code |
| 360 | GeniNav: Generative Model Driven Image-Goal Navigation Via Imagination-Guided Consistency Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The absence of interactive, closed-loop benchmarks further limits fair and reproducible comparison. To address these issues, we propose GeniNav, a generative image-goal navigation framework that couples a VLM-driven latent subgoal imagination module for high-level semantic guidance with Multi-Segment Consistency Flow Matching (MS-CFM) for temporally smooth and dynamically coherent motion generation. |
Yuqi Chen; GAO JUNJIE; Pan Yongzhou; Siyuan Song; ZIXUAN ZHANG; Jiaping Xiao; Mir Feroskhan; | code |
| 361 | CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a unified framework **CogniEdit**, combining multi-modal reasoning with dense reward optimization that propagates gradients across consecutive denoising steps, enabling trajectory-level gradient flow through the sampling process. |
Yan Li; Lin Liu; Xiaopeng Zhang; Wei Xue; Wenhan Luo; Yike Guo; Qi Tian; | code |
| 362 | PositionIC: Unified Position and Identity Consistency for Image Customization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce PositionIC, a unified framework for high-fidelity, spatially controllable multi-subject customization. |
Junjie Hu; Tianyang Han; Kai Ma; Jialin Gao; Yang Song; Xianhua He; Junfeng Luo; Xiaoming Wei; Wenqiang Zhang; | code |
| 363 | R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM’s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. |
Zhuangzi Li; Jian Jin; Shilv Cai; Weisi Lin; | code |
| 364 | Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, directly animatable 3D head avatars from a single image. |
Yisheng He; | code |
| 365 | Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task. |
Bo Li; Yunkuo Lei; Tingting Bao; Hang Yan; Yaxian Wang; Weiping Fu; Lingling Zhang; Jun Liu; | code |
| 366 | MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MajutsuCity, a natural language–driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. |
Zilong Huang; Jun He; Xiaobin Huang; Ziyi Xiong; Yang Luo; Junyan Ye; Weijia Li; Yiping Chen; Ting Han; | code |
| 367 | BEV-CAR: Enhancing Monocular Bird’s Eye View Segmentation with Context-Aware Rasterization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Bird’s Eye View (BEV) semantic segmentation is essential for autonomous driving and mobile robotics, yet it still faces significant challenges on accurate segmentation of foreground object and efficient estimating of layout categories obscured by objects. To address these issues, we propose BEV-CAR, a Context-Aware Rasterization method that rasterizes the BEV representation without any coordinate transformations. |
Yixin Xiong; Ke Wang; Tongtong Cheng; Chunhui Liu; Kai Liu; | code |
| 368 | Memory-Efficient Transfer Learning with Fading Side Networks Via Masked Dual Path Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. |
Yutong Zhang; Jiaxin Chen; Honglin Chen; Kaiqi Zheng; Shengcai Liao; Hanwen Zhong; Weixin Li; Yunhong Wang; | code |
| 369 | VLM4RSDet: Collaborative Optimization with Vision-Language Model for Enhancing Remote Sensing Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing VLMs are designed for open-vocabulary tasks and exhibit inherent limitations when directly applied to closed-set scenarios, such as notable accuracy degradation and high deployment costs. To address these issues, we propose VLM4RSDet, a novel collaborative training framework that leverages vision-language model to enhance the performance of conventional closed-set remote sensing object detectors. |
Shuohao Shi; Qiang Fang; Xin Xu; | code |
| 370 | Evolutionary Multimodal Reasoning Via Hierarchical Semantic Representation for Intent Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates HIerarchical semantic representation with Evolutionary Reasoning based on Multimodal Large Language Model (MLLM). |
Qianrui Zhou; Hua Xu; Yunjin Gu; Yifan Wang; Songze Li; Hanlei Zhang; | code |
| 371 | Do VLMs Perceive or Recall? Probing Visual Perception Vs. Memory with Classic Visual Illusions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To move from observations to systematic understanding, this paper introduces \textbf{VI-Probe}, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. |
Xiaoxiao Sun; Mingyang Li; Kun yuan; Min Woo Sun; Mark Endo; Shengguang Wu; Changlin Li; Yuhui Zhang; Zeyu Wang; Serena Yeung; | code |
| 372 | Boosting Vision-Language Models Towards Cross-Domain Incremental Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: CDIOD reveals that existing methods struggle to balance between adaptivity and stability under substantial domain shifts. To tackle this challenge, we propose Dynamic Group Subspace (DGS), a novel framework that dynamically groups tasks by distribution to promote knowledge sharing and prevent task collisions; progressively consolidates adapters to build shared subspaces and control parameter growth; and implements a dynamic training pipeline to maintain a proper stability-adaptivity balance. |
Xu Wang; Zihan Lin; Yixin Zhang; Zilei Wang; | code |
| 373 | Test-Time Training for LiDAR Semantic Segmentation Under Corruption Via Geometric Inlier Discrimination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing test-time adaptation methods, including approaches based on pseudo-labels and normalization statistics, have shown promising results but can still struggle under severe distribution shifts. To complement these approaches, we propose a geometry-aware test-time training framework that leverages an auxiliary self-supervised objective. |
Hyeonseong Kim; Hyun-Kurl Jang; Kuk-Jin Yoon; | code |
| 374 | Evidential Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Among those that do quantify one or the other, many of them either compromise rendering quality or incur significant computational overhead to obtain uncertainty estimates. To address these issues, we introduce Evidential Neural Radiance Fields, a probabilistic approach that seamlessly integrates with the NeRF rendering process and enables direct quantification of both aleatoric and epistemic uncertainty from a single forward pass. |
Ruxiao Duan; Alex Wong; | code |
| 375 | A More Word-like Image Tokenization for MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. |
Hyun Lee; Hyemin Jeong; Yejin Kim; Hyungwook Choi; Hyunsoo Cho; Soo Kyung Kim; Joonseok Lee; | code |
| 376 | OSA: Echocardiography Video Segmentation Via Orthogonalized State Update and Anatomical Prior-aware Feature Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce OSA, a lightweight linear sequence architecture designed for stable and efficient cardiac video segmentation. |
Rui Wang; Huisi Wu; Jing Qin; | code |
| 377 | Neural Field-Based 3D Surface Reconstruction of Microstructures from Multi-Detector Signals in Scanning Electron Microscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing SEM 3D reconstruction methods struggle with textureless regions, shadowing artifacts, and calibration dependencies, whereas advanced learning-based approaches fail to generalize to microscopic SEM domains due to the lack of physical priors and domain-specific data. To address these challenges, we introduce NFH-SEM, a neural field-based hybrid reconstruction framework that recovers high-fidelity 3D surfaces from multi-view, multi-detector SEM images. |
Shuo Chen; Yijin Li; Xi Zheng; Guofeng Zhang; | code |
| 378 | Streaming Diffusion Model for Fast Infrared and Visible Video Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While diffusion models possess strong generative priors to remedy this, their iterative nature is prohibitively slow for video. To resolve this fundamental dilemma, we propose a streaming diffusion model for efficient infrared and visible video fusion, termed SDMFusion. |
Jinyuan Liu; Ludan Sun; Tengyu Ma; Chunyan Yang; Zhiying Jiang; Long Ma; Risheng Liu; Xin Fan; | code |
| 379 | Bridging Human Evaluation to Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge is further exacerbated by the ill-posed nature of IVIF, which severely limits its effectiveness in human perceptual environments such as security surveillance and driver assistance systems. To address these limitations, we propose a feedback reinforcement framework that bridges human evaluation to infrared and visible image fusion. |
Jinyuan Liu; Xingyuan Li; Qingyun Mei; HaoYuan Xu; Zhiying Jiang; Long Ma; Risheng Liu; Xin Fan; | code |
| 380 | Cross-Slice Knowledge Transfer Via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge, we propose SpaHGC, a multi-modal heterogeneous graph-based model that captures both intra-slice and inter-slice spot-spot relationships from histology images. |
Zhiceng Shi; Changmiao Wang; Jun Wan; Wenwen Min; | code |
| 381 | UniCorn: Unified Correspondence Transformer Across 2D and 3D Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UniCorn, the first correspondence model with shared weights that unifies geometric matching across all three tasks. |
Prajnan Goswami; Tianye Ding; Feng Liu; Huaizu Jiang; | code |
| 382 | Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. |
Ji Ma; Wei Suo; Peng Wang; Yanning Zhang; | code |
| 383 | PG-VTON: Single-Pass Training-Free Virtual Try-On Via Patch-Guided Reference Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose PG-VTON, a single-pass, training-free framework based on Patch-Guided Reference Alignment. |
Guohao Zhao; Yuxin Peng; | code |
| 384 | SPE-MVS: Spatial Position Encoding Enhanced Multi-View Stereo with Monocular Depth Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods depend heavily on photometric consistency across views, leading to poor performance in challenging regions, such as weakly textured or non-Lambertian surfaces. To overcome this limitation, we propose SPE-MVS, a novel MVS framework enhanced with Spatial Position Encoding (SPE). |
Shaoqian Wang; Jiadai Sun; Bosen Hou; Qiang Wang; Bin Fan; Bo Li; Bin Lu; Yuchao Dai; | code |
| 385 | DepthFocus: Controllable Depth Estimation for See-Through Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as intent-driven control. |
junhong min; Jimin Kim; Minwook Kim; Cheol-Hui Min; YOUNGPIL JEON; Minyong Choi; | code |
| 386 | When Lines Meet Textures: Spatial-Frequency Aligned Diffusion Features for Cross-Sparsity Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on this analysis, we propose SFA-DIFT, a novel approach that learns spatial-frequency aligned diffusion features for robust cross-modal correspondence. |
Mingrui Zhu; Fengzhi Wang; Xin Wei; Jun Wang; Nannan Wang; Xinbo Gao; | code |
| 387 | Mixture of States: Routing Token-Level Dynamics for Multimodal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. |
Haozhe Liu; Ding Liu; Mingchen Zhuge; Zijian Zhou; Tian Xie; Sen He; Yukang Yang; Shuming Liu; Yuren Cong; Jiadong Guo; Hongyu Xu; Ke Xu; Kam Woh Ng; Juan Camilo Perez; Juan-Manuel Pérez-Rúa; Tao Xiang; Wei Liu; Shikun Liu; Jürgen Schmidhuber; | code |
| 388 | Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they are trained to learn a single-domain target distribution, whereas fusion lacks natural fused data and relies on modeling complementary information from multiple sources, making diffusion hard to apply directly in practice. To address these challenges, this paper proposes an efficient degradation-aware diffusion framework for image fusion under arbitrary degradation scenarios. |
Yu Shi; Yu Liu; Zhong-Cheng Wu; Juan Cheng; Huafeng Li; Xun Chen; | code |
| 389 | TextOVSR: Text-Guided Real-World Opera Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, current RWVSR methods, which rely solely on degraded image features, struggle to reconstruct realistic and detailed textures due to a lack of high-level semantic guidance. To address these issues, we propose a Text-guided Dual-Branch Opera Video Super-Resolution (TextOVSR) network, which introduces two types of textual prompts to guide the super-resolution process. |
Hua Chang; Xin Xu; Wei Liu; Jiayi Wu; Kui Jiang; Fei Ma; Qi Tian; | code |
| 390 | From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. |
Yang Liu; Qianqian Xu; Peisong Wen; Siran Dai; Xilin Zhao; Qingming Huang; | code |
| 391 | CoT-Edit: Let CoT Guide Instruction Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Text-driven instruction-based video editing in complex scenes remains challenging: purely textual prompts often fail to capture precise spatial relationships and physical constraints, resulting in target ambiguity and physically implausible outcomes. To address this, we propose a plan–guide–edit framework that explicitly bridges semantic intent and spatial execution. |
Sen Liang; Fengbin Guan; Youliang Zhang; Xin Li; Zhibo Chen; | code |
| 392 | Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. |
Yongjie Bai; Zhouxia Wang; Yang Liu; Kaijun Luo; Yifan Wen; Mingtong Dai; weixing chen; Ziliang Chen; Lingbo Liu; Guanbin Li; Liang Lin; | code |
| 393 | SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, such manually designed strategies exhibit poor generalization. To address these issues, we propose SODA, a Sensitivity-Oriented Dynamic Acceleration method that adaptively performs caching and pruning based on fine-grained sensitivity. |
Tong Shao; Yusen Fu; Guoying Sun; Jingde Kong; Zhuotao Tian; Jingyong Su; | code |
| 394 | Zero-Shot Feature Upsampling Via Neighborhood Attention Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Neighborhood Attention Filtering (NAF), bridging classical filtering with modern upsamplers. |
Loick Chambon; Paul Couairon; Éloi Zablocki; Alexandre Boulch; Nicolas THOME; Matthieu Cord; | code |
| 395 | Streamlined Open-Vocabulary Human-Object Interaction Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, feature fusion in this paradigm is challenging due to significant gaps in cross-model representations. To address this issue, we introduce **SL-HOI**, a **S**tream**L**ined open-vocabulary **HOI** detection framework based solely on the powerful DINOv3 model. |
Chang Sun; LiaoDongliang LiaoDongliang; Changxing Ding; | code |
| 396 | GMT: Effective Global Framework for Multi-Target Multi-Camera Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in this paradigm, the use of multiple views is confined to recovering missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose a novel global MCMT tracking framework termed GMT, which effectively leverages the advantage of multi-view by performing global-level trajectory-target matching. |
Yihao Zhen; Mingyue Xu; Qiang Wang; Baojie Fan; Jiahua Dong; Tinghui Zhao; Huijie Fan; | code |
| 397 | Parameter-Efficient Adaptation for MLLMs Via Implicit Modality Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing PEFT methods neglect the issue of modality-imbalanced learning, which is characterized by the excessive dominance of text modality in updating parameters, thus incurring insufficient learning of non-text modalities and leading to performance degradation. To address this issue, we propose a novel parameter-efficient adaptation method for MLLMs, namely Implicit Modality Decomposition (IMoD), based on LoRA. |
Mingfang Zhang; Yunhong Wang; Lu Wang; Jiaxin Chen; | code |
| 398 | ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object’s metric scale and pose for grounding its normalized mesh in world space. |
Zikai Wang; Zhilu Zhang; Yiqing Wang; Hui Li; Wangmeng Zuo; | code |
| 399 | Scaling Multi-Identity Consistency for Image Customization Via Multi-to-Multi Matching Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. |
Yufeng Cheng; wenxu wu; Shaojin Wu; Mengqi Huang; Fei Ding; Qian HE; | code |
| 400 | Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. |
Boyu Han; Qianqian Xu; Shilong Bao; Zhiyong Yang; Ruochen Cui; Xilin Zhao; Qingming Huang; | code |
| 401 | CrossAgent: Bridging Cross-level Actions Into One Agentic Model Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The effective granularity of the control inevitably varies depending on the situation. To address this challenge, we propose CrossAgent, which introduces a novel adaptive action-space selection framework. |
Kaichen He; Zihao Wang; Muyao Li; Anji Liu; Yitao Liang; | code |
| 402 | Video Generation with Stable Transparency Via Shiftable RGB-A Distribution Learner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current methods often suffer from low quality due to confusion between RGB and alpha. In this paper, we address this problem by learning shiftable RGB‑A distributions. |
Haotian Dong; Wenjing Wang; Chen Li; Jing LYU; Di Lin; | code |
| 403 | AGiLe: Learning Robust Long-Horizon Manipulation Via Affordance-Grounded Bidirectional Latent Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches often encounter two critical limitations: the accumulation of prediction errors in subgoal planning, leading to compounding deviations over time; and the planning-execution gap, where high-level abstract plans fail to be effectively grounded in the continuous perception-action space. To address these challenges, we propose a novel unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe). |
Zixuan Chen; Xiangrong Feng; Jieqi Shi; Lin Shao; Jing Huo; Yang Gao; | code |
| 404 | Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. |
Aojun Lu; Tao Feng; Hangjie Yuan; Wei Li; Yanan Sun; | code |
| 405 | Unified Generation and Self-Verification for Vision-Language Models Via Advantage Decoupled Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Advantage Decoupled Preference Optimization (\textbf{ADPO}), a unified reinforcement learning framework that jointly learns answer generation and self-verification within a single policy. |
Xinyu Qiu; Heng Jia; Zhengwen Zeng; Shuheng Shen; Changhua Meng; Yi Yang; Linchao Zhu; | code |
| 406 | FlowPalm: Optical Flow Driven Non-Rigid Deformation for Geometrically Diverse Palmprint Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose FlowPalm, an optical-flow-driven palmprint generation framework capable of simulating the complex non-rigid deformations observed in real palms. |
yuchen zou; Huikai Shao; Lihuang Fang; Zhipeng Xiong; Dexing Zhong; | code |
| 407 | Scalable Multi-View Subspace Clustering with Tensorized Anchor Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel scalable tensorized anchor guidance for multi-view subspace clustering, which directly couples anchors across views to improve clustering performance. |
Miao Jia; Xingchen Hu; Jiyuan Liu; Siwei Wang; Min Wang; Zijian Chen; | code |
| 408 | TGTrack: Temporal Generative Learning for Unified Single Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing single object trackers typically treat temporal modeling superficially by passing limited inter-frame information, such as propagated tokens or template updates, without intrinsic temporal supervision learning. To address this limitation, we propose TGTrack, a new unified tracking framework that incorporates a temporally generative supervision task to guide the model in learning temporal dynamics. |
Wanting Geng; Xin Chen; Chuanyu Sun; Jie Zhao; Ben Kang; Dong Wang; Huchuan Lu; | code |
| 409 | LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce \textbf{LLaDA-MedV}, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. |
XUANZHAO DONG; Wenhui Zhu; Xiwen Chen; Zhipeng Wang; Peijie Qiu; Shao Tang; Xin Li; Yalin Wang; | code |
| 410 | GeCo: Geometry-Consistent Regularization for Domain Generalized Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Geometry-Consistent Regularization (GeCo), which extrapolates the pretrained representation space toward the target task under structure-respected constraints, thereby preserving the inherent generalization of VFMs while enhancing their task-specific adaptation. |
Qi Zang; Dong Zhao; Nan Pu; Wenjing Li; Zhun Zhong; Meng Wang; | code |
| 411 | EfficientVPR: Toward Efficient Visual Place Recognition Via Scene-Aware Prompt Tuning and Adaptive Feature Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods predominantly address these challenges by either scaling up model capacity or employing computationally intensive reranking stages, creating a significant efficiency bottleneck. To overcome this limitation, we propose EfficientVPR, a lightweight one-stage framework that achieves unprecedented speed-accuracy trade-offs through two key innovations: i) a scene-aware visual prompt tuning method which adapts pretrained features with less parameters while dynamically adjusting to sample-specific characteristics, and ii) an instance-dependent key local feature enhancement module that further reinforces discriminative regions. |
Wenjing Tang; Chuanguang Yang; Zhulin An; Libo Huang; boyu diao; Yongjun Xu; | code |
| 412 | MatLat: Material Latent Space for PBR Texture Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. |
Kyeongmin Yeo; Yunhong Min; Jaihoon Kim; Minhyuk Sung; | code |
| 413 | ReLaGS: Relational Language Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Achieving unified 3D perception and reasoning across tasks such as segmentation, retrieval, and relation understanding remains challenging, as existing methods are either object-centric or rely on costly training for inter-object reasoning. We present a novel framework that constructs a hierarchical language-distilled Gaussian scene and its 3D semantic scene graph without scene-specific training. |
Yaxu Xie; Abdalla Arafa; Alireza Javanmardi; Christen Millerdurai; Jia Hu; Shaoxiang Wang; Alain Pagani; Didier Stricker; | code |
| 414 | SEASON: Mitigating Temporal Hallucination in Video Large Language Models Via Self-Diagnostic Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. |
Chang-Hsun Wu; Kai-Po Chang; Yu-Yang Sheng; Hung-Kai Chung; Kuei-Chun Wang; Yu-Chiang Frank Wang; | code |
| 415 | OmniFood8K: Single-Image Nutrition Estimation Via Hierarchical Frequency-Aligned Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, many state-of-the-art nutrition prediction methods rely on depth sensors, restricting their applicability in daily scenarios. To address these limitations, we introduce OmniFood8K, a comprehensive multimodal dataset comprising 8,036 food scenes with detailed nutritional annotations and multi-view images for each scene. |
Dongjian Yu; Weiqing Min; Qian Jiang; Xing Lin; Xin Jin; Shuqiang Jiang; | code |
| 416 | LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. |
Zekun Li; Sizhe An; Chengcheng Tang; Chuan Guo; Ivan Shugurov; Linguang Zhang; Amy Zhao; Srinath Sridhar; Lingling Tao; Abhay Mittal; | code |
| 417 | InterAgent: Physics-based Multi-agent Command Execution Via Diffusion on Interaction Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods are largely confined to single-agent scenarios, overlooking the physically plausible interplay essential for multi-agent interactions. To bridge this gap, we propose InterAgent, the first end-to-end framework for text-driven physics-based multi-agent humanoid control. |
Bin Li; Ruichi Zhang; Han Liang; Jingyan Zhang; Juze Zhang; Xin Chen; Lan Xu; Jingyi Yu; Jingya Wang; | code |
| 418 | Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment in CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a multi-granular text-conditioned contrastive learning framework, $\beta$-CLIP, to achieve hierarchical alignment across multiple textual granularities — from full captions to sentences and phrases — and their corresponding visual regions. |
Fatimah Zohra; Chen Zhao; Hani Itani; Bernard Ghanem; | code |
| 419 | Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. |
Chenyang LI; Wenbing Tang; Yihao Huang; Simon Sinong Zhan; Ming Hu; Xiaojun Jia; Yang Liu; | code |
| 420 | MorphAny3D: Unleashing The Power of Structured Latent in 3D Morphing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MorphAny3D, a training-free framework that leverages Structured Latent (SLAT) representations for high-quality 3D morphing. |
Xiaokun Sun; Zeyu Cai; Hao Tang; Ying Tai; Jian Yang; Zhenyu Zhang; | code |
| 421 | PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present PiLoT, a unified framework that tackles UAV-based ego and target geo-localization. |
Xiaoya Cheng; Long Wang; Yan Liu; Xinyi Liu; Hanlin Tan; Yu Liu; Maojun Zhang; Shen Yan; | code |
| 422 | TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we present \textbf{TUDSR}, a \textbf{T}wice \textbf{U}psampling–\textbf{D}iffusion framework for higher SR. |
Zhiqiang Wu; Yitong Dong; Xian Wei; | code |
| 423 | Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We ask: must we process every pixel to understand the image? This paper introduces C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. |
Chen Wu; Ling Wang; Zhuoran Zheng; Yuning Cui; Zhixiong Yang; Xiangyu Chen; Yue Zhang; Weidong Jiang; Jingyuan Xia; | code |
| 424 | MetroGS: Efficient and Stable Reconstruction of Geometrically Accurate High-Fidelity Large-Scale Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, how to efficiently and stably achieve high-quality geometric fidelity remains a core challenge. To address this issue, we introduce MetroGS, a novel Gaussian Splatting framework for efficient and robust reconstruction in complex urban environments. |
Kehua Chen; Tianlu Mao; Xinzhu Ma; Hao Jiang; Zehao Li; Zihan Liu; Shuqin Gao; Honglong Zhao; Feng Dai; Yucheng Zhang; Zhaoqi Wang; | code |
| 425 | Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. |
Wenhao Li; Zimeng Wu; Yu Wu; Zehua Fu; Jiaxin Chen; | code |
| 426 | MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. |
Lingjun Zhang; Yujian Yuan; Changjie Wu; Xinyuan Chang; Xin Cai; Shuang Zeng; Linzhe Shi; Sijin Wang; Hang Zhang; Mu Xu; | code |
| 427 | Sampling-Aware Quantization for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we uncover that quantization-induced noise disrupts directional estimation at each sampling step, further distorting the precise directional estimations of higher-order samplers when solving the sampling equations through discretized numerical methods, thereby altering the optimal sampling trajectory. |
Qian Zeng; Jie Song; Yuanyu Wan; Huiqiong Wang; Mingli Song; | code |
| 428 | Affordance Field Intervention: Enabling VLAs to Escape Memory Traps in Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To compensate for this missing spatial understanding, 3D Spatial Affordance Fields (SAFs) can provide a geometric representation that highlights where interactions are physically feasible, offering explicit cues about regions the robot should approach or avoid. We therefore introduce Affordance Field Intervention (AFI), a lightweight hybrid framework that uses SAFs as an on-demand plug-in to guide VLA behavior. |
Siyu Xu; Zijian Wang; Yunke Wang; Chenghao Xia; Tao Huang; Chang Xu; | code |
| 429 | Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This one-shot mapping, however, struggles to learn accurate representations for short trajectories due to significant information gaps. To address this issue, we propose a $\textbf{P}$rogressive $\textbf{R}$etrospective $\textbf{F}$ramework (PRF), which gradually aligns features from incomplete observations with those from complete ones via a cascade of retrospective units. |
Hao Zhou; Lu Qi; Xiangtai Li; Jie Zhang; Yi Liu; Xu Yang; Mingyu Fan; Fei Luo; | code |
| 430 | LangRef3DGS: Natural Language-Guided 3D Referential Segmentation from Partial Observations Via 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Language-Guided 3D segmentation is crucial for linking 3D perception with semantic understanding, yet it remains vulnerable to the incomplete and occluded views common in real-world RGB-D data. To overcome this, we present a real-time framework that leverages 3D Gaussian Splatting (3DGS) to build a semantically continuous and differentiable embedding field from partial observations. |
xulun ye; Qin Zhang; Kun Zhou; | code |
| 431 | Nonparametric Deep Fine-grained Clustering with Low-Rank Guided Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Vision-Language Models (VLMs) on existing coarse-grained datasets (characterized by high inter-class and low intra-class variance) struggle to capture the subtle distinctions essential for fine-grained categorization, leading to suboptimal clustering performance. To address this, we propose a novel framework that adapts VLMs for fine-grained clustering without requiring fine-grained labels. |
xulun ye; Benyu Wu; Jie Hong; Kun Zhou; | code |
| 432 | Clone Deterministic 3D Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we take a step toward building a truly accurate world model by addressing a fundamental yet open problem: constructing a model that can fully clone a deterministic 3D world. |
Zaishuo Xia; Yukuan Lu; Xinyi Li; Yifan Xu; Yubei Chen; | code |
| 433 | Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. |
Yiwen Tang; Ziyu Guo; Kaixin Zhu; Renrui Zhang; Qizhi Chen; Dongzhi Jiang; Junli Liu; Bohan Zeng; Haoming Song; Delin Qu; Tianyi Bai; Dan Xu; Wentao Zhang; Bin Zhao; | code |
| 434 | ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a framework ExPose that directly addresses 3D inconsistency when applying video generation to pose estimation in extreme-view settings. |
Youngho Yoon; Wonjune Cho; Hyunho Ha; Sujung Kim; Kuk-Jin Yoon; | code |
| 435 | Event6D: Event-based Novel Object 6D Pose Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce EventTrack6D, an event-depth tracking framework that generalizes to novel objects without object-specific training by reconstructing both intensity and depth at arbitrary timestamps between depth frames. |
Jae-Young Kang; Hoonhee Cho; Taeyeop Lee; Minjun Kang; Bowen Wen; Youngho Kim; Kuk-Jin Yoon; | code |
| 436 | VisiLock: Authorizing Instruction-based Image Editing with Dual Score Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \textbf{Visilock}, where access control is baked into model weights, rendering the model unusable without a visual trigger in the input. |
Thanh Van Le; Yun Fu; | code |
| 437 | ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We revisit video hallucination in multimodal large language models (Video-MLLMs) from a semantic aggregation perspective. |
Hao Lu; Jiahao Wang; Yaolun Zhang; Ruohui Wang; Xuanyu Zheng; Yepeng Tang; Dahua Lin; Lewei Lu; | code |
| 438 | More Than Meets The Eye: A Unified Image Fusion Framework Via Semantic-Pixel Entropy Trade-off for Zero-Shot Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation can be attributed to three key challenges: (1) the lack of a unified, task-agnostic optimization objective; (2) the inherent difficulty in balancing semantic fidelity and pixel-level richness; and (3) an over-reliance on supervised learning, which limits transferability across tasks. To overcome these issues, this work proposes a unified fusion framework that generalizes to diverse fusion tasks even when trained solely on infrared–visible image pairs. |
Xiaowen Liu; Jing Li; Hongtao Huo; Haozhe Cao; Renhua Wang; Xu Dong; | code |
| 439 | ProSoftArena: Evaluating Hierarchical Capabilities of Multimodal Agents in Professional Software Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal agents are making rapid progress on general computer-use tasks, yet existing benchmarks remain largely confined to browsersand basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. |
Jiaxin Ai; Yukang Feng; Fanrui Zhang; Jianwen Sun; Zizhen Li; Chuanhao Li; Yifan Chang; Wenxiao Wu; Ruoxi Wang; Mingliang Zhai; Kaipeng Zhang; | code |
| 440 | ISplat: Iterative Learning for Fine-Grained Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we leverage the strengths of both paradigms and introduce iSplat, a novel framework that reformulates reconstruction as an iterative feed-forward process involving multiple (typically three) passes. |
Haifeng Wu; Wei Long; Shuhang Gu; Lixin Duan; Wen Li; | code |
| 441 | Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we tackle the scaling challenge in multimodal-to-audio generation, examining whether models trained on short instances can generalize to longer ones during testing. |
Christian Simon; Masato Ishii; Wei-Yao Wang; Koichi Saito; Akio Hayakawa; Dongseok Shim; Zhi Zhong; Shuyang Cui; Takashi Shibuya; Shusuke Takahashi; Yuki Mitsufuji; | code |
| 442 | ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence On Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, its development is critically bottlenecked by the lack of benchmarks that can address real-world complexity and enable objective, executable evaluation. To overcome these challenges, we introduce \textbf{ProactiveMobile}, a comprehensive benchmark designed to systematically advance research in this domain. |
Dezhi Kong; Zhengzhao Feng; Qiliang Liang; Wang Hao; haofei Sun; Changpeng Yang; Yang Li; Peng Zhou; Shuai Nie; Hongzhen Wang; Linfeng Zhou; Hao Jia; Jiaming Xu; Runyu Shi; Ying Huang; | code |
| 443 | RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cue for 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. |
Xiaokai Bai; Chenxu Zhou; Lianqing Zheng; Jianan Liu; Siyuan Cao; Xiaohan Zhang; Yiming Li; Zhengzhuang Zhang; Hui-Liang Shen; | code |
| 444 | Will Multimodal Models Be Dazzled By Multi-Image Visual Puzzles? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The rapid advancement of Multimodal Large Language Models (MLLMs) has revealed the limitations of existing benchmarks in evaluating complex reasoning over multiple images. To address this gap, we introduce $\textbf{MIRACLE}$, a novel benchmark for Multi-Image complex Reasoning And Comprehension Logic Evaluation, featuring 4,000 questions across diverse reasoning types such as visual comparison, temporal sequencing, and spatial relations, with each question involving an average of seven tightly correlated images. |
zhi zhu; YaoQi Fan; Zhe Chen; Yue Cao; Yangzhou Liu; Tong Lu; | code |
| 445 | Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet existing proxy metrics, such as model confidence or accuracy-on-the-line, are often unreliable as they only assess model outputs while ignoring the internal mechanisms that produce them. We address this limitation by introducing a new perspective: using a model’s inner working, i.e. circuits, as a predictive metric of generalization performance. |
Yunxiang Peng; Mengmeng Ma; Ziyu Yao; Xi Peng; | code |
| 446 | VLM-Loc: Localization in Point Cloud Maps Via Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. |
Shuhao Kang; Youqi Liao; Peijie Wang; Wenlong Liao; Qilin Zhang; Benjamin Busam; Xieyuanli Chen; Yun Liu; | code |
| 447 | Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. |
Ziqi Wang; Chang Che; Qi Wang; Hui Ma; Zenglin Shi; Cees G. M. Snoek; Meng Wang; | code |
| 448 | Learning 3D Reconstruction with Priors in Test Time Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. |
Lei Zhou; Haoyu Wu; Akshat Dave; Dimitris Samaras; | code |
| 449 | Chain of Event-Centric Causal Thought for Physically Plausible Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we view PPVG as generating a sequence of causally connected and dynamically evolving events. |
Zixuan Wang; Yixin Hu; Haolan Wang; Feng Chen; Yan Liu; Wen Li; Yinjie Lei; | code |
| 450 | Training-free Motion Factorization for Compositional Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a \textbf{motion factorization} framework that decomposes complex motion into three primary categories: motionlessness, rigid motion, and non-rigid motion. |
Zixuan Wang; Ziqin Zhou; Feng Chen; DUO PENG; Yixin Hu; Changsheng Li; Yinjie Lei; | code |
| 451 | Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a Transformer-based multi-view crowd tracking model, MVTrackTrans, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. |
Qi Zhang; Jixuan Chen; Zhang Kaiyi; Xinquan Yu; Antoni B. Chan; Hui Huang; | code |
| 452 | 3D-IDE: 3D Implicit Depth Emergent Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose 3D-Implicit Depth Emergence, a method that reframes 3D perception as an emergent property derived from geometric self-supervision rather than explicit encoding. |
Chushan Zhang; Ruihan Lu; Jinguang Tong; Yikai Wang; Hongdong Li; | code |
| 453 | UniSH: Unifying Scene and Human Reconstruction in A Feed-Forward Pass Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. |
Mengfei Li; Peng Li; Zheng Zhang; Jiahao Lu; Chengfeng Zhao; Wei Xue; Qifeng Liu; Sida Peng; Wenxiao ZHANG; Wenhan Luo; Yuan Liu; Yike Guo; | code |
| 454 | Learning Cross-View Object Correspondence Via Cycle-Consistent Mask Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video. |
Shannan Yan; Leqi Zheng; Keyu Lv; Jingchen Ni; Hongyang Wei; Jiajun Zhang; Guangting Wang; Jing LYU; Chun Yuan; Fengyun Rao; | code |
| 455 | Towards Visual Query Localization in The 3D World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we make the first attempt to visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. |
liang peng; Bohan Tan; Zhipeng Zhang; Haobo Li; Yifan Jiao; Xingping Dong; Libo Zhang; | code |
| 456 | Depth Any Endoscopy: Towards Self-Supervised Generalizable Depth Estimation in Monocular Endoscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Depth Any Endoscopy (DAE), a novel self-supervised framework for generalizable depth estimation in monocular endoscopy. |
Shuwei Shao; Kejin Zhu; Shixing Ma; Xinzhe Du; Baochang Zhang; Zhe Min; | code |
| 457 | Improving Sparse Autoencoder with Dynamic Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing activation functions such as ReLU and TopK provide certain sparsity guarantees, they typically require additional sparsity regularization or cherry-picked hyperparameters. We show in this paper that dynamically sparse attention mechanisms using sparsemax can bridge this trade-off, due to their ability to determine the activation numbers in a data-dependent manner. |
Dongsheng Wang; Jinsen Zhang; Dawei Su; Hui Huang; | code |
| 458 | LRDUN: A Low-Rank Deep Unfolding Network for Efficient Spectral Compressive Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose two novel imaging models corresponding to the spectral basis and subspace image by explicitly integrating low-rank (LR) decomposition with the sensing model. |
HE HUANG; Yujun Guo; Wei He; | code |
| 459 | CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis suggests that this instability arises from visual preference conflicts, where the context-agnostic nature of vision encoders induces divergent parameter updates under diverse multimodal context. To address this issue, we propose the Context-aware Visual Fine-tuning (CoVFT) framework, which explicitly incorporates multimodal context into visual adaptation. |
Nan Zhou; Huiqun Wang; Yaoyan Zheng; Di Huang; | code |
| 460 | Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: MIMIC-ILS contains 1.1M instruction-answer pairs derived from 192K images and 91K unique segmentation masks, covering seven major lesion types. To empirically demonstrate its utility, we introduce ROSALIA, a vision-language model fine-tuned on MIMIC-ILS. |
Geon Choi; Hangyul Yoon; Hyunju Shin; Hyunki Park; Sang Seo; Eunho Yang; Edward Choi; | code |
| 461 | Explore with Long-term Memory: A Benchmark and Multimodal LLM-based Reinforcement Learning Framework for Embodied Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enhance the agent’s memory recall and proactive exploration capabilities, we propose MemoryExplorer, a novel method that fine-tunes a multimodal large language model through reinforcement learning to encourage active memory querying. |
sen wang; Bangwei Liu; Zhenkun Gao; Lizhuang Ma; Xuhong Wang; Yuan Xie; Xin Tan; | code |
| 462 | Learning to Generate Via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we explore UMMs’ internal understanding capability to enhance generation quality. We propose a token-level intrinsic text-image alignment reward mechanism,GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. |
Jiadong Pan; Liang Li; Yuxin Peng; Yu-Ming Tang; Shuohuan Wang; Yu Sun; Hua Wu; Qingming Huang; Haifeng Wang; | code |
| 463 | PET-DINO: Unifying Visual Cues Into Grounding DINO with Prompt-Enriched Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal object detector supporting both text and visual prompts. |
Weifu Fu; Jinyang Li; Bin-Bin Gao; Jialin Li; Yuhuan Lin; Hanqiu Deng; Wenbing Tao; Yong Liu; Chengjie Wang; | code |
| 464 | GeoRelight: Learning Joint Geometrical Reconstruction and Relighting with Flexible Multi-Modal Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: **GeoRelight**. |
Yuxuan Xue; Ruofan Liang; Egor Zakharov; Timur Bagautdinov; Chen Cao; Giljoo Nam; Shunsuke Saito; Gerard Pons-Moll; Javier Romero; | code |
| 465 | Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution Can Improve Robust Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Robo-SGG, a plug-and-play module for robust scene graph generation (SGG). |
Changsheng Lv; Zijian Fu; Mengshi Qi; | code |
| 466 | Dynamic Token Reweighting for Robust Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present DTR, a novel inference-time defense that mitigates multimodal jailbreak attacks through optimizing the model’s key-value (KV) caches. |
Tanqiu Jiang; Jiacheng Liang; Rongyi Zhu; Jiawei Zhou; Fenglong Ma; Ting Wang; | code |
| 467 | MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the Masked Edge Prediction MOdel (MEMO), which produces both accurate and crisp edges using only cross-entropy loss. |
Jiaxin Cheng; Yue Wu; Yicong Zhou; | code |
| 468 | Spectral Mixture-of-Experts for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Parameter-Efficient Fine-Tuning using Mixture-of-Experts (MoE) is a promising solution for continual learning (CL), it suffers from two critical failure modes: structural interference, where expert updates interfere, and compositional forgetting, where the model’s routing policy drifts. To address these issues, we introduce Spectral MoE, a novel framework built for CL from three core components. |
Chen Yin; Xingbo Dong; Xuelin Shen; Zhe Jin; | code |
| 469 | CROWn: A Unified Framework for Anti‑Aliased Downsampling and Phase‑Calibrated Fusion in 3D Medical Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Coset-fibRated micrO-local co-attention Network (CROWn), a general segmentation framework that couples sampling theory with representation learning to jointly suppress aliasing and calibrate cross-scale fusion. |
Xingru Huang; Shuanghua Ye; Zhao Huang; Wenwen Tang; Huiyu Zhou; Zhiwen Zheng; Jin Liu; Xiaoshuai Zhang; | code |
| 470 | Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. |
Delin An; Chaoli Wang; | code |
| 471 | Is Your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we developed the SpatialSky-Dataset, a comprehensive dataset containing 1 M samples with diverse annotations across various scenarios. Leveraging this dataset, we introduce Sky-VLM, a specialized VLM designed for UAV spatial reasoning across multiple granularities and contexts. |
Lingfeng Zhang; Yuchen Zhang; Hongsheng Li; Haoxiang Fu; Yingbo Tang; Hangjun Ye; Long Chen; Xiaojun Liang; Xiaoshuai Hao; Wenbo Ding; | code |
| 472 | OptiMVMap: Offline Vectorized Map Construction Via Optimal Multi-vehicle Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Incorporating views from surrounding vehicles offers complementary perspectives, yet naive fusion introduces three key challenges: computational cost from large candidate pools, redundancy from near-collinear viewpoints, and noise from pose errors and occlusion artifacts. We present OptiMVMap, which reformulates multi-vehicle mapping as a select-then-fuse problem to address these challenges systematically. |
Zedong Dan; Zijie Wang; Wei Zhang; Xiangru Lin; Weiming Zhang; Xiao Tan; Jingdong Wang; Liang Lin; Guanbin Li; | code |
| 473 | Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, \textit{\textbf{D}ifficulty-\textbf{I}nfluence \textbf{Q}uadrant} \textbf{(DIQ)}, which prioritizes samples in the “high-difficulty–high-influence” quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. |
Xinlin Zhuang; feilong tang; Haolin Yang; Xiwei Liu; Ming Hu; Huifa Li; Haochen Xue; Junjun He; Zongyuan Ge; Yichen Li; Ying Qian; Imran Razzak; | code |
| 474 | YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. |
Xu Lin; Jinlong Peng; Zhenye Gan; Jiawen Zhu; Jun Liu; | code |
| 475 | SpaceTools: Tool-Augmented Spatial Reasoning Via Double Interactive RL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. |
Siyi Chen; Mikaela Angelina Uy; Chan Hee Song; Faisal Ladhak; Adithya Murali; Qing Qu; Stan Birchfield; Valts Blukis; Jonathan Tremblay; | code |
| 476 | RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing VLG approaches focus on coarse-grained, object-level localization, while traditional robotic grasping methods rely predominantly on geometric cues and lack language guidance, which limits their applicability in language-driven manipulation scenarios. To address these limitations, we propose the RealVLG framework, which integrates the RealVLG-11B dataset and the RealVLG-R1 model to unify real-world visual-language grounding and grasping tasks. |
Linfei Li; Lin Zhang; Ying Shen; | code |
| 477 | 3D Instance Models Are Implicit Generalizable Spatial Learners Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Generalization remains the central challenge for interactive 3D scene generation. Existing learning‑based approaches ground spatial understanding in limited scene dataset, … |
Lu Ling; Yunhao Ge; Yichen Sheng; Aniket Bera; | code |
| 478 | MA-Bench: Towards Fine-grained Micro-Action Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present **MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. |
Kun Li; Jihao Gu; Fei Wang; zhiliang wu; Hehe Fan; Dan Guo; | code |
| 479 | Batch Loss Score for Dynamic Data Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes the Batch Loss Score (BLS), a computationally efficient alternative using an Exponential Moving Average (EMA) of readily available batch losses to assign scores to individual samples. |
Qing Zhou; Bingxuan Zhao; Tao Yang; Hongyuan Zhang; Junyu Gao; Qi Wang; | code |
| 480 | Learning from Oblivion: Predicting Knowledge-Overflowed Weights Via Retrodiction of Forgetting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a fundamental question remains: how can we obtain better pre-trained weights that encapsulate more knowledge beyond the given dataset? In this work, we introduce KNowledge-Overflowed Weights (KNOW) prediction, a novel strategy that leverages structured forgetting and its inversion to synthesize knowledge-enriched weights. |
Jinhyeok Jang; Jaehong Kim; Jung Uk Kim; | code |
| 481 | SinGeo: Unlock Single Model’s Potential for Robust Cross-View Geo-Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although studies have explored dynamic FoV training by simply randomizing FoVs, they failed to achieve robustness across diverse conditions—implicitly assuming all FoVs are equally difficult. To address this gap, we present SinGeo, a simple yet powerful framework that enables a single model to realize robust cross-view geo-localization without additional modules or explicit transformations. |
CHEN Yang; Xieyuanli Chen; Junxiang Li; Jie Tang; Tao Wu; | code |
| 482 | Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a more reliable framework for evaluating SNN adversarial robustness. |
Jihang Wang; Dongcheng Zhao; Ruolin Chen; Qian Zhang; Yi Zeng; | code |
| 483 | Beyond Static Frames: Temporal Aggregate-and-Restore Vision Transformer for Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose TAR-ViTPose, a novel Temporal Aggregate-and-Restore Vision Transformer tailored for video-based 2D human pose estimation. |
Hongwei Fang; Jiahang Cai; Xun Wang; Wenwu Yang; | code |
| 484 | 3D-Object Perception Transformer (3PT) Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Current approaches to zero-shot 3D-object perception typically rely on ensembles of frozen foundation models. This limits deep object understanding and cross-domain … |
Agastya Kalra; Tim Salzmann; Guy Stoppi; Dmitrii Marin; Rishav Agarwal; Vage Taamazyan; Martin Bokeloh; Stefan Hinterstoisser; Anton Boykov; Alberto Dall'Olio; Pravin Dangol; Kartik Venkataraman; Huaijin Chen; | code |
| 485 | Neural Collapse in Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample’s feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. |
Xiao Chen; Zhongjing Du; Jiazhen Huang; Jiang Xu; Li Lu; Jingyan Jiang; Zhi Wang; | code |
| 486 | Adaptive Anisotropic Gaussian Splatting for Multi-contrast MRI Arbitrary-Scale Super-Resolution with Anatomy Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To handle inter-contrast discrepancies, we introduce an anatomy-guided pipeline comprising three core modules: a Structure Prior Modulation Fusion (SPMF) module for feature enhancement; an Anatomy-Guided Dual-Domain Cross Attention (AG-DDCA) module for joint spatial-frequency modeling; and an Anatomy-Guided Gaussian Parametrizer (AGGP) that leverages gradient-based sparse attention to concentrate Gaussian centers on critical anatomical structures. |
Qiuhai Yan; Kang Chen; Zhengjie Lu; Tingting Wang; Faming Fang; Guixu Zhang; | code |
| 487 | Layer-wise Instance Binding for Regional and Occlusion Control in Text-to-Image Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Region-instructed layout control in text-to-image generation is highly practical, yet existing methods suffer from limitations: (i) training-based approaches inherit data bias and often degrade image quality, and (ii) current techniques struggle with occlusion order, limiting real-world usability. To address these issues, we propose LayerBind. |
Ruidong Chen; Yancheng Bai; Xuanpu Zhang; Jianhao Zeng; Lanjun Wang; Dan Song; Lei Sun; Xiangxiang Chu; An-An Liu; | code |
| 488 | Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline{\textbf{meta}}-learning with \underline{\textbf{f}}eature \underline{\textbf{c}}onsistency (Meta-FC). |
Yuheng Li; Weitong Chen; chengcheng zhu; Jiale Zhang; Chunpeng Ge; Di Wu; Guodong Long; | code |
| 489 | Discriminative Perception Via Anchored Description for Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by sharply contrasting the caption’s semantic relevance to the referred object against the wider context. |
Tao Yang; Qing Zhou; Yanliang Li; Qi Wang; | code |
| 490 | Structural Graph Probing of Vision–Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We take a topology-first perspective and analyze VLMs through the interaction graphs induced by neuron–neuron correlations, treating each layer as a structured computational network rather than a sequence of token transformations. Operating solely on these graphs, we show that global connectivity patterns are strongly predictive of model behavior across grounded reasoning, counting, and hallucination tasks. |
Haoyu He; Yue Zhuo; Yu Zheng; Qi Wang; | code |
| 491 | Federated Active Learning Under Extreme Non-IID and Global Class Imbalance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. |
Chen-Chen Zong; Sheng-Jun Huang; | code |
| 492 | D-Prism: Differentiable Primitives for Structured Dynamic Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose D-Prism, the first framework to achieve high-fidelity structured dynamic modeling by extending differentiable primitives to the dynamic domain. |
Xingyuan Yu; Yijin Li; Chong Zeng; Yuhang Ming; Hujun Bao; Guofeng Zhang; | code |
| 493 | SAMIX: Reinforcing SAM2 with Semantic Adapter and Reference Selecting Policy for Mix-Supervised Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SAMIX, a novel framework that adapts SAM2 into a semantic-aware pseudo-label generator SA-SAM2 by incorporating a lightweight semantic adapter. |
Qiang Hu; Jiajie Wei; Zhenyu Yi; Zhifen Yan; Yingjie Guo; Hongkuan Shi; Ge-Peng Ji; Qiang Li; Zhiwei Wang; | code |
| 494 | GaussianMatch: Semi-Supervised Regression with Pseudo-Label Filtering Via Multi-View Gaussian Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose GaussianMatch, a novel SSR framework enabling high-quality pseudo-label filtering, which selects reliable pseudo-labels through multi-view prediction consistency under feature-space smoothness assumptions. |
Yin Wang; Hao Lu; Zixuan Wang; Zhen Qin; Li Kuang; Mengchu Zhou; Shuiguang Deng; | code |
| 495 | Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AFRO, a scalable self-supervised framework that learns dynamics-aware 3D representations directly from point clouds without action or label supervision. |
Qiwei Liang; Boyang Cai; Minghao Lai; Sitong Zhuang; Tao Lin; Yan Qin; Yixuan Ye; Jiaming Liang; Renjing Xu; | code |
| 496 | Focus-to-Perceive Representation Learning: A Cognition-Inspired Hierarchical Framework for Endoscopic Video Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While self-supervised video pre-training shows promise, existing methods designed for natural videos tend to prioritize dense spatio-temporal modeling and exhibit motion bias, neglecting the static, structured semantics that are critical for clinical decision-making. To address this challenge, we propose **F**ocus-to-**P**erceive **R**epresentation **L**earning (***FPRL***), a cognition-inspired hierarchical framework that emulates the clinical examination process of endoscopic videos. |
Yuan Zhang; Sihao Dou; Kai Hu; Shuhua Deng; Chunhong Cao; Fen Xiao; Xieping Gao; | code |
| 497 | AGFT: Alignment-Guided Fine-Tuning for Zero-Shot Adversarial Robustness of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional classification-guided adversarial fine-tuning often compromises the pre-trained cross-modal alignment, undermining the intricate visual-textual correspondence essential for zero-shot performance. To mitigate this, we introduce Alignment-Guided Fine-Tuning (AGFT), a novel framework that preserves semantic integrity while enhancing robustness. |
Yubo Cui; Xianchao Guan; Zijun Xiong; Zheng Zhang; | code |
| 498 | FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-and-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose *FantasyVLN*, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. |
Jing Zuo; Lingzhou Mu; Fan Jiang; Chengcheng Ma; Mu Xu; Yonggang Qi; | code |
| 499 | MmWaveFlow: Unified Enhancement and Generation of MmWave Human Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We revisit generative modeling for mmWave point clouds and propose a unified flow-matching framework mmWaveFlow that unifies enhancement and generation by learning an invertible transport between dense and sparse point clouds. |
Chang Su; Beihong Jin; Qiwen Shi; Zhi Wang; | code |
| 500 | RAYNOVA: Geometry-Free Auto-Regressive 4D World Modeling with Unified Spatio-Temporal Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: World foundation models aim to simulate the evolution of the real world with physically plausible behavior. |
Yichen Xie; Chensheng Peng; Mazen Abdelfattah; Yihan Hu; Jiezhi Yang; Eric Higgins; Ryan Brigden; Masayoshi Tomizuka; Wei Zhan; | code |
| 501 | MIBURI: Towards Expressive Interactive Gesture Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, an online, causal framework for generating expressive co-speech gestures and facial expressions synchronized with real-time spoken dialogue. |
M. Hamza Mughal; Rishabh Dabral; Vera Demberg; Christian Theobalt; | code |
| 502 | More Than The Sum: Panorama-Language Models for Adverse Omni-Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we propose that panoramic vision-language understanding is more than the sum of its pinhole counterparts. |
Weijia Fan; Ruiping Liu; Jiale Wei; Yufan Chen; Junwei Zheng; Zichao Zeng; Jiaming Zhang; Qiufu Li; Linlin Shen; Rainer Stiefelhagen; | code |
| 503 | Intra-class Distribution-guided Generative Hashing with Neighbor Refinement for Cross-modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent cross-modal hashing methods have introduced sample generation strategies to enrich training signals. |
Hao Sun; Yadong Huo; Qibing Qin; Wenfeng Zhang; Lei Huang; | code |
| 504 | GEM: Generating LiDAR World Model Via Deformable Mamba Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose **GEM**: a **G**enerative LiDAR world model that leverages d**E**formable **M**amba architecture, significantly improving fidelity and imaginative capability. |
Yang Wu; Zhaojiang Liu; Qiang Meng; Youquan Liu; renliang Weng; Jianjun Qian; Jian Yang; Jin Xie; | code |
| 505 | Drift-Resilient Temporal Priors for Visual Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. |
Yuqing Huang; Liting Lin; Weijun Zhuang; Zhenyu He; Xin Li; | code |
| 506 | Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, recent works show that while VLMs often manage to capture the correct image region corresponding to the question, they do not necessarily produce the correct answers. In this work, we demonstrate that this misalignment could be attributed to suboptimal information flow within VLMs, where text tokens distribute too much attention to irrelevant visual tokens, leading to incorrect answers. |
Chengxin Liu; Wonseok Choi; Chenshuang Zhang; Tae-Hyun Oh; | code |
| 507 | EgoAVU: Egocentric Audio-Visual Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the challenge of obtaining text labels with coherent joint modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions and answers. |
Ashish Seth; Xinhao Mei; Changsheng Zhao; Varun Nagaraja; Ernie Chang; Gregory P. Meyer; Gael Le Lan; Yunyang Xiong; Vikas Chandra; Yangyang Shi; Dinesh Manocha; zhipeng cai; | code |
| 508 | UniChange: Unifying Change Detection with Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce three special tokens: [T1], [T2], and [CHANGE], utilising their embeddings as the key to query variations. |
Xu Zhang; Danyang Li; Xiaohang Dong; Tianhao Wu; Hualong Yu; Jianye Wang; Qicheng Li; Xiang Li; | code |
| 509 | Beyond Global Similarity: Multi-Conditional Retrieval for Fine-Grained Cross-Modal Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, existing benchmarks largely focus on coarse-grained or single-condition alignment, overlooking real-world scenarios where user queries specify multiple interdependent constraints across modalities. To bridge this gap, we introduce MCMR (Multi-Conditional Multimodal Retrieval)—a large-scale benchmark designed to evaluate fine-grained, multi-condition retrieval under natural-language queries. |
Xuan Lu; Kangle Li; Haohang Huang; Rui Meng; Wenjun Zeng; Xiaoyu Shen; | code |
| 510 | Real-Time Neural Video Compression with Unified Intra and Inter Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. |
Hui Xiang; Yifan Bian; Li Li; Jingran Wu; Xianguo Zhang; Dong Liu; | code |
| 511 | Joint Spectral Image Reconstruction and Semantic Segmentation with Cooperative Unfolding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To make the two mutually reinforcing, we introduce the Cross-Aggregated Super-Token Attention (CASTA) mechanism to enhance the representation interactions between HSI reconstruction and semantic segmentation. |
Zijun He; Ping Wang; Xiaodong Wang; ChangChen ChangChen; Xin Yuan; | code |
| 512 | SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the absence of contextual cues, such as objects involved in the action, introduces an inherent gap between skeleton and semantic representations, making it difficult to distinguish visually similar actions. To address this, we propose SkeletonContext, a prompt-based framework that enriches skeletal motion representations with language-driven contextual semantics. |
Ning Wang; Tieyue Wu; Naeha Sharif; Farid Boussaid; Guangming Zhu; Lin Mei; Mohammed Bennamoun; Liang Zhang; | code |
| 513 | CORE: Compact Object-centric REpresentations As A New Paradigm for Token Merging in LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing token compression methods, while varied, often lack a high-level semantic understanding, leading to suboptimal merges, information redundancy, or context loss. To address these limitations, we introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression. |
Jingyu Lei; Gaoang Wang; Der-Horng Lee; | code |
| 514 | EgoMind: Activating Spatial Cognition Through Linguistic Reasoning in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Purely 2D approaches, however, struggle with multi-frame spatial reasoning due to missing viewpoint transitions and overlooked implicit objects that act as spatial bridges. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Captioning and Progressive Spatial Analysis, jointly constructing a coherent linguistic scene graph across frames. |
Zhenghao Chen; Huiqun Wang; Di Huang; | code |
| 515 | DeepfakeImpact: A Two-Stage Benchmark with Real-World Impact in Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that technical metrics may fail to capture models’ actual capacity to mitigate real-world harm, as they treat all errors as equally significant. To bridge this gap, we introduce DeepfakeImpact, a two-stage benchmark that moves beyond pure technical evaluation toward societally-aware assessment. |
Chaoyu Gong; Han Zhang; Siqiang Luo; | code |
| 516 | FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. |
Jingren Liu; Shuning Xu; Qirui Yang; WANG Yun; Xiangyu Chen; Zhong Ji; | code |
| 517 | MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation Via Decoupled Mamba Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. |
Zilong Zhao; Zhengming Ding; Pei Niu; Wenhao Sun; Feng Guo; | code |
| 518 | DVAR: Dynamic Visual Autoregressive Modeling for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation stems from their reliance on memorizing fixed, absolute scaling schedules, which necessitates a distinct model for each target resolution. We introduce DVAR, a Dynamic Visual AutoRegressive framework that overcomes this fundamental bottleneck. |
Yu Zheng; Kai Zhang; Wei Zhu; Qingguo Liu; Xiantao Hu; Jun Li; Jian Yang; | code |
| 519 | Quantum-Gated Task-interaction Knowledge Distillation for Pre-trained Model-based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although pretrained models (PTMs) have shown promising performance in CIL, they still struggle with the entanglement of multi-task subspaces, leading to catastrophic forgetting when task routing parameters are poorly calibrated or task-level representations are rigidly fixed. To address this issue, we propose a novel Quantum-Gated Task-interaction Knowledge Distillation (QKD) framework that leverages quantum gating to guide inter-task knowledge transfer. |
Linjie Li; HUIYU XIAO; Jiarui Cao; Zhenyu Wu; JI Yang; | code |
| 520 | MMCP-GEN: A Modality-Extensible Diffusion Language Model for Conditional Protein Sequence Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This isolation limits cross-modal interaction, reduces generation quality, and complicates the incorporation of new conditions without retraining or redesigning the backbone. To address these limitations, we introduce $\textbf{MMCP-GEN}$, a DLM for $\textbf{M}$ulti-$\textbf{M}$odal, Multi-$\textbf{C}$ondition $\textbf{P}$rotein sequence $\textbf{GEN}$eration. |
Zeyu An; Wanyu Lin; Feng Tan; Shujun Wang; | code |
| 521 | Factorized Context Aggregation for Robust Cancer Risk Estimation Via Soft Re-Ranked Retrieval and Hierarchical Anchors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a novel framework that leverages histopathology as a basis for outcome prediction, while using other data modalities when training the models. |
Puria Azadi Moghadam; Ali Khajegili Mirabadi; Behnam Maneshgar; Hossein Farahani; Ali Bashashati; | code |
| 522 | RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). |
Junwei Zheng; Ruize Dai; Ruiping Liu; Zichao Zeng; Yufan Chen; Fangjinhua Wang; Kunyu Peng; Kailun Yang; Jiaming Zhang; Rainer Stiefelhagen; | code |
| 523 | GazeOnce360: Fisheye-Based 360° Multi-Person Gaze Estimation with Global–Local Feature Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present GazeOnce360, a novel end-to-end model for multi-person gaze estimation from a single tabletop-mounted upward-facing fisheye camera. |
Zhuojiang Cai; Zhenghui Sun; Feng Lu; | code |
| 524 | ReGenHOI: Unifying Reconstruction and Generation for 3D Human–Object Interaction Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing methods treat these abilities as separate tasks, limiting their capacity to capture the unified nature of human spatial reasoning. To address this, we propose a unified framework that bridges reconstruction and generation through a shared semantic–geometric reasoning space. |
miao xu; Xiangyu Zhu; Zidu Wang; XUSHENG LIANG; Bao Li; Jinlin Wu; Zelin Zang; Zhen Lei; | code |
| 525 | VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-view Indoor 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. |
Yang Cao; Feize Wu; Dave Zhenyu Chen; Yingji Zhong; Lanqing Hong; Dan Xu; | code |
| 526 | Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these challenges, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. |
Jun Li; Xuhang Lou; Jinpeng Wang; Yuting Wang; Yaowei Wang; Shu-Tao Xia; Bin Chen; | code |
| 527 | Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Beta-weighted Knowledge Distillation \textbf{$\beta$-KD}, an adaptive, uncertainty-aware knowledge distillation framework that supports arbitrary distillation objectives under a unified Bayesian formulation. |
Jingchen Sun; Shaobo Han; Deep Patel; Wataru Kohno; Can Jin; Changyou Chen; | code |
| 528 | UZ3DVG: Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods rely on extra 2D images during inference and/or require multi-turn interactions with large language models (LLMs) or vision-language models (VLMs), which increase latency, computational cost, and deployment complexity. To overcome these limitations, we propose Unaided Zero-Shot 3D Visual Grounding with Generated Language Conditions (UZ3DVG), which is fed with 3D point clouds and textual descriptions only during inference and does not depend on external models. |
Wenbin Tan; Jiawen Lin; Yuan Xie; Yachao Zhang; Yanyun Qu; | code |
| 529 | OrionEdit: Bridging Reference and Source Images for Generalized Cross-Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recently, a new paradigm has emerged that focuses on editing one image from another, enabling more direct and interpretable manipulation through reference exemplars. In this work, we formalize this paradigm as cross-image editing, which modifies a source image under the guidance of one or more references, encompassing subject replacement, style transfer, image completion, and other reference-to-source tasks. |
Zeyu Jiang; Lai Man Po; XUYUAN XU; Yexin Wang; Guoping Gong; Haoxuan Wu; Chenbo Yan; Kun Li; Yuyang Liu; | code |
| 530 | DDSF: Robust Few-Shot Learning Via Disentangled Subspaces with Determinantal Point Process Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel “Filter-Repair-Expand” framework grounded in Determinantal Point Process (DPP) theory. |
xulun ye; Yifan Mei; Kun Zhou; Zelei Wu; Jieyu Zhao; | code |
| 531 | LoD-Loc V3: Generalized Aerial Localization in Dense Cities Using Instance Silhouette Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. |
Shuaibang Peng; Juelin Zhu; Xia Li; Kun Yang; Yu Liu; Maojun Zhang; Shen Yan; | code |
| 532 | SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. |
Naomi Kombol; Ivan Martinović; Siniša Šegvić; Giorgos Tolias; | code |
| 533 | SemLayer: Semantic Generative Segmentation and Layer Reconstruction for Vector Icons Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This absence of semantic decomposition hinders downstream tasks such as editing, restyling, and animation. We formalize this problem as semantic layer construction for flattened vector art and introduce SemLayer, a visual generation–based pipeline that restores editable layered structures. |
Haiyang Xu; Ronghuan Wu; Li-Yi Wei; Nanxuan Zhao; Chenxi Liu; Cuong Nguyen; Zhuowen Tu; Zhaowen Wang; | code |
| 534 | Beyond Prompt Degradation: Prototype-guided Dual-pool Prompting for Incremental Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to prompt coupling and prompt drift, these methods often suffer from prompt degradation during continual adaptation. To address these issues, we propose a novel prompt-decoupled framework called PDP. |
Yaoteng Zhang; Qing Zhou; Junyu Gao; Qi Wang; | code |
| 535 | RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. |
Hanqing Liu; Mingjie Liu; Luoping Cui; Endian Lin; Donghong Jiang; Chuang Zhu; | code |
| 536 | CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we pinpoint two primary culprits: 1) In certain domains, such as nighttime or rainy conditions, one modality experiences significant degradation;2) The LiDAR branch tends to dominate the detection process, resulting in systematic under-exploitation of visual cues and vulnerability when point clouds are compromised. |
Yuchen Wu; Kun Wang; Yining Pan; Na Zhao; | code |
| 537 | HiFi-Brep: High-Fidelity B-Rep Latent Representation and Robust Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our core insight is that robust, high-validity generation requires: first, building upon a compact and high-fidelity latent representation; and second, reformulating validity constraints as differentiable inductive biases within a single-stage generation process, enabling mutual guidance between geometry and topology. |
Junhao Hou; Chenqi Luo; PuFan Wang; Jiaying Lu; Yusheng Liu; Feiwei Qin; Meie Fang; Kun Zhou; | code |
| 538 | MMDIR: Multimodal Instruction-Driven Framework for Mixed-Degradation Document Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current approaches typically necessitate training multiple specialized models for specific degradation types or rely on explicit prior knowledge of degradation patterns to guide the training process. To overcome these limitations, we propose $\textit{MMDIR}$, a multimodal instruction-driven framework designed for document image restoration under mixed and uncertain degradation conditions. |
Heng Li; Xingyuan Wang; Yang Fan; Yunan Zhang; Xiangping Wu; Qingcai Chen; | code |
| 539 | D2Dewarp: Dual Dimensions Geometric Representation Learning Based Document Image Dewarping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a fine-grained deformation perception model that focuses on $\textbf{D}$ual $\textbf{D}$imensions of document horizontal-vertical-lines to improve document $\textbf{Dewarp}$ing called $\textit{D2Dewarp}$. |
Heng Li; Xiangping Wu; Qingcai Chen; | code |
| 540 | Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Langugae Model Blindness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs’ reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. |
Xin Hu; Haomiao Ni; Yunbei Zhang; Jihun Hamm; Zechen Li; Zhengming Ding; | code |
| 541 | Uni-Hema: Unified Model for Digital Hematopathology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Whether single-task, vision-language, WSI-optimized, or single-cell hematology models, these approaches share a key limitation: they cannot provide unified, multi-task, multi-modal reasoning across the complexities of digital hematopathology. To overcome these limitations, we propose \textbf{Uni-Hema}, a multi-task, unified model for digital hematopathology integrating detection, classification, segmentation, morphology prediction, and reasoning across multiple diseases. |
Abdul Rehman; Iqra Rasool; Ayisha Imran; Mohsen Ali; Waqas Sultani; | code |
| 542 | Universal-to-Specific: Dynamic Knowledge-Guided Multiple Instance Learning for Few-Shot Whole Slide Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To ensure fidelity and prevent semantic drift, we introduce a Structural Consistency loss that enforces alignment between knowledge-instantiated and visual features. |
Junjian Li; Hulin Kuang; Jin Liu; Hailin Yue; Mengshen He; Jianxin Wang; | code |
| 543 | LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose **LEADER**, a robust LiDAR-based localization framework enhanced by a simple, yet effective geometric encoder. |
Jianshi Wu; Minghang Zhu; dq Liu; Wen Li; Sheng Ao; Siqi Shen; Chenglu Wen; Cheng Wang; | code |
| 544 | VMD-FACT: A New Video Dataset and MLLM-based Method for Detecting Realistic AI-Generated Video Misinformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we introduce an AI-generative framework for producing realistic AI-generated video misinformation. |
Yongkang Zhang; Dongyu She; Baiyu Ji; Qichuan Geng; Zhong Zhou; Yan Wang; | code |
| 545 | TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they inherently suffer from certain limitations, such as being task-agnostic and exhibiting positional bias. In this work, we explore a new perspective on token importance assignment based on token transitions in LVLMs, where token transitions are defined as the changes in token representations occurring as they propagate through the model’s modules. |
Ao Li; Yuxiang Duan; Jinghui Zhang; Congbo Ma; Yutong Xie; Gustavo Carneiro; Mohammad Yaqub; Hu Wang; | code |
| 546 | DiT360: High-Fidelity Panoramic Image Generation Via Hybrid Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. |
Haoran Feng; Dizhe Zhang; Xiangtai Li; Bo Du; Lu Qi; | code |
| 547 | Ultra-Fast Neural Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. |
Jiahao Li; Wenxuan Xie; Zhaoyang Jia; Bin Li; Zongyu Guo; Xiaoyi Zhang; Yan Lu; | code |
| 548 | OpenDPR: Open-Vocabulary Change Detection Via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. |
Qi Guo; Jue Wang; Yinhe Liu; Yanfei Zhong; | code |
| 549 | SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the subsequent RL stage, we introduce Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to enhance the tool invocation and reasoning ability. |
Yong Xien Chng; Tao Hu; Wenwen Tong; Xueheng Li; Jiandong Chen; Haojia Yu; Jiefan Lu; Hewei Guo; Hanming Deng; Chengjun Xie; Gao Huang; Lewei Lu; | code |
| 550 | Enhancing Out-of-Distribution Detection with Extended Logit Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Extensive work has focused on devising various scoring functions for detecting OOD samples, while only a few studies focus on training neural networks using certain model calibration objectives, which often lead to a compromise in predictive accuracy and support only limited choices of scoring functions. |
Yifan Ding; Xixi Liu; Jonas Unger; Gabriel Eilertsen; | code |
| 551 | ANTS: Adaptive Negative Textual Space Shaping for OOD Detection Via Test-Time MLLM Understanding and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, the absence of negative labels semantically similar to ID labels constrains their capability in near-OOD detection. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). |
Wenjie Zhu; Yabin Zhang; Xin Jin; Wenjun Zeng; Lei Zhang; | code |
| 552 | Duala: Dual-level Alignment of Subjects and Stimuli for Cross-Subject FMRI Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often suffer from degraded performance when adapting to new subjects with limited data, as they struggle to preserve both the semantic consistency of stimuli and the alignment of brain responses. To address these challenges, we propose Duala, a dual-level alignment framework designed to achieve stimulus-level consistency and subject-level alignment in fMRI-based cross-subject visual decoding. |
Shumeng Li; Jintao Guo; Jian Zhang; Yulin Zhou; Luyang Cao; Yinghuan Shi; | code |
| 553 | Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. |
Rui Zhao; Bin Shi; Kai Sun; Bo Dong; | code |
| 554 | BioVITA: Biological Dataset, Model, and Benchmark for Visual-Textual-Acoustic Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BioVITA, a novel visual-textual-acoustic alignment framework for biological applications. |
Risa Shinoda; Kaede Shiohara; Nakamasa Inoue; Kuniaki Saito; Hiroaki Santo; Fumio Okura; | code |
| 555 | DarkAct: A RGB-Thermal Dataset and Fusion Framework for Multimodal Low-Light Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DarkAct, a large-scale and high-quality RGB–thermal video dataset purpose-built for multimodal action recognition under low illumination. |
Yuanjun Tan; Aoran Xiao; Liqian Deng; Zhigang Tu; | code |
| 556 | YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. |
Miro Miranda; Deepak Pathak; Patrick Helber; Benjamin Bischke; Hiba Najjar; Francisco Mena; Cristhian Sanchez; Akshay Pai; Diego Arenas; Matias Valdenegro; Marcela Charfuelan; Marlon Nuske; Andreas Dengel; | code |
| 557 | Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To understand this phenomenon, through extensive experiments, we interpret it as the model’s shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model’s reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. |
Shuai Yi; Yixiong Zou; Yuhua Li; Ruixuan Li; | code |
| 558 | BlackMirror: Black-Box Backdoor Detection for Text-to-Image Models Via Instruction-Response Deviation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates the challenging task of detecting backdoored text-to-image generative models under black-box settings and introduces a novel detection framework **BlackMirror**. |
Feiran Li; Qianqian Xu; Shilong Bao; Zhiyong Yang; Xilin Zhao; Xiaochun Cao; Qingming Huang; | code |
| 559 | EpiAgent: An Agent-Centric System for Ancient Inscription Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the skill-coordinated workflow of human epigraphers, we propose EpiAgent, an agent-centric system that formalizes inscription restoration as a hierarchical planning problem. |
Shipeng Zhu; Ang Chen; Na Nie; Pengfei Fang; Min-Ling Zhang; Hui Xue; | code |
| 560 | SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a decoupled 3D scene generation framework called SceneMaker in this work. |
Yukai Shi; Weiyu Li; Zihao Wang; Hongyang Li; Xingyu Chen; Ping Tan; Lei Zhang; | code |
| 561 | AERGS-SLAM: Auto-Exposure-Robust Stereo 3D Gaussian Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing research on 3DGS-based SLAM fails to accurately address the appearance variations induced by camera auto-exposure in prevalent real-world scenarios, resulting in reduced localization and photorealistic mapping accuracy. To address this issue, we propose a stereo auto-exposure-robust Gaussian splatting SLAM (AERGS-SLAM), a framework robust to such variations and enables both reliable localization and exposure-controlled photorealistic mapping. |
Zhiyu Zhou; Feng Hui; Yu Liu; | code |
| 562 | Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MASS (MAsk-guided Self-Supervised learning), which treats in-context segmentation as the pretext task for learning general-purpose medical imaging representations. |
Yunhe Gao; Yabin Zhang; Chong Wang; Jiaming Liu; Maya Varma; Jean-Benoit Delbrouck; Akshay Chaudhari; Curtis Langlotz; | code |
| 563 | SynCLIP: Synonym-Coherent Language-Image Pretraining for Robust Open-Vocabulary Dense Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This inconsistency undermines the robustness and performance of existing methods in real-world OVDP applications. To address this issue, we propose SynCLIP, a Synonym-Coherent Language-Image Pretraining framework that enhances synonym-robust grounding for OVDP tasks. |
Mingjie Xie; heguangjun heguangjun; Dongli Xu; Youtian Lin; Hongjue Li; Pengming Feng; Jian Guan; Yue Deng; | code |
| 564 | Goldilocks Test Sets for Face Verification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To challenge models on variation of facial attributes, we propose Hadrian and Eclipse to address facial hair differences and face exposure differences. |
Haiyu Wu; Sicong Tian; Aman Bhatta; Jacob Gutierrez; Grace Bezold; Genesis Argueta; Karl Ricanek; Michael King; Kevin W. Bowyer; | code |
| 565 | SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos Through A Chain-of-Thought Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce **SurgCoT,** a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across **7 surgical specialties** and **35 diverse procedures**. |
Gui Wang; YongSong Zhou; Kaijun Deng; Wooi Ping Cheah; Rong Qu; Jianfeng Ren; Linlin Shen; | code |
| 566 | X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current benchmarks, predominantly limited to single-modality data, lack the capacity to evaluate progressive reasoning and cross-modal integration essential for clinical practice. To bridge this gap, we introduce **Cross-Modality Progressive Clinical Reasoning** (**X-PCR**) benchmark, the first comprehensive evaluation framework for MLLMs spanning the complete ophthalmology diagnostic workflow. |
Gui Wang; Zehao Zhong; YongSong Zhou; Yudong Li; Ende Wu; Wooi Ping Cheah; Rong Qu; Jianfeng Ren; Linlin Shen; | code |
| 567 | SafeRoPE:Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of \textbf{safety-critical heads} is responsible for unsafe feature extraction. |
Xiang Yang; Feifei Li; Mi Zhang; Geng Hong; Xiaoyu You; Min Yang; | code |
| 568 | Towards Multimodal Domain Generalization with Few Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. |
Hongzhao Li; Hao Dong; Hualei Wan; Shupan Li; Mingliang Xu; Muhammad Haris Khan; | code |
| 569 | Learning and Aligning Click-Aware Shape Prior for Interactive Amodal Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the task of interactive amodal segmentation, where a few user clicks are available for better segmenting the complete masks of object instances. |
Junjie Chen; Junwei Lin; Ren Hong; Shengjie Liu; Yuming Fang; Feng Qian; Yifan Zuo; | code |
| 570 | FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. |
Xingyu Wang; Tao Wang; | code |
| 571 | RaPA: Enhancing Transferable Targeted Attacks Via Random Parameter Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we find that adversarial examples generated by existing methods rely heavily on a small subset of surrogate model parameters, which in turn limits their transferability to unseen target models. |
Tongrui Su; Qingbin Li; Shengyu Zhu; Wei Chen; Xueqi Cheng; | code |
| 572 | HumanBA: Human-Aware Bundle Adjustment Via Global Human-Camera Decoupling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HumanBA, a human-aware bundle adjustment framework that transforms dynamic humans into usable constraints via motion decoupling. |
Fengyuan Yang; Tanuj Sur; Tze Ho Elden Tse; Angela Yao; | code |
| 573 | NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. |
Dengdi Sun; Xiaoya Zhou; Xiao Wang; Hao Si; Wanli Lyu; Jin Tang; Bin Luo; | code |
| 574 | AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current approaches reduce training cost by selecting a subset of layers for retraining; however, they rely on labeled data, at least one full-model backpropagation, or server-side meta-training, limiting their suitability for constrained devices. We introduce AdaBet, a gradient-free layer selection approach to rank important layers, followed by important channels of these layers, by analyzing topological features of their activation spaces through Betti Numbers and using forward passes alone. |
Irene Tenison; Soumyajit Chatterjee; Fahim Kawsar; Mohammad Malekzadeh; | code |
| 575 | HOPS: Hierarchical Open-vocabulary Part Segmentation with Attention-Aware Filtering and Affinity-Guided Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing VLM-based methods face two challenges: (1) object over-segmentation, caused by overly broad semantic activations, and (2) part under-segmentation, resulting from weak fine-grained perception. To address these issues, we propose HOPS, a two-stage framework for hierarchical open-vocabulary part segmentation. |
Xinlong Li; Di Lin; Shaoyiyi Gao; Yaxuan Liu; Jixian He; Jiaxin Li; Ruonan Liu; Qing Guo; Kairui Yang; Wei Feng; | code |
| 576 | DART: Dynamic ModAlity-balanced Multimodal RepresenTation Learning for E-commerce Product Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose DART, a Dynamic modAlity-balanced multimodal RepresenTation learning framework for e-commerce product understanding. |
Zhanheng Nie; ChengHanFu ChengHanFu; Daoze Zhang; Junxian Wu; Wanxian Guan; Pengjie Wang; Jian Xu; Bo Zheng; | code |
| 577 | Cinematic Audio Source Separation Using Visual Cues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. |
Kang Zhang; Suyeon Lee; Arda Senocak; Joon Chung; | code |
| 578 | Logit-Margin Repulsion for Backdoor Defense Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, most existing approaches fail to generalize to Transformer architectures. To address these challenges, we propose $\textit{$\textbf{L}$ogit $\textbf{M}$argin $\textbf{R}$epulsion}$ (LMR), a universal and architecture-agnostic defense method. |
Zhiguo Yang; Dongsheng Xu; Ruizhi Zhong; Jiacheng Pi; Xingxing Huang; Wenjie Ruan; | code |
| 579 | RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RAISE (Requirement-Adaptive Self-Improving Evolution), a training-free, requirement-driven evolutionary framework for adaptive T2I generation. |
Liyao Jiang; Ruichen Chen; Chao Gao; Di Niu; | code |
| 580 | KASALv2: Fully Automatic 3D Rotational Symmetry Classification and Axis Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a fully automatic and reference-free framework that performs symmetry-type classification, rotational-order identification, and full-axis localization across all eight canonical 3D rotational symmetry types. |
Mengxin Zhang; Yulin Wang; Chen LUO; Yongzhe Li; Yijun Zhou; | code |
| 581 | SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge this gap, we construct a benchmark named \textbf{OVCOD-D} by augmenting carefully selected camouflaged object images with fine-grained textual descriptions. Due to the limited scale of available camouflaged object datasets, we adopt detectors pre-trained on large-scale object detection datasets as our baseline methods, as they possess stronger zero-shot generalization ability. |
Jiaming Liang; Yifeng Zhan; Chunlin Liu; Weihua Zheng; bingye Peng; Qiwei Liang; Boyang Cai; Xiaochun Mai; Qiang Nie; | code |
| 582 | Collaborative Multi-Mode Pruning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. |
Zimeng Wu; Yunhong Wang; Donghao Wang; Jiaxin Chen; | code |
| 583 | TAMER: A Tri-Modal Contrastive Alignment and Multi-Scale Embedding Refinement Framework for Zero-Shot ECG Diagnosis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing self-supervised uni-modal methods suffer from limited representational capacity, while multi-modal frameworks are hindered by coarse-grained semantic alignment across modalities, thus restricting their generalizability in clinical settings. To address these limitations, we propose TAMER, a Tri-modal contrastive Alignment and Multi-scale Embedding Refinement framework that jointly models ECG recordings, spectrograms, and diagnostic reports. |
Xuewei Zhou; Yajie Meng; Pan Zeng; Xianfang Tang; Feifei Cui; Qiangguo Jin; Jialiang Yang; Junlin Xu; | code |
| 584 | Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our systematic analysis, leveraging Linear Discriminant Analysis (LDA), diagnoses this failure by identifying two core issues: severe data-level heterogeneity, which causes the feature distributions of real and synthetic images to increasingly overlap, and a critical model-level bottleneck from fixed, pretrained encoders that cannot adapt to the rising complexity. To address these challenges, we propose Generator-Aware Prototype Learning (GAPL), a framework that replaces unconstrained aggregation with a structured learning paradigm. |
Ziheng Qin; Yuheng Ji; Renshuai Tao; Yuxuan Tian; Yuyang Liu; Yipu Wang; Xiaolong Zheng; | code |
| 585 | TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a Task-Adaptive Prototype Evolution (TAPE) framework to facilitate ALMs to tackle the challenges of FFCAC, which consists of two key components:(1) A Task-Adapter that isolates audio features in a metric space to mitigate catastrophic forgetting while preserving knowledge across sessions, and (2) A Prototype Evolution mechanism that dynamically refines class prototypes using query samples during inference, thereby enabling adaptive learning and reducing overfitting. |
Yunlong Gao; Wenxin Liang; Guanglu Wang; Senqi Guan; Linlin Zong; Dongyu Zhang; Xinyue Liu; | code |
| 586 | Cross-Scale Pansharpening Via ScaleFormer and The PanScale Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To realize scale generalization, we propose ScaleFormer, a novel architecture designed for multi-scale pansharpening. |
Ke Cao; Xuanhua He; Xueheng Li; Lingting Zhu; Yingying Wang; Ao Ma; Zhanjie Zhang; Man Zhou; Chengjun Xie; Jie Zhang; | code |
| 587 | Debiased Sample Selection for Learning with Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Both biases accumulate over training and degrade performance. To mitigate these issues, we propose Marginal Distribution Adjustment (MDA) and Candidate Class Selection (CCS). |
Weiran Pan; Wei Wei; Wenfeng xie; | code |
| 588 | CLIP-like Model As A Foundational Density Ratio Estimator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering. |
Fumiya Uchiyama; Rintaro Yanagi; Shohei Taniguchi; Shota Takashiro; Masahiro Suzuki; Hirokatsu Kataoka; Yusuke Iwasawa; Yutaka Matsuo; | code |
| 589 | SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SPARROW, a pixel-grounded video MLLM that unifies spatial accuracy and temporal stability through two key components: (i) Target-Specific Tracked Features (TSF), which inject temporally aligned referent cues during training, and (ii) a dual-prompt design that decodes box ([BOX]) and segmentation ([SEG]) tokens to fuse geometric priors with semantic grounding. |
Mohamad Alansari; Naufal Suryanto; Divya Velayudhan; Sajid Javed; Naoufel Werghi; Muzammal Naseer; | code |
| 590 | RHCNet: Residual-Guided Hierarchical Calibration Network for Robust Underwater Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Underwater images commonly suffer from foreground-background ambiguity, loss of structural details, and severely reduced contrast, which collectively make underwater object detection (UOD) an inherently challenging task. To handle this issue, we present a residual-guided hierarchical calibration network (RHCNet) designed to achieve more efficient and robust UOD, which comprises a residual-guided feature enhancement module (RGFE) and a hierarchical feature calibration pyramid module (HFCP). |
Yueying Wang; Yiteng Guo; Weidong Zhang; Jie Wen; Liquan Shen; Huaicheng Yan; Xin Xu; | code |
| 591 | PAVAS: Physics-Aware Video-to-Audio Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). |
Oh Hyun-Bin; Yuhta Takida; Toshimitsu Uesaka; Tae-Hyun Oh; Yuki Mitsufuji; | code |
| 592 | VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. |
Jiayi Yuan; Haobo Jiang; De Soh Soh; Na Zhao; | code |
| 593 | RegionFuse: Region-Adaptive Pixel Distribution Learning for Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods typically employ fixed or sample-adaptive fusion paradigms where fusion weights are static or derived from global pixel distributions, they often overlook spatial inconsistencies in pixel distribution within images, leading to suboptimal performance. To address this issue, we propose RegionFuse, a Region-Adaptive Pixel Distribution Learning Network for IVIF, which dynamically generates fusion weights based on local pixel distributions to construct a region-wise adaptive fusion paradigm. |
Jianghan Xia; Hong Song; Jinfu Li; Yucong Lin; Shihan Ma; Jingfan Fan; Danni Ai; Tianyu Fu; Deqiang Xiao; Jian Yang; | code |
| 594 | Imbalanced View Contribution Evaluation and Refinement for Deep Incomplete Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we employ shapley values to quantify the marginal contribution of each view, and incorporate imbalanced optimal transport to characterize distributional deviations across views. |
Taichun Zhou; Zhibin Dong; Hao Tan; Siwei Wang; Xinwang Liu; En Zhu; Di Hu; Tianrui Liu; chuankun Li; Kunlun He; | code |
| 595 | STRNet: Visual Navigation with Spatio-Temporal Representation Through Dynamic Graph Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. |
Hao Ren; Zetong Bi; Yiming Zeng; Zhaoliang Wan; Lu Qi; Hui Cheng; | code |
| 596 | Towards Motion Turing Test: Evaluating Human-Likeness in Humanoid Robots Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent progress in multimodal large language models, we find that they remain inadequate for assessing motion human-likeness. To address this gap, we propose a simple baseline model and demonstrate that it outperforms several contemporary LLM-based approaches. |
Mingzhe Li; Mengyin Liu; Zekai Wu; Xincheng Lin; Junsheng Zhang; Ming Yan; Zengye Xie; Changwang Zhang; Chenglu Wen; Lan Xu; Siqi Shen; Cheng Wang; | code |
| 597 | Verifying Neural Network Robustness with Dual Perturbations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This gap prevents assessing robustness under realistic conditions where both perturbation types occur simultaneously. To address these limitations, we propose VeriDou, a framework that introduces:(i) universal convolutional perturbations that enable verification across continuous spatial distortion spaces, and(ii) dual perturbations that capture both convolutional distortions and independent pixel-level variations. |
Hai Duong; Son Vu; Thanh Le; ThanhVu Nguyen; | code |
| 598 | Selfi: Self Improving Reconstruction Engine Via 3D Geometric Feature Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. |
Youming Deng; Songyou Peng; Junyi Zhang; Kathryn Heal; Tiancheng Sun; John Flynn; Steve Marschner; Lucy Chai; | code |
| 599 | An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks typically provide incomplete and low-resolution geometry without instance-level annotations, limiting the development of models capable of achieving precise geometric reconstruction, reliable occlusion reasoning, and holistic 3D understanding. To address these challenges, this paper presents an instance-centric benchmark for the 3D panoptic occupancy prediction task. |
Yi Feng; Junwu E; Zizhan Guo; Yu Ma; Hanli Wang; Rui Fan; | code |
| 600 | Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast to crowds, plants are complicated by nonrigid morphologies and physical appearance variations across growth stages and environments. Tofill this gap, we present TPC-268, the first plant counting benchmark taking plant taxonomy into account. |
Jinyu Xu; Tianqi Hu; Xiaonan Hu; Letian Zhou; Songliang Cao; Meng Zhang; Hao Lu; | code |
| 601 | Incentivizing Versatile Video Reasoning in MLLMs Via Data-Efficient Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: But these methods have two main problems: First, the RL framework they used has unstable training, high training costs, and is difficult to train satisfactory video reasoning models; Second, the linguistic reasoning process is difficult to guarantee the reliability of visual information. To alleviate these problems, we propose to use multimodal elements for reasoning, and we design a novel framework to build and enhance versatile video reasoning capabilities on MLLMs. |
Xiaodong Wang; Zhirong Wu; Langling Huang; Yuxi Zheng; Peixi Peng; | code |
| 602 | LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Accurate metric depth is critical for autonomous driving perception and simulation, yet current approaches struggle to achieve high metric accuracy, multi-view and temporal consistency, and cross-domain generalization. To address these challenges, we present MVS-Pro, a novel multi-view stereo framework that reconciles these competing objectives through two key insights: (1) Sparse but metrically accurate LiDAR observations can serve as geometric prompts to anchor depth estimation in absolute scale, and (2) deep fusion of diverse cues is essential for resolving ambiguities and enhancing robustness, while a spatio-temporal decoder ensures consistency across frames. |
Qihao Sun; Jiarun Liu; Ziqian Ni; Jianyun Xu; Sheng Yang; Tao Xie; lijun zhao; Ruifeng Li; | code |
| 603 | Geometric-Photometric Event-based 3D Gaussian Ray Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes a framework to address the trade-off between accuracy and temporal resolution in the event-based 3DGS. |
Kai Kohyama; Yoshimitsu Aoki; Guillermo Gallego; Shintaro Shiba; | code |
| 604 | Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the Discover-Segment-Select (DSS) mechanism, a three-stage framework that progressively refines the segmentation process. |
Yilong Yang; Jianxin Tian; Shengchuan Zhang; Liujuan Cao; | code |
| 605 | Deep Feature Deformation Weights Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an improved feature distillation pipeline, barycentric feature distillation, which leverages the full visual signal from shape renders to make the compute cost robust to mesh resolution. |
Richard Liu; Itai Lang; Rana Hanocka; | code |
| 606 | RemedyGS: Defend 3D Gaussian Splatting Against Computation Cost Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. |
Yanping LI; Zhening Liu; Zijian Li; Zehong Lin; Jun Zhang; | code |
| 607 | $\textbf{FailureAtlas}$: Mapping The Failure Landscape of T2I Models Via Active Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue for a complementary paradigm: $\textbf{active exploration}$, and introduce $\textbf{FailureAtlas}$, the first framework designed to autonomously explore and map the vast failure landscapes of T2I models at scale. |
Muxi Chen; Zhaohua Zhang; Chenchen Zhao; Mingyang Chen; Wenyu Jiang; Tianwen Jiang; Jianhuan Zhuo; Yutang Yutang; Qiuyong Xiao; Jihong Zhang; Qiang Xu; | code |
| 608 | Personalized Image Descriptions from Attention Sequences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. |
Ruoyu Xue; Hieu Le; Jingyi Xu; Sounak Mondal; Abe Leite; Gregory Zelinsky; Minh Nguyen Nguyen; Dimitris Samaras; | code |
| 609 | ELiC: Efficient LiDAR Geometry Compression Via Cross-Bit-depth Feature Propagation and Bag-of-Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. |
Junsik Kim; Gun Bang; Soowoong Kim; | code |
| 610 | SECOS: Semantic Capture for Rigorous Classification in Open-World Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing OWSSL methods fail to achieve this because novel samples are trained without explicit supervision, and these methods lack mechanisms to extract latent semantic information, resulting in predicted labels that have no semantic correspondence to candidate textual labels. To address this, we introduce SEmantic Capture for Open-world Semi-supervised learning (SECOS), which directly predicts textual labels from the candidate set without post-processing, meeting the requirements of practical OWSSL applications. |
Hezhao Liu; jiacheng yang; Junlong Gao; Mengke Li; Yiqun Zhang; Shreyank Gowda Gowda; Yang Lu; | code |
| 611 | ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Lacking an invariant geometric–topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, coupling metric and topology across surfaces, curves, and samples. |
Chengzhi Hong; Bijun Li; | code |
| 612 | Elucidating The Design Space of Arbitrary-Noise-Based Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To interpret diverse methods for handling distinct noise patterns within a unified theoretical framework and to minimize the restoration distance, we propose \textbf{EDA}, which \textbf{E}lucidates the \textbf{D}esign space of \textbf{A}rbitrary-noise diffusion models. |
Xingyu Qiu; Mengying Yang; Xinghua Ma; Dong Liang; Fanding Li; Gongning Luo; wei wang; Kuanquan Wang; Shuo Li; | code |
| 613 | TrackMAE: Video Representation Learning Via Track Mask and Predict Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, such models struggle on motion-centric tasks that require fine-grained motion awareness. To address this, we propose TrackMAE, a simple masked video modeling paradigm that explicitly uses motion information as a reconstruction signal. |
Renaud Vandeghen; Fida Mohammad Thoker; Marc Van Droogenbroeck; Bernard Ghanem; | code |
| 614 | FedRAC: Rolling Submodel Allocation for Collaborative Fairness in Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often under-reward low-contributing clients in the early training stage and neglect critical issues, such as consistency across local models or unequal neuron training frequencies in the aggregated model, both of which lead to degraded performance. To address these issues, we propose FedRAC, a novel Federated learning framework employing Rolling submodel Allocation for Collaborative fairness, without compromising the global model performance. |
Zihui Wang; Yuhang Fu; Mengmeng Du; Zhimin Yuan; Yachen Liu; Weisheng Liao; Kaiyu Wang; Zheng Wang; | code |
| 615 | NVGS: Neural Visibility for Occlusion Culling in 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the semi-transparent nature of Gaussians prevents the application of another highly effective technique: occlusion culling. We address this limitation by proposing a novel method to learn the viewpoint-dependent visibility function of all Gaussians in a trained model using a small, shared MLP across instances of an asset in a scene. |
Brent Zoomers; Florian Hahlbohm; Joni Vanherck; Lode Jorissen; Marcus Magnor; Nick Michiels; | code |
| 616 | Love Me, Love My Label: Rethinking The Role of Labels in Prompt Retrieval for Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, to handle unavailable query labels at test time, we introduce a mixture-of-expert mechanism to the dual encoders with query-adaptive routing. |
Tianci Luo; Haohao Pan; Jinpeng Wang; Niu Lian; Xinrui Chen; Bin Chen; Shu-Tao Xia; Chun Yuan; | code |
| 617 | ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that guarantees shared-edge consistency by design. |
Weiqin Jiao; Hao Cheng; George Vosselman; Claudio Persello; | code |
| 618 | Catch Me If You Can: Active Mapping of Moving 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object’s motion. |
Davide Allegro; Shiyao Li; Stefano Ghidoni; Vincent Lepetit; | code |
| 619 | Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose \textbf{Cross-Modal Emotion Transfer (C-MET)}, a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. |
Chanhyuk Choi; Taesoo Kim; Donggyu Lee; Siyeol Jung; Taehwan Kim; | code |
| 620 | Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. |
Peng Sun; Jun XIE; Tao Lin; | code |
| 621 | SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. |
Zhenyu Lu; Liupeng Li; Jinpeng Wang; Haoqian Kang; Yan Feng; Ke Chen; Yaowei Wang; | code |
| 622 | PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on improving two key factors: lip-audio alignment control(LAC) and emotion control(EMC), to enhance the diversity and user-friendliness of talking videos. |
baiqin wang; Xiangyu Zhu; Fan Shen; HAO XU; Zhen Lei; | code |
| 623 | CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce $\textbf{CLIPoint3D}$, the first framework for $\textit{few-shot unsupervised 3D point cloud domain adaptation}$ built upon CLIP. |
Mainak Singha; Sarthak Mehrotra; Paolo Casari; Subhasis Chaudhuri; Elisa Ricci; Biplab Banerjee; | code |
| 624 | SMAP: Semantic Route Planning with Map-Grounded Multimodal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by how humans use maps for route planning, we propose the SMAP, which is the first multimodal framework combining user queries, POI metadata, and map tiles to produce spatially coherent, preference-aware routes. |
Wenjie Zhang; Chen Yang; Xin Lu; Zhen Wang; Yue Liu; Bobo Xi; Pengbo Zhang; | code |
| 625 | One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. |
Lorenzo Bianchi; Giacomo Pacini; Fabio Carrara; Nicola Messina; Giuseppe Amato; Fabrizio Falchi; | code |
| 626 | ZOO-Prune: Training-Free Token Pruning Via Zeroth-Order Gradient Estimation in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ZOO-Prune, a training-free framework built on the intuition that highly sensitive tokens have a stronger influence on the model’s output and capture complementary visual cues rather than redundant ones. |
Youngeun Kim; Youjia Zhang; Huiling Liu; Aecheon Jung; Sunwoo Lee; Sungeun Hong; | code |
| 627 | Accelerating Diffusion Via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. |
Euisoo Jung; Byunghyun Kim; Hyunjin Kim; Seonghye Cho; Jae-Gil Lee; | code |
| 628 | RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation Via Reinforcement Learning Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. |
Ehsan Ahmadi; Hunter Schofield; Behzad Khamidehi; Fazel Arasteh; Jinjun Shan; Lili Mou; Dongfeng Bai; Kasra Rezaee; | code |
| 629 | Free-Lunch Long Video Generation Via Layer-Adaptive O.O.D Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose a novel training-free, layer-adaptive framework. |
Jiahao Tian; Chenxi Song; Wei Cheng; Chi Zhang; | code |
| 630 | FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, normal features in images often occlude anomalous features, leading to coarse localization and limited discriminability. To address this challenge, we propose **FB-CLIP**, which enhances foreground features while suppressing irrelevant background interference to improve anomaly detection performance. |
Ming Hu; Yongsheng Huo; Mingyu Dou; Jianfu Yin; Peng Zhao; Yao Wang; Cong Hu; Bingliang Hu; Quan Wang; | code |
| 631 | Training-free Detection of Generated Videos Via Spatial-Temporal Likelihoods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce STALL, a simple, training-free, theoretically justified detector that provides likelihood-based scoring for videos, jointly modeling spatial and temporal evidence within a probabilistic framework. |
Omer Ben Hayun; Roy Betser; Meir Yossef Levi; Levi Kassel; Guy Gilboa; | code |
| 632 | Generalizable Knowledge Distillation from Vision Foundation Models for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Generalizable Knowledge Distillation (GKD), a multi-stage framework that explicitly enhances generalization. |
Chonghua Lv; Dong Zhao; Shuang Wang; Dou Quan; Ning Huyan; Nicu Sebe; Zhun Zhong; | code |
| 633 | PlanaReLoc: Camera Relocalization in 3D Planar Primitives Via Region-based Structure Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through extensive experiments on the *ScanNet* and *12Scenes* datasets across hundreds of scenes, our method demonstrates the superiority of planar primitives in facilitating reliable cross-modality structural correspondences and achieving effective camera relocalization without requiring realistically textured/colored maps, pose priors, or per-scene training. |
Hanqiao Ye; Yuzhou Liu; Yangdong Liu; Shuhan Shen; | code |
| 634 | What’s Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. |
Xingsong Ye; Yongkun Du; Jiaxin Zhang; Chen Li; Jing LYU; Zhineng Chen; | code |
| 635 | PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: PET plays a vital role in oncology, but automating report generation is difficult due to the complexity of whole-body 3D volumes, the wide range of potential clinical findings, and the limited availability of annotated datasets. To address these challenges, we introduce PETARSeg-11K, the first large-scale, publicly available dataset that provides lesion-level correspondence between 3D PET/CT volumes and free-text radiological findings. |
Danyal Maqbool; Changhee Lee; Zachary Huemann; Samuel Church; Matthew Larson; Scott Perlman; Tomas Romero; Joshua Warner; Meghan Lubner; Xin Tie; Jameson Merkow; Junjie Hu; Steve Cho; Tyler Bradshaw; | code |
| 636 | Camouflage-aware Image-Text Retrieval Via Expert Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. |
Yao Jiang; Zhongkuan Mao; xuan wu; Keren Fu; Qijun Zhao; | code |
| 637 | Act2See: Emergent Active Visual Perception for Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. |
Martin Q. Ma; Yuxiao Qu; Aditya Agrawal; Willis Guo; Paul Pu Liang; Ruslan Salakhutdinov; Louis-Philippe Morency; | code |
| 638 | ViHOI: Human-Object Interaction Synthesis with Visual Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we introduce ViHOI, a novel framework that enables diffusion-based generative models to leverage rich, task-specific priors from 2D images to enhance generation quality. |
Songjin Cai; Linjie Zhong; Ling Guo; Changxing Ding; | code |
| 639 | Decision Boundary-aware Generation for Long-tailed Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Long-tailed data bias decision boundaries toward head classes and degrade tail class accuracy. Diffusion-based generative augmentation address this problem by generating … |
jiacheng yang; Ruichi Zhang; Chikai Shang; Mengke Li; Xinyi Shang; Junlong Gao; Yonggang Zhang; Yang Lu; | code |
| 640 | From Spots to Pixels: Dense Spatial Gene Expression Prediction from Histology Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods inherently lose the spatial resolution in gene expression:1) each spot often contains multiple cells with distinct gene expression profiles;2) spots are typically defined at fixed spatial resolutions, limiting the ability to predict gene expression at varying scales. To address these limitations, this paper presents PixNet, a dense prediction network capable of predicting spatially resolved gene expression across spots of varying sizes and scales directly from histopathology slide images. |
Ruikun Zhang; Yan Yang; Liyuan Pan; | code |
| 641 | Reparameterized Tensor Ring Functional Decomposition for Multi-Dimensional Data Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a TR functional decomposition for both meshgrid and non-meshgrid data, where factors are parameterized by Implicit Neural Representations (INRs). |
Yangyang Xu; Junbo Ke; You-Wei Wen; Chao Wang; | code |
| 642 | Photo3D: Advancing Photorealistic 3D Generation Through Structure‑Aligned Detail Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Photo3D, a framework for advancing photorealistic 3D generation, which is driven by the image data generated by the GPT‑4o‑Image model. |
Xinyue Liang; Zhiyuan Ma; Lingchen Sun; Yanjun Guo; Lei Zhang; | code |
| 643 | Explaining Object Detectors Via Collective Contribution of Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. |
Toshinori Yamauchi; Hiroshi Kera; Kazuhiko Kawamoto; | code |
| 644 | COG: Confidence-aware Optimal Geometric Correspondence for Unsupervised Single-reference Novel Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Confidence-aware Optimal Geometric Correspondence (COG), an unsupervised framework that formulates correspondence estimation as a confidence-aware optimal transport problem. |
Yuchen Che; JINGTU WU; Hao ZHENG; Asako Kanezaki; | code |
| 645 | Prompt-Anchored Vision–Text Distillation for Lifelong Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To leverage the synergy between vision and text, we propose Prompt-Anchored vision–text Distillation (PAD), a unified framework that enhances semantic alignment and cross-domain generalization. |
Wen Wen; Hao CHEN; Shiliang Zhang; | code |
| 646 | VDOT: Efficient Unified Video Creation Via Optimal Transport Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing video creation models either focus solely on a few specific conditions or suffer from excessively long generation times due to complex model inference, making them impractical for real-world applications. To mitigate these issues, we propose an efficient unified video creation model, named VDOT. |
Yutong Wang; Haiyu Zhang; Tianfan Xue; Yu Qiao; Yaohui Wang; Chang Xu; Xinyuan Chen; | code |
| 647 | Act Like A Pathologist: Tissue-Aware Whole Slide Image Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we try to bring models closer to how humans actually examine slides. |
Wentao Huang; Weimin Lyu; Peiliang Lou; Qingqiao Hu; Xiaoling Hu; Shahira Abousamra; Wenchao Han; Ruifeng Guo; Jiawei Zhou; Chao Chen; Chen Wang; | code |
| 648 | Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. |
Bo Zhang; Xu Xinan; Shuo Yan; Yu Bai; Zheng Zhang; Wufan Wang; Hui Gao; Wendong Wang; | code |
| 649 | Test-time Sparsity for Extreme Fast Action Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. |
Kangye Ji; Yuan Meng; Jianbo Zhou; Ye Li; Chen Tang; Zhi Wang; | code |
| 650 | EVLF: Early Vision-Language Fusion for Generative Dataset Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although this strategy enforces label relevance, it diminishes the contribution of visual latents, resulting in over-corrected samples that mirror prompt patterns rather than reflecting intrinsic visual features. To solve this problem, we introduce an Early Vision–Language Fusion (EVLF) method that aligns textual and visual embeddings at the transition between the encoder and the generative backbone. |
WENQI CAI; Yawen Zou; Guang Li; Chunzhi Gu; Chao Zhang; | code |
| 651 | SCE-Depth: A Spherical Compound Eye Framework for Wide FOV Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Herein, we propose SCE-Depth, a bio-inspired framework for spherical compound eye depth estimation, which processes spherical images natively on a HEALPix grid using a spherical neural network. |
Yi Zhu; Hao Xiong; Lin Xiao; Ranfeng Shi; Qinying Gu; Leilei Gu; | code |
| 652 | From None to All: Self-Supervised 3D Reconstruction Via Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. |
Ranran Huang; Weixun Luo; Ye Mao; Krystian Mikolajczyk; | code |
| 653 | Diversity Over Uniformity: Rethinking Representation in Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that reliably generated image detection should not depend on a single decision path but should preserve multiple judgment perspectives, enabling the model to understand the differences between real and generated images from diverse viewpoints. Based on this idea, we propose an anti-feature-collapse learning framework that filters task-irrelevant components and suppresses excessive overlap among different forgery cues in the representation space, preventing discriminative information from collapsing into a few dominant feature directions. |
Qinghui He; Haifeng Zhang; Qiao Qin; Bo Liu; Xiuli Bi; Bin Xiao; | code |
| 654 | DBMSolver: A Training-free Diffusion Bridge Sampler for High-Quality Image-to-Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \textbf{DBMSolver}, a training-free sampler that exploits the semi-linear structure of DBM’s underlying SDE and ODE via exponential integrators, yielding exact $1^\text{st}$- and $2^\text{nd}$-order solutions. |
SANKARSHANA VENUGOPAL; Mohammad Mostafavi; Jonghyun Choi; | code |
| 655 | Towards Stealthy and Effective Backdoor Attacks on Lane Detection: A Naturalistic Data Poisoning Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing backdoor attack methods on LD often exhibit limited practical utility due to the artificial and conspicuous nature of their triggers. To address this limitation and investigate the impact of more ecologically valid backdoor attacks on lane detection models, we examine the common data poisoning attack and introduce DBALD, a novel diffusion-based data poisoning framework for generating naturalistic backdoor triggers. |
YIFAN LIAO; Yuxin Cao; Yedi Zhang; Wentao He; Yan XIAO; Xianglong Du; Zhiyong Huang; Jin Song Dong; | code |
| 656 | Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we formulate an open-set domain adaptation setting and propose the Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. |
Yuanfan Zheng; Kunyu Peng; Xu Zheng; Kailun Yang; | code |
| 657 | LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation Via Spatial-Temporal Modulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose the Lagrangian-Dynamic Informed Network (LaDy), a framework integrating principles of Lagrangian dynamics into the segmentation process. |
Haoyu Ji; Xueting Liu; Yu Gao; Wenze Huang; Zhihao Yang; Weihong Ren; Zhiyong Wang; Honghai LIU; | code |
| 658 | Spectral Scalpel: Amplifying Adjacent Action Discrepancy Via Frequency-Selective Filtering for Skeleton-Based Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing STAS methodologies face challenges of limited inter-class discriminability and blurred segmentation boundaries, primarily due to insufficient distinction of spatio-temporal patterns between adjacent actions. To address these limitations, we propose Spectral Scalpel, a frequency-selective filtering framework aimed at suppressing shared frequency components between adjacent distinct actions while amplifying their action-specific frequencies, thereby enhancing inter-action discrepancies and sharpening transition boundaries. |
Haoyu Ji; Bowen Chen; Zhihao Yang; Wenze Huang; Yu Gao; Xueting Liu; Weihong Ren; Zhiyong Wang; Honghai LIU; | code |
| 659 | Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. |
Daniel Jung; Kyoung Mu Lee; | code |
| 660 | PDD: Manifold-Prior Diverse Distillation for Medical Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through systematic Grad-CAM analysis, we reveal that discriminative activation maps fail on medical data, unlike their success on industrial datasets, motivating the need for manifold-level modeling. We propose \textbf{PDD} (Manifold-Prior Diverse Distillation), a framework that unifies dual-teacher priors into a shared high-dimensional manifold and distills this knowledge into dual students with complementary behaviors. |
Xijun Lu; Hongying Liu; Fanhua Shang; Yanming hui; Liang Wan; | code |
| 661 | HDW-SR: High-Frequency Guided Diffusion Model Based on Wavelet Decomposition for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Diffusion-based methods have shown great promise in single image super-resolution (SISR); however, existing approaches often produce blurred fine details due to insufficient guidance in the high-frequency domain. To address this issue, we propose a High-Frequency Guided Diffusion Network based on Wavelet Decomposition (HDW-SR), which replaces the conventional U-Net backbone in diffusion frameworks. |
Chao Yang; Boqian Zhang; Jinghao Xu; Guang Jiang; | code |
| 662 | Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel balanced multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. |
Mengshi Qi; Jiaxuan Peng; Xianlin Zhang; Huadong Ma; | code |
| 663 | Clay-to-Stone: Phase-wise 3D Gaussian Splatting for Monocular Articulated Hand-Object Manipulation Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When applied to articulated manipulations, the continuous joint rotations and frequent component deformations introduce a strong coupling between shape and motion, leading to severe ambiguity and instability in articulation optimization under monocular observation. To address this challenge, we propose a Clay-to-Stone dual-phase framework, modeling the articulated manipulation at hierarchical granularities, enabling a progression from flexible semantic exploration to structured articulation recovery. |
Xingyu Liu; Pengfei Ren; Qi Qi; Haifeng Sun; Zirui Zhuang; Jianxin Liao; Jingyu Wang; | code |
| 664 | RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. |
Xiyan Liu; Han Wang; Yuhu Wang; JUNJIE CAI; Zhe Cao; Yangjianzhong Yangjianzhong; Zhen Lu; | code |
| 665 | RARE: Learn to RAnk and REtrieve for Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces RARE, a unified framework that addresses both challenges through learning to rank and retrieve. |
Hyeonjeong Park; Peixi Xiong; Xiaoqian Ruan; Dian Jia; Pei Yu; Wei Tang; | code |
| 666 | Generate, Analyze, and Refine: Training-Free Sound Source Localization Via MLLM Meta-Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by human metacognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). |
Subin Park; Jung Uk Kim; | code |
| 667 | ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce ACE-Merging, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. |
Bo Xu; Haotian Wu; Hehai Lin; Weiquan Huang; Beier Zhu; Yao Shu; Chengwei Qin; | code |
| 668 | Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often $\textit{too vivid to be real}$ even when prompted for realistic-style images. To address this issue, we present $\textbf{Color Fidelity Dataset (CFD)}$ and $\textbf{Color Fidelity Metric (CFM)}$ for objective evaluation of color fidelity in realistic-style generations. |
Zhengyao Fang; Zexi Jia; Yijia Zhong; Pengcheng Luo; Jinchao Zhang; Guangming Lu; Jun Yu; Wenjie Pei; | code |
| 669 | $A^2$GC: $A$symmetric $A$ggregation with Geometric Constraints for Locally Aggregated Descriptors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an asymmetric aggregation VPR method with geometric constraints for locally aggregated descriptors, called $A^2$GC-VPR. |
Zhenyu Li; Tianyi Shang; | code |
| 670 | Semi-supervised Echocardiography Video Segmentation Via Anchor Semantic Awareness and Continuous Pseudo-label Reforging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it is an extremely challenging task to obtain high-quality segmentation results throughout the cardiac cycle owing to (1) the inherent speckle noise in echocardiography videos, (2) the complex dynamic motions of cardiac structures, and (3) the scarcity of annotated data. To comprehensively address these challenges, we propose a novel semi-supervised model, which can achieve accurate and real-time echocardiography video segmentation with very limited annotations. |
Yunpeng Fang; Yimu Sun; Jingxing Guo; Huisi Wu; Jing Qin; | code |
| 671 | Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. |
Shuo Zhang; Chenqi Li; Tingting Zhu; | code |
| 672 | ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. |
Srikumar Sastry; Subash Khanal; Aayush Dhakal; Jiayu Lin; Daniel Cher; Phoenix Jarosz; Nathan Jacobs; | code |
| 673 | MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MambaLiteUNet, a compact yet robust segmentation framework that integrates Mamba state space modeling into a U-Net architecture, along with three key modules: Adaptive Multi-Branch Mamba Feature Fusion (AMF), Local Global Feature Mixing (LGFM), and Cross-Gated Attention (CGA). |
Md Maklachur Rahman; Soon Ki Jung; Tracy Hammond; | code |
| 674 | BOP-ASK: Object-Interaction Reasoning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present BOP-ASK, a novel large-scale dataset for object-interaction reasoning for both training and benchmarking. |
Vineet Bhat; Sungsu Kim; Valts Blukis; Greg Heinrich; Prashanth Krishnamurthy; Ramesh Karri; Stan Birchfield; Farshad Khorrami; Jonathan Tremblay; | code |
| 675 | Adaptive Confidence Regularization for Multimodal Failure Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we address the largely unexplored problem of failure detection in multimodal contexts. |
Moru Liu; Hao Dong; Olga Fink; Mario Trapp; | code |
| 676 | MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. |
Kehong Gong; Zhengyu Wen; Weixia He; Xu Mingxi; Qi WANG; ning Zhang; Zhengyu Li; Dongze Lian; Wei Zhao; He Xiaoyu; Mingyuan Zhang; | code |
| 677 | Long-Term Personalized Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Pal-R3, an innovative personalized multimodal agent framework designed for long-term personalization. |
Chang Nie; Chaoyou Fu; YiFan Zhang; HaiHuaYang HaiHuaYang; Caifeng Shan; | code |
| 678 | FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. |
Mahesh Bhosale; Abdul Wasi Lone; Shantam Srivastava; Shifa Latif; Tianyu Luan; Mingchen Gao; David Doermann; Xuan Gong; | code |
| 679 | UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Departing from previous approaches, we propose **UniGenDet**: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. |
Yanran Zhang; Wenzhao Zheng; Yifei Li; Bingyao Yu; Yu Zheng; Lei Chen; Jiwen Lu; Jie Zhou; | code |
| 680 | Learning to Focus and Precise Cropping:A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By adding random noise to the cropped images, we find that they still maintain most of the performance, especially for models using only reinforcement learning, indicating a heavy reliance on the global input and a weak dependence on details within the cropped region. To address this issue, we propose a novel two-stage reinforcement learning framework that does not require trajectory supervision. |
Xuanpu Zhao; Zhentao Tan; Dianmo Sheng; Tianxiang Chen; Yao Liu; Yue Wu; Tao Gong; Qi Chu; Nenghai Yu; | code |
| 681 | Scalable Feature Matching Via State Space Modeling and Sparse Correlation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While current semi-dense feature matching approaches commonly suffer from quadratic complexity in spatial resolution due to transformer-based long-range context modeling or redundant full correlation computations. To overcome these limitations, we present a novel scalable feature matching method that delivers reliable correspondences with low memory footprint and latency, especially at high resolutions. |
Choo Sin Wai; Bo Li; | code |
| 682 | $\alpha$Matte4K & $\mu$Matting: Dataset and Model for Ultra-Micro Precision Alpha Video Matting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: From model perspective, constrained by computational costs, current methods often up-sample alpha outputs to meet target resolutions that unavoidably diminishes precision. To overcome this critical limitation, we introduce $\mu$Matting, a innovative resolution-agnostic two-stage matting framework for video matting: (1) coarse matte localization using a portrait-aware masked autoencoder; (2) refinement of critical regions via sparse 3D convolution, augmented by a temporal modulator that injects global spatio-temporal cues for enhanced consistency and contextual awareness. |
Xinyi Chen; Hang Dong; Baowei Jiang; Shenkun Xu; Youqi Guan; Kanle Shi; Kun Gai; Haichuan Song; | code |
| 683 | UAV-CB: A Complex-Background RGB–T Dataset and Local Frequency Bridge Network for UAV Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing UAV detection datasets, though diverse, are not specifically designed to capture these camouflage and complex-background challenges, which limits progress toward robust real-world perception. To fill this gap, we construct UAV-CB, a new RGB–T UAV detection dataset deliberately curated to emphasize complex low-altitude backgrounds and camouflage characteristics. |
Shenghui Huang; Menghao Hu; Longkun Zou; Hongyu Chi; Zekai Li; Feng Gao; Fan Yang; Qingyao Wu; Ke Chen; | code |
| 684 | VisualAD: Language-Free Zero-Shot Anomaly Detection Via Vision Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work revisits the necessity of the text branch in ZSAD and presents VisualAD, a purely visual framework built on Vision Transformers. |
Yanning Hou; Peiyuan Li; Zirui Liu; Yitong Wang; Yanran Ruan; Jianfeng Qiu; Ke Xu; | code |
| 685 | JUMP-Hand: Learning Joint-wise Uncertainty to Gate Mixture of View Experts for Multi-View 3D Hand Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, JUMP-Hand is proposed as a novel method for multi-view 3D hand reconstruction, which is the first to introduce probabilistic joint-wise uncertainty as an explicit gating mechanism to fuse multi-view information. |
Haohong Kuang; Yang Xiao; Changlong Jiang; Jinghong Zheng; Hang Xu; Ran Wang; Zhiguo Cao; Joey Tianyi Zhou; | code |
| 686 | FINER: MLLMs Hallucinate Under Fine-grained Negative Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. |
Rui Xiao; Sanghwan Kim; Yongqin Xian; Zeynep Akata; Stephan Alaniz; | code |
| 687 | ULF-Loc: Unbiased Landmark Feature for Robust Visual Localization with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This bias stems from the entanglement between individual Gaussians and their neighboring Gaussians, making the learned features unsuitable for precise matching tasks. Motivated by these findings, we propose ULF-Loc, an unbiased landmark feature framework that replaces biased feature optimization with geometry-weighted feature fusion. |
Yingdong Gu; Shaocheng Yan; Zhenjun Zhao; Yuan Kou; Jianxin Luo; Pengcheng Shi; Jiayuan Li; | code |
| 688 | VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce *VoxTell*, a vision–language model for text-prompted volumetric medical image segmentation. |
Maximilian Rokuss; Moritz Langenberg; Yannick Kirchhoff; Fabian Isensee; Benjamin Hamm; Constantin Ulrich; Sebastian Regnery; Lukas Bauer; Efthimios Katsigiannopulos; Tobias Norajitra; Klaus Maier-Hein; | code |
| 689 | GeoSANE: Learning Geospatial Representations From Models, Not Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By shifting from pre-training to weight generation, GeoSANE introduces a new framework for unifying and transferring geospatial knowledge across models and tasks. |
Joëlle Hanna; Damian Falk; Stella X. Yu; Damian Borth; | code |
| 690 | Adaptive Learned Image Compression with Graph Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This rigidity limits the model’s ability to adaptively capture spatially varying redundancy across the image, particularly at the global level. To overcome these limitations, we propose a content-adaptive image compression framework based on Graph Neural Networks (GNNs). |
Yunuo Chen; Bing He; Zezheng Lyu; Hongwei Hu; Qunshan Gu; Yuan Tian; Guo Lu; | code |
| 691 | FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simpler setup and high scalability. |
Hang Dai; Hongwei Fan; Han Zhang; Duojin Wu; Jiyao Zhang; Hao Dong; | code |
| 692 | MaskDexGrasp: Generative Masked Modeling for Part-Aware Dexterous Grasp Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a dexterous hand always maintains high-dimensional DoF and actuation space, making existing approaches that rely on holistic latent representations difficult to produce high-quality and semantically aligned grasps. In this paper, we propose MaskDexGrasp to address these challenges. |
Binghui Zuo; Lin Zhou; Haoxuan Xu; Jianan Yan; ZhiPeng Yu; Zekai Liu; Yangang Wang; | code |
| 693 | From Feature Learning to Spectral Basis Learning: A Unifying and Flexible Framework for Efficient and Robust Shape Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this framework, the spectral basis is optimized by learning a set of inhibition functions. Building on this foundation, we propose the first unsupervised spectral basis learning method for efficient and robust non-rigid 3D shape matching, simultaneously optimizing feature extraction and basis functions in an end-to-end manner. |
Feifan Luo; Hongyang Chen; | code |
| 694 | Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such decoupled structure not only harm the generation efficiency but also hinder the practical use of MAR for reinforcement learning (RL), an increasingly critical paradigm for generative model post-training. To address this fundamental issue, we introduce MARVAL (Masked Auto-regressive Variational Acceleration), a distillation-based framework that compresses the diffusion chain into a single AR generation step while preserving the flexible auto-regressive unmasking order. |
Yuxuan Gu; Weimin Bai; Yifei Wang; Weijian Luo; He Sun; | code |
| 695 | Structure-Aware Representation Distillation for Tiny-Dense Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Structure-Aware Representation Distillation (SARD), a teacher-compatible framework that transfers structural knowledge from a large teacher to a compact student via feature-space alignment rather than mask imitation. |
Xuesong Liu; Anke Xu; Wenbo Cao; Emmett Ientilucci; | code |
| 696 | BridgeEQA: Virtual Embodied Agents for Real Bridge Inspections Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose infrastructure inspection as a compelling domain for open-vocabulary Embodied Question Answering (EQA): it naturally demands multi-scale reasoning, long-range spatial understanding, and complex semantic relationships, while offering unique evaluation advantages via standardized National Bridge Inventory (NBI) condition ratings (0-9), professional inspection reports, and egocentric imagery. |
Subin Varghese; Joshua Gao; Asad Ur Rahman; Vedhus Hoskere; | code |
| 697 | GroundingME: Exposing The Visual Grounding Gap in MLLMs Through Multi-Dimensional Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To rigorously assess MLLMs’ true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative—distinguishing highly similar objects, (2) Spatial—understanding complex relational descriptions, (3) Limited—handling occlusions or tiny objects, and (4) Rejection—recognizing ungroundable queries. |
Rang Li; Lei Li; Shuhuai Ren; Hao Tian; Shuhao Gu; Shicheng Li; Zihao Yue; Yudong Wang; Wenhan Ma; Zhe Yang; Jingyuan Ma; Zhifang Sui; Fuli Luo; | code |
| 698 | Seeing Both Sides: Towards Bidirectional Semantic Alignment for Open-Vocabulary Camouflaged Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such a design neglects the bidirectional interaction between visual and language modalities, making the model vulnerable to the semantic gap between image-level textual semantics and pixel-level segmentation cues, which in turn leads to severe semantic confusion in complex camouflaged scenarios. To address this challenge, we propose BaCLIP, a novel bidirectional semantic alignment framework for OVCOS. |
Guohui Zhang; Fuming Sun; Yu Zhao; Yuqiu Kong; Jing Sun; Ganggang Huang; | code |
| 699 | SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing video attribution methods require additional operations or the training of source attribution models, which may degrade video quality or necessitate large amounts of training samples. To address these challenges, we define for the first time the few-shot training-free generated video attribution task and propose SWIFT, which is tightly integrated with the temporal characteristics of the video. |
Chao Wang; Zijin Yang; Yaofei Wang; Yuang Qi; Weiming Zhang; Nenghai Yu; Kejiang Chen; | code |
| 700 | Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, directly reducing the time context through embedding-based retrieval may lose key information of complex problems. In this paper, we propose Symphony, a multi-agent system, to alleviate these limitations. |
海洋 闫; Hongyun Zhou; Peng Xu; FengXiaoxue FengXiaoxue; Mengyi Liu; | code |
| 701 | Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness Via Contextual Consistency Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This lack of consistency leads to a performance drop because the model struggles to detect the same object in different scenes, which reveals a robustness gap. To address this issue, we introduce Contextual Consistency Learning (CCL), a novel framework that integrates two key strategies: Contextual Bootstrapped Data Generation (CBDG) and Contextual Consistency Loss (CCLoss). |
bozhao Li; Shaocong Wu; Tong Shao; Senqiao Yang; Qiben Shan; Zhuotao Tian; Jingyong Su; | code |
| 702 | DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. |
Jiayi Li; Yuxin Yao; Qiuhang Lu; Juyong Zhang; | code |
| 703 | Delta Rectified Flow Sampling for Text-to-Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Delta Rectified Flow Sampling (DRFS), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. |
Gaspard Beaudouin; Minghan LI; Jaeyeon Kim; Sung-Hoon Yoon; Mengyu Wang; | code |
| 704 | SonoWorld: From One Image to A 3D Audio-Visual Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. |
Derong Jin; Xiyi Chen; Ming Lin; Ruohan Gao; | code |
| 705 | Low-Rank Residual Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the Low-Rank Residual Diffusion Model (LRDM), which performs diffusion within a compact low-rank residual subspace for efficient and structure-preserving restoration. |
Junfu Tan; Jiang Yuan; | code |
| 706 | From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. |
Yuqing Shao; Yuchen Yang; Rui Yu; Weilong Li; Xu Guo; Huaicheng Yan; Wei Wang; Xiao Sun; | code |
| 707 | Exploring 6D Object Pose Estimation with Deformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DeSOPE, a large-scale dataset designed for Deformed Six-DoF Object Pose Estimation. |
Zhiqiang Liu; Rui Song; Chuanqi DuanMu; Jiaojiao Li; David Ferstl; Yinlin Hu; | code |
| 708 | Make It SING: Analyzing Semantic Invariants in Classifiers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The semantic content of these invariants remains vague, as existing approaches struggle to provide human-interpretable information. To address this gap, we present Semantic Interpretation of the Null-space Geometry (SING), a method that constructs equivalent images, with respect to the network, and assigns semantic interpretations to the available variations. |
Harel Yadid; Meir Yossef Levi; Roy Betser; Guy Gilboa; | code |
| 709 | Rethinking Glyph Spatial Information in Font Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: At the model level, the implicit coupling of shape and position hinders fine-grained optimization and generalization. We address these challenges in the context of Chinese font generation, where glyph complexity demands superior model capability. |
Peng Su; Xi Yang; | code |
| 710 | WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. |
Shan Ning; Longtian Qiu; Jiaxuan Sun; Xuming He; | code |
| 711 | Your Classifier Can Do More: Towards Bridging The Gaps in Classification, Robustness, and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might bridge their performance disparities. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), a unified generative-discriminative-robust framework that maximizes the joint probability of clean and adversarial distribution. |
kaichao jiang; He Wang; Xiaoshuai Hao; Xiulong Yang; Ajian Liu; Qi Chu; Yunfeng Diao; Richang Hong; | code |
| 712 | SEA-Flow3D: Simplified, Efficient, and Accurate Scene Flow Via Spatial Vector Sampling and Multi-scale Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SEA-Flow3D, a simple, efficient, and accurate framework for dense scene flow estimation. |
Han Ling; Quansen Sun; Yinghua Yao; Ivor Tsang; Yinghui Sun; | code |
| 713 | CompetitorFormer: Mitigating Query Conflicts for 3D Instance Segmentation Via Competitive Strategy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We define this phenomenon as \emph{inter-query competition}, which slows convergence and limits segmentation accuracy. To address this problem, we present \textbf{CompetitorFormer}, a novel framework designed for Transformer-based methods. |
wang duanchu; Junjie Yang; Haoran Gong; Jing Liu; Di Wang; | code |
| 714 | Seeing Is Improving: Visual Feedback for Iterative Text Layout Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. |
Junrong Guo; Shancheng Fang; Yadong Qu; Hongtao Xie; | code |
| 715 | Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. |
Hayeon Kim; Ji Ha Jang; Junghun James Kim; Se Young Chun; | code |
| 716 | Adapting In-context Generation for Enhanced Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: After that, we present a two-stage framework applicable to any supervised CIR approach. |
Haiwen Li; Zining Chen; Delong Liu; Zhaohui Hou; Zhicheng Zhao; Fei Su; | code |
| 717 | OSMO: Open-vocabulary Self-eMOtion Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the novel task of egocentric self-emotion tracking, which aims to infer an individual’s evolving emotions from egocentric multimodal streams such as voice, visual surroundings, semantic subtext, and eye-tracking signals. |
Mohamed Abdelfattah; Bugra Tekin; Fadime Sener; Necati Cihan Camgoz; Eric Sauser; Shugao Ma; Alex Alahi; Edoardo Remelli; | code |
| 718 | MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to Mitigate the Optical–SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. |
Yujian Zhao; Hankun Liu; Guanglin Niu; | code |
| 719 | ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, recent advances in video generation have driven the emergence of activity-level forgeries that modify human actions to distort event semantics, resulting in highly deceptive forgeries that critically undermine media authenticity and public trust. To address this issue, we introduce ActivityForensics, the first large-scale benchmark for localizing manipulated activity in untrimmed videos. |
Peijun Bao; Luo Anwei; Gang Pan; Alex C. Kot; Xudong Jiang; | code |
| 720 | StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them “structure-centric”. |
Zanxi Ruan; Songqun Gao; Qiuyu Kong; Yiming Wang; Marco Cristani; | code |
| 721 | FedMOP: Achieving Enhanced Privacy and Performance in Federated Learning Via Momentum Orthogonal Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Federated Learning with Momentum-Based Orthogonal Projection (FedMOP), a method that simultaneously achieves strong privacy guarantees and superior model performance. |
Yunlong Zhao; Xiaoheng Deng; Hongyan Xu; Zhuohua Qiu; Xiaowen Hu; Shan You; Yi Chen; Chang Xu; Xiu Su; | code |
| 722 | Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DiffRender-VLA, a differentiable rendering–based framework that bridges 3D and 2D Vision-Language-Action models through gradient-consistent visual mediation. |
Yunlong Zhao; Xiaoheng Deng; Yichao Cao; Yi Chen; Xiangjian He; Shan You; Shuo Yang; Lei Fan; Fei Wang; Xiu Su; | code |
| 723 | RFDM: Residual Flow Diffusion Models for Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces an efficient causal video editing model that edits a video frame-by-frame. |
Mohammadreza Salehi; Mehdi Noroozi; Luca Morreale; Ruchika Chavhan; Malcolm Chadwick; Alberto Gil Couto Pimentel Ramos; Abhinav Mehrotra; | code |
| 724 | Curvature-Aware Zeroth-Order Optimization for Memory-Efficient Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, ZO methods suffer from much higher variance compared with first-order methods in estimating the gradient. To address this, we propose an improved ZO method to substantially boost the performance of ZO optimization based TTA. |
Junming Zhang; Shuyu Yin; Peilin Liu; Rendong Ying; Fei Wen; | code |
| 725 | CG-Reasoner: Centroid-Guided Positional Reasoning Segmentation for Medical Imaging with A Robust Visual-Text Consistency Metric Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CG-Reasoner, a novel centroid-guided cross-modal framework that jointly performs medical image segmentation and positional reasoning. |
Lakshmikar Reddy Polamreddy; Ming Ma; | code |
| 726 | TIM: Temporal Decoupling with Iterative Mutual-Refinement Model for Longitudinal Radiology Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a Temporal Decoupling with Iterative Mutual-Refinement Model (TIM), a two-stage framework that explicitly decouples spatial pathology from temporal progression and iteratively refines reports through mutual feedback. |
Yiheng Dong; Yi Lin; Shilong Huang; Xiyan Yang; Xin Yang; | code |
| 727 | SEBA: Sample-Efficient Black-Box Attacks on Visual Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SEBA, a sample-efficient framework for black-box adversarial attacks on visual RL agents. |
Tairan HUANG; Yulin Jin; Junxu Liu; Qingqing Ye; Haibo Hu; | code |
| 728 | Attribution As Retrieval: Model-Agnostic AI-Generated Image Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). |
Hongsong Wang; Renxi Cheng; Chaolei Han; Jie Gui; | code |
| 729 | FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we formulate the task of \textbf{stain-aware FQA}, emphasizing that focus behavior in fluorescence microscopy must be modeled as a function of staining characteristics. |
Hyejin Park; Jiwon Yoon; Sumin Park; Suree Kim; Sinae Jang; Eunsoo Lee; Dongmin Kang; Dongbo Min; | code |
| 730 | Lens Component Deletion Based on Differentiable Ray Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel lens component deletion pipeline for miniature optical systems, which automatically deletes the suitable lens component, and then optimizes both the lens system and the post-processing network to achieve joint aberration correction. |
Wenguan Zhang; Qirun Zhang; Tuo Sun; Jiajian He; Jiahui Xu; Huajun Feng; Qi Li; | code |
| 731 | Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the labeled data is often limited in real-world scenarios, leading to unstable coverage performance in different runs. To address this issue, we extend CP to the semi-supervised setting and propose SemiCP, a new paradigm that leverages both labeled and unlabeled data for calibration. |
Xuanning Zhou; Zihao Shi; Hao Zeng; Xiaobo Xia; Bingyi Jing; Hongxin Wei; | code |
| 732 | SAR2Net: Learning Spatially Anchored Representations for Retrieval-Guided Cross-Stain Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SAR2Net, a framework that learns spatially anchored representations and reformulates cross-stain alignment as a region-level feature retrieval problem. |
Tianle Shen; Fang Yan; Xiaofan Zhang; | code |
| 733 | MedKCO: Medical Vision-Language Pretraining Via Knowledge-Driven Cognitive Orchestration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This anti-cognitive process leads to suboptimal feature representations, especially under distribution shift. To address this limitation, we propose a Knowledge-driven Cognitive Orchestration for Medical VLP (MedKCO) that involves both the ordering of the pretraining data and the learning objective of vision-language contrast. |
Chenran Zhang; Ruiqi Wu; Tao Zhou; Yi Zhou; | code |
| 734 | Resolving The Identity Crisis in Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DisCo, Reinforcement with DiverSity Constraints, a novel reinforcement learning framework that directly optimizes identity diversity both within images and across groups of generated samples. |
Shubhankar Borse; Farzad Farhadzadeh; Munawar Hayat; Fatih Porikli; | code |
| 735 | Ar2Can: An Architect and An Artist Leveraging A Canvas for Multi-Human Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. |
Shubhankar Borse; Phuc Pham; Farzad Farhadzadeh; Seokeon Choi; Phong Nguyen; Anh Tran; Sungrack Yun; Munawar Hayat; Fatih Porikli; | code |
| 736 | From Infusion to Assimilation Distillation for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing KD methods enhance student performance, but due to teacher-student different feature advantages, they neglect to internalize and integrate student’s semantic information adaptively after knowledge transfer, causing poor knowledge assimilation and limiting gains and generalization. To address this limitation, we propose a novel medical image segmentation framework, which is injection to assimilation distillation (IAD). |
Jiankang Hong; Ye Luo; Yinan Liu; Junsong Yuan; | code |
| 737 | Reframing Long-Tailed Learning Via Loss Landscape Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Balancing performance trade-off on long-tail data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called catastrophic forgetting” in continual learning (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. |
shenghan chen; Yiming Liu; Yanzhen Wang; Yujia Wang; Xiankai Lu; | code |
| 738 | AE2VID: Event-based Video Reconstruction Via Aperture Modulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce aperture-modulation-triggered events as a complementary mechanism to enrich the captured scene information. |
Chenxu Bai; Boyu Li; Peiqi Duan; xinyu zhou; Hanyue Lou; Boxin Shi; | code |
| 739 | 240FPS Stereo Vision from Monocular Mixed Spikes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. |
Yeliduosi Xiaokaiti; Yakun Chang; Yang Bai; Zhaojun Huang; Peiqi Duan; Boxin Shi; | code |
| 740 | Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In scenarios with explicit expectations, such as controlled generation, reference videos naturally provide rich, unambiguous spatio-temporal evidence, enabling stricter and more trustworthy assessment. Motivated by this, we propose Ref4D, a reference-based, fine-grained, multi-dimensional benchmark for generated video evaluation. |
Jiajia Wei; YuJia He; Yuhan Hou; Hang Qi; Sihua Wang; Jincheng Shi; Kwok Li; Zibin Zheng; Weibin Wu; | code |
| 741 | MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose \textbf{\textit{Multi-resolution Retrieval-Detection (MRD)}}, a training-free framework for high-resolution image understanding. |
Fan Yang; Xingping Dong; Xin Yu; Wenhan Luo; Wei Liu; Kaihao Zhang; | code |
| 742 | UCMNet: Uncertainty-Aware Context Memory Network for Under-Display Camera Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they still face challenges in recovering fine details when dealing with complex, spatially varying degradation. To solve this problem, we propose a lightweight Uncertainty-aware Context-Memory Network (UCMNet), for UDC image restoration. |
DAEHYUN KIM; Youngmin Kim; Yoon Ju Oh; Tae Hyun Kim; | code |
| 743 | WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WorldReel, a 4D video generator that is natively spatio-temporally consistent. |
Shaoheng Fang; Hanwen Jiang; Yunpeng Bai; Niloy J. Mitra; Qixing Huang; | code |
| 744 | Spectral Super-Resolution Via Adversarial Unfolding and Data-Driven Spectrum Regularization: From Multispectral Satellite Data to NASA Hyperspectral Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study aims to achieve spectral super-resolution from 12-to-186 and unify the spatial resolution of Sentinel-2 data to 5 m. To enable a reliable and efficient reconstruction, we formulate a novel deep unfolding framework regularized by a data-driven spectrum prior from PriorNet, instead of relying on implicit deep priors as conventional deep unfolding does. |
Si-Sheng Yang; Chia-Hsiang Lin; | code |
| 745 | ReLaX: Reasoning with Latent Exploration for Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We then introduce a new metric, $\textbf{D}$ynamic $\textbf{S}$pectral $\textbf{D}$ispersion ($\textbf{DSD}$),to quantify the diversity of the model’s reasoning dynamics, which also serves as a direct measure of the degree of exploration. Building upon these foundations, we introduce a latent dynamics aware training paradigm, $\textbf{Re}$asoning with $\textbf{La}$tent e$\textbf{X}$ploration ($\textbf{ReLaX}$), to attain a better balance between exploration and exploitation during policy optimization. |
Shimin Zhang; Xianwei Chen; Yufan Shen; Ziyuan Ye; Jibin Wu; | code |
| 746 | UniPercept: A Unified Diffusion Model for Generalizable Visual Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing diffusion-based perception models are generally restricted to a single task or a fixed set of predefined tasks, lacking an efficient mechanism to generalize to novel tasks. To overcome this limitation, we propose a unified DiT-based perception framework called UniPercept, which introduces a novel foundation–adapter paradigm for general visual perception. |
Zuyan Zhao; Zhenliang He; Meina Kan; Shiguang Shan; Xilin Chen; | code |
| 747 | 4D Local Modeling Toward Dynamic Global Perception for Ambiguity-free Rotation-Invariant Point Cloud Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most rotation-invariant (RI) representations are derived from local coordinate systems, which inherently suffer from point-pair ambiguities and fail to capture discriminative features in symmetric or repetitive structures, while discarding informative global pose cues. To overcome these limitations, we propose Ga4DPF, a novel framework that offers a robust, global-aware RI representation by converting rotation-equivariant geometric representations into invariant ones, while concurrently integrating global pose awareness. |
JIAXUN GUO; Wentao Fan; Manar Amayri; Nizar Bouguila; | code |
| 748 | Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. |
Shaochen Bi; Yuting He; Weiming Wang; Hao Chen; | code |
| 749 | Boosting Vision-Language-Action Finetuning with Feasible Action Neighborhood Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, prevalent VLA training methodologies are directly inherited from linguistic settings and does not exploit the FAN property, thus lead to poor generalization and low sample efficiency. To address this limitation, we introduce a FAN-guided regularizer that shapes the model’s output distribution to align with the geometry of FAN. |
Haochen Niu; Kanyu Zhang; Shuyu Yin; Qinghai Guo; Peilin Liu; Fei Wen; | code |
| 750 | U$^{2}$Flow: Uncertainty-Aware Unsupervised Optical Flow Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose U$^{2}$Flow, the first recurrent unsupervised framework that jointly estimates optical flow and per-pixel uncertainty. |
Xunpei Sun; Wenwei Lin; Yi Chang; Gang Chen; | code |
| 751 | Gamba: Mamba-based Graph Convolutional Network with Dynamic Graph Topology Learning for Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To capture the underlying relation of the joints of different categories, the state space model is introduced to the proposed method to process enhanced temporal features, aiming at learning dynamic adjacency matrices for long-range dependencies of the joints across different categories. |
Rouyi Zhou; 漾之 吴; Jiajun Wen; Can Gao; Feng Liu; Zhihui Lai; Linlin Shen; | code |
| 752 | Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they still struggle with high information density (HID) charts characterized by multiple subplots, legends, and dense annotations due to three major challenges: (1) limited fine-grained perception results in the omission of critical visual cues; (2) redundant or noisy visual information undermines the performance of multimodal reasoning; (3) lack of adaptive deep reasoning relative to the amount of visual information. To tackle these challenges, we present a novel focus-driven fine-grained chart reasoning model, **Chart-FR1**, to improve perception, focusing efficiency, and adaptive deep reasoning on HID charts. |
Hongkun Pan; Yuwei Wu; Wanyi Hong; ShengHui Hu; Qitong Yan; Yi Yang; Rufei Han; Changju Zhou; Minfeng Zhu; Dongming Han; Wei Chen; | code |
| 753 | LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Lighting-driven Dynamic Active Sensing (LiDAS), a closed-loop active illumination system that combines off-the-shelf visual perception models with high‑definition headlights. |
Simon de Moreau; Andrei Bursuc; Hafid EL IDRISSI; Fabien Moutarde; | code |
| 754 | Mining Instance-Centric Vision–Language Contexts for Human–Object Interaction Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)—a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. |
Soo Won Seo; Kyungchae Lee; Hyungchan Cho; Taein Son; Nam Ik Cho; Jun Won Choi; | code |
| 755 | Test-time Ego-Exo-centric Adaptation for Action Anticipation Via Multi-Label Prototype Growing and Dual-Clue Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we make the first exploration of a Test-time Ego-Exo Adaptation for Action Anticipation (TE$^{2}$A$^{3}$) task, which aims to adjust the source-view-trained model online during test time to anticipate target-view actions. |
Zhaofeng Shi; Heqian Qiu; Lanxiao Wang; Qingbo Wu; Fanman Meng; Lili Pan; Hongliang Li; | code |
| 756 | CanonCGT: Reference-Based Color Grading Via Canonical Pivot Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CanonCGT, a two-stage framework built on a canonical pivot — a style-neutral intermediate representation for stable color mapping. |
JINWON KO; Keunsoo Ko; Chang-Su Kim; | code |
| 757 | OrthoFuse: Training-free Riemannian Fusion of Orthogonal Style-Concept Adapters for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we seek to show that in the case of orthogonal fine-tuning (OFT), we can use structured orthogonal parametrization and, utilizing manifold theory, get the formulas for training-free adapter merging. |
Ali Aliev; Kamil Garifullin; Nikolay Yudin; Vera Soboleva; Alexander Molozhavenko; Ivan Oseledets; Aibek Alanov; Maxim Rakhuba; | code |
| 758 | Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, models encoding absolute positions struggle to extract spatial relations from prematurely fused features, while methods explicitly encoding all spatial relations (which is quadratic in the number of objects) as input tokens suffer from poor scalability. To address these limitations, we propose QuatRoPE, a novel positional embedding method with an input length that is linear to the number of objects, and explicitly calculates pairwise spatial relations through the dot product in attention layers. |
Shengli Zhou; Minghang Zheng; Feng Zheng; Yang Liu; | code |
| 759 | SpiralDiff: Spiral Diffusion with LoRA for RGB-to-RAW Conversion Across Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they overlook two key challenges: (i) the reconstruction difficulty varies with pixel intensity, and (ii) multi-camera conversion requires camera-specific adaptation. To address these issues, we propose SpiralDiff, a diffusion-based framework tailored for RGB-to-RAW conversion with a signal-dependent noise weighting strategy that adapts reconstruction fidelity across intensity levels. |
Huanjing Yue; Shangbin Xie; Cong Cao; Qian.Wu Qian.Wu; Lei Zhang; Zhao Lei; Jingyu Yang; | code |
| 760 | $\text{F}^2\text{HDR}$: Two-Stage HDR Video Reconstruction Via Flow Adapter and Physical Motion Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often suffer from inaccurate alignment, suboptimal feature aggregation, and degraded reconstruction quality in motion-dominated regions. To address these challenges, we propose $\text{F}^2\text{HDR}$, a two-stage HDR video reconstruction framework that robustly perceives inter-frame motion and restores fine details in complex dynamic scenarios. |
Huanjing Yue; Dawei Li; Shaoxiong Tu; Jingyu Yang; | code |
| 761 | Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine- grained structure capture, cross-modal complementarity modeling, and effective exploitation of available modalities. |
Peibo Song; Xiaotian Xue; Jinshuo Zhang; zihao wang; Jinhua liu; Shujun Fu; Fangxun Bao; Si Yong Yeo; | code |
| 762 | Content-Aware Dynamic Patchification for Efficient Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This content-agnostic tokenization results in substantial redundant computation, especially in visually simple or static areas. To address this inefficiency while preserving the video generation quality, we propose DynaPatch, a fine-grained dynamic patchification framework that adaptively selects patch sizes for each spatiotemporal region based on content complexity. |
Sheng Li; Connelly Barnes; Mamshad Nayeem Rizve; Hongwu Peng; Zhengang Li; Ohi Dibua; Alireza Ganjdanesh; Xulong Tang; Yan Kang; Yifan Gong; | code |
| 763 | Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. |
Yara Bahram; Mélodie Desbos; Mohammadhadi Shateri; Eric Granger; | code |
| 764 | CAST: Context-Aware Dynamic Latent Space Transformation for Interactive Text-to-Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such static formulation easily causes semantic vagueness, making it difficult to capture subtle embedding shifts in the user’s updated intention for fine-grained retrieval. To address this limitation, we propose Context-Aware Latent Space Transformation (CAST), a lightweight framework that dynamically transforms the common latent space of both textual and visual representations according to the specific evolving user’s search intention, enabling fine-grained and adaptive semantic alignment. |
Xuanzuo Lin; Min Zhang; Daizong Liu; Zhiwen Zuo; Xun Yang; Changting Lin; Xun Wang; Jianfeng Dong; | code |
| 765 | VDE: Training-Free Accelerating Rectified Flow Model Via Velocity Decomposition and Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes Velocity Decomposition and Estimation (VDE), a training-free acceleration method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating. |
Tan Junwen; Jinglin Liang; Hongyuan Chen; Shuangping Huang; | code |
| 766 | Guiding Diffusion Models with Semantically Degraded Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $c_{deg}$ . |
shilong han; Yuming Zhang; Hongxia Wang; | code |
| 767 | $L^{2}DGS$: Low-Light Dynamic Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose $L^{2}DGS$ (Low-Light Dynamic Gaussian Splatting), a self-supervised 4D GS framework for directly reconstructing well-lit dynamic scenes from low-light inputs. |
Ashish Kumar; A. N. Rajagopalan; | code |
| 768 | Disco-GS: Gaussian Splatting in Dynamic Color Lighting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Disco-GS, a framework that leverages GS for reconstructing the 3D scene while simultaneously recovering the underlying canonical appearance from videos captured under dynamic lighting conditions. |
Ashish Kumar; A. N. Rajagopalan; | code |
| 769 | SPEGC: Continual Test-Time Adaptation Via Semantic-Prompt-Enhanced Graph Clustering for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. |
Xiaogang Du; Jiawei Zhang; Tongfei Liu; Tao Lei; Yingbo Wang; | code |
| 770 | High-Fidelity Generation of Lane Scenes Under Adverse Weather and Lighting Conditions Without Re-annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current datasets like CULane and TuSimple have relatively limited data under extreme weather conditions, such as rain, snow and fog, which makes detection models unreliable in extreme conditions, potentially leading to serious safety-critical failures on the road. In this direction, we propose \textbf{\textit{HG-Lane}}, a \textbf{H}igh-fidelity \textbf{G}eneration framework for \textbf{Lane} Scenes under adverse weather and lighting conditions, without the need for re-annotation and training. |
Daichao Zhao; Qiupu Chen; Feng He; Xin Ning; Qiankun Li; | code |
| 771 | Electromagnetic Inverse Scattering from A Single Transmitter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The scarcity of transmitters leads to an insufficient amount of measured data, which fails to capture adequate physical information for stable inversion. Built on this insight, we propose a fully end-to-end and data-driven framework that predicts the relative permittivity of scatterers from measured fields, leveraging data distribution priors to compensate for the lack of physical information. |
Yizhe Cheng; Chunxun Tian; Haoru Wang; Wentao Zhu; Xiaoxuan Ma; Yizhou Wang; | code |
| 772 | Online3R: Online Learning for Consistent Sequential Reconstruction Based on Geometry Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Online3R, a new sequential reconstruction framework that is capable of adapting to new scenes through online learning, effectively resolving inconsistency issues. |
Shunkai Zhou; Zike Yan; fei xue; Dong Wu; Yuchen Deng; Hongbin Zha; | code |
| 773 | Common Inpainted Objects In-N-Out of Context Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. |
Tianze Yang; Tyson Jordan; Ruitong Sun; Ninghao Liu; Jin Sun; | code |
| 774 | GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. |
tianyu chen; Wei Xiang; Kang Han; Lu Yu; Di Wu; Gaowen Liu; Ramana Kompella; | code |
| 775 | PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose PerformRecast, a versatile portrait video expression editing method which is dedicated to recast the performance in existing film and animation. |
Jiadong Liang; Bojun Xiong; Jie Tian; Hua Li; Xiao Long; Yong Zheng; Huan Fu; | code |
| 776 | RMIR: A Benchmark Dataset for Reasoning-Intensive Multimodal Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When retrieval requires complex reasoning to determine the target image, the task becomes significantly more challenging, yet standardized benchmarks for this setting do not exist. To fill this gap, we introduce RMIR, a benchmark dataset of $1,634$ queries requiring reasoning across three categories: functional (object affordances), temporal (time-based relationships), and causal (cause-effect reasoning). |
Yijiang Li; Kunal Kotian; Ali Marjaninejad; Meir Friedenberg; Kaushik Pavani; Sunny Dasgupta; | code |
| 777 | GraspALL: Adaptive Structural Compensation from Luminance Variation for Robotic Garment Grasping in Any Low-Light Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods typically enhance RGB features by exploiting the illumination-invariant properties of non-RGB modalities, yet they overlook the varying dependence on non-RGB features under varying lighting conditions, which can introduce misaligned non-RGB cues and thereby weaken the model’s adaptability to illumination changes. To address this problem, we propose GraspALL, an illumination-structure interactive compensation model. |
Haifeng Zhong; Wenshuo Han; Zhouyu Wang; Runyang Feng; Fan Tang; Tong-yee Lee; zipei fan; Ruihai Wu; Yuran Wang; Hao Dong; Hechang Chen; Hyung Jin Chang; Yixing Gao; | code |
| 778 | Forging A Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. |
Zizhi Chen; Yizhen Gao; Minghao Han; Yizhou Liu; Zhaoyu Chen; Dingkang Yang; Lihua Zhang; | code |
| 779 | Score2Instruct: Scaling Up Video Quality-Centric Instructions Via Automated Dimension Scoring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems (e.g., GPT-4), limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. |
Qizhi Xie; Kun Yuan; Yunpeng Qu; Jiachao Gong; Mingda Wu; Ming Sun; Chao Zhou; Jihong Zhu; | code |
| 780 | Focal–General Diffusion Model with Semantic Consistent Guidance for Sign Language Production Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing G2P methods typically treat each pose as an indivisible unit, limiting their ability to capture fine-grained joint-level dependencies and thus degrading pose quality. To address this, we propose the Focal–General Diffusion Model (FGDM), characterized by a pioneering two-stage denoising framework that harmonizes local joint-level dependencies and global coherence. |
Yiheng Yu; Sheng Liu; Yuan Feng; Zhelun Jin; Yining Jiang; Min Xu; | code |
| 781 | Pose-Free Omnidirectional Gaussian Splatting for 360-Degree Videos with Consistent Depth Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a pose-free omnidirectional 3DGS method, named PFGS360, that reconstructs 3D Gaussians from unposed omnidirectional videos. |
Chuanqing Zhuang; Xin Lu; Zehui Deng; Zhengda Lu; Yiqun Wang; Junqi Diao; Jun Xiao; | code |
| 782 | Fine-Tuning Impairs The Balancedness of Foundation Models in Long-tailed Personalized Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, real-world scenarios often face the co-occurrence of non-IID data and long-tailed class distributions, presenting unique challenges that remain underexplored in PFL. In this paper, we investigate this long-tailed personalized federated learning and observe that current methods suffer from two limitations: (i) Fine-tuning degrades performance below zero-shot baselines due to the erosion of inherent class balance in foundation models; (ii) Conventional personalization techniques further transfer this bias to local models through parameter or feature-level fusion. |
Shihao Hou; Chikai Shang; Zhiheng Yang; jiacheng yang; Xinyi Shang; Junlong Gao; Yiqun Zhang; Yang Lu; | code |
| 783 | Hist2Style: Histogram-Guided Stylization with Bilateral Grids Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. |
Dekel Galor; Adam Pikielny; Zhoutong Zhang; Ke Wang; Laura Waller; Jiawen Chen; Ilya Chugunov; | code |
| 784 | VisRes Bench: On Evaluating The Visual Reasoning Capabilities of VLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To adress this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. |
Brigitta Malagurski Törtei; Yasser Dahou; Ngoc Huynh; Wamiq Reyaz Para; Phúc H. Lê Khắc; Ankit Singh; Sofian Chaybouti; Sanath Narayan; | code |
| 785 | TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the paper, we present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision. |
Qingwen Zhang; Chenhan Jiang; Xiaomeng Zhu; Yunqi Miao; Yushan Zhang; Olov Andersson; Patric Jensfelt; | code |
| 786 | MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leaves a gap in modeling cases where sarcasm is triggered by multi-image cues in real-world settings. To bridge this gap, we introduce MMSD3.0, a new benchmark composed entirely of multi-image samples curated from tweets and Amazon reviews. |
HAOCHEN ZHAO; Yuyao Kong; Yongxiu Xu; Gaopeng Gou; Hongbo Xu; Yubin Wang; Haoliang Zhang; | code |
| 787 | Hear What You See: Video-to-Audio Generation with Diffusion Transformer and Semantic-Temporal Alignment-Ranked Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches often fail to capture fine-grained temporal correspondence between visual events and audio dynamics, leading to unrealistic or desynchronized outputs. To address these limitations, we propose VisioSonic, a Video-Aligned Sound generation framework that unifies flow-matching diffusion and preference-guided alignment. |
Kai Wang; Tao Zhou; jiayi lei; Jing Wang; Jinman Zhao; Weiguo Pian; Yuan Cheng; Yapeng Tian; Peng Gao; Bin Fu; Yihao Liu; Dimitrios Hatzinakos; Yuewen Cao; | code |
| 788 | Dejavu: Towards Experience Feedback Learning for Embodied Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a general post-deployment learning framework Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. |
Shaokai Wu; Yanbiao Ji; Qiuchang Li; Zhiyi Zhang; Qichen He; Wenyuan XIE; Guodong Zhang; Bayram Bayramli; Yue Ding; Hongtao Lu; | code |
| 789 | BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. |
Melissa Schween; Mathis Kruse; Bodo Rosenhahn; | code |
| 790 | AutoCut: End-to-end Advertisement Video Editing Based on Multimodal Discretization and Controllable Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current workflows and AI tools remain disjoint and modality-specific, leading to high production costs and low overall efficiency. To address this issue, we propose AutoCut, an end-to-end video ad editing framework based on multimodal discretization and controllable generation. |
Milton Zhou; Sizhong Qin; Yongzhi Li; Quan Chen; Peng Jiang; | code |
| 791 | CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent’s specific physical and operational capabilities. |
Xia Su; Ruiqi Chen; Benlin Liu; Jingwei Ma; Zonglin Di; Ranjay Krishna; Jon Froehlich; | code |
| 792 | Vision-Oriented Lightweight Neural Architecture Search with Budget-Adaptive Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, NAS methods are trapped in a fundamental accuracy-efficiency dilemma: training-based approaches deliver reliable performance but incur prohibitive search costs, whereas training-free strategies are ultra-fast but often yield relatively unreliable rankings. To reconcile this conflict, we propose a vision-oriented lightweight training-based NAS framework. |
Yi Fan; Yu-Bin Yang; | code |
| 793 | ReMoRa: Multimodal Large Language Model Based on Refined Motion Representation for Long-Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we focus on video understanding by MLLMs. |
Daichi Yashima; Shuhei Kurita; Yusuke Oda; Komei Sugiura; | code |
| 794 | When AVSR Meets Video Conferencing: Dataset, Degradation, and The Hidden Mechanism Behind Performance Collapse Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. |
Yihuan Huang; Jun Xue; Liu Jiajun; Daixian Li; Tong Zhang; Zhuolin Yi; Yanzhen Ren; Kai Li; | code |
| 795 | Convolutional Neural Networks Driven By Content Similarity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although convolutional neural networks (CNNs) have continued to evolve in recent years, Transformers have become increasingly popular in the field of computer vision. In this work, we open a new avenue for CNNs, enabling them to aggregate information based on content similarity—an ability analogous to the self-attention mechanism. |
Ligeng Zou; Guihu Zhao; | code |
| 796 | A Stitch in Time: Learning Procedural Workflow Via Self-Supervised Plackett–Luce Ranking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We expose the lack of procedural awareness in current SSL methods with a motivating experiment: models pretrained on forward and time-reversed sequences produce highly similar features, confirming that their representations are blind to the underlying procedural order. To address this shortcoming, we propose PL-Stitch, a self-supervised framework that harnesses the inherent temporal order of video frames as a powerful supervisory signal. |
chengan che; Chao Wang; Xinyue Chen; Sophia Tsoka; Luis Carlos Garcia Peraza Herrera; | code |
| 797 | GraPHFormer: A Multimodal Graph Persistent Homology Transformer for The Analysis of Neuroscience Morphologies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Skeletonized reconstructions of neurons and glia enable systematic study of branching patterns, path lengths, tapering, and spatial organization, with implications for neurodevelopment, learning and memory, and neurodegenerative disease. |
Uzair Shah; Marco Agus; Mahmoud Gamal; Mahmood Alzubaidi; Corrado Cali; PIERRE MAGISTRETTI; Abdesselam Bouzerdoum; Mowafa Househ; | code |
| 798 | LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. |
chengan che; Chao Wang; Tom Vercauteren; Sophia Tsoka; Luis Carlos Garcia Peraza Herrera; | code |
| 799 | TALON: Test-time Adaptive Learning for On-the-Fly Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It often results in category explosion, where a single class is fragmented into multiple pseudo-classes. To overcome these limitations, we propose a test-time adaptation framework that enables learning through discovery. |
Yanan Wu; Yuhan Yan; Tailai Chen; Zhixiang Chi; ZiZhang Wu; Yi Jin; Yang Wang; Zhenbo Li; | code |
| 800 | OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing methods remain offline, relying on future frames or global optimization, which limits their applicability in interactive feedback and perception-action loop scenarios such as AR/VR and telepresence. To address this, we propose OnlineHMR, a fully online framework that jointly satisfies four essential criteria of online processing, including system-level causality, efficiency, faithfulness, and temporal consistency. |
Yiwen Zhao; Ce Zheng; Yufu Wang; Hsueh-Han Yang; Liting Wen; Laszlo Jeni; | code |
| 801 | PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by biological center–surround organization and frequency-selective signal processing, we propose PFGNet, a fully convolutional framework that dynamically modulates receptive fields through pixel-wise frequency-guided gating. |
Xinyong Cai; Changbin Sun; Yong Wang; Hongyu Yang; Yuankai Wu; | code |
| 802 | EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views Under Extreme Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. |
Taegyoon Yoon; Yegyu Han; Seojin Ji; Jaewoo Park; Sojeong Kim; Taein Kwon; Hyung-Sin Kim; | code |
| 803 | Compressed-Domain-Aware Online Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although recent online video super-resolution (online VSR) approaches achieve promising results, they are still compute-intensive and fall short of real-time processing at higher resolutions, due to complex motion estimation for alignment and redundant processing of consecutive frames. To address these issues, we propose a compressed-domain-aware network (CDA-VSR) for online VSR, which utilizes compressed-domain information, including motion vectors, residual maps, and frame types to balance quality and efficiency. |
Yuhang Wang; Hai Li; Shujuan Hou; zhetao dong; yangxiaoyao yangxiaoyao; | code |
| 804 | HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. |
Xuchang Zhong; Xu Cao; Jinke Feng; Hao Fang; | code |
| 805 | Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions Under A Certain Decision Rule Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Perception Characteristics Distance (PCD), a novel metric incorporating model output uncertainty as represented by the farthest distance at which an object can be reliably detected. |
Boyu Jiang; Liang Shi; Zhengzhi Lin; Lanxin Xiang; Loren Stowe; Feng Guo; | code |
| 806 | RISE: Single Static Radar-based Indoor Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. |
Kaichen Zhou; Laura Dodds; Sayed Afzal; Fadel Adib; | code |
| 807 | RPPG-VQA: A Video Quality Assessment Framework for Unsupervised RPPG Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose rPPG-VQA, a novel framework for assessing video suitability for rPPG. |
Tianyang Dai; Ming Chang; Yan Chen; Yang Hu; | code |
| 808 | PhysGen: Physically Grounded 3D Shape Generation for Industrial Design Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Since existing methods lack knowledge of such physics, they are unable to use this knowledge to enhance the realism of shape generation. Motivated by this, we propose a unified physics-based 3D shape generation pipeline, with a focus on industrial design applications. |
Yingxuan You; Chen Zhao; Hantao Zhang; Ming Xu; Pascal Fua; | code |
| 809 | Breaking The 3D Dataset Bottleneck: Fast Scalable Generation of Aligned 3D Assets from Scratch for Category 6D Pose Estimation and Robotic Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the first scalable, automated framework that generates complete category-level 6D pose datasets directly from text prompts, bypassing the need for existing 3D assets. |
Guillaume Duret; Danylo Mazurak; Florence Zara; Jan Peters; Liming Chen; | code |
| 810 | CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce $CoCoVideo-26K$, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real–fake video pairs. |
Huidong Feng; Wentao Chen; Jie Chen; Xinqi Cai; Ruolong Ma; Yinglin Zheng; Yuxin Lin; Ming Zeng; | code |
| 811 | STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, they struggle to reconstruct frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image- and FLAME-based priors to learn per-Gaussian feature offsets within the UV space. |
Jiankuo Zhao; Xiangyu Zhu; Zidu Wang; Zhen Lei; | code |
| 812 | SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SATTC (Structure-Aware Test-Time Calibration), a label-free test-time calibration head that operates directly on the similarity matrix of frozen EEG–image encoders. |
Qunjie Huang; Weina Zhu; | code |
| 813 | CRFT: Consistent–Recurrent Feature Flow Transformer for Cross-Modal Image Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Consistent–Recurrent Feature Flow Transformer (CRFT), a unified coarse-to-fine framework that learns feature flow for robust cross-modal registration. |
Xuecong Liu; Mengzhu Ding; Zixuan Sun; Zhang Li; Xichao Teng; | code |
| 814 | Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. |
Hao Zheng; Hu Wang; Tiantian Zheng; Prajjwal Bhattarai; Tuka Alhanai; | code |
| 815 | Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this issue, we leverage causal analysis and experiment to reveal the underlying phenomenon of perceptual bias, demonstrating that RL-based fine-tuning compels lightweight models to preferentially adopt perceptual shortcuts induced by data biases, rather than developing genuine reasoning abilities. Motivated by this insight, we propose VideoThinker, a causal-inspired framework that cultivates robust reasoning in lightweight models through a two-stage debiasing process. |
Jingze Wu; Quan Zhang; Hongfei Suo; Zeqiang Cai; Hongbo Chen; | code |
| 816 | Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make It Strong Again Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. |
Weize Li; Yunhao Du; Qixiang Yin; Zhicheng Zhao; Fei Su; | code |
| 817 | Robust Promptable Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper offers the first comprehensive study on robust PVOS (RobustPVOS). |
Sohyun Lee; Yeho Gwon; Lukas Hoyer; Konrad Schindler; Christos Sakaridis; Suha Kwak; | code |
| 818 | CAPT : Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model’s intrinsic bias and limited fine-grained discriminative ability. To address this, we propose **CAPT**, a **C**onfusion-**A**ware **P**rompt **T**uning framework that enables models to learn from their own misalignment. |
Maoyuan Shao; Yutong Gao; Xinyang Huang; Lijuan Sun; Guoshun Nan; Chuang Zhu; | code |
| 819 | TM-BSN: Triangular-Masked Blind-Spot Network for Real-World Self-Supervised Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the Triangular-Masked Blind-Spot Network (TM-BSN), a novel blind-spot architecture that accurately models the spatial correlation of real sRGB noise. |
Junyoung Park; Youngjin Oh; Nam Ik Cho; | code |
| 820 | Visual-RRT: Finding Paths Toward Visual-Goals Via Differentiable Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose visual-RRT (vRRT), a motion planner that enables visual-goal planning by unifying gradient-based exploitation from differentiable robot rendering with sampling-based exploration from RRTs. |
Sebin Lee; Jumin Lee; Taeyeon Kim; Youngju Na; Woobin Im; Sung-Eui Yoon; | code |
| 821 | FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present $\textbf{FlowDIS}$, a novel dichotomous image segmentation method built upon the flow matching framework, which learns a time-dependent vector field to transport the image distribution into the corresponding mask distribution under optional textual guidance. |
Andranik Sargsyan; Shant Navasardyan; | code |
| 822 | EmoDiffTalk:Emotion-aware Diffusion for Editable 3D Gaussian Talking Head Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a new editable 3D Gaussian talking head, i.e. EmoDiffTalk. |
Chang Liu; Tianjiao Jing; Chengcheng Ma; Xuanqi Zhou; Zhengxuan Lian; Qin Jin; Hongliang Yuan; Shi-Sheng Huang; | code |
| 823 | LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches remain impractically slow, memory-intensive, and overly complex due to iterative optimization and dense feature assignments for every Gaussian. To address these limitations, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. |
Jaehun Bang; Jinhyeok Kim; Minji Kim; Seungheon Jeong; Kyungdon Joo; | code |
| 824 | GuardTrace-VL: Detecting Unsafe Multimodel Reasoning Via Iterative Safety Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question–Thinking–Answer (QTA) pipeline via joint image–text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. |
Yuxiao Xiang; Junchi Chen; Zhenchao Jin; Changtao Miao; Haojie Yuan; Qi Chu; Tao Gong; Nenghai Yu; | code |
| 825 | MLLM-HWSI: A Multimodal Large Language Model for Hierarchical Whole Slide Image Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \textbf{MLLM-HWSI}, a Hierarchical WSI-level MLLM that aligns visual features with pathology language at four distinct scales—cell as word, patch as phrase, region as sentence, and WSI as paragraph—to support interpretable, evidence-grounded reasoning. |
Basit Alawode; Arif Mahmood; Muaz Radi; Shahad Albastaki; Asim Khan; Muhammad Bilal; Moshira Abdalla; Mohammed Bennamoun; Sajid Javed; | code |
| 826 | Perceptual Neural Video Compression with Color Separation and Rank Chain Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, within this framework, we introduce the perceptual optimization scheme Rc-GAN, which leverages a bitrate-based rank chain loss to link variable-rate coding with perceptual quality ranking, enforcing consistent quality ordering and improving perceptual fidelity. Built upon these designs, we establish the PNVC-C framework with two variants: PNVC-C-Base, optimized for objective fidelity, and PNVC-CR, a perceptual variant that applies the Rc-GAN. |
xiongzhuang liang; Chuanbo Tang; Zhuoyuan Li; Li Li; Dong Liu; | code |
| 827 | Improving Diffusion Generalization with Weak-to-Strong Segmented Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we first conduct synthetic comparisons to isolate and demonstrate the effective regime of guidance methods represented by CFG and AG from the perspective of weak-to-strong principle. Based on this, we propose a hybrid instantiation called SEG under the principle, taking the benefits of both. |
Liangyu Yuan; Yufei Huang; Mingkun Lei; Tong Zhao; Ruoyu Wang; Chi Changxi; Yiwei Wang; Chi Zhang; | code |
| 828 | 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce 3D-Fixer, a novel generalizable and efficient scheme for single-image to compositional 3D scene generation. |
Ze-Xin Yin; Liu Liu; Xinjie wang; Wei Sui; Zhizhong Su; Jian Yang; Jin Xie; | code |
| 829 | ReMoE: Region-Mixture Experts for Adversarially-Robust Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Adversarial perturbations are typically local and spatially structured, whereas the globally coupled self-attention and spatially uniform feed-forward networks in ViTs propagate local corruptions across the whole image without enforcing consistency within semantically coherent regions. To mitigate this mismatch, we propose Region-aware Mixture-of-Experts, namely ReMoE, a plug-and-play module that replaces the standard feed-forward network (FFN) with a region-aware expert layer. |
Qinghao Zhong; Bingzhi Chen; Yishu Liu; Minhua Lu; Guangming Lu; | code |
| 830 | CG-Floor: Centroid-Guided Diffusion for Large-Scale Floorplan Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing methods have shown success in generating small-scale floorplans with simple room shapes, they struggle to handle the complex room connections and irregular room shapes that arise in large-scale floorplans. In this paper, we propose CG-Floor, a centroid-guided hierarchical framework that explicitly decouples topology and geometry to address these issues. |
Hongjin Lian; Jian Ma; Hongjie Chen; Jia Li; Ruizhen Hu; Yu-Kun Lai; Kun Li; | code |
| 831 | A Difference-in-Difference Approach to Detecting AI-Generated Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these detectors become less effective as modern AI-generated images become increasingly similar to real ones. To address this challenge, we propose a novel difference-in-difference method. |
Xinyi Qi; Kai Ye; Chengchun Shi; Ying Yang; Jin Zhu; Hongyi Zhou; | code |
| 832 | MatchED: Crisp Edge Detection Using End-to-End, Matching-based Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, all existing crisp edge detection methods still depend on such post-processing to achieve satisfactory results. To address this limitation, we propose \MethodLPP, a lightweight, only $\sim$21K additional parameters, and plug-and-play matching-based supervision module that can be appended to any edge detection model for joint end-to-end learning of crisp edges. |
bedrettin cetinkaya; Sinan Kalkan; Emre Akbas; | code |
| 833 | The Invisible Gorilla Effect in Out-of-distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we identify a previously unreported bias in OOD detection: for hard-to-detect artefacts (near-OOD), detection performance typically improves when the artefact shares visual similarity (e.g. colour) with the model’s ROI and drops when it does not – a phenomenon we term the Invisible Gorilla Effect. |
Harry Anthony; Ziyun Liang; Hermione Warr; Konstantinos Kamnitsas; | code |
| 834 | Explaining CLIP Zero-shot Predictions Through Concepts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce a framework that bridges these two paradigms by explaining CLIP’s zero-shot predictions through human-understandable concepts. |
Onat Ozdemir; Anders Christensen; Stephan Alaniz; Zeynep Akata; Emre Akbas; | code |
| 835 | MOGeo: Beyond One-to-One Cross-View Object Geo-localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap between the realistc setting and existing task, we propose a new task, called Cross-View Multi-Object Geo-Localization (CVMOGL). |
Lv Bo; Qingwang Zhang; Le Wu; Yuanyuan Li; YINGYING ZHU; | code |
| 836 | OMoBlur: An Object Motion Blur Dataset and Benchmark for Real-World Local Motion Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing datasets either rely on costly beam-splitting capture with residual misalignment or employ synthetic blur that fails to model the continuous photon-integration process during exposure. To overcome these limitations, we introduce OMoBlur, a physically grounded dataset that emulates realistic exposure integration via programmable sensor control, ensuring close alignment between synthetic and real blur distributions. |
Dingchuan Yu; Jiatong Li; Jingwen Zhou; Zhengyue Zhuge; Yueting Chen; Qi Li; | code |
| 837 | TAR: Token-Aware Refinement for Fine-grained Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that attention artifacts compel the model to overemphasize global semantics, consequently overlooking fine-grained local cues that are crucial for category discrimination. |
XingYu Yang; Yu Zhang; Siya Mi; Xiu-Shen Wei; | code |
| 838 | Continual Learning for FMRI-Based Brain Disorder Diagnosis Via Functional Connectivity Matrices Generative Replay Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the first continual learning framework specifically designed for fMRI-based diagnosis across heterogeneous clinical sites. |
qianyu Chen; Shujian Yu; | code |
| 839 | Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Currently, most existing automatic trailer generation methods employ a selection-then-ranking paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. |
Sidan Zhu; Hongteng Xu; Dixin Luo; | code |
| 840 | Learning to Solve PDEs on Neural Shape Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. |
Lilian Welschinger; Yilin Liu; Zican Wang; Niloy J. Mitra; | code |
| 841 | NeighborMAE: Exploiting Spatial Dependencies Between Neighboring Earth Observation Images in Masked Autoencoders Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Since the Earth’s surface is continuous, neighboring images are highly related and offer rich contextual information for self-supervised learning. To close this gap, we propose NeighborMAE, which learns spatial dependencies by joint reconstruction of neighboring Earth Observation images. |
Liang Zeng; Valerio Marsocci; Wufan Zhao; Andrea Nascetti; Maarten Vergauwen; | code |
| 842 | SGDE: Self-supervised Geometry Degradation Estimation Framework for Coded Aperture Compressive Spectral Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods either assume ideal imaging conditions, or rely on offline calibration, making them vulnerable to dynamic perturbations, such as thermal expansion and mechanical vibration that cause mask shifts. To address these limitations, we propose a Self-Supervised Geometry Degradation Estimation (SGDE) framework that explicitly models mask misalignments as an affine transformation and embeds it into the imaging model. |
Yuqiao He; Xiaoyan LIU; Jianxu Mao; Yaonan Wang; Hui Zhang; Lizhu Liu; Yurong Chen; Wenbin He; | code |
| 843 | SGS-Intrinsic: Semantic-Invariant Gaussian Splatting for Sparse-View Indoor Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We presents SGS-Intrinsic, an indoor inverse rendering framework that works well for sparse-view images. |
jiahao niu; rongjia zheng; Wenju Xu; Wei-Shi Zheng; Qing Zhang; | code |
| 844 | Regulating Rather Than Constraining: Adaptive Guidance for Complex Spectral Reconstruction in Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we introduce a simple data-level transformation, MixShuffle, which performs random convex combinations across spatial positions and spectral channels to generate training data with richer spatial structures and stronger spectral mixing. |
Zhuwei Wen; Zimin Xia; He Chen; Linwei Yue; Xianwei Zheng; | code |
| 845 | Spatial-SAM: Spatially Consistent 3D Electron Microscopy Segmentation with SDF Memory and Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Spatial-SAM, a spatially consistent and annotation-efficient framework that achieves high precision on 3D-EM data. |
Yikai Huang; Renmin Han; Yuxuan Wang; Youcheng Cai; Ligang Liu; | code |
| 846 | VAST: Video Ability‑Stratified Taxonomy for Data‑Efficient Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: On the data side, we propose VAST, an ability-stratified framework that reorganizes video understanding tasks into a three-layer cognitive taxonomy spanning Perception, Reasoning, and Cognition. |
Zhongan Wang; Xiaoyu Wen; Lingxiao Du; Kun Li; zhiliang wu; Xingcheng Xu; Qiaosheng Zhang; Chaochao Lu; Hehe Fan; | code |
| 847 | FusionRegister: Every Infrared and Visible Image Fusion Deserves Registrtaion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Spatial registration across different visual modalities is a critical but formidable step in multi-modality image fusion for real-world perception. Although there are several … |
Congcong Bian; HaoLong Ma; Hui Li; Zhongwei Shen; Xiaoqing Luo; Xiaoning Song; Xiaojun Wu; | code |
| 848 | ReFTA: Breaking The Weight Reconstruction Bottleneck in Tensorized Parameter-Efficient Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing tensor-based methods typically require repeated reconstruction of model weights during training, leading to substantial computational and memory overhead. To overcome these limitations, we propose Reconstruction-Free Tensor-Based Adaptation (ReFTA), which offers four key advantages: (1) it eliminates repeated explicit tensor reconstruction by exploiting the algebraic properties of tensors; (2) it achieves lower quantization error by fine-tuning only the principal tensor components; (3) it is supported by a rigorous generalization guarantee rooted in the algebraic foundations of tensor product–based approaches; and (4) it adopts a unified design controlled by a single tensor rank configuration. |
Jingjing Zheng; Anda Tang; Qiangqiang Mao; Zhouchen Lin; Yankai Cao; | code |
| 849 | SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, scaling to high-resolution images requires optimizing and storing millions of unstructured Gaussian primitives independently, leading to slow convergence and redundant parameters. To address this, we propose Structured Gaussian Image (SGI), a compact and efficient framework for representing high-resolution images. |
Zixuan Pan; Kaiyuan Tang; Jun Xia; Yifan Qin; Lin Gu; Chaoli Wang; Jianxu Chen; Yiyu Shi; | code |
| 850 | AD-GBC: Anisotropic Granular-Ball Skip-Connection Refiner for UNet-Based Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prototype or region-attention modules have recently improved medical image segmentation but still suffer from two fundamental limitations: 1) they represent each semantic concept as a point or isotropic region, failing to capture the inherently anisotropic geometry of real feature distributions; and 2) many rely on non-differentiable clustering or one-way kernel weighting, which restricts their ability to form coherent region-level representations. We address these issues with the Anisotropic Differentiable Granular-Ball (AD-GBC) module, which generalizes prototypes into learnable geometric regions parameterized by a center and an anisotropic vector scale. |
Xiya Shen; Qinglin Zhao; Li Feng; | code |
| 851 | Hyperbolic Busemann Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we lift two core components of neural networks, Multinomial Logistic Regression (MLR) and Fully Connected (FC) layers, into hyperbolic space via Busemann functions, resulting in Busemann MLR (BMLR) and Busemann FC (BFC) layers with a unified mathematical interpretation. |
Ziheng Chen; Bernhard Schölkopf; Nicu Sebe; | code |
| 852 | DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DriverGaze360, a large-scale 360$^\circ$ field of view driver attention dataset, containing $\sim$1 million gaze-labeled frames collected from 19 human drivers, enabling comprehensive omnidirectional modeling of driver gaze behavior. |
Shreedhar Govil; Didier Stricker; Jason Rambach; | code |
| 853 | Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. |
Panagiotis Filntisis; George Retsinas; Radek Danecek; Vanessa Sklyarova; Petros Maragos; Timo Bolkart; | code |
| 854 | Self-supervised Dynamic Heterogeneous Degradation Modeling for Unified Zero-Shot Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, we introduce a dynamic quality-refinement strategy that adaptively adjusts the diffusion trajectory for robust globally optimal convergence. |
Xiaowan Hu; Jing Yang; Henan Liu; Li Huaqiu; Mai Xu; | code |
| 855 | SARMAE: Masked Autoencoder for SAR Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. |
Danxu Liu; Di Wang; Hebaixu Wang; Haoyang Chen; Wentao Jiang; Yilin Cheng; Haonan Guo; Wei Cui; Jing Zhang; | code |
| 856 | SAM2Text: Towards Prompt-Free and Multi-Resolution Video Scene Text Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we present a comprehensive framework based on SAM2. |
Jing-Yao Zhang; Heng Zhang; Mingsen Zhang; Binbin Yang; Fei Yin; | code |
| 857 | ContourVertex: Bridging Semantics and Geometry for Referring Remote Sensing Interpretation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose RECS4R, a unified multi-task framework to upgrade RECS performance. |
Jinming Chai; Lingling Li; Licheng Jiao; Xiaoqiang Lu; Long Sun; Xu Liu; Wenping Ma; Weibin Li; | code |
| 858 | Semantic Audio-Visual Navigation in Continuous Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. |
Yichen Zeng; Hebaixu Wang; Meng Liu; Yu ZHOU; Chen Gao; Kehan Chen; Gongping Huang; | code |
| 859 | Real-Time Long Horizon Air Quality Forecasting Via Group-Relative Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Group-Relative Policy Optimization (GRPO) with class-wise rewards and curriculum rollout to align predictions with operational priorities. |
Inha Kang; Eunki Kim; Wonjeong Ryu; Jaeyo Shin; Seungjun Yu; Yoon-Hee Kang; Seongeun Jeong; Eunhye Kim; Soontae Kim; Hyunjung Shim; | code |
| 860 | HySeg: Learning Generative Priors for Structure-Aware Remote Sensing Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This imbalance leads to fragmented boundaries, texture overfitting, and poor cross-domain generalization. We address this challenge by reformulating RSISS as posterior inference grounded in generative structural priors, introducing {\bf HySeg}, a hybrid generative–discriminative segmentation paradigm that learns structure-consistent priors through generative modeling and guides posterior inference for remote sensing segmentation. |
Jie Qiu; XIN LI; Fan Yang; Yan Wang; Dong Yu; Changying Wang; Linwei Dai; Yongxiang Chen; Youqin Chen; Jianzhang Chen; | code |
| 861 | Exact-GS: Mathematically Rigorous and Accurate 3D Gaussian Splatting for 3D X-ray Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Exact-GS, a novel mathematically rigorous and accurate 3D Gaussian Splatting model designed to perform 3D X-ray computed tomography (CT) reconstruction and novel view synthesis. |
Guangpu Yang; Steffen Kieß; Hanxiang Luo; Xingyu Liu; Sven Simon; | code |
| 862 | Balanced Dataset Distillation Via Modeling Multiple Visual Pattern Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, they either overemphasize class-general patterns representing the majority of each class or focus on fewer marginal patterns critical for model generalization. To address this issue, we propose a novel framework, Balanced Patterns Selection (BPS). |
Guanghui Shi; Xuefeng Liang; Qixiang Wen; | code |
| 863 | Specificity-aware Reinforcement Learning for Fine-grained Open-world Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate how to steer reasoning LMMs toward predictions that are both correct and specific. |
Samuele Angheben; Davide Berasi; Alessandro Conti; Elisa Ricci; Yiming Wang; | code |
| 864 | Beyond Euclidean Gossip: KL-Barycentric Consensus on Heterogeneous and Imbalanced Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a geometry-aware approach based on natural gradient variational inference. |
Lu Xu; Guosheng Yin; | code |
| 865 | Property-Informed Diffusion-Based Text-to-Microstructure Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. |
Bingxuan Dai; Hongsong Wang; Jie Gui; | code |
| 866 | Sparse–View Localization Via Online Neural 3D Regression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ON3R, an online-trained neural regressor addressing sparse-view structureless localization, where database images have limited visual overlap and no prebuilt 3D map. |
Ludvig Dillén; Magnus Oskarsson; Viktor Larsson; | code |
| 867 | Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by axiomatic attribution in mechanistic interpretability, we introduce a fine-grained quantization strategy based on Quantization-aware Integrated Gradients (QIG), which leverages integrated gradients to quantitatively evaluate token sensitivity and push the granularity from modality level to token level, reflecting both inter-modality and intra-modality dynamics. |
Ziwei Xiang; Fanhu Zeng; Hongjian Fang; Rui-Qi Wang; Renxing Chen; Yanan Zhu; yi chen; Peipei Yang; Xu-Yao Zhang; | code |
| 868 | Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. |
Sairam Rebbapragada; Rishabh Lalla; Aveen Dayal; Tejal Kulkarni; Anuj Lalla; Vineeth Balasubramanian; Muhammad Haris Khan; | code |
| 869 | Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a plug-and-play framework called **L**ocate-**T**hen-**S**parsify for **F**eature **S**teering (**LTS-FS**), which controls the steering intensity according to the hallucination relevance of each layer. |
Tiantian Dang; Chao Bi; Shufan Shen; Jinzhe Liu; Qingming Huang; Shuhui Wang; | code |
| 870 | A Closer Look at Cross-Domain Few-Shot Object Detection: Fine-Tuning Matters and Parallel Decoder Helps Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Few-shot object detection (FSOD) is challenging due to unstable optimization and limited generalization arising from the scarcity of training samples. To address these issues, we propose a hybrid ensemble decoder that enhances generalization during fine-tuning. |
Xuanlong Yu; Youyang Sha; Longfei Liu; Xi Shen; Di Yang; | code |
| 871 | Spatio-Temporal Difference Guided Motion Deblurring with The Complementary Vision Sensor Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a recent breakthrough, the complementary vision sensor (CVS) captures synchronized RGB frames together with high-frame-rate, multi-bit spatial difference ($\mathcal{SD}$, encoding structural edges) and temporal difference ($\mathcal{TD}$, encoding motion cues) data within a single RGB exposure, offering a promising solution for RGB deblurring under extreme dynamic scenes. To fully leverage these complementary modalities, we propose Spatio-Temporal Difference Guided Deblur Net (STGDNet), which adopts a recurrent multi-branch architecture that iteratively encodes and fuses $\mathcal{SD}$ and $\mathcal{TD}$ sequences to restore structure and color details lost in blurry RGB inputs. |
Yapeng Meng; Lin Yang; Yuguo Chen; Xiangru Chen; Taoyi Wang; Lijian Wang; Zheyu Yang; Yihan Lin; Rong Zhao; | code |
| 872 | Resolving Endpoint Underfitting in Diffusion Bridges Via Noise Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution ($t \to 0$). |
Yurong Gao; Zicheng Zhang; Congying Han; Tiande Guo; Xinmin QIu; | code |
| 873 | Jailbreaking Vision-Language Models Via Dissonance-Guided Suffix Optimization and Image–Phrase Injection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce $\textbf{DGSIP}$, a $\textbf{D}$issonance-$\textbf{G}$uided $\textbf{S}$uffix Optimization and $\textbf{I}$mage–$\textbf{P}$hrase Injection framework. |
Jiacheng Pi; Zhiguo Yang; Xingxing Huang; Dongsheng Xu; Ruizhi Zhong; Wenjie Ruan; | code |
| 874 | Differentiable Laplacian Matrix Guided Superpixel Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a simple, fully differentiable graph-Laplacian loss that encourages spatial regularity and connectivity during training. |
Jeremy Juybari; Joshua Hamilton; Shuvra Das; Chaofan Chen; Andre Khalil; Yifeng Zhu; | code |
| 875 | Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. |
Changqing Zhou; Yueru Luo; Han Zhang; Zeyu Jiang; Changhao Chen; | code |
| 876 | Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. |
Changqing Zhou; Yueru Luo; Changhao Chen; | code |
| 877 | Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, research in the 3D field remains limited, with most existing approaches applying post-processing techniques from 2D vision. To cover this lack, we propose a new efficient approach that directly operates in the feature space, modeling the feature distribution of inlier classes to constrain anomalous samples. |
Simone Mosco; Daniel Fusaro; Alberto Pretto; | code |
| 878 | PGR-Net: Prior-Guided ROI Reasoning Network for Brain Tumor MRI Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided Region Network)—an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. |
Jiacheng Lu; Hui Ding; Shiyu Zhang; Guoping Huo; | code |
| 879 | Global-Aware Edge Prioritization for Pose Graph Initialization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. |
Tong Wei; Giorgos Tolias; Jiri Matas; Daniel Barath; | code |
| 880 | DC-Merge: Improving Model Merging with Directional Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Model merging aims to integrate multiple task-adapted models into a unified model that preserves the knowledge of each task. In this paper, we identify that the key to this knowledge retention lies in maintaining the directional consistency of singular spaces between merged multi-task vector and individual task vectors. |
Han-Chen Zhang; Zi-Hao Zhou; Mao-Lin Luo; Shimin Di; Min-Ling Zhang; Tong Wei; | code |
| 881 | PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. |
Isaac Deutsch; Nicolas Moënne-Loccoz; Gavriel State; Žan Gojčič; | code |
| 882 | No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Key contributing factors include limited dataset diversity, and inadequate understanding of context-dependent anomalous semantics. To address these issues, i) we propose LAVIDA, an end-to-end zero-shot video anomaly detection framework. |
Zunkai Dai; Ke Li; JIAJIA LIU; Jie Yang; Yuanyuan Qiao; | code |
| 883 | No Hard Negatives Required: Concept Centric Learning Leads to Compositionality Without Degrading Zero-shot Capabilities of Contrastive Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work we follow a different approach. |
Hai X. Pham; David T. Hoffmann; Ricardo Guerrero; Brais Martinez; | code |
| 884 | DUET-VLM: Dual Stage Unified Efficient Token Reduction for VLM Training and Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose *DUET-VLM*, a versatile plug-and-play dual compression framework that consists of (a) vision-only redundancy aware compression of vision encoder’s output into information-preserving tokens,followed by(b) layer-wise, salient text-guided dropping of visual tokens within the language backbone to progressively prune less informative tokens. |
Aditya Kumar Singh; Hitesh Kandala; Pratik Brahma; Zicheng Liu; Emad Barsoum; | code |
| 885 | PriVi: Towards A General-Purpose Video Model for Primate Behavior in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. |
Felix B Mueller; Jan Frederik Meier; Timo Lüddecke; Richard Vogg; Roger Freixanet; Valentin Hassler; Tiffany Bosshard; Elif Karakoc; William O'Hearn; Sofia Pereira; Sandro Sehner; Kaja Wierucka; Judith Burkart; Claudia Fichtel; Julia Fischer; Alexander Gail; Catherine Hobaiter; Julia Ostner; Liran Samuni; Oliver Schülke; Neda Shahidi; Erin G. Wessling; Alexander Ecker; | code |
| 886 | Do Vision-Language Models Leak What They Learn? Adaptive Token-Weighted Model Inversion Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: **In this work, we present the first systematic study of MI attacks on VLMs to understand their susceptibility to leaking private visual training data. |
Ngoc-Bao Nguyen; Sy-Tuyen Ho; Koh Jun Hao; Ngai-Man Cheung; | code |
| 887 | Proof-of-Perception: Certified Tool-Using Multimodal Reasoning with Compositional Conformal Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Proof-of-Perception (PoP), a tool-using framework that casts multimodal reasoning as an executable graph with explicit reliability guarantees. |
Arya Fayyazi; Haleh Akrami; | code |
| 888 | Rethinking Knowledge Transfer in Image Quality Assessment: A Perceptual Preference Structure Alignment Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify the root cause as inconsistent perceptual preference structures across datasets, where models trained on different sources rely on distinct perceptual cues, leading to mismatched conditional distributions $P(Y|X)$ that fundamentally limit transferability. To address this, we propose Perceptual Preference Representation (PPR), which quantifies dataset-specific perceptual preference structures by analyzing correlations between visual features and quality scores. |
Aobo Li; Jinjian Wu; Yongxu Liu; Jupo Ma; Weisheng Dong; | code |
| 889 | Revisiting The Necessity of Full Accuracy: Weakly Supervised Object-Level Offset Correction for Misaligned Building Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge, we propose an Object-based Multi-stage Alignment Framework (OMAF) that generates high-quality corrected labels with minimal manual intervention. |
Junda Xu; Yanmeng Liu; Xiangqiang Zeng; Jinrong Wu; Ying Qu; Libao Zhang; | code |
| 890 | Can Natural Image Autoencoders Compactly Tokenize FMRI Volumes for Long-Range Dynamics Modeling? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. |
Peter Yongho Kim; Juhyeon Park; Jungwoo Park; Jubin Choi; Jungwoo Seo; Jiook Cha; Taesup Moon; | code |
| 891 | Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore the feasibility of applying shallow encoders for ultra-low bitrate compression and propose a novel Asymmetric Extreme Image Compression (AEIC) framework that pursues simultaneously encoding simplicity and decoding quality. |
Tianyu Zhang; Dong Liu; Chang-Wen Chen; | code |
| 892 | Inter-Edit: First Benchmark for Interactive Instruction-Based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the critical lack of suitable data, we propose an efficient pipeline to generate Inter-Edit, a new million-scale training dataset that simulates realistic user masks—not strictly segment-aligned. |
Delong Liu; Haotian Hou; Zhaohui Hou; Zhiyuan Huang; Shihao Han; Mingjie Zhan; Zhicheng Zhao; Fei Su; | code |
| 893 | OVSegDT: Segmenting Transformer for Open-Vocabulary Object Goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the OVSegDT approach, which has three key features. |
Tatiana Zemskova; Aleksei Staroverov; Dmitry Yudin; Aleksandr Panov; | code |
| 894 | When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. |
Sunoh Kim; Daeho Um; | code |
| 895 | Coded-E2LF: Coded Aperture Light Field Imaging from Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. |
Tomoya Tsuchida; Keita Takahashi; Chihiro Tsutake; Toshiaki Fujii; Hajime Nagahara; | code |
| 896 | SceMoS: Local Scene-Aware Human Motion Synthesis By Planning with Geometry-Grounded Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SceMoS, a scene-aware motion synthesis framework that shows that structured 2D scene representations can serve as a powerful alternative to full 3D supervision in physically grounded motion synthesis. |
Anindita Ghosh; Vladislav Golyanik; Taku Komura; Philipp Slusallek; Christian Theobalt; Rishabh Dabral; | code |
| 897 | FMPose: 3D Pose Estimation Via Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, Flow Matching (FM) learns an ODE-based velocity field, enabling efficient generation of 3D pose samples with only a few integration steps. Inspired by this capability, we propose a novel generative pose estimation framework, FMPose, that formulates 3D pose estimation as a conditional distribution transport problem. |
Ti Wang; Xiaohang Yu; Mackenzie Mathis; | code |
| 898 | Diagnose, Correct, and Learn from Manipulation Failures Via Visual Symbols Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. |
xianchao Zeng; Xinyu Zhou; Youcheng Li; Jiayou Shi; Tianle Li; Liangming Chen; Lei Ren; Yonglu Li; | code |
| 899 | Video Panels for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we take a different approach: rather than fine-tuning VLMs with the limited data available, we attempt to maximize the performance of existing models. |
Lars Doorenbos; Federico Spurio; Jürgen Gall; | code |
| 900 | OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from The Aerial Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Crucially, we propose a LiDAR-free data generation framework that is based on camera modality, which is ubiquitous on modern UAVs. |
Markus Gross; Sai Bharadhwaj Matha; Aya Fahmy; Rui Song; Daniel Cremers; Henri Meeß; | code |
| 901 | ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ChartNet, a high-quality, million-scale multimodal dataset designed to advance chart interpretation and reasoning. |
Jovana Kondic; Pengyuan Li; Dhiraj Joshi; Isaac Sanchez; Ben wiesel; Shafiq Abedin; Amit Alfassy; Eli Schwartz; Daniel Caraballo; Yagmur Gizem Cinar; Florian Scheidegger; Steven I Ross; Daniel Weidele; Hang Hua; Ekaterina Arutyunova; Roei Herzig; Zihan Wang; Xinyue Yu; Yunfei Zhao; Sicong Jiang; Minghao Liu; Qunshu Lin; Aude Oliva; Rogerio Feris; | code |
| 902 | Point Cloud As A Foreign Language for Multi-modal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present SAGE, the first end-to-end 3D MLLM that directly processes raw point clouds without relying on a pre-trained 3D encoder. |
Sneha Paul; Zachary Patterson; Nizar Bouguila; | code |
| 903 | Linguistic Priors for Visual Decoupling: Towards Symmetric Vision-Brain Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous methods that directly align visual and brain representations often overlook this inherent asymmetry, resulting in suboptimal decoding performance. To address this, we propose linguistic-prior-guided visual decoupling method, which introducing object-oriented textual descriptions as semantic guidance to explicitly decouple foreground objects from complex backgrounds in natural images, thereby establishing symmetric vision-brain alignment. |
Dongjun Liu; Weichen Dai; Jingsheng Qian; Honggang Liu; Hangjie Yi; Wanzeng Kong; | code |
| 904 | From Corners to Fiducial Tags: Revisiting Checkerboard Calibration for Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This hinders reliable corner localization and makes calibration difficult. To address these issues, we present a novel calibration framework that directly detects checkerboard corners from a raw event stream. |
Taehun Ryu; Changwoo Kang; Kyungdon Joo; | code |
| 905 | Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. |
Yayuan Li; Aadit Jain; Filippos Bellos; Jason Corso; | code |
| 906 | TTL: Test-time Textual Learning for OOD Detection with Pretrained Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, fixed labels fail to represent the diverse and evolving OOD semantics encountered in test streams. To address this limitation, we introduce **T**est-time **T**extual **L**earning (**TTL**), a framework that dynamically learns OOD textual semantics from unlabeled test streams, without relying on external OOD labels. |
Jinlun Ye; Jiang Liao; Runhe Lai; Xinhua Lu; Jia-Xin ZHUANG; Zhiyong Gan; Ruixuan Wang; | code |
| 907 | Reading Your Actions: Learning Generalizable Action Representations Via Pre-training AEMG Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel perspective on EMG signals, treating muscle contractions as words and activation sequences as sentences. |
Zhenghao Huang; Kaikai Wang; HUILIN YAO; Lin Shu; | code |
| 908 | BiomedCCPL: Causal Conditional Prompt Learning for Biomedical Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their generalization to unseen classes within the same dataset remains limited, as the image–text alignment semantics often rely on spurious cues present in seen classes that do not transfer. To tackle this, we propose $\textbf{BiomedCCPL (Causal Conditional Prompt Learning)}$, a framework that uses VGAP (Visual Grounder with Adaptive Prototype) to generate image-conditional prompts from multi-scale adaptive prototypes and employs SCD (Synergistic Causal Disentanglement) to regularize the generation of image-conditional prompts. |
Xueliang Cui; Juncai Zhang; Jiacheng Hou; Dan Lu; Hao Zhang; Ruxin Wang; | code |
| 909 | HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce HypeVPR, a hierarchical embedding framework in hyperbolic space specifically designed to address the challenges of P2E matching. |
Suhan Woo; Seongwon Lee; jinwoo jang; Euntai Kim; | code |
| 910 | GH-NAF: Grid-Adaptive Hash-Level–Attended Neural Attenuation Fields for Discrepancy-Aware CBCT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, uniformly concatenating multi-resolution hash-grid features blends heterogeneous frequency components and noise into a single representation, causing artifacts: homogeneous regions acquire spurious high-frequency patterns, structural boundaries become blurred, and projection-induced bias propagates throughout the learned field. Given these limitations, we introduce the Grid-Adaptive Hash-Level–Attended Neural Attenuation Field (GH-NAF). |
seong Je Oh; Ju Hwan Lee; Chae Lim; Donghwan Lee; Myung Chung; Kyungsu Kim; | code |
| 911 | Leveraging Multispectral Sensors for Color Correction in Mobile Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a unified, learning-based framework that (i) performs end-to-end color correction and (ii) jointly leverages data from a high-resolution RGB sensor and an auxiliary low-resolution MS sensor. |
Luca Cogo; Marco Buzzelli; Simone Bianco; Javier Vazquez-Corral; Raimondo Schettini; | code |
| 912 | FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While graph neural networks have been proposed to model interactions between tissue regions, their reliance on pre-defined sparse graphs prevents them from considering potentially interacting spot pairs, resulting in a structural limitation in capturing complex biological relationships. To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions. |
Taejin Jeong; Joohyeok Kim; Jinyeong Kim; Chanyoung Kim; Seong Jae Hwang; | code |
| 913 | FluidGaussian: Propagating Simulation-Based Uncertainty Toward Functionally-Intelligent 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. |
Yuqiu Liu; Jialin Song; Marissa Ramirez de Chanlatte; Rochishnu Chowdhury; Rushil Desai; Wuyang Chen; Daniel Martin; Michael Mahoney; | code |
| 914 | Continual Distillation of Teachers from Different Domains Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To better trade off between UKT and UKF, we propose Self External Data Distillation (SE2D), a method that preserves logits on external data to stabilize learning across heterogeneous teachers. |
Nicolas Michel; Maorong Wang; Jiangpeng He; Toshihiko Yamasaki; | code |
| 915 | Hoi! – A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a dataset for force-grounded, cross-view articulated manipulation that couples what is seen with what is done and what is felt during real human interaction. |
Tim Engelbracht; René Zurbrügg; Matteo Wohlrapp; Martin Büchner; Abhinav Valada; Marc Pollefeys; Hermann Blum; Zuria Bauer; | code |
| 916 | Anatomica: Localized Control Over Geometric and Topological Properties for Anatomical Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an inference-time guidance framework for generating 3D multi-class anatomical voxel maps with localized geometric and topological control. |
Karim Kadry; Abdalla Abdelwahed; Ajay Manicka; Naravich Chutisilp; Farhad R. Nezami; Elazer R Edelman; | code |
| 917 | Seeing Beyond 8bits: Subjective and Objective Quality Assessment of HDR-UGC Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose (i) a novel HDR-aware vision encoder to produce HDR-sensitive embeddings, and (ii) HDR-Aware Policy Optimization (HAPO), an RL finetuning framework that anchors reasoning to HDR cues. |
SHRESHTH SAINI; Bowen Chen; Yilin Wang; Neil Birkbeck; Balu Adsumilli; Alan Bovik; | code |
| 918 | FiDeSR: High-Fidelity and Detail-Preserving One-Step Diffusion Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose FiDeSR, a high-fidelity and detail-preserving one-step diffusion super-resolution framework. |
Aro Kim; Myeongjin Jang; Chaewon Moon; Youngjin Shin; Jinwoo Jeong; Sang-hyo Park; | code |
| 919 | Virtual Immunohistochemistry Staining with Dual-Aligned Multi-Task Feature Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel framework for virtual IHC staining guided by dual-aligned multi-task features, which fully explores semantic cues from auxiliary tasks. |
Shigeng Xie; Hongming Xu; Guiyang Jiang; Tuomo Rossi; Tommi Kärkkäinen; Fengyu Cong; | code |
| 920 | Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. |
Liu Kejia; Haoyang Zhou; Ruoyu Xu; Peicheng Wang; Mingli Song; Haofei Zhang; | code |
| 921 | Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. |
Yu Fanqi; Matteo Tiezzi; Tommaso Apicella; Cigdem Beyan; Vittorio Murino; | code |
| 922 | G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR (G-MIXER). |
jiyoung lim; Heejae Yang; Jee-Hyong Lee; | code |
| 923 | MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, yet effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. |
Changlu Guo; Anders Nymark Christensen; Anders Dahl; Morten Hannemose; | code |
| 924 | Denoising As Path Planning: Training-Free Acceleration of Diffusion Models with DPCache Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. |
Bowen Cui; Yuanbin Wang; Huajiang Xu; Biaolong Chen; Aixi Zhang; Hao Jiang; Zhengzheng Jin; Xu Liu; Pipei Huang; | code |
| 925 | Towards Photorealistic and Efficient Bokeh Rendering Via Diffusion Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A naive solution is to enhance image quality before applying bokeh rendering, yet this two-stage pipeline reduces efficiency and introduces unnecessary error accumulation. To overcome these limitations, we propose MagicBokeh, a unified diffusion-based framework designed for high-quality and efficient bokeh rendering. |
Linxiao Shi; Siming Zheng; Zerong Wang; Hao Zhang; Jinwei Chen; Bo Li; Shifeng Chen; Peng-Tao Jiang; | code |
| 926 | CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, recent driving datasets primarily provide sparse LiDAR ground truth for images, which is insufficient for assessing fine-grained geometry in depth estimation and completion. To address these gaps, we introduce CARD, a multi-modal driving dataset that delivers quasi-dense 3D ground truth across continuous sequences rich in speed bumps, potholes, irregular surfaces and off-road segments. |
Gasser Elazab; Frank Neuhaus; Tilman Koß; Malte Splietker; Aditya Date; Michael Unterreiner; Maximilian Jansen; Olaf Hellwich; | code |
| 927 | AnomalyVFM — Transforming Vision Foundation Models Into Zero-Shot Anomaly Detectors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. |
Matic Fučka; Vitjan Zavrtanik; Danijel Skočaj; | code |
| 928 | Cross-Instance Gaussian Splatting Registration Via Geometry-Aware Feature-Guided Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation; translation; scale), even when they are of different objects in the same category (e.g, different cars). |
Roy Amoyal; Oren Freifeld; Chaim Baskin; | code |
| 929 | Unleashing Stealthy Backdoor Pandemic By Infecting A Single Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose four necessary tests that a successful backdoor attack on the diffusion model should pass to cause a backdoor pandemic. |
Mohaiminul Al Nahian; Abeer Almalky; Sabbir Ahmed; Abdullah Al Arafat; Mamshad Nayeem Rizve; Adnan Rakin Rakin; | code |
| 930 | Breaking The Continuum: Discrete Distribution Learning for Structural MRI Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These properties naturally induce clustered patterns in the latent space, which are difficult to capture using conventional continuous generative priors that assume smooth manifold distributions. To address this limitation, we propose DiCoS (Discrete–Continuous Synthesis), a generative reconstruction framework that integrates discrete structural reasoning with continuous refinement. |
Tianle Lyu; Mengjingcheng Mo; Ting Wen; Zhen Song; Zinan Xiong; Yanjie Zhu; | code |
| 931 | OccAny: Generalized Unconstrained Urban 3D Occupancy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii)Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. |
Anh Quan Cao; TUAN-HUNG VU; | code |
| 932 | Language-Free Generative Editing from One Visual Example Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The door to cost-efficient visual editing remains open, and the key lies in a vision-centric paradigm that perceives and reasons about visual change as humans do, beyond words. Inspired by this, we introduce Visual Diffusion Conditioning (VDC), a training-free framework that learns conditioning signals directly from visual examples for precise, language-free image editing. |
Omar Elezabi; Eduard Zamfir; Zongwei Wu; Radu Timofte; | code |
| 933 | Unified Number-Free Text-to-Motion Generation Via Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). |
Guanhe Huang; Oya Celiktutan; | code |
| 934 | Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. |
Yiwen Shan; Haiyu Zhao; Peng Hu; Xi Peng; Yuanbiao Gou; | code |
| 935 | COPE: Consistent Occlusion and Prompt Enhancement Network for Occluded Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing efforts have explored occlusion-aware data augmentation and feature reconstruction to mitigate these issues, the former often fails to address erroneous matches caused by similar occlusion patterns and background distractions, whereas the latter typically introduces significant computational overhead. To overcome these limitations, we propose a Consistent Occlusion and Prompt Enhancement (COPE) network. |
Sun Siyi; Jinliang Lin; Juanjuan Weng; Zhihui Liu; Shaozi Li; Zhiming Luo; | code |
| 936 | GQIR: Generative Quanta Image Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an approach that adapts large text-to-image latent diffusion models to the photon-limited domain of quanta burst imaging. |
Aryan Garg; Sizhuo Ma; Mohit Gupta; | code |
| 937 | MAD: Motion Appearance Decoupling for Efficient Driving World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. |
Ahmad Rahimi; Valentin Gerard; Éloi Zablocki; Matthieu Cord; Alex Alahi; | code |
| 938 | BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods tend to generate limited sub-actions or incoherent motion due to fixed-length temporal inputs and discrete frame-wise representations that fail to capture rich motion semantics. We address these limitations by representing motion with continuous differentiable B-spline curves, enabling more effective motion generation without modifying the capabilities of the underlying generative model. |
Miaowei Wang; Qingxuan Yan; Zhi Cao; Yayuan Li; Oisin Mac Aodha; Jason Corso; Amir Vaxman; | code |
| 939 | Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. |
Kazuya Nishimura; Ryoma Bise; Shinnosuke Matsuo; Haruka Hirose; Yasuhiro Kojima; | code |
| 940 | Hilbert Curve-Based Attention Enabling Topology-Preserving Image Tensor Representation for Semantic Segmentation Network Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drone-based building defect segmentation remains challenging due to complex surface textures and illumination variations. We propose TPSegformer, a topology-preserving segmentation framework that mitigates mis-segmentation in such scenarios. |
Linkang Xu; Gang Li; Yue Song; Xiangxin Ji; | code |
| 941 | Lipschitz Optimization for Formal Verification of Homographies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a formal verification approach when the capturing camera undergoes 3D motion perturbations. |
Jean-Guillaume Durand; Panagiotis Kouvaros; Maxime Gariel; Alessio Lomuscio; | code |
| 942 | PAF: Perturbation-Aware Filtering for Open-Set Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a novel filtering strategy, Perturbation-Aware Filtering (PAF), which identifies OOD samples by measuring the representation instability under semantic-preserving perturbations. |
Yinan Han; Qing-Yuan Jiang; | code |
| 943 | FunFact: Building Probabilistic Functional 3D Scene Graphs Via Factor-Graph Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FunFact, a framework for constructing probabilistic open-vocabulary functional 3D scene graphs from posed RGB-D images. |
Zhengyu Fu; René Zurbrügg; Kaixian Qu; Marc Pollefeys; Marco Hutter; Hermann Blum; Zuria Bauer; | code |
| 944 | Role-SynthCLIP: A Role-Play Driven Diverse Synthetic Data Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, while existing synthetic data generation methods primarily focus on increasing data volume, such emphasis often leads to limited semantic diversity and redundant or shallow captions. To address this limitation, we propose \textbf{Role-SynthCLIP}, a novel data synthesis framework that leverages multi-perspective role-playing prompts (e.g., a compositional analyst, an interpreter of image context) to guide Multimodal Large Language Models (MLLMs) in generating semantically diverse captions from distinct viewpoints. |
Yuanxiang Huangfu; Chaochao wang; weilei wang; | code |
| 945 | Reflection Separation from A Single Image Via Joint Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often struggle to recover both layers in glare or weak-reflection scenarios, because of insufficient information. This paper presents the first diffusion model explicitly fine-tuned for this task, leveraging generative diffusion priors for robust separation. |
Zheng-Hui Huang; Zhixiang Wang; Yu-Lun Liu; Yung-Yu Chuang; | code |
| 946 | Generative Adversarial Perturbations with Cross-paradigm Transferability on Localized Crowd Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel adversarial framework that compromises both density map and point regression architectural paradigms through a comprehensive multi-task loss optimization. |
Alabi Mehzabin Anisha; Guangjing Wang; Sriram Chellappan; | code |
| 947 | GazeShift: Unsupervised Gaze Estimation and Dataset for VR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Gaze annotation is difficult since fixation on intended targets cannot be guaranteed. To address these challenges, we introduce VRGaze—the first large-scale off-axis gaze estimation dataset for VR—comprising 2.1 million near-eye infrared images collected from 68 participants. |
Gil Shapira; Ishay Goldin; Evgeny Artyomov; Donghoon Kim; Yosi Keller; Niv Zehngut; | code |
| 948 | Your Dissimilarities Define You: Complementary Learning Exploiting Class Diversities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we exploit class dissimilarities in a rather novel way, which provides complementary learning information beyond correct classification, that is not fully utilized in existing learning paradigms. |
Dimitrios Katsikas; Nikolaos Passalis; Anastasios Tefas; | code |
| 949 | Toward Early Quality Assessment of Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this post-hoc quality assessment is highly resource-intensive since quality is assessed after dozens to hundreds of denoising steps per image, leading to substantial waste on low-quality samples. To address this issue, we propose \textbf{Probe-Select}, a plug-in framework for early quality assessment in T2I generation. |
Huanlei Guo; Hongxin Wei; Bingyi Jing; | code |
| 950 | NERFIFY: Multi Agent Framework for Turning NeRF Papers Into Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce NERFIFY, a multi-agent framework that reliably converts NeRF research papers into trainable Nerfstudio plugins, in contrast to generic paper-to-code methods and frontier models like GPT-5 that usually fail to produce runnable code. |
Seemandhar Jain; Keshav Gupta; Kunal Gupta; Manmohan Chandraker; | code |
| 951 | DROID-SLAM in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. |
Moyang Li; Zihan Zhu; Marc Pollefeys; Daniel Barath; | code |
| 952 | Linear Fundamental Matrix Estimation from 7 or 5 Points Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we consider a special case of the 7-point problem where 5 of the points are configured to lie on two lines, which has previously been shown to have a unique solution. |
Taci Ata Kucukpinar; Juan Mogollon; Joshua Fraser; Timothy Duff; Kannappan Palaniappan; | code |
| 953 | $\oslash$ Source Models Leak What They Shouldn’t $\nrightarrow$ : Unlearning Zero-Shot Transfer in Domain Adaptation Through Adversarial Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a new unlearning method, where an adversarially generated forget class sample is unlearned by the model during the domain adaptation process using a novel rescaled labeling strategy and adversarial optimization. |
Arnav Devalapally; Poornima Jain; Kartik Srinivas; Vineeth Balasubramanian; | code |
| 954 | Critical Patch-Aware Sparse Prompting with Decoupled Training for Continual Learning on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose CPS-Prompt, a critical patch-aware sparse prompting framework that enhances training efficiency with minimal accuracy loss by combining Critical Patch Sampling (CPS) for task-aware token selection and Decoupled Prompt–Classifier Training (DPCT) for representation alignment. |
Wonseon Lim; Jaesung Lee; Dae-Won Kim; | code |
| 955 | Improving Controllable Generation: Faster Training and Better Performance Via X0-Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we revisit the training objective of controllable diffusion models through a detailed analysis of their denoising dynamics. |
Amadou S. SANGARE; Adrien Maglo; Mohamed Chaouch; Bertrand Luvison; | code |
| 956 | SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SearchAD, a large-scale rare image retrieval dataset for AD containing over 423k frames drawn from 11 established datasets. |
Felix Embacher; Jonas Uhrig; Marius Cordts; Markus Enzweiler; | code |
| 957 | Camera Control for Text-to-Image Generation Via Learning Viewpoint Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. |
Xinxuan Lu; Charless Fowlkes; Alex Berg; | code |
| 958 | Is Bin Generation Indispensable? A Bin-Generation-Free Dataset Quantization Via Semantic Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, a fixed drop ratio in its patch dropping step fails to adapt to the diverse redundancy levels across samples, which degrades the representational quality of the quantized coreset. To address these limitations, we present Bin-Generation-Free Dataset Quantization (BGFDQ), a fully restructured framework that incorporates a simple yet effective KNN-based neighbor identification and neighbor-aware coreset selection strategy. |
Deng Maijie; Yuhua Li; Yixiong Zou; Yao Wu; Chenru Ma; | code |
| 959 | MoECLIP: Patch-Specialized Experts for Zero-shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches attempting to solve this challenge share the fundamental limitation of a patch-agnostic design that processes all patches monolithically without regard for their unique characteristics. To address this limitation, we propose MoECLIP, a Mixture-of-Experts (MoE) architecture for the ZSAD task, which achieves patch-level adaptation by dynamically routing each image patch to a specialized Low-Rank Adaptation (LoRA) expert based on its unique characteristics. |
Jun Yeong Park; JunYoung Seo; Minji Kang; Yu Rang Park; | code |
| 960 | AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. |
Artur Xarles i Esparraguera; Sergio Escalera; Thomas B. Moeslund; Albert Clapés; | code |
| 961 | Towards High-Quality Image Segmentation: Improving Topology Accuracy By Penalizing Neighbor Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels’ neighbors before allowing it to improve the pixels themselves. |
J. Miguel Valverde; Dim Papadopoulos; Rasmus Larsen; Anders Dahl; | code |
| 962 | 3D Sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. |
Ryousuke Yamada; Kohsuke Ide; Yoshihiro Fukuhara; Hirokatsu Kataoka; Gilles Puy; Andrei Bursuc; Yuki M Asano; | code |
| 963 | PMRNet: Physics-informed Multi-scale Refinement Network for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Medical image segmentation demands both high accuracy and computational efficiency, yet existing methods face a critical trade-off: CNNs lack global context while transformers incur prohibitive costs for deployment on resource-constrained devices. To address this challenge, we propose $P$hysics-informed $M$ulti-scale $R$efinement Network (PMRNet), integrating symplectic geometry, renormalization group theory, and entropy diffusion to guide feature learning. |
Boce Kang; | code |
| 964 | ELVIS: Enhance Low-light for Video Instance Segmentation in The Dark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ELVIS (Enhance Low-light for Video Instance Segmentation), a novel framework that enables effective domain adaptation of state-of-the-art VIS models to low-light scenarios. |
Joanne Lin; Ruirui Lin; Yini Li; David Bull; Nantheera Anantrasirichai; | code |
| 965 | Spectral Conformal Risk Control: Distribution-Free Tail Guarantees Via Bayesian Quadrature Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While conformal prediction gives distribution-free uncertainty guarantees, most existing methods only control mean error and are hard to tune toward rare but high-cost mistakes. We propose Bayesian-Quadrature Spectral Risk Control (BQ-SRC), a general framework for controlling tail-focused risks (such as conditional value at risk (CVaR)-style objectives) in a distribution-free way. |
Mohammad Esfeh Esfeh; Qi Yan; Yongxing Zhang; Zahra Gholami; Renjie Liao; Purang Abolmaesumi; | code |
| 966 | Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ALOE (ALign Once to Explain), a one-time, label-free feature alignment based approach that efficiently converts foundational vision models into inherently interpretable B-cos variants. |
Raphael Maser; Siddhartha Gairola; Sukrut Rao; Bernt Schiele; | code |
| 967 | Visual Grounding for Object Questions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the novel problem of Visual Grounding for Object Questions (VGOQ). |
Martin Nicolas Everaert; Xiruo Liu; Hiroyuki Takeda; Raja Bala; Vivek Yadav; Vidya Narayanan; | code |
| 968 | VoDaSuRe: A Large-Scale Dataset Revealing Domain Shift in Volumetric Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such a training setup arises partly from the limited availability of paired high- and low-resolution volumetric datasets. To address this gap, we introduce VoDaSuRe, a large-scale volumetric dataset containing paired high- and low-resolution scans. |
August Leander Høeg; Sophia Bardenfleth; Hans Martin Kjer; Tim Dyrby; Vedrana Dahl; Anders Dahl; | code |
| 969 | Rewis3d: Reconstruction for Weakly-Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. |
Jonas Ernst; Wolfgang Boettcher; Lukas Hoyer; Jan Lenssen; Bernt Schiele; | code |
| 970 | Streamlined Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These methods are often inefficient due to redundant objectives, suboptimal transformations, and poorly designed loss functions. Motivated by these issues, we propose Streamlined Knowledge Distillation (SKD), a simple yet effective logit-based method that transfers only two essential forms of knowledge without requiring additional alignment or relational modeling. |
Hyeon-Jin Jung; Han-Jin Lee; Seok-Hwan Choi; | code |
| 971 | Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. |
Minh Kha Do; Wei Xiang; Kang Han; Di Wu; Khoa T. Phan; Yi-Ping Phoebe Chen; Gaowen Liu; Ramana Kompella; | code |
| 972 | Content-Aware Frequency Encoding for Implicit Neural Representations with Fourier-Chebyshev Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This forces multi-layer perceptrons (MLPs) to inefficiently compose the required frequencies, thereby constraining their representational capacity. To address this limitation, we propose Content-Aware Frequency Encoding (CAFE), which builds upon Fourier features through multiple parallel linear layers combined via a Hadamard product. |
Junbo Ke; Yangyang Xu; Chao Wang; You-Wei Wen; | code |
| 973 | THE MORE, THE MERRIER: CONTRASTIVE FUSION FOR HIGHER-ORDER MULTIMODAL ALIGNMENT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. |
Stefanos Koutoupis; Michaela Areti Zervou; Konstantinos Kontras; Maarten De Vos; Panagiotis Tsakalides; Grigorios Tsagkatakis; | code |
| 974 | Dedelayed: Deleting Remote Inference Delay Via On-device Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. |
Dan Jacobellis; Mateen Ulhaq; Fabien Racapé; Hyomin Choi; Neeraja Yadwadkar; | code |
| 975 | Question-guided Visual Compression with Memory Feedback for Long-Term Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end,we propose Question-guided Visual Compression with Memory Feedback (QViC-MF),a framework for long-term video understanding. |
Sosuke Yamao; Natsuki Miyahara; Yuankai Qi; Shun Takeuchi; | code |
| 976 | Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. |
Zhanhe Lei; Zhongyuan Wang; Jikang Cheng; Baojin Huang; Yuhong Yang; Zhen Han; Chao Liang; Dengpan Ye; | code |
| 977 | Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This characteristic leads to highly imbalanced texture distributions, posing a significant hurdle to the model’s spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) that reflects the underlying texture distribution. |
Enzhuo Zhang; Sijie Zhao; Dilxat Muhtar; Zhenshi Li; Xueliang Zhang; Pengfeng Xiao; | code |
| 978 | TriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with DINOv2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. |
Arian Sabaghi; Jose Oramas; | code |