ICCV 2025 Papers with Code & Data
To facilitate rapid community engagement with the presented research, we have compiled an extensive index of accepted papers that have associated public code or data repositories. We list all of them in the following table. This index was generated using an automated extraction process. While we strive for completeness, some papers with public resources may have been missed. Please inform us if you discover any additional papers that should be included. Readers should be aware that some code repositories may not be made fully public until the conference officially begins.
In addition to this index, we encourage readers to explore our related resources: ICCV-2025 papers & highlights: For curated summaries and key takeaways from this year’s conference. “Best Paper” Digest (ICCV): A historical overview of the most influential ICCV papers published since 1988.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive daily paper digests on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: ICCV 2025 Papers with Code & Data
| Paper | Author(s) | Code | |
|---|---|---|---|
| 1 | Scaling Language-Free Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" |
David Fan; Shengbang Tong; Jiachen Zhu; Koustuv Sinha; Zhuang Liu; Xinlei Chen; Michael Rabbat; Nicolas Ballas; Yann LeCun; Amir Bar; Saining Xie; | code |
| 2 | MIEB: Massive Image Embedding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. |
Chenghao Xiao; Isaac Chung; Imene Kerboua; Jamie Stirling; Xin Zhang; Márton Kardos; Roman Solomatin; Noura Al Moubayed; Kenneth Enevoldsen; Niklas Muennighoff; | code |
| 3 | Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360deg wide-coverage, scene-level reconstruction. |
Chen Ziwen; Hao Tan; Kai Zhang; Sai Bi; Fujun Luan; Yicong Hong; Li Fuxin; Zexiang Xu; | code |
| 4 | YOLOE: Real-Time Seeing Anything Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. |
Ao Wang; Lihao Liu; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; | code |
| 5 | MRGen: Segmentation Data Engine For Underrepresented MRI Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Concretely, our contributions are threefold: (i) we introduce MRGen-DB, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset featuring pixel-wise mask annotations; (ii) we present MRGen, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. |
Haoning Wu; Ziheng Zhao; Ya Zhang; Yanfeng Wang; Weidi Xie; | code |
| 6 | From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models Via Reflection Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. |
Le Zhuo; Liangbing Zhao; Sayak Paul; Yue Liao; Renrui Zhang; Yi Xin; Peng Gao; Mohamed Elhoseiny; Hongsheng Li; | code |
| 7 | Randomized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. |
Qihang Yu; Ju He; Xueqing Deng; Xiaohui Shen; Liang-Chieh Chen; | code |
| 8 | LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion Via Distillation to Learnable Look-Up Tables Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. |
Xunpeng Yi; Yibing Zhang; Xinyu Xiang; Qinglong Yan; Han Xu; Jiayi Ma; | code |
| 9 | VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. |
Jiacheng Ruan; Wenzhen Yuan; Xian Gao; Ye Guo; Daoxin Zhang; Zhe Xu; Yao Hu; Ting Liu; Yuzhuo Fu; | code |
| 10 | FlowChef: Steering of Rectified Flow Models for Controlled Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present FlowChef, a novel training, inversion, and gradient-free inference-time steering strategy for RFMs that deterministically guides the denoising process. |
Maitreya Patel; Song Wen; Dimitris N. Metaxas; Yezhou Yang; | code |
| 11 | GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. |
Rui Hu; Lianghui Zhu; Yuxuan Zhang; Tianheng Cheng; Lei Liu; Heng Liu; Longjin Ran; Xiaoxin Chen; Wenyu Liu; Xinggang Wang; | code |
| 12 | Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a kxk grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. |
Sucheng Ren; Qihang Yu; Ju He; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen; | code |
| 13 | MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present MaTVLM, a method for distilling pre-trained vision-language models (VLMs) into an efficient Mamba-Transformer hybrid architecture. |
Yingyue Li; Bencheng Liao; Wenyu Liu; Xinggang Wang; | code |
| 14 | External Knowledge Injection for CLIP-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. |
Da-Wei Zhou; Kai-Wen Li; Jingyi Ning; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; | code |
| 15 | PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects in interaction to produce a photo- and physically realistic, real-time interactive virtual replica.Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering, and (2) a novel multi-stage optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. |
Hanxiao Jiang; Hao-Yu Hsu; Kaifeng Zhang; Hsin-Ni Yu; Shenlong Wang; Yunzhu Li; | code |
| 16 | Flash-VStream: Efficient Real-Time Understanding for Long Video Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. |
Haoji Zhang; Yiqin Wang; Yansong Tang; Yong Liu; Jiashi Feng; Xiaojie Jin; | code |
| 17 | Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. |
Zhengyao Lv; Tianlin Pan; Chenyang Si; Zhaoxi Chen; Wangmeng Zuo; Ziwei Liu; Kwan-Yee K. Wong; | code |
| 18 | MaskControl: Spatio-Temporal Control for Masked Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. |
Ekkasit Pinyoanuntapong; Muhammad Saleem; Korrawe Karunratanakul; Pu Wang; Hongfei Xue; Chen Chen; Chuan Guo; Junli Cao; Jian Ren; Sergey Tulyakov; | code |
| 19 | SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this static nature renders them unable to dynamically track the data utility throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. |
Yangyang Guo; Mohan Kankanhalli; | code |
| 20 | FlowTok: Flowing Seamlessly Across Text and Image Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. |
Ju He; Qihang Yu; Qihao Liu; Liang-Chieh Chen; | code |
| 21 | LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. |
Yu Cheng; Fajie Yuan; | code |
| 22 | EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. |
Haiwen Diao; Xiaotong Li; Yufeng Cui; Yueze Wang; Haoge Deng; Ting Pan; Wenxuan Wang; Huchuan Lu; Xinlong Wang; | code |
| 23 | REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" |
Xingjian Leng; Jaskirat Singh; Yunzhong Hou; Zhenchang Xing; Saining Xie; Liang Zheng; | code |
| 24 | The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. |
Aoxiong Yin; Xu Tan; Kai Shen; Yichong Leng; Xinyu Zhou; Juncheng Li; Siliang Tang; | code |
| 25 | Efficient Track Anything Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. |
Yunyang Xiong; Chong Zhou; Xiaoyu Xiang; Lemeng Wu; Chenchen Zhu; Zechun Liu; Saksham Suri; Balakrishnan Varadarajan; Ramya Akula; Forrest Iandola; Raghuraman Krishnamoorthi; Bilge Soran; Vikas Chandra; | code |
| 26 | LEGION: Learning to Ground and Explain for Synthetic Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. |
Hengrui Kang; Siwei Wen; Zichen Wen; Junyan Ye; Weijia Li; Peilin Feng; Baichuan Zhou; Bin Wang; Dahua Lin; Linfeng Zhang; Conghui He; | code |
| 27 | 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. |
Wenqi Zhang; Hang Zhang; Xin Li; Jiashuo Sun; Yongliang Shen; Weiming Lu; Deli Zhao; Yueting Zhuang; Lidong Bing; | code |
| 28 | SpectralAR: Spectral Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. |
Yuanhui Huang; Weiliang Chen; Wenzhao Zheng; Yueqi Duan; Jie Zhou; Jiwen Lu; | code |
| 29 | SAM2Long: Enhancing SAM 2 for Long Video Segmentation with A Training-Free Memory Tree Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. |
Shuangrui Ding; Rui Qian; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Yuwei Guo; Dahua Lin; Jiaqi Wang; | code |
| 30 | General Compression Framework for Efficient Transformer Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. |
Lingyi Hong; Jinglun Li; Xinyu Zhou; Shilin Yan; Pinxue Guo; Kaixun Jiang; Zhaoyu Chen; Shuyong Gao; Runze Li; Xingdong Sheng; Wei Zhang; Hong Lu; Wenqiang Zhang; | code |
| 31 | ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. |
Zifu Wan; Ce Zhang; Silong Yong; Martin Q. Ma; Simon Stepputtis; Louis-Philippe Morency; Deva Ramanan; Katia Sycara; Yaqi Xie; | code |
| 32 | Cycle Consistency As Reward: Learning Image-Text Alignment Without Human Preferences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an alternative approach that leverages cycle consistency as a supervisory signal. |
Hyojin Bahng; Caroline Chan; Fredo Durand; Phillip Isola; | code |
| 33 | EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. |
Lu Chen; Yizhou Wang; Shixiang Tang; Qianhong Ma; Tong He; Wanli Ouyang; Xiaowei Zhou; Hujun Bao; Sida Peng; | code |
| 34 | GaussianReg: Rapid 2D/3D Registration for Emergency Surgery Via Explicit 3D Modeling with Gaussian Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present GaussianReg, a novel registration framework that achieves clinically acceptable accuracy within minutes of preprocessing. |
Weihao Yu; Xiaoqing Guo; Xinyu Liu; Yifan Liu; Hao Zheng; Yawen Huang; Yixuan Yuan; | code |
| 35 | X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, We propose X^2-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. |
Weihao Yu; Yuanhao Cai; Ruyi Zha; Zhiwen Fan; Chenxin Li; Yixuan Yuan; | code |
| 36 | Rethinking The Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. |
Ruixuan Cong; Yu Wang; Mingyuan Zhao; Da Yang; Rongshan Chen; Hao Sheng; | code |
| 37 | InfoBridge: Balanced Multimodal Integration Through Conditional Dependency Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing methods attempt to enhance fusion through cross-modal alignment or interaction mechanisms, they often struggle to balance effective integration with preserving modality-specific information. We introduce InfoBridge, a novel framework grounded in conditional information maximization principles addressing these limitations. |
Chenxin Li; Yifan Liu; Panwang Pan; Hengyu Liu; Xinyu Liu; Wuyang Li; Cheng Wang; Weihao Yu; Yiyang Lin; Yixuan Yuan; | code |
| 38 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. |
Guowei Xu; Peng Jin; Ziang Wu; Hao Li; Yibing Song; Lichao Sun; Li Yuan; | code |
| 39 | Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. |
Bowen Zhang; Sicheng Xu; Chuxin Wang; Jiaolong Yang; Feng Zhao; Dong Chen; Baining Guo; | code |
| 40 | Representation Shift: Unifying Token Compression with FlashAttention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. |
Joonmyung Choi; Sanghyeok Lee; Byungoh Ko; Eunseo Kim; Jihyung Kil; Hyunwoo J. Kim; | code |
| 41 | Online Dense Point Tracking with Streaming Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with Streaming memory for dense POint Tracking and online video processing. |
Qiaole Dong; Yanwei Fu; | code |
| 42 | Unleashing Vecset Diffusion Model for Fast Shape Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Challenges exist because of not only difficulties in accelerating diffusion sampling but also VAE decoding in VDM — areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. |
Zeqiang Lai; Yunfei Zhao; Zibo Zhao; Haolin Liu; Fuyun Wang; Huiwen Shi; Xianghui Yang; Qingxiang Lin; Jingwei Huang; Yuhong Liu; Jie Jiang; Chunchao Guo; Xiangyu Yue; | code |
| 43 | HPSv3: Towards Wide-Spectrum Human Preference Score Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. |
Yuhang Ma; Xiaoshi Wu; Keqiang Sun; Hongsheng Li; | code |
| 44 | BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. |
Tongfan Guan; Jiaxin Guo; Chen Wang; Yun-Hui Liu; | code |
| 45 | Multi-Modal Few-Shot Temporal Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). |
Zijia Lu; Ehsan Elhamifar; | code |
| 46 | Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. |
Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Zhonghua Wu; Qingyi Tao; Wentao Liu; Wei Li; Chen Change Loy; | code |
| 47 | EEdit : Rethinking The Spatial and Temporal Redundancy for Efficient Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose an Efficient Editing framework, named EEdit, to achieve efficient image editing. |
Zexuan Yan; Yue Ma; Chang Zou; Wenteng Chen; Qifeng Chen; Linfeng Zhang; | code |
| 48 | Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our method, Marigold-DC, builds on a pretrained latent diffusion model (LDM) for depth estimation and injects the depth observations as test-time guidance, via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. |
Massimiliano Viola; Kevin Qu; Nando Metzger; Bingxin Ke; Alexander Becker; Konrad Schindler; Anton Obukhov; | code |
| 49 | GenDoP: Auto-regressive Camera Trajectory Generation As A Director of Photography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. |
Mengchen Zhang; Tong Wu; Jing Tan; Ziwei Liu; Gordon Wetzstein; Dahua Lin; | code |
| 50 | Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. |
Junyan Ye; Jun He; Weijia Li; Zhutao Lv; Yi Lin; Jinhua Yu; Haote Yang; Conghui He; | code |
| 51 | Advancing Visual Large Language Model for Multi-granular Versatile Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVL-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. |
Wentao Xiang; Haoxian Tan; Yujie Zhong; Cong Wei; Dengjie Li; Yujiu Yang; | code |
| 52 | Less-to-More Generalization: Unlocking More Controllability By In-Context Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle these challenges. |
Shaojin Wu; Mengqi Huang; Wenxu Wu; Yufeng Cheng; Fei Ding; Qian He; | code |
| 53 | IRASim: A Fine-Grained World Model for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories. |
Fangqi Zhu; Hongtao Wu; Song Guo; Yuxiao Liu; Chilam Cheang; Tao Kong; | code |
| 54 | Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. |
Qizhe Zhang; Aosong Cheng; Ming Lu; Renrui Zhang; Zhiyong Zhuo; Jiajun Cao; Shaobo Guo; Qi She; Shanghang Zhang; | code |
| 55 | Visual Test-time Scaling for GUI Agent Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. |
Tiange Luo; Lajanugen Logeswaran; Justin Johnson; Honglak Lee; | code |
| 56 | T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. |
Chieh-Yun Chen; Min Shi; Gong Zhang; Humphrey Shi; | code |
| 57 | ZeroStereo: Zero-shot Stereo Matching from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. |
Xianqi Wang; Hao Yang; Gangwei Xu; Junda Cheng; Min Lin; Yong Deng; Jinliang Zang; Yurui Chen; Xin Yang; | code |
| 58 | Rethinking The Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. |
Liuyi Wang; Xinyuan Xia; Hui Zhao; Hanqing Wang; Tai Wang; Yilun Chen; Chengju Liu; Qijun Chen; Jiangmiao Pang; | code |
| 59 | Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. |
Yuqing Wang; Zhijie Lin; Yao Teng; Yuanzhi Zhu; Shuhuai Ren; Jiashi Feng; Xihui Liu; | code |
| 60 | DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. |
Dewei Zhou; Mingwei Li; Zongxin Yang; Yi Yang; | code |
| 61 | Frequency-Dynamic Attention Modulation For Dense Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. |
Linwei Chen; Lin Gu; Ying Fu; | code |
| 62 | 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. |
Tianrui Lou; Xiaojun Jia; Siyuan Liang; Jiawei Liang; Ming Zhang; Yanjun Xiao; Xiaochun Cao; | code |
| 63 | AllTracker: Efficient Dense Point Tracking at High Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. |
Adam W. Harley; Yang You; Xinglong Sun; Yang Zheng; Nikhil Raghuraman; Yunqi Gu; Sheldon Liang; Wen-Hsuan Chu; Achal Dave; Suya You; Rares Ambrus; Katerina Fragkiadaki; Leonidas Guibas; | code |
| 64 | WorldScore: A Unified Evaluation Benchmark for World Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the WorldScore benchmark, the first unified benchmark for world generation. |
Haoyi Duan; Hong-Xing Yu; Sirui Chen; Li Fei-Fei; Jiajun Wu; | code |
| 65 | LD-RPS: Zero-Shot Unified Image Restoration Via Latent Diffusion Recurrent Posterior Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. |
Huaqiu Li; Yong Wang; Tongwen Huang; Hailang Huang; Haoqian Wang; Xiangxiang Chu; | code |
| 66 | Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this task poses significant challenges, including the accurate modeling of complex style patterns–encompassing both intra- and inter-word relationships–and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. |
Gang Dai; Yifan Zhang; Yutao Qin; Qiangya Guo; Shuangping Huang; Shuicheng Yan; | code |
| 67 | MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for frame-wise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. |
Ruiyuan Gao; Kai Chen; Bo Xiao; Lanqing Hong; Zhenguo Li; Qiang Xu; | code |
| 68 | Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show that we only need a single parameter \omega to effectively control granularity in diffusion-based synthesis. |
Xinyu Hou; Zongsheng Yue; Xiaoming Li; Chen Change Loy; | code |
| 69 | Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing PRototype Evolution with Dual-Knowledge Cooperation framework (SPRED). |
Kunlun Xu; Fan Zhuo; Jiangmeng Li; Xu Zou; Jiahuan Zhou; | code |
| 70 | Heavy Labels Out! Dataset Distillation with Label Space Lightening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. |
Ruonan Yu; Songhua Liu; Zigeng Chen; Jingwen Ye; Xinchao Wang; | code |
| 71 | I2VControl: Disentangled and Unified Video Motion Synthesis Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. |
Wanquan Feng; Tianhao Qi; Jiawei Liu; Mingzhen Sun; Pengqi Tu; Tianxiang Ma; Fei Dai; Songtao Zhao; Siyu Zhou; Qian He; | code |
| 72 | End-to-End Driving with Online Trajectory Evaluation Via BEV World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework **WoTE**, which leverages a BEV **Wo**rld model to predict future BEV states for **T**rajectory **E**valuation. |
Yingyan Li; Yuqi Wang; Yang Liu; Jiawei He; Lue Fan; Zhaoxiang Zhang; | code |
| 73 | Boosting MLLM Reasoning with Text-Debiased Hint-GRPO Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. |
Qihan Huang; Weilong Dai; Jinlong Liu; Wanggui He; Hao Jiang; Mingli Song; Jingyuan Chen; Chang Yao; Jie Song; | code |
| 74 | WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. |
Zhongyu Yang; Jun Chen; Dannong Xu; Junjie Fei; Xiaoqian Shen; Liangbing Zhao; Chun-Mei Feng; Mohamed Elhoseiny; | code |
| 75 | Find Any Part in 3D Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. |
Ziqi Ma; Yisong Yue; Georgia Gkioxari; | code |
| 76 | 4D Visual Pre-training for Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. |
Chengkai Hou; Yanjie Ze; Yankai Fu; Zeyu Gao; Songbo Hu; Yue Yu; Shanghang Zhang; Huazhe Xu; | code |
| 77 | Contrastive Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. |
George Stoica; Vivek Ramanujan; Xiang Fan; Ali Farhadi; Ranjay Krishna; Judy Hoffman; | code |
| 78 | World4Drive: End-to-End Autonomous Driving Via Intention-aware Physical Latent World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. |
Yupeng Zheng; Pengxuan Yang; Zebin Xing; Qichao Zhang; Yuhang Zheng; Yinfeng Gao; Pengfei Li; Teng Zhang; Zhongpu Xia; Peng Jia; XianPeng Lang; Dongbin Zhao; | code |
| 79 | DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. |
Jiazhe Guo; Yikang Ding; Xiwu Chen; Shuo Chen; Bohan Li; Yingshuang Zou; Xiaoyang Lyu; Feiyang Tan; Xiaojuan Qi; Zhiheng Li; Hao Zhao; | code |
| 80 | VMBench: A Benchmark for Perception-Aligned Video Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based these findings, we introduce VMBench–a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. |
Xinran Ling; Chen Zhu; Meiqi Wu; Hangyu Li; Xiaokun Feng; Cundian Yang; Aiming Hao; Jiashu Zhu; Jiahong Wu; Xiangxiang Chu; | code |
| 81 | Holistic Tokenizer for Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce Hita, a novel image tokenizer for autoregressive (AR) image generation. |
Anlin Zheng; Haochen Wang; Yucheng Zhao; Weipeng Deng; Tiancai Wang; Xiangyu Zhang; Xiaojuan Qi; | code |
| 82 | Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Emerging Active Domain Adaptation (ADA) approaches offer more effective enhancements, but all ignore fairness issues. Therefore, in this work, we propose the first fairness-aware ADA paradigm that simultaneously achieves both enhanced fairness and superior overall performance. |
Hongqiu Wang; Wu Chen; Xiangde Luo; Zhaohu Xing; Lihao Liu; Jing Qin; Shaozhi Wu; Lei Zhu; | code |
| 83 | Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs By Learning Language-Agnostic Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. |
Jeong Hun Yeo; Minsu Kim; Chae Won Kim; Stavros Petridis; Yong Man Ro; | code |
| 84 | GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. |
Zhenghao He; Sanchit Sinha; Guangzhi Xiong; Aidong Zhang; | code |
| 85 | Mobile Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces the first mobile-optimized image-to-video diffusion model. |
Haitam Ben Yahia; Denis Korzhenkov; Ioannis Lelekas; Amir Ghodrati; Amirhossein Habibian; | code |
| 86 | Refer to Any Segmentation Mask Group With Vision-Language Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. |
Shengcao Cao; Zijun Wei; Jason Kuen; Kangning Liu; Lingzhi Zhang; Jiuxiang Gu; HyunJoon Jung; Liang-Yan Gui; Yu-Xiong Wang; | code |
| 87 | Reangle-A-Video: 4D Video Generation As Video-to-Video Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Reangle-A-Video, a unified framework for generating synchronized multi-view videos from a single input video. |
Hyeonho Jeong; Suhyeon Lee; Jong Chul Ye; | code |
| 88 | SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. |
Tianci Wen; Zhiang Liu; Yongchun Fang; | code |
| 89 | HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a unified Driving World Model named HERMES. |
Xin Zhou; Dingkang Liang; Sifan Tu; Xiwu Chen; Yikang Ding; Dingyuan Zhang; Feiyang Tan; Hengshuang Zhao; Xiang Bai; | code |
| 90 | From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. |
Chuang Yu; Jinmiao Zhao; Yunpeng Liu; Sicheng Zhao; Yimian Dai; Xiangyu Yue; | code |
| 91 | IMG: Calibrating Diffusion Models Via Implicit Multimodal Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. |
Jiayi Guo; Chuanhao Yan; Xingqian Xu; Yulin Wang; Kai Wang; Gao Huang; Humphrey Shi; | code |
| 92 | Deep Adaptive Unfolded Network Via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, validating pan-sharpening performance in high-level semantic tasks is intractable for the absence of datasets. To tackle these issues, we propose a deep adaptive unfolded network via spatial morphology stripping and spectral filtration for pan-sharpening, which is conceptualized as a linear inverse problem regularized by spatial and spectral priors. |
Hebaixu Wang; Jiayi Ma; | code |
| 93 | EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. |
Yuqi Wu; Wenzhao Zheng; Sicheng Zuo; Yuanhui Huang; Jie Zhou; Jiwen Lu; | code |
| 94 | MINERVA: Evaluating Complex Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. |
Arsha Nagrani; Sachit Menon; Ahmet Iscen; Shyamal Buch; Ramin Mehran; Nilpa Jha; Anja Hauth; Yukun Zhu; Carl Vondrick ; Mikhail Sirotenko; Cordelia Schmid; Tobias Weyand; | code |
| 95 | LeGrad: An Explainability Method for Vision Transformers Via Feature Formation Sensitivity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, because of their modeling of long-range dependencies through self-attention mechanisms, the explainability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. |
Walid Bousselham; Angie Boggust; Sofian Chaybouti; Hendrik Strobelt; Hilde Kuehne; | code |
| 96 | ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. |
Yuqian Fu; Runze Wang; Bin Ren; Guolei Sun; Biao Gong; Yanwei Fu; Danda Pani Paudel; Xuanjing Huang; Luc Van Gool; | code |
| 97 | VideoAds for Fast-Paced Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. |
Zheyuan Zhang; Wanying Dou; Linkai Peng; Hongyi Pan; Ulas Bagci; Boqing Gong; | code |
| 98 | IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. |
Wenxuan Guo; Xiuwei Xu; Hang Yin; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu; | code |
| 99 | CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. |
Changxing Liu; Genjia Liu; Zijun Wang; Jinchang Yang; Siheng Chen; | code |
| 100 | DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. |
Wenwen Yu; Zhibo Yang; Yuliang Liu; Xiang Bai; | code |
| 101 | OuroMamba: A Data-Free Quantization Framework for Vision Mamba Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). |
Akshat Ramachandran; Mingyu Lee; Huan Xu; Souvik Kundu; Tushar Krishna; | code |
| 102 | SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. |
Jiahui Wang; Zuyan Liu; Yongming Rao; Jiwen Lu; | code |
| 103 | EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. |
Xiaobao Wei; Qingpo Wuwu; Zhongyu Zhao; Zhuangzhe Wu; Nan Huang; Ming Lu; Ningning Ma; Shanghang Zhang; | code |
| 104 | PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. |
Haotian Wang; Aoran Xiao; Xiaoqin Zhang; Meng Yang; Shijian Lu; | code |
| 105 | Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. |
Chengbo Yuan; Geng Chen; Li Yi; Yang Gao; | code |
| 106 | Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. |
Yan Wang; Da-Wei Zhou; Han-Jia Ye; | code |
| 107 | VIPerson: Flexibly Generating Virtual Identity for Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel pedestrian generation pipeline, VIPerson, to generate camera-realistic pedestrian images with flexible Virtual Identities for the Person ReID task. |
Xiao-Wen Zhang; Delong Zhang; Yi-Xing Peng; Zhi Ouyang; Jingke Meng; Wei-Shi Zheng; | code |
| 108 | PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. |
Xiaoyang Hao; Han Li; | code |
| 109 | D3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D^3QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. |
Yanran Zhang; Bingyao Yu; Yu Zheng; Wenzhao Zheng; Yueqi Duan; Lei Chen; Jie Zhou; Jiwen Lu; | code |
| 110 | Make Your Training Flexible: Towards Deployment-Efficient Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We hence introduce a novel paradigm for lossless adaptation across scenarios, enabling models to maintain optimal performance under high-resource conditions while seamlessly transferring to low-resource environments. |
Chenting Wang; Kunchang Li; Tianxiang Jiang; Xiangyu Zeng; Yi Wang; Limin Wang; | code |
| 111 | Multi-identity Human Image Animation with Structural Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. |
Zhenzhi Wang; Yixuan Li; Yanhong Zeng; Yuwei Guo; Dahua Lin; Tianfan Xue; Bo Dai; | code |
| 112 | Griffon V2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such limitation further restricts the model’s potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, etc. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. |
Yufei Zhan; Shurong Zheng; Yousong Zhu; Hongyin Zhao; Fan Yang; Ming Tang; Jinqiao Wang; | code |
| 113 | Diffusion-Based Imaginative Coordination for Bimanual Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While video prediction has been recently studied for representation learning and control, leveraging its ability to capture rich dynamic and behavioral information, its potential for enhancing bimanual coordination remains underexplored. To bridge this gap, we propose a unified diffusion-based framework for the joint optimization of video and action prediction. |
Huilin Xu; Jian Ding; Jiakun Xu; Ruixiang Wang; Jun Chen; Jinjie Mai; Yanwei Fu; Bernard Ghanem; Feng Xu; Mohamed Elhoseiny; | code |
| 114 | SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. |
Yongkun Du; Zhineng Chen; Hongtao Xie; Caiyan Jia; Yu-Gang Jiang; | code |
| 115 | LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation,which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models.Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perceptual quality, text-image correspondence, and task-specific accuracy.Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. |
Jiarui Wang; Huiyu Duan; Yu Zhao; Juntong Wang; Guangtao Zhai; Xiongkuo Min; | code |
| 116 | VPO: Aligning Text-to-Video Generation Models with Prompt Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. |
Jiale Cheng; Ruiliang Lyu; Xiaotao Gu; Xiao Liu; Jiazheng Xu; Yida Lu; Jiayan Teng; Zhuoyi Yang; Yuxiao Dong; Jie Tang; Hongning Wang; Minlie Huang; | code |
| 117 | GFPack++: Attention-Driven Gradient Fields for Optimizing 2D Irregular Packing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GFPack++, a deeply investigated framework that adopts attention-based geometry and relation encoding, enabling more comprehensive modeling of complex packing relationships. |
Tianyang Xue; Lin Lu; Yang Liu; Mingdong Wu; Hao Dong; Yanbin Zhang; Renmin Han; Baoquan Chen; | code |
| 118 | Proxy-Bridged Game Transformer for Interactive Extreme Motion Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches fall short in modeling extreme motions like lindy-hop dances, as they require a more comprehensive understanding of cross-person dependencies. To bridge this gap, we introduce Proxy-bridged Game Transformer (PGformer), a Transformer-based foundation model that captures the interactions driving extreme multi-person motions. |
Yanwen Fang; Wenqi Jia; Xu Cao; Peng-Tao Jiang; Guodong Li; Jintai Chen; | code |
| 119 | Auto-Regressively Generating Multi-View Consistent Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Multi-View AutoRegressive (MV-AR) method, which leverages an autoregressive model to progressively generate consistent multiview images from arbitrary prompts. |
JiaKui Hu; Yuxiao Yang; Jialun Liu; Jinbo Wu; Chen Zhao; Yanye Lu; | code |
| 120 | Principles of Visual Tokens for Efficient Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. |
Xinyue Hao; Gen Li; Shreyank N Gowda; Robert B. Fisher; Jonathan Huang; Anurag Arnab; Laura Sevilla-Lara; | code |
| 121 | Progressive Test Time Energy Adaptation for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. |
Xiaoran Zhang; Byung-Woo Hong; Hyoungseob Park; Daniel H. Pak; Anne-Marie Rickmann; Lawrence H. Staib; James S. Duncan; Alex Wong; | code |
| 122 | D-Attn: Decomposed Attention for Large Vision-and-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. |
Chia-Wen Kuo; Sijie Zhu; Fan Chen; Xiaohui Shen; Longyin Wen; | code |
| 123 | Semantic Discrepancy-aware Detector for Image Forgery Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the misalignment between the forgery and semantic concept spaces hinders the model’s forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. |
Ziye Wang; Minghang Yu; Chunyan Xu; Zhen Cui; | code |
| 124 | NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. |
Zhixi Cai; Fucai Ke; Simindokht Jahangard; Maria Garcia de la Banda; Reza Haffari; Peter J. Stuckey; Hamid Rezatofighi; | code |
| 125 | Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous state-of-the-art approaches often suffer from significant `foreground bias’, where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. |
Yunheng Li; Yuxuan Li; Quan-Sheng Zeng; Wenhai Wang; Qibin Hou; Ming-Ming Cheng; | code |
| 126 | Mind The Gap: Preserving and Compensating for The Modality Gap in CLIP-Based Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. |
Linlan Huang; Xusheng Cao; Haori Lu; Yifan Meng; Fei Yang; Xialei Liu; | code |
| 127 | HyPiDecoder: Hybrid Pixel Decoder for Efficient Segmentation and Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the inefficiency of MSDeformAttn has become a performance bottleneck for segmenters. To address this, we propose the Hyper Pixel Decoder (HyPiDecoder), an improved Pixel Decoder design that replaces parts of the MSDeformAttn layers with convolution-based FPN layers, introducing explicit locality information and significantly boosting inference speed. |
Fengzhe Zhou; Humphrey Shi; | code |
| 128 | Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. |
Xiao Fang; Minhyek Jeon; Zheyang Qin; Stanislav Panev; Celso De Melo; Shuowen Hu; Shayok Chakraborty; Fernando De La Torre; | code |
| 129 | POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying POintmap MAtching with Temporal mOtion. |
Songyan Zhang; Yongtao Ge; Jinyuan Tian; Guangkai Xu; Hao Chen; Chen Lv; Chunhua Shen; | code |
| 130 | The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with A Single Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. |
Weixian Lei; Jiacong Wang; Haochen Wang; Xiangtai Li; Jun Hao Liew; Jiashi Feng; Zilong Huang; | code |
| 131 | TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by autoregressive pretraining in language models, we propose TokenUnify, a hierarchical predictive coding framework that captures multi-scale dependencies through three complementary learning objectives. |
Yinda Chen; Haoyuan Shi; Xiaoyu Liu; Te Shi; Ruobing Zhang; Dong Liu; Zhiwei Xiong; Feng Wu; | code |
| 132 | Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. |
Zhichuan Wang; Yang Zhou; Zhe Liu; Rui Yu; Song Bai; Yulong Wang; Xinwei He; Xiang Bai; | code |
| 133 | Training-free Geometric Image Editing on Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. |
Hanshen Zhu; Zhen Zhu; Kaile Zhang; Yiming Gong; Yuliang Liu; Xiang Bai; | code |
| 134 | LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. |
Zhang Li; Biao Yang; Qiang Liu; Shuo Zhang; Zhiyin Ma; Liang Yin; Linger Deng; Yabo Sun; Yuliang Liu; Xiang Bai; | code |
| 135 | AIM: Adaptive Inference of Multi-Modal LLMs Via Token Merging and Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. |
Yiwu Zhong; Zhuoming Liu; Yin Li; Liwei Wang; | code |
| 136 | CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans whose costs outweigh their benefits in terms of task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. |
Duo Wu; Jinghe Wang; Yuan Meng; Yanning Zhang; Le Sun; Zhi Wang; | code |
| 137 | REDUCIO! Generating 1K Video Within 16 Seconds Using Extremely Compressed Motion Latents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access.One crucial obstacle for large-scale applications is the expensive training and inference cost.In this paper, we argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.Towards this goal, we design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images. This magic Reducio charm enables 64xreduction of latents compared to a common 2D VAE, without sacrificing the quality.Building upon Reducio-VAE, we can train diffusion models for high-resolution video generation efficiently. |
Rui Tian; Qi Dai; Jianmin Bao; Kai Qiu; Yifan Yang; Chong Luo; Zuxuan Wu; Yu-Gang Jiang; | code |
| 138 | From One to More: Contextual Part Latents for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. |
Shaocong Dong; Lihe Ding; Xiao Chen; Yaokun Li; Yuxin Wang; Yucheng Wang; Qi Wang; Jaehyeok Kim; Chenjian Gao; Zhanpeng Huang; Zibin Wang; Tianfan Xue; Dan Xu; | code |
| 139 | Accelerate 3D Object Detection Models Via Zero-Shot Attention Key Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. |
Lizhen Xu; Xiuxiu Bai; Xiaojun Jia; Jianwu Fang; Shanmin Pang; | code |
| 140 | ZIM: Zero-Shot Image Matting for Anything Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D segmentation. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. |
Beomyoung Kim; Chanyong Shin; Joonhyun Jeong; Hyungsik Jung; Se-Yun Lee; Sewhan Chun; Dong-Hyun Hwang; Joonsang Yu; | code |
| 141 | Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. |
Jhe-Hao Lin; Yi Yao; Chan-Feng Hsu; Hong-Xia Xie; Hong-Han Shuai; Wen-Huang Cheng; | code |
| 142 | Partially Matching Submap Helps: Uncertainty Modeling and Propagation for Text to Point Cloud Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work redefines the task under a more realistic assumption, relaxing the one-to-one retrieval constraint by allowing partially matching query text and submap pairs. To address this challenge, we augment datasets with partially matching submaps and introduce an uncertainty-aware framework. |
Mingtao Feng; Longlong Mei; Zijie Wu; Jianqiao Luo; Fenghao Tian; Jie Feng; Weisheng Dong; Yaonan Wang; | code |
| 143 | Web Artifact Attacks Disrupt Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce "artifact-based" attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. |
Maan Qraitem; Piotr Teterwak; Kate Saenko; Bryan A. Plummer; | code |
| 144 | SPA: Efficient User-Preference Alignment Against Uncertainty in Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While prior uncertainty-aware and interactive methods offer adaptability, they are inefficient at test time: uncertainty-aware models require users to choose from numerous similar outputs, while interactive models demand significant user input through click or box prompts to refine segmentation. To address these challenges, we propose SPA, a new Segmentation Preference Alignment framework that efficiently adapts to diverse test-time preferences with minimal human interaction. |
Jiayuan Zhu; Junde Wu; Cheng Ouyang; Konstantinos Kamnitsas; J. Alison Noble; | code |
| 145 | Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. |
Yiyang Chen; Shanshan Zhao; Lunhao Duan; Changxing Ding; Dacheng Tao; | code |
| 146 | Cooperative Pseudo Labeling for Unsupervised Federated Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, **Fed**erated **Co**operative **P**seudo **L**abeling (**FedCoPL**). |
Kuangpu Guo; Lijun Sheng; Yongcan Yu; Jian Liang; Zilei Wang; Ran He; | code |
| 147 | MotionStreamer: Streaming Motion Generation Via Diffusion-based Autoregressive Model in Causal Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. |
Lixing Xiao; Shunlin Lu; Huaijin Pi; Ke Fan; Liang Pan; Yueer Zhou; Ziyong Feng; Xiaowei Zhou; Sida Peng; Jingbo Wang; | code |
| 148 | Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which is limited in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. |
Zihua Zhao; Feng Hong; Mengxi Chen; Pengyi Chen; Benyuan Liu; Jiangchao Yao; Ya Zhang; Yanfeng Wang; | code |
| 149 | LookOut: Real-World Humanoid Egocentric Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. |
Boxiao Pan; Adam W. Harley; Francis Engelmann; C. Karen Liu; Leonidas J. Guibas; | code |
| 150 | HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. |
Qinqian Lei; Bo Wang; Robby T. Tan; | code |
| 151 | Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection Under Test-Time Shifts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. |
Zixuan Hu; Dongxiao Li; Xinzhu Ma; Shixiang Tang; Xiaotong Li; Wenhan Yang; Ling-Yu Duan; | code |
| 152 | Rethink Sparse Signals for Pose-guided Text-to-image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel Spatial-Pose ControlNet (SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. |
Wenjie Xuan; Jing Zhang; Juhua Liu; Bo Du; Dacheng Tao; | code |
| 153 | Scendi Score: Prompt-Aware Diversity Evaluation Via Schur Complement of CLIP Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we extend the application of CLIP embeddings to quantify and interpret the intrinsic diversity of text-to-image models, which are responsible for generating diverse images from similar text prompts, which we refer to as prompt-aware diversity. |
Azim Ospanov; Mohammad Jalali; Farzan Farnia; | code |
| 154 | Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs Within Single Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. |
Kuo Wang; Quanlong Zheng; Junlin Xie; Yanhao Zhang; Jinguo Luo; Haonan Lu; Liang Lin; Fan Zhou; Guanbin Li; | code |
| 155 | Learning Large Motion Estimation from Intermediate Representations with A High-Resolution Optical Flow Dataset Featuring Long-Range Dynamic Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel training framework that integrates matching cost distillation and incremental time-step learning to refine cost volume estimation and stabilize training. |
Hoonhee Cho; Yuhwan Jeong; Kuk-Jin Yoon; | code |
| 156 | CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. |
Jiaru Zhong; Jiahao Wang; Jiahui Xu; Xiaofan Li; Zaiqing Nie; Haibao Yu; | code |
| 157 | Where Am I? Cross-View Geo-localization with Natural Language Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response.In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. |
Junyan Ye; Honglin Lin; Leyan Ou; Dairong Chen; Zihao Wang; Qi Zhu; Conghui He; Weijia Li; | code |
| 158 | CABLD: Contrast-Agnostic Brain Landmark Detection with Consistency-Based Regularization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CABLD, a novel self-supervised DL framework for 3D brain landmark detection in unlabeled scans with varying contrasts by using only a single reference example. |
Soorena Salari; Arash Harirpoush; Hassan Rivaz; Yiming Xiao; | code |
| 159 | Open-set Cross Modal Generalization Via Multimodal Unified Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). |
Hai Huang; Yan Xia; Shulei Wang; Hanting Wang; Minghui Fang; Shengpeng Ji; Sashuai Zhou; Tao Jin; Zhou Zhao; | code |
| 160 | Learning Normals of Noisy Points By Local Gradient-Aware Surface Filtering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. |
Qing Li; Huifang Feng; Xun Gong; Yu-Shen Liu; | code |
| 161 | INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs’ Performance in Insurance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce INS-MMBench, the first hierarchical benchmark tailored for the insurance domain. |
Chenwei Lin; Hanjia Lyu; Xian Xu; Jiebo Luo; | code |
| 162 | Spatial Preference Rewarding for MLLMs Spatial Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs’ actual responses. We address this issue by SPR, a Spatial Preference Rewarding(SPR) approach that enhances MLLMs’ spatial capabilities by rewarding MLLMs’ detailed responses with precise object localization over vague or inaccurate responses. |
Han Qiu; Peng Gao; Lewei Lu; Xiaoqin Zhang; Ling Shao; Shijian Lu; | code |
| 163 | DreamRelation: Relation-Centric Video Customization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. |
Yujie Wei; Shiwei Zhang; Hangjie Yuan; Biao Gong; Longxiang Tang; Xiang Wang; Haonan Qiu; Hengjia Li; Shuai Tan; Yingya Zhang; Hongming Shan; | code |
| 164 | DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a human face video dataset, DH-FaceVid-1K. |
Donglin Di; He Feng; Wenzhang Sun; Yongjia Ma; Hao Li; Wei Chen; Lei Fan; Tonghua Su; Xun Yang; | code |
| 165 | Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. |
Yudong Jin; Sida Peng; Xuan Wang; Tao Xie; Zhen Xu; Yifan Yang; Yujun Shen; Hujun Bao; Xiaowei Zhou; | code |
| 166 | Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM’s encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field. |
Yuheng Shi; Minjing Dong; Chang Xu; | code |
| 167 | VSSD: Vision Mamba with Non-Causal State Space Duality Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. |
Yuheng Shi; Mingjia Li; Minjing Dong; Chang Xu; | code |
| 168 | VisualCloze: A Universal Image Generation Framework Via Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While universal models attempt to address these limitations, they face critical challenges, including generalizable instruction design, appropriate task distributions, and unified architectural design. In this work, we propose VisualCloze, a universal image generation framework, to tackle these challenges. |
Zhong-Yu Li; Ruoyi Du; Juncheng Yan; Le Zhuo; Zhen Li; Peng Gao; Zhanyu Ma; Ming-Ming Cheng; | code |
| 169 | SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. |
Yijia Hong; Yuan-Chen Guo; Ran Yi; Yulong Chen; Yan-Pei Cao; Lizhuang Ma; | code |
| 170 | EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. |
Yufei Cai; Hu Han; Yuxiang Wei; Shiguang Shan; Xilin Chen; | code |
| 171 | SpecGuard: Spectral Projection-based Advanced Invisible Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. |
Inzamamul Alam; Md Tanvir Islam; Simon S. Woo; Khan Muhammad; | code |
| 172 | D3: Training-Free AI-Generated Video Detection Using Second-Order Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. |
Chende Zheng; Ruiqi Suo; Chenhao Lin; Zhengyu Zhao; Le Yang; Shuai Liu; Minghui Yang; Cong Wang; Chao Shen; | code |
| 173 | RARE: Refine Any Registration of Pairwise Point Clouds Via Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. |
Chengyu Zheng; Jin Huang; Honghua Chen; Mingqiang Wei; | code |
| 174 | DisTime: Distribution-based Time Representation for Video Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. |
Yingsen Zeng; Zepeng Huang; Yujie Zhong; Chengjian Feng; Jie Hu; Lin Ma; Yang Liu; | code |
| 175 | Controllable 3D Outdoor Scene Generation Via Scene Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a method that uses scene graph as a user-friendly control format to generate outdoor 3D scenes. |
Yuheng Liu; Xinke Li; Yuning Zhang; Lu Qi; Xin Li; Wenping Wang; Chongshou Li; Xueting Li; Ming-Hsuan Yang; | code |
| 176 | VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present the first look at privacy protection in VPR systems and introduce VPR-Cloak, an efficient privacy-preserving network. |
Shuting Dong; Mingzhi Chen; Feng Lu; Hao Yu; Guanghao Li; Zhe Wu; Ming Tang; Chun Yuan; | code |
| 177 | Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. |
Baoyou Chen; Ce Liu; Weihao Yuan; Zilong Dong; Siyu Zhu; | code |
| 178 | GAP: Gaussianize Any Point Clouds with Text Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. |
Weiqi Zhang; Junsheng Zhou; Haotian Geng; Wenyuan Zhang; Yu-Shen Liu; | code |
| 179 | AGO: Adaptive Grounding for Open World 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Direct alignment with pretrained image embeddings, on the other hand, often fails to achieve reliable performance because of inconsistent image and text representations in VLMs. To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios. |
Peizheng Li; Shuxiao Ding; You Zhou; Qingwen Zhang; Onat Inak; Larissa Triess; Niklas Hanselmann; Marius Cordts; Andreas Zell; | code |
| 180 | MC-Bench: A Benchmark for Multi-Context Visual Grounding in The Era of MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. To assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. |
Yunqiu Xu; Linchao Zhu; Yi Yang; | code |
| 181 | LangBridge: Interpreting Image As A Combination of Language Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. |
Jiaqi Liao; Yuwei Niu; Fanqing Meng; Hao Li; Changyao Tian; Yinuo Du; Yuwen Xiong; Dianqi Li; Xizhou Zhu; Li Yuan; Jifeng Dai; Yu Cheng; | code |
| 182 | SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel hybrid method that combines both strengths: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine SDF details for accurate surface reconstruction. |
Zihui Gao; Jia-Wang Bian; Guosheng Lin; Hao Chen; Chunhua Shen; | code |
| 183 | BlinkTrack: Feature Tracking Over 80 FPS Via Events and Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. |
Yichen Shen; Yijin Li; Shuo Chen; Guanglin Li; Zhaoyang Huang; Hujun Bao; Zhaopeng Cui; Guofeng Zhang; | code |
| 184 | MobileViCLIP: An Efficient Video-Text Model for Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since many efficient contrastive language-image pre-training (CLIP) models have shown strong zero-shot classification and retrieval capability, we attempt to fill the gap in video-text understanding models and propose a fast and efficient video-text model MobileViCLIP with strong zero-shot reasoning capability that can be deployed on mobile devices. |
Min Yang; Zihan Jia; Zhilin Dai; Sheng Guo; Limin Wang; | code |
| 185 | Low-Light Image Enhancement Using Event-Based Illumination Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesis. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. |
Lei Sun; Yuhan Bao; Jiajun Zhai; Jingyun Liang; Yulun Zhang; Kaiwei Wang; Danda Pani Paudel; Luc Van Gool; | code |
| 186 | Federated Continual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. |
Haiyang Guo; Fanhu Zeng; Fei Zhu; Wenzhuo Liu; Da-Han Wang; Jian Xu; Xu-Yao Zhang; Cheng-Lin Liu; | code |
| 187 | Revisiting Adversarial Patch Defenses on Object Detectors: Unified Evaluation, Large-Scale Dataset, and New Insights Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we identify that existing defense evaluations lack a unified and comprehensive framework, resulting in inconsistent and incomplete assessments of current methods. To address this issue, we revisit 11 representative defenses and present the first patch defense benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics. |
Junhao Zheng; Jiahao Sun; Chenhao Lin; Zhengyu Zhao; Chen Ma; Chong Zhang; Cong Wang; Qian Wang; Chao Shen; | code |
| 188 | Rethinking Discrete Tokens: Treating Them As Conditions for Continuous Autoregressive Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike discrete tokens that reside in a structured and bounded space, continuous representations exist in an unbounded, high-dimensional space, making density estimation more challenging and increasing the risk of generating out-of-distribution artifacts. Based on the above findings, this work introduces DisCon (Discrete-Conditioned Continuous Autoregressive Model), a novel framework that reinterprets discrete tokens as conditional signals rather than generation targets. |
Peng Zheng; Junke Wang; Yi Chang; Yizhou Yu; Rui Ma; Zuxuan Wu; | code |
| 189 | OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. |
Yiming Zuo; Willow Yang; Zeyu Ma; Jia Deng; | code |
| 190 | Simultaneous Motion And Noise Estimation with Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion. We propose, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise. |
Shintaro Shiba; Yoshimitsu Aoki; Guillermo Gallego; | code |
| 191 | Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, limited to the singular forensic perspective, existing methods struggle to explore sufficient traces encountered with diverse synthetic techniques. In response to this, we argue that different synthetic images encompass a variety of forensic traces, and utilizing multiple experts to explore traces from diverse perspectives will be beneficial. |
Mingqi Fang; Ziguang Li; Lingyun Yu; Quanwei Yang; Hongtao Xie; Yongdong Zhang; | code |
| 192 | Few-Shot Image Quality Assessment Via Adaptation of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing Blind IQA (BIQA) methods largely rely on extensive human annotations, which are labor-intensive and costly due to the demanding nature of creating IQA datasets. To reduce this dependency, we propose the Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA), designed to efficiently adapt the visual-language pre-trained model, CLIP, to IQA tasks, achieving high accuracy even with limited data. |
Xudong Li; Zihao Huang; Yan Zhang; Yunhang Shen; Ke Li; Xiawu Zheng; Liujuan Cao; Rongrong Ji; | code |
| 193 | Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given text query within a video stream. |
Minghang Zheng; Yuxin Peng; Benyuan Sun; Yi Yang; Yang Liu; | code |
| 194 | SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. |
Liang Han; Xu Zhang; Haichuan Song; Kanle Shi; Yu-Shen Liu; Zhizhong Han; | code |
| 195 | Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. |
Suchisrit Gangopadhyay; Jung-Hee Kim; Xien Chen; Patrick Rim; Hyoungseob Park; Alex Wong; | code |
| 196 | FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT’s capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject’s layout integrity while preserving crucial editing flexibility. |
Yanbing Zhang; Zhe Wang; Qin Zhou; Mengping Yang; | code |
| 197 | Auto-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. |
Osman Ülger; Maksymilian Kulicki; Yuki Asano; Martin R. Oswald; | code |
| 198 | Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. |
Yuxin Cheng; Binxiao Huang; Taiqiang Wu; Wenyong Zhou; Chenchen Ding; Zhengwu Liu; Graziano Chesi; Ngai Wong; | code |
| 199 | Feature Coding in The Era of Large Models: Dataset, Test Conditions, and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we draw attention to large model feature coding and make three fundamental contributions. First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models. |
Changsheng Gao; Yifan Ma; Qiaoxi Chen; Yenan Xu; Dong Liu; Weisi Lin; | code |
| 200 | Online Reasoning Video Segmentation with Just-in-Time Digital Twins Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. |
Yiqing Shen; Bohan Liu; Chenjia Li; Lalithkumar Seenivasan; Mathias Unberath; | code |
| 201 | Region-based Cluster Discrimination for Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. |
Yin Xie; Kaicheng Yang; Xiang An; Kun Wu; Yongle Zhao; Weimo Deng; Zimin Ran; Yumeng Wang; Ziyong Feng; Roy Miles; Ismail Elezi; Jiankang Deng; | code |
| 202 | ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. |
Xiaokun Feng; Shiyu Hu; Xuchen Li; Dailing Zhang; Meiqi Wu; Jing Zhang; Xiaotang Chen; Kaiqi Huang; | code |
| 203 | Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. |
Dohwan Ko; Ji Soo Lee; Minhyuk Choi; Zihang Meng; Hyunwoo J. Kim; | code |
| 204 | Generating Physically Stable and Buildable Brick Structures from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce BrickGPT, the first approach for generating physically stable interconnecting brick assembly models from text prompts. |
Ava Pun; Kangle Deng; Ruixuan Liu; Deva Ramanan; Changliu Liu; Jun-Yan Zhu; | code |
| 205 | Sequential Gaussian Avatars with Hierarchical Motion Context Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose SeqAvatar, which excavates the explicit 3DGS representation to better model human avatars based on a hierarchical motion context. |
Wangze Xu; Yifan Zhan; Zhihang Zhong; Xiao Sun; | code |
| 206 | DISTIL: Data-Free Inversion of Suspicious Trojan Inputs Via Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. |
Hossein Mirzaei; Zeinab Taghavi; Sepehr Rezaee; Masoud Hadi; Moein Madadi; Mackenzie W. Mathis; | code |
| 207 | DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on a dexterous robotic hand. |
Youzhuo Wang; Jiayi Ye; Chuyang Xiao; Yiming Zhong; Heng Tao; Hang Yu; Yumeng Liu; Jingyi Yu; Yuexin Ma; | code |
| 208 | Who Is A Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. |
Yingjie Zhou; Jiezhang Cao; Zicheng Zhang; Farong Wen; Yanwei Jiang; Jun Jia; Xiaohong Liu; Xiongkuo Min; Guangtao Zhai; | code |
| 209 | VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. |
Hyojun Go; Byeongjun Park; Hyelin Nam; Byung-Hoon Kim; Hyungjin Chung; Changick Kim; | code |
| 210 | Breaking Rectangular Shackles: Cross-View Object Segmentation for Fine-Grained Object Geo-Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle CVOS challenges, we propose Transformer Object Geo-localization (TROGeo), a two-stage framework. |
Qingwang Zhang; Yingying Zhu; | code |
| 211 | H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. |
Heng Jia; Linchao Zhu; Na Zhao; | code |
| 212 | Domain Generalizable Portrait Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. |
Xinbo Wang; Wenju Xu; Qing Zhang; Wei-Shi Zheng; | code |
| 213 | Towards A Unified Copernicus Foundation Model for Earth Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing efforts remain limited to fixed spectral sensors, focus solely on the Earth’s surface, and overlook valuable metadata beyond imagery. In this work, we take a step towards next-generation EO foundation models with three key components: 1) Copernicus-Pretrain, a massive-scale pretraining dataset that integrates 18.7M aligned images from all major Copernicus Sentinel missions, spanning from the Earth’s surface to its atmosphere; 2) Copernicus-FM, a unified foundation model capable of processing any spectral or non-spectral sensor modality using extended dynamic hypernetworks and flexible metadata encoding; and 3) Copernicus-Bench, a systematic evaluation benchmark with 15 hierarchical downstream tasks ranging from preprocessing to specialized applications for each Sentinel mission. |
Yi Wang; Zhitong Xiong; Chenying Liu; Adam J. Stewart; Thomas Dujardin; Nikolaos Ioannis Bountos; Angelos Zavras; Franziska Gerken; Ioannis Papoutsis; Laura Leal-Taixé; Xiao Xiang Zhu; | code |
| 214 | SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we offer a novel solution by constructing more effective editing instructions for given image pairs. |
Ming Li; Xin Gu; Fan Chen; Xiaoying Xing; Longyin Wen; Chen Chen; Sijie Zhu; | code |
| 215 | UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. |
Peiming Li; Ziyi Wang; Yulin Yuan; Hong Liu; Xiangming Meng; Junsong Yuan; Mengyuan Liu; | code |
| 216 | DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle (1), we propose an alignment module between modalities based on mutual information, which adaptively enhances favorable modalities while suppressing unfavorable ones. |
Jingyi Yang; Xun Lin; Zitong Yu; Liepiao Zhang; Xin Liu; Hui Li; Xiaochen Yuan; Xiaochun Cao; | code |
| 217 | RhythmGuassian: Repurposing Generalizable Gaussian Model For Remote Physiological Measurement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenge, we employ the Generalizable Gaussian Model (GGM) to disentangle geometry and chroma components with 4D Gaussian representations. |
Hao Lu; Yuting Zhang; Jiaqi Tang; Bowen Fu; Wenhang Ge; Wei Wei; Kaishun Wu; Yingcong Chen; | code |
| 218 | AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introduc- ing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). |
Dejie Yang; Zijing Zhao; Yang Liu; | code |
| 219 | Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. |
Xiangdong Zhang; Shaofeng Zhang; Junchi Yan; | code |
| 220 | A Framework for Double-Blind Federated Adaptation of Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose BlindFed, a framework enabling collaborative FM adaptation while protecting both parties: data owners do not access the FM or each other’s data, and the LSP does not see sensitive task data. |
Nurbek Tastan; Karthik Nandakumar; | code |
| 221 | End-to-End Entity-Predicate Association Reasoning for Dynamic Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, current approaches do not support end-to-end training and instead rely on a two-stage pipeline, which incurs higher computational costs. To address these issues, we propose an end-to-end Association Reasoning Network (ARN) for DSGG. |
Liwei Wang; Yanduo Zhang; Tao Lu; Fang Liu; Huiqin Zhang; Jiayi Ma; Huabing Zhou; | code |
| 222 | Arti-PG: A Toolbox for Procedurally Synthesizing Large-Scale and Diverse Articulated Objects with Rich Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. |
Jianhua Sun; Yuxuan Li; Jiude Wei; Longfei Xu; Nange Wang; Yining Zhang; Cewu Lu; | code |
| 223 | Baking Gaussian Splatting Into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. |
Yuanhao Cai; He Zhang; Kai Zhang; Yixun Liang; Mengwei Ren; Fujun Luan; Qing Liu; Soo Ye Kim; Jianming Zhang; Zhifei Zhang; Yuqian Zhou; Yulun Zhang; Xiaokang Yang; Zhe Lin; Alan Yuille; | code |
| 224 | HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. |
Yu Wang; Bo Dang; Wanchun Li; Wei Chen; Yansheng Li; | code |
| 225 | Context-Aware Academic Emotion Dataset and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While significant progress has been made in facial expression recognition for basic emotions, academic emotion recognition remains underexplored, largely due to the scarcity of publicly available datasets. To bridge this gap, we introduce RAER, a novel dataset comprising approximately 2,700 video clips collected from around 140 students in diverse, natural learning contexts such as classrooms, libraries, laboratories, and dormitories, covering both classroom sessions and individual study. |
Luming Zhao; Jingwen Xuan; Jiamin Lou; Yonghui Yu; Wenwu Yang; | code |
| 226 | Cross-Granularity Online Optimization with Masked Compensated Information for Learned Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing practices that train models on entire dataset offline face a limitation, as the estimated distribution only approximates the general image signal distribution and fails to capture image-specific characteristics. To address this issue, we propose a cross-granularity online optimization strategy to mitigate information loss from two key aspects: statistical distribution gaps and local structural gaps. |
Haowei Kuang; Wenhan Yang; Zongming Guo; Jiaying Liu; | code |
| 227 | PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. |
Ciyu Ruan; Ruishan Guo; Zihang Gong; Jingao Xu; Wenhan Yang; Xinlei Chen; | code |
| 228 | RealGeneral: Unifying Visual Generation Via Temporal In-Context Learning with Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. |
Yijing Lin; Mengqi Huang; Shuhan Zhuang; Zhendong Mao; | code |
| 229 | Processing and Acquisition Traces in Visual Encoders: What Does CLIP Know About Your Camera? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. |
Ryan Ramos; Vladan Stojnić; Giorgos Kordopatis-Zilos; Yuta Nakashima; Giorgos Tolias; Noa Garcia; | code |
| 230 | Learning to See in The Extremely Dark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. |
Hai Jiang; Binhao Guan; Zhen Liu; Xiaohong Liu; Jian Yu; Zheng Liu; Songchen Han; Shuaicheng Liu; | code |
| 231 | Robust Dataset Condensation Using Supervised Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods show severe performance degradation when applied to noisy datasets. To address this, we present robust dataset condensation (RDC), an end-to-end method that mitigates noise to generate a clean and robust synthetic set, without requiring separate noise-reduction preprocessing steps. |
Nicole Hee-Yeon Kim; Hwanjun Song; | code |
| 232 | CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. |
Zizhuo Li; Yifan Lu; Linfeng Tang; Shihua Zhang; Jiayi Ma; | code |
| 233 | ForCenNet: Foreground-Centric Network for Document Image Rectification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Foreground-Centric Network(ForCenNet) to eliminate geometric distortions in document images. |
Peng Cai; Qiang Li; Kaicheng Yang; Dong Guo; Jia Li; Nan Zhou; Xiang An; Ninghua Yang; Jiankang Deng; | code |
| 234 | LayerAnimate: Layer-level Control for Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing anime generation video methods typically treat animation as a distinct data domain different from real-world videos, lacking fine-grained control at the layer level. To bridge this gap, we introduce LayerAnimate, a novel video diffusion framework with layer-aware architecture that empowers the manipulation of layers through layer-level controls. |
Yuxue Yang; Lue Fan; Zuzeng Lin; Feng Wang; Zhaoxiang Zhang; | code |
| 235 | Learned Image Compression with Hierarchical Progressive Context Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel Hierarchical Progressive Context Model (HPCM) for more efficient context information acquisition. |
Yuqi Li; Haotian Zhang; Li Li; Dong Liu; | code |
| 236 | Leveraging Spatial Invariance to Boost Adversarial Transferability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing input transformation-based adversarial attacks solely focus on behavioral patterns at a singular position, failing to fully exploit the spatial invariance exhibited by DNNs across multiple positions, thus constraining the transferability of adversarial examples. To address this, we propose a multi-scale, multi-position input transformation-based attack called Spatial Invariance Diversity (SID). |
Zihan Zhou; Li Li; Yanli Ren; Chuan Qin; Guorui Feng; | code |
| 237 | Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. |
Hao Ju; Shaofei Huang; Si Liu; Zhedong Zheng; | code |
| 238 | Moderating The Generalization of Score-based Generative Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Further analysis of score functions reveals that the MU ‘gold standard’ does not alter the original score function, which explains its ineffectiveness. Building on this insight, we propose the first Moderated Score-based Generative Model (MSGM), which introduces a novel score adjustment strategy that redirects the score function away from undesirable data during the continuous-time stochastic differential equation process. |
Wan Jiang; He Wang; Xin Zhang; Dan Guo; Zhaoxin Fan; Yunfeng Diao; Richang Hong; | code |
| 239 | EEGMirror: Leveraging EEG Data in The Wild Via Montage-Agnostic Self-Supervision for EEG to Video Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, EEG-to-video is challenging due to the complexity and nonstationarity of EEG signals and the scarcity of data annotations. Addressing these issues, we present **EEGMirror**. |
Xuan-Hao Liu; Bao-Liang Lu; Wei-Long Zheng; | code |
| 240 | RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. |
Chuanyu Fu; Yuqi Zhang; Kunbin Yao; Guanying Chen; Yuan Xiong; Chuan Huang; Shuguang Cui; Xiaochun Cao; | code |
| 241 | Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods primarily align brain signals with stimulus signals using Mean Squared Error (MSE), which focuses only on local point-wise alignment and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. |
Yang Xiao; Wang Lu; Jie Ji; Ruimeng Ye; Gen Li; Xiaolong Ma; Bo Hui; | code |
| 242 | Unveiling The Invisible: Reasoning Complex Occlusions Amodally with AURA Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. |
Zhixuan Li; Hyunse Yoon; Sanghoon Lee; Weisi Lin; | code |
| 243 | Rethinking Multi-modal Object Detection from The Perspective of Mono-Modality Feature Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. |
Tianyi Zhao; Boyang Liu; Yanglei Gao; Yiming Sun; Maoxun Yuan; Xingxing Wei; | code |
| 244 | Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most knowledge distillation (KD) methods focus on teacher-student pairs with similar architectures, such as both being CNN models. |
Guopeng Li; Qiang Wang; Ke Yan; Shouhong Ding; Yuan Gao; Gui-Song Xia; | code |
| 245 | Understanding Co-speech Gestures In-the-wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a new framework for co-speech gesture understanding in the wild. |
Sindhu B Hegde; K R Prajwal; Taein Kwon; Andrew Zisserman; | code |
| 246 | A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model’s focus by pruning non-informative tokens. |
Qiyu Xu; Zhanxuan Hu; Yu Duan; Ercheng Pei; Yonghang Tai; | code |
| 247 | Layer-wise Vision Injection with Disentangled Attention for Efficient LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models(LVLMs) have achieved prominent performance across a wide range of scenarios. |
Xuange Zhang; Dengjie Li; Bo Liu; Zenghao Bao; Yao Zhou; Baisong Yang; Zhongying Liu; Yujie Zhong; Tongtong Yuan; | code |
| 248 | TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. |
Xiaowen Ma; Zhenliang Ni; Xinghao Chen; | code |
| 249 | BadVideo: Stealthy Backdoor Attack Against Text-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts, such as environmental elements, secondary objects, and additional details, providing opportunities for malicious attackers to embed hidden harmful content. Exploiting this inherent redundancy, we introduce BadVideo, the first backdoor attack framework tailored for T2V generation. |
Ruotong Wang; Mingli Zhu; Jiarong Ou; Rui Chen; Xin Tao; Pengfei Wan; Baoyuan Wu; | code |
| 250 | A Token-level Text Image Foundation Model for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenFD, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. |
Tongkun Guan; Zining Wang; Pei Fu; Zhengtao Guo; Wei Shen; Kai Zhou; Tiezhu Yue; Chen Duan; Hao Sun; Qianyi Jiang; Junfeng Luo; Xiaokang Yang; | code |
| 251 | X-Dancer: Expressive Music to Human Dance Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. |
Zeyuan Chen; Hongyi Xu; Guoxian Song; You Xie; Chenxu Zhang; Xin Chen; Chao Wang; Di Chang; Linjie Luo; | code |
| 252 | Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. |
Wei Suo; Ji Ma; Mengyang Sun; Lin Yuanbo Wu; Peng Wang; Yanning Zhang; | code |
| 253 | CuRe: Cultural Gaps in The Long Tail of Text-to-Image Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Popular text-to-image (T2I) systems are trained on web-scraped data, which is heavily Amero and Euro-centric, underrepresenting the cultures of the Global South. To analyze these biases, we introduce CuRe, a novel and scalable benchmarking and scoring suite for cultural representativeness that leverages the marginal utility of attribute specification to T2I systems as a proxy for human judgments. |
Aniket Rege; Zinnia Nie; Mahesh Ramesh; Unmesh Raskar; Zhuoran Yu; Aditya Kusupati; Yong Jae Lee; Ramya Korlakai Vinayak; | code |
| 254 | Dataset Distillation Via The Wasserstein Metric Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing from optimal transport theory, we introduce WMDD (Wasserstein Metric-based Dataset Distillation), a straightforward yet powerful method that employs the Wasserstein metric to enhance distribution matching. |
Haoyang Liu; Yijiang Li; Tiancheng Xing; Peiran Wang; Vibhu Dalal; Luwei Li; Jingrui He; Haohan Wang; | code |
| 255 | Enhanced Pansharpening Via Quaternion Spatial-Spectral Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, many existing methods struggle to fully capture spatial and spectral interactions, limiting their effectiveness. To address this, we propose a novel quaternion-based spatial-spectral interaction network that enhances pansharpening by leveraging the compact representation capabilities of quaternions for high-dimensional data. |
Dong Li; Chunhui Luo; Yuanfei Bao; Gang Yang; Jie Xiao; Xueyang Fu; Zheng-Jun Zha; | code |
| 256 | SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. |
Haoran Wang; Bo Zhao; Jinghui Wang; Hanzhang Wang; Huan Yang; Wei Ji; Hao Liu; Xinyan Xiao; | code |
| 257 | Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. |
Pei He; Lingling Li; Licheng Jiao; Ronghua Shang; Fang Liu; Shuang Wang; Xu Liu; Wenping Ma; | code |
| 258 | ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we study the problem of Text-to-Image In-Context Learning (T2I-ICL). |
Jiaqi Liao; Zhengyuan Yang; Linjie Li; Dianqi Li; Kevin Lin; Yu Cheng; Lijuan Wang; | code |
| 259 | Stable Score Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Stable Score Distillation (SSD), a streamlined framework that enhances stability and alignment in the editing process by anchoring a single classifier to the source prompt. |
Haiming Zhu; Yangyang Xu; Chenshu Xu; Tingrui Shen; Wenxi Liu; Yong Du; Jun Yu; Shengfeng He; | code |
| 260 | Bias-Resilient Weakly Supervised Semantic Segmentation Using Normalizing Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose to use normalizing flow to model the class feature distribution of all pixels across the entire dataset and design a Bias-Resilient WSSS framework based on Normalizing Flow (BRNF). |
Xianglin Qiu; Xiaoyang Wang; Zhen Zhang; Jimin Xiao; | code |
| 261 | Is Less More? Exploring Token Condensation As Training-free Test-time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing TC methods often fail to maintain in-distribution performance when reducing tokens, prompting us to ask: How can we transform TC into an effective "free-lunch" adaptation strategy for VLMs? To address this, we propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC. |
Zixin Wang; Dong Gong; Sen Wang; Zi Huang; Yadan Luo; | code |
| 262 | Estimating 2D Camera Motion with Hybrid Motion Basis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce CamFlow, a novel framework that represents camera motion using hybrid motion bases: physical bases derived from camera geometry and stochastic bases for complex scenarios. |
Haipeng Li; Tianhao Zhou; Zhanglei Yang; Yi Wu; Yan Chen; Zijing Mao; Shen Cheng; Bing Zeng; Shuaicheng Liu; | code |
| 263 | LoD-Loc V2: Aerial Visual Localization Over Low Level-of-Detail City Models Using Explicit Silhouette Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. |
Juelin Zhu; Shuaibang Peng; Long Wang; Hanlin Tan; Yu Liu; Maojun Zhang; Shen Yan; | code |
| 264 | AnimalClue: Recognizing Animals By Their Traces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. |
Risa Shinoda; Nakamasa Inoue; Iro Laina; Christian Rupprecht; Hirokatsu Kataoka; | code |
| 265 | Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. |
Jianhui Zhang; Shen Cheng; Qirui Sun; Jia Liu; Wang Luyang; Chaoyu Feng; Chen Fang; Lei Lei; Jue Wang; Shuaicheng Liu; | code |
| 266 | Event-guided Unified Framework for Low-light Video Enhancement, Frame Interpolation, and Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, videos captured in low-light exhibit low visibility and motion blur, as well as low frame rates. To overcome these limitations, we propose a novel problem aimed at transforming motion-blurred, low-frame-rate videos with poor visibility in low-light environments into high-frame-rate videos while simultaneously enhancing their visibility. |
Taewoo Kim; Kuk-Jin Yoon; | code |
| 267 | Generalizable Object Re-Identification Via Visual In-Context Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes Visual In-Context Prompting (VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only in-context examples as prompts, without requiring parameter adaptation. |
Zhizhong Huang; Xiaoming Liu; | code |
| 268 | Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing Via Bidirectional Warping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. |
Jingyi Lu; Kai Han; | code |
| 269 | LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. |
Achint Soni; Meet Soni; Sirisha Rambhatla; | code |
| 270 | TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. |
Meiqi Gong; Hao Zhang; Xunpeng Yi; Linfeng Tang; Jiayi Ma; | code |
| 271 | MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the robustness of blind PnP to noise and outliers in correspondences, we propose an approximate blind PnP-based correspondence learning approach. |
Pei An; Jiaqi Yang; Muyao Peng; You Yang; Qiong Liu; Xiaolin Wu; Liangliang Nan; | code |
| 272 | Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. |
Giuseppe Cartella; Vittorio Cuculo; Alessandro D’Amelio; Marcella Cornia; Giuseppe Boccignone; Rita Cucchiara; | code |
| 273 | What’s in A Latent? Leveraging Diffusion Latent Space for Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. |
Xavier Thomas; Deepti Ghadiyaram; | code |
| 274 | FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. |
Hang Xu; Jie Huang; Linjiang Huang; Dong Li; Yidi Liu; Feng Zhao; | code |
| 275 | Moment Quantization for Video Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. |
Xiaolong Sun; Le Wang; Sanping Zhou; Liushuai Shi; Kun Xia; Mengnan Liu; Yabing Wang; Gang Hua; | code |
| 276 | LONG3R: Long Sequence Streaming 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose LONG3R (LOng sequence streamiNG 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. |
Zhuoguang Chen; Minghui Qin; Tianyuan Yuan; Zhe Liu; Hang Zhao; | code |
| 277 | MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent advances in text-to-speech (TTS) models, audio-visual-to-audio-visual (AV2AV) translation still faces a critical challenge: maintaining speaker consistency between the original and translated vocal and facial features. To address this issue, we propose a conditional flow matching (CFM) zero-shot audio-visual renderer that utilizes strong dual guidance from both audio and visual modalities. |
Sungwoo Cho; Jeongsoo Choi; Sungnyun Kim; Se-Young Yun; | code |
| 278 | Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. |
Zheyuan Zhang; Weihao Tang; Hong Chen; | code |
| 279 | StrandHead: Text to Hair-Disentangled 3D Head Avatars Using Human-Centric Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose StrandHead, a novel text-driven method capable of generating 3D hair strands and disentangled head avatars with strand-level attributes. |
Xiaokun Sun; Zeyu Cai; Ying Tai; Jian Yang; Zhenyu Zhang; | code |
| 280 | Intermediate Connectors and Geometric Priors for Language-Guided Affordance Segmentation on Unseen Object Categories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region.Towards this, we introduce a GeneraLized frAmework on uNseen CategoriEs (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and cross-view consistency of their multi-view segmentation masks. |
Yicong Li; Yiyang Chen; Zhenyuan Ma; Junbin Xiao; Xiang Wang; Angela Yao; | code |
| 281 | ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although coarse-to-fine schemes mitigate computational costs, their efficiency remains limited by the substantial demands of heavy feature extractors and global matchers. In this paper, we propose Adaptive Refinement Gathering, a refinement pipeline that reduces reliance on these costly components without sacrificing accuracy. |
Yuxin Deng; Kaining Zhang; Linfeng Tang; Jiaqi Yang; Jiayi Ma; | code |
| 282 | A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This can be attributed to some flawed motion clips in video-based motion capture results and the inherent complexity in modeling high-difficulty motions. Therefore, sensing the advantage of segmentation in localizing human body, we introduce a mask-based motion correction module (MCM) that leverages motion context and video mask to repair flawed motions; and propose a physics-based motion transfer module (PTM), which employs a prior injected pretrain and adapt approach for motion imitation, improving physical plausibility with the ability to handle in-the-wild and challenging motions. |
Youliang Zhang; Ronghui Li; Yachao Zhang; Liang Pan; Jingbo Wang; Yebin Liu; Xiu Li; | code |
| 283 | Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an unsupervised learning framework that jointly estimates optical flow (motion) and image intensity (appearance) using a single network. |
Shuang Guo; Friedhelm Hamann; Guillermo Gallego; | code |
| 284 | Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing dermatological datasets are limited in both scale and depth, offering only single-label annotations across a narrow range of diseases instead of rich textual descriptions, and lacking the crucial clinical context needed for real-world applications. To address these limitations, we present Derm1M, the first large-scale vision-language dataset for dermatology, comprising 1,029,761 image-text pairs. |
Siyuan Yan; Ming Hu; Yiwen Jiang; Xieji Li; Hao Fei; Philipp Tschandl; Harald Kittler; Zongyuan Ge; | code |
| 285 | NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution Via Neural Operator Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most publicly accessible remote sensing data suffer from low resolution, limiting their practical applications. To address this, we propose a diffusion model guided by neural operators (NO) for continuous remote sensing image super-resolution (NeurOp-Diff). |
Zihao Xu; Yuzhi Tang; Bowen Xu; Qingquan Li; | code |
| 286 | Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. |
Ting Lei; Shaofeng Yin; Qingchao Chen; Yuxin Peng; Yang Liu; | code |
| 287 | Joint Asymmetric Loss for Learning with Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL). |
Jialiang Wang; Xianming Liu; Xiong Zhou; Gangfeng Hu; Deming Zhai; Junjun Jiang; Xiangyang Ji; | code |
| 288 | Large-scale Pre-training for Grounded Video Caption Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. |
Evangelos Kazakos; Cordelia Schmid; Josef Sivic; | code |
| 289 | From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications.To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps.However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality.To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps.Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. |
Jiacheng Liu; Chang Zou; Yuanhuiyi Lyu; Junjie Chen; Linfeng Zhang; | code |
| 290 | VA-MoE: Variables-Adaptive Mixture of Experts for Incremental Weather Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Variables-Adaptive Mixture of Experts (VA-MoE), a novel framework for incremental weather forecasting that dynamically adapts to evolving spatiotemporal patterns in real-time data. |
Hao Chen; Han Tao; Guo Song; Jie Zhang; Yonghan Dong; Yunlong Yu; Lei Bai; | code |
| 291 | FedXDS: Leveraging Model Attribution Methods to Counteract Data Heterogeneity in Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we show for the first time how XAI can be utilized in the context of federated learning. |
Maximilian Andreas Hoefler; Karsten Mueller; Wojciech Samek; | code |
| 292 | Learning Precise Affordances from Egocentric Videos for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. |
Gen Li; Nikolaos Tsagkas; Jifei Song; Ruaridh Mon-Williams; Sethu Vijayakumar; Kun Shao; Laura Sevilla-Lara; | code |
| 293 | DuCos: Duality Constrained Depth Super-Resolution Via Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. |
Zhiqiang Yan; Zhengxue Wang; Haoye Dong; Jun Li; Jian Yang; Gim Hee Lee; | code |
| 294 | PartField: Learning 3D Feature Fields for Part Segmentation and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. |
Minghua Liu; Mikaela Angelina Uy; Donglai Xiang; Hao Su; Sanja Fidler; Nicholas Sharp; Jun Gao; | code |
| 295 | Open-World Skill Discovery from Unsegmented Demonstration Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. |
Jingwen Deng; Zihao Wang; Shaofei Cai; Anji Liu; Yitao Liang; | code |
| 296 | Democratizing High-Fidelity Co-Speech Gesture Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a lightweight framework that utilizes 2D full-body skeletons as an efficient auxiliary condition to bridge audio signals with visual outputs. |
Xu Yang; Shaoli Huang; Shenbo Xie; Xuelin Chen; Yifei Liu; Changxing Ding; | code |
| 297 | Vector Contrastive Learning For Pixel-Wise Pretraining In Medical Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (COVER) framework. |
Yuting He; Shuo Li; | code |
| 298 | SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose SCORE (Scene Context matters in Open-vocabulary REmote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. |
Shiqi Huang; Shuting He; Huaiyuan Qin; Bihan Wen; | code |
| 299 | MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. |
Fu Rong; Meng Lan; Qian Zhang; Lefei Zhang; | code |
| 300 | Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To address this limitation, we introduce an unsupervised and controllable method termed Retinex-MEF. |
Haowen Bai; Jiangshe Zhang; Zixiang Zhao; Lilun Deng; Yukun Cui; Shuang Xu; | code |
| 301 | METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. |
Yuchen Liu; Yaoming Wang; Bowen Shi; Xiaopeng Zhang; Wenrui Dai; Chenglin Li; Hongkai Xiong; Qi Tian; | code |
| 302 | Enpowering Your Pansharpening Models with Generalizability: Unified Distribution Is All You Need Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To validate the idea and further achieve a "train once, deploy forever" capability, this paper introduces a novel and intuitive approach to enpower any pansharpening models with generalizability by employing a unified distribution strategy (UniPAN). |
Yongchuan Cui; Peng Liu; Hui Zhang; | code |
| 303 | Generative Zoo Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. |
Tomasz Niewiadomski; Anastasios Yiannakidis; Hanz Cuevas-Velasquez; Soubhik Sanyal; Michael J. Black; Silvia Zuffi; Peter Kulits; | code |
| 304 | From Sharp to Blur: Unsupervised Domain Adaptation for 2D Human Pose Estimation Under Extreme Motion Blur Using Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. |
Youngho Kim; Hoonhee Cho; Kuk-Jin Yoon; | code |
| 305 | SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SeaS, a unified industrial generative model for automatically creating diverse anomalies, authentic normal products, and precise anomaly masks. |
Zhewei Dai; Shilei Zeng; Haotian Liu; Xurui Li; Feng Xue; Yu Zhou; | code |
| 306 | R1-Onevision: Advancing Generalized Multimodal Reasoning Through Cross-Modal Formalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. |
Yi Yang; Xiaoxuan He; Hongkun Pan; Xiyan Jiang; Yan Deng; Xingtao Yang; Haoyu Lu; Dacheng Yin; Fengyun Rao; Minfeng Zhu; Bo Zhang; Wei Chen; | code |
| 307 | Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing fusion models lack controllability, making it difficult to customize the fused output according to user preferences. To address this challenge, we propose a novel weakly-supervised, instance-level controllable fusion model that adaptively highlights user-specified instances based on input text. |
Zeyu Wang; Jizheng Zhang; Haiyu Song; Mingyu Ge; Jiayu Wang; Haoran Duan; | code |
| 308 | Adaptive Learning of High-Value Regions for Semi-Supervised Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods fail to explore which regions hold higher learning value and how to design adaptive learning strategies for these regions. To address these issues, we propose a novel adaptive learning of high-value regions (ALHVR) framework. |
Tao Lei; Ziyao Yang; Xingwu Wang; Yi Wang; Xuan Wang; Feiman Sun; Asoke K. Nandi; | code |
| 309 | A Real-world Display Inverse Rendering Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the first real-world dataset for display-based inverse rendering. |
Seokjun Choi; Hoon-Gyu Chung; Yujin Jeon; Giljoo Nam; Seung-Hwan Baek; | code |
| 310 | Physical Degradation Model-Guided Interferometric Hyperspectral Reconstruction with Unfolding Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 2) the difficulty in eliminating IHI-specific degradation components through learning-based methods. To address these challenges, we propose a novel IHI reconstruction pipeline. |
Yuansheng Li; Yunhao Zou; Linwei Chen; Ying Fu; | code |
| 311 | LMM-Det: Make Large Multimodal Models Excel in Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a large multimodal model for vanilla object detection without relying on specialized detection modules. |
Jincheng Li; Chunyu Xie; Ji Ao; Dawei Leng; Yuhui Yin; | code |
| 312 | Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel approach for restoring low-light spike images using noise-modeled diffusion models. |
Ruonan Liu; Lin Zhu; Xijie Xiang; Lizhi Wang; Hua Huang; | code |
| 313 | VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present VoxelKP, a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data. |
Jian Shi; Peter Wonka; | code |
| 314 | EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos Via Embodiment-Centric Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Embodiment-Centric Flow (EC-Flow), a framework that directly learns manipulation from action-unlabeled videos by predicting embodiment-centric flow. |
Yixiang Chen; Peiyan Li; Yan Huang; Jiabing Yang; Kehan Chen; Liang Wang; | code |
| 315 | LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a robust Pose Estimation Module leveraging learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that converts dense point clouds into anchors based on spatial density. |
Chin-Yang Lin; Cheng Sun; Fu-En Yang; Min-Hung Chen; Yen-Yu Lin; Yu-Lun Liu; | code |
| 316 | Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The shared topology of human skeletons motivated the recent investigation of graph convolutional network (GCN) solutions for action recognition.However, most of the existing GCNs rely on the binary connection of two neighboring vertices (joints) formed by an edge (bone), overlooking the potential of constructing multi-vertex convolution structures.Although some studies have attempted to utilize hyper-graphs to represent the topology, they rely on a fixed construction strategy, which limits their adaptivity in uncovering the intricate latent relationships within the action.In this paper, we address this oversight and explore the merits of an adaptive hyper-graph convolutional network (Hyper-GCN) to achieve the aggregation of rich semantic information conveyed by skeleton vertices.In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations. |
Youwei Zhou; Tianyang Xu; Cong Wu; Xiaojun Wu; Josef Kittler; | code |
| 317 | GLEAM: Enhanced Transferable Adversarial Attacks for Vision-Language Pre-training Models Via Global-Local Transformations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing black-box attack methods are limited by insufficient data augmentation mechanisms or the disruption of global semantic structures, leading to poor adversarial transferability. To address these challenges, we propose the Global-Local Enhanced Adversarial Multimodal attack (GLEAM), a unified framework for generating transferable adversarial examples in vision-language tasks. |
Yunqi Liu; Xue Ouyang; Xiaohui Cui; | code |
| 318 | DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation Through Loopback Synergy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose DeRIS , a novel framework that decomposes RIS into two key components: perception and cognition . |
Ming Dai; Wenxuan Cheng; Jiang-jiang Liu; Sen Yang; Wenxiao Cai; Yanpeng Sun; Wankou Yang; | code |
| 319 | Kh: Symmetry Understanding of 3D Shapes Via Chirality Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the recent Diff3F framework, we propose an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. |
Weikang Wang; Tobias Weißberg; Nafie El Amrani; Florian Bernard; | code |
| 320 | Deciphering Cross-Modal Alignment in Large Vision-Language Models Via Modality Integration Rate Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A series of experiments are conducted to explore the effectiveness of MIR and MoCa, demonstrating that MIR is highly indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. |
Qidong Huang; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Jiaqi Wang; Weiming Zhang; Nenghai Yu; | code |
| 321 | VISION-XL: High Definition Video Inverse Problem Solver Using Latent Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. |
Taesung Kwon; Jong Chul Ye; | code |
| 322 | GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. |
Baijun Ye; Minghui Qin; Saining Zhang; Moonjun Gong; Shaoting Zhu; Hao Zhao; Hang Zhao; | code |
| 323 | SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). |
Zeqi Zheng; Yanchen Huang; Yingchao Yu; Zizheng Zhu; Junfeng Tang; Zhaofei Yu; Yaochu Jin; | code |
| 324 | OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model’s own multimodal capabilities to recursively refine its understanding. |
Tianrun Xu; Guanyu Chen; Ye Li; Yuxin Xi; Zeyu Mu; Ruichen Wang; Tianren Zhang; Haichuan Gao; Feng Chen; | code |
| 325 | A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Quality-guided Mixture of score-fusion Experts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). |
Jie Zhu; Yiyang Su; Minchul Kim; Anil Jain; Xiaoming Liu; | code |
| 326 | Revelio: Interpreting and Leveraging Semantic Information in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study how rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. |
Dahye Kim; Xavier Thomas; Deepti Ghadiyaram; | code |
| 327 | DICE: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we identify that these techniques induce severe staleness — the usage of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects layers vulnerable to staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. |
Jiajun Luo; Lizhuo Luo; Jianru Xu; Jiajun Song; Rongwei Lu; Chen Tang; Zhi Wang; | code |
| 328 | GausSim: Foreseeing Reality By Gaussian Simulator for Elastic Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. |
Yidi Shao; Mu Huang; Chen Change Loy; Bo Dai; | code |
| 329 | GSOT3D: Towards Generic 3D Single Object Tracking in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. |
Yifan Jiao; Yunhao Li; Junhua Ding; Qing Yang; Song Fu; Heng Fan; Libo Zhang; | code |
| 330 | DISTA-Net: Dynamic Closely-Spaced Infrared Small Target Unmixing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose the Dynamic Iterative Shrinkage Thresholding Network (DISTA-Net), which reconceptualizes traditional sparse reconstruction within a dynamic framework. |
Shengdong Han; Shangdong Yang; Yuxuan Li; Xin Zhang; Xiang Li; Jian Yang; Ming-Ming Cheng; Yimian Dai; | code |
| 331 | Ponimator: Unfolding Interactive Pose for Versatile Human-human Interaction Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given such poses, humans can intuitively infer the context and anticipate possible past and future dynamics, drawing on strong priors of human behavior. Inspired by this observation, we propose Ponimator, a simple framework anchored on proximal interactive poses for versatile interaction animation. |
Shaowei Liu; Chuan Guo; Bing Zhou; Jian Wang; | code |
| 332 | ShortV: Efficient Multimodal Large Language Models By Freezing Visual Tokens in Ineffective Layers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer’s transformations on visual and text tokens, respectively. |
Qianhao Yuan; Qingyu Zhang; Yanjiang Liu; Jiawei Chen; Yaojie Lu; Hongyu Lin; Jia Zheng; Xianpei Han; Le Sun; | code |
| 333 | Active Perception Meets Rule-Guided RL: A Two-Phase Approach for Precise Object Navigation in Complex Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by human search behavior, our rule-guided RL policy enables efficient and adaptive exploration by combining structured heuristics with learning-based decision-making. In the last-mile navigation phase, we introduce an RL-based policy enhanced with active target perception, allowing the robot to refine its position dynamically based on real-time detection feedback. |
Liang Qin; Min Wang; Peiwei Li; Wengang Zhou; Houqiang Li; | code |
| 334 | SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite existing studies integrating supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this approach exhibits a critical limitation: it inherently prioritizes RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modalities. To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. |
Sixian Chan; Zedong Li; Wenhao Li; Shijian Lu; Chunhua Shen; Xiaoqin Zhang; | code |
| 335 | Hipandas: Hyperspectral Image Joint Denoising and Super-Resolution By Image Fusion with The Panchromatic Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Hyperspectral Image Joint Pandenoising and Pansharpening (Hipandas), a novel learning paradigm that reconstructs high quality HSIs from noisy low-resolution HSIs (NLRHS) and high-resolution PAN images. |
Shuang Xu; Zixiang Zhao; Haowen Bai; Chang Yu; Jiangjun Peng; Xiangyong Cao; Deyu Meng; | code |
| 336 | Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed Tune-Your-Style, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. |
Yian Zhao; Rushi Ye; Ruochong Zheng; Zesen Cheng; Chaoran Feng; Jiashu Yang; Pengchong Qiao; Chang Liu; Jie Chen; | code |
| 337 | MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. |
Hallee E. Wong; Jose Javier Gonzalez Ortiz; John Guttag; Adrian V. Dalca; | code |
| 338 | Towards Stabilized and Efficient Diffusion Transformers Through Long-Skip-Connections with Spectral Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Skip-DiT, a novel DiT variant enhanced with Long-Skip-Connections (LSCs) – the key efficiency component in U-Nets. |
Guanjie Chen; Xinyu Zhao; Yucheng Zhou; Xiaoye Qu; Tianlong Chen; Yu Cheng; | code |
| 339 | Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing fine-tuning approaches often encounter challenges in extreme cross-domain few-shot learning scenarios, primarily due to the significant domain shift between pre-training data and target tasks, as well as the scarcity of annotated target samples. To mitigate this issue, we propose a novel absorption adaptation learning framework which meticulously regularizes the fine-tuning procedure of foundation model using an expert model with the same architecture but trained from scratch on the targeted data in two aspects. |
Fei Zhou; Peng Wang; Lei Zhang; Wei Wei; Chen Ding; Guosheng Lin; Yanning Zhang; | code |
| 340 | Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). |
Chen Zhao; Xuan Wang; Tong Zhang; Saqib Javed; Mathieu Salzmann; | code |
| 341 | TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce TIP-I2V, the first large-scale dataset of over 1.70 million unique user-provided Text and Image Prompts specifically for Image-to-Video generation. |
Wenhao Wang; Yi Yang; | code |
| 342 | Allowing Oscillation Quantization: Overcoming Solution Space Limitation in Low Bit-Width Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often converge to sub-optimal solutions due to inadequate exploration of quantization solution space. To address this, we propose a novel QAT method, Allowing Oscillation Quantization (AOQ), which expands the reachable solution space through weight oscillation. |
Weiying Xie; Zihan Meng; Jitao Ma; Wenjin Guo; Haowei Li; Haonan Qin; Leyuan Fang; Yunsong Li; | code |
| 343 | AgroBench: Vision-Language Model Benchmark in Agriculture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. |
Risa Shinoda; Nakamasa Inoue; Hirokatsu Kataoka; Masaki Onishi; Yoshitaka Ushiku; | code |
| 344 | Personalized Federated Learning Under Local Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, many existing pFL studies rely on directly using the global model for local training without fully assessing its impact on the performance of the local model, resulting in a potential conflict between personalization and generalization. To address this issue, we propose a parallel structure of a local supervisor and an inter-learning model for the local model and introduce a novel pFL method called federated learning by considering data similarity across clients assisted by a local supervisor (FedSimSup). |
Qiqi Liu; Jiaqiang Li; Yuchen Liu; Yaochu Jin; Lingjuan Lyu; Xiaohu Wu; Han Yu; | code |
| 345 | MMGeo: Multimodal Compositional Geo-Localization for UAVs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database. |
Yuxiang Ji; Boyong He; Zhuoyue Tan; Liaoni Wu; | code |
| 346 | Tiling Artifacts and Trade-offs of Feature Normalization in The Segmentation of Large Biological Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose indicators to detect normalization issues and further explore the trade-offs between artifact-free and high-quality predictions, using three diverse microscopy datasets as examples. |
Elena Buglakova; Anwai Archit; Edoardo D’Imprima; Julia Mahamid; Constantin Pape; Anna Kreshuk; | code |
| 347 | Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. |
Wanchang Yu; Qing Zhang; Rongjia Zheng; Wei-Shi Zheng; | code |
| 348 | DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. |
Rongjia Zheng; Qing Zhang; Chengjiang Long; Wei-Shi Zheng; | code |
| 349 | SAMPLE: Semantic Alignment Through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose SAMPLE, a lightweight adaptation of VLMs for event-based action recognition, balancing supervised and open-vocabulary performance. |
Jing Wang; Rui Zhao; Ruiqin Xiong; Xingtao Wang; Xiaopeng Fan; Tiejun Huang; | code |
| 350 | TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. |
Siqi Luo; Haoran Yang; Yi Xin; Mingyang Yi; Guangyang Wu; Guangtao Zhai; Xiaohong Liu; | code |
| 351 | GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose \textbf Gated-Mechanism Mixture-of-Experts (GM-MoE), the first framework to introduce a mixture-of-experts network for low-light image enhancement. |
Minwen Liao; Haobo Dong; Xinyi Wang; Kurban Ubul; Yihua Shao; Ziyang Yan; | code |
| 352 | Weakly Supervised Visible-Infrared Person Re-Identification Via Heterogeneous Expert Collaborative Consistency Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the impact of missing cross-modal labels on model performance, we propose a heterogeneous expert collaborative consistency learning framework, designed to establish robust cross-modal identity correspondences in a weakly supervised manner. |
Yafei Zhang; Lingqi Kong; Huafeng Li; Jie Wen; | code |
| 353 | DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. |
Ziqi Gao; Qiufu Li; Linlin Shen; | code |
| 354 | LazyMAR: Accelerating Masked Autoregressive Models Via Feature Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Condition Redundancy indicates that the difference between conditional and unconditional output in classifier-free guidance exhibits very similar values in adjacent steps. Based on these two redundancies, we propose LazyMAR, which introduces two caching mechanisms to handle them one by one. |
Feihong Yan; Qingyan Wei; Jiayi Tang; Jiajun Li; Yulin Wang; Xuming Hu; Huiqi Li; Linfeng Zhang; | code |
| 355 | OCR Hinders RAG: Evaluating The Cascading Impact of OCR on Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. |
Junyuan Zhang; Qintong Zhang; Bin Wang; Linke Ouyang; Zichen Wen; Ying Li; Ka-Ho Chow; Conghui He; Wentao Zhang; | code |
| 356 | CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. |
Dengke Zhang; Fagui Liu; Quan Tang; | code |
| 357 | ISP2HRNet: Learning to Reconstruct High Resolution Image from Irregularly Sampled Pixels Via Hierarchical Gradient Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To handle the challenges brought by irregular sampling, we propose an architecture to extract gradient structure hierarchically and learn continuous image representation. |
Yuanlin Wang; Ruiqin Xiong; Rui Zhao; Jin Wang; Xiaopeng Fan; Tiejun Huang; | code |
| 358 | Trade-offs in Image Generation: How Do Different Dimensions Interact? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, models’ complex trade-offs among these dimensions have been rarely explored due to (1) the lack of datasets that allow fine-grained quantification of these trade-offs, and (2) using a single metric for multiple dimensions. To address this gap, we introduce TRIG-Bench (Trade-offs in Image Generation), which spans 10 dimensions (Realism, Originality, Aesthetics, Content, Relation, Style, Knowledge, Ambiguity, Toxicity and Bias), contains over 40,200 samples, and covers 132 Pairwise Dimensional Subsets. |
Sicheng Zhang; Binzhu Xie; Zhonghao Yan; Yuli Zhang; Donghao Zhou; Xiaofei Chen; Shi Qiu; Jiaqi Liu; Guoyang Xie; Zhichao Lu; | code |
| 359 | GDKVM: Echocardiography Video Segmentation Via Spatiotemporal Key-Value Memory with Gated Delta Rule Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. |
Rui Wang; Yimu Sun; Jingxing Guo; Huisi Wu; Jing Qin; | code |
| 360 | FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. |
Tianyu Fu; Tengxuan Liu; Qinghao Han; Guohao Dai; Shengen Yan; Huazhong Yang; Xuefei Ning; Yu Wang; | code |
| 361 | CAT: A Unified Click-and-Track Framework for Realistic Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a unified Click-and-Track (CAT) framework for full-process tracking, eliminating the need for auxiliary models or complex initialization pipelines. |
Yongsheng Yuan; Jie Zhao; Dong Wang; Huchuan Lu; | code |
| 362 | Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. |
Jinglun Li; Kaixun Jiang; Zhaoyu Chen; Bo Lin; Yao Tang; Weifeng Ge; Wenqiang Zhang; | code |
| 363 | UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model that predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images.Previous methods often regress 3D Gaussians locally on a per-pixel basis for each view and then transfer them to world space and merge them through point concatenation.In contrast, Our approach involves modeling unitary 3D Gaussians in world space and updating them layer by layer.To leverage information from multi-view inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as queries and updates their parameters by performing multi-view cross-attention (MVDFA) across multiple input images, which are treated as keys and values.This approach effectively avoids ‘ghosting’ issue and allocates more 3D Gaussians to complex regions.Moreover, since the number of 3D Gaussians used as decoder queries is independent of the number of input views, our method allows arbitrary number of multi-view images as input without causing memory explosion or requiring retraining.Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively. |
Jiamin Wu; Kenkun Liu; Xiaoke Jiang; Yuan Yao; Lei Zhang; | code |
| 364 | SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. |
Zhi Chen; Zecheng Zhao; Jingcai Guo; Jingjing Li; Zi Huang; | code |
| 365 | Neurons: Emulating The Human Visual Cortex Improves Fidelity and Interpretability in FMRI-to-Video Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the hierarchical organization of the visual system, we propose NEURONS, a novel framework that decouples learning into four correlated sub-tasks: key object segmentation, concept recognition, scene description, and blurry video reconstruction. |
Haonan Wang; Qixiang Zhang; Lehan Wang; Xuanqi Huang; Xiaomeng Li; | code |
| 366 | MotionLab: Unified Human Motion Generation and Editing Via The Motion-Condition-Motion Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. |
Ziyan Guo; Zeyu Hu; De Wen Soh; Na Zhao; | code |
| 367 | HUMOTO: A 4D Dataset of Mocap Human Object Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Human Motions with Objects (HUMOTO), a high-fidelity dataset of human-object interactions for motion generation, computer vision, and robotics applications. |
Jiaxin Lu; Chun-Hao Paul Huang; Uttaran Bhattacharya; Qixing Huang; Yi Zhou; | code |
| 368 | LVAgent: Long Video Understanding By Multi-Round Dynamical Collaboration of MLLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. |
Boyu Chen; Zhengrong Yue; Siran Chen; Zikang Wang; Yang Liu; Peng Li; Yali Wang; | code |
| 369 | Sculpting Memory: Multi-Concept Forgetting in Diffusion Models Via Dynamic Mask and Concept-Aware Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing unlearning methods can remove certain concepts, they struggle with multi-concept forgetting due to instability, residual knowledge persistence, and generation quality degradation. To address these challenges, we propose **Dynamic Mask coupled with Concept-Aware Loss**, a novel unlearning framework designed for multi-concept forgetting in diffusion models. |
Gen Li; Yang Xiao; Jie Ji; Kaiyuan Deng; Bo Hui; Linke Guo; Xiaolong Ma; | code |
| 370 | Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches predominantly leverage the Mean Teacher framework for self-training in the target domain. |
Qi He; Xiao Wu; Jun-Yan He; Shuai Li; | code |
| 371 | AerialVG: A Challenging Benchmark for Aerial Visual Grounding By Exploring Positional Relations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. |
Junli Liu; Qizhi Chen; Zhigang Wang; Yiwen Tang; Yiting Zhang; Chi Yan; Dong Wang; Xuelong Li; Bin Zhao; | code |
| 372 | EA-Vit: Efficient Adaptation for Elastic Vision Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. |
Chen Zhu; Wangbo Zhao; Huiwen Zhang; Yuhao Zhou; Weidong Tang; Shuo Wang; Zhihang Yuan; Yuzhang Shang; Xiaojiang Peng; Kai Wang; Dawei Yang; | code |
| 373 | Music-Aligned Holistic 3D Dance Generation Via Hierarchical Motion Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these challenges, we introduce SoulDance, a high-precision music-dance paired dataset captured via professional motion capture systems, featuring meticulously annotated holistic dance movements. Building on this dataset, we propose SoulNet, a framework designed to generate music-aligned, kinematically coordinated holistic dance sequences. |
Xiaojie Li; Ronghui Li; Shukai Fang; Shuzhao Xie; Xiaoyang Guo; Jiaqing Zhou; Junkun Peng; Zhi Wang; | code |
| 374 | Can Generative Geospatial Diffusion Models Excel As Discriminative Geospatial Foundation Models? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. |
Yuru Jia; Valerio Marsocci; Ziyang Gong; Xue Yang; Maarten Vergauwen; Andrea Nascetti; | code |
| 375 | Leveraging Local Patch Alignment to Seam-cutting for Large Parallax Image Stitching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an alignment-compensation paradigm that dissociates seam quality from initial alignment accuracy by integrating a Local Patch Alignment Module (LPAM) into the seam-cutting pipeline. |
Tianli Liao; Chenyang Zhao; Lei Li; Heling Cao; | code |
| 376 | CountSE: Soft Exemplar Open-set Object Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, though exemplar-guided few-shot approaches gain better results, they rely heavily on manually annotated visual exemplars, resulting in low efficiency and high labor intensity. Therefore, we propose CountSE, which simultaneously achieves high efficiency and high performance. |
Shuai Liu; Peng Zhang; Shiwei Zhang; Wei Ke; | code |
| 377 | Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. |
Jingjing Jiang; Chao Ma; Xurui Song; Hanwang Zhang; Jun Luo; | code |
| 378 | Lumina-Image 2.0: A Unified and Efficient Image Generative Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Lumina-Image 2.0, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. |
Qi Qin; Le Zhuo; Yi Xin; Ruoyi Du; Zhen Li; Bin Fu; Yiting Lu; Xinyue Li; Dongyang Liu; Xiangyang Zhu; Will Beddow; Erwann Millon; Victor Perez; Wenhai Wang; Yu Qiao; Bo Zhang; Xiaohong Liu; Hongsheng Li; Chang Xu; Peng Gao; | code |
| 379 | Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. |
Beier Zhu; Ruoyu Wang; Tong Zhao; Hanwang Zhang; Chi Zhang; | code |
| 380 | Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. |
Teng Ma; Xiaojun Jia; Ranjie Duan; Xinfeng Li; Yihao Huang; Xiaoshuang Jia; Zhixuan Chu; Wenqi Ren; | code |
| 381 | Blind Video Super-Resolution Based on Implicit Kernels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. |
Qiang Zhu; Yuxuan Jiang; Shuyuan Zhu; Fan Zhang; David Bull; Bing Zeng; | code |
| 382 | VisionMath: Vision-Form Mathematical Problem-Solving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose VisionMath, the first exploration for vision-form mathematical problem-solving model, which employs a three-stage progressive multimodal reasoning alignment strategy to systematically enhance task-specific capabilities. |
Zongyang Ma; Yuxin Chen; Ziqi Zhang; Zhongang Qi; Chunfeng Yuan; Shaojie Zhu; Chengxiang Zhuo; Bing Li; Ye Liu; Zang Li; Ying Shan; Weiming Hu; | code |
| 383 | Training-free Generation of Temporally Consistent Rewards from VLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose T2-VLM, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the status changes in VLM-derived subgoals. |
Yinuo Zhao; Jiale Yuan; Zhiyuan Xu; Xiaoshuai Hao; Xinyi Zhang; Kun Wu; Zhengping Che; Chi Harold Liu; Jian Tang; | code |
| 384 | Asynchronous Event Error-Minimizing Noise for Safeguarding Event Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose the first unlearnable event stream generation method to prevent unauthorized training from event datasets. |
Ruofei Wang; Peiqi Duan; Boxin Shi; Renjie Wan; | code |
| 385 | FiVE-Bench: A Fine-grained Video Editing Benchmark for Evaluating Emerging Diffusion and Rectified Flow Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further enhance object-level evaluation, we introduce FiVE-Acc, a novel metric leveraging Vision-Language Models (VLMs) to assess the success of fine-grained video editing. |
Minghan Li; Chenxi Xie; Yichen Wu; Lei Zhang; Mengyu Wang; | code |
| 386 | Tree-NeRV: Efficient Non-Uniform Sampling for Neural Video Representation Via Tree-Structured Feature Grids Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Implicit Neural Representations for Videos (NeRV) have emerged as a powerful paradigm for video representation, enabling direct mappings from frame indices to video frames. … |
Jiancheng Zhao; Yifan Zhan; Qingtian Zhu; Mingze Ma; Muyao Niu; Zunian Wan; Xiang Ji; Yinqiang Zheng; | code |
| 387 | SplArt: Articulation Estimation and Part-Level Reconstruction with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SplArt, a self-supervised, category-agnostic framework that uses 3D Gaussian Splatting (3DGS) to reconstruct and infer the kinematics of articulated objects from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. |
Shengjie Lin; Jiading Fang; Muhammad Zubair Irshad; Vitor Campagnolo Guizilini; Rares Andrei Ambrus; Greg Shakhnarovich; Matthew R. Walter; | code |
| 388 | Fine-structure Preserved Real-world Image Super-resolution Via Transfer VAE Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8xdownsampled VAE into a 4xone while adapting to the pre-trained UNet. |
Qiaosi Yi; Shuai Li; Rongyuan Wu; Lingchen Sun; Yuhui Wu; Lei Zhang; | code |
| 389 | Neural Compression for 3D Geometry Sets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present NeCGS, the first neural compression paradigm, which can compress a geometry set encompassing thousands of detailed and diverse 3D mesh models by up to 900 times with high accuracy and preservation of detailed geometric structures. |
Siyu Ren; Junhui Hou; Weiyao Lin; Wenping Wang; | code |
| 390 | WildSeg3D: Segment Any 3D Objects in The Wild from 2D Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. |
Yansong Guo; Jie Hu; Yansong Qu; Liujuan Cao; | code |
| 391 | UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce UniOcc, a comprehensive, unified benchmark and toolkit for occupancy forecasting (i.e., predicting future occupancies based on historical information) and occupancy prediction (i.e., predicting current-frame occupancy from camera images. |
Yuping Wang; Xiangyu Huang; Xiaokang Sun; Mingxuan Yan; Shuo Xing; Zhengzhong Tu; Jiachen Li; | code |
| 392 | Similarity Memory Prior Is All You Need for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. |
Hao Tang; Zhiqing Guo; Liejun Wang; Chao Liu; | code |
| 393 | Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose DyTo, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. |
Yiming Zhang; Zhuokai Zhao; Zhaorun Chen; Zenghui Ding; Xianjun Yang; Yining Sun; | code |
| 394 | Tracking Tiny Drones Against Clutter: Large-Scale Infrared Benchmark with Motion-Centric Adaptive Algorithm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike traditional trackers that primarily rely on appearance matching, we introduce a novel method called Motion-Centric Adaptive Tracking (MCATrack), which initially employs a magnocell-inspired motion response to enhance the local signal-to-noise ratio of tiny target regions while suppressing complex clutter. |
Jiahao Zhang; Zongli Jiang; Jinli Zhang; Yixin Wei; Liang Li; Yizheng Wang; Gang Wang; | code |
| 395 | LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT’s potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. |
Jinghan You; Shanglin Li; Yuanrui Sun; Jiangchuan Wei; Mingyu Guo; Chao Feng; Jiao Ran; | code |
| 396 | Anomaly Detection of Integrated Circuits Package Substrates Using The Large Vision Model SAIC: Dataset Construction, Methodology, and Application Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the progress of IC anomaly detection is hampered by the scarcity of defective samples and the shortage of well-defined annotations. To address this challenge, this paper focuses on the research in the field of IC, especially on ceramic package substrates (CPS). |
Ruiyun Yu; Bingyang Guo; Haoyuan Li; | code |
| 397 | FreeDance: Towards Harmonic Free-Number Group Dance Generation Via A Unified Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods merely model a fixed number of dancers, lacking the flexibility for arbitrary individuals. To fulfill this goal, we propose a novel unified framework, FreeDance. |
Yiwen Zhao; Yang Wang; Liting Wen; Hengyuan Zhang; Xingqun Qi; | code |
| 398 | Unleashing The Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel stereo 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. |
Jae-Young Kang; Hoonhee Cho; Kuk-Jin Yoon; | code |
| 399 | DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. |
Jihun Kim; Hoyong Kwon; Hyeokjun Kweon; Wooseong Jeong; Kuk-Jin Yoon; | code |
| 400 | UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. |
Jie Feng; Shengyuan Wang; Tianhui Liu; Yanxin Xi; Yong Li; | code |
| 401 | AU-Blendshape for Fine-grained Stylized 3D Facial Expression Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the AUBlendSet, a 3D facial dataset based on AU-Blendshape representation for fine-grained facial expression manipulation across identities. |
Hao Li; Ju Dai; Feng Zhou; Kaida Ning; Lei Li; Junjun Pan; | code |
| 402 | Robust Multi-View Learning Via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. |
Jie Xu; Na Zhao; Gang Niu; Masashi Sugiyama; Xiaofeng Zhu; | code |
| 403 | CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). |
Peng Chen; Pi Bu; Yingyao Wang; Xinyi Wang; Ziming Wang; Jie Guo; Yingxiu Zhao; Qi Zhu; Jun Song; Siran Yang; Jiamang Wang; Bo Zheng; | code |
| 404 | PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. |
Bing Fan; Yunhe Feng; Yapeng Tian; James Chenhao Liang; Yuewei Lin; Yan Huang; Heng Fan; | code |
| 405 | AIRA: Activation-Informed Low-Rank Adaptation for Large Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This omission leads to suboptimal adaptation and slower convergence. To address this limitation, we present Activation-Informed Low-Rank Adaptation (AIRA), a novel approach that integrates activation information into initialization, training, and rank assignment to enhance model performance. |
Lujun Li; Dezhi Li; Cheng Lin; Wei Li; Wei Xue; Sirui Han; Yike Guo; | code |
| 406 | Efficient Fine-Tuning of Large Models Via Nested Low-Rank Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Then, we establish design guidelines that emphasize the use of serial structures, optimal placements, and nested LoRA. Based on these insights, we present NoRA, a nested parameter-efficient LoRA structure that revolutionizes the initialization and fine-tuning of projection matrices. |
Lujun Li; Cheng Lin; Dezhi Li; You-Liang Huang; Wei Li; Tianyu Wu; Jie Zou; Wei Xue; Sirui Han; Yike Guo; | code |
| 407 | TurboReg: TurboClique for Robust and Efficient Point Cloud Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. |
Shaocheng Yan; Pengcheng Shi; Zhenjun Zhao; Kaixin Wang; Kuang Cao; Ji Wu; Jiayuan Li; | code |
| 408 | AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. |
Ruifei Zhang; Junlin Xie; Wei Zhang; Weikai Chen; Xiao Tan; Xiang Wan; Guanbin Li; | code |
| 409 | VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: (2) The substantial parameters of LLMs pose considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. |
Ruifei Zhang; Wei Zhang; Xiao Tan; Sibei Yang; Xiang Wan; Xiaonan Luo; Guanbin Li; | code |
| 410 | DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. |
Jie Chen; Zhangchi Hu; Peixi Wu; Huyue Zhu; Hebei Li; Xiaoyan Sun; | code |
| 411 | BlueNeg: A 35mm Negative Film Dataset for Restoring Channel-Heterogeneous Deterioration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While digitally acquired photographs have been dominating since around 2000, there remains a huge amount of legacy photographs being acquired by optical cameras and are stored in the form of film negatives. In this paper, we address the unique challenge of channel-heterogeneous deterioration in film negatives and introduce BlueNeg, the first high-quality 35mm negative film dataset specifically designed for restoration in this context. |
Hanyuan Liu; Chengze Li; Minshan Xie; Zhenni Wang; Jiawen Liang; Chi-Sing Leung; Tien-Tsin Wong; | code |
| 412 | Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ViTaM-D, a novel visual-tactile framework for reconstructing dynamic hand-object interaction with distributed tactile sensing to enhance contact modeling. |
Zhenjun Yu; Wenqiang Xu; Pengfei Xie; Yutong Li; Brian W. Anthony; Zhuorui Zhang; Cewu Lu; | code |
| 413 | LA-MOTR: End-to-End Multi-Object Tracking By Learnable Association Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes LA-MOTR, a novel Tracking-by-Learnable-Association framework that resolves the competing optimization objectives between detection and association in end-to-end Tracking-by-Attention (TbA) Multi-Object Tracking. |
Peng Wang; Yongcai Wang; Hualong Cao; Wang Chen; Deying Li; | code |
| 414 | Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(Any-SSR) to address these challenges. |
Kai Tong; Kang Pan; Xiao Zhang; Erli Meng; Run He; Yawen Cui; Nuoyan Guo; Huiping Zhuang; | code |
| 415 | SuperEvent: Cross-Modal Learning of Event-based Keypoint Detection for SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. |
Yannick Burkhardt; Simon Schaefer; Stefan Leutenegger; | code |
| 416 | INTER: Mitigating Hallucination in Large Vision-Language Models By Interaction Guidance Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans’ ability to effectively leverage multimodal interaction information in data samples. |
Xin Dong; Shichao Dong; Jin Wang; Jing Huang; Li Zhou; Zenghui Sun; Lihua Jing; Jinsong Lan; Xiaoyong Zhu; Bo Zheng; | code |
| 417 | GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study the joint sequential scene flow estimation and future scene flow prediction on point cloud sequences. |
Hanlin Li; Wenming Weng; Yueyi Zhang; Zhiwei Xiong; | code |
| 418 | Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. |
Junming Liu; Siyuan Meng; Yanting Gao; Song Mao; Pinlong Cai; Guohang Yan; Yirong Chen; Zilin Bian; Ding Wang; Botian Shi; | code |
| 419 | GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method that leverages 2D diffusion models’ implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. |
Ye Tao; Jiawei Zhang; Yahao Shi; Dongqing Zou; Bin Zhou; | code |
| 420 | Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. |
Hongyang Wei; Shuaizheng Liu; Chun Yuan; Lei Zhang; | code |
| 421 | CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. |
Lei Tian; Xiaomin Li; Liqian Ma; Hao Yin; Zirui Zheng; Hefei Huang; Taiqing Li; Huchuan Lu; Xu Jia; | code |
| 422 | MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a memory-efficient framework for 4DGS. |
Xinjie Zhang; Zhening Liu; Yifan Zhang; Xingtong Ge; Dailan He; Tongda Xu; Yan Wang; Zehong Lin; Shuicheng Yan; Jun Zhang; | code |
| 423 | LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To alleviate it, this paper proposes LLM-assisted Entropy-based Adaptive Distillation (LEAD), a novel unsupervised FVRL framework that selectively distills fine-grained knowledge from a powerful teacher model built upon pre-trained models. |
Jianfeng Dong; Danfeng Luo; Daizong Liu; Jie Sun; Xiaoye Qu; Xun Yang; Dongsheng Liu; Xun Wang; | code |
| 424 | Pseudo-SD: Pseudo Controlled Stable Diffusion for Semi-Supervised and Cross-Domain Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces Pseudo-SD, a novel framework that redefines the utilization of pseudo-label knowledge through Stable Diffusion (SD). |
Dong Zhao; Qi Zang; Shuang Wang; Nicu Sebe; Zhun Zhong; | code |
| 425 | DLF: Extreme Image Compression with Dual-generative Latent Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, this results in suboptimal reconstruction fidelity, especially at low bitrates. To address this issue, we introduce a Dual-generative Latent Fusion (DLF) paradigm. |
Naifu Xue; Zhaoyang Jia; Jiahao Li; Bin Li; Yuan Zhang; Yan Lu; | code |
| 426 | MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction Under Various Light Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose MGSR, a 2D/3D Mutual-boosted Gaussian Splatting for Surface Reconstruction that enhances both rendering quality and 3D reconstruction accuracy. |
Qingyuan Zhou; Yuehu Gong; Weidong Yang; Jiaze Li; Yeqi Luo; Baixin Xu; Shuhao Li; Ben Fei; Ying He; | code |
| 427 | AFUNet: Cross-Iterative Alignment-Fusion Synergy for HDR Reconstruction Via Deep Unfolding Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing learning-based methods effectively reconstruct HDR images from multi-exposure LDR inputs with extended dynamic range and improved detail, but often rely on empirical design rather than a theoretical foundation, which can impact their reliability. To address these limitations, we propose the cross-iterative Alignment and Fusion deep Unfolding Network (AFUNet), where HDR reconstruction is systematically decoupled into two interleaved subtasks–alignment and fusion–optimized through alternating refinement, achieving synergy between the two subtasks to enhance the overall performance. |
Xinyue Li; Zhangkai Ni; Wenhan Yang; | code |
| 428 | Bridging The Gap Between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Past studies have generally assumed that stimulus images and their evoked brain recordings are strictly semantically equivalent, potentially leading to semantic misalignment between supervision signals and neural recordings. In order to address this, we propose a novel self-adaptive semantic decoding method (Mind-SA), designed to dynamically detect the regions within stimulus images that the brain actually focuses on and use them as supervision to guide brain-to-text reconstruction. |
Jiaxuan Chen; Yu Qi; Yueming Wang; Gang Pan; | code |
| 429 | CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. |
Lin Sun; Jiale Cao; Jin Xie; Xiaoheng Jiang; Yanwei Pang; | code |
| 430 | Bridging The Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization – RRDataset encompasses high-quality images from seven major scenarios (War & Conflict, Disasters & Accidents, Political & Social Events, Medical & Public Health, Culture & Religion, Labor & Production, and everyday life), addressing existing dataset gaps from a content perspective. |
Chunxiao Li; Xiaoxiao Wang; Meiling Li; Boming Miao; Peng Sun; Yunjian Zhang; Xiangyang Ji; Yao Zhu; | code |
| 431 | Dual-Temporal Exemplar Representation Network for Video Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework, the Dual-Temporal Exemplar Representation Network (DTERN), which utilizes the strong representational capability of cluster centers, i.e., exemplars, to effectively model both local and global temporal information. |
Xiaolong Xu; Lei Zhang; Jiayi Li; Lituan Wang; Yifan Guan; Yu Yan; Leyi Zhang; Hao Song; | code |
| 432 | Scale Your Instructions: Enhance The Instruction-Following Fidelity of Unified Image Generation Model By Self-Adaptive Attention Scaling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. |
Chao Zhou; Tianyi Wei; Nenghai Yu; | code |
| 433 | FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, processing long sequences of visual tokens extracted from visual backbones poses challenges for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating computational and memory demands during both training and inference. |
Haicheng Wang; Zhemeng Yu; Gabriele Spadaro; Chen Ju; Victor Quétu; Shuai Xiao; Enzo Tartaglione; | code |
| 434 | PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches treat all regions of the image equally, overlooking the fact that regions with varying levels of reconstruction difficulty require different sampling steps. To address this limitation, we propose PatchScaler, an efficient patch-independent diffusion pipeline for single image super-resolution. |
Yong Liu; Hang Dong; Jinshan Pan; Qingji Dong; Kai Chen; Rongxiang Zhang; Lean Fu; Fei Wang; | code |
| 435 | SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. |
Pingchuan Ma; Xiaopei Yang; Yusong Li; Ming Gui; Felix Krause; Johannes Schusterbauer; Björn Ommer; | code |
| 436 | Stochastic Interpolants for Revealing Stylistic Flows Across The History of Art Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most methods operate at the level of individual images, limiting their ability to reveal broader stylistic trends and temporal transitions. We address this by introducing a framework that models stylistic evolution as an optimal transport problem in a learned style space, using stochastic interpolants and dual diffusion implicit bridges to align artistic distributions across time without requiring paired data. |
Pingchuan Ma; Ming Gui; Johannes Schusterbauer; Xiaopei Yang; Olga Grebenkova; Vincent Tao Hu; Björn Ommer; | code |
| 437 | Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. |
Jeongseok Hyun; Sukjun Hwang; Su Ho Han; Taeoh Kim; Inwoong Lee; Dongyoon Wee; Joon-Young Lee; Seon Joo Kim; Minho Shim; | code |
| 438 | ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). |
Xiangyu Yin; Boyuan Yang; Weichen Liu; Qiyao Xue; Abrar Alamri; Goeran Fiedler; Wei Gao; | code |
| 439 | Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, transparent OLED (TOLED) layers introduce severe degradations–such as signal attenuation, multi-path interference (MPI), and temporal noise–that significantly compromise depth quality. To alleviate this drawback, we propose Learnable Fractional Reaction-Diffusion Dynamics (LFRD^2), a hybrid framework that combines the expressive power of neural networks with the interpretability of physical modeling. |
Xin Qiao; Matteo Poggi; Xing Wei; Pengchao Deng; Yanhui Zhou; Stefano Mattoccia; | code |
| 440 | Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client’s unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. |
Wenxuan Bao; Ruxi Deng; Ruizhong Qiu; Tianxin Wei; Hanghang Tong; Jingrui He; | code |
| 441 | TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-to-fine matching fashion. |
Yan Xia; Yunxiang Lu; Rui Song; Oussema Dhaouadi; João F. Henriques; Daniel Cremers; | code |
| 442 | Learning Robust Image Watermarking with Lossless Cover Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing watermarking methods embed watermarks by adding irremovable perturbations to the cover image, causing permanent distortion. To address this issue, we propose a novel watermarking approach termed Cover-Recoverable Watermark (CRMark). |
Jiale Chen; Wei Wang; Chongyang Shi; Li Dong; Xiping Hu; | code |
| 443 | Occupancy Learning with Spatiotemporal Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. |
Ziyang Leng; Jiawei Yang; Wenlong Yi; Bolei Zhou; | code |
| 444 | Recognizing Actions from Robotic View for Natural Human-Robot Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing benchmarks designed for traditional action recognition fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments. To address these challenges, we introduce ACTIVE (Action from Robotic View), a large-scale dataset tailored specifically for perception-centric robotic views prevalent in mobile service robots. |
Ziyi Wang; Peiming Li; Hong Liu; Zhichao Deng; Can Wang; Jun Liu; Junsong Yuan; Mengyuan Liu; | code |
| 445 | Vid-Group: Temporal Video Grounding Pretraining from Unlabeled Videos in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. |
Peijun Bao; Chenqi Kong; Siyuan Yang; Zihao Shao; Xinghao Jiang; Boon Poh Ng; Meng Hwa Er; Alex Kot; | code |
| 446 | DiGA3D: Coarse-to-Fine Diffusional Propagation of Geometry and Appearance for Versatile 3D Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, there are still challenges in performing multiple 3D inpainting tasks within a unified framework: 1) Single reference inpainting methods lack robustness when dealing with views that are far from the reference view; 2) Appearance inconsistency arises when independently inpainting multi-view images with 2D diffusion priors; 3) Geometry inconsistency limits performance when there are significant geometric changes in the inpainting regions. To tackle these challenges, we introduce DiGA3D, a novel and versatile 3D inpainting pipeline that leverages diffusion models to propagate consistent appearance and geometry in a coarse-to-fine manner. |
Jingyi Pan; Dan Xu; Qiong Luo; | code |
| 447 | KinMo: Kinematic-aware Human Motion Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A single coarse description, such as "run", fails to capture essential details like variations in speed, limb positioning, and kinematic dynamics, leading to significant ambiguities between text and motion modalities. To address this challenge, we introduce KinMo, a unified framework built on a hierarchical describable motion representation that extends beyond global action by incorporating kinematic group movements and their interactions.We design an automated annotation pipeline to generate high-quality, fine-grained descriptions for this decomposition, resulting in the KinMo dataset. |
Pengfei Zhang; Pinxin Liu; Pablo Garrido; Hyeongwoo Kim; Bindita Chaudhuri; | code |
| 448 | Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a visual-oriented, fine-grained multimodal knowledge editing task that targets precise modifications in images containing multiple interacting entities. |
Zhen Zeng; Leijiang Gu; Xun Yang; Zhangling Duan; Zenglin Shi; Meng Wang; | code |
| 449 | FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models are usually built upon text-to-image diffusion models, so they either rely on slow optimization-based inference, or necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. |
Yabo Zhang; Xinpeng Zhou; Yihan Zeng; Hang Xu; Hui Li; Wangmeng Zuo; | code |
| 450 | Robust Test-Time Adaptation for Single Image Denoising Using Deep Gaussian Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reveal that deep Gaussian denoisers have an underlying ability to handle other noises with only ten iterations of self-supervised learning, which is referred to as deep denoiser prior. |
Qing Ma; Pengwei Liang; Xiong Zhou; Jiayi Ma; Junjun Jiang; Zhe Peng; | code |
| 451 | When Pixel Difference Patterns Meet ViT: PiDiViT for Few-Shot Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Few-shot object detection aims to detect novel classes with limited samples. Recent methods have leveraged the rich semantic representations of pretrained vision transformer (ViT) … |
Hongliang Zhou; Yongxiang Liu; Canyu Mo; Weijie Li; Bowen Peng; Li Liu; | code |
| 452 | Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose the Adversarial Open-Unfairness Discovery and Mitigation Network (AdvOU), a novel framework designed to mitigate unpredictable unfairness in deepfake detection. |
Zhaoyang Li; Zhu Teng; Baopeng Zhang; Jianping Fan; | code |
| 453 | SMP-Attack: Boosting The Transferability of Feature Importance-based Adversarial Attack with Semantics-aware Multi-granularity Patchout Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods primarily focus on single-granularity patch and single-stage training, leading to suboptimal solutions. To address these limitations, we propose a general multi-stage optimization framework based on Semantics-aware Multi-granularity Patchout, dubbed as SMP-Attack. |
Wen Yang; Guodong Liu; Di Ming; | code |
| 454 | DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrade. To address this, we introduce DualReal, a novel framework that employs adaptive joint training to construct interdependencies between dimensions collaboratively. |
Wenchuan Wang; Mengqi Huang; Yijing Tu; Zhendong Mao; | code |
| 455 | Advancing Textual Prompt Learning with Anchored Attributes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. |
Zheng Li; Yibing Song; Ming-Ming Cheng; Xiang Li; Jian Yang; | code |
| 456 | SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. |
Chaesong Park; Eunbin Seo; Jihyeon Hwang; Jongwoo Lim; | code |
| 457 | Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel distillation method tailored for 3D Li- DAR scene completion models, dubbed ScoreLiDAR, which achieves efficient yet high-quality scene completion. |
Shengyuan Zhang; An Zhao; Ling Yang; Zejian Li; Chenye Meng; Haoran Xu; Tianrun Chen; AnYang Wei; Perry Pengyun Gu; Lingyun Sun; | code |
| 458 | MMAD: Multi-label Micro-Action Detection in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. |
Kun Li; Pengyu Liu; Dan Guo; Fei Wang; Zhiliang Wu; Hehe Fan; Meng Wang; | code |
| 459 | DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation By Isomorphic Hetero-Source Planning Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. |
Rui Yu; Xianghang Zhang; Runkai Zhao; Huaicheng Yan; Meng Wang; | code |
| 460 | SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the second step, we develop a spatial-aware relation graph transformer that captures both local and long-range contextual information, facilitating the generation of high-quality relation queries. |
Xin Hu; Ke Qin; Guiduo Duan; Ming Li; Yuan-Fang Li; Tao He; | code |
| 461 | SUB: Benchmarking CBM Generalization Via Synthetic Attribute Substitutions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. |
Jessica Bader; Leander Girrbach; Stephan Alaniz; Zeynep Akata; | code |
| 462 | RA-BUSSeg: Relation-aware Semi-supervised Breast Ultrasound Image Segmentation Via Adjacent Propagation and Cross-layer Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel relation-aware semi-supervised model for BUS image segmentation, which is composed of two innovative components: an adjacent relation propagation (ARP) module and a cross-layer relation alignment (CRA) module, for comprehensively explore pixel relations to improve segmentation performance. |
Wanting Zhang; Zhenhui Ding; Guilian Chen; Huisi Wu; Jing Qin; | code |
| 463 | Free2Guide: Training-Free Text-to-Video Alignment Using Image LVLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Free^2Guide, a novel gradient-free and training-free framework for aligning generated videos with text prompts. |
Jaemin Kim; Bryan Sangwoo Kim; Jong Chul Ye; | code |
| 464 | From Abyssal Darkness to Blinding Glare: A Benchmark on Extreme Exposure Correction in Real World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the effectiveness of iterative correction in improving color and texture, we introduce the CLIP-Guided Iterative Refinement Strategy. |
Bo Wang; Huiyuan Fu; Zhiye Huang; Siru Zhang; Xin Wang; Huadong Ma; | code |
| 465 | ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. |
Jiaxin Ai; Pengfei Zhou; Zhaopan Xu; Ming Li; Fanrui Zhang; Zizhen Li; Jianwen Sun; Yukang Feng; Baojin Huang; Zhongyuan Wang; Kaipeng Zhang; | code |
| 466 | VTimeCoT: Thinking By Drawing for Video Temporal Grounding and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. |
Jinglei Zhang; Yuanfan Guo; Rolandos Alexandros Potamias; Jiankang Deng; Hang Xu; Chao Ma; | code |
| 467 | TESPEC: Temporally-Enhanced Self-Supervised Pretraining for Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for learning spatio-temporal information. |
Mohammad Mohammadi; Ziyi Wu; Igor Gilitschenski; | code |
| 468 | Boosting Adversarial Transferability Via Residual Perturbation Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel attack method, named Residual Perturbation Attack (ResPA), relying on the residual gradient as the perturbation direction to guide the adversarial examples toward the flat regions of the loss function. |
Jinjia Peng; Zeze Tao; Huibing Wang; Meng Wang; Yang Wang; | code |
| 469 | Adapting In-Domain Few-Shot Segmentation to New Domains Without Source Domain Retraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most CD-FSS methods redesign and retrain in-domain FSS models using abundant base data from the source domain, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for source domain retraining. |
Qi Fan; Kaiqi Liu; Nian Liu; Hisham Cholakkal; Rao Muhammad Anwer; Wenbin Li; Yang Gao; | code |
| 470 | Towards Open-World Generation of Stereo Images and Unsupervised Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. |
Feng Qiao; Zhexiao Xiong; Eric Xing; Nathan Jacobs; | code |
| 471 | Trust But Verify: Programmatic VLM Evaluation in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. |
Viraj Prabhu; Senthil Purushwalkam; An Yan; Caiming Xiong; Ran Xu; | code |
| 472 | Forgetting Through Transforming: Enabling Federated Unlearning Via Class-Aware Representation Transformation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This inspired us to transform the distribution of unlearning data to fuse with similar categories in the remaining data for effective FU. Based on this insight, we propose a novel framework, named FUCRT, to achieve Federated Unlearning via Class-aware Representation Transformation. |
Qi Guo; Zhen Tian; Minghao Yao; Saiyu Qi; Yong Qi; Bingyi Liu; | code |
| 473 | Client2Vec: Improving Federated Learning By Distribution Shifts Aware Client Indexing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce the Client2Vec mechanism, which generates a unique client index that contains clients’ distribution shifts information for each client before the commencement of FL training. |
Yongxin Guo; Lin Wang; Xiaoying Tang; Tao Lin; | code |
| 474 | Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. |
Jun Li; Jinpeng Wang; Chaolei Tan; Niu Lian; Long Chen; Yaowei Wang; Min Zhang; Shu-Tao Xia; Bin Chen; | code |
| 475 | Accelerating Diffusion Transformer Via Gradient-Optimized Cache Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It is a challenging problem since (1) Progressive error accumulation from cached blocks significantly degrades generation quality, particularly when over 50% of blocks are cached; (2) Current error compensation approaches neglect dynamic perturbation patterns during the caching process, leading to suboptimal error correction. To solve these problems, we propose the Gradient-Optimized Cache (GOC) with two key innovations: (1) Cached Gradient Propagation: A gradient queue dynamically computes the gradient differences between cached and recomputed features. |
Junxiang Qiu; Lin Liu; Shuo Wang; Jinda Lu; Kezhou Chen; Yanbin Hao; | code |
| 476 | OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. |
Shiyong Liu; Xiao Tang; Zhihao Li; Yingfan He; Chongjie Ye; Jianzhuang Liu; Binxiao Huang; Shunbo Zhou; Xiaofei Wu; | code |
| 477 | GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel surface reconstruction method with Gaussian splatting, named GSRecon, which leverages the advantages of rasterization-based rendering to achieve efficient reconstruction. |
Hang Yang; Le Hui; Jianjun Qian; Jin Xie; Jian Yang; | code |
| 478 | LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. |
Xinyu Yan; Meijun Sun; Ge-Peng Ji; Fahad Shahbaz Khan; Salman Khan; Deng-Ping Fan; | code |
| 479 | MUSE: Multi-Subject Unified Synthesis Via Explicit Layout Semantic Expansion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we address the task of layout-controllable multi-subject synthesis (LMS), which requires both faithful reconstruction of reference subjects and their accurate placement in specified regions within a unified image. |
Fei Peng; Junqiang Wu; Yan Li; Tingting Gao; Di Zhang; Huiyuan Fu; | code |
| 480 | DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. |
Hengyuan Zhang; Zhe Li; Xingqun Qi; Mengze Li; Muyi Sun; Siye Wang; Man Zhang; Sirui Han; | code |
| 481 | CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. |
Kuniaki Saito; Donghyun Kim; Kwanyong Park; Atsushi Hashimoto; Yoshitaka Ushiku; | code |
| 482 | TerraMind: Large-Scale Generative Multimodality for Earth Observation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TerraMind, the first any-to-any generative, multi-modal foundation model for Earth observation (EO). |
Johannes Jakubik; Felix Yang; Benedikt Blumenstiel; Erik Scheurer; Rocco Sedona; Stefano Maurogiovanni; Jente Bosmans; Nikolaos Dionelis; Valerio Marsocci; Niklas Kopp; Rahul Ramachandran; Paolo Fraccaro; Thomas Brunschwiler; Gabriele Cavallaro; Juan Bernabe-Moreno; Nicolas Longépé; | code |
| 483 | MotionShot: Adaptive Motion Transfer Across Arbitrary Objects for Text-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. |
Yanchen Liu; Yanan Sun; Zhening Xing; Junyao Gao; Kai Chen; Wenjie Pei; | code |
| 484 | DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. |
Maksim Siniukov; Di Chang; Minh Tran; Hongkun Gong; Ashutosh Chaubey; Mohammad Soleymani; | code |
| 485 | Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. |
Qi Xun Yeo; Yanyan Li; Gim Hee Lee; | code |
| 486 | Timestep-Aware Diffusion Model for Extreme Image Rescaling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel framework called Timestep-Aware Diffusion Model (TADM) for extreme image rescaling, which performs rescaling operations in the latent space of a pre-trained autoencoder and effectively leverages powerful natural image priors learned by a pre-trained text-to-image diffusion model. |
Ce Wang; Zhenyu Hu; Wanjie Sun; Zhenzhong Chen; | code |
| 487 | Hierarchical Cross-modal Prompt Learning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. |
Hao Zheng; Shunzhi Yang; Zhuoxin He; Jinfeng Yang; Zhenhua Huang; | code |
| 488 | MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. |
Langyu Wang; Bingke Zhu; Yingying Chen; Yiyuan Zhang; Ming Tang; Jinqiao Wang; | code |
| 489 | Faster and Better 3D Splatting Via Group Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. |
Chengbo Wang; Guozheng Ma; Yifei Xue; Yizhen Lao; | code |
| 490 | ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. |
Cihang Peng; Qiming Hou; Zhong Ren; Kun Zhou; | code |
| 491 | FPEM: Face Prior Enhanced Facial Attractiveness Prediction for Live Videos with Face Retouching Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live videos with facial retouching. However, previous FAP … |
Hui Li; Xiaoyu Ren; Hongjiu Yu; Ying Chen; Kai Li; L Wang; Xiongkuo Min; Huiyu Duan; Guangtao Zhai; Xu Liu; | code |
| 492 | Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus motivated, in this paper we propose TRACT, an open-vocabulary tracker that leverages trajectory information to improve both object association and classification in OV-MOT. |
Yunhao Li; Yifan Jiao; Dan Meng; Heng Fan; Libo Zhang; | code |
| 493 | Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. |
Du Chen; Liyi Chen; Zhengqiang Zhang; Lei Zhang; | code |
| 494 | Knowledge Distillation with Refined Logits Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. |
Wujie Sun; Defang Chen; Siwei Lyu; Genlang Chen; Chun Chen; Can Wang; | code |
| 495 | Efficient Spiking Point Mamba for Point Cloud Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain. |
Peixi Wu; Bosong Chai; Menghua Zheng; Wei Li; Zhangchi Hu; Jie Chen; Zheyu Zhang; Hebei Li; Xiaoyan Sun; | code |
| 496 | Automated Red Teaming for Text-to-Image Models Through Feedback-Guided Prompt Iteration with Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing safety mechanisms, such as filtering and fine-tuning, remain insufficient in preventing vulnerabilities exposed by adversarial prompts. To systematically evaluate these weaknesses, we propose an automated red-teaming framework, Feedback-Guided Prompt Iteration (FGPI), which utilizes a Vision-Language Model (VLM) as the red-teaming agent following a feedback-guide-rewrite paradigm for iterative prompt optimization. |
Wei Xu; Kangjie Chen; Jiawei Qiu; Yuyang Zhang; Run Wang; Jin Mao; Tianwei Zhang; Lina Wang; | code |
| 497 | STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel sequence-to-sequence model for seamless Spatial-Temporal aware motion Retargeting (STaR), with penetration and consistency constraints. |
Xiaohang Yang; Qing Wang; Jiahao Yang; Gregory Slabaugh; Shanxin Yuan; | code |
| 498 | LightBSR: Towards Lightweight Blind Super-Resolution Via Discriminative Implicit Degradation Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. |
Jiang Yuan; Ji Ma; Bo Wang; Guanzhou Ke; Weiming Hu; | code |
| 499 | TeethGenerator: A Two-stage Framework for Paired Pre- and Post-orthodontic 3D Dental Data Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. |
Changsong Lei; Yaqian Liang; Shaofeng Wang; Jiajia Dai; Yong-Jin Liu; | code |
| 500 | Effective Training Data Synthesis for Improving MLLM Chart Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. |
Yuwei Yang; Zeyu Zhang; Yunzhong Hou; Zhuowan Li; Gaowen Liu; Ali Payani; Yuan-Sen Ting; Liang Zheng; | code |
| 501 | CleanPose: Category-Level Object Pose Estimation Via Causal Learning and Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the effort to achieve robust and generalizable category-level object pose estimation, recent methods primarily focus on learning fundamental representations from data. |
Xiao Lin; Yun Peng; Liuyi Wang; Xianyou Zhong; Minghao Zhu; Yi Feng; Jingwei Yang; Chengju Liu; Qijun Chen; | code |
| 502 | OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. |
Mingquan Zhou; Chen He; Ruiping Wang; Xilin Chen; | code |
| 503 | Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. |
Jinpei Guo; Zheng Chen; Wenbo Li; Yong Guo; Yulun Zhang; | code |
| 504 | NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of Video Diffusion Models (VDMs). |
Yanrui Bin; Wenbo Hu; Haoyuan Wang; Xinya Chen; Bing Wang; | code |
| 505 | Disentangled Clothed Avatar Generation with Layered Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LayerAvatar, a novel feed-forward diffusion-based method capable of generating high-quality component-disentangled clothed avatars in seconds. |
Weitian Zhang; Yichao Yan; Sijing Wu; Manwen Liao; Xiaokang Yang; | code |
| 506 | Princeton365: A Diverse Dataset with Accurate Camera Pose Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. |
Karhan Kayan; Stamatis Alexandropoulos; Rishabh Jain; Yiming Zuo; Erich Liang; Jia Deng; | code |
| 507 | Evidential Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing logit-based knowledge distillation methods typically employ singularly deterministic categorical distributions, which eliminates the inherent uncertainty in network predictions and thereby limiting the effective transfer of knowledge. To address this limitation, we introduce distribution-based probabilistic modeling as a more comprehensive representation of network knowledge. |
Liangyu Xiang; Junyu Gao; Changsheng Xu; | code |
| 508 | Open-Vocabulary Octree-Graph for 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and can not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. |
Zhigang Wang; Yifei Su; Chenhui Li; Dong Wang; Yan Huang; Xuelong Li; Bin Zhao; | code |
| 509 | PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. |
Mahesh Bhosale; Abdul Wasi; Yuanhao Zhai; Yunjie Tian; Samuel Border; Nan Xi; Pinaki Sarder; Junsong Yuan; David Doermann; Xuan Gong; | code |
| 510 | COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the **co**mbination of **s**elective **m**em**o**rization (COSMO), which integrates state-space modules (SSMs) and transformer modules. |
Siqi Zhang; Yanyuan Qiao; Qunbo Wang; Zike Yan; Qi Wu; Zhihua Wei; Jing Liu; | code |
| 511 | UAVScenes: A Multi-Modal Dataset for UAVs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. |
Sijie Wang; Siqi Li; Yawei Zhang; Shangshu Yu; Shenghai Yuan; Rui She; Quanjiang Guo; JinXuan Zheng; Ong Kang Howe; Leonrich Chandra; Shrivarshann Srijeyan; Aditya Sivadas; Toshan Aggarwal; Heyuan Liu; Hongming Zhang; Chujie Chen; Junyu Jiang; Lihua Xie; Wee Peng Tay; | code |
| 512 | GraspCoT: Integrating Physical Property Reasoning for 6-DoF Grasping Under Flexible Language Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose GraspCoT, a 6-DoF grasp detection framework that integrates a Chain-of-Thought (CoT) reasoning mechanism oriented to physical properties, guided by auxiliary question-answering (QA) tasks. |
Xiaomeng Chu; Jiajun Deng; Guoliang You; Wei Liu; Xingchen Li; Jianmin Ji; Yanyong Zhang; | code |
| 513 | GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose GazeGaussian, the first high-fidelity gaze redirection method that uses a two-stream 3DGS model to represent the face and eye regions separately. |
Xiaobao Wei; Peng Chen; Guangyu Li; Ming Lu; Hui Chen; Feng Tian; | code |
| 514 | CIARD: Cyclic Iterative Adversarial Robustness Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1 The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2 The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: 1 A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and 2 Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. |
Liming Lu; Shuchao Pang; Xu Zheng; Xiang Gu; Anan Du; Yunhuai Liu; Yongbin Zhou; | code |
| 515 | Lifting The Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we observe and verify that transformations based on motion representations ((e.g.,, Thin-Plate Spline) exhibit structural continuity in both rectification and rectangling tasks. |
Wenting Luan; Siqi Lu; Yongbin Zheng; Wanying Xu; Lang Nie; Zongtan Zhou; Kang Liao; | code |
| 516 | DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose DoppDrive, a novel Doppler-Driven temporal aggregation method that enhances radar point cloud density while minimizing scatter. |
Yuval Haitman; Oded Bialer; | code |
| 517 | Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. |
Zeyu Xi; Haoying Sun; Yaofei Wu; Junchi Yan; Haoran Zhang; Lifang Wu; Liang Wang; Changwen Chen; | code |
| 518 | CVPT: Cross Visual Prompt Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our analysis reveals that VPT’s shortcomings stem from its prompt deployment strategy, which can distort the model’s inherent self-attention mechanism. To address this, we propose Cross Visual Prompt Tuning (CVPT). |
Lingyun Huang; Jianxu Mao; Junfei Yi; Ziming Tao; Yaonan Wang; | code |
| 519 | Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. |
Dat Nguyen; Marcella Astrid; Anis Kacem; Enjie Ghorbel; Djamila Aouada; | code |
| 520 | StyleMotif: Multi-Modal Motion Stylization Using Style-Content Cross Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. |
Ziyu Guo; Young Yoon Lee; Joseph Liu; Yizhak Ben-Shabat; Victor Zordan; Mubbasir Kapadia; | code |
| 521 | Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous supervised approaches rely on costly manual annotations, LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method designed to alleviate this annotation burden. |
Ziliang Miao; Runjian Chen; Yixi Cai; Buwei He; Wenquan Zhao; Wenqi Shao; Bo Zhang; Fu Zhang; | code |
| 522 | Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. |
Zixian Guo; Ming Liu; Qilong Wang; Zhilong Ji; Jinfeng Bai; Lei Zhang; Wangmeng Zuo; | code |
| 523 | ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Few-shot methods, which use limited samples and prompts, offer a more efficient approach. |
Hongchi Ma; Guanglei Yang; Debin Zhao; Yanli Ji; Wangmeng Zuo; | code |
| 524 | MMOne: Representing Multiple Modalities in One Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. |
Zhifeng Gu; Bing Wang; | code |
| 525 | Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed P3HOT, is proposed, which blends Prompt guidance and human Proximal Perception. |
Yuxiao Wang; Yu Lei; Zhenao Wei; Weiying Xue; Xinyu Jiang; Nan Zhuang; Qi Liu; | code |
| 526 | DDB: Diffusion Driven Balancing to Address Spurious Correlations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a Diffusion Driven Balancing (DDB) technique to generate training samples with text-to-image diffusion models for addressing the spurious correlation problem. |
Aryan Yazdan Parast; Basim Azam; Naveed Akhtar; | code |
| 527 | Mitigating Catastrophic Overfitting in Fast Adversarial Training Via Label Information Elimination Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we discover that after CO occurs, the label information of certain samples can transfer across different samples, significantly increasing the likelihood of modified images being classified as the intended label. |
Chao Pan; Ke Tang; Qing Li; Xin Yao; | code |
| 528 | CO2-Net: A Physics-Informed Spatio-Temporal Model for Global Surface CO2 Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, \text CO _2 reconstruction presents unique challenges, including complex spatio-temporal dynamics, periodic patterns and sparse observations. We propose \text CO _2-Net, a data-driven model that addresses these challenges without requiring extensive prior data. |
Hao Zheng; Yuting Zheng; Hanbo Huang; Chaofan Sun; Enhui Liao; Lin Liu; Yi Han; Hao Zhou; Shiyu Liang; | code |
| 529 | OmniVTON: Training-Free Universal Virtual Try-On Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings. |
Zhaotong Yang; Yuhui Li; Shengfeng He; Xinzhe Li; Yangyang Xu; Junyu Dong; Yong Du; | code |
| 530 | HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, the hidden-to-hidden connection approaches, which propagate latent features within the RNN, offer greater flexibility but require large buffer sizes. To address these issues, we propose HyTIP, a learned video coding framework that combines both mechanisms. |
Yi-Hsin Chen; Yi-Chen Yao; Kuan-Wei Ho; Chun-Hung Wu; Huu-Tai Phung; Martin Benjak; Jörn Ostermann; Wen-Hsiao Peng; | code |
| 531 | Learning Robust Stereo Matching in The Wild with Selective Mixture-of-Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, learning-based stereo matching networks have advanced significantly.However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets.Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model’s robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge.To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules.SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. |
Yun Wang; Longguang Wang; Chenghao Zhang; Yongjian Zhang; Zhanjie Zhang; Ao Ma; Chenyou Fan; Tin Lun Lam; Junjie Hu; | code |
| 532 | DMesh++: An Efficient Differentiable Mesh for Complex Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent probabilistic methods for 3D triangular meshes capture diverse shapes by differentiable mesh connectivity, but face high computational costs with increased shape details. We introduce a new differentiable mesh processing method that addresses this challenge and efficiently handles meshes with intricate structures. |
Sanghyun Son; Matheus Gadelha; Yang Zhou; Matthew Fisher; Zexiang Xu; Yi-Ling Qiao; Ming C. Lin; Yi Zhou; | code |
| 533 | Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce VLADBench, a challenging and fine-grained benchmark featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. |
Yue Li; Meng Tian; Zhenyu Lin; Jiangtong Zhu; Dechang Zhu; Haiqiang Liu; Yueyi Zhang; Zhiwei Xiong; Xinhai Zhao; | code |
| 534 | Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, ie, Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). |
Yihong Cao; Jiaming Zhang; Xu Zheng; Hao Shi; Kunyu Peng; Hang Liu; Kailun Yang; Hui Zhang; | code |
| 535 | OrderChain: Towards General Instruct-Tuning for Stimulating The Ordinal Understanding Ability of MLLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling. |
Jinhong Wang; Shuo Tong; Jian Liu; Dongqi Tang; Weiqiang Wang; Wentong Li; Hongxia Xu; Danny Z. Chen; Jintai Chen; Jian Wu; | code |
| 536 | Backdooring Self-Supervised Contrastive Learning By Noisy Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. |
Tuo Chen; Jie Gui; Minjing Dong; Ju Jia; Lanting Fang; Jian Liu; | code |
| 537 | Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. |
Jiawei Wang; Yushen Zuo; Yuanjun Chai; Zhendong Liu; Yicheng Fu; Yichun Feng; Kin-Man Lam; | code |
| 538 | Multi-turn Consistent Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. |
Zijun Zhou; Yingying Deng; Xiangyu He; Weiming Dong; Fan Tang; | code |
| 539 | MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. |
Tongtong Cheng; Rongzhen Li; Yixin Xiong; Tao Zhang; Jing Wang; Kai Liu; | code |
| 540 | Global Regulation and Excitation Via Attention Tuning for Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. |
Jiahao Li; Xinhong Chen; Zhengmin Jiang; Qian Zhou; Yung-Hui Li; Jianping Wang; | code |
| 541 | ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks re- veal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings re- quiring multi-step reasoning. In this work, we intro- duce ToolVQA, a large-scale multimodal dataset compris- ing 23K samples, designed to bridge this gap. |
Shaofeng Yin; Ting Lei; Yang Liu; | code |
| 542 | MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. |
Tianhong Gao; Yannian Fu; Weiqun Wu; Haixiao Yue; Shanshan Liu; Gang Zhang; | code |
| 543 | MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. |
Hailong Yan; Ao Li; Xiangtao Zhang; Zhe Liu; Zenglin Shi; Ce Zhu; Le Zhang; | code |
| 544 | ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. To address this gap, we propose applying adversarial augmentation to the surrogate models, aiming to boost overall generalization of ensemble models and reduce the risk of adversarial overfitting. |
Hanwen Cao; Haobo Lu; Xiaosen Wang; Kun He; | code |
| 545 | FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, textual prompt tuning suffers from overfitting to known concepts, limiting its generalizability to unseen concepts. To address this limitation, we propose Multimodal Visual Prompt Tuning (FedMVP) that conditions the prompts on multimodal contextual information – derived from the input image and textual attribute features of a class. |
Mainak Singha; Subhankar Roy; Sarthak Mehrotra; Ankit Jha; Moloud Abdar; Biplab Banerjee; Elisa Ricci; | code |
| 546 | Superpowering Open-Vocabulary Object Detectors for X-ray Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. |
Pablo Garcia-Fernandez; Lorenzo Vaquero; Mingxuan Liu; Feng Xue; Daniel Cores; Nicu Sebe; Manuel Mucientes; Elisa Ricci; | code |
| 547 | Factorized Learning for Temporally Grounded Video-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. |
Wenzheng Zeng; Difei Gao; Mike Zheng Shou; Hwee Tou Ng; | code |
| 548 | ConceptSplit: Decoupled Multi-Concept Personalization of Diffusion Models Via Token-wise Adaptation and Attention Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The main challenge of this task is "concept mixing", where multiple learned concepts interfere or blend undesirably in the output image. To address this issue, in this paper, we present ConceptSplit, a novel framework to split the individual concepts through training and inference. |
Habin Lim; Yeongseob Won; Juwon Seo; Gyeong-Moon Park; | code |
| 549 | When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. |
Pan Liu; Jinshi Liu; | code |
| 550 | Anti-Tamper Protection for Unauthorized Individual Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the protection algorithms would become ineffective when forgery attackers apply purification techniques to bypass the protection. To address this issue, we present a novel approach, Anti-Tamper Perturbation (ATP). |
Zelin Li; Ruohan Zong; Yifan Liu; Ruichen Yao; Yaokun Liu; Yang Zhang; Dong Wang; | code |
| 551 | TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. |
Christian Simon; Masato Ishii; Akio Hayakawa; Zhi Zhong; Shusuke Takahashi; Takashi Shibuya; Yuki Mitsufuji; | code |
| 552 | RESCUE: Crowd Evacuation Simulation Via Controlling SDM-United Characters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose a real-time 3D crowd evacuation simulation framework that integrates a 3D-adaptive SFM (Social Force Model) Decision Mechanism and a Personalized Gait Control Motor. |
Xiaolin Liu; Tianyi Zhou; Hongbo Kang; Jian Ma; Ziwen Wang; Jing Huang; Wenguo Weng; Yu-Kun Lai; Kun Li; | code |
| 553 | Adversarial Attention Perturbations for Large Object Detection Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. |
Zachary Yahn; Selim Furkan Tekin; Fatih Ilhan; Sihao Hu; Tiansheng Huang; Yichang Xu; Margaret Loper; Ling Liu; | code |
| 554 | Dissecting Generalized Category Discovery: Multiplex Consensus Under Self-Deconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. |
Luyao Tang; Kunze Huang; Chaoqi Chen; Yuxuan Yuan; Chenxin Li; Xiaotong Tu; Xinghao Ding; Yue Huang; | code |
| 555 | MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. |
Jianfei Jiang; Qiankun Liu; Haochen Yu; Hongyuan Liu; Liyong Wang; Jiansheng Chen; Huimin Ma; | code |
| 556 | Global and Local Entailment Learning for Natural World Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. |
Srikumar Sastry; Aayush Dhakal; Eric Xing; Subash Khanal; Nathan Jacobs; | code |
| 557 | ViewSRD: 3D Visual Grounding Via Structured Multi-View Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors … |
Ronggang Huang; Haoxin Yang; Yan Cai; Xuemiao Xu; Huaidong Zhang; Shengfeng He; | code |
| 558 | FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier Basis-guided Diffusion model, termed FB-Diff. |
Xin You; Runze Yang; Chuyan Zhang; Zhongliang Jiang; Jie Yang; Nassir Navab; | code |
| 559 | Zeroth-Order Fine-Tuning of LLMs in Random Subspaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs’ high dimensionality. |
Ziming Yu; Pan Zhou; Sike Wang; Jia Li; Mi Tian; Hua Huang; | code |
| 560 | Attention to Neural Plagiarism: Diffusion Models Can Plagiarize Your Copyrighted Images! Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we highlight a critical threat posed by emerging neural models–data plagiarism. |
Zihang Zou; Boqing Gong; Liqiang Wang; | code |
| 561 | ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, no public training datasets exist to fine-tune open-source VLMs, only small benchmarks with diverse evaluation schemes. To address this, we introduce ADIEE, an automated dataset creation approach which is then used to train a scoring model for instruction-guided image editing evaluation. |
Sherry X. Chen; Yi Wei; Luowei Zhou; Suren Kumar; | code |
| 562 | A Tiny Change, A Giant Leap: Long-Tailed Class-Incremental Learning Via Geometric Prototype Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These factors jointly degrade tail-class performance and exacerbate catastrophic forgetting. To tackle these issues, we propose Geometric Prototype Alignment (GPA), a model-agnostic approach that calibrates classifier learning dynamics via geometric feature-space alignment. |
Xinyi Lai; Luojun Lin; Weijie Chen; Yuanlong Yu; | code |
| 563 | Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a new reconstruction pipeline based on Gaussian Splatting that uses a flexible lens model and supports fields of view approaching 180 degrees. |
Youming Deng; Wenqi Xian; Guandao Yang; Leonidas Guibas; Gordon Wetzstein; Steve Marschner; Paul Debevec; | code |
| 564 | HERMES: Temporal-coHERent Long-forM Understanding with Episodes and Semantics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. |
Gueter Josmy Faure; Jia-Fong Yeh; Min-Hung Chen; Hung-Ting Su; Shang-Hong Lai; Winston H. Hsu; | code |
| 565 | Towards A 3D Transfer-based Black-box Attack Via Critical Feature Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed **C**ritical **F**eature **G**uidance. |
Shuchao Pang; Zhenghan Chen; Shen Zhang; Liming Lu; Siyuan Liang; Anan Du; Yongbin Zhou; | code |
| 566 | UniFuse: A Unified All-in-One Framework for Multi-Modal Medical Image Fusion Under Diverse Degradations and Misalignments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Its effectiveness heavily relies on these conditions and often deteriorates when handling misaligned or degraded medical images. To address this, we propose UniFuse, a general fusion framework. |
Dayong Su; Yafei Zhang; Huafeng Li; Jinxing Li; Yu Liu; | code |
| 567 | Leveraging The Power of MLLMs for Gloss-Free Sign Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). |
Jungeun Kim; Hyeongwoo Jeon; Jongseong Bae; Ha Young Kim; | code |
| 568 | IMoRe: Implicit Program-Guided Reasoning for Human Motion Q&A Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing human motion Q&A methods rely on explicit program execution, where the requirement for manually defined functional modules may limit the scalability and adaptability. To overcome this, we propose an implicit program-guided motion reasoning (IMoRe) framework that unifies reasoning across multiple query types without manually designed modules. |
Chen Li; Chinthani Sugandhika; Yeo Keat Ee; Eric Peh; Hao Zhang; Hong Yang; Deepu Rajan; Basura Fernando; | code |
| 569 | Correspondence As Video: Test-Time Adaption on SAM2 for Reference Segmentation in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. |
Haoran Wang; Zekun Li; Jian Zhang; Lei Qi; Yinghuan Shi; | code |
| 570 | Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. |
Yongwei Jiang; Yixiong Zou; Yuhua Li; Ruixuan Li; | code |
| 571 | Differentially Private Fine-Tuning of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces DP-LoRA, a surprisingly simple yet effective framework for differentially private fine-tuning of latent diffusion models (LDMs) using Low-Rank Adaptation (LoRA). |
Yu-Lin Tsai; Yizhe Li; Chia-Mu Yu; Xuebin Ren; Po-Yu Chen; Zekai Chen; Francois Buet-Golfouse; | code |
| 572 | Is CLIP Ideal? No. Can We Fix It? Yes! Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. |
Raphi Kang; Yue Song; Georgia Gkioxari; Pietro Perona; | code |
| 573 | SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing backdoor attacks often fail to bypass backdoor detection and human visual inspection, resulting in the exposure of the backdoor implanted in DNNs, which can subsequently be significantly mitigated through pruning or fine-tuning on benign data. To address this issue, in this paper, we propose a novel backdoor attack called SPD (Shallow Protecting Deep), which consists of a deep backdoor in the frequency domain and a shallow backdoor in the pixel domain, where the shallow backdoor acts as a firewall to protect the deep backdoor from being detected. |
Shunjie Yuan; Xinghua Li; Xuelin Cao; Haiyan Zhang; Mengyao Zhu; Robert H. Deng; | code |
| 574 | CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pre-trained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. |
Junho Kim; Hyungjin Chung; Byung-Hoon Kim; | code |
| 575 | ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. |
Binbin Xiang; Maciej Wielgosz; Stefano Puliti; Kamil Král; Martin Krůček; Azim Missarov; Rasmus Astrup; | code |
| 576 | Environment-Agnostic Pose: Generating Environment-independent Object Representations for 6D Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces EA6D, a novel diffusion-based framework for 6D pose estimation that operates effectively in any environment. |
Shaobo Zhang; Yuhang Huang; Wanqing Zhao; Wei Zhao; Ziyu Guan; Jinye Peng; | code |
| 577 | Learning 3D Scene Analogies with Neural Contextual Scene Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. |
Junho Kim; Gwangtak Bae; Eun Sun Lee; Young Min Kim; | code |
| 578 | Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. |
Zhensheng Yuan; Haozhi Huang; Zhen Xiong; Di Wang; Guanghua Yang; | code |
| 579 | STDDNet: Harnessing Mamba for Video Polyp Segmentation Via Spatial-aligned Temporal Modeling and Discriminative Dynamic Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, to meet clinical requirements, the inference must operate in real-time to enable intraoperative tracking and guidance. In this paper, we propose a novel and efficient segmentation network, STDDNet, which integrates a spatial-aligned temporal modeling strategy and a discriminative dynamic representation learning mechanism, to comprehensively address these challenges by harnessing the advantages of Mamba. |
Guilian Chen; Huisi Wu; Jing Qin; | code |
| 580 | Cross-modal Ship Re-Identification Via Optical and SAR Imagery: A Novel Dataset and Method Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. |
Han Wang; Shengyang Li; Jian Yang; Yuxuan Liu; Yixuan Lv; Zhuang Zhou; | code |
| 581 | Humans As Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). |
Fengyuan Yang; Kerui Gu; Ha Linh Nguyen; Tze Ho Elden Tse; Angela Yao; | code |
| 582 | Consensus-Driven Active Model Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. |
Justin Kay; Grant Van Horn; Subhransu Maji; Daniel Sheldon; Sara Beery; | code |
| 583 | What We Need Is Explicit Controllability: Training 3D Gaze Estimator Using Only Facial Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further enhance realism, we introduce eye-focused constraints, including a rotation symmetry protocol, as well as geometry and appearance losses for the eye regions, alongside conventional learning objectives. |
Tingwei Li; Jun Bao; Zhenzhong Kuang; Buyu Liu; | code |
| 584 | Learning Separable Fine-Grained Representation Via Dendrogram Construction from Coarse Labels for Fine-grained Visual Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we introduce a bottom-up learning paradigm that constructs a hierarchical dendrogram by iteratively merging similar instances/clusters, inferring higher-level semantics from lowest-level instances without predefining class numbers. Leveraging this, we propose BuCSFR, a novel method that integrates a Bottom-up Construction (BuC) module to build the dendrogram based on a minimal information loss criterion, and a Separable Fine-grained Representation (SFR) module that treats dendrogram nodes as pseudo-labels to ensure representation separability. |
Guanghui Shi; Xuefeng Liang; Wenjie Li; Xiaoyu Lin; | code |
| 585 | BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors,and (c) raw coordinate usage, which exacerbates scale discrepancies. |
Minkyun Seo; Hyungtae Lim; Kanghee Lee; Luca Carlone; Jaesik Park; | code |
| 586 | S3R-GS: Streamlining The Pipeline for Large-Scale Street Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead.After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. |
Guangting Zheng; Jiajun Deng; Xiaomeng Chu; Yu Yuan; Houqiang Li; Yanyong Zhang; | code |
| 587 | Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation. |
Xiuyu Yang; Shuhan Tan; Philipp Krähenbühl; | code |
| 588 | Causality-guided Prompt Learning for Vision-language Models Via Visual Granulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. |
Mengyu Gao; Qiulei Dong; | code |
| 589 | VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, despite their capabilities, VDMs often fail to produce physically plausible videos due to an inherent lack of understanding of physics, resulting in incorrect dynamics and event sequences. To address this limitation, we propose a novel two-stage image-to-video generation framework that explicitly incorporates physics with vision and language informed physical prior. |
Xindi Yang; Baolu Li; Yiming Zhang; Zhenfei Yin; Lei Bai; Liqian Ma; Zhiyong Wang; Jianfei Cai; Tien-Tsin Wong; Huchuan Lu; Xu Jia; | code |
| 590 | SDFit: 3D Object Pose and Shape By Fitting A Morphable SDF to A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. |
Dimitrije Antić; Georgios Paschalidis; Shashank Tripathi; Theo Gevers; Sai Kumar Dwivedi; Dimitrios Tzionas; | code |
| 591 | What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. |
Lorenzo Baraldi; Davide Bucciarelli; Federico Betti; Marcella Cornia; Lorenzo Baraldi; Nicu Sebe; Rita Cucchiara; | code |
| 592 | On The Generalization of Representation Uncertainty in Earth Observation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate the generalization of representation uncertainty in EO, considering the domain’s unique semantic characteristics. |
Spyros Kondylatos; Nikolaos Ioannis Bountos; Dimitrios Michail; Xiao Xiang Zhu; Gustau Camps-Valls; Ioannis Papoutsis; | code |
| 593 | NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. |
Sung-Yeon Park; Can Cui; Yunsheng Ma; Ahmadreza Moradipari; Rohit Gupta; Kyungtae Han; Ziran Wang; | code |
| 594 | COIN: Confidence Score-Guided Distillation for Annotation-Free Cell Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. |
Sanghyun Jo; Seo Jin Lee; Seungwoo Lee; Seohyung Hong; Hyungseok Seo; Kyungsu Kim; | code |
| 595 | Failure Cases Are Better Learned But Boundary Says Sorry: Facilitating Smooth Perception Change for Accuracy-Robustness Trade-Off in Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reveal a counterintuitive fact for the first time: **From the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended**. |
Yanyun Wang; Li Liu; | code |
| 596 | Task Vector Quantization for Memory-Efficient Model Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. |
Youngeun Kim; Seunghwan Lee; Aecheon Jung; Bogon Ryu; Sungeun Hong; | code |
| 597 | An Efficient Hybrid Vision Transformer for TinyML Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces TinyNeXt, a family of efficient hybrid ViTs for TinyML, featuring Lean Single-Head Self-Attention to minimize memory-bound operations, and a macro design tailored to feature characteristics at different stages. |
Fanhong Zeng; Huanan Li; Juntao Guan; Rui Fan; Tong Wu; Xilong Wang; Rui Lai; | code |
| 598 | Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. |
Shiming Chen; Bowen Duan; Salman Khan; Fahad Shahbaz Khan; | code |
| 599 | Balancing Conservatism and Aggressiveness: Prototype-Affinity Hybrid Network for Few-Shot Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a Prototype-Affinity Hybrid Network (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). |
Tianyu Zou; Shengwu Xiong; Ruilin Yao; Yi Rong; | code |
| 600 | DOGR: Towards Versatile Visual Document Grounding and Referring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the **DO**cument **G**rounding and **R**eferring data engine (**DOGR-Engine**), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs’ grounding and referring capabilities in dialogue and reasoning. |
Yinan Zhou; Yuxin Chen; Haokun Lin; Yichen Wu; Shuyu Yang; Zhongang Qi; Chen Ma; Li Zhu; | code |
| 601 | A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a differentiable optics simulator that accurately and efficiently models aberration and diffraction in compound optics and allows us to analyze the role and impact of diffraction in end-to-end optimization. |
Chi-Jui Ho; Yash Belhe; Steve Rotenberg; Ravi Ramamoorthi; Tzu-Mao Li; Nicholas Antipa; | code |
| 602 | Parameter-Efficient Adaptation of Geospatial Foundation Models Through Embedding Deflection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that incorporating stronger inductive biases on both the data and the models can enhance the adaptation of Geospatial Foundation Models (GFMs), pretrained on RGB satellite images, to other types of optical satellite data. |
Romain Thoreau; Valerio Marsocci; Dawa Derksen; | code |
| 603 | G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Dynamically evolving from Euclidean metrics, we propose a novel \underline G eometry-\underline G uided \underline S core \underline F usion (G^ 2 SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. |
Chengyu Tao; Xuanming Cao; Juan Du; | code |
| 604 | LoftUp: Learning A Coordinate-Based Feature Upsampler for Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. |
Haiwen Huang; Anpei Chen; Volodymyr Havrylov; Andreas Geiger; Dan Zhang; | code |
| 605 | Unlearning The Noisy Correspondence Makes CLIP More Robust Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. |
Haochen Han; Alex Jinpeng Wang; Peijun Ye; Fangming Liu; | code |
| 606 | Gradient Extrapolation for Debiased Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. |
Ihab Asaad; Maha Shadaydeh; Joachim Denzler; | code |
| 607 | Dataset Ownership Verification for Pre-trained Masked Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). |
Yuechen Xie; Jie Song; Yicheng Shan; Xiaoyan Zhang; Yuanyu Wan; Shengxuming Zhang; Jiarui Duan; Mingli Song; | code |
| 608 | Consistent Time-of-Flight Depth Denoising Via Graph-Informed Geometric Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. |
Weida Wang; Changyong He; Jin Zeng; Di Qiu; | code |
| 609 | Boosting Class Representation Via Semantically Related Instances for Robust Long-Tailed Learning with Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a simple yet effective method, Instances Benefitting Classes (IBC). |
Yuhang Li; Zhuying Li; Yuheng Jia; | code |
| 610 | Boosting Multi-View Indoor 3D Object Detection Via Adaptive 3D Volume Construction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents SGCDet, a novel multi-view indoor 3D object detection framework based on adaptive 3D volume construction. |
Runmin Zhang; Zhu Yu; Si-Yuan Cao; Lingyu Zhu; Guangyi Zhang; Xiaokai Bai; Hui-Liang Shen; | code |
| 611 | Registration Beyond Points: General Affine Subspace Alignment Via Geodesic Distance on Grassmann Manifold Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. |
Jaeho Shin; Hyeonjae Gil; Junwoo Jang; Maani Ghaffari; Ayoung Kim; | code |
| 612 | MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Multi-Baseline Gaussian Splatting (MuGS), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines. |
Yaopeng Lou; Liao Shen; Tianqi Liu; Jiaqi Li; Zihao Huang; Huiqiang Sun; Zhiguo Cao; | code |
| 613 | Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intra-sample aspect relationships. |
Zhangjun Zhou; Yiping Li; Chunlin Zhong; Jianuo Huang; Jialun Pei; Hua Li; He Tang; | code |
| 614 | LaCoOT: Layer Collapse Through Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. |
Victor Quétu; Zhu Liao; Nour Hezbri; Fabio Pizzati; Enzo Tartaglione; | code |
| 615 | Test-Time Retrieval-Augmented Adaptation for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel test-time retrieval-augmented adaptation (TT-RAA) method that enables VLMs to maintain high performance across diverse visual recognition tasks without the need for task-specific training or large computational overhead. |
Xinqi Fan; Xueli Chen; Luoxiao Yang; Chuin Hong Yap; Rizwan Qureshi; Qi Dou; Moi Hoon Yap; Mubarak Shah; | code |
| 616 | SSVQ: Unleashing The Potential of Vector Quantization with Sign-Splitting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. |
Shuaiting Li; Juncan Deng; Chengxuan Wang; Kedong Xu; Rongtao Deng; Hong Gu; Haibin Shen; Kejie Huang; | code |
| 617 | Spatial-Temporal Forgery Trace Based Forgery Image Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While these features contain forgery traces, they also include a substantial amount of the image’s semantic information, which interferes with the precision and generalization of forgery detection models. To tackle these challenges, this paper introduces a novel forgery image identification method based on the Spatial-Temporal Forgery Trace (STFT). |
Yilin Wang; Zunlei Feng; Jiachi Wang; Hengrui Lou; Binjia Zhou; Jie Lei; Mingli Song; Yijun Bei; | code |
| 618 | MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel method, named MagShield, designed to address the issue of magnetic disturbances in sparse inertial motion capture (MoCap) systems. |
Yunzhe Shao; Xinyu Yi; Lu Yin; Shihui Guo; Junhai Yong; Feng Xu; | code |
| 619 | Not All Frame Features Are Equal: Video-to-4D Generation Via Decoupling Dynamic-Static Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a dynamic-static feature decoupling module (DSFD). |
Liying Yang; Chen Liu; Zhenwei Zhu; Ajian Liu; Hui Ma; Jian Nong; Yanyan Liang; | code |
| 620 | Multi-View 3D Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. |
Frano Rajič; Haofei Xu; Marko Mihajlovic; Siyuan Li; Irem Demir; Emircan Gündoğdu; Lei Ke; Sergey Prokudin; Marc Pollefeys; Siyu Tang; | code |
| 621 | Incremental Few-Shot Semantic Segmentation Via Multi-Level Switchable Visual Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel prompt-based IFSS method with a visual prompt pool to store and switch multi-granular knowledge across stages, enhancing the model’s ability to learn new classes. |
Maoxian Wan; Kaige Li; Qichuan Geng; Weimin Shi; Zhong Zhou; | code |
| 622 | Enhancing Adversarial Transferability By Balancing Exploration and Exploitation with Gradient-Guided Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conversely, recent methods with inner-iteration sampling over-prioritize Exploration, i.e., flatter loss surfaces for cross-model generalization but weakened attack potency (suboptimal local maxima). To resolve this dilemma, we propose a simple yet effective Gradient-Guided Sampling (GGS), which harmonizes both objectives through guiding sampling along the gradient ascent direction to improve both sampling efficiency and stability. |
Zenghao Niu; Weicheng Xie; Siyang Song; Zitong Yu; Feng Liu; Linlin Shen; | code |
| 623 | VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos Via Tree-based Group Relative Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. |
Xinye Cao; Hongcan Guo; Jiawen Qian; Guoshun Nan; Chao Wang; Yuqi Pan; Tianhao Hou; Xiaojuan Wang; Yutong Gao; | code |
| 624 | FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method, FlashDepth, that satisfies all three requirements, performing depth estimation for a 2044×1148 streaming video at 24 FPS. |
Gene Chou; Wenqi Xian; Guandao Yang; Mohamed Abdelfattah; Bharath Hariharan; Noah Snavely; Ning Yu; Paul Debevec; | code |
| 625 | Ensemble Foreground Management for Unsupervised Object Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces UnionCut, a robust and well-grounded foreground prior based on min-cut and ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. |
Ziling Wu; Armaghan Moemeni; Praminda Caleb-Solly; | code |
| 626 | MissRAG: Addressing The Missing Modality Challenge in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. |
Vittorio Pipoli; Alessia Saporita; Federico Bolelli; Marcella Cornia; Lorenzo Baraldi; Costantino Grana; Rita Cucchiara; Elisa Ficarra; | code |
| 627 | OpenSubstance: A High-quality Measured Dataset of Multi-View and -Lighting Images and Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present OpenSubstance, a high-quality measured dataset with 2.4 million high-dynamic-range images of 187 objects with a wide variety in shape and appearance, captured under 270 camera views and 1,637 lighting conditions, including 1,620 one-light-at-a-time, 8 environment, 8 linear and 1 full-on illumination. |
Fan Pei; Jinchen Bai; Xiang Feng; Zoubin Bi; Kun Zhou; Hongzhi Wu; | code |
| 628 | MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present MVQA, a Mamba-based model designed for efficient VQA along with a novel Unified Semantic and Distortion Sampling (USDS) approach. |
Yachun Mi; Yu Li; Weicheng Meng; Chaofeng Chen; Chen Hui; Shaohui Liu; | code |
| 629 | How Can Objects Help Video-Language Understanding? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To the other extreme, image captions by themselves provide strong empirical performances for understanding tasks, despite missing fine-grained spatiotemporal information. To answer this question, we introduce ObjectMLLM, a framework capable of leveraging arbitrary computer vision algorithm to extract and integrate structured visual representation. |
Zitian Tang; Shijie Wang; Junho Cho; Jaewook Yoo; Chen Sun; | code |
| 630 | Breaking Grid Constraints: Dynamic Graph Reconstruction Network for Multi-organ Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we propose a novel multi-organ segmentation network via dynamic graph reconstruction, called DGRNet. |
Junhao Xiao; Yang Wei; Jingyu Wang; Yongchao Wang; Xiuli Bi; Bin Xiao; | code |
| 631 | Beyond Pixel Uncertainty: Bounding The OoD Objects in Road Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose DetSeg, a novel paradigm that helps incorporate object-level understanding. |
Huachao Zhu; Zelong Liu; Zhichao Sun; Yuda Zou; Gui-Song Xia; Yongchao Xu; | code |
| 632 | Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing solutions perform well in achieving intra-client (local) uniformity for local models while failing to achieve inter-client (global) uniformity after aggregation due to non-IID data distributions and the decentralized nature of FUL. To address this issue, we propose Soft Separation and Distillation (SSD), a novel approach that preserves inter-client uniformity by encouraging client representations to spread toward different directions. |
Hung-Chieh Fang; Hsuan-Tien Lin; Irwin King; Yifei Zhang; | code |
| 633 | DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces and examines integral spatio-temporal consistency, considering the synergy between plot progression and camera techniques, and the long-term impact of prior content on subsequent generation. |
Runze Zhang; Guoguang Du; Xiaochuan Li; Qi Jia; Liang Jin; Lu Liu; Jingjing Wang; Cong Xu; Zhenhua Guo; Yaqian Zhao; Xiaoli Gong; Rengang Li; Baoyu Fan; | code |
| 634 | LIRA: Reasoning Reconstruction Via Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This task inputs an implicit instruction involving complex reasoning and an RGB-D sequence, and outputs incremental 3D reconstruction of instances that conform to the instruction. To handle this task, we propose LIRA: Language Instructed Reconstruction Assistant. |
Zhen Zhou; Tong Wang; Yunkai Ma; Xiao Tan; Fengshui Jing; | code |
| 635 | EDM: Efficient Deep Feature Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. |
Xi Li; Tong Rao; Cihui Pan; | code |
| 636 | RTMap: Real-Time Recursive Mapping with Change Detection and Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. |
Yuheng Du; Sheng Yang; Lingxuan Wang; Zhenghua Hou; Chengying Cai; Zhitao Tan; Mingxia Chen; Shi-Sheng Huang; Qiang Li; | code |
| 637 | CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a cross-view two-stage fusion network called CVFusion. |
Hanzhi Zhong; Zhiyu Xiang; Ruoyu Xu; Jingyun Fu; Peng Xu; Shaohong Wang; Zhihao Yang; Tianyu Pu; Eryun Liu; | code |
| 638 | DynamicFace: High-Quality and Consistent Face Swapping for Image and Video Using Composable 3D Facial Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel method DynamicFace that leverages the power of diffusion models and plug-and-play adaptive attention layers for image and video face swapping. |
Runqi Wang; Yang Chen; Sijie Xu; Tianyao He; Wei Zhu; Dejia Song; Nemo Chen; Xu Tang; Yao Hu; | code |
| 639 | Dataset Distillation Via Vision-Language Category Prototype Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. |
Yawen Zou; Guang Li; Duo Su; Zi Wang; Jun Yu; Chao Zhang; | code |
| 640 | Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, these approaches still possess limited generalization ability, particularly when dealing with scenes at a low signal-to-noise ratio (SNR). To overcome the above problems, we introduce a novel learning-based approach, comprising two key designs: Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF). |
Shida Sun; Yue Li; Yueyi Zhang; Zhiwei Xiong; | code |
| 641 | Text-guided Visual Prompt DINO for Generic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. |
Yuchen Guan; Chong Sun; Canmiao Fu; Zhipeng Huang; Chun Yuan; Chen Li; | code |
| 642 | Secure On-Device Video OOD Detection Without Backpropagation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices.To overcome these challenges, we propose SecDOOD, a secure cloud-device collaboration framework for efficient on-device OOD detection without requiring device-side backpropagation.SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. |
Shawn Li; Peilin Cai; Yuxiao Zhou; Zhiyu Ni; Renjie Liang; You Qin; Yi Nian; Zhengzhong Tu; Xiyang Hu; Yue Zhao; | code |
| 643 | Learning Normal Flow Directly From Events Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel supervised point-based method for normal flow estimation that overcomes the limitations of existing event learning-based approaches. |
Dehao Yuan; Levi Burner; Jiayi Wu; Minghui Liu; Jingxi Chen; Yiannis Aloimonos; Cornelia Fermüller; | code |
| 644 | DEPTHOR: Depth Enhancement from A Practical Light-Weight DToF Sensor and RGB Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. |
Jijun Xiang; Xuan Zhu; Xianqi Wang; Yu Wang; Hong Zhang; Fei Guo; Xin Yang; | code |
| 645 | Learning An Implicit Physics Model for Image-based Fluid Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. |
Emily Yue-Ting Jia; Jiageng Mao; Zhiyuan Gao; Yajie Zhao; Yue Wang; | code |
| 646 | MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In MA-CIR, we observe that current CIR models struggle with negation (or replacement) arithmetic types and semantic types that require complex reasoning, indicating a potential reliance on object or concept information. To tackle this, we propose leveraging strong text encoders, particularly those based on large language models (LLMs), and fine-tuning them using carefully constructed text triplets that include hard negatives, thereby enhancing their compositional understanding. |
Jaeseok Byun; Young Kyun Jang; Seokhyeon Jeong; Donghyun Kim; Taesup Moon; | code |
| 647 | An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we introduce Reducing Task Discrepancy of Text encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. |
Jaeseok Byun; Seokhyeon Jeong; Wonjae Kim; Sanghyuk Chun; Taesup Moon; | code |
| 648 | Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. |
Inseung Hwang; Kiseok Choi; Hyunho Ha; Min H. Kim; | code |
| 649 | Splat-based 3D Scene Reconstruction with Extreme Motion-blur Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a splat-based 3D scene reconstruction method from RGB-D input that effectively handles extreme motion blur, a frequent challenge in low-light environments. |
Hyeonjoong Jang; Dongyoung Choi; Donggun Kim; Woohyun Kang; Min H. Kim; | code |
| 650 | Granular Concept Circuits: Toward A Fine-Grained Circuit Discovery for Concept Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce an effective circuit discovery method, called Granular Concept Circuit (GCC)(Code is available at https://github.com/daheekwon/GCC). |
Dahee Kwon; Sehyun Lee; Jaesik Choi; | code |
| 651 | 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose 3DGraphLLM, a method for constructing a learnable representation of a 3D scene graph that explicitly incorporates semantic relationships. |
Tatiana Zemskova; Dmitry Yudin; | code |
| 652 | Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach for 3D hair reconstruction from single photographs based on a global hair prior combined with local optimization. |
Vanessa Sklyarova; Egor Zakharov; Malte Prinzler; Giorgio Becherini; Michael J. Black; Justus Thies; | code |
| 653 | Uncalibrated Structure from Motion on A Sphere Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through analysis of the relationship between focal length and spherical relative pose, we devise a global structure-from-motion approach for uncalibrated reconstruction. |
Jonathan Ventura; Viktor Larsson; Fredrik Kahl; | code |
| 654 | PointGAC: Geometric-Aware Codebook for Masked Point Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose PointGAC, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. |
Abiao Li; Chenlei Lv; Yuming Fang; Yifan Zuo; Jian Zhang; Guofeng Mei; | code |
| 655 | HiGarment: Cross-modal Harmony Based Diffusion Model for Flat Sketch to Realistic Garment Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: FS2RG presents two key challenges: 1) fabric characteristics are solely guided by textual prompts, providing insufficient visual supervision for diffusion-based models, which limits their ability to capture fine-grained fabric details; 2) flat sketches and textual guidance may provide conflicting information, requiring the model to selectively preserve or modify garment attributes while maintaining structural coherence. To tackle this task, we propose HiGarment, a novel framework that comprises two core components: i) a multi-modal semantic enhancement mechanism that enhances fabric representation across textual and visual modalities, and ii) a harmonized cross-attention mechanism that dynamically balances information from flat sketches and text prompts, allowing controllable synthesis by generating either sketch-aligned (image-biased) or text-guided (text-biased) outputs. |
Junyi Guo; Jingxuan Zhang; Fangyu Wu; Huanda Lu; Qiufeng Wang; Wenmian Yang; Eng Gee Lim; Dongming Lu; | code |
| 656 | MP-HSIR: A Multi-Prompt Framework for Universal Hyperspectral Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose MP-HSIR, a novel multi-prompt framework that effectively integrates spectral, textual, and visual prompts to achieve universal HSI restoration across diverse degradation types and intensities. |
Zhehui Wu; Yong Chen; Naoto Yokoya; Wei He; | code |
| 657 | Deep Space Weather Model: Long-Range Solar Flare Prediction from Multi-Wavelength Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information.Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method.Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. |
Shunya Nagashima; Komei Sugiura; | code |
| 658 | You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning Under Heterogeneous and Long-tailed Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-distillation perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. |
Shanshan Yan; Zexi Li; Chao Wu; Meng Pang; Yang Lu; Yan Yan; Hanzi Wang; | code |
| 659 | Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). |
Nuoye Xiong; Anqi Dong; Ning Wang; Cong Hua; Guangming Zhu; Lin Mei; Peiyi Shen; Liang Zhang; | code |
| 660 | Monocular Semantic Scene Completion Via Masked Recurrent Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. |
Xuzhi Wang; Xinran Wu; Song Wang; Lingdong Kong; Ziping Zhao; | code |
| 661 | Prototype-based Contrastive Learning with Stage-wise Progressive Augmentation for Self-Supervised Fine-Grained Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we mitigate the problem of Self-Supervised Learning (SSL) for fine-grained representation learning, aimed at distinguishing subtle differences within highly similar subordinate categories. |
Baofeng Tan; Xiu-Shen Wei; Lin Zhao; | code |
| 662 | AJAHR: Amputated Joint Aware 3D Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This assumption introduces bias with applied to individuals with amputations, a limitation further exacerbated by the scarcity of suitable datasets. To address this gap, we propose Amputated Joint Aware 3D Human Mesh Recovery (AJAHR), which is an adaptive pose estimation framework that improves mesh reconstruction for individuals with limb loss. |
Hyunjin Cho; Giyun Choi; Jongwon Choi; | code |
| 663 | Stealthy Backdoor Attack in Federated Learning Via Adaptive Layer-wise Gradient Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an adaptive layer-wise gradient alignment strategy to effectively evade various robust defense mechanisms while preserving attack strength. |
Qingqian Yang; Peishen Yan; Xiaoyu Wu; Jiaru Zhang; Tao Song; Yang Hua; Hao Wang; Liangliang Wang; Haibing Guan; | code |
| 664 | MaskSAM: Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, SAM is not directly applicable to medical image segmentation due to its inability to predict semantic labels, reliance on additional prompts, and suboptimal performance in this domain. To address these limitations, we propose MaskSAM, a novel prompt-free SAM adaptation framework for medical image segmentation based on mask classification. |
Bin Xie; Hao Tang; Bin Duan; Dawen Cai; Yan Yan; Gady Agam; | code |
| 665 | FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. |
Xinhua Lu; Runhe Lai; Yanqi Wu; Kanghao Chen; Wei-Shi Zheng; Ruixuan Wang; | code |
| 666 | DIA: The Adversarial Exposure of Deterministic Inversion in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the DDIM Inversion Attack (DIA) that attacks the integrated DDIM trajectory path. |
Seunghoo Hong; Geonho Son; Juhun Lee; Simon S. Woo; | code |
| 667 | Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. |
Zhenbang Du; Yonggan Fu; Lifu Wang; Jiayi Qian; Xiao Luo; Yingyan Celine Lin; | code |
| 668 | Latent Swap Joint Diffusion for 2D Long-Form Latent Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherent long spectrum and panorama using a latent swap joint diffusion process across multi-views. |
Yusheng Dai; Chenxi Wang; Chang Li; Chen Wang; Kewei Li; Jun Du; Lei Sun; Jianqing Gao; Ruoyu Wang; Jiefeng Ma; | code |
| 669 | JPEG Processing Neural Operator for Backward-Compatible Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present the JPEG Processing Neural Operator (JPNeO), a next-generation JPEG algorithm that maintains full backward compatibility with the current JPEG format. |
Woo Kyoung Han; Yongjun Lee; Byeonghun Lee; Sang Hyun Park; Sunghoon Im; Kyong Hwan Jin; | code |
| 670 | Neuromanifold-Regularized KANs for Shape-fair Feature Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the other hand, shape-fair networks reside on a neuromanifold of low-degree. Motivated by this, we investigate neuromanifold regularization of KANs to enable learning of shape-fair feature representations. |
Mazlum Ferhat Arslan; Weihong Guo; Shuo Li; | code |
| 671 | MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional discriminative methods, which learn a deterministic mapping from a 2D image to a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D mapping. To address this challenge, we propose MaskHand, a novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes by learning and sampling from the probabilistic distribution of the ambiguous 2D-to-3D mapping process. |
Muhammad Usama Saleem; Ekkasit Pinyoanuntapong; Mayur Jagdishbhai Patel; Hongfei Xue; Ahmed Helmy; Srijan Das; Pu Wang; | code |
| 672 | Q-Norm: Robust Representation Learning Via Quality-Adaptive Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing biological inspiration from the human visual system (HVS), which dynamically adjusts perception strategies through contrast gain control and selective attention to salient regions, we propose Quality-Adaptive Normalization (Q-Norm) – a novel normalization method that learns adaptive parameters guided by image quality features. |
Lanning Zhang; Ying Zhou; Fei Gao; Ziyun Li; Maoying Qiao; Jinlan Xu; Nannan Wang; | code |
| 673 | Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. |
Hanshi Wang; Jin Gao; Weiming Hu; Zhipeng Zhang; | code |
| 674 | Learn2Synth: Learning Optimal Data Synthesis Using Hypergradients for Brain Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although powerful, this approach relies on the accurate tuning of a large set of hyperparameters that govern the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. |
Xiaoling Hu; Xiangrui Zeng; Oula Puonti; Juan Eugenio Iglesias; Bruce Fischl; Yaël Balbastre; | code |
| 675 | MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. |
Yaoye Zhu; Zhe Wang; Yan Wang; | code |
| 676 | The Devil Is in The Spurious Correlations: Boosting Moment Retrieval with Dynamic Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Namely, the model makes predictions by overly associating queries with background frames rather than distinguishing target moments. To address this issue, we propose a dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. |
Xinyang Zhou; Fanyue Wei; Lixin Duan; Angela Yao; Wen Li; | code |
| 677 | PRO-VPT: Distribution-Adaptive Visual Prompt Tuning Via Prompt Relocation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most prior art indiscriminately uses a fixed prompt distribution across different tasks, neglecting the importance of each block varying depending on the task. In this paper, we introduce adaptive distribution optimization (ADO) by tackling two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? |
Chikai Shang; Mengke Li; Yiqun Zhang; Zhen Chen; Jinlin Wu; Fangqing Gu; Yang Lu; Yiu-Ming Cheung; | code |
| 678 | Towards Adversarial Robustness Via Debiased High-Confidence Logit Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This spurious correlation bias leads to overfitting irrelevant background features during adversarial training, thereby degrading the model’s robust performance and generalization capabilities. To address this issue, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that aligns adversarial logits with debiased high-confidence logits and restores proper attention by enhancing foreground logit orthogonality. |
Kejia Zhang; Juanjuan Weng; Shaozi Li; Zhiming Luo; | code |
| 679 | No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. |
Ranran Huang; Krystian Mikolajczyk; | code |
| 680 | You Think, You ACT: The New Task of Arbitrary Text to Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper extends limited Action Texts to arbitrary ones. |
Runqi Wang; Caoyuan Ma; Guopeng Li; Hanrui Xu; Yuke Li; Zheng Wang; | code |
| 681 | Beyond Single Images: Retrieval Self-Augmented Unsupervised Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. |
Ji Du; Xin Wang; Fangwei Hao; Mingyang Yu; Chunyuan Chen; Jiesheng Wu; Bin Wang; Jing Xu; Ping Li; | code |
| 682 | To Label or Not to Label: PALM – A Predictive Model for Evaluating Sample Efficiency in Active Learning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. |
Julia Machnio; Mads Nielsen; Mostafa Mehdipour Ghazi; | code |
| 683 | SAGI: Semantically Aligned and Uncertainty Guided AI Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, generating semantically correct photorealistic imagery, typically requires carefully-crafted prompts and iterative refinement by evaluating the realism of the generated content – tasks commonly performed by humans. To automate the generative process, we propose Semantically Aligned and Uncertainty Guided AI Image Inpainting (SAGI), a model-agnostic pipeline, to sample prompts from a distribution that closely aligns with human perception and to evaluate the generated content and discard instances that deviate from such a distribution, which we approximate using pretrained large language models and vision-language models. |
Paschalis Giakoumoglou; Dimitrios Karageorgiou; Symeon Papadopoulos; Panagiotis C. Petrantonakis; | code |
| 684 | HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present HiNeuS, a unified framework that holistically addresses three core limitations in existing approaches: multi-view radiance inconsistency, missing keypoints in textureless regions, and structural degradation from over-enforced Eikonal constraints during joint optimization. |
Yida Wang; Xueyang Zhang; Kun Zhan; Peng Jia; Xianpeng Lang; | code |
| 685 | AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. |
Javier Tirado-Garín; Javier Civera; | code |
| 686 | Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning Via Synthetic Captions for Better Understanding of Long Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs (>77 tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. |
Bingchao Wang; Zhiwei Ning; Jianyu Ding; Xuanang Gao; Yin Li; Dongsheng Jiang; Jie Yang; Wei Liu; | code |
| 687 | Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. |
Tan Pan; Zhaorui Tan; Kaiyu Guo; Dongli Xu; Weidi Xu; Chen Jiang; Xin Guo; Yuan Qi; Yuan Cheng; | code |
| 688 | Pretrained Reversible Generation As Unsupervised Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Pretrained Reversible Generation (PRG), which extracts unsupervised representations by reversing the generative process of a pretrained continuous generation model. |
Rongkun Xue; Jinouwen Zhang; Yazhe Niu; Dazhong Shen; Bingqi Ma; Yu Liu; Jing Yang; | code |
| 689 | SemGes: Semantics-aware Co-Speech Gesture Generation Using Semantic Coherence and Relevance Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. |
Lanmiao Liu; Esam Ghaleb; Asli Ozyurek; Zerrin Yumak; | code |
| 690 | Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. |
Siyu Chen; Ting Han; Changshe Zhang; Xin Luo; Meiliu Wu; Guorong Cai; Jinhe Su; | code |
| 691 | CWNet: Causal Wavelet Network for Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. |
Tongshun Zhang; Pingping Liu; Yubing Lu; Mengen Cai; Zijian Zhang; Zhe Zhang; Qiuzhan Zhou; | code |
| 692 | Unbiased Missing-modality Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often suffer from modality generation bias: while certain modalities are generated with high fidelity, others–such as video–remain challenging due to intrinsic modality gaps, leading to imbalanced training. To address this issue, we propose MD^2N (Multi-stage Duplex Diffusion Network), a novel framework for unbiased missing-modality recovery. |
Ruiting Dai; Chenxi Li; Yandong Yan; Lisi Mo; Ke Qin; Tao He; | code |
| 693 | Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. |
Jianing Zhang; Jiayi Zhu; Feiyu Ji; Xiaokang Yang; Xiaoyun Yuan; | code |
| 694 | RareCLIP: Rarity-aware Online Zero-shot Industrial Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although some batch-based approaches exploit the rarity by processing multiple samples concurrently, they generally introduce unacceptable latency for real-time applications. To mitigate these limitations, we propose RareCLIP, a novel online zero-shot anomaly detection framework that enables sequential image processing in real-time without requiring prior knowledge of the target domain. |
Jianfang He; Min Cao; Silong Peng; Qiong Xie; | code |
| 695 | Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). |
Tianjiao Jiang; Zhen Zhang; Yuhang Liu; Javen Qinfeng Shi; | code |
| 696 | Mitigating Object Hallucinations Via Sentence-Level Early Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. |
Shangpin Peng; Senqiao Yang; Li Jiang; Zhuotao Tian; | code |
| 697 | INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (instance-level interaction architecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. |
Yunjiang Xu; Lingzhi Li; Jin Wang; Yupeng Ouyang; Benyuan Yang; | code |
| 698 | ArgoTweak: Towards Self-Updating HD Maps Through Structured Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. |
Lena Wild; Rafael Valencia; Patric Jensfelt; | code |
| 699 | The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM)–a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. |
Laura Niss; Kevin Vogt-Lowell; Theodoros Tsiligkaridis; | code |
| 700 | GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While video-based reinforcement learning improves policy robustness, it remains constrained by the uncertainty of video generation and the challenges of collecting large-scale robot datasets for training diffusion models. To address these limitations, we propose GENFLOWRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets. |
Kelin Yu; Sheng Zhang; Harshit Soora; Furong Huang; Heng Huang; Pratap Tokekar; Ruohan Gao; | code |
| 701 | SAC-GNC: SAmple Consensus for Adaptive Graduated Non-Convexity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel approach to adaptively anneal the shape parameter within a GNC framework. |
Valter Piedade; Chitturi Sidhartha; José Gaspar; Venu Madhav Govindu; Pedro Miraldo; | code |
| 702 | PLADIS: Pushing The Limits of Attention in Diffusion Models at Inference Time By Leveraging Sparsity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. |
Kwanyoung Kim; Byeongsu Sim; | code |
| 703 | V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher’s outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, \loss, that integrates DPO and SFT. |
Jisoo Kim; Wooseok Seo; Junwan Kim; Seungho Park; Sooyeon Park; Youngjae Yu; | code |
| 704 | DMQ: Dissecting Outliers of Diffusion Models for Post-Training Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches often overlook outliers, leading to degraded performance at low bit-widths. In this paper, we propose a DMQ which combines Learned Equivalent Scaling (LES) and channel-wise Power-of-Two Scaling (PTS) to effectively address these challenges. |
Dongyeun Lee; Jiwan Hur; Hyounguk Shon; Jae Young Lee; Junmo Kim; | code |
| 705 | C2MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C2MIL. |
Min Cen; Zhenfeng Zhuang; Yuzhe Zhang; Min Zeng; Baptiste Magnier; Lequan Yu; Hong Zhang; Liansheng Wang; | code |
| 706 | CMAD: Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Correlation-Aware and Modalities-Aware Distillation (CMAD), a unified framework designed for MSA under varying missing-modality conditions. |
Yan Zhuang; Minhao Liu; Wei Bai; Yanru Zhang; Xiaoyue Zhang; Jiawen Deng; Fuji Ren; | code |
| 707 | SemiVisBooster: Boosting Semi-Supervised Learning for Fine-Grained Classification Through Pseudo-Label Semantic Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces SemiVisBooster, an SSL enhancement approach that utilizes semantic information from label names to guide visual feature learning, addressing the challenges of fine-grained classification. |
Wenjin Zhang; Xinyu Li; Chenyang Gao; Ivan Marsic; | code |
| 708 | Diffusion-Based Extreme High-speed Scenes Reconstruction with The Complementary Vision Sensor Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome these challenges, we leverage a novel complementary vision sensor, Tianmouc, which outputs high-speed, multi-bit, sparse spatio-temporal difference information with RGB frames. Building on this unique sensing modality, we introduce a Cascaded Bi-directional Recurrent Diffusion Model (CBRDM) that achieves accurate, sharp, color-rich video frames reconstruction. |
Yapeng Meng; Yihan Lin; Taoyi Wang; Yuguo Chen; Lijian Wang; Rong Zhao; | code |
| 709 | MorphoGen: Efficient Unconditional Generation of Long-Range Projection Neuronal Morphology Via A Global-to-Local Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose MorphoGen, a hierarchical framework integrating global structure prediction through denoising diffusion probabilistic models (DDPMs) with local neurites optimization. |
Tianfang Zhu; Hongyang Zhou; Anan Li; | code |
| 710 | Lightweight and Fast Real-time Image Enhancement Via Decomposition of The Spatial-aware Lookup Tables Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although spatial-aware 3D LUT methods address this limitation, they introduce additional modules that require a substantial number of parameters, leading to increased runtime as image resolution increases. To address this issue, we propose a method for generating image-adaptive LUTs by focusing on the redundant parts of the tables. |
Wontae Kim; Keuntek Lee; Nam Ik Cho; | code |
| 711 | Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present dSVA—a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. |
Shangbo Wu; Yu-an Tan; Ruinan Ma; Wencong Ma; Dehua Zhu; Yuanzhang Li; | code |
| 712 | 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction Via Physics Simulation for Scene Update Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce TRAN-D, a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects. |
Jeongyun Kim; Seunghoon Jeong; Giseop Kim; Myung-Hwan Jeon; Eunji Jun; Ayoung Kim; | code |
| 713 | All in One: Visual-Description-Guided Unified Point Cloud Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. |
Zongyan Han; Mohamed El Amine Boudjoghra; Jiahua Dong; Jinhong Wang; Rao Muhammad Anwer; | code |
| 714 | Automated Model Evaluation for Object Detection Via Prediction Consistency and Reliability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). |
Seungju Yoo; Hyuk Kwon; Joong-Won Hwang; Kibok Lee; | code |
| 715 | MixANT: Observation-dependent Memory Propagation for Stochastic Dense Action Anticipation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present MixANT, a novel architecture for stochastic long-term dense anticipation of human activities. |
Syed Talal Wasim; Hamid Suleman; Olga Zatsarynna; Muzammal Naseer; Juergen Gall; | code |
| 716 | Multi-modal Segment Anything Model for Camouflaged Scene Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose leveraging the Segment Anything Model (SAM) to tackle this challenging task effectively. |
Guangyu Ren; Hengyan Liu; Michalis Lazarou; Tania Stathaki; | code |
| 717 | HOMO-Feature: Cross-Arbitrary-Modal Image Matching with Homomorphism of Organized Major Orientation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead of using deep models to conduct data-driven black-box learning, we introduce a Major Orientation Map (MOM), effectively combating image modal differences. |
Chenzhong Gao; Wei Li; Desheng Weng; | code |
| 718 | Serialization Based Point Cloud Oversegmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel serialization based point cloud oversegmentation method, which leverages serialization to avoid complex spatial queries, directly accessing neighboring points through sequence locality for similarity matching and superpoint clustering. |
Chenghui Lu; Jianlong Kwan; Dilong Li; Ziyi Chen; Haiyan Guan; | code |
| 719 | LOTA: Bit-Planes Guided AI-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. |
Hongsong Wang; Renxi Cheng; Yang Zhang; Chaolei Han; Jie Gui; | code |
| 720 | ADCD-Net: Robust Document Image Forgery Localization Via Adaptive DCT Feature and Hierarchical Content Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. |
Kahim Wong; Jicheng Zhou; Haiwei Wu; Yain-Whar Si; Jiantao Zhou; | code |
| 721 | Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a self-supervised method to improve an agent’s abilities in describing arbitrary objects while actively exploring a generic environment. |
Tommaso Galliena; Tommaso Apicella; Stefano Rosa; Pietro Morerio; Alessio Del Bue; Lorenzo Natale; | code |
| 722 | Robust Low-light Scene Restoration Via Illumination Transition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Robust Low-light Scene Restoration framework (Rose), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. |
Ze Li; Feng Zhang; Xiatian Zhu; Meng Zhang; Yanghong Zhou; P. Y. Mok; | code |
| 723 | SmolDocling: An Ultra-compact Vision-language Model for End-to-end Multi-modal Document Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. |
Ahmed Nassar; Matteo Omenetti; Maksym Lysak; Nikolaos Livathinos; Christoph Auer; Lucas Morin; Rafael Teixeira de Lima; Yusik Kim; A. Said Gurbuz; Michele Dolfi; Peter W. J. Staar; | code |
| 724 | GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose GeoDistill, a Geometry guided weakly supervised self Distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. |
Shaowen Tong; Zimin Xia; Alexandre Alahi; Xuming He; Yujiao Shi; | code |
| 725 | AIComposer: Any Style and Content Image Composition Via Feature Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the first cross-domain image composition method that does not require text prompts, allowing natural stylization and seamless compositions. |
Haowen Li; Zhenfeng Fan; Zhang Wen; Zhengzhou Zhu; Yunjin Li; | code |
| 726 | FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. |
Hao-Yu Hou; Chun-Yi Lee; Motoharu Sonogashira; Yasutomo Kawanishi; | code |
| 727 | Dual Recursive Feedback on Generation and Appearance Latents for Pose-Robust Text-to-Image Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models often struggle to accurately preserve spatial structures and fail to capture fine-grained conditions related to object poses and scene layouts. To address these challenges, we propose a training-free Dual Recursive Feedback (DRF) system that properly reflects control conditions in controllable T2I models. |
Jiwon Kim; Pureum Kim; SeonHwa Kim; Soobin Park; Eunju Cha; Kyong Hwan Jin; | code |
| 728 | Reference-based Super-Resolution Via Image-based Retrieval-Augmented Generation Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by RAG, we propose an image-based RAG framework (iRAG) for realistic super-resolution, which employs a trainable hashing function to retrieve either real-world or generated references given an LR query. |
Byeonghun Lee; Hyunmin Cho; Hong Gyu Choi; Soo Min Kang; Iljun Ahn; Kyong Hwan Jin; | code |
| 729 | Image-Guided Shape-from-Template Using Mesh Inextensibility Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a 400xfaster pace than (best-performing) unsupervised SfT. |
Thuy Tran; Ruochen Chen; Shaifali Parashar; | code |
| 730 | Discretized Gaussian Representation for Tomographic Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Discretized Gaussian Representation (DGR), a novel framework that reconstructs the 3D volume directly using a set of discretized Gaussian functions in an end-to-end manner. |
Shaokai Wu; Yuxiang Lu; Yapan Guo; Wei Ji; Suizhi Huang; Fengyu Yang; Shalayiding Sirejiding; Qichen He; Jing Tong; Yanbiao Ji; Yue Ding; Hongtao Lu; | code |
| 731 | FusionPhys: A Flexible Framework for Fusing Complementary Sensing Modalities in Remote Physiological Measurement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our key insight is that while visible light, near-infrared, and radar operate on distinct physical principles, they all capture temporally dynamic physiological signatures that can be represented as time-varying signals reflecting underlying physiological processes. Based on this insight, we propose FusionPhys, a novel framework that implements an adaptive integration mechanism to refine physiological information across complementary modalities. |
Chenhang Ying; Huiyu Yang; Jieyi Ge; Zhaodong Sun; Xu Cheng; Kui Ren; Xiaobai Li; | code |
| 732 | Divide-and-Conquer for Enhancing Unlabeled Learning, Stability, and Plasticity in Semi-supervised Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP’s outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. |
Yue Duan; Taicai Chen; Lei Qi; Yinghuan Shi; | code |
| 733 | Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. |
Feng Huang; Shuyuan Zheng; Zhaobing Qiu; Huanxian Liu; Huanxin Bai; Liqiong Chen; | code |
| 734 | Accelerating Diffusion Sampling Via Exploiting Local Transition Coherence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. |
Shangwen Zhu; Han Zhang; Zhantao Yang; Qianyu Peng; Zhao Pu; Huangji Wang; Fan Cheng; | code |
| 735 | Robust 3D-Masked Part-level Editing in 3D Gaussian Splatting with Regularized Score Distillation Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, achieving precise local 3D edits remains challenging, especially for Gaussian Splatting, due to inconsistent multi-view 2D part segmentations and inherently ambiguous nature of Score Distillation Sampling (SDS) loss. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing framework that enables precise and drastic part-level modifications. |
Hayeon Kim; Ji Ha Jang; Se Young Chun; | code |
| 736 | Voyaging Into Perpetual Dynamic Scenes from A Single View Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prior work learns such consistency by training on multiple views, but the generated scene regions often interpolate between training views and fail to generate perpetual views. To address this issue, we propose DynamicVoyager, which reformulates dynamic scene generation as a scene outpainting problem with new dynamic content. |
Fengrui Tian; Tianjiao Ding; Jinqi Luo; Hancheng Min; Rene Vidal; | code |
| 737 | BATCLIP: Bimodal Online Test-Time Adaptation for CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose \texttt BATCLIP , a bimodal online TTA method designed to improve CLIP’s robustness to common image corruptions. |
Sarthak Maharana; Baoming Zhang; Leonid Karlinsky; Rogerio Feris; Yunhui Guo; | code |
| 738 | Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we proposed an automatic tooth alignment neural network based on Swin-transformer. |
Zhenxing Dong; Jiazhou Chen; | code |
| 739 | LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. |
Haoran Lou; Chunxiao Fan; Ziyan Liu; Yuexin Wu; Xinliang Wang; | code |
| 740 | RePoseD: Efficient Relative Pose Estimation With Known Depth Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate how monocular depth estimates can be used for relative pose estimation. |
Yaqing Ding; Viktor Kocur; Václav Vávra; Zuzana Berger Haladová; Jian Yang; Torsten Sattler; Zuzana Kukelova; | code |
| 741 | Visual Modality Prompt for Adapting Vision-Language Object Detectors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt VLDs to new modalities without degrading zero-shot performance. |
Heitor R. Medeiros; Atif Belal; Srikanth Muralidharan; Eric Granger; Marco Pedersoli; | code |
| 742 | Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. |
Xinyao Liu; Diping Song; | code |
| 743 | G2D: Boosting Multimodal Learning with Gradient-Guided Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation (G^ 2 D), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. |
Mohammed Rakib; Arunkumar Bagavathi; | code |
| 744 | Sparse-Dense Side-Tuner for Efficient Video Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of … |
David Pujol-Perich; Sergio Escalera; Albert Clapés; | code |
| 745 | Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we propose a first proactive defense against model merging. |
Wei Junhao; Yu Zhe; Jun Sakuma; | code |
| 746 | Temporal Rate Reduction Clustering for Human Motion Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach for HMS, named Temporal Rate Reduction Clustering (\text TR ^2\text C ), which jointly learns structured representations and affinity to segment the sequences of frames in video. |
Xianghan Meng; Zhengyu Tong; Zhiyuan Huang; Chun-Guang Li; | code |
| 747 | SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SMARTIES, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SMARTIES projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the use of arbitrary combinations of bands both for training and inference. |
Gencer Sumbul; Chang Xu; Emanuele Dalsasso; Devis Tuia; | code |
| 748 | FED-PsyAU: Privacy-Preserving Micro-Expression Recognition Via Psychological AU Coordination and Dynamic Facial Motion Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. |
Jingting Li; Yu Qian; Lin Zhao; Su-Jing Wang; | code |
| 749 | PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. |
Mahdiyar Molahasani; Azadeh Motamedi; Michael Greenspan; Il-Min Kim; Ali Etemad; | code |
| 750 | HumorDB: Can AI Understand Graphical Humor? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces HumorDB, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. |
Vedaant V Jain; Gabriel Kreiman; Felipe dos Santos Alves Feitosa; | code |
| 751 | GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present GaSLight, a method that generates spatially-varying lighting from regular images. |
Christophe Bolduc; Yannick Hold-Geoffroy; Jean-François Lalonde; | code |
| 752 | TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. |
Xingsong Ye; Yongkun Du; Yunbo Tao; Zhineng Chen; | code |
| 753 | Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel and principled spherical epipolar rectification model, which handles all camera motions. |
Pierre-André Brousseau; Sébastien Roy; | code |
| 754 | DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. |
Emery Pierson; Lei Li; Angela Dai; Maks Ovsjanikov; | code |
| 755 | XTrack: Multimodal Training Boosts RGB-X Video Object Trackers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a classifier with weak loss tasked with distinguishing between modalities. |
Yuedong Tan; Zongwei Wu; Yuqian Fu; Zhuyun Zhou; Guolei Sun; Eduard Zamfir; Chao Ma; Danda Paudel; Luc Van Gool; Radu Timofte; | code |
| 756 | EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from A Mobile Device Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. |
Gunjan Chhablani; Xiaomeng Ye; Muhammad Zubair Irshad; Zsolt Kira; | code |
| 757 | Dynamic Dictionary Learning for Remote Sensing Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. |
Xuechao Zou; Yue Li; Shun Zhang; Kai Li; Shiying Wang; Pin Tao; Junliang Xing; Congyan Lang; | code |
| 758 | What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. |
Chi-Hsi Kung; Frangil Ramirez; Juhyung Ha; Yi-Ting Chen; David Crandall; Yi-Hsuan Tsai; | code |
| 759 | DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. |
Kazuma Nagata; Naoshi Kaneko; | code |
| 760 | Kaputt: A Large-Scale Dataset for Visual Defect Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel large-scale dataset for defect detection in a logistics setting. |
Sebastian Höfer; Dorian F. Henning; Artemij Amiranashvili; Douglas Morrison; Mariliza Tzes; Ingmar Posner; Marc Matvienko; Alessandro Rennola; Anton Milan; | code |
| 761 | Dynamic Group Detection Using VLM-augmented Temporal Groupness Graph Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes dynamic human group detection in videos. |
Kaname Yokoyama; Chihiro Nakatani; Norimichi Ukita; | code |
| 762 | Acknowledging Focus Ambiguity in Visual Questions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: No published work on visual question answering (VQA) accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each plausible image region a question could refer to when arriving at valid answers. |
Chongyan Chen; Yu-Yun Tseng; Zhuoheng Li; Anush Venkatesh; Danna Gurari; | code |
| 763 | DAViD: Data-efficient and Accurate Vision Models from Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. |
Fatemeh Saleh; Sadegh Aliakbarian; Charlie Hewitt; Lohit Petikam; Xian Xiao; Antonio Criminisi; Thomas J. Cashman; Tadas Baltrusaitis; | code |
| 764 | DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To close this gap, our work explores learning the correspondence between LiDAR point clouds, human skeleton poses, IMU data, and text in a joint embedding space. More specifically, we present DeSPITE, a \underline De ep \underline S keleton-\underline P ointcloud-\underline I MU-\underline T ext \underline E mbedding model, which effectively learns a joint embedding space across these four modalities. |
Thomas Kreutz; Max Mühlhäuser; Alejandro Sanchez Guinea; | code |
| 765 | EA-KD: Entropy-based Adaptive Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Entropy-based Adaptive Knowledge Distillation (EA-KD), a simple yet effective plug-and-play KD method that prioritizes learning from valuable samples. |
Chi-Ping Su; Ching-Hsun Tseng; Bin Pu; Lei Zhao; Jiewen Yang; Zhuangzhuang Chen; Shin-Jye Lee; | code |
| 766 | MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. |
Vladislav Bargatin; Egor Chistov; Alexander Yakovenko; Dmitriy Vatolin; | code |
| 767 | How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. |
Mahnoor Fatima Saad; Ziad Al-Halah; | code |
| 768 | LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. |
Wei-Jer Chang; Wei Zhan; Masayoshi Tomizuka; Manmohan Chandraker; Francesco Pittaluga; | code |
| 769 | NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. |
Amirhossein Ansari; Ke Wang; Pulei Xiong; | code |
| 770 | PHD: Personalized 3D Human Body Fitting with Point Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. |
Hsuan-I Ho; Chen Guo; Po-Chen Wu; Ivan Shugurov; Chengcheng Tang; Abhay Mittal; Sizhe An; Manuel Kaufmann; Linguang Zhang; | code |
| 771 | Synchronization of Multiple Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Temporal Prototype Learning (TPL), a prototype-based framework that constructs a shared, compact 1D representation from high-dimensional embeddings extracted by any of various pretrained models. |
Avihai Naaman; Ron Shapira Weber; Oren Freifeld; | code |
| 772 | Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce an efficient, self-supervised joint reconstruction method that recovers SOS and high-quality images for ring array PACT systems. |
Tianao Li; Manxiu Cui; Cheng Ma; Emma Alexander; | code |
| 773 | DialNav: Multi-turn Dialog Navigation with A Remote Guide Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce DialNav, a novel collaborative embodied dialog task, where a navigation agent (Navigator) and a remote guide (Guide) engage in multi-turn dialog to reach a goal location. |
Leekyeung Han; Hyunji Min; Gyeom Hwangbo; Jonghyun Choi; Paul Hongsuck Seo; | code |
| 774 | Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a new cross-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. |
Nicolai Hermann; Jorge Condor; Piotr Didyk; | code |
| 775 | Explaining Human Preferences Via Metrics for Structured 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a detailed discussion of automated metrics for evaluating structured 3D reconstructions. |
Jack Langerman; Denys Rozumnyi; Yuzhong Huang; Dmytro Mishkin; | code |
| 776 | ClaraVid: A Holistic Scene Reconstruction Benchmark From Aerial Perspective With Delentropy-Based Complexity Profiling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. |
Radu Beche; Sergiu Nedevschi; | code |
| 777 | PROL : Rehearsal Free Continual Learning in Streaming Data Via Prompt Online Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) pre-trained model (PTM) generalization preserving, and (4) hard-soft updates mechanism. |
M. Anwar Ma’sum; Mahardhika Pratama; Savitha Ramasamy; Lin Liu; Habibullah Habibullah; Ryszard Kowalczyk; | code |
| 778 | FLOSS: Free Lunch in Open-vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we challenge the conventional practice in Open-Vocabulary Semantic Segmentation (OVSS) of using averaged class-wise text embeddings, which are typically obtained by encoding each class name with multiple templates (e.g., a photo of <class>, a sketch of a <class>). |
Yasser Benigmim; Mohammad Fahes; Tuan-Hung Vu; Andrei Bursuc; Raoul de Charette; | code |
| 779 | Correspondence-Free Fast and Robust Spherical Point Pattern Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. |
Anik Sarker; Alan T. Asbeck; | code |
| 780 | Punching Bag Vs. Punching Person: Motion Transferability in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, can a model recognize the broad action "punching" when presented with an unseen variation such as "punching person"? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. |
Raiyaan Abdullah; Jared Claypoole; Michael Cogswell; Ajay Divakaran; Yogesh Rawat; | code |
| 781 | TITAN: Query-Token Based Domain Adaptive Adversarial Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN) which separates the target images into two subsets that are similar to the source (easy) and those that are dissimilar (hard). |
Tajamul Ashraf; Janibul Bashir; | code |
| 782 | CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Canonical Consolidation Fields (CanFields). |
Miaowei Wang; Changjian Li; Amir Vaxman; | code |
| 783 | Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the model’s ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. |
Jiong Yin; Liang Li; Jiehua Zhang; Yuhan Gao; Chenggang Yan; Xichun Sheng; | code |
| 784 | SALAD — Semantics-Aware Logical Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. |
Matic Fučka; Vitjan Zavrtanik; Danijel Skočaj; | code |
| 785 | Beyond Perspective: Neural 360-Degree Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel method for training data augmentation exploiting the spherical characteristics of 360-degree video that shows to be crucial for achieving maximum compression performance. |
Andy Regensky; Marc Windsheimer; Fabian Brand; Andre Kaup; | code |
| 786 | TimeBooth: Disentangled Facial Invariant Representation for Diverse and Personalized Face Aging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This strategy maximizes the disentanglement of ID and age representation through bidirectional adversarial learning, extracting their attribute-invariant representations. Based on this representation, we propose TimeBooth, a personalized face aging model capable of generating diverse and individualized aging results. |
Zepeng Su; Zhulin Liu; Zongyan Zhang; Tong Zhang; C.L.Philip Chen; | code |
| 787 | EquiCaps: Predictor-Free Pose-Aware Pre-Trained Capsule Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. |
Athinoulla Konstantinou; Georgios Leontidis; Mamatha Thota; Aiden Durrant; | code |
| 788 | GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles Based on Probabilistic Cue Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup. |
Karlo Koledić; Luka Petrović; Ivan Marković; Ivan Petrović; | code |