CVPR 2025 Papers with Code & Data
To facilitate rapid community engagement with the presented research, we have compiled an extensive index of accepted papers that have associated public code or data repositories. We list all of them in the following table. This index was generated using an automated extraction process. While we strive for completeness, some papers with public resources may have been missed. Please inform us if you discover any additional papers that should be included. Readers should be aware that some code repositories may not be made fully public until the conference officially begins.
In addition to this index, we encourage readers to explore our related resources: CVPR-2025 papers & highlights: For curated summaries and key takeaways from this year’s conference. “Best Paper” Digest (CVPR): A historical overview of the most influential CVPR papers published since 1988.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive daily paper digests on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: CVPR 2025 Papers with Code & Data
| Paper | Author(s) | Code | |
|---|---|---|---|
| 1 | Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. |
Matt Deitke; Christopher Clark; Sangho Lee; Rohun Tripathi; Yue Yang; Jae Sung Park; Mohammadreza Salehi; Niklas Muennighoff; Kyle Lo; Luca Soldaini; Jiasen Lu; Taira Anderson; Erin Bransom; Kiana Ehsani; Huong Ngo; YenSung Chen; Ajay Patel; Mark Yatskar; Chris Callison-Burch; Andrew Head; Rose Hendrix; Favyen Bastani; Eli VanderBilt; Nathan Lambert; Yvonne Chou; Arnavi Chheda; Jenna Sparks; Sam Skjonsberg; Michael Schmitz; Aaron Sarnat; Byron Bischoff; Pete Walsh; Chris Newell; Piper Wolters; Tanmay Gupta; Kuo-Hao Zeng; Jon Borchardt; Dirk Groeneveld; Crystal Nam; Sophie Lebrecht; Caitlin Wittlif; Carissa Schoenick; Oscar Michel; Ranjay Krishna; Luca Weihs; Noah A. Smith; Hannaneh Hajishirzi; Ross Girshick; Ali Farhadi; Aniruddha Kembhavi; | code |
| 2 | OmniGen: Unified Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce OmniGen, a new diffusion model for unified image generation. |
Shitao Xiao; Yueze Wang; Junjie Zhou; Huaying Yuan; Xingrun Xing; Ruiran Yan; Chaofan Li; Shuting Wang; Tiejun Huang; Zheng Liu; | code |
| 3 | Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. |
Guy Yariv; Yuval Kirstain; Amit Zohar; Shelly Sheynin; Yaniv Taigman; Yossi Adi; Sagie Benaim; Adam Polyak; | code |
| 4 | DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. |
Lianghui Zhu; Zilong Huang; Bencheng Liao; Jun Hao Liew; Hanshu Yan; Jiashi Feng; Xinggang Wang; | code |
| 5 | MambaVision: A Hybrid Mamba-Transformer Vision Backbone Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. |
Ali Hatamizadeh; Jan Kautz; | code |
| 6 | Let’s Verify and Reinforce Image Generation Step By Step Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we provide the first comprehensive investigation in the potential of CoT reasoning to enhance autoregressive image generation. |
Renrui Zhang; Chengzhuo Tong; Zhizheng Zhao; Ziyu Guo; Haoquan Zhang; Manyuan Zhang; Jiaming Liu; Peng Gao; Hongsheng Li; | code |
| 7 | Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. |
Yuying Ge; Yizhuo Li; Yixiao Ge; Ying Shan; | code |
| 8 | WonderWorld: Interactive 3D Scene Generation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. |
Hong-Xing Yu; Haoyi Duan; Charles Herrmann; William T. Freeman; Jiajun Wu; | code |
| 9 | MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. |
Wenyi Hong; Yean Cheng; Zhuoyi Yang; Weihan Wang; Lefan Wang; Xiaotao Gu; Shiyu Huang; Yuxiao Dong; Jie Tang; | code |
| 10 | Foveated Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a foveated instance segmentation(FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. |
Hongyi Zeng; Wenxuan Liu; Tianhua Xia; Jinhui Chen; Ziyun Li; Sai Qian Zhang; | code |
| 11 | Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with An Iterative Data Engine Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this issue, we first collect a large-scale dataset of approximately 0.4 million mirror-related images from the internet, significantly expanding the data scale for mirror detection. To effectively exploit this unlabeled dataset, we propose the first semi-supervised framework (namely an iterative data engine) consisting of four steps: (1) mirror detection model training, (2) pseudo label prediction, (3) dual guidance scoring, and (4) selection of highly reliable pseudo labels. |
Zhaohu Xing; Lihao Liu; Yijun Yang; Hongqiu Wang; Tian Ye; Sixiang Chen; Wenxue Li; Guang Liu; Lei Zhu; | code |
| 12 | MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To facilitate effective distillation, we introduce Monocular Teaching Assistant Knowledge Distillation (MonoTAKD), which proposes a camera-based teaching assistant (TA) model to transfer robust 3D visual knowledge to the student model, leveraging the smaller feature representation gap. |
Hou-I Liu; Christine Wu; Jen-Hao Cheng; Wenhao Chai; Shian-Yun Wang; Gaowen Liu; Hugo Latapie; Jhih-Ciang Wu; Jenq-Neng Hwang; Hong-Han Shuai; Wen-Huang Cheng; | code |
| 13 | Masking Meets Supervision: A Strong Learning Alliance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub). |
Byeongho Heo; Taekyung Kim; Sangdoo Yun; Dongyoon Han; | code |
| 14 | LSNet: See Large, Focus Small Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a "See Large, Focus Small" strategy for lightweight vision network design. |
Ao Wang; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; | code |
| 15 | Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. |
Linyi Jin; Richard Tucker; Zhengqi Li; David Fouhey; Noah Snavely; Aleksander Holynski; | code |
| 16 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. |
Joya Chen; Ziyun Zeng; Yiqi Lin; Wei Li; Zejun Ma; Mike Zheng Shou; | code |
| 17 | Reconstruction Vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. |
Jingfeng Yao; Bin Yang; Xinggang Wang; | code |
| 18 | GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. |
Haoyi Jiang; Liu Liu; Tianheng Cheng; Xinjie Wang; Tianwei Lin; Zhizhong Su; Wenyu Liu; Xinggang Wang; | code |
| 19 | Model Poisoning Attacks to Federated Learning Via Multi-Round Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. |
Yueqi Xie; Minghong Fang; Neil Zhenqiang Gong; | code |
| 20 | VisionZip: Longer Is Better But Not Necessary in Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs.However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. |
Senqiao Yang; Yukang Chen; Zhuotao Tian; Chengyao Wang; Jingyao Li; Bei Yu; Jiaya Jia; | code |
| 21 | TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. |
Liao Qu; Huichao Zhang; Yiheng Liu; Xu Wang; Yi Jiang; Yiming Gao; Hu Ye; Daniel K. Du; Zehuan Yuan; Xinglong Wu; | code |
| 22 | VoCo-LLaMA: Towards Vision Compression with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. |
Xubing Ye; Yukang Gan; Xiaoke Huang; Yixiao Ge; Yansong Tang; | code |
| 23 | DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. |
Leqi Shen; Guoqiang Gong; Tianxiang Hao; Tao He; Yifeng Zhang; Pengzhang Liu; Sicheng Zhao; Jungong Han; Guiguang Ding; | code |
| 24 | Bayesian Prompt Flow Learning for Zero-Shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods face several challenges: 1) handcrafted prompts require extensive expert knowledge and trial-and-error; 2) single-form learnable prompts struggle to capture complex anomaly semantics; and 3) an unconstrained prompt space limits generalization to unseen categories. To address these issues, we propose Bayesian Prompt Flow Learning (Bayes-PFL), which models the prompt space as a learnable probability distribution from a Bayesian perspective. |
Zhen Qu; Xian Tao; Xinyi Gong; ShiChen Qu; Qiyu Chen; Zhengtao Zhang; Xingang Wang; Guiguang Ding; | code |
| 25 | One-Minute Video Generation with Test-Time Training Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle to … |
Karan Dalal; Daniel Koceja; Jiarui Xu; Yue Zhao; Shihao Han; Ka Chun Cheung; Jan Kautz; Yejin Choi; Yu Sun; Xiaolong Wang; | code |
| 26 | HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details, and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable Image Implausibility Evaluator. |
Fan Yang; Ru Zhen; Jianing Wang; Yanhao Zhang; Haoxiang Chen; Haonan Lu; Sicheng Zhao; Guiguang Ding; | code |
| 27 | Docopilot: Improving Multimodal Models for Document-Level Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. |
Yuchen Duan; Zhe Chen; Yusong Hu; Weiyun Wang; Shenglong Ye; Botian Shi; Lewei Lu; Qibin Hou; Tong Lu; Hongsheng Li; Jifeng Dai; Wenhai Wang; | code |
| 28 | StoryGPT-V: Large Language Models As Consistent Story Visualizers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce StoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. |
Xiaoqian Shen; Mohamed Elhoseiny; | code |
| 29 | Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. |
Tianhao Qi; Jianlong Yuan; Wanquan Feng; Shancheng Fang; Jiawei Liu; SiYu Zhou; Qian He; Hongtao Xie; Yongdong Zhang; | code |
| 30 | Mask-Adapter: The Devil Is in The Masks for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, *e.g.*, CLIP, to classify these masks via mask pooling.Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions.In this paper, we reveal the performance limitations of mask pooling and introduce **Mask-Adapter**, a simple yet effective method to address these challenges in open-vocabulary segmentation.Compared to directly using proposal masks, our proposed Mask-Adapter extracts *semantic activation maps* from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP.Additionally, we propose a *mask consistency loss* that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models’ robustness to varying predicted masks.Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. |
Yongkang Li; Tianheng Cheng; Bin Feng; Wenyu Liu; Xinggang Wang; | code |
| 31 | Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Fast 3D Reconstruction (Fast3R), a novel multi-view generalization to DUSt3R that achieves efficient and scalable 3D reconstruction by processing many views in parallel. |
Jianing Yang; Alexander Sax; Kevin J. Liang; Mikael Henaff; Hao Tang; Ang Cao; Joyce Chai; Franziska Meier; Matt Feiszli; | code |
| 32 | Learning Temporally Consistent Video Depth from Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. |
Jiahao Shao; Yuanbo Yang; Hongyu Zhou; Youmin Zhang; Yujun Shen; Vitor Guizilini; Yue Wang; Matteo Poggi; Yiyi Liao; | code |
| 33 | AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. |
Guillaume Astruc; Nicolas Gonthier; Clément Mallet; Loic Landrieu; | code |
| 34 | Continuous 3D Perception Model with Persistent State Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a unified framework capable of solving a broad range of 3D tasks. |
Qianqian Wang; Yifei Zhang; Aleksander Holynski; Alexei A. Efros; Angjoo Kanazawa; | code |
| 35 | Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. |
Da-Wei Zhou; Zi-Wen Cai; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; | code |
| 36 | Realistic Test-Time Adaptation of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios and introduces a more realistic evaluation framework, including (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. |
Maxime Zanella; Clément Fuchs; Christophe De Vleeschouwer; Ismail Ben Ayed; | code |
| 37 | SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. |
Zixuan Huang; Mark Boss; Aaryaman Vasishta; James M. Rehg; Varun Jampani; | code |
| 38 | WF-VAE: Enhancing Video VAE By Wavelet-Driven Energy Flow for Latent Video Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Wavelet transform can decompose videos into multiple frequency-domain components and improve the efficiency significantly, we thus propose Wavelet Flow VAE (WF-VAE), an autoencoder that leverages multi-level wavelet transform to facilitate low-frequency energy flow into latent representation. |
Zongjian Li; Bin Lin; Yang Ye; Liuhan Chen; Xinhua Cheng; Shenghai Yuan; Li Yuan; | code |
| 39 | 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. |
Qi Wu; Janick Martinez Esturo; Ashkan Mirzaei; Nicolas Moënne-Loccoz; Zan Gojcic; | code |
| 40 | PEACE: Empowering Geologic Map Holistic Understanding with MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To quantify this gap, we construct **GeoMap-Bench**, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce **GeoMap-Agent**, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). |
Yangyu Huang; Tianyi Gao; Haoran Xu; Qihao Zhao; Yang Song; Zhipeng Gui; Tengchao Lv; Hao Chen; Lei Cui; Scarlett Li; Furu Wei; | code |
| 41 | OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks. To address these gaps, we introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources, including academic papers, textbooks, and more challenging cases such as handwritten notes and densely typeset newspapers. |
Linke Ouyang; Yuan Qu; Hongbin Zhou; Jiawei Zhu; Rui Zhang; Qunshu Lin; Bin Wang; Zhiyuan Zhao; Man Jiang; Xiaomeng Zhao; Jin Shi; Fan Wu; Pei Chu; Minghao Liu; Zhenxiang Li; Chao Xu; Bo Zhang; Botian Shi; Zhongying Tu; Conghui He; | code |
| 42 | ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding Using Captions with Grounded Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. |
Ali Athar; Xueqing Deng; Liang-Chieh Chen; | code |
| 43 | Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. |
Chenyangguang Zhang; Alexandros Delitzas; Fangjinhua Wang; Ruida Zhang; Xiangyang Ji; Marc Pollefeys; Francis Engelmann; | code |
| 44 | Mono-InternVL: Pushing The Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. |
Gen Luo; Xue Yang; Wenhan Dou; Zhaokai Wang; Jiawen Liu; Jifeng Dai; Yu Qiao; Xizhou Zhu; | code |
| 45 | TinyFusion: Diffusion Transformers Learned Shallow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. |
Gongfan Fang; Kunjun Li; Xinyin Ma; Xinchao Wang; | code |
| 46 | ECBench: Can Multi-modal Foundation Models Understand The Egocentric World? A Holistic Embodied Cognition Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. |
Ronghao Dang; Yuqian Yuan; Wenqi Zhang; Yifei Xin; Boqiang Zhang; Long Li; Liuyi Wang; Qinyang Zeng; Xin Li; Lidong Bing; | code |
| 47 | Magma: A Foundation Model for Multimodal AI Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. |
Jianwei Yang; Reuben Tan; Qianhui Wu; Ruijie Zheng; Baolin Peng; Yongyuan Liang; Yu Gu; Mu Cai; Seonghyeon Ye; Joel Jang; Yuquan Deng; Jianfeng Gao; | code |
| 48 | Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. |
Jiacong Xu; Shao-Yuan Lo; Bardia Safaei; Vishal M. Patel; Isht Dwivedi; | code |
| 49 | Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. |
Ryan Burgert; Yuancheng Xu; Wenqi Xian; Oliver Pilarski; Pascal Clausen; Mingming He; Li Ma; Yitong Deng; Lingxiao Li; Mohsen Mousavi; Michael Ryoo; Paul Debevec; Ning Yu; | code |
| 50 | VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. |
Yancong Lin; Shiming Wang; Liangliang Nan; Julian Kooij; Holger Caesar; | code |
| 51 | GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. |
Zebin Xing; Xingyu Zhang; Yang Hu; Bo Jiang; Tong He; Qian Zhang; Xiaoxiao Long; Wei Yin; | code |
| 52 | Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. |
Yanda Chen; Gongwei Chen; Miao Zhang; Weili Guan; Liqiang Nie; | code |
| 53 | Rethinking Reconstruction and Denoising in The Dark: New Perspective, General Architecture and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces a novel approach by rethinking denoising and reconstruction from a "backbone-head" perspective, leveraging the stronger shared parameter space offered by the backbone, compared to the encoder used in existing works. |
Tengyu Ma; Long Ma; Ziye Li; Yuetong Wang; Jinyuan Liu; Chengpei Xu; Risheng Liu; | code |
| 54 | Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we move a step forward and design an approach that allows for multimodal queries – composed of both an image and a text – and can search within collections of multimodal documents, where images and text are interleaved. |
Davide Caffagni; Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara; | code |
| 55 | VideoDPO: Omni-Preference Alignment for Video Diffusion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. |
Runtao Liu; Haoyu Wu; Ziqiang Zheng; Chen Wei; Yingqing He; Renjie Pi; Qifeng Chen; | code |
| 56 | FoundationStereo: Zero-Shot Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero shot generalization. |
Bowen Wen; Matthew Trepte; Joseph Aribido; Jan Kautz; Orazio Gallo; Stan Birchfield; | code |
| 57 | RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. |
Ziqi Pang; Tianyuan Zhang; Fujun Luan; Yunze Man; Hao Tan; Kai Zhang; William T. Freeman; Yu-Xiong Wang; | code |
| 58 | AIpparel: A Multimodal Foundation Model for Digital Garments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a multimodal foundation model for generating and editing sewing patterns. |
Kiyohiro Nakayama; Jan Ackermann; Timur Levent Kesdogan; Yang Zheng; Maria Korosteleva; Olga Sorkine-Hornung; Leonidas J. Guibas; Guandao Yang; Gordon Wetzstein; | code |
| 59 | Continuous Locomotive Crowd Behavior Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. |
Inhwan Bae; Junoh Lee; Hae-Gon Jeon; | code |
| 60 | Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Pixel-level and Semantic-level Adjustable SR (PiSA-SR), which learns two LoRA modules upon the pre-trained stable-diffusion (SD) model to achieve improved and adjustable SR results. |
Lingchen Sun; Rongyuan Wu; Zhiyuan Ma; Shuaizheng Liu; Qiaosi Yi; Lei Zhang; | code |
| 61 | GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields Through Efficient Dense 3D Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a novel framework that constructs dynamic 3D Gaussian fields with dense 3D point tracking and renders the Gaussian field for all video frames. |
Weikang Bian; Zhaoyang Huang; Xiaoyu Shi; Yijin Li; Fu-Yun Wang; Hongsheng Li; | code |
| 62 | Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. |
Kaihang Pan; Wang Lin; Zhongqi Yue; Tenglong Ao; Liyu Jia; Wei Zhao; Juncheng Li; Siliang Tang; Hanwang Zhang; | code |
| 63 | MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose the first distributed multi-agent collaborative SLAM framework with distributed mapping and camera tracking, joint scene representation, intra-to-inter loop closure, and multi-submap fusion. |
Tianchen Deng; Guole Shen; Chen Xun; Shenghai Yuan; Tongxin Jin; Hongming Shen; Yanbo Wang; Jingchuan Wang; Hesheng Wang; Danwei Wang; Weidong Chen; | code |
| 64 | Perception Tokens Enhance Visual Reasoning in Multimodal Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. |
Mahtab Bigverdi; Zelun Luo; Cheng-Yu Hsieh; Ethan Shen; Dongping Chen; Linda G. Shapiro; Ranjay Krishna; | code |
| 65 | HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. |
Zunnan Xu; Zhentao Yu; Zixiang Zhou; Jun Zhou; Xiaoyu Jin; Fa-ting Hong; Xiaozhong Ji; Junwei Zhu; Chengfei Cai; Shiyu Tang; Qin Lin; Xiu Li; Qinglin Lu; | code |
| 66 | BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. |
Taha Koleilat; Hojat Asgariandehkordi; Hassan Rivaz; Yiming Xiao; | code |
| 67 | VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning Via Core Frame Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. |
Songhao Han; Wei Huang; Hairong Shi; Le Zhuo; Xiu Su; Shifeng Zhang; Xu Zhou; Xiaojuan Qi; Yue Liao; Si Liu; | code |
| 68 | Omnia De EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. |
Chiara Plizzari; Alessio Tonioni; Yongqin Xian; Achin Kulshrestha; Federico Tombari; | code |
| 69 | Koala-36M: A Large-scale Video Dataset Improving Consistency Between Fine-grained Conditions and Video Content Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing datasets exhibit various limitations in these areas. To address these challenges, we introduce Koala-36M, a large-scale, high-quality video dataset featuring accurate temporal splitting, detailed captions, and superior video quality. |
Qiuheng Wang; Yukai Shi; Jiarong Ou; Rui Chen; Ke Lin; Jiahao Wang; Boyuan Jiang; Haotian Yang; Mingwu Zheng; Xin Tao; Fei Yang; Pengfei Wan; Di Zhang; | code |
| 70 | DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce DoF-Gaussian, a controllable depth-of-field method for 3D-GS. |
Liao Shen; Tianqi Liu; Huiqiang Sun; Jiaqi Li; Zhiguo Cao; Wei Li; Chen Change Loy; | code |
| 71 | VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we systematically study music generation conditioned solely on the video. |
Zeyue Tian; Zhaoyang Liu; Ruibin Yuan; Jiahao Pan; Qifeng Liu; Xu Tan; Qifeng Chen; Wei Xue; Yike Guo; | code |
| 72 | DrVideo: Document Retrieval Based Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. |
Ziyu Ma; Chenhui Gou; Hengcan Shi; Bin Sun; Shutao Li; Hamid Rezatofighi; Jianfei Cai; | code |
| 73 | UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. |
Wenbo Wang; Fangyun Wei; Lei Zhou; Xi Chen; Lin Luo; Xiaohan Yi; Yizhong Zhang; Yaobo Liang; Chang Xu; Yan Lu; Jiaolong Yang; Baining Guo; | code |
| 74 | EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. |
Yang Yue; Yulin Wang; Haojun Jiang; Pan Liu; Shiji Song; Gao Huang; | code |
| 75 | CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. |
Yang Yue; Yulin Wang; Chenxin Tao; Pan Liu; Shiji Song; Gao Huang; | code |
| 76 | Multiple Object Tracking As ID Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. |
Ruopeng Gao; Ji Qi; Limin Wang; | code |
| 77 | Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). |
Jiuming Liu; Jinru Han; Lihao Liu; Angelica I. Aviles-Rivero; Chaokang Jiang; Zhe Liu; Hesheng Wang; | code |
| 78 | Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? |
Yiming Dou; Wonseok Oh; Yuqing Luo; Antonio Loquercio; Andrew Owens; | code |
| 79 | SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SLAM3R, a novel and effective system for real-time, high-quality, dense 3D reconstruction using RGB videos. |
Yuzheng Liu; Siyan Dong; Shuzhe Wang; Yingda Yin; Yanchao Yang; Qingnan Fan; Baoquan Chen; | code |
| 80 | You See It, You Got It: Learning 3D Creation on Pose-Free Videos at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. |
Baorui Ma; Huachen Gao; Haoge Deng; Zhengxiong Luo; Tiejun Huang; Lulu Tang; Xinlong Wang; | code |
| 81 | Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that this problem can be solved through explicit cooperation among tasks. To achieve this goal, we propose a unified learning method which achieves explicit inter-task cooperation from both the perspectives of data and model thoroughly. |
Henghui Du; Guangyao Li; Chang Zhou; Chunjie Zhang; Alan Zhao; Di Hu; | code |
| 82 | PMA: Towards Parameter-Efficient Point Cloud Understanding Via Point Mamba Adapter Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. |
Yaohua Zha; Yanzi Wang; Hang Guo; Jinpeng Wang; Tao Dai; Bin Chen; Zhihao Ouyang; Xue Yuerong; Ke Chen; Shu-Tao Xia; | code |
| 83 | DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. |
Rui Zhao; Weijia Mao; Mike Zheng Shou; | code |
| 84 | Visual Agentic AI for Spatial Reasoning with A Dynamic API Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. |
Damiano Marsili; Rohun Agrawal; Yisong Yue; Georgia Gkioxari; | code |
| 85 | QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the performance degradation, we introduce activation polishing and compensation algorithm applied before and after activation quantization, as well as a weight reconstruction method for minimizing errors in weight quantization. |
Xuan Shen; Weize Ma; Jing Liu; Changdi Yang; Rui Ding; Quanyi Wang; Henghui Ding; Wei Niu; Yanzhi Wang; Pu Zhao; Jun Lin; Jiuxiang Gu; | code |
| 86 | F-LMM: Grounding Frozen Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. |
Size Wu; Sheng Jin; Wenwei Zhang; Lumin Xu; Wentao Liu; Wei Li; Chen Change Loy; | code |
| 87 | DRAWER: Digital Reconstruction and Articulation With Environment Realism Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present DRAWER, a novel framework that converts a video of a static indoor scene into a photorealistic and interactive digital environment. |
Hongchi Xia; Entong Su; Marius Memmel; Arhan Jain; Raymond Yu; Numfor Mbiziwo-Tiapo; Ali Farhadi; Abhishek Gupta; Shenlong Wang; Wei-Chiu Ma; | code |
| 88 | FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, real-time rendering with 3DGS remains a challenging problem, particularly in large-scale, high-resolution scenes due to the presence of numerous anisotropic Gaussian representations, and it has not been extensively explored. To address this challenge, we introduce FlashGS, an open-source CUDA library with Python bindings, featuring comprehensive algorithm design and optimizations, including redundancy elimination, adaptive scheduling, and efficient pipelining. |
Guofeng Feng; Siyan Chen; Rong Fu; Zimu Liao; Yi Wang; Tao Liu; Boni Hu; Linning Xu; Zhilin Pei; Hengjie Li; Xiuhong Li; Ninghui Sun; Xingcheng Zhang; Bo Dai; | code |
| 89 | Embracing Collaboration Over Competition: Condensing Multiple Prompts for Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Multiple suitable prompts may exist, but individually they often fall short, leading to difficulties in selection and the exclusion of useful context. To address this, we propose a new perspective: prompt condensation. |
Jinpeng Wang; Tianci Luo; Yaohua Zha; Yan Feng; Ruisheng Luo; Bin Chen; Tao Dai; Long Chen; Yaowei Wang; Shu-Tao Xia; | code |
| 90 | HVI: A New Color Space for Low-light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. |
Qingsen Yan; Yixu Feng; Cheng Zhang; Guansong Pang; Kangbiao Shi; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang; | code |
| 91 | Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. |
Hanxun Yu; Wentong Li; Song Wang; Junbo Chen; Jianke Zhu; | code |
| 92 | Everything to The Synthetic: Diffusion-driven Test-time Adaptation Via Synthetic-Domain Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in this paper, we reveal that although the synthetic data in diffusion-driven TTA seems indistinguishable from the source data, it is unaligned with, or even markedly different from the latter for deep networks. To address this issue, we propose a Synthetic-Domain Alignment (SDA) framework. |
Jiayi Guo; Junhao Zhao; Chaoqun Du; Yulin Wang; Chunjiang Ge; Zanlin Ni; Shiji Song; Humphrey Shi; Gao Huang; | code |
| 93 | Weakly Supervised Semantic Segmentation Via Progressive Confidence Region Expansion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the prevalent methods relying on vision transformers (ViT) encounter an "over-expansion" issue, i.e., CAM incorrectly expands high activation value from the target object to the background regions, as it is difficult to learn pixel-level local intrinsic inductive bias in ViT from weak supervisions. To solve this problem, we propose a Progressive Confidence Region Expansion (PCRE) framework for WSSS, it gradually learns a faithful mask over the target region and utilizes this mask to correct the confusion in CAM. |
Xiangfeng Xu; Pinyi Zhang; Wenxuan Huang; Yunhang Shen; Haosheng Chen; Jingzhong Lin; Wei Li; Gaoqi He; Jiao Xie; Shaohui Lin; | code |
| 94 | Gen3DEval: Using VLLMs for Automatic Evaluation of Generated 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. |
Shalini Maiti; Lourdes Agapito; Filippos Kokkinos; | code |
| 95 | Is Your World Simulator A Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models’ abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models’ story-completion capabilities. |
Yiping Wang; Xuehai He; Kuan Wang; Luyao Ma; Jianwei Yang; Shuohang Wang; Simon Shaolei Du; Yelong Shen; | code |
| 96 | FOCUS: Knowledge-enhanced Adaptive Visual Compression for Few-shot Whole Slide Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce the knowledge-enhanced adaptive visual compression framework, dubbed FOCUS, which uniquely combines pathology FMs with language prior knowledge to enable a focused analysis of diagnostically relevant regions by prioritizing discriminative WSI patches. |
Zhengrui Guo; Conghao Xiong; Jiabo Ma; Qichen Sun; Lishuang Feng; Jinzhuo Wang; Hao Chen; | code |
| 97 | Stretching Each Dollar: Diffusion Training from Scratch on A Micro-Budget Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As scaling laws in generative AI push performance, they simultaneously concentrate the development of these models among actors with large computational resources. With a focus on text-to-image (T2I) generative models, we aim to unlock this bottleneck by demonstrating very low-cost training of large-scale T2I diffusion transformer models. |
Vikash Sehwag; Xianghao Kong; Jingtao Li; Michael Spranger; Lingjuan Lyu; | code |
| 98 | Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. |
Hyeonho Jeong; Chun-Hao P. Huang; Jong Chul Ye; Niloy J. Mitra; Duygu Ceylan; | code |
| 99 | Omni-Scene: Omni-Gaussian Representation for Ego-Centric Sparse-View Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limitations of pixel-based representation thus hinder the utility of prior works in this task. In light of this, this paper conducts an in-depth analysis of different representations, and introduces Omni-Gaussian representation with tailored network design to complement their strengths and mitigate their drawbacks. |
Dongxu Wei; Zhiqi Li; Peidong Liu; | code |
| 100 | Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To safeguard personal videos from unauthorized use, we propose two series of protective video watermarks with imperceptible adversarial perturbations, named Ramblings and Mutes. |
Haitong Liu; Kuofeng Gao; Yang Bai; Jinmin Li; Jinxiao Shan; Tao Dai; Shu-Tao Xia; | code |
| 101 | KVQ: Boosting Video Quality Assessment Via Saliency-guided Local Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. |
Yunpeng Qu; Kun Yuan; Qizhi Xie; Ming Sun; Chao Zhou; Jian Wang; | code |
| 102 | DTOS: Dynamic Time Object Sensing with Large Multimodal Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we propose a novel framework, Dynamic Time Object Sensing (DTOS), specifically designed for RVOS. |
Jirui Tian; Jinrong Zhang; Shenglan Liu; Luhao Xu; Zhixiong Huang; Gao Huang; | code |
| 103 | SegMAN: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SegMAN, a novel linear-time model comprising a hybrid feature encoder dubbed SegMAN Encoder, and a decoder based on state space models. |
Yunxiang Fu; Meng Lou; Yizhou Yu; | code |
| 104 | Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Pow3R, a novel large 3D vision regression model that is highly versatile in the input modalities it accepts. |
Wonbong Jang; Philippe Weinzaepfel; Vincent Leroy; Lourdes Agapito; Jerome Revaud; | code |
| 105 | Uni4D: Unifying Visual Foundation Models for 4D Modeling from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a unified approach to understanding dynamic scenes from casual videos. |
David Yifan Yao; Albert J. Zhai; Shenlong Wang; | code |
| 106 | CoMBO: Conflict Mitigation Via Branched Optimization for Class Incremental Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The inherent conflict above often leads to a back-and-forth, which turns the objective into finding the balance between the performance of previous (old) and incremental (new) classes. To address this conflict, we introduce a novel approach, Conflict Mitigation via Branched Optimization (CoMBO). |
Kai Fang; Anqi Zhang; Guangyu Gao; Jianbo Jiao; Chi Harold Liu; Yunchao Wei; | code |
| 107 | SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less Than 0.2% Training Cost Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. |
Haiyang Mei; Pengyu Zhang; Mike Zheng Shou; | code |
| 108 | TexGaussian: Generating High-quality PBR Material Via Octree-based 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents TexGaussian, a novel method that uses octant-aligned 3D Gaussian Splatting for rapid PBR material generation. |
Bojun Xiong; Jialun Liu; Jiakui Hu; Chenming Wu; Jinbo Wu; Xing Liu; Chen Zhao; Errui Ding; Zhouhui Lian; | code |
| 109 | Parallelized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. |
Yuqing Wang; Shuhuai Ren; Zhijie Lin; Yujin Han; Haoyuan Guo; Zhenheng Yang; Difan Zou; Jiashi Feng; Xihui Liu; | code |
| 110 | Hash3D: Training-free Acceleration for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Hash3D, a universal acceleration for 3D score distillation sampling (SDS) without model training.Central to Hash3D is the observation that images rendered from similar camera positions and diffusion time-steps often have redundant feature maps. |
Xingyi Yang; Songhua Liu; Xinchao Wang; | code |
| 111 | Frequency Dynamic Convolution for Dense Image Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While Dynamic Convolution (DY-Conv) has shown promising performance by enabling adaptive weight selection through multiple parallel weights combined with an attention mechanism, the frequency response of these weights tends to exhibit high similarity, resulting in high parameter costs but limited adaptability. In this work, we introduce Frequency Dynamic Convolution (FDConv), a novel approach that mitigates these limitations by learning a fixed parameter budget in the Fourier domain. |
Linwei Chen; Lin Gu; Liang Li; Chenggang Yan; Ying Fu; | code |
| 112 | One-shot 3D Object Canonicalization Based on Geometric and Semantic Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel joint energy function to enforce geometric and semantic consistency, aligning object orientations precisely despite significant shape variations. |
Li Jin; Yujie Wang; Wenzheng Chen; Qiyu Dai; Qingzhe Gao; Xueying Qin; Baoquan Chen; | code |
| 113 | Paint By Inpaint: Learning to Add Image Objects By Removing Them First Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images. |
Navve Wasserman; Noam Rotstein; Roy Ganz; Ron Kimmel; | code |
| 114 | BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. |
Yulu Pan; Ce Zhang; Gedas Bertasius; | code |
| 115 | Two By Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Leveraging the 2BY2 dataset, we propose a two-step SE(3) pose estimation method with equivariant features for assembly constraints. |
Yu Qi; Yuanchen Ju; Tianming Wei; Chi Chu; Lawson L.S. Wong; Huazhe Xu; | code |
| 116 | FineVQ: Fine-Grained User Generated Content Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. |
Huiyu Duan; Qiang Hu; Jiarui Wang; Liu Yang; Zitong Xu; Lu Liu; Xiongkuo Min; Chunlei Cai; Tianxiao Ye; Xiaoyun Zhang; Guangtao Zhai; | code |
| 117 | MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called "mesh attention" to enable training at 1024×1024 resolution. |
Yuhan Wang; Fangzhou Hong; Shuai Yang; Liming Jiang; Wayne Wu; Chen Change Loy; | code |
| 118 | GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. |
Haoqiang Kang; Enna Sachdeva; Piyush Gupta; Sangjae Bae; Kwonjoon Lee; | code |
| 119 | UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, even within the same domain, current VAD approaches often require large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, with a training-free unified model. |
Zhaopeng Gu; Bingke Zhu; Guibo Zhu; Yingying Chen; Ming Tang; Jinqiao Wang; | code |
| 120 | DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. |
Bo-Wen Yin; Jiao-Long Cao; Ming-Ming Cheng; Qibin Hou; | code |
| 121 | FLAIR: VLM with Fine-grained Language-informed Image Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. |
Rui Xiao; Sanghwan Kim; Mariana-Iuliana Georgescu; Zeynep Akata; Stephan Alaniz; | code |
| 122 | Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle the problem, we propose to jointly learn motion and appearance codebooks and perform multi-scale codebook compensation to effectively refine both the facial motion conditions and appearance features for talking face image decoding. |
Shuling Zhao; Fa-Ting Hong; Xiaoshui Huang; Dan Xu; | code |
| 123 | Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. |
Zhenglin Zhou; Fan Ma; Hehe Fan; Tat-Seng Chua; | code |
| 124 | UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we advocate and introduce the unrolling paradigm, also referred to as "learning to optimize", in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyperparameters. |
Long Zhou; Fereshteh Shakeri; Aymen Sadraoui; Mounir Kaaniche; Jean-Christophe Pesquet; Ismail Ben Ayed; | code |
| 125 | A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. |
Xin Wen; Bingchen Zhao; Yilun Chen; Jiangmiao Pang; Xiaojuan Qi; | code |
| 126 | Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to leverage MLLMs to regress accurate quality scores. |
Zhiyuan You; Xin Cai; Jinjin Gu; Tianfan Xue; Chao Dong; | code |
| 127 | Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Assembly of Global and Local Attention (AGLA) , a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. |
Wenbin An; Feng Tian; Sicong Leng; Jiahao Nie; Haonan Lin; Qianying Wang; Ping Chen; Xiaoqin Zhang; Shijian Lu; | code |
| 128 | PatchDPO: Patch-level DPO for Finetuning-free Personalized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current finetuning-free methods simply adopt a single training stage with a simple image reconstruction task, and they typically generate low-quality images inconsistent with the reference images during test-time. To mitigate this problem, inspired by the recent DPO (i.e., direct preference optimization) technique, this work proposes an additional training stage to improve the pre-trained personalized generation models. |
Qihan Huang; Long Chan; Jinlong Liu; Wanggui He; Hao Jiang; Mingli Song; Jie Song; | code |
| 129 | Decompositional Neural Scene Reconstruction with Generative Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. |
Junfeng Ni; Yu Liu; Ruijie Lu; Zirui Zhou; Song-Chun Zhu; Yixin Chen; Siyuan Huang; | code |
| 130 | EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce EarthDial, a conversational assistant specifically designed for Earth Observation (EO) data, transforming complex, multi-sensory Earth observations into interactive, natural language dialogues. |
Sagar Soni; Akshay Dudhane; Hiyam Debary; Mustansar Fiaz; Muhammad Akhtar Munir; Muhammad Sohail Danish; Paolo Fraccaro; Campbell D Watson; Levente J Klein; Fahad Shahbaz Khan; Salman Khan; | code |
| 131 | MATCHA: Towards Matching Anything Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional methods are often specialized for specific correspondence types, geometric, semantic, or temporal, whereas humans naturally identify alignments across these domains. Inspired by this flexibility, we propose MATCHA, a unified feature model designed to "rule them all", establishing robust correspondences across diverse matching tasks. |
Fei Xue; Sven Elflein; Laura Leal-Taixé; Qunjie Zhou; | code |
| 132 | Your ViT Is Secretly An Image Segmentation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. |
Tommie Kerssies; Niccolò Cavagnero; Alexander Hermans; Narges Norouzi; Giuseppe Averta; Bastian Leibe; Gijs Dubbelman; Daan de Geus; | code |
| 133 | HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). |
Cong Wei; Yujie Zhong; Haoxian Tan; Yong Liu; Jie Hu; Dengjie Li; Zheng Zhao; Yujiu Yang; | code |
| 134 | Pathways on The Image Manifold: Image Editing Via Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Simultaneously, video generation has made remarkable strides, with models that effectively function as consistent and continuous world simulators. In this paper, we propose merging these two fields by utilizing image-to-video models for image editing. |
Noam Rotstein; Gal Yona; Daniel Silver; Roy Velich; David Bensaid; Ron Kimmel; | code |
| 135 | PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. |
Song Wang; Xiaolu Liu; Lingdong Kong; Jianyun Xu; Chunyong Hu; Gongfan Fang; Wentong Li; Jianke Zhu; Xinchao Wang; | code |
| 136 | Few-shot Implicit Function Generation Via Equivariance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. |
Suizhi Huang; Xingyi Yang; Hongtao Lu; Xinchao Wang; | code |
| 137 | RAP: Retrieval-Augmented Personalization for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the **R**etrieval **A**ugmented **P**ersonalization (RAP) framework for MLLMs’ personalization. |
Haoran Hao; Jiaming Han; Changsheng Li; Yu-Feng Li; Xiangyu Yue; | code |
| 138 | DiC: Rethinking Conv3x3 Designs in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. |
Yuchuan Tian; Jing Han; Chengcheng Wang; Yuchen Liang; Chao Xu; Hanting Chen; | code |
| 139 | Complexity Experts Are Task-Discriminative Learners for Any Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This hinders leveraging MoEs’ computational benefits by bypassing irrelevant experts during inference.We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce "complexity experts" — flexible expert blocks with varying computational complexity and receptive fields. |
Eduard Zamfir; Zongwei Wu; Nancy Mehta; Yuedong Tan; Danda Pani Paudel; Yulun Zhang; Radu Timofte; | code |
| 140 | Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. |
Zexin He; Tengfei Wang; Xin Huang; Xingang Pan; Ziwei Liu; | code |
| 141 | Towards Practical Real-Time Neural Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. |
Zhaoyang Jia; Bin Li; Jiahao Li; Wenxuan Xie; Linfeng Qi; Houqiang Li; Yan Lu; | code |
| 142 | MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. |
Ruijie Lu; Yixin Chen; Junfeng Ni; Baoxiong Jia; Yu Liu; Diwen Wan; Gang Zeng; Siyuan Huang; | code |
| 143 | Invisible Backdoor Attack Against Self-supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an imperceptible and effective backdoor attack against self-supervised models. |
Hanrong Zhang; Zhenting Wang; Boheng Li; Fulin Lin; Tingxu Han; Mingyu Jin; Chenlu Zhan; Mengnan Du; Hongwei Wang; Shiqing Ma; | code |
| 144 | Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing an image-level rather than a LaTeX-level metric score. |
Bin Wang; Fan Wu; Linke Ouyang; Zhuangcheng Gu; Rui Zhang; Renqiu Xia; Botian Shi; Bo Zhang; Conghui He; | code |
| 145 | UniSTD: Towards Unified Spatio-Temporal Learning Across Diverse Disciplines Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UniSTD, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. |
Chen Tang; Xinzhu Ma; Encheng Su; Xiufeng Song; Xiaohong Liu; Wei-Hong Li; Lei Bai; Wanli Ouyang; Xiangyu Yue; | code |
| 146 | HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose HaWoR, a high-fidelity method for hand motion reconstruction in world coordinates from egocentric videos. |
Jinglei Zhang; Jiankang Deng; Chao Ma; Rolandos Alexandros Potamias; | code |
| 147 | MLVU: Benchmarking Multi-task Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark called MLVU (Multi-task Long Video Understanding Benchmark) for the comprehensive and in-depth evaluation of LVU. |
Junjie Zhou; Yan Shu; Bo Zhao; Boya Wu; Zhengyang Liang; Shitao Xiao; Minghao Qin; Xi Yang; Yongping Xiong; Bo Zhang; Tiejun Huang; Zheng Liu; | code |
| 148 | Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent approaches attempt task-specific design but rarely achieve "The Best of Both Worlds" due to inconsistent optimization goals. To address these issues, we propose a novel method that leverages the semantic knowledge from the Segment Anything Model (SAM) to grow the quality of fusion results and enable downstream task adaptability, namely SAGE. |
Guanyao Wu; Haoyu Liu; Hongming Fu; Yichuan Peng; Jinyuan Liu; Xin Fan; Risheng Liu; | code |
| 149 | 3D-GSW: 3D Gaussian Splatting for Robust Watermarking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As 3D Gaussian Splatting (3D-GS) gains significant attention and its commercial usage increases, the need for watermarking technologies to prevent unauthorized use of the 3D-GS models and rendered images has become increasingly important. In this paper, we introduce a robust watermarking method for 3D-GS that secures copyright of both the model and its rendered images. |
Youngdong Jang; Hyunje Park; Feng Yang; Heeju Ko; Euijin Choo; Sangpil Kim; | code |
| 150 | Reasoning to Attend: Try to Understand How <SEG> Token Works Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the \texttt <SEG> token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. |
Rui Qian; Xin Yin; Dejing Dou; | code |
| 151 | Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. |
Bardia Safaei; Faizan Siddiqui; Jiacong Xu; Vishal M. Patel; Shao-Yuan Lo; | code |
| 152 | Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. |
Wenxuan Guo; Xiuwei Xu; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu; | code |
| 153 | SkillMimic: Learning Basketball Interaction Skills from Demonstrations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SkillMimic, a unified data-driven framework that fundamentally changes how agents learn interaction skills by eliminating the need for skill-specific rewards. |
Yinhuai Wang; Qihan Zhao; Runyi Yu; Hok Wai Tsui; Ailing Zeng; Jing Lin; Zhengyi Luo; Jiwen Yu; Xiu Li; Qifeng Chen; Jian Zhang; Lei Zhang; Ping Tan; | code |
| 154 | DPFlow: Adaptive Optical Flow Estimation with A Dual-Pyramid Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose DPFlow, an adaptive optical flow architecture capable of generalizing up to 8K resolution inputs while trained with only low-resolution samples. |
Henrique Morimitsu; Xiaobin Zhu; Roberto M. Cesar; Xiangyang Ji; Xu-Cheng Yin; | code |
| 155 | 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. |
Jan Held; Renaud Vandeghen; Abdullah Hamdi; Adrien Deliege; Anthony Cioppa; Silvio Giancola; Andrea Vedaldi; Bernard Ghanem; Marc Van Droogenbroeck; | code |
| 156 | Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Closest Among Top-K (CAT-K) rollouts, a simple yet effective closed-loop fine-tuning strategy to mitigate covariate shift. |
Zhejun Zhang; Peter Karkus; Maximilian Igl; Wenhao Ding; Yuxiao Chen; Boris Ivanovic; Marco Pavone; | code |
| 157 | Exploring Intrinsic Normal Prototypes Within A Single Image for Universal Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that this information is useful and may be more aligned with the anomalies since both the anomalies and the normal information originate from the same image. Therefore, rather than relying on external normality from the training set, we propose INP-Former, a novel method that extracts Intrinsic Normal Prototypes (INPs) directly from the test image. |
Wei Luo; Yunkang Cao; Haiming Yao; Xiaotian Zhang; Jianan Lou; Yuqi Cheng; Weiming Shen; Wenyong Yu; | code |
| 158 | Audio-Visual Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. |
Ruohao Guo; Xianghua Ying; Yaru Chen; Dantong Niu; Guangyao Li; Liao Qu; Yanyu Qi; Jinxing Zhou; Bowei Xing; Wenzhen Yue; Ji Shi; Qixun Wang; Peiliang Zhang; Buwen Liang; | code |
| 159 | Distilled Prompt Learning for Incomplete Multimodal Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. |
Yingxue Xu; Fengtao Zhou; Chenyu Zhao; Yihui Wang; Can Yang; Hao Chen; | code |
| 160 | OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. |
Meng Lou; Yizhou Yu; | code |
| 161 | DEAL: Data-Efficient Adversarial Learning for High-Quality Infrared Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The scarcity of high-quality infrared data, coupled with the challenges of dynamic, intricate degradations, makes it difficult to recover details using existing methods. In this paper, we introduce thermal degradation simulation integrated into the training process via a mini-max optimization, by modeling these degraded factors as adversarial attacks on thermal images. |
Zhu Liu; Zijun Wang; Jinyuan Liu; Fanqi Meng; Long Ma; Risheng Liu; | code |
| 162 | Video Depth Without Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. |
Bingxin Ke; Dominik Narnhofer; Shengyu Huang; Lei Ke; Torben Peters; Katerina Fragkiadaki; Anton Obukhov; Konrad Schindler; | code |
| 163 | MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. |
Yifan Liu; Keyu Fan; Weihao Yu; Chenxin Li; Hao Lu; Yixuan Yuan; | code |
| 164 | SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. |
Dongliang Luo; Hanshen Zhu; Ziyang Zhang; Dingkang Liang; Xudong Xie; Yuliang Liu; Xiang Bai; | code |
| 165 | UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. |
Ziyi Wang; Yanran Zhang; Jie Zhou; Jiwen Lu; | code |
| 166 | VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations.VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. |
Yunlong Tang; Junjia Guo; Hang Hua; Susan Liang; Mingqian Feng; Xinyang Li; Rui Mao; Chao Huang; Jing Bi; Zeliang Zhang; Pooyan Fazli; Chenliang Xu; | code |
| 167 | Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. |
Federico Cocchi; Nicholas Moratelli; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara; | code |
| 168 | MP-SfM: Monocular Surface Priors for Robust Structure-from-Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Because capturing images that avoid these pitfalls is challenging, this severely limits the wider use of SfM, especially by non-expert users. We overcome these limitations by augmenting the classical SfM paradigm with monocular depth and normal priors inferred by deep neural networks. |
Zador Pataki; Paul-Edouard Sarlin; Johannes L. Schönberger; Marc Pollefeys; | code |
| 169 | ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. |
Shaofei Cai; Zihao Wang; Kewei Lian; Zhancun Mu; Xiaojian Ma; Anji Liu; Yitao Liang; | code |
| 170 | Learning Hazing to Dehazing: Towards Realistic Haze Generation for Real-World Image Dehazing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, restoring heavily distorted information under dense haze requires generative diffusion models, whose potential in dehazing remains underutilized partly due to their lengthy sampling processes. To address these limitations, we introduce a novel hazing-dehazing pipeline consisting of a Realistic Hazy Image Generation framework (HazeGen) and a Diffusion-based Dehazing framework (DiffDehaze). |
Ruiyi Wang; Yushuo Zheng; Zicheng Zhang; Chunyi Li; Shuaicheng Liu; Guangtao Zhai; Xiaohong Liu; | code |
| 171 | LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. |
Hongjie Wang; Chih-Yao Ma; Yen-Cheng Liu; Ji Hou; Tao Xu; Jialiang Wang; Felix Juefei-Xu; Yaqiao Luo; Peizhao Zhang; Tingbo Hou; Peter Vajda; Niraj K. Jha; Xiaoliang Dai; | code |
| 172 | Are Spatial-Temporal Graph Convolution Networks for Human Action Recognition Over-Parameterized? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, despite the development of numerous models, their recognition performance does not differ significantly after aligning the input settings. With this observation, we hypothesize that ST-GCNs are over-parameterized for HAR, a conjecture subsequently confirmed through experiments employing the lottery ticket hypothesis. |
Jianyang Xie; Yitian Zhao; Yanda Meng; He Zhao; Anh Nguyen; Yalin Zheng; | code |
| 173 | HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel hybrid representation, termed as HybridGS, using 2D Gaussians for transient objects per image and maintaining traditional 3D Gaussians for the whole static scenes. |
Jingyu Lin; Jiaqi Gu; Lubin Fan; Bojian Wu; Yujing Lou; Renjie Chen; Ligang Liu; Jieping Ye; | code |
| 174 | MC^2: Multi-concept Guidance for Customized Multi-concept Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose MC^2, a novel approach for multi-concept customization that enhances flexibility and fidelity through inference-time optimization. |
Jiaxiu Jiang; Yabo Zhang; Kailai Feng; Xiaohe Wu; Wenbo Li; Renjing Pei; Fan Li; Wangmeng Zuo; | code |
| 175 | Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene. To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately. |
Huan Zheng; Wencheng Han; Jianbing Shen; | code |
| 176 | AIGV-Assessor: Benchmarking and Evaluating The Perceptual Quality of Text-to-Video Generation with LMM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. |
Jiarui Wang; Huiyu Duan; Guangtao Zhai; Juntong Wang; Xiongkuo Min; | code |
| 177 | Community Forensics: Using Thousands of Generators to Train Fake Image Detectors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: One of the key challenges of detecting AI-generated images is spotting images that have been created by previously unseen generative models. We argue that the limited diversity of the training data is a major obstacle to addressing this problem, and we propose a new dataset that is significantly larger and more diverse than prior works. |
Jeongsoo Park; Andrew Owens; | code |
| 178 | Rectified Diffusion Guidance for Conditional Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (*i.e.*, the widely used summing-to-one version) brings about expectation shift of the generative distribution. |
Mengfei Xia; Nan Xue; Yujun Shen; Ran Yi; Tieliang Gong; Yong-Jin Liu; | code |
| 179 | Detection-Friendly Nonuniformity Correction: A Union Framework for Infrared UAV Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a detection-friendly union framework, termed UniCD, that simultaneously addresses both infrared NUC and UAV target detection tasks in an end-to-end manner. |
Houzhang Fang; Xiaolin Wang; Zengyang Li; Lu Wang; Qingshan Li; Yi Chang; Luxin Yan; | code |
| 180 | Lessons and Insights from A Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. |
Zheda Mai; Ping Zhang; Cheng-Hao Tu; Hong-You Chen; Quang-Huy Nguyen; Li Zhang; Wei-Lun Chao; | code |
| 181 | Advancing Semantic Future Prediction Through Multimodal Visual Sequence Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. |
Efstathios Karypidis; Ioannis Kakogeorgiou; Spyros Gidaris; Nikos Komodakis; | code |
| 182 | InsTaG: Learning Personalized 3D Talking Head from Few-Second Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces InsTaG, a 3D talking head synthesis framework that allows a fast learning of realistic personalized 3D talking head from few training data. |
Jiahe Li; Jiawei Zhang; Xiao Bai; Jin Zheng; Jun Zhou; Lin Gu; | code |
| 183 | EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages, which sets the number of channels to the minimum needed for accurate feature representation. |
Md Mostafijur Rahman; Radu Marculescu; | code |
| 184 | MagicArticulate: Make Your 3D Models Articulation-Ready Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. |
Chaoyue Song; Jianfeng Zhang; Xiu Li; Fan Yang; Yiwen Chen; Zhongcong Xu; Jun Hao Liew; Xiaoyang Guo; Fayao Liu; Jiashi Feng; Guosheng Lin; | code |
| 185 | MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. |
Jiajun Cao; Yuan Zhang; Tao Huang; Ming Lu; Qizhe Zhang; Ruichuan An; Ningning Ma; Shanghang Zhang; | code |
| 186 | USP-Gaussian: Unifying Spike-based Image Reconstruction, Pose Correction and Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the cascading framework suffers from substantial cumulative errors, i.e., the quality of the initially reconstructed images will impact pose estimation, ultimately limiting the fidelity of the 3D reconstruction. To address this limitation, we propose a synergistic optimization framework USP-Gaussian, which unifies spike-to-image reconstruction, pose correction, and gaussian splatting into an end-to-end pipeline. |
Kang Chen; Jiyuan Zhang; Zecheng Hao; Yajing Zheng; Tiejun Huang; Zhaofei Yu; | code |
| 187 | WiLoR: End-to-end 3D Hand Localization and Reconstruction In-the-wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a data-driven pipeline for efficient multi-hand reconstruction in the wild. |
Rolandos Alexandros Potamias; Jinglei Zhang; Jiankang Deng; Stefanos Zafeiriou; | code |
| 188 | Bridging The Vision-Brain Gap with An Uncertainty-Aware Blur Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to limited paired data, these gaps are difficult for the model to learn, leading to overfitting and poor generalization to new data. To address these GAPs, we propose a simple yet effective approach called the Uncertainty-aware Blur Prior (UBP). |
Haitao Wu; Qing Li; Changqing Zhang; Zhen He; Xiaomin Ying; | code |
| 189 | APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they fail when applied to ViTs, primarily due to the inaccurate estimation of output importance and the substantial accuracy degradation in quantizing post-GELU activations. To address these issues, we propose APHQ-ViT, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH). |
Zhuguanyu Wu; Jiayi Zhang; Jiaxin Chen; Jinyang Guo; Di Huang; Yunhong Wang; | code |
| 190 | FIMA-Q: Post-Training Quantization for Vision Transformers By Fisher Information Matrix Approximation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, most existing PTQ methods for Vision Transformers (ViTs) exhibit a notable drop in accuracy, especially in low-bit cases. To tackle these challenges, we analyze the extensively utilized Hessian-guided quantization loss, and uncover certain limitations within the approximated pre-activation Hessian. |
Zhuguanyu Wu; Shihe Wang; Jiayi Zhang; Jiaxin Chen; Yunhong Wang; | code |
| 191 | No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, many HRIR tasks necessitate the exploration of wider regions to model objects and contexts, which limits their performance in such scenarios. To address this issue, we present a DBPS strategy to enable training with more patches at low consumption. |
Rong Qin; Xin Liu; Xingyu Liu; Jiaxuan Liu; Jinglei Shi; Liang Lin; Jufeng Yang; | code |
| 192 | Boosting The Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of them overly concentrate on crafting complex pipelines to pursue one of the above objectives separately, limiting the model performance in both accuracy and inference consumption. In this paper, we suggest simultaneously achieving these objectives by estimating resolution-biased uncertainties in low resolution stream. |
Rong Qin; Xingyu Liu; Jinglei Shi; Liang Lin; Jufeng Yang; | code |
| 193 | Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Perturb-and-Revise, which makes possible a variety of NeRF editing. |
Susung Hong; Johanna Karras; Ricardo Martin-Brualla; Ira Kemelmacher-Shlizerman; | code |
| 194 | R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. |
Xudong Jiang; Fangjinhua Wang; Silvano Galliani; Christoph Vogel; Marc Pollefeys; | code |
| 195 | EntityErasure: Erasing Entity Cleanly Via Amodal Entity Segmentation and Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents EntityErasure, a novel diffusion-based inpainting method that can effectively erase entities without inducing unwanted sundries. |
Yixing Zhu; Qing Zhang; Yitong Wang; Yongwei Nie; Wei-Shi Zheng; | code |
| 196 | Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. |
Chaocan Xue; Bineng Zhong; Qihua Liang; Yaozong Zheng; Ning Li; Yuanliang Xue; Shuxiang Song; | code |
| 197 | FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). |
Tianyun Zhong; Chao Liang; Jianwen Jiang; Gaojie Lin; Jiaqi Yang; Zhou Zhao; | code |
| 198 | Tora: Trajectory-oriented Diffusion Transformer for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. |
Zhenghao Zhang; Junchao Liao; Menghao Li; ZuoZhuo Dai; Bingxue Qiu; Siyu Zhu; Long Qin; Weizhi Wang; | code |
| 199 | Assessing and Learning Alignment of Unimodal Vision and Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a direct assessment method, inspired by linear probing, to evaluate vision-language alignment. |
Le Zhang; Qian Yang; Aishwarya Agrawal; | code |
| 200 | FedSPA: Generalizable Federated Graph Learning Under Homophily Heterogeneity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose FedSPA, an effective framework that addresses homophily heterogeneity from the perspectives of homophily conflict and homophily bias. |
Zihan Tan; Guancheng Wan; Wenke Huang; He Li; Guibin Zhang; Carl Yang; Mang Ye; | code |
| 201 | SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SpectroMotion, a novel approach that combines 3D Gaussian Splatting (3DGS) with physically-based rendering (PBR) and deformation fields to reconstruct dynamic specular scenes. |
Cheng-De Fan; Chen-Wei Chang; Yi-Ruei Liu; Jie-Ying Lee; Jiun-Long Huang; Yu-Chee Tseng; Yu-Lun Liu; | code |
| 202 | LaVin-DiT: Large Vision Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. |
Zhaoqing Wang; Xiaobo Xia; Runnan Chen; Dongdong Yu; Changhu Wang; Mingming Gong; Tongliang Liu; | code |
| 203 | VASparse: Towards Efficient Visual Hallucination Mitigation Via Visual-Aware Token Sparsification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. |
Xianwei Zhuang; Zhihong Zhu; Yuxin Xie; Liming Liang; Yuexian Zou; | code |
| 204 | MINIMA: Modality Invariant Image Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. |
Jiangwei Ren; Xingyu Jiang; Zizhuo Li; Dingkang Liang; Xin Zhou; Xiang Bai; | code |
| 205 | A Unified Image-Dense Annotation Generation Model for Underwater Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. |
Hongkai Lin; Dingkang Liang; Zhenghao Qi; Xiang Bai; | code |
| 206 | DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing in-bed human mesh estimation methods often rely heavily on real-world data, limiting their ability to generalize across different in-bed scenarios, such as varying coverings and environmental settings. To address this, we propose a Sim-to-Real Transfer Framework for in-bed human mesh recovery from overhead depth images, which leverages large-scale synthetic data alongside limited or no real-world samples. |
Jing Gao; Ce Zheng; Laszlo A. Jeni; Zackory Erickson; | code |
| 207 | A Bias-Free Training Paradigm for More General AI-generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose B-Free, a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. |
Fabrizio Guillaro; Giada Zingarini; Ben Usman; Avneesh Sud; Davide Cozzolino; Luisa Verdoliva; | code |
| 208 | Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Even state-of-the-art large foundation models (e.g., SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. |
Buzhen Huang; Chen Li; Chongyang Xu; Dongyue Lu; Jinnan Chen; Yangang Wang; Gim Hee Lee; | code |
| 209 | GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, with novel designs for the awareness of garment geometry, structure, and inter-object relations. |
Ruihai Wu; Ziyu Zhu; Yuran Wang; Yue Chen; Jiarui Wang; Hao Dong; | code |
| 210 | Dual-Granularity Semantic Guided Sparse Routing Diffusion Model for General Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the domain gap produced by varying satellite sensors and distinct scenes, we propose a dual-granularity semantic guided sparse routing diffusion model for general pansharpening. |
Yinghui Xing; Litao Qu; Shizhou Zhang; Di Xu; Yingkun Yang; Yanning Zhang; | code |
| 211 | EfficientViM: Efficient Vision Mamba with Hidden State Mixer Based State Space Duality Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To harness the efficacy of SSM, we introduce Efficient Vision Mamba (EfficientViM), a novel architecture built on hidden state mixer-based state space duality (HSM-SSD) that efficiently captures global dependencies with further reduced computational cost. |
Sanghyeok Lee; Joonmyung Choi; Hyunwoo J. Kim; | code |
| 212 | SVFR: A Unified Framework for Generalized Video Face Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video blind face restoration (BFR), inpainting, and colorization tasks that we empirically show to benefit each other. |
Zhiyao Wang; Xu Chen; Chengming Xu; Junwei Zhu; Xiaobin Hu; Jiangning Zhang; Chengjie Wang; Yuqi Liu; Yiyi Zhou; Rongrong Ji; | code |
| 213 | SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, semantic classes exhibit high variability in their appearance in real-world driving scenarios. To address these issues, we introduce a novel 3D SSC method, called SOAP, including two key components: an occluded region-aware view projection and a scene-adaptive decoder. |
Hyo-Jun Lee; Yeong Jun Koh; Hanul Kim; Hyunseop Kim; Yonguk Lee; Jinu Lee; | code |
| 214 | Towards Transformer-Based Aligned Generation with Self-Coherence Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). |
Shulei Wang; Wang Lin; Hai Huang; Hanting Wang; Sihang Cai; WenKang Han; Tao Jin; Jingyuan Chen; Jiacheng Sun; Jieming Zhu; Zhou Zhao; | code |
| 215 | EnvGS: Modeling View-Dependent Appearance with Environment Gaussian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce EnvGS, a novel approach that employs a set of Gaussian primitives as an explicit 3D representation for capturing reflections of environments. |
Tao Xie; Xi Chen; Zhen Xu; Yiman Xie; Yudong Jin; Yujun Shen; Sida Peng; Hujun Bao; Xiaowei Zhou; | code |
| 216 | A Unified Model for Compressed Sensing MRI Across Undersampling Patterns Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a unified MRI reconstruction model robust to various measurement undersampling patterns and image resolutions. |
Armeet Singh Jatyani; Jiayun Wang; Aditi Chandrashekar; Zihui Wu; Miguel Liu-Schiaffini; Bahareh Tolooshams; Anima Anandkumar; | code |
| 217 | Rethinking Temporal Fusion with A Unified Gradient Descent View for 3D Semantic Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present GDFusion, a temporal fusion method for vision-based 3D semantic occupancy prediction (VisionOcc). |
Dubing Chen; Huan Zheng; Jin Fang; Xingping Dong; Xianfei Li; Wenlong Liao; Tao He; Pai Peng; Jianbing Shen; | code |
| 218 | Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. |
Zhiyang Guo; Jinxu Xiang; Kai Ma; Wengang Zhou; Houqiang Li; Ran Zhang; | code |
| 219 | Learning Dynamic Collaborative Network for Semi-supervised 3D Vessel Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a new dynamic collaborative network for semi-supervised 3D vessel segmentation, termed DiCo. |
Jiao Xu; Xin Chen; Lihe Zhang; | code |
| 220 | DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most video diffusion models are constrained by limited resolution and aspect ratio, which restricts their applicability to scene-level dynamic content synthesis. In this work, we propose DynamicScaler, addressing these challenges by enabling spatially scalable and panoramic dynamic scene synthesis that preserves coherence across panoramic scenes of arbitrary size. |
Jinxiu Liu; Shaoheng Lin; Yinxiao Li; Ming-Hsuan Yang; | code |
| 221 | TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. |
Hongxiang Zhao; Xingchen Liu; Mutian Xu; Yiming Hao; Weikai Chen; Xiaoguang Han; | code |
| 222 | Ev-3DOD: Pushing The Temporal Boundaries of 3D Object Detection with Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. |
Hoonhee Cho; Jae-Young Kang; Youngho Kim; Kuk-Jin Yoon; | code |
| 223 | Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods typically build the skeleton-semantics interactions by uncontrollable mappings and conspicuous representations, thereby can hardly capture the intricate and fine-grained relationship for effective cross-modal transferability. To address these issues, we propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework with the guidance of cOntext-aware side informatioN (dubbed Neuron), to explore more fine-grained cross-modal correspondence from micro to macro perspectives at both spatial and temporal levels, respectively. |
Yang Chen; Jingcai Guo; Song Guo; Dacheng Tao; | code |
| 224 | AI-Face: A Million-Scale Demographically Annotated AI-Generated Face Dataset and Fairness Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce the AI-Face dataset, the first million-scale demographically annotated AI-generated face image dataset, including real faces, faces from deepfake videos, and faces generated by Generative Adversarial Networks and Diffusion Models. |
Li Lin; Santosh Santosh; Mingyang Wu; Xin Wang; Shu Hu; | code |
| 225 | ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. |
Zetong Zhang; Manuel Kaufmann; Lixin Xue; Jie Song; Martin R. Oswald; | code |
| 226 | ILIAS: Instance-Level Image Retrieval At Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. |
Giorgos Kordopatis-Zilos; Vladan Stojnić; Anna Manko; Pavel Suma; Nikolaos-Antonios Ypsilantis; Nikos Efthymiadis; Zakaria Laskar; Jiri Matas; Ondrej Chum; Giorgos Tolias; | code |
| 227 | Cross-modal Information Flow in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities—language and vision—in MLLMs, focusing on visual question answering. |
Zhi Zhang; Srishti Yadav; Fengze Han; Ekaterina Shutova; | code |
| 228 | Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce the excessive reliance on training data. |
Xiaolu Liu; Ruizi Yang; Song Wang; Wentong Li; Junbo Chen; Jianke Zhu; | code |
| 229 | TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. |
Liangbin Xie; Daniil Pakhomov; Zhonghao Wang; Zongze Wu; Ziyan Chen; Yuqian Zhou; Haitian Zheng; Zhifei Zhang; Zhe Lin; Jiantao Zhou; Chao Dong; | code |
| 230 | FlashSloth : Lightning Multimodal Large Language Models Via Embedded Visual Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a powerful and fast tiny MLLM called FlashSloth. |
Bo Tong; Bokai Lai; Yiyi Zhou; Gen Luo; Yunhang Shen; Ke Li; Xiaoshuai Sun; Rongrong Ji; | code |
| 231 | UHD-processer: Unified UHD Image Restoration with Progressive Frequency Learning and Degradation-aware Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce UHD-Processor, a unified and robust framework for all-in-one image restoration, which is particularly resource-efficient for Ultra-High-Definition (UHD) images. |
Yidi Liu; Dong Li; Xueyang Fu; Xin Lu; Jie Huang; Zheng-Jun Zha; | code |
| 232 | Decision SpikeFormer: Spike-Driven Transformer for Decision Making Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce DSFormer, the first spike-driven transformer model designed to tackle offline RL via sequence modeling. |
Wei Huang; Qinying Gu; Nanyang Ye; | code |
| 233 | MOS: Modeling Object-Scene Associations in Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To more effectively leverage scene information, we propose the Modeling Object-Scene Associations (MOS) framework, which utilizes a simple MLP-based scene-awareness module to enhance GCD performance. |
Zhengyuan Peng; Jinpeng Ma; Zhimin Sun; Ran Yi; Haichuan Song; Xin Tan; Lizhuang Ma; | code |
| 234 | MagicQuill: An Intelligent Interactive Image Editing System Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we unveil MagicQuill, an integrated image editing system designed to support users in swiftly actualizing their creativity. |
Zichen Liu; Yue Yu; Hao Ouyang; Qiuyu Wang; Ka Leong Cheng; Wen Wang; Zhiheng Liu; Qifeng Chen; Yujun Shen; | code |
| 235 | Real-IAD D3: A Real-World 2D/Pseudo-3D/3D Dataset for Industrial Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Pioneering datasets like MVTec 3D have laid essential groundwork in multimodal IAD by incorporating RGB+3D data, but still face challenges in bridging the gap with real industrial environments due to limitations in scale and resolution. To address these challenges, we introduce Real-IAD D3, a high-precision multimodal dataset that uniquely incorporates an additional pseudo-3D modality generated through photometric stereo, alongside high-resolution RGB images and micrometer-level 3D point clouds. |
Wenbing Zhu; Lidong Wang; Ziqing Zhou; Chengjie Wang; Yurui Pan; Ruoyi Zhang; Zhuhao Chen; Linjie Cheng; Bin-Bin Gao; Jiangning Zhang; Zhenye Gan; Yuxie Wang; Yulong Chen; Shuguang Qian; Mingmin Chi; Bo Peng; Lizhuang Ma; | code |
| 236 | POT: Prototypical Optimal Transport for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework called Prototypical Optimal Transport (POT) for WSSS. |
Jian Wang; Tianhong Dai; Bingfeng Zhang; Siyue Yu; Eng Gee Lim; Jimin Xiao; | code |
| 237 | Number It: Temporal Grounding Videos Like Flipping Manga Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. |
Yongliang Wu; Xinting Hu; Yuyang Sun; Yizhou Zhou; Wenbo Zhu; Fengyun Rao; Bernt Schiele; Xu Yang; | code |
| 238 | Towards RAW Object Detection in Diverse Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the AODRaw dataset, which offers 7,785 high-resolution real RAW images with 135,601 annotated instances spanning 62 categories, capturing a broad range of indoor and outdoor scenes under 9 distinct light and weather conditions. |
Zhong-Yu Li; Xin Jin; Bo-Yuan Sun; Chun-Le Guo; Ming-Ming Cheng; | code |
| 239 | MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. |
Aviral Chharia; Wenbo Gou; Haoye Dong; | code |
| 240 | DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce DocLayLLM, an efficient multi-modal extension of LLMs specifically designed for TDU. |
Wenhui Liao; Jiapeng Wang; Hongliang Li; Chengyu Wang; Jun Huang; Lianwen Jin; | code |
| 241 | ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reveal that the key hidden factors of large kernels can be summarized as two separate components: extracting features at a certain granularity and fusing features by multiple pathways. |
Dachong Li; Li Li; Zhuangzhuang Chen; Jianqiang Li; | code |
| 242 | Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Large Model Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. |
Qianhan Feng; Wenshuo Li; Tong Lin; Xinghao Chen; | code |
| 243 | DiffLocks: Generating 3D Hair from A Single Image Using Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. |
Radu Alexandru Rosu; Keyu Wu; Yao Feng; Youyi Zheng; Michael J. Black; | code |
| 244 | EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. |
Gaoxiang Cong; Jiadong Pan; Liang Li; Yuankai Qi; Yuxin Peng; Anton van den Hengel; Jian Yang; Qingming Huang; | code |
| 245 | ICE: Intrinsic Concept Extraction from A Single Image Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods lack a systematic way to reliably extract the interpretable underlying intrinsic concepts. To address this challenge, we present ICE, short for Intrinsic Concept Extraction, a novel framework that exclusively utilizes a T2I model to automatically and systematically extract intrinsic concepts from a single image. |
Fernando Julio Cendra; Kai Han; | code |
| 246 | MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a Transformer-based monocular 3D object detection method called MonoDGP, which adopts perspective-invariant geometry errors to modify the projection formula. |
Fanqi Pu; Yifan Wang; Jiru Deng; Wenming Yang; | code |
| 247 | MotionPro: A Precise Motion Controller for Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. |
Zhongwei Zhang; Fuchen Long; Zhaofan Qiu; Yingwei Pan; Wu Liu; Ting Yao; Tao Mei; | code |
| 248 | CRISP: Object Pose and Shape Estimation with Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We consider the problem of estimating object pose and shape from an RGB-D image. |
Jingnan Shi; Rajat Talak; Harry Zhang; David Jin; Luca Carlone; | code |
| 249 | Continual SFT Matches Multimodal RLHF with Negative Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. |
Ke Zhu; Yu Wang; Yanpeng Sun; Qiang Chen; Jiangjiang Liu; Gang Zhang; Jingdong Wang; | code |
| 250 | Contextual AD Narration with Interleaved Multimodal Sequence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie. To achieve this goal, we propose to leverage pre-trained foundation models through a simple and unified framework to generate ADs with interleaved multimodal sequence as input, termed as Uni-AD. |
Hanlin Wang; Zhan Tong; Kecheng Zheng; Yujun Shen; Limin Wang; | code |
| 251 | LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. |
Hanlin Wang; Hao Ouyang; Qiuyu Wang; Wen Wang; Ka Leong Cheng; Qifeng Chen; Yujun Shen; Limin Wang; | code |
| 252 | VGGT: Visual Geometry Grounded Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. |
Jianyuan Wang; Minghao Chen; Nikita Karaev; Andrea Vedaldi; Christian Rupprecht; David Novotny; | code |
| 253 | Reconstructing Humans with A Biomechanically Accurate Skeleton Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. |
Yan Xia; Xiaowei Zhou; Etienne Vouga; Qixing Huang; Georgios Pavlakos; | code |
| 254 | Universal Domain Adaptation for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Ma tching and Prototype-based Distinction, a novel framework composed of two key components. |
Seun-An Choe; Keon-Hee Park; Jinwoo Choi; Gyeong-Moon Park; | code |
| 255 | ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. |
Xiangyuan Xue; Zeyu Lu; Di Huang; Zidong Wang; Wanli Ouyang; Lei Bai; | code |
| 256 | SCAP: Transductive Test-Time Adaptation Via Supportive Clique-based Attribute Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent ViT-based TTA methods have introduced batch-level adaptation, they remain suboptimal for VLMs due to inadequate integration of the text modality. To address these limitations, we propose a novel transductive TTA framework, Supportive Clique-based Attribute Prompting (SCAP), which effectively combines visual and textual information to enhance adaptation by generating fine-grained attribute prompts across test batches. |
Chenyu Zhang; Kunlun Xu; Zichen Liu; Yuxin Peng; Jiahuan Zhou; | code |
| 257 | STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation significantly hinders the model’s ability to capture essential temporal information for effective video understanding. To address this, we propose an integrated Spatial-TempOral dynamic Prompting (STOP) model which consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. |
Zichen Liu; Kunlun Xu; Bing Su; Xu Zou; Yuxin Peng; Jiahuan Zhou; | code |
| 258 | Unity in Diversity: Video Editing Via Gradient-Latent Purification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, it is challenging to determine the optimal stopping point for the editing process, making it difficult to achieve an optimal solution for the latent representation. To address these issues, we propose a unified gradient-latent purification framework that collects gradient and latent information across different stages to identify effective and concordant update directions. |
Junyu Gao; Kunlin Yang; Xuan Yao; Yufan Hu; | code |
| 259 | Object-Shot Enhanced Grounding Network for Egocentric Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. |
Yisen Feng; Haoyu Zhang; Meng Liu; Weili Guan; Liqiang Nie; | code |
| 260 | Asynchronous Collaborative Graph Representation for Frames and Events Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most multimodal methods directly convert events into image-like formats synchronized with frames and process each stream through separate two-branch backbones, making it difficult to fully exploit the spatiotemporal events while limiting inference frequency to the frame rate. To address these problems, we propose a novel asynchronous collaborative graph representation, namely ACGR, which is the first trial to explore a unified graph framework for asynchronously processing frames and events with high performance and low latency. |
Dianze Li; Jianing Li; Xu Liu; Xiaopeng Fan; Yonghong Tian; | code |
| 261 | Re-HOLD: Video Hand Object Interaction Reenactment Via Adaptive Layout-instructed Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite human hand synthesis already being an intricate problem, generating objects in contact with hands and their interactions presents an even more challenging task, especially when the objects exhibit obvious variations in size and shape. To tackle these issues, we present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive Layout-instructed Diffusion model (Re-HOLD). |
Yingying Fan; Quanwei Yang; Kaisiyuan Wang; Hang Zhou; Yingying Li; Haocheng Feng; Errui Ding; Yu Wu; Jingdong Wang; | code |
| 262 | SMTPD: A New Benchmark for Temporal Prediction of Social Media Popularity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, with exploring YouTube’s multilingual and multi-modal content, we construct a new social media temporal popularity prediction benchmark, namely SMTPD, and suggest a baseline framework for temporal popularity prediction. |
Yijie Xu; Bolun Zheng; Wei Zhu; Hangjia Pan; Yuchen Yao; Ning Xu; Anan Liu; Quan Zhang; Chenggang Yan; | code |
| 263 | Learning to Normalize on The SPD Manifold Under Bures-Wasserstein Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, the recently introduced Generalized BWM (GBWM) parameterizes the vanilla BWM via an SPD matrix, allowing for a more nuanced representation of vibrant geometries of the SPD manifold. Therefore, we propose a novel RBN algorithm based on the GBW geometry, incorporating a learnable metric parameter. |
Rui Wang; Shaocheng Jin; Ziheng Chen; Xiaoqing Luo; Xiao-Jun Wu; | code |
| 264 | LoRA Recycle: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models By Recycling Pre-Tuned LoRAs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. |
Zixuan Hu; Yongxian Wei; Li Shen; Chun Yuan; Dacheng Tao; | code |
| 265 | SwiftEdit: Lightning Fast Text-Guided Image Editing Via One-Step Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods often fall short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we introduce SwiftEdit, a simple yet highly efficient editing tool that achieve instant text-guided image editing in 0.23s. |
Trong-Tung Nguyen; Quang Nguyen; Khoi Nguyen; Anh Tran; Cuong Pham; | code |
| 266 | Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. |
Haonan An; Guang Hua; Zhengru Fang; Guowen Xu; Susanto Rahardja; Yuguang Fang; | code |
| 267 | SceneTAP: Scene-Coherent Typographic Adversarial Planner Against Vision-Language Models in Real-World Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present the first approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness through the capability of the LLM-based agent. |
Yue Cao; Yun Xing; Jie Zhang; Di Lin; Tianwei Zhang; Ivor Tsang; Yang Liu; Qing Guo; | code |
| 268 | FreePCA: Integrating Consistency Information Across Long-short Frames in Training-free Long Video Generation Via Principal Component Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reveal that global and local information can be precisely decoupled into consistent appearance and motion intensity information by applying Principal Component Analysis (PCA), allowing for refined complementary integration of global consistency and local quality. |
Jiangtong Tan; Hu Yu; Jie Huang; Jie Xiao; Feng Zhao; | code |
| 269 | Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. |
Hongda Liu; Yunfan Liu; Min Ren; Hao Wang; Yunlong Wang; Zhenan Sun; | code |
| 270 | FFR: Frequency Feature Rectification for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we identify that attenuated high-frequency features mislead the decoder of ViT-based WSSS models, resulting in over-smoothed false segmentation. To address this, we propose a Frequency Feature Rectification (FFR) framework to rectify the false segmentations caused by attenuated high-frequency features and enhance the learning of high-frequency features in the decoder. |
Ziqian Yang; Xinqiao Zhao; Xiaolei Wang; Quan Zhang; Jimin Xiao; | code |
| 271 | VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. |
Chaoyu Li; Eun Woo Im; Pooyan Fazli; | code |
| 272 | Empowering LLMs to Understand and Generate Complex Vector Graphics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, LLM training typically lacks modeling and understanding of the rendering sequence of vector paths, which can lead to occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. |
Ximing Xing; Juncheng Hu; Guotao Liang; Jing Zhang; Dong Xu; Qian Yu; | code |
| 273 | ID-Patch: Robust ID Association for Group Photo Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods suffer from limitations such as the reliance on segmentation models, increased runtime, or a high probability of ID leakage. To address these challenges, we propose ID-Patch, a novel method that provides robust association between identities and 2D positions. |
Yimeng Zhang; Tiancheng Zhi; Jing Liu; Shen Sang; Liming Jiang; Qing Yan; Sijia Liu; Linjie Luo; | code |
| 274 | SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. |
Claudia Cuttano; Gabriele Trivigno; Gabriele Rosi; Carlo Masone; Giuseppe Averta; | code |
| 275 | AniMer: Animal Pose and Shape Estimation Using Family Aware Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents AniMer to estimate animal pose and shape using family-aware Transformer, enhancing the reconstruction accuracy of diverse quadrupedal families. |
Jin Lyu; Tianyi Zhu; Yi Gu; Li Lin; Pujin Cheng; Yebin Liu; Xiaoying Tang; Liang An; | code |
| 276 | Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of more than 30K annotated scenes with 5.6M mask-text pairs – significantly larger than existing datasets. Building on these data, we propose Mosaic3D, a 3D visiual foundation model (3D-VFM) combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. |
Junha Lee; Chunghyun Park; Jaesung Choe; Yu-Chiang Frank Wang; Jan Kautz; Minsu Cho; Chris Choy; | code |
| 277 | Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Sparsely activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited resources. Inspired by this, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to adaptively select the most suitable expert for each feature point. |
Jiapeng Zhu; Ceyuan Yang; Kecheng Zheng; Yinghao Xu; Zifan Shi; Yifei Zhang; Qifeng Chen; Yujun Shen; | code |
| 278 | V-CLR: View-Consistent Learning for Open-World Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the challenging problem of open-world instance segmentation. |
Chang-Bin Zhang; Jinhong Ni; Yujie Zhong; Kai Han; | code |
| 279 | Mr. DETR: Instructive Multi-Route Training for Detection Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we treat the model as a multi-task framework, simultaneously performing one-to-one and one-to-many predictions. |
Chang-Bin Zhang; Yujie Zhong; Kai Han; | code |
| 280 | Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360* scenes. |
Chong Bao; Xiyu Zhang; Zehao Yu; Jiale Shi; Guofeng Zhang; Songyou Peng; Zhaopeng Cui; | code |
| 281 | Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel prediction-based mapping network, named PrediCIR, to adaptively predict the missing target visual content in reference images in the latent space before mapping for accurate ZS-CIR. |
Yuanmin Tang; Jing Yu; Keke Gai; Jiamin Zhuang; Gang Xiong; Gaopeng Gou; Qi Wu; | code |
| 282 | Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods suffer from missing critical visual details and limited reasoning capabilities, leading to suboptimal retrieval performance. To address these challenges, we propose a novel, training-free one-stage method, One-Stage Reflective Chain-of-Thought Reasoning (OSrCIR) for ZS-CIR, which employs Multimodal Large Language Models to retain essential visual information in a single-stage reasoning process, eliminating the information loss in two-stage methods. |
Yuanmin Tang; Jue Zhang; Xiaoting Qin; Jing Yu; Gaopeng Gou; Gang Xiong; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Qi Wu; | code |
| 283 | COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike existing approaches that remove ambiguous Gaussians and sacrifice visual quality, COB-GS, as a 3DGS refinement method, jointly optimizes semantic and visual information, allowing the two different levels to cooperate with each other effectively. Specifically, for the semantic guidance, we introduce a boundary-adaptive Gaussian splitting technique that leverages semantic gradient statistics to identify and split ambiguous Gaussians, aligning them closely with object boundaries. |
Jiaxin Zhang; Junjun Jiang; Youyu Chen; Kui Jiang; Xianming Liu; | code |
| 284 | Curriculum Direct Preference Optimization for Diffusion and Consistency Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel and enhanced version of DPO based on curriculum learning for text-to-image generation. |
Florinel-Alin Croitoru; Vlad Hondru; Radu Tudor Ionescu; Nicu Sebe; Mubarak Shah; | code |
| 285 | Minority-Focused Text-to-Image Generation Via Prompt Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, existing pretrained T2I diffusion models primarily focus on high-density regions, largely due to the influence of guided samplers (like CFG) that are essential for high-quality generation. To address this, we present a novel framework to counter the high-density-focus of T2I diffusion models. |
Soobin Um; Jong Chul Ye; | code |
| 286 | Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Two gadgets are proposed to encourage textual diversity. To match such (image, multi-texts) pairs, we modify the CLIP image encoder into multi-branch, and propose multi-to-multi contrastive optimization for image-text part-to-part matching. |
Haicheng Wang; Chen Ju; Weixiong Lin; Shuai Xiao; Mengting Chen; Yixuan Huang; Chang Liu; Mingshuai Yao; Jinsong Lan; Ying Chen; Qingwen Liu; Yanfeng Wang; | code |
| 287 | FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM’s pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. |
Haokun Chen; Hang Li; Yao Zhang; Jinhe Bi; Gengyuan Zhang; Yueqi Zhang; Philip Torr; Jindong Gu; Denis Krompass; Volker Tresp; | code |
| 288 | Cross-modal Causal Relation Alignment for Video Question Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. |
Weixing Chen; Yang Liu; Binglin Chen; Jiandong Su; Yongsen Zheng; Liang Lin; | code |
| 289 | Wavelet and Prototype Augmented Query-based Transformer for Pixel-level Surface Defect Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, inspired by query-based methods, we propose a Wavelet and Prototype Augmented Query-based Transformer (WPFormer) for surface defect detection. |
Feng Yan; Xiaoheng Jiang; Yang Lu; Jiale Cao; Dong Chen; Mingliang Xu; | code |
| 290 | FluidNexus: 3D Fluid Reconstruction and Prediction from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods require multi-view videos for fluid reconstruction. We present FluidNexus, a novel framework that bridges video generation and physics simulation to tackle this task. |
Yue Gao; Hong-Xing Yu; Bo Zhu; Jiajun Wu; | code |
| 291 | Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, based on our observation, this prime learning window in multimodal learning is often dominated by information-sufficient modalities, which in turn suppresses the information acquisition of information-insufficient modalities. To address this issue, we propose Information Acquisition Regulation (InfoReg), a method designed to balance information acquisition among modalities. |
Chengxiang Huang; Yake Wei; Zequn Yang; Di Hu; | code |
| 292 | SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. |
Rong Li; Shijie Li; Lingdong Kong; Xulei Yang; Junwei Liang; | code |
| 293 | Interpreting Object-level Foundation Models Via Visual Precision Search Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we propose a Visual Precision Search method that generates accurate attribution maps with fewer regions. |
Ruoyu Chen; Siyuan Liang; Jingzhi Li; Shiming Liu; Maosen Li; Zhen Huang; Hua Zhang; Xiaochun Cao; | code |
| 294 | Adaptive Rectangular Convolution for Remote Sensing Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given the diverse object sizes in remote sensing images, these rigid parameters lead to suboptimal feature extraction. To overcome these limitations, we introduce an innovative convolutional module, Adaptive Rectangular Convolution (ARConv). |
Xueyang Wang; Zhixin Zheng; Jiandong Shao; Yule Duan; Liang-Jian Deng; | code |
| 295 | Towards More General Video-based Deepfake Detection Through Facial Component Guided Adaptation for Foundation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The current deep generative models have enabled the creation of synthetic facial images with remarkable photorealism, raising significant societal concerns over their potential misuse. |
Yue-Hua Han; Tai-Ming Huang; Kai-Lung Hua; Jun-Cheng Chen; | code |
| 296 | Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose EB-Diff-Train, a new efficient DM training approach that is orthogonal to other methods of accelerating DM training, by investigating and leveraging Early-Bird (EB) tickets–sparse subnetworks that manifest early in the training process and maintain high generation quality. |
Lexington Whalen; Zhenbang Du; Haoran You; Chaojian Li; Sixu Li; Yingyan Lin; | code |
| 297 | TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. |
Xin Wang; Kai Chen; Jiaming Zhang; Jingjing Chen; Xingjun Ma; | code |
| 298 | All-Day Multi-Camera Multi-Target Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose an All-Day Mamba Fusion(ADMF) model to adaptively fuse information from different modalities. |
Huijie Fan; Yu Qiao; Yihao Zhen; Tinghui Zhao; Baojie Fan; Qiang Wang; | code |
| 299 | DKC: Differentiated Knowledge Consolidation for Cloth-Hybrid Lifelong Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, existing LReID methods typically fail to leverage such knowledge resulting in the exacerbation of catastrophic forgetting issues. Therefore, in this paper, we focus on a challenging practical task called Cloth-Hybrid Lifelong Person Re-identification (CH-LReID), which requires matching the same person wearing different clothes using sequentially collected data. |
Zhenyu Cui; Jiahuan Zhou; Yuxin Peng; | code |
| 300 | Logits DeConfusion with CLIP for Few-Shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we found in experiments that CLIP’s logits suffer from serious inter-class confusion problems in downstream tasks, and the ambiguity between categories seriously affects the accuracy. To address this challenge, we propose a novel method called Logits DeConfusion, which effectively learns and eliminates inter-class confusion in logits by combining our Multi-level Adapter Fusion (MAF) module with our Inter-Class Deconfusion (ICD) module. |
Shuo Li; Fang Liu; Zehua Hao; Xinyi Wang; Lingling Li; Xu Liu; Puhua Chen; Wenping Ma; | code |
| 301 | Luminance-GS: Adapting 3D Gaussian Splatting to Challenging Lighting Conditions with View-Adaptive Curve Adjustment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such lighting degradations and view-dependent variations pose substantial challenges to novel view synthesis (NVS) frameworks based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). To address this, we introduce Luminance-GS, a novel approach to achieving high-quality novel view synthesis results under diverse challenging lighting conditions using 3DGS. |
Ziteng Cui; Xuangeng Chu; Tatsuya Harada; | code |
| 302 | DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on weakly supervised referring expression comprehension (REC), and identify that the lack of fine-grained visual capability greatly limits the upper performance bound of existing methods. |
Xiaofu Chen; Yaxin Luo; Gen Luo; Jiayi Ji; Henghui Ding; Yiyi Zhou; | code |
| 303 | Explaining Domain Shifts in Language: Concept Erasing for Interpretable Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Language-guided Concept-Erasing (LanCE) framework. |
Zequn Zeng; Yudi Su; Jianqiao Sun; Tiansheng Wen; Hao Zhang; Zhengjue Wang; Bo Chen; Hongwei Liu; Jiawei Ma; | code |
| 304 | ReNeg: Learning Negative Embedding with Reward Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce ReNeg, an end-to-end method designed to learn improved Negative embeddings guided by a Reward model. |
Xiaomin Li; Yixuan Liu; Takashi Isobe; Xu Jia; Qinpeng Cui; Dong Zhou; Dong Li; You He; Huchuan Lu; Zhongdao Wang; Emad Barsoum; | code |
| 305 | Open-Canopy: Towards Very High Resolution Forest Monitoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Open-Canopy, the first open-access, country-scale benchmark for very high-resolution (1.5 m) canopy height estimation, covering over 87,000 km2 across France with 1.5 m resolution satellite imagery and aerial LiDAR data. |
Fajwel Fogel; Yohann Perron; Nikola Besic; Laurent Saint-André; Agnès Pellissier-Tanon; Martin Schwartz; Thomas Boudras; Ibrahim Fayad; Alexandre d’Aspremont; Loic Landrieu; Philippe Ciais; | code |
| 306 | One Model for ALL: Low-Level Task Interaction Is A Key to Task-Agnostic Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. |
Chunyang Cheng; Tianyang Xu; Zhenhua Feng; Xiaojun Wu; Zhangyong Tang; Hui Li; Zeyang Zhang; Sara Atito; Muhammad Awais; Josef Kittler; | code |
| 307 | Any-Resolution AI-Generated Image Detection By Spectral Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we build upon the key idea that the spectral distribution of real images constitutes both an invariant and highly discriminative pattern for AI-generated image detection. |
Dimitrios Karageorgiou; Symeon Papadopoulos; Ioannis Kompatsiaris; Efstratios Gavves; | code |
| 308 | Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism. |
Rui Chen; Jianfeng Zhang; Yixun Liang; Guan Luo; Weiyu Li; Jiarui Liu; Xiu Li; Xiaoxiao Long; Jiashi Feng; Ping Tan; | code |
| 309 | FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. |
Rong Wang; Fabian Prada; Ziyan Wang; Zhongshi Jiang; Chengxiang Yin; Junxuan Li; Shunsuke Saito; Igor Santesteban; Javier Romero; Rohan Joshi; Hongdong Li; Jason Saragih; Yaser Sheikh; | code |
| 310 | RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes RENO, the first real-time neural codec for 3D LiDAR point clouds, achieving superior performance with a lightweight model. |
Kang You; Tong Chen; Dandan Ding; M. Salman Asif; Zhan Ma; | code |
| 311 | Neuro-3D: Towards 3D Visual Decoding from EEG Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Understanding how the brain perceives and processes 3D visual stimuli in the real world has been a longstanding endeavor in neuroscience. Towards this goal, we introduce a new neuroscience task: decoding 3D visual perception from EEG signals, a neuroimaging technique that enables real-time monitoring of neural dynamics enriched with complex visual cues. |
Zhanqiang Guo; Jiamin Wu; Yonghao Song; Jiahui Bu; Weijian Mai; Qihao Zheng; Wanli Ouyang; Chunfeng Song; | code |
| 312 | MMRL: Multi-Modal Representation Learning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. |
Yuncheng Guo; Xiaodong Gu; | code |
| 313 | Image Quality Assessment: From Human to Machine Preference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering the huge gap between human and machine vision systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. |
Chunyi Li; Yuan Tian; Xiaoyue Ling; Zicheng Zhang; Haodong Duan; Haoning Wu; Ziheng Jia; Xiaohong Liu; Xiongkuo Min; Guo Lu; Weisi Lin; Guangtao Zhai; | code |
| 314 | Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an efficient radiance field rendering algorithm that incorporates a rasterization process on adaptive sparse voxels without neural networks or 3D Gaussians. |
Cheng Sun; Jaesung Choe; Charles Loop; Wei-Chiu Ma; Yu-Chiang Frank Wang; | code |
| 315 | Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. |
Zhedong Zhang; Liang Li; Chenggang Yan; Chunshan Liu; Anton van den Hengel; Yuankai Qi; | code |
| 316 | T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. |
Yifei Qian; Zhongliang Guo; Bowen Deng; Chun Tong Lei; Shuai Zhao; Chun Pong Lau; Xiaopeng Hong; Michael P. Pound; | code |
| 317 | The PanAf-FGBG Dataset: Understanding The Impact of Backgrounds in Wildlife Behaviour Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we present the PanAf-FGBG dataset, featuring 21 hours of wild chimpanzee behaviours, recorded at 389 individual camera locations. |
Otto Brookes; Maksim Kukushkin; Majid Mirmehdi; Colleen Stephens; Paula Dieguez; Thurston C. Hicks; Sorrel Jones; Kevin Lee; Maureen S. McCarthy; Amelia Meier; Emmanuelle Normand; Erin G. Wessling; Roman M. Wittig; Kevin Langergraber; Klaus Zuberbühler; Lukas Boesch; Thomas Schmid; Mimi Arandjelovic; Hjalmar Kühl; Tilo Burghardt; | code |
| 318 | GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Normally, humans address complex tasks through multi-step reasoning and respond to diverse situations by leveraging associative and analogical thinking. In light of this, we propose GREAT (Geometry-Intention Collaborative Inference) for Open-Vocabulary 3D Object Affordance Grounding, a novel framework that mines the object invariant geometry attributes and performs analogically reason in potential interaction scenarios to form affordance knowledge, fully combining the knowledge with both geometries and visual contents to ground 3D object affordance. |
Yawen Shao; Wei Zhai; Yuhang Yang; Hongchen Luo; Yang Cao; Zheng-Jun Zha; | code |
| 319 | RoGSplat: Learning Robust Generalizable Human Gaussian Splatting from Sparse Multi-View Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents RoGSplat, a novel approach for synthesizing high-fidelity novel views of unseen human from sparse multi-view images, while requiring no cumbersome per-subject optimization. |
Junjin Xiao; Qing Zhang; Yonewei Nie; Lei Zhu; Wei-Shi Zheng; | code |
| 320 | ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized By Trainable Commuting Angle Matrices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. |
Hao Yu; Tangyu Jiang; Shuning Jia; Shannan Yan; Shunning Liu; Haolong Qian; Guanghao Li; Shuting Dong; Chun Yuan; | code |
| 321 | RGBAvatar: Reduced Gaussian Blendshapes for Online Modeling of Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Reduced Gaussian Blendshapes Avatar (RGBAvatar), a method for reconstructing photorealistic, animatable head avatars at speeds sufficient for on-the-fly reconstruction. |
Linzhou Li; Yumeng Li; Yanlin Weng; Youyi Zheng; Kun Zhou; | code |
| 322 | Augmented Deep Contexts for Spatially Embedded Video Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To relieve the limitations, we propose a Spatially Embedded Video Codec (SEVC), in which the low-resolution video is compressed for spatial references. |
Yifan Bian; Chuanbo Tang; Li Li; Dong Liu; | code |
| 323 | Degradation-Aware Feature Perturbation for All-in-One Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. |
Xiangpeng Tian; Xiangyu Liao; Xiao Liu; Meng Li; Chao Ren; | code |
| 324 | Training-free Dense-Aligned Diffusion Guidance for Modular Conditional Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach that treats conditional image synthesis as the modular combination of diverse fundamental condition units. |
Zixuan Wang; Duo Peng; Feng Chen; Yuwei Yang; Yinjie Lei; | code |
| 325 | Octopus: Alleviating Hallucination Via Dynamic Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Related results show that hallucination causes are hybrid and each generative step faces a unique hallucination challenge. Leveraging these meaningful insights, we introduce a simple yet effective Octopus-like framework that enables the model to adaptively identify hallucination types and create a dynamic CD workflow. |
Wei Suo; Lijun Zhang; Mengyang Sun; Lin Yuanbo Wu; Peng Wang; Yanning Zhang; | code |
| 326 | Weakly Supervised Contrastive Adversarial Training for Learning Robust Features from Semi-supervised Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, achieving complete perturbation–perturbing as many non-robust features as possible–is challenging due to the difficulty in distinguishing robust and non-robust features and the sparsity of labeled data. To address these challenges, we propose a novel approach called Weakly Supervised Contrastive Adversarial Training (WSCAT). |
Lilin Zhang; Chengpei Wu; Ning Yang; | code |
| 327 | FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Effectively integrating this multimodal input — device pose and camera feeds — is challenging due to the differing characteristics of each data source. To address this, we propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction through geometrically sound multimodal integration and can run at 300 FPS on modern hardware. |
Andrea Boscolo Camiletto; Jian Wang; Eduardo Alvarado; Rishabh Dabral; Thabo Beeler; Marc Habermann; Christian Theobalt; | code |
| 328 | Retrieving Semantics from The Deep: An RAG Solution for Gesture Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Producing such semantic co-speech gestures has been a major challenge for the existing neural systems that can generate rhythmic beat gestures, but struggle to produce semantically meaningful gestures. Therefore, we present RAG-Gesture, a diffusion-based gesture generation approach that leverages Retrieval Augmented Generation (RAG) to produce natural-looking and semantically rich gestures. |
M. Hamza Mughal; Rishabh Dabral; Merel C.J. Scholman; Vera Demberg; Christian Theobalt; | code |
| 329 | ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ARKit LabelMaker, a large-scale real-world 3D dataset with dense semantic annotation that is more than three times larger than prior largest dataset. |
Guangda Ji; Silvan Weder; Francis Engelmann; Marc Pollefeys; Hermann Blum; | code |
| 330 | 5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To find a competitive alternative to full fine-tuning, we propose the Multi-cognitive Visual Adapter (Mona) tuning, a novel adapter-based tuning method. |
Dongshuo Yin; Leiyi Hu; Bin Li; Youqun Zhang; Xue Yang; | code |
| 331 | Playing The Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, despite the significance of safety alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the unexplored vulnerability of the safety alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. |
Joonhyun Jeong; Seyun Bae; Yeonsung Jung; Jaeryong Hwang; Eunho Yang; | code |
| 332 | Adaptive Keyframe Sampling for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). |
Xi Tang; Jihao Qiu; Lingxi Xie; Yunjie Tian; Jianbin Jiao; Qixiang Ye; | code |
| 333 | Beyond Single-Modal Boundary: Cross-Modal Anomaly Detection Through Visual Prototype and Harmonization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a cross-modal anomaly detection model that is trained using data from a variety of existing modalities and can be generalized well to unseen modalities. |
Kai Mao; Ping Wei; Yiyang Lian; Yangyang Wang; Nanning Zheng; | code |
| 334 | SACB-Net: Spatial-awareness Convolutions for Medical Image Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a 3D Spatial-Awareness Convolution Block (SACB) to enhance the spatial information within feature representations. |
Xinxing Cheng; Tianyang Zhang; Wenqi Lu; Qingjie Meng; Alejandro F. Frangi; Jinming Duan; | code |
| 335 | Mamba As A Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their complementary strengths, effectively integrating VFMs and VLMs with attention mechanisms is challenging, as the increased patch tokens complicate long-sequence modeling. To address this, we propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs while maintaining linear scalability in sequence length. |
Xin Zhang; Robby T. Tan; | code |
| 336 | Dataset Distillation with Neural Characteristic Function: A Minmax Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. |
Shaobo Wang; Yicun Yang; Zhiyuan Liu; Chenghao Sun; Xuming Hu; Conghui He; Linfeng Zhang; | code |
| 337 | Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. |
Yifei Zhang; Chang Liu; Jin Wei; Xiaomeng Yang; Yu Zhou; Can Ma; Xiangyang Ji; | code |
| 338 | Skip Tuning: Pre-trained Vision-Language Models Are Effective and Efficient Adapters Themselves Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. |
Shihan Wu; Ji Zhang; Pengpeng Zeng; Lianli Gao; Jingkuan Song; Heng Tao Shen; | code |
| 339 | SASep: Saliency-Aware Structured Separation of Geometry and Feature for Open Set Learning on Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advancements in deep learning have greatly enhanced 3D object recognition, but most models are limited to closed-set scenarios, unable to handle unknown samples in real-world applications. Open-set recognition (OSR) addresses this limitation by enabling models to both classify known classes and identify novel classes.However, current OSR methods rely on global features to differentiate known and unknown classes, treating the entire object uniformly and overlooking the varying semantic importance of its different parts.To address this gap, we propose Salience-Aware Structured Separation (SASep), which includes (i) a tunable semantic decomposition (TSD) module to semantically decompose objects into important and unimportant parts, (ii) a geometric synthesis strategy (GSS) to generate pseudo-unknown objects by combining these unimportant parts, and (iii) a synth-aided margin separation (SMS) module to enhance feature-level separation by expanding the feature distributions between classes.Together, these components improve both geometric and feature representations, enhancing the model’s ability to effectively distinguish known and unknown classes.Experimental results show that SASep achieves superior performance in 3D OSR, outperforming existing state-of-the-art methods. |
Jinfeng Xu; Xianzhi Li; Yuan Tang; Xu Han; Qiao Yu; Yixue Hao; Long Hu; Min Chen; | code |
| 340 | Tightening Robustness Verification of MaxPool-based Neural Networks Via Minimizing The Over-Approximation Zone Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Ti-Lin, a robustness verifier for MaxPool-based CNNs with Tight Linear Approximation. |
Yuan Xiao; Yuchen Chen; Shiqing Ma; Chunrong Fang; Tongtong Bai; Mingzheng Gu; Yuxin Cheng; Yanwei Chen; Zhenyu Chen; | code |
| 341 | ACAttack: Adaptive Cross Attacking RGB-T Tracker Via Multi-Modal Response Decoupling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work represents an innovative attempt to develop an adaptive cross attack framework via multi-modal response decoupling, generating multi-modal adversarial patches to evade RGB-T trackers. |
Xinyu Xiang; Qinglong Yan; Hao Zhang; Jiayi Ma; | code |
| 342 | Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, not all image tokens are equally important, and certain localized areas require more computation, such as objects. To address this, we propose DiffCR, a dynamic DiT inference framework with differentiable compression ratios, which automatically learns to dynamically route computation across layers and timesteps for each image token, resulting in efficient DiTs. |
Haoran You; Connelly Barnes; Yuqian Zhou; Yan Kang; Zhenbang Du; Wei Zhou; Lingzhi Zhang; Yotam Nitzan; Xiaoyang Liu; Zhe Lin; Eli Shechtman; Sohrab Amirghodsi; Yingyan Celine Lin; | code |
| 343 | PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Established on the off-the-shelf text-to-image (T2I) diffusion model, we propose a novel text-guided image-to-image (I2I) translation framework dubbed as Phase-Transferred Diffusion Model (PTDiffusion) for hidden art syntheses, which harmoniously embeds an input reference image into arbitrary scenes described by the text prompts. |
Xiang Gao; Shuai Yang; Jiaying Liu; | code |
| 344 | Flash-Split: 2D Reflection Removal with Flash Cues and Latent Diffusion Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Flash-Split, a robust framework for separating transmitted and reflected light using a single (potentially misaligned) pair of flash/no-flash images. |
Tianfu Wang; Mingyang Xie; Haoming Cai; Sachin Shah; Christopher A. Metzler; | code |
| 345 | Hyperbolic Safety-Aware Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. |
Tobia Poppi; Tejaswi Kasarla; Pascal Mettes; Lorenzo Baraldi; Rita Cucchiara; | code |
| 346 | Language Guided Concept Bottleneck Models for Interpretable Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. |
Lu Yu; Haoyu Han; Zhe Tao; Hantao Yao; Changsheng Xu; | code |
| 347 | Neural Video Compression with Context Modulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the limitation by modulating the temporal context with the reference frame in two steps. |
Chuanbo Tang; Zhuoyuan Li; Yifan Bian; Li Li; Dong Liu; | code |
| 348 | ARM: Appearance Reconstruction Model for Relightable 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent image-to-3D reconstruction models have greatly advanced geometry generation, but they still struggle to faithfully generate realistic appearance. To address this, we introduce ARM, a novel method that reconstructs high-quality 3D meshes and realistic appearance from sparse-view images. |
Xiang Feng; Chang Yu; Zoubin Bi; Yintong Shang; Feng Gao; Hongzhi Wu; Kun Zhou; Chenfanfu Jiang; Yin Yang; | code |
| 349 | AA-CLIP: Enhancing Zero-Shot Anomaly Detection Via Anomaly-Aware CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP’s anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. |
Wenxin Ma; Xu Zhang; Qingsong Yao; Fenghe Tang; Chenxu Wu; Yingtai Li; Rui Yan; Zihang Jiang; S.Kevin Zhou; | code |
| 350 | DEIM: DETR with Improved Matching for Fast Convergence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). |
Shihua Huang; Zhichao Lu; Xiaodong Cun; Yongjun Yu; Xiao Zhou; Xi Shen; | code |
| 351 | SGFormer: Satellite-Ground Fusion for 3D Semantic Scene Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their success in visible areas, existing methods struggle to capture complete scene semantics due to frequent visual occlusions. To address this limitation, this paper presents the first satellite-ground cooperative SSC framework, i.e., SGFormer, exploring the potential of satellite-ground image pairs in the SSC task. |
Xiyue Guo; Jiarui Hu; Junjie Hu; Hujun Bao; Guofeng Zhang; | code |
| 352 | R2C: Mapping Room to Chessboard to Unlock LLM As Low-Level Action Planner Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Room to Chessboard (R2C), a novel semantic representation that maps environmental states onto a grid-based chessboard, empowering LLMs to generate specific low-level coordinates and guide the robot in a manner akin to playing a game of chess. |
Ziyi Bai; Hanxuan Li; Bin Fu; Chuyan Xiong; Ruiping Wang; Xilin Chen; | code |
| 353 | Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the IMed-361M benchmark dataset, a significant advancement in general IMIS research. |
Junlong Cheng; Bin Fu; Jin Ye; Guoan Wang; Tianbin Li; Haoyu Wang; Ruoyu Li; He Yao; Junren Cheng; Jingwen Li; Yanzhou Su; Min Zhu; Junjun He; | code |
| 354 | Revisiting Generative Replay for Class Incremental Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, motivated by the observation that the forgetting of prior knowledge is predominantly present in the classification sub-task as opposed to the localization sub-task, we revisit the generative replay method for class incremental object detection. |
Shizhou Zhang; Xueqiang Lv; Yinghui Xing; Qirui Wu; Di Xu; Yanning Zhang; | code |
| 355 | Diffusion-based Event Generation for High-Quality Image Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While event-based deblurring have demonstrated impressive results, they are impractical for consumer photos captured by cell phones and digital cameras that are not equipped with the event sensor. To address this problem, we in this paper propose a novel deblurring framework called Event Generation Deblurring (EGDeblurring), which allows to effectively deblur an image by generating event guidance describing the motion information using a diffusion model. |
Xinan Xie; Qing Zhang; Wei-Shi Zheng; | code |
| 356 | Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods struggle to either generalize well to new scenes or provide accurate camera pose estimates. To address these issues, we present Reloc3r, a simple yet effective visual localization framework. |
Siyan Dong; Shuzhe Wang; Shaohui Liu; Lulu Cai; Qingnan Fan; Juho Kannala; Yanchao Yang; | code |
| 357 | Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. |
Arpita Chowdhury; Dipanjyoti Paul; Zheda Mai; Jianyang Gu; Ziheng Zhang; Kazi Sajeed Mehrab; Elizabeth G. Campolongo; Daniel Rubenstein; Charles V. Stewart; Anuj Karpatne; Tanya Berger-Wolf; Yu Su; Wei-Lun Chao; | code |
| 358 | Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they often overlook the adaptability of the model, limiting the ability to learn generalizable and discriminative features incrementally from online training data.To address this, we introduce a plug-and-play module, S6MOD, which can be integrated into most existing methods and directly improve adaptability. Specifically, S6MOD introduces an extra branch after the backbone, where a mixture of discretization selectively adjusts parameters in a selective state space model, enriching selective scan patterns such that the model can adaptively select the most sensitive discretization method for current dynamics.We further design a class-conditional routing algorithm for dynamic, uncertainty-based adjustment and implement a contrastive discretization loss to optimize it. |
Sihao Liu; Yibo Yang; Xiaojie Li; David A. Clifton; Bernard Ghanem; | code |
| 359 | SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). |
Hongrui Jia; Chaoya Jiang; Haiyang Xu; Wei Ye; Mengfan Dong; Ming Yan; Ji Zhang; Fei Huang; Shikun Zhang; | code |
| 360 | CryptoFace: End-to-End Encrypted Face Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and thus inference latency. |
Wei Ao; Vishnu Naresh Boddeti; | code |
| 361 | Reward Fine-Tuning Two-Step Diffusion Models Via Learning Differentiable Latent-Space Surrogate Reward Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. |
Zhiwei Jia; Yuesong Nan; Huixi Zhao; Gengdai Liu; | code |
| 362 | Exploration-Driven Generative Interactive Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, training their model requires expensive demonstrations. Therefore, we propose a training framework merely using a random agent in virtual environments. |
Nedko Savov; Naser Kazemi; Mohammad Mahdi; Danda Pani Paudel; Xi Wang; Luc Van Gool; | code |
| 363 | Dissecting and Mitigating Diffusion Bias Via Mechanistic Interpretability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. |
Yingdong Shi; Changming Li; Yifan Wang; Yongxiang Zhao; Anqi Pang; Sibei Yang; Jingyi Yu; Kan Ren; | code |
| 364 | Golden Cudgel Network for Real-Time Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, their speed is limited by multi-path blocks, and some depend on high-performance teacher models for training. To overcome these issues, we propose Golden Cudgel Network (GCNet). |
Guoyu Yang; Yuan Wang; Daming Shi; Yanzhong Wang; | code |
| 365 | COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. |
Sanghwan Kim; Rui Xiao; Mariana-Iuliana Georgescu; Stephan Alaniz; Zeynep Akata; | code |
| 366 | SphereUFormer: A U-Shaped Transformer for Spherical 360 Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel method for omnidirectional 360\degree perception. |
Yaniv Benny; Lior Wolf; | code |
| 367 | Mind The Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. |
Junxi Chen; Junhao Dong; Xiaohua Xie; | code |
| 368 | Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. |
Jianwei Tang; Hong Yang; Tengyue Chen; Jian-Fang Hu; | code |
| 369 | MambaIC: State Space Models for High-Performance Learned Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we systematically analyze the advantages of SSMs for better integration and propose an enhanced image compression approach through refined context modeling, which we term MambaIC. |
Fanhu Zeng; Hao Tang; Yihua Shao; Siyu Chen; Ling Shao; Yan Wang; | code |
| 370 | SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. |
Ying Chen; Guoan Wang; Yuanfeng Ji; Yanjun Li; Jin Ye; Tianbin Li; Ming Hu; Rongshan Yu; Yu Qiao; Junjun He; | code |
| 371 | O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a new approach, called O-TPT that introduces orthogonality constraints on the textual features corresponding to the learnable prompts for calibrating test-time prompt tuning in VLMs. |
Ashshak Sharifdeen; Muhammad Akhtar Munir; Sanoojan Baliah; Salman Khan; Muhammad Haris Khan; | code |
| 372 | Task-driven Image Fusion with Learnable Fusion Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current fusion methods for downstream tasks still use predefined fusion objectives that potentially mismatch the downstream tasks, limiting adaptive guidance and reducing model flexibility. To address this, we propose Task-driven Image Fusion (TDFusion), a fusion framework incorporating a learnable fusion loss guided by task loss. |
Haowen Bai; Jiangshe Zhang; Zixiang Zhao; Yichen Wu; Lilun Deng; Yukun Cui; Tao Feng; Shuang Xu; | code |
| 373 | MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes MMTL-UniAD, a unified multi-modal multi-task learning framework that simultaneously recognizes driver behavior (e.g., looking around, talking), driver emotion (e.g., anxiety, happiness), vehicle behavior (e.g., parking, turning), and traffic context (e.g., traffic jam, traffic smooth). |
Wenzhuo Liu; Wenshuo Wang; Yicheng Qiao; Qiannan Guo; Jiayin Zhu; Pengfei Li; Zilong Chen; Huiming Yang; Zhiwei Li; Lening Wang; Tiao Tan; Huaping Liu; | code |
| 374 | SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP’s inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities. To address these challenges, we propose a stratified granular comparison network. |
Xin Lin; Chong Shi; Zuopeng Yang; Haojin Tang; Zhili Zhou; | code |
| 375 | Any6D: Model-free 6D Pose Estimation of Novel Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. |
Taeyeop Lee; Bowen Wen; Minjun Kang; Gyuree Kang; In So Kweon; Kuk-Jin Yoon; | code |
| 376 | Reconstructing Animals and The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method to reconstruct natural scenes from single images. |
Peter Kulits; Michael J. Black; Silvia Zuffi; | code |
| 377 | CustAny: Customizing Anything from A Single Example Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel pipeline to construct a large dataset of general objects and build the Multi-Category ID-Consistent (MC-IDC) dataset, featuring 315k text-image samples across 10k categories. |
Lingjie Kong; Kai Wu; Chengming Xu; Xiaobin Hu; Wenhui Han; Jinlong Peng; Donghao Luo; Mengtian Li; Jiangning Zhang; Chengjie Wang; Yanwei Fu; | code |
| 378 | ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome these, we introduce ConMo, a zero-shot framework that disentangle and recompose the motions of subjects and camera movements. |
Jiayi Gao; Zijin Yin; Changcheng Hua; Yuxin Peng; Kongming Liang; Zhanyu Ma; Jun Guo; Yang Liu; | code |
| 379 | Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Even if aesthetic labels were provided (at substantial cost), it would be hard for the two-trajectory methods to capture nuanced visual differences at different steps. |
Zhanhao Liang; Yuhui Yuan; Shuyang Gu; Bohan Chen; Tiankai Hang; Mingxi Cheng; Ji Li; Liang Zheng; | code |
| 380 | UniHOPE: A Unified Approach for Hand-Only and Hand-Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UniHOPE, a unified approach for general 3D hand-object pose estimation, flexibly adapting both scenarios. |
Yinqiao Wang; Hao Xu; Pheng-Ann Heng; Chi-Wing Fu; | code |
| 381 | Noise Calibration and Spatial-Frequency Interactive Network for STEM Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nonetheless, existing STEM image enhancement methods usually overlook unique features in the frequency domain, and existing datasets lack realism and generality. To resolve these issues, in this paper, we develop noise calibration, data synthesis, and enhancement methods for STEM images. |
Hesong Li; Ziqi Wu; Ruiwen Shao; Tao Zhang; Ying Fu; | code |
| 382 | ColabSfM: Collaborative Structure-from-Motion By Point Cloud Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, there is a lack of scalable methods and training datasets for registering SfM reconstructions. In this paper, we tackle this challenge by proposing the scalable task of point cloud registration for SfM reconstructions. |
Johan Edstedt; André Mateus; Alberto Jaenal; | code |
| 383 | SAM2Object: Consolidating View Consistency Via SAM2 for Zero-Shot 3D Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present SAM2Object, a novel zero-shot 3D instance segmentation method that effectively utilizes the Segment Anything Model 2 to segment and track objects, consolidating view consistency across frames. |
Jihuai Zhao; Junbao Zhuo; Jiansheng Chen; Huimin Ma; | code |
| 384 | Certified Human Trajectory Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a certification approach tailored for trajectory prediction that provides guaranteed robustness. |
Mohammadhossein Bahari; Saeed Saadatnejad; Amirhossein Askari Farsangi; Seyed-Mohsen Moosavi-Dezfooli; Alexandre Alahi; | code |
| 385 | Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. |
Wenqiao Li; Bozhong Zheng; Xiaohao Xu; Jinye Gan; Fading Lu; Xiang Li; Na Ni; Zheng Tian; Xiaonan Huang; Shenghua Gao; Yingna Wu; | code |
| 386 | Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly … |
Wenqiao Li; Yao Gu; Xintao Chen; Xiaohao Xu; Ming Hu; Xiaonan Huang; Yingna Wu; | code |
| 387 | D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, large compression leads to limited local realism, while small compression increases computational complexity and compromises global consistency, ultimately impacting the quality of generated images. To address these limitations, we propose dynamically compressing different image regions by recognizing the importance of different regions, and introduce a novel two-stage framework designed to enhance the effectiveness and efficiency of image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical encoder to encode different image regions at different downsampling rates, tailored to their specific information densities, thereby providing more accurate and natural latent codes for the diffusion process. |
Weinan Jia; Mengqi Huang; Nan Chen; Lei Zhang; Zhendong Mao; | code |
| 388 | Optimizing for The Shortest Path in Denoising Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this research, we propose a novel denoising diffusion model based on shortest-path modeling that optimizes residual propagation to enhance both denoising efficiency and quality. |
Ping Chen; Xingpeng Zhang; Zhaoxiang Liu; Huan Hu; Xiang Liu; Kai Wang; Min Wang; Yanlin Qian; Shiguo Lian; | code |
| 389 | Finer-CAM: Spotting The Difference Reveals Finer Details for Visual Explanation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Finer-CAM, a method that retains CAM’s efficiency while achieving precise localization of discriminative regions. |
Ziheng Zhang; Jianyang Gu; Arpita Chowdhury; Zheda Mai; David Carlyn; Tanya Berger-Wolf; Yu Su; Wei-Lun Chao; | code |
| 390 | CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce CPath-Omni, the first 15B parameter LMM that unifies patch and WSI analysis, consolidating a variety of tasks at both levels, including classification, visual question answering, captioning, and visual referring prompting. |
Yuxuan Sun; Yixuan Si; Chenglu Zhu; Xuan Gong; Kai Zhang; Pingyi Chen; Ye Zhang; Zhongyi Shui; Tao Lin; Lin Yang; | code |
| 391 | Apply Hierarchical-Chain-of-Generation to Complex Attributes Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Though some works introduce manual efforts to alleviate the above issues, their quality is unstable and highly reliant on manual information. To tackle above problems, we propose a automated method Hierarchical-Chain-of-Generation (HCoG). |
Yiming Qin; Zhu Xu; Yang Liu; | code |
| 392 | Temporal Separation with Entropy Regularization for Knowledge Distillation in Spiking Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent research has aimed to enhance SNN learning by employing knowledge distillation (KD) from ANN teacher networks, traditional distillation techniques often overlook the distinctive spatiotemporal properties of SNNs, thus failing to fully leverage their advantages. To overcome these challenge, we propose a novel logit distillation method characterized by temporal separation and entropy regularization. |
Kairong Yu; Chengting Yu; Tianqing Zhang; Xiaochen Zhao; Shu Yang; Hongwei Wang; Qiang Zhang; Qi Xu; | code |
| 393 | ConText-CIR: Learning from Concepts in Text for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods in CIR struggle to accurately represent the image and the text modification, resulting in subpar performance. To address this limitation, we introduce a CIR framework, \name , trained with a Text Concept-Consistency loss that encourages the representations of noun phrases in the text modification to better attend to the relevant parts of the query image. |
Eric Xing; Pranavi Kolouju; Robert Pless; Abby Stylianou; Nathan Jacobs; | code |
| 394 | On Denoising Walking Videos for Gait Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, emerging end-to-end methods focus on directly denoising RGB videos using global optimization and human-defined priors. Building on this trend, we propose a novel gait denoising method, DenosingGait. |
Dongyang Jin; Chao Fan; Jingzhe Ma; Jingkai Zhou; Weihua Chen; Shiqi Yu; | code |
| 395 | FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose FLAME (Frozen Large lAnguage Models Enable data-efficient language-image pre-training) that leverages frozen large language models as text encoders, naturally processing long text inputs and demonstrating impressive multilingual generalization. |
Anjia Cao; Xing Wei; Zhiheng Ma; | code |
| 396 | MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. |
James Burgess; Jeffrey J Nirschl; Laura Bravo-Sánchez; Alejandro Lozano; Sanket Rajan Gupte; Jesus G. Galaz-Montoya; Yuhui Zhang; Yuchang Su; Disha Bhowmik; Zachary Coman; Sarina M Hasan; Alexandra Johannesson; William D. Leineweber; Malvika G Nair; Ridhi Yarlagadda; Connor Zuraski; Wah Chiu; Sarah Cohen; Jan N. Hansen; Manuel D Leonetti; Chad Liu; Emma Lundberg; Serena Yeung-Levy; | code |
| 397 | FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce FactCheXcker, a modular framework that de-hallucinates radiology report measurements by leveraging an improved query-code-update paradigm. |
Alice Heiman; Xiaoman Zhang; Emma Chen; Sung Eun Kim; Pranav Rajpurkar; | code |
| 398 | Linear Attention Modeling for Learned Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LALIC, a linear attention modeling for learned image compression. |
Donghui Feng; Zhengxue Cheng; Shen Wang; Ronghua Wu; Hongwei Hu; Guo Lu; Li Song; | code |
| 399 | On The Consistency of Video Large Language Models in Temporal Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. |
Minjoon Jung; Junbin Xiao; Byoung-Tak Zhang; Angela Yao; | code |
| 400 | FedMIA: An Effective Membership Inference Attack Exploiting "All for One" Principle in Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods fail to leverage updates from non-target clients, potentially underutilizing available information.In this paper, we first formulate a one-tailed likelihood-ratio hypothesis test based on the likelihood of updates from non-target clients. Building upon this formulation, we introduce a three-stage Membership Inference Attack (MIA) method, called FedMIA, which follows the "all for one"–leveraging updates from all clients across multiple communication rounds to enhance MIA effectiveness. |
Gongxi Zhu; Donghao Li; Hanlin Gu; Yuan Yao; Lixin Fan; Yuxing Han; | code |
| 401 | 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce 3D-LLaVA, a simple yet highly powerful 3D LMM designed to act as an intelligent assistant in comprehending, reasoning, and interacting with the 3D world. |
Jiajun Deng; Tianyu He; Li Jiang; Tianyu Wang; Feras Dayoub; Ian Reid; | code |
| 402 | Embodied Scene Understanding for Vision Language Models Via MetaVQA Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs’ understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. |
Weizhen Wang; Chenda Duan; Zhenghao Peng; Yuxin Liu; Bolei Zhou; | code |
| 403 | Enhancing Dataset Distillation Via Non-Critical Region Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Non-Critical Region Refinement Dataset Distillation (NRR-DD) method, which preserves the instance-specific and fine-grained regions in synthetic data while enriching non-critical regions with more class-general information. |
Minh-Tuan Tran; Trung Le; Xuan-May Le; Thanh-Toan Do; Dinh Phung; | code |
| 404 | Effortless Active Labeling for Long-Term Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate how to achieve effortless active labeling so that a maximum of one sample is selected for annotation in each batch. |
Guowei Wang; Changxing Ding; | code |
| 405 | U-Know-DiffPAN: An Uncertainty-aware Knowledge Distillation Diffusion Framework with Details Enhancement for PAN-Sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, diffusion-based approaches lack sufficient conditioning to fully utilize Panchromatic (PAN) images and low-resolution multispectral (LRMS) inputs effectively. To address these challenges, we propose an uncertainty-aware knowledge distillation diffusion framework with details enhancement for PAN-sharpening, called U-Know-DiffPAN. |
Sungpyo Kim; Jeonghyeok Do; Jaehyup Lee; Munchurl Kim; | code |
| 406 | Active Event-based Stereo Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel problem setting, namely active event-based stereo vision, which provides the first insight of integrating binocular event cameras and an infrared projector for high-speed depth sensing. |
Jianing Li; Yunjian Zhang; Haiqian Han; Xiangyang Ji; | code |
| 407 | Period-LLM: Extending The Periodic Capability of Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Period-LLM, a multimodal large language model designed to enhance the performance of periodic tasks across various modalities, and constructs a benchmark of various difficulty for evaluating the cross-modal periodic capabilities of large models. |
Yuting Zhang; Hao Lu; Qingyong Hu; Yin Wang; Kaishen Yuan; Xin Liu; Kaishun Wu; | code |
| 408 | Sparse2DGS: Geometry-Prioritized Gaussian Splatting for Surface Reconstruction from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a Gaussian Splatting method for surface reconstruction using sparse input views. |
Jiang Wu; Rui Li; Yu Zhu; Rong Guo; Jinqiu Sun; Yanning Zhang; | code |
| 409 | Bootstrap Your Own Views: Masked Ego-Exo Modeling for Fine-grained View-invariant Video Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel masked ego-exo modeling that promotes both causal temporal dynamics and cross-view alignment, called Bootstrap Your Own Views (BYOV), for fine-grained view-invariant video representation learning from unpaired ego-exo videos. |
Jungin Park; Jiyoung Lee; Kwanghoon Sohn; | code |
| 410 | GazeGene: Large-scale Synthetic Gaze Dataset with 3D Eyeball Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present GazeGene, a new large-scale synthetic gaze dataset with photo-realistic samples. |
Yiwei Bao; Zhiming Wang; Feng Lu; | code |
| 411 | Mind The Gap: Confidence Discrepancy Can Guide Federated Semi-Supervised Learning Across Pseudo-Mismatch Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study the problem of FSSL in-depth and show that (1) heterogeneity exacerbates pseudo-label mismatches, further degrading model performance and convergence, and (2) local and global models’ predictive tendencies diverge as heterogeneity increases. |
Yijie Liu; Xinyi Shang; Yiqun Zhang; Yang Lu; Chen Gong; Jing-Hao Xue; Hanzi Wang; | code |
| 412 | SuperLightNet: Lightweight Parameter Aggregation Network for Multimodal Brain Tumor Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Experimental results demonstrate that the proposed method achieves a leading reduction in parameter count by 95.59%, the 96.78% improvement in computational efficiency, the 96.86% enhancement in memory access performance, and the average performance gain of 0.21% on the BraTS2019 and BraTS2021 datasets in comparison with the state-of-the-art methods. |
Feng Yu; Jiacheng Cao; Li Liu; Minghua Jiang; | code |
| 413 | PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it is challenging to obtain photorealistic results, which simulate contextual interactions, such as reflections, illumination, and shadows, induced by merging the target object into the source image. To address this issue, we propose PS-Diffusion, which ensures realistic and consistent object-scene blending while maintaining the invariance of subject appearance during editing. |
Weicheng Wang; Guoli Jia; Zhongqi Zhang; Liang Lin; Jufeng Yang; | code |
| 414 | Matrix3D: Large Photogrammetry Model All-in-One Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. |
Yuanxun Lu; Jingyang Zhang; Tian Fang; Jean-Daniel Nahmias; Yanghai Tsin; Long Quan; Xun Cao; Yao Yao; Shiwei Li; | code |
| 415 | Image Quality Assessment: Investigating Causal Perceptual Effects with Abductive Counterfactual Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an FR-IQA method based on abductive counterfactual inference to investigate the causal relationships between deep network features and perceptual distortions. |
Wenhao Shen; Mingliang Zhou; Yu Chen; Xuekai Wei; Yong Feng; Huayan Pu; Weijia Jia; | code |
| 416 | ROS-SAM: High-Quality Interactive Segmentation for Remote Sensing Moving Object Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose ROS-SAM, a method designed to achieve high-quality interactive segmentation while preserving generalization across diverse remote sensing data. |
Zhe Shan; Yang Liu; Lei Zhou; Cheng Yan; Heng Wang; Xia Xie; | code |
| 417 | LP-Diff: Towards Improved Restoration of Real-World Degraded License Plate Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To better restore severely degraded LP, we propose a novel Diffusion-based network, called LP-Diff, to tackle real-world LPIR tasks. |
Haoyan Gong; Zhenrong Zhang; Yuzheng Feng; Anh Nguyen; Hongbin Liu; | code |
| 418 | Which Viewpoint Shows It Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). |
Sagnik Majumder; Tushar Nagarajan; Ziad Al-Halah; Reina Pradhan; Kristen Grauman; | code |
| 419 | Birth and Death of A Rose Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study the problem of generating temporal object intrinsics–temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose–from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pretrained 2D diffusion models. |
Chen Geng; Yunzhi Zhang; Shangzhe Wu; Jiajun Wu; | code |
| 420 | PO3AD: Predicting Point Offsets Toward Better 3D Point Cloud Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response to those predicaments, we introduce an innovative approach that emphasizes learning point offsets, targeting more informative pseudo-abnormal points, thus fostering more effective distillation of normal data representations. |
Jianan Ye; Weiguang Zhao; Xi Yang; Guangliang Cheng; Kaizhu Huang; | code |
| 421 | LMO: Linear Mamba Operator for MRI Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To counteract the dilemma, we propose an innovative Linear Mamba Operator (LMO) to ensure consistency and generalization, while still enjoying desirable interpretability. |
Wei Li; Jiawei Jiang; Jie Wu; Kaihao Yu; Jianwei Zheng; | code |
| 422 | Language-Guided Audio-Visual Learning for Long-Term Sports Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous works require a large number of model parameters to learn potential associations between actions and music. To address this issue, we propose a language-guided audio-visual learning (MLAVL) framework that models "audio-action-visual" correlations guided by low-cost language modality. |
Huangbiao Xu; Xiao Ke; Huanqi Wu; Rui Xu; Yuezhou Li; Wenzhong Guo; | code |
| 423 | Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This leads to two issues: (1) the model cannot effectively detect errors using static prototypes when the inference environment or action execution distribution differs from training; and (2) the model may also use the wrong prototypes to detect errors if the ongoing action label is not the same as the predicted one. To address this problem, we propose an Adaptive Multiple Normal Action Representation (AMNAR) framework. |
Wei-Jin Huang; Yuan-Ming Li; Zhi-Wei Xia; Yu-Ming Tang; Kun-Yu Lin; Jian-Fang Hu; Wei-Shi Zheng; | code |
| 424 | Visual Consensus Prompting for Co-Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. |
Jie Wang; Nana Yu; Zihao Zhang; Yahong Han; | code |
| 425 | DCEvo: Discriminative Cross-Dimensional Evolutionary Learning for Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing approaches typically treat image fusion and subsequent high-level tasks as separate processes, resulting in fused images that offer only marginal gains in task performance and fail to provide constructive feedback for optimizing the fusion process. To overcome these limitations, we propose a Discriminative Cross-Dimension Evolutionary Learning Framework, termed DCEvo, which simultaneously enhances visual quality and perception accuracy. |
Jinyuan Liu; Bowei Zhang; Qingyun Mei; Xingyuan Li; Yang Zou; Zhiying Jiang; Long Ma; Risheng Liu; Xin Fan; | code |
| 426 | Jailbreaking The Non-Transferable Barrier Via Test-Time Data Disguising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reveal the first loophole of black-box NTL models by proposing a novel attack method (dubbed as JailNTL) to jailbreak the non-transferable barrier through test-time data disguising, The main idea of JailNTL is to disguise unauthorized data so it can be identified as authorized by the NTL model, thereby bypassing the non-transferable barrier without modifying the NTL model weights. |
Yongli Xiang; Ziming Hong; Lina Yao; Dadong Wang; Tongliang Liu; | code |
| 427 | ACE: Anti-Editing Concept Erasure in Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. |
Zihao Wang; Yuxiang Wei; Fan Li; Renjing Pei; Hang Xu; Wangmeng Zuo; | code |
| 428 | Generating Multimodal Driving Scenes Via Next-Scene Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of the map modality. |
Yanhao Wu; Haoyang Zhang; Tianwei Lin; Lichao Huang; Shujie Luo; Rui Wu; Congpei Qiu; Wei Ke; Tong Zhang; | code |
| 429 | Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. |
Wei Lin; Chenyang Zhao; Antoni B. Chan; | code |
| 430 | PolarFree: Polarization-based Reflection-Free Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. |
Mingde Yao; Menglu Wang; King-Man Tam; Lingen Li; Tianfan Xue; Jinwei Gu; | code |
| 431 | OccMamba: Semantic Occupancy Prediction with State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the global modeling and linear computation complexity of the Mamba architecture, we present the first Mamba-based network for semantic occupancy prediction, termed OccMamba. |
Heng Li; Yuenan Hou; Xiaohan Xing; Yuexin Ma; Xiao Sun; Yanyong Zhang; | code |
| 432 | SfM-Free 3D Gaussian Splatting Via Hierarchical Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. |
Bo Ji; Angela Yao; | code |
| 433 | From Sparse to Dense: Camera Relocalization with Scene-Specific Detector from Feature Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel camera relocalization method, STDLoc, which leverages Feature Gaussian as scene representation. |
Zhiwei Huang; Hailin Yu; Yichun Shentu; Jin Yuan; Guofeng Zhang; | code |
| 434 | Optical-Flow Guided Prompt Optimization for Coherent Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Within diffusion frameworks, guidance techniques have proven effective in enhancing output quality during inference; however, applying these methods to video diffusion models introduces additional complexity of handling computations across entire sequences. To address this, we propose a novel framework called MotionPrompt that guides the video generation process via optical flow. |
Hyelin Nam; Jaemin Kim; Dohun Lee; Jong Chul Ye; | code |
| 435 | Deterministic Image-to-Image Translation Via Denoising Brownian Bridge Models with Dual Approximators Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a denois- ing Brownian bridge model with dual approximators (Dual- approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse pro- cess) to produce faithful output with negligible variance and high image quality in I2I translations. |
Bohan Xiao; Peiyong Wang; Qisheng He; Ming Dong; | code |
| 436 | WAVE: Weight Templates for Adaptive Initialization of Variable-sized Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, deployment constraints often necessitate models of varying sizes, exposing limitations in the conventional pre-training and fine-tuning paradigm, particularly when target model sizes are incompatible with pre-trained ones. To address this challenge, we propose WAVE, a novel approach that reformulates variable-sized model initialization from a multi-task perspective, where initializing each model size is treated as a distinct task. |
Fu Feng; Yucheng Xie; Jing Wang; Xin Geng; | code |
| 437 | Redefining <Creative> in Dictionary: Towards An Enhanced Semantic Understanding of Creative Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output-a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining "creative" as a new token, <CreTok>, thus enhancing models’ semantic understanding for combinatorial creativity. |
Fu Feng; Yucheng Xie; Xu Yang; Jing Wang; Xin Geng; | code |
| 438 | CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. |
Qingqing Zhao; Yao Lu; Moo Jin Kim; Zipeng Fu; Zhuoyang Zhang; Yecheng Wu; Zhaoshuo Li; Qianli Ma; Song Han; Chelsea Finn; Ankur Handa; Tsung-Yi Lin; Gordon Wetzstein; Ming-Yu Liu; Donglai Xiang; | code |
| 439 | Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. |
Chen Liao; Yan Shen; Dan Li; Zhongli Wang; | code |
| 440 | Open Set Label Shift with Test Time Out-of-Distribution Reference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Open set label shift (OSLS) occurs when label distributions change from a source to a target distribution, and the target distribution has an additional out-of-distribution (OOD) class.In this work, we build estimators for both source and target open set label distributions using a source domain in-distribution (ID) classifier and an ID/OOD classifier. |
Changkun Ye; Russell Tsuchida; Lars Petersson; Nick Barnes; | code |
| 441 | CoA: Towards Real Image Dehazing Via Compression-and-Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, there is an urgent need for an algorithm that excels in both efficiency and adaptability to address real image dehazing effectively. This work proposes a Compression-and-Adaptation (CoA) computational flow to tackle these challenges from a divide-and-conquer perspective. |
Long Ma; Yuxin Feng; Yan Zhang; Jinyuan Liu; Weimin Wang; Guang-Yong Chen; Chengpei Xu; Zhuo Su; | code |
| 442 | Boosting Adversarial Transferability Through Augmentation in Hypothesis Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we observe a mirroring relationship between model generalization and adversarial example transferability. |
Yu Guo; Weiquan Liu; Qingshan Xu; Shijun Zheng; Shujun Huang; Yu Zang; Siqi Shen; Chenglu Wen; Cheng Wang; | code |
| 443 | DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this issue, we propose a new task–multi-round dual-speaker interaction for 3D talking head generation–which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. |
Ziqiao Peng; Yanbo Fan; Haoyu Wu; Xuan Wang; Hongyan Liu; Jun He; Zhaoxin Fan; | code |
| 444 | Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. |
Steeven Janny; Hervé Poirier; Leonid Antsfeld; Guillaume Bono; Gianluca Monaci; Boris Chidlovskii; Francesco Giuliari; Alessio Del Bue; Christian Wolf; | code |
| 445 | DynRefer: Delving Into Region-level Multimodal Tasks Via Dynamic Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. |
Yuzhong Zhao; Feng Liu; Yue Liu; Mingxiang Liao; Chen Gong; Qixiang Ye; Fang Wan; | code |
| 446 | Exploring CLIP’s Dense Knowledge for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose ExCEL to explore CLIP’s dense knowledge via a novel patch-text alignment paradigm for WSSS. |
Zhiwei Yang; Yucong Meng; Kexue Fu; Feilong Tang; Shuo Wang; Zhijian Song; | code |
| 447 | Derivative-Free Diffusion Manifold-Constrained Gradient for Unified XAI Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they (1) require white-box access to models, (2) are vulnerable to adversarial attacks, and (3) produce attributions that lie off the image manifold, leading to explanations that are not amenable to human perception. To overcome these challenges, we introduce Derivative-Free Diffusion Manifold-Contrained Gradients (FreeMCG): by leveraging ensemble Kalman filters and diffusion models, we derive a derivative-free approximation of the model’s gradient projected onto the data manifold, requiring access only to the model’s outputs (i.e., black-box setting). |
Won Jun Kim; Hyungjin Chung; Jaemin Kim; Sangmin Lee; Byeongsu Sim; Jong Chul Ye; | code |
| 448 | OVO-Bench: How Far Is Your Video-LLMs from Real-World Online Video Understanding? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. |
Junbo Niu; Yifei Li; Ziyang Miao; Chunjiang Ge; Yuanhang Zhou; Qihao He; Xiaoyi Dong; Haodong Duan; Shuangrui Ding; Rui Qian; Pan Zhang; Yuhang Zang; Yuhang Cao; Conghui He; Jiaqi Wang; | code |
| 449 | Fish-Vista: A Multi-Purpose Dataset for Understanding & Identification of Traits from Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Fish-Visual Trait Analysis (Fish-Vista), the first organismal image dataset designed for the analysis of visual traits of aquatic species directly from images using machine learning and computer vision methods. |
Kazi Sajeed Mehrab; M. Maruf; Arka Daw; Abhilash Neog; Harish Babu Manogaran; Mridul Khurana; Zhenyang Feng; Bahadir Altintas; Yasin Bakis; Elizabeth G Campolongo; Matthew J Thompson; Xiaojun Wang; Hilmar Lapp; Tanya Berger-Wolf; Paula Mabee; Henry Bart; Wei-Lun Chao; Wasila M Dahdul; Anuj Karpatne; | code |
| 450 | Learning Physics From Video: Unsupervised Physical Parameter Estimation for Continuous Dynamical Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an unsupervised method to estimate the physical parameters of known, continuous governing equations from single videos suitable for different dynamical systems beyond motion and robust to initialization. |
Alejandro Castañeda Garcia; Jan Warchocki; Jan van Gemert; Daan Brinks; Nergis Tomen; | code |
| 451 | Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable continuous dataset expansion as mobile platforms evolve, we present an automated framework that leverages publicly available video content to create comprehensive task datasets without manual annotation. |
Yunseok Jang; Yeda Song; Sungryull Sohn; Lajanugen Logeswaran; Tiange Luo; Dong-Ki Kim; Kyunghoon Bae; Honglak Lee; | code |
| 452 | MotionPRO: Exploring The Role of Pressure in Human MoCap and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. |
Shenghao Ren; Yi Lu; Jiayi Huang; Jiayi Zhao; He Zhang; Tao Yu; Qiu Shen; Xun Cao; | code |
| 453 | Open-World Amodal Appearance Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Open-World Amodal Appearance Completion, a training-free framework that expands amodal completion capabilities by accepting flexible text queries as input. |
Jiayang Ao; Yanbei Jiang; Qiuhong Ke; Krista A. Ehinger; | code |
| 454 | MuTri: Multi-view Tri-alignment for OCT to OCTA 3D Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose the multi-view Tri-alignment framework for OCT to OCTA 3D image translation in discrete and finite space, named MuTri. |
Zhuangzhuang Chen; Hualiang Wang; Chubin Ou; Xiaomeng Li; | code |
| 455 | Improving Adversarial Transferability on Vision Transformers Via Forward Propagation Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. |
Yuchen Ren; Zhengyu Zhao; Chenhao Lin; Bo Yang; Lu Zhou; Zhe Liu; Chao Shen; | code |
| 456 | S2Gaussian: Sparse-View Super-Resolution 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim ambitiously for a realistic yet challenging problem, namely, how to reconstruct high-quality 3D scenes from sparse low-resolution views that simultaneously suffer from deficient perspectives and contents. |
Yecong Wan; Mingwen Shao; Yuanshuo Cheng; Wangmeng Zuo; | code |
| 457 | Cross-Rejective Open-Set SAR Image Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In general, the registration is regarded as a typical closed-set classification, which forces each keypoint to be classified into the given classes, but ignoring an essential issue that numerous redundant keypoints are beyond the given classes, which unavoidably results in capturing incorrect matched-point pairs. Based on this, we propose a Cross-Rejective Open-set SAR Image Registration (CroR-OSIR) method. |
Shasha Mao; Shiming Lu; Zhaolong Du; Licheng Jiao; Shuiping Gou; Luntian Mou; Xuequan Lu; Lin Xiong; Yimeng Zhang; | code |
| 458 | When The Future Becomes The Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: 2) Previous MVM methods primarily recover the masked patches in the pixel space, leading to insufficient information compression for downstream tasks. To address these challenges jointly, we propose a self-supervised framework that leverages Temporal Correspondence for video Representation learning (T-CoRe). |
Yang Liu; Qianqian Xu; Peisong Wen; Siran Dai; Qingming Huang; | code |
| 459 | Enhancing Diversity for Data-free Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the generators in these methods have the mode collapse problem, making them unable to synthesize diverse data. To solve this problem, we leverage the information from the full-precision model and enhance both inter-class and intra-class diversity for generating better calibration data, by devising a multi-layer features mixer and normalization flow based attention. |
Kai Zhao; Zhihao Zhuang; Miao Zhang; Chenjuan Guo; Yang Shu; Bin Yang; | code |
| 460 | Navigating Image Restoration with VAR’s Distribution Alignment Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Generative models trained on extensive high-quality datasets effectively capture the structural and statistical properties of clean images, rendering them powerful priors for transforming degraded features into clean ones in image restoration. |
Siyang Wang; Naishan Zheng; Jie Huang; Feng Zhao; | code |
| 461 | Parametric Point Cloud Completion for Polygonal Surface Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that while current point cloud completion techniques may recover missing points, they are not optimized for polygonal surface reconstruction, where the parametric representation of underlying surfaces remains overlooked. To address this gap, we introduce parametric completion, a novel paradigm for point cloud completion, which recovers parametric primitives instead of individual points to convey high-level geometric structures. |
Zhaiyu Chen; Yuqing Wang; Liangliang Nan; Xiao Xiang Zhu; | code |
| 462 | Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. |
Huiyi Wang; Haodong Lu; Lina Yao; Dong Gong; | code |
| 463 | UniScene: Unified Occupancy-centric Driving Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UniScene, the first unified framework for generating three key data forms – semantic occupancy, video, and LiDAR – in driving scenes. |
Bohan Li; Jiazhe Guo; Hongsi Liu; Yingshuang Zou; Yikang Ding; Xiwu Chen; Hu Zhu; Feiyang Tan; Chi Zhang; Tiancai Wang; Shuchang Zhou; Li Zhang; Xiaojuan Qi; Hao Zhao; Mu Yang; Wenjun Zeng; Xin Jin; | code |
| 464 | Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Besides, token merging and pruning face a dilemma between disrupting target token information and losing non-target token information. To solve these problems, we propose a novel visual token compression scheme, named Libra-Merging. |
Longrong Yang; Dong Shen; Chaoxiang Cai; Kaibing Chen; Fan Yang; Tingting Gao; Di Zhang; Xi Li; | code |
| 465 | OpenSDI: Spotting Diffusion-Generated Images in The Open World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. |
Yabin Wang; Zhiwu Huang; Xiaopeng Hong; | code |
| 466 | PURA: Parameter Update-Recovery Test-Time Adaption for RGB-T Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: At the same time, the gradient computations involved in the optimization process impose a significant computational burden. To address these challenges, we propose a Parameter Update-Recovery Adaptation (PURA) framework based on parameter decomposition. |
Zekai Shao; Yufan Hu; Bin Fan; Hongmin Liu; | code |
| 467 | DeSplat: Decomposed Gaussian Splatting for Distractor-Free Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel method, DeSplat, that directly separates distractors and static scene elements purely based on volume rendering of Gaussian primitives. |
Yihao Wang; Marcus Klasson; Matias Turkulainen; Shuzhe Wang; Juho Kannala; Arno Solin; | code |
| 468 | Not All Parameters Matter: Masking Diffusion Models for Enhancing Generation Ability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method–termed "MaskUNet"– that enhances generation quality with negligible parameter numbers. |
Lei Wang; Senmao Li; Fei Yang; Jianye Wang; Ziheng Zhang; Yuhan Liu; Yaxing Wang; Jian Yang; | code |
| 469 | Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. |
Jinjin Zhang; Qiuyu Huang; Junjie Liu; Xiefan Guo; Di Huang; | code |
| 470 | Towards Training-free Anomaly Detection with Vision and Language Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. |
Jinjin Zhang; Guodong Wang; Yizhou Jin; Di Huang; | code |
| 471 | EmotiveTalk: Expressive Talking Head Generation Through Audio Information Decoupling and Emotional Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. |
Haotian Wang; Yuzhe Weng; Yueyan Li; Zilu Guo; Jun Du; Shutong Niu; Jiefeng Ma; Shan He; Xiaoyan Wu; Qiming Hu; Bing Yin; Cong Liu; Qingfeng Liu; | code |
| 472 | PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Describing rules with natural language provides generalizable applications, but also raises cross-modal heterogeneity and hierarchical misalignment challenges. In this paper, we proposed a novel approach termed Procedural Heterogeneous Graph Completion (PHGC), which addresses these challenges with heterogeneous graphs representing the logic in rules and operation flows. |
Xun Jiang; Zhiyi Huang; Xing Xu; Jingkuan Song; Fumin Shen; Heng Tao Shen; | code |
| 473 | UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. |
Yiheng Li; Ruibing Hou; Hong Chang; Shiguang Shan; Xilin Chen; | code |
| 474 | DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we clone a learnable parallel prompt based on the backbone prompt, and introduce a variable Weighting-Decoupling framework to independently control the optimization directions of dual prompts specific to base or new tasks, thus avoiding the conflict in generalization. |
Haoyang Li; Liang Wang; Chao Wang; Jing Jiang; Yan Peng; Guodong Long; | code |
| 475 | BHViT: Binarized Hybrid Vision Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the structural differences between CNN and Transformer architectures, simply applying binary CNN strategies to the ViT models will lead to a significant performance drop. To tackle this challenge, we propose BHViT, a binarization-friendly hybrid ViT architecture, and its full binarization model with the guidance of three important observations. |
Tian Gao; Yu Zhang; Zhiyuan Zhang; Huajun Liu; Kaijie Yin; Chengzhong Xu; Hui Kong; | code |
| 476 | Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce a novel visual-language alignment method that casts the key issue as a Visual Question Answering with Mask generation (VQAMask) task, optimizing two tasks simultaneously: VQA-based text parsing and mask generation. |
Zining Wang; Tongkun Guan; Pei Fu; Chen Duan; Qianyi Jiang; Zhentao Guo; Shan Guo; Junfeng Luo; Wei Shen; Xiaokang Yang; | code |
| 477 | Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we uncover the potential of the internal representations within frozen medical foundation diffusion models as highly efficient zero-shot learners for tumor segmentation by introducing a novel framework named DiffuGTS. |
Yankai Jiang; Peng Zhang; Donglin Yang; Yuan Tian; Hai Lin; Xiaosong Wang; | code |
| 478 | HeMoRa: Unsupervised Heuristic Consensus Sampling for Robust Point Cloud Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose HeMoRa, a new unsupervised framework that trains a Heuristic information Generator (HeGen) to estimate sampling probabilities for correspondences using a Multi-order Reward Aggregator (MoRa) loss. |
Shaocheng Yan; Yiming Wang; Kaiyan Zhao; Pengcheng Shi; Zhenjun Zhao; Yongjun Zhang; Jiayuan Li; | code |
| 479 | AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. |
Felix Wimbauer; Weirong Chen; Dominik Muhle; Christian Rupprecht; Daniel Cremers; | code |
| 480 | Segment Any Motion in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features and leverages SAM2 for pixel-level mask densification through an iterative prompting strategy. |
Nan Huang; Wenzhao Zheng; Chenfeng Xu; Kurt Keutzer; Shanghang Zhang; Angjoo Kanazawa; Qianqian Wang; | code |
| 481 | CCIN: Compositional Conflict Identification and Neutralization for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper proposes the Compositional Conflict Identification and Neutralization (CCIN) framework, which sequentially identifies and neutralizes compositional conflicts for effective CIR. |
Likai Tian; Jian Zhao; Zechao Hu; Zhengwei Yang; Hao Li; Lei Jin; Zheng Wang; Xuelong Li; | code |
| 482 | Nullu: Mitigating Object Hallucinations in Large Vision-Language Models Via HalluSpace Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent studies have shown that large vision-language models (LVLMs) often suffer from the issue of object hallucinations (OH). To mitigate this issue, we introduce an efficient method that edits the model weights based on an unsafe subspace, which we call HalluSpace in this paper. |
Le Yang; Ziwei Zheng; Boxu Chen; Zhengyu Zhao; Chenhao Lin; Chao Shen; | code |
| 483 | DyCON: Dynamic Uncertainty-aware Consistency and Contrastive Learning for Semi-supervised Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current methods struggle with class imbalance and high uncertainty from pathology variations, leading to inaccurate segmentation in 3D medical images. To address these challenges, we present DyCON, a Dynamic Uncertainty-aware Consistency and Contrastive Learning framework that enhances the generalization of consistency methods with two complementary losses: Uncertainty-aware Consistency Loss (UnCL) and Focal Entropy-aware Contrastive Loss (FeCL). |
Maregu Assefa; Muzammal Naseer; Iyyakutti Iyappan Ganapathi; Syed Sadaf Ali; Mohamed L Seghier; Naoufel Werghi; | code |
| 484 | AlphaPre: Amplitude-Phase Disentanglement Model for Precipitation Nowcasting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the fact that in the frequency domain, phase variations are shown to correspond to changes in the position of precipitation, while amplitude variations are linked to intensity changes, we propose an amplitude-phase disentanglement model called AlphaPre, which separately learn the position and intensity changes of precipitation. |
Kenghong Lin; Baoquan Zhang; Demin Yu; Wenzhi Feng; Shidong Chen; Feifan Gao; Xutao Li; Yunming Ye; | code |
| 485 | SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel framework called Spatially-vayring Gaussian Inverse Rendering (SVG-IR), aimed at enhancing both NVS and relighting quality. |
Hanxiao Sun; Yupeng Gao; Jin Xie; Jian Yang; Beibei Wang; | code |
| 486 | Functionality Understanding and Segmentation in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Fun3DU, the first approach designed for functionality understanding in 3D scenes. |
Jaime Corsetti; Francesco Giuliari; Alice Fasoli; Davide Boscaini; Fabio Poiesi; | code |
| 487 | Splatter-360: Generalizable 360 Gaussian Splatting for Wide-baseline Panoramic Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Splatter-360, a novel end-to-end generalizable 3DGS framework designed to handle wide-baseline panoramic images. |
Zheng Chen; Chenming Wu; Zhelun Shen; Chen Zhao; Weicai Ye; Haocheng Feng; Errui Ding; Song-Hai Zhang; | code |
| 488 | Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. |
You Wu; Xucheng Wang; Xiangyang Yang; Mengyuan Liu; Dan Zeng; Hengzhou Ye; Shuiwang Li; | code |
| 489 | Plug-and-Play Versatile Compressed Video Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a versatile codec-aware enhancement framework that reuses codec information to adaptively enhance videos under different compression settings, assisting various downstream vision tasks without introducing computation bottleneck. |
Huimin Zeng; Jiacheng Li; Zhiwei Xiong; | code |
| 490 | Sea-ing in Low-light Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a Self-supervised Low-light Underwater Image and Depth recovery network (SelfLUID-Net) for joint estimation of depth and restored image in real-time from a single LLUW image. |
Nisha Varghese; A. N. Rajagopalan; | code |
| 491 | ECVC: Exploiting Non-Local Correlations in Multiple Frames for Contextual Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on the proposed techniques, we present a video compression scheme ECVC. |
Wei Jiang; Junru Li; Kai Zhang; Li Zhang; | code |
| 492 | Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the prevalence of phonetically similar syllables with distinct lip shapes in language, these near-homophone syllables tend to exhibit significant coupling in self-supervised audio feature spaces, leading to the averaging effect in subsequent lip motion generation. To address this issue, this paper proposes a plug-and-play semantic decorrelation module–Wav2Sem. |
Hao Li; Ju Dai; Xin Zhao; Feng Zhou; Junjun Pan; Lei Li; | code |
| 493 | Samba: A Unified Mamba-based Framework for General Salient Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The emerging state space model, namely Mamba, has demonstrated its potential to balance global receptive fields and computational complexity. Therefore, we propose a novel unified framework based on the pure Mamba architecture, dubbed saliency Mamba (Samba), to flexibly handle general SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), and RGB-D VSOD. |
Jiahao He; Keren Fu; Xiaohong Liu; Qijun Zhao; | code |
| 494 | OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose OmniSplat, a training-free fast feed-forward 3DGS generation framework for omnidirectional images. |
Suyoung Lee; Jaeyoung Chung; Kihoon Kim; Jaeyoo Huh; Gunhee Lee; Minsoo Lee; Kyoung Mu Lee; | code |
| 495 | Adversarial Diffusion Compression for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel Real-ISR method, AdcSR, by distilling the one-step diffusion network OSEDiff into a streamlined diffusion-GAN model under our Adversarial Diffusion Compression (ADC) framework. |
Bin Chen; Gehui Li; Rongyuan Wu; Xindong Zhang; Jie Chen; Jian Zhang; Lei Zhang; | code |
| 496 | Progressive Correspondence Regenerator for Robust 3D Registration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing correspondence refinement methods mostly follow the paradigm of outlier removal, which either fails to correctly identify the accurate correspondences under extreme outlier ratios, or select too few correct correspondences to support robust registration. To address this challenge, we propose a novel approach named Regor, which is a progressive correspondence regenerator that generates higher-quality matches whist sufficiently robust for numerous outliers. |
Guiyu Zhao; Sheng Ao; Ye Zhang; Kai Xu; Yulan Guo; | code |
| 497 | Flow-NeRF: Joint Learning of Geometry, Poses, and Dense Flow Within Unified Neural Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Flow-NeRF, a unified framework that simultaneously optimizes scene geometry, camera poses, and dense optical flow all on-the-fly. |
Xunzhi Zheng; Dan Xu; | code |
| 498 | GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GO-N3RDet, a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields. |
Zechuan Li; Hongshan Yu; Yihao Ding; Jinhao Qiao; Basim Azam; Naveed Akhtar; | code |
| 499 | Theoretical Insights in Model Inversion Robustness and Conditional Entropy Maximization for Collaborative Inference Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By locally encoding raw data into intermediate features, collaborative inference enables end users to leverage powerful deep learning models without exposure of sensitive raw data to cloud servers. |
Song Xia; Yi Yu; Wenhan Yang; Meiwen Ding; Zhuo Chen; Ling-Yu Duan; Alex C. Kot; Xudong Jiang; | code |
| 500 | PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel method, Prediction Shift Backdoor Detection (PSBD), leveraging an uncertainty-based approach requiring minimal unlabeled clean validation data. |
Wei Li; Pin-Yu Chen; Sijia Liu; Ren Wang; | code |
| 501 | Do Your Best and Get Enough Rest for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Retaining new knowledge for a long period of time without catastrophic forgetting is the critical problem of continual learning. Therefore, based on Ebbinghaus’ theory, we introduce the view-batch model that adjusts the learning schedules to optimize the recall interval between retraining the same samples. |
Hankyul Kang; Gregor Seifer; Donghyun Lee; Jongbin Ryu; | code |
| 502 | Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As computer vision seeks to replicate the human vision system, understanding infant visual development may offer valuable insights. In this paper, we present an interdisciplinary study exploring this question: can a computational model that imitates the infant learning process develop broader visual concepts that extend beyond the vocabulary it has heard, similar to how infants naturally learn? |
Xueyi Ke; Satoshi Tsutsui; Yayun Zhang; Bihan Wen; | code |
| 503 | DynPose: Largely Improving The Efficiency of Human Pose Estimation By A Simple Dynamic Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a straightforward yet effective dynamic framework called DynPose, designed to match diverse pose samples with the most appropriate models, thereby ensuring optimal performance and high efficiency. |
Yalong Xu; Lin Zhao; Chen Gong; Guangyu Li; Di Wang; Nannan Wang; | code |
| 504 | Let Samples Speak: Mitigating Spurious Correlation By Exploiting The Clusterness of Samples Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. |
Weiwei Li; Junzhuo Liu; Yuanyuan Ren; Yuchen Zheng; Yahao Liu; Wen Li; | code |
| 505 | Style Evolving Along Chain-of-Thought for Unknown-Domain Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The one-step prompt method may not effectively synthesize combined information involving various styles. To address this limitation, we propose a new method, i.e., Style Evolving along Chain-of-Thought, which aims to progressively integrate and expand style information along the chain of thought, enabling the continual evolution of styles. |
Zihao Zhang; Aming Wu; Yahong Han; | code |
| 506 | Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation Without 3D Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. |
Zhiyuan Ma; Xinyue Liang; Rongyuan Wu; Xiangyu Zhu; Zhen Lei; Lei Zhang; | code |
| 507 | Can’t Slow Me Down: Learning Robust and Hardware-Adaptive Object Detectors Against Latency Attacks for Edge Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They exploit new attack surfaces in object detectors by creating a computational bottleneck in the post-processing module, which leads to cascading failure and puts the real-time downstream tasks at risk. In this work, we take an initial attempt to defend against this attack via background-attentive adversarial training that is also cognizant of the underlying hardware capabilities. |
Tianyi Wang; Zichen Wang; Cong Wang; Yuanchao Shu; Ruilong Deng; Peng Cheng; Jiming Chen; | code |
| 508 | Test-Time Visual In-Context Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. |
Jiahao Xie; Alessio Tonioni; Nathalie Rauschmayr; Federico Tombari; Bernt Schiele; | code |
| 509 | VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. |
Lei Li; Yuancheng Wei; Zhihui Xie; Xuqing Yang; Yifan Song; Peiyi Wang; Chenxin An; Tianyu Liu; Sujian Li; Bill Yuchen Lin; Lingpeng Kong; Qi Liu; | code |
| 510 | DRiVE: Diffusion-based Rigging Empowers Generation of Versatile and Expressive Characters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this gap, we curate AnimeRig, a large-scale dataset with detailed skeleton and skinning annotations. Building upon this, we propose DRiVE, a novel framework for generating and rigging 3D human characters with intricate structures. |
Mingze Sun; Junhao Chen; Junting Dong; Yurun Chen; Xinyu Jiang; Shiwei Mao; Puhua Jiang; Jingbo Wang; Bo Dai; Ruqi Huang; | code |
| 511 | Category-Agnostic Neural Object Rigging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the proposed method on a variety of object categories and demonstrate the effectiveness of the proposed framework. |
Guangzhao He; Chen Geng; Shangzhe Wu; Jiajun Wu; | code |
| 512 | Learning from Synchronization: Self-Supervised Uncalibrated Multi-View Person Association in Challenging Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a self-supervised learning framework, consisting of an encoder-decoder model and a self-supervised pretext task, cross-view image synchronization, which aims to distinguish whether two images from different views are captured at the same time. |
Keqi Chen; Vinkle Srivastav; Didier Mutter; Nicolas Padoy; | code |
| 513 | Positive2Negative: Breaking The Information-Lossy Barrier in Self-Supervised Single Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing self-supervised image denoising paradigms (Noise2Noise and Noise2Void) rely heavily on information-lossy operations, such as downsampling and masking, culminating in low-quality denoising performance.In this paper, we propose a novel self-supervised single image denoising paradigm, Positive2Negative, to break the information-lossy barrier.Our paradigm involves two key steps: Renoised Data Construction (RDC) and Denoised Consistency Supervision (DCS). |
Tong Li; Lizhi Wang; Zhiyuan Xu; Lin Zhu; Wanxuan Lu; Hua Huang; | code |
| 514 | BOOTPLACE: Bootstrapped Object Placement with Detection Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. |
Hang Zhou; Xinxin Zuo; Rui Ma; Li Cheng; | code |
| 515 | ResCLIP: Residual Attention for Training-free Dense Vision-language Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we reveal that the cross-correlation of self-attention in non-final layers of CLIP also exhibits localization properties. |
Yuhang Yang; Jinhong Deng; Wen Li; Lixin Duan; | code |
| 516 | PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our method addresses a key limitation in most NeRF and 3D Gaussian Splatting approaches: they estimate view-dependent appearance without modeling scene materials and illumination. To address this limitation, we present an inverse rendering (IR) model capable of jointly estimating scene geometry, materials, and illumination. |
Sean Wu; Shamik Basu; Tim Broedermann; Luc Van Gool; Christos Sakaridis; | code |
| 517 | 2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose 2DMamba, a novel 2D selective SSM framework that incorporates the 2D spatial structure of images into Mamba, with a highly optimized hardware-aware operator, adopting both spatial continuity and computational efficiency. |
Jingwei Zhang; Anh Tien Nguyen; Xi Han; Vincent Quoc-Huy Trinh; Hong Qin; Dimitris Samaras; Mahdi S. Hosseini; | code |
| 518 | Fitted Neural Lossless Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel approach named Fitted Neural Lossless Image Compression (FNLIC) that enhances efficiency through a two-phase fitting process. |
Zhe Zhang; Zhenzhong Chen; Shan Liu; | code |
| 519 | MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing methods struggle to address the trade-off in the shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. |
Siyuan Li; Luyuan Zhang; Zedong Wang; Juanxi Tian; Cheng Tan; Zicheng Liu; Chang Yu; Qingsong Xie; Haonan Lu; Haoqian Wang; Zhen Lei; | code |
| 520 | LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multi-view, multi-modal RGBS instruction-tuning dataset. |
Dominick Reilly; Rajatsubhra Chakraborty; Arkaprava Sinha; Manish Kumar Govind; Pu Wang; Francois Bremond; Le Xue; Srijan Das; | code |
| 521 | Unveiling Differences in Generative Models: A Scalable Differential Clustering Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose solving a differential clustering problem to detect sample types generated differently by two generative models. |
Jingwei Zhang; Mohammad Jalali; Cheuk Ting Li; Farzan Farnia; | code |
| 522 | MUST: The First Dataset and Unified Framework for Multispectral UAV Single Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, progress in this field has been hindered by the lack of relevant datasets. To address this gap, we introduce the first large-scale Multispectral UAV Single Object Tracking dataset (MUST), which includes 250 video sequences spanning diverse environments and challenges, providing a comprehensive data foundation for multispectral UAV tracking. |
Haolin Qin; Tingfa Xu; Tianhao Li; Zhenxiang Chen; Tao Feng; Jianan Li; | code |
| 523 | FilmComposer: LLM-Driven Music Production for Silent Film Clips Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we implement music production for silent film clips using LLM-driven method. |
Zhifeng Xie; Qile He; Youjia Zhu; Qiwei He; Mengtian Li; | code |
| 524 | M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. |
Zixuan Chen; Jiaxin Li; Junxuan Liang; Liming Tan; Yejie Guo; Cewu Lu; Yong-Lu Li; | code |
| 525 | HUNet: Homotopy Unfolding Network for Image Compressive Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Homotopy Unfolding Network (HUNet) for image CS, which enables phase-by-phase reconstruction of images along homotopy path. |
Feiyang Shen; Hongping Gan; | code |
| 526 | Autoregressive Distillation of Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. |
Yeongmin Kim; Sotiris Anagnostidis; Yuming Du; Edgar Schönfeld; Jonas Kohler; Markos Georgopoulos; Albert Pumarola; Ali Thabet; Artsiom Sanakoyeu; | code |
| 527 | Recover and Match: Open-Vocabulary Multi-Label Recognition Through Knowledge-Constrained Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: (2) The matching property between image regions and candidate labels has been neglected, relying instead on naive feature aggregation such as average pooling, which leads to spurious predictions from irrelevant regions. In this paper, we present RAM (Recover And Match), a novel framework that effectively addresses the above issues. |
Hao Tan; Zichang Tan; Jun Li; Ajian Liu; Jun Wan; Zhen Lei; | code |
| 528 | Generative Inbetweening Through Frame-wise Conditions-Driven Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a straightforward yet highly effective Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames. |
Tianyi Zhu; Dongwei Ren; Qilong Wang; Xiaohe Wu; Wangmeng Zuo; | code |
| 529 | Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel spatiotemporal decoupling vision-based paradigm to explicitly tackle the bias and achieve both effective and efficient 3D OCF. |
Jingyi Xu; Xieyuanli Chen; Junyi Ma; Jiawei Huang; Jintao Xu; Yue Wang; Ling Pei; | code |
| 530 | MP-GUI: Modality Perception with MLLMs for GUI Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. |
Ziwei Wang; Weizhi Chen; Leyang Yang; Sheng Zhou; Shengchu Zhao; Hanbei Zhan; Jiongchao Jin; Liangcheng Li; Zirui Shao; Jiajun Bu; | code |
| 531 | PIAD: Pose and Illumination Agnostic Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Pose and Illumination agnostic Anomaly Detection (PIAD) problem, a generalization of pose-agnostic anomaly detection (PAD). |
Kaichen Yang; Junjie Cao; Zeyu Bai; Zhixun Su; Andrea Tagliasacchi; | code |
| 532 | ReCon: Enhancing True Correspondence Discrimination Through Relation Consistency for Robust Noisy Correspondence Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. |
Quanxing Zha; Xin Liu; Shu-Juan Peng; Yiu-ming Cheung; Xing Xu; Nannan Wang; | code |
| 533 | X-Dyna: Expressive Dynamic Human Image Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. |
Di Chang; Hongyi Xu; You Xie; Yipeng Gao; Zhengfei Kuang; Shengqu Cai; Chenxu Zhang; Guoxian Song; Chao Wang; Yichun Shi; Zeyuan Chen; Shijie Zhou; Linjie Luo; Gordon Wetzstein; Mohammad Soleymani; | code |
| 534 | LightLoc: Learning Outdoor LiDAR Localization at Light Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LightLoc, the first method capable of efficiently learning localization in a new scene at light speed. Beyond freezing the scene-agnostic feature backbone and training only the scene-specific prediction heads, we introduce two novel techniques to address these challenges. |
Wen Li; Chen Liu; Shangshu Yu; Dunqiang Liu; Yin Zhou; Siqi Shen; Chenglu Wen; Cheng Wang; | code |
| 535 | EmoEdit: Evoking Emotions Through Image Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing on psychological insights, we introduce EmoEdit, which extends AIM by incorporating content modifications to enhance emotional impact. |
Jingyuan Yang; Jiawei Feng; Weibin Luo; Dani Lischinski; Daniel Cohen-Or; Hui Huang; | code |
| 536 | Effective Cloud Removal for Remote Sensing Images By An Improved Mean-Reverting Denoising Model with Elucidated Design Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. |
Yi Liu; Wengen Li; Jihong Guan; Shuigeng Zhou; Yichao Zhang; | code |
| 537 | Multi-view Reconstruction Via SfM-guided Monocular Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to reconstruct the scene geometry from multi-view images with strong robustness and high quality. |
Haoyu Guo; He Zhu; Sida Peng; Haotong Lin; Yunzhi Yan; Tao Xie; Wenguan Wang; Xiaowei Zhou; Hujun Bao; | code |
| 538 | FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although significant progress has been made in rendering performance and manipulation capabilities, notable challenges remain, including incomplete reconstruction and inefficient Gaussian representation. To address these challenges, we introduce FATE — a novel method for reconstructing an editable full-head avatar from a single monocular video. |
Jiawei Zhang; Zijian Wu; Zhiyang Liang; Yicheng Gong; Dongfang Hu; Yao Yao; Xun Cao; Hao Zhu; | code |
| 539 | Scale Efficient Training for Large Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement. To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. |
Qing Zhou; Junyu Gao; Qi Wang; | code |
| 540 | UltraFusion: Ultra High Dynamic Imaging Using Exposure Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose UltraFusion, the first exposure fusion technique that can merge inputs with 9 stops differences. |
Zixuan Chen; Yujin Wang; Xin Cai; Zhiyuan You; Zheming Lu; Fan Zhang; Shi Guo; Tianfan Xue; | code |
| 541 | Unleashing The Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM^4. |
Yiheng Li; Yang Yang; Zichang Tan; Huan Liu; Weihua Chen; Xu Zhou; Zhen Lei; | code |
| 542 | Exploiting Temporal State Space Sharing for Video Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. |
Syed Ariff Syed Hesham; Yun Liu; Guolei Sun; Henghui Ding; Jing Yang; Ender Konukoglu; Xue Geng; Xudong Jiang; | code |
| 543 | Learning to Detect Objects from Multi-Agent LiDAR Scans Without Manual Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. |
Qiming Xia; Wenkai Lin; Haoen Xiang; Xun Huang; Siheng Chen; Zhen Dong; Cheng Wang; Chenglu Wen; | code |
| 544 | Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the visual content is not equally contributed to user instructions, existing strategies (e.g., average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. |
Zhihang Liu; Chen-Wei Xie; Pandeng Li; Liming Zhao; Longxiang Tang; Yun Zheng; Chuanbin Liu; Hongtao Xie; | code |
| 545 | Less Attention Is More: Prompt Transformer for Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This results in a model with more yet scattered attention, where neither excessive nor insufficient focus can grasp subtle differences to classify fine-grained unknown and known categories. To address this issue, we propose the AptGCD to deliver apt attention for GCD. |
Wei Zhang; Baopeng Zhang; Zhu Teng; Wenxin Luo; Junnan Zou; Jianping Fan; | code |
| 546 | Physical Plausibility-aware Trajectory Prediction Via Locomotion Embodiment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. |
Hiromu Taketsugu; Takeru Oba; Takahiro Maeda; Shohei Nobuhara; Norimichi Ukita; | code |
| 547 | BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. |
Amaya Gallagher-Syed; Henry Senior; Omnia Alwazzan; Elena Pontarini; Michele Bombardieri; Costantino Pitzalis; Myles J. Lewis; Michael R. Barnes; Luca Rossi; Gregory Slabaugh; | code |
| 548 | Is `Right’ Right? Enhancing Object Orientation Understanding in Multimodal Large Language Models Through Egocentric Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current MLLMs face challenges in accurately interpreting object orientation in images due to inconsistent orientation annotations in training data, hindering the development of a coherent orientation understanding. To overcome this, we propose egocentric instruction tuning, which aligns MLLMs’ orientation understanding with the user’s perspective, based on a consistent annotation standard derived from the user’s egocentric viewpoint. |
Ji Hyeok Jung; Eun Tae Kim; Seoyeon Kim; Joo Ho Lee; Bumsoo Kim; Buru Chang; | code |
| 549 | SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation Via Scale-Adaptive Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. |
Chi Su; Xiaoxuan Ma; Jiajun Su; Yizhou Wang; | code |
| 550 | Reanimating Images Using Neural Representations of Dynamic Stimuli Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems. We show additional decoding and encoding examples on this site: https://brain-nrds.github.io/. |
Jacob Yeung; Andrew F. Luo; Gabriel Sarch; Margaret M. Henderson; Deva Ramanan; Michael J. Tarr; | code |
| 551 | FADE: Frequency-Aware Diffusion Model Factorization for Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component’s specialized role. |
Yixuan Zhu; Haolin Wang; Shilin Ma; Wenliang Zhao; Yansong Tang; Lei Chen; Jie Zhou; | code |
| 552 | Blood Flow Speed Estimation with Optical Coherence Tomography Angiography Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing techniques, such as Optical Doppler Tomography (ODT), generally require complex hardware control and signal processing, and still suffer from inherent system-level artifacts. To address these challenges, we propose a new learning-based approach named OCTA-Flow, which directly estimates vascular blood flow speed from Optical Coherence Tomography Angiography (OCTA) images that are commonly used for vascular structure analysis. |
Wensheng Cheng; Zhenghong Li; Jiaxiang Ren; Hyomin Jeong; Congwu Du; Yingtian Pan; Haibin Ling; | code |
| 553 | HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a tuning-free hyperspectral foundation model called HyperFree, by adapting the existing visual prompt engineering. |
Jingtao Li; Yingyi Liu; Xinyu Wang; Yunning Peng; Chen Sun; Shaoyu Wang; Zhendong Sun; Tian Ke; Xiao Jiang; Tangwei Lu; Anran Zhao; Yanfei Zhong; | code |
| 554 | Task-Aware Clustering for Prompting Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The resulting prompts often do not generalize well or exhibit limited task-awareness. To address this issue, we propose a novel Task-Aware Clustering (TAC) framework for prompting vision-language models, which increases the task-awareness of learnable prompts by introducing task-aware pre-context. |
Fusheng Hao; Fengxiang He; Fuxiang Wu; Tichao Wang; Chengqun Song; Jun Cheng; | code |
| 555 | Boost The Inference with Co-training: A Depth-guided Mutual Learning Framework for Semi-supervised Medical Polyp Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The existing RGB-D segmentation methods rely on depth data in the inference stage, limiting their clinical applications. To tackle this problem, we propose a semi-supervised polyp segmentation framework based on the mean teacher architecture. |
Yuxin Li; Zihao Zhu; Yuxiang Zhang; Yifan Chen; Zhibin Yu; | code |
| 556 | Omnidirectional Multi-Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, panoramic image distortions, such as resolution loss, geometric deformation, and uneven lighting, hinder direct adaptation of existing MOT methods, leading to significant performance degradation. To address these challenges, we propose OmniTrack, an omnidirectional MOT framework that incorporates Tracklet Management to introduce temporal cues, FlexiTrack Instances for object localization and association, and the CircularStatE Module to alleviate image and geometric distortions. |
Kai Luo; Hao Shi; Sheng Wu; Fei Teng; Mengfei Duan; Chang Huang; Yuhang Wang; Kaiwei Wang; Kailun Yang; | code |
| 557 | Toward Generalized Image Quality Assessment: Relaxing The Perfect Reference Quality Assumption Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive FIdelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. |
Du Chen; Tianhe Wu; Kede Ma; Lei Zhang; | code |
| 558 | MIMO: A Medical Vision Language Model with Visual Referring Multimodal Input and Pixel Grounding Multimodal Output Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing models are confronted with two issues: for input, the model only relies on text instructions and lacks direct understanding of visual clues in the image; for output, the model only gives text answers and lacks connection with key areas in the image. To address these issues, we propose a unified medical vision language model MIMO, with visual referring Multimodal Input and pixel grounding Multimodal Output. |
Yanyuan Chen; Dexuan Xu; Yu Huang; Songkun Zhan; Hanpin Wang; Dongxue Chen; Xueping Wang; Meikang Qiu; Hang Li; | code |
| 559 | STCOcc: Sparse Spatial-Temporal Cascade Renovation for 3D Occupancy and Scene Flow Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches struggle to capture local details and diminish the model’s spatial discriminative ability. To address these challenges, we propose a novel explicit state-based modeling method designed to leverage the occupied state to renovate the 3D features. |
Zhimin Liao; Ping Wei; Shuaijia Chen; Haoxuan Wang; Ziyang Ren; | code |
| 560 | Towards In-the-wild 3D Plane Reconstruction from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel framework dubbed ZeroPlane, a Transformer-based model targeting zero-shot 3D plane detection and reconstruction from a single image, over diverse domains and environments. |
Jiachen Liu; Rui Yu; Sili Chen; Sharon X. Huang; Hengkai Guo; | code |
| 561 | Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we claim that three criteria–Temporal Synchronization, Lip Readability, and Expressiveness–are crucial for achieving perceptually accurate lip movements. |
Lee Chae-Yeon; Oh Hyun-Bin; Han EunGi; Kim Sung-Bin; Suekyeong Nam; Tae-Hyun Oh; | code |
| 562 | Towards Universal AI-Generated Image Detection By Variational Information Bottleneck Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we proposed VIB-Net, which uses Variational Information Bottlenecks to enforce authentication task-related feature learning. |
Haifeng Zhang; Qinghui He; Xiuli Bi; Weisheng Li; Bo Liu; Bin Xiao; | code |
| 563 | EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. |
Songpengcheng Xia; Yu Zhang; Zhuo Su; Xiaozheng Zheng; Zheng Lv; Guidong Wang; Yongjie Zhang; Qi Wu; Lei Chu; Ling Pei; | code |
| 564 | Automatic Joint Structured Pruning and Quantization for Efficient Neural Network Training and Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing joint schemes are not widely used because of (1) engineering difficulties (complicated multi-stage processes), (2) black-box optimization (extensive hyperparameter tuning to control the overall compression), and (3) insufficient architecture generalization. To address these limitations, we present the framework GETA, which automatically and efficiently performs joint structured pruning and quantization-aware training on any DNN. |
Xiaoyi Qu; David Aponte; Colby Banbury; Daniel P. Robinson; Tianyu Ding; Kazuhito Koishida; Ilya Zharkov; Tianyi Chen; | code |
| 565 | Satellite to GroundScape – Large-scale Consistent Ground View Generation from Satellite Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous efforts mainly concentrated on single-view generation, often resulting in inconsistencies across neighboring ground views. In this work, we propose a novel cross-view synthesis approach designed to overcome these challenges by ensuring consistency across ground-view images generated from satellite views. |
Ningli Xu; Rongjun Qin; | code |
| 566 | MotiF: Making Text Count in Image Animation with Motion Focal Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model’s learning to the regions with more motion, thereby improving the text alignment and motion generation. |
Shijie Wang; Samaneh Azadi; Rohit Girdhar; Saketh Rambhatla; Chen Sun; Xi Yin; | code |
| 567 | Sim-to-Real Causal Transfer: A Metric Learning Approach to Causally-Aware Interaction Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite notable progress, it remains unclear to which extent modern representations can capture the causal relationships behind agent interactions. In this work, we take an in-depth look at the causal awareness of these representations, from computational formalism to real-world practice. |
Ahmad Rahimi; Po-Chien Luan; Yuejiang Liu; Frano Rajič; Alexandre Alahi; | code |
| 568 | SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the limited scale of multimodal table understanding (MTU) data, model performance is constrained. A straightforward approach is to use multimodal large language models to obtain more samples, but this may cause hallucinations, generate incorrect sample pairs, and cost significantly.To address the above issues, we design a simple yet effective synthesis framework that consists of two independent steps: table image rendering and table question and answer (Q&A) pairs generation.We use table codes (HTML, LaTeX, Markdown) to synthesize images and generate Q&A pairs with large language model (LLM). |
Bangbang Zhou; Zuan Gao; Zixiao Wang; Boqiang Zhang; Yuxin Wang; Zhineng Chen; Hongtao Xie; | code |
| 569 | Volumetrically Consistent 3D Gaussian Rasterization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, its splatting-based rendering model makes several approximations to the rendering equation, reducing physical accuracy. We show that splatting and its approximations are unnecessary, even within a rasterizer; we instead volumetrically integrate 3D Gaussians directly to compute the transmittance across them analytically. |
Chinmay Talegaonkar; Yash Belhe; Ravi Ramamoorthi; Nicholas Antipa; | code |
| 570 | GaPT-DAR: Category-level Garments Pose Tracking Via Integrated 2D Deformation and 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, motivated by the 2D warping space and shape prior, we propose GaPT-DAR, a novel category-level Garments Pose Tracking framework with integrated 2D Deformation And 3D Reconstruction function, which fully utilize 3D-2D projection and 2D-3D reconstruction to transform the 3D point-wise learning into 2D warping deformation learning. |
Li Zhang; Mingliang Xu; Jianan Wang; Qiaojun Yu; Lixin Yang; Yonglu Li; Cewu Lu; Rujing Wang; Liu Liu; | code |
| 571 | Focal Split: Untethered Snapshot Depth from Differential Defocus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Focal Split, a handheld, snapshot depth camera with fully onboard power and computing based on depth-from-differential-defocus (DfDD). |
Junjie Luo; John Mamish; Alan Fu; Thomas Concannon; Josiah Hester; Emma Alexander; Qi Guo; | code |
| 572 | DV-Matcher: Deformation-based Non-rigid Point Cloud Matching Guided By Pre-trained Visual Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present DV-Matcher, a novel learning-based framework for estimating dense correspondences between non-rigidly deformable point clouds. |
Zhangquan Chen; Puhua Jiang; Ruqi Huang; | code |
| 573 | FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. |
Xiaoqin Wang; Xusen Ma; Xianxu Hou; Meidan Ding; Yudong Li; Junliang Chen; Wenting Chen; Xiaoyang Peng; Linlin Shen; | code |
| 574 | Forming Auxiliary High-confident Instance-level Loss to Promote Learning from Label Proportions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a dual entropy-based weight (DEW) method to adaptively measure the confidences of pseudo-labels. |
Tianhao Ma; Han Chen; Juncheng Hu; Yungang Zhu; Ximing Li; | code |
| 575 | TopoCellGen: Generating Histopathology Cell Topology with A Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. |
Meilong Xu; Saumya Gupta; Xiaoling Hu; Chen Li; Shahira Abousamra; Dimitris Samaras; Prateek Prasanna; Chao Chen; | code |
| 576 | DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Dual-vision Scene Perception Network (DSPNet), to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA. |
Jingzhou Luo; Yang Liu; Weixing Chen; Zhen Li; Yaowei Wang; Guanbin Li; Liang Lin; | code |
| 577 | SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. |
Wenrui Cai; Qingjie Liu; Yunhong Wang; | code |
| 578 | RaCFormer: Towards High-Quality 3D Object Detection Via Query-based Radar-Camera Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The Radar-Camera fusion in outdoor 3D scene perception is capped by the image-to-BEV transformation-if the depth of pixels is not accurately estimated, the naive combination of BEV features actually integrates unaligned visual content. To avoid this problem, we propose a query-based framework that enables adaptive sampling of instance-relevant features from both the bird’s-eye view (BEV) and the original image view. |
Xiaomeng Chu; Jiajun Deng; Guoliang You; Yifan Duan; Houqiang Li; Yanyong Zhang; | code |
| 579 | EdgeTAM: On-Device Track Anything Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. |
Chong Zhou; Chenchen Zhu; Yunyang Xiong; Saksham Suri; Fanyi Xiao; Lemeng Wu; Raghuraman Krishnamoorthi; Bo Dai; Chen Change Loy; Vikas Chandra; Bilge Soran; | code |
| 580 | S2D-LFE: Sparse-to-Dense Light Field Event Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present S2D-LFE, an innovative approach for sparse-to-dense light field event generation. |
Yutong Liu; Wenming Weng; Yueyi Zhang; Zhiwei Xiong; | code |
| 581 | Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. |
Cecilia Curreli; Dominik Muhle; Abhishek Saroha; Zhenzhang Ye; Riccardo Marin; Daniel Cremers; | code |
| 582 | GauSTAR: Gaussian Surface Tracking and Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, tracking dynamic surfaces with 3D Gaussians remains challenging due to complex topology changes, such as surfaces appearing, disappearing, or splitting. To address these challenges, we propose GauSTAR, a novel method that achieves photo-realistic rendering, accurate surface reconstruction, and reliable 3D tracking for general dynamic scenes with changing topology. |
Chengwei Zheng; Lixin Xue; Juan Zarate; Jie Song; | code |
| 583 | Unseen Visual Anomaly Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Anomaly Anything (AnomalyAny), a novel framework that leverages Stable Diffusion (SD)’s image generation capabilities to generate diverse and realistic unseen anomalies. |
Han Sun; Yunkang Cao; Hao Dong; Olga Fink; | code |
| 584 | HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel multimodal learning method named HOP for co-speech gesture generation that captures the heterogeneous entanglement between gesture motion, audio rhythm, and text semantics, enabling the generation of coordinated gestures. |
Hongye Cheng; Tianyu Wang; Guangsi Shi; Zexing Zhao; Yanwei Fu; | code |
| 585 | GIVEPose: Gradual Intra-class Variation Elimination for RGB-based Category-Level Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By leveraging the complementary strengths of the NOCS map and the IVFC map, we introduce GIVEPose, a framework that implements Gradual Intra-class Variation Elimination for category-level object pose estimation. |
Ziqin Huang; Gu Wang; Chenyangguang Zhang; Ruida Zhang; Xiu Li; Xiangyang Ji; | code |
| 586 | NoPain: No-box Point Cloud Attack Via Optimal Transport Singular Boundary Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite their promising attack performance, they often struggle to produce transferable adversarial samples due to overfitting the specific parameters of surrogate models. To overcome this issue, we shift our focus to the data distribution itself and introduce a novel approach named NoPain, which employs optimal transport (OT) to identify the inherent singular boundaries of the data manifold for cross-network point cloud attacks. |
Zezeng Li; Xiaoyu Du; Na Lei; Liming Chen; Weimin Wang; | code |
| 587 | Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these image pairs often fail to align with the specified edit instructions due to the limitations of T2I models, which negatively impacts models trained on such datasets. To address this, we present Instruct-CLIP (I-CLIP), a selfsupervised method that learns the semantic changes between original and edited images to refine and better align the instructions in existing datasets. |
Sherry X. Chen; Misha Sra; Pradeep Sen; | code |
| 588 | SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. |
Xuan Zhu; Jijun Xiang; Xianqi Wang; Longliang Liu; Yu Wang; Hong Zhang; Fei Guo; Xin Yang; | code |
| 589 | Distilling Long-tailed Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing dataset distillation methods struggle with long-tailed datasets, which are prevalent in real-world scenarios. By investigating the reasons behind this unexpected result, we identified two main causes: 1) The distillation process on imbalanced datasets develops biased gradients, leading to the synthesis of similarly imbalanced distilled datasets. |
Zhenghao Zhao; Haoxuan Wang; Yuzhang Shang; Kai Wang; Yan Yan; | code |
| 590 | MaskGaussian: Adaptive 3D Gaussian Representation from Probabilistic Masks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this issue, we introduce MaskGaussian, which models Gaussians as probabilistic entities rather than permanently removing them, and utilize them according to their probability of existence. To achieve this, we propose a masked-rasterization technique that enables unused yet probabilistically existing Gaussians to receive gradients, allowing for dynamic assessment of their contribution to the evolving scene and adjustment of their probability of existence. |
Yifei Liu; Zhihang Zhong; Yifan Zhan; Sheng Xu; Xiao Sun; | code |
| 591 | AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose AnomalyNCD, a multi-class anomaly classification network compatible with different anomaly detection methods. |
Ziming Huang; Xurui Li; Haotian Liu; Feng Xue; Yuzhe Wang; Yu Zhou; | code |
| 592 | CaricatureBooth: Data-Free Interactive Caricature Generation in A Photo Booth Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present CaricatureBooth, a system that transforms caricature creation into a simple interactive experience — as easy as using a photo booth! |
Zhiyu Qu; Yunqi Miao; Zhensong Zhang; Jifei Song; Jiankang Deng; Yi-Zhe Song; | code |
| 593 | SDGOCC: Semantic and Depth-Guided Bird’s-Eye View Transformation for 3D Multimodal Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. |
ZaiPeng Duan; ChenXu Dang; Xuzhong Hu; Pei An; Junfeng Ding; Jie Zhan; YunBiao Xu; Jie Ma; | code |
| 594 | CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in practical scenarios, e.g., vehicle pickup, passengers usually describe only the part of the most significant and nearby surroundings instead of the entire environment. In response to this partially relevant challenge, we propose CMMLoc, an uncertainty-aware Cauchy-Mixture-Model (CMM) based framework for text-to-point-cloud Localization. |
Yanlong Xu; Haoxuan Qu; Jun Liu; Wenxiao Zhang; Xun Yang; | code |
| 595 | Sketchy Bounding-box Supervision for 3D Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. |
Qian Deng; Le Hui; Jin Xie; Jian Yang; | code |
| 596 | Pseudo Visible Feature Fine-Grained Fusion for Thermal Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this fusion approach does not fully exploit the complementary visible spectrum information beneficial for thermal detection. To address this issue, we propose a novel cross-modal fusion method called Pseudo Visible Feature Fine-Grained Fusion (PFGF). |
Ting Li; Mao Ye; Tianwen Wu; Nianxin Li; Shuaifeng Li; Song Tang; Luping Ji; | code |
| 597 | Learning Partonomic 3D Reconstruction from Image Collections Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To handle the expanded solution space and frequent part occlusions in single-view images, we introduce a novel approach that represents, parses, and learns the structural compositionality of 3D objects. |
Xiaoqian Ruan; Pei Yu; Dian Jia; Hyeonjeong Park; Peixi Xiong; Wei Tang; | code |
| 598 | Percept, Memory, and Imagine: World Feature Simulating for Open-Domain Unknown Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through multi-level perception and rich memory about known objects, the characteristics of unknown objects can be imagined sufficiently, enhancing the ability of discriminating known from unknown objects. Inspired by this idea, an approach of World Feature Simulation (WFS) is proposed, mainly consisting of a multi-level perception, memory recorder, and unknown-feature generator. |
Aming Wu; Cheng Deng; | code |
| 599 | Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although token reduction (TR) techniques can help reduce computational demands, they often lead to homogeneous attention patterns that compromise performance in pixel-level scenarios. This study underscores the importance of maintaining attention diversity for these tasks and proposes to enhance attention diversity while ensuring the completeness of token sequences. |
Cheng Lei; Ao Li; Hu Yao; Ce Zhu; Le Zhang; | code |
| 600 | Image Referenced Sketch Colorization Based on Animation Creation Workflow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. |
Dingkun Yan; Xinrui Wang; Zhuoru Li; Suguru Saito; Yusuke Iwasawa; Yutaka Matsuo; Jiaxian Guo; | code |
| 601 | NnWNet: Rethinking The Use of Transformers in Biomedical Image Segmentation and Calling for A Unified Evaluation Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, some models lack a unified and standardized evaluation benchmark, leading to significant discrepancies in the experimental setup. In this study, we review and summarize these architectures and analyze their contradictions in design. |
Yanfeng Zhou; Lingrui Li; Le Lu; Minfeng Xu; | code |
| 602 | FreeGave: 3D Physics Learning from Dynamic Videos By Gaussian Velocity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. |
Jinxi Li; Ziyang Song; Siyuan Zhou; Bo Yang; | code |
| 603 | GoLF-NRT: Integrating Global Context and Local Geometry for Few-Shot View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, with limited input views, these methods experience significant degradation in rendering quality. To address this limitation, we propose GoLF-NRT: a Global and Local feature Fusion-based Neural Rendering Transformer. |
You Wang; Li Fang; Hao Zhu; Fei Hu; Long Ye; Zhan Ma; | code |
| 604 | MoFlow: One-Step Flow Matching for Human Trajectory Forecasting Via Implicit Maximum Likelihood Estimation Based Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the problem of human trajectory forecasting, which aims to predict the inherently multi-modal future movements of humans based on their past trajectories and other contextual cues. |
Yuxiang Fu; Qi Yan; Lele Wang; Ke Li; Renjie Liao; | code |
| 605 | Gazing at Rewards: Eye Movements As A Lens Into Human and AI Decision-Making in Hybrid Visual Foraging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Their eye fixations are drawn to regions with higher average rewards, fixation durations are longer on more valuable targets, and their cumulative rewards exceed chance, approaching the upper bound of optimal foragers. To probe these decision making process of humans, we developed a transformer-based Visual Forager (VF) model trained via reinforcement learning. |
Bo Wang; Dingwei Tan; Yen-Ling Kuo; Zhaowei Sun; Jeremy M. Wolfe; Tat-Jen Cham; Mengmi Zhang; | code |
| 606 | CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders Via Fine-Grained Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning. |
Edson Araujo; Andrew Rouditchenko; Yuan Gong; Saurabhchand Bhati; Samuel Thomas; Brian Kingsbury; Leonid Karlinsky; Rogerio Feris; James R. Glass; Hilde Kuehne; | code |
| 607 | FineLIP: Extending CLIP’s Reach Via Fine-Grained Alignment with Longer Text Inputs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, CLIP models often struggle to effectively capture detailed visual and textual information, which hampers their performance on tasks that require fine-grained analysis. To address these limitations, we present a novel approach, FineLIP, that extends the capabilities of CLIP. |
Mothilal Asokan; Kebin Wu; Fatima Albreiki; | code |
| 608 | PromptHash:Affinity-Prompted Collaborative Cross-Modal Learning for Adaptive Hashing Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present PromptHash, an innovative framework leveraging affinity prompt-aware collaborative learning for adaptive cross-modal hashing. |
Qiang Zou; Shuli Cheng; Jiayi Chen; | code |
| 609 | Binarized Mamba-Transformer for Lightweight Quad Bayer HybridEVS Demosaicing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing learning-based methods that leverage long-range dependency modeling have achieved promising results, their complexity severely limits deployment on mobile devices for real-world applications. To address these limitations, we propose a lightweight Mamba-based binary neural network designed for efficient and high-performing demosaicing of HybridEVS RAW images. |
Shiyang Zhou; Haijin Zeng; Yunfan Lu; Tong Shao; Ke Tang; Yongyong Chen; Jie Liu; Jingyong Su; | code |
| 610 | OSDFace: One-Step Diffusion Model for Face Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose OSDFace, a novel one-step diffusion model for face restoration. |
Jingkai Wang; Jue Gong; Lin Zhang; Zheng Chen; Xing Liu; Hong Gu; Yutong Liu; Yulun Zhang; Xiaokang Yang; | code |
| 611 | ODA-GAN: Orthogonal Decoupling Alignment GAN Assisted By Weakly-supervised Learning for Virtual Immunohistochemistry Staining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the Orthogonal Decoupling Alignment Generative Adversarial Network (ODA-GAN) for unpaired virtual immunohistochemistry (IHC) staining. |
Tong Wang; Mingkang Wang; Zhongze Wang; Hongkai Wang; Qi Xu; Fengyu Cong; Hongming Xu; | code |
| 612 | Complementary Advantages: Exploiting Cross-Field Frequency Correlation for NIR-Assisted Image Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the inconsistency between NIR and RGB images, the existing works still struggle to balance the contributions of two fields in the process of image fusion. In response to this, in this paper, we develop a cross-field Frequency Correlation Exploiting Network (FCENet) for NIR-assisted image denoising. |
Yuchen Wang; Hongyuan Wang; Lizhi Wang; Xin Wang; Lin Zhu; Wanxuan Lu; Hua Huang; | code |
| 613 | MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. |
Shaoming Li; Qing Cai; Songqi Kong; Runqing Tan; Heng Tong; Shiji Qiu; Yongguo Jiang; Zhi Liu; | code |
| 614 | Enhancing Few-Shot Class-Incremental Learning Via Training-Free Bi-Level Modality Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance prediction robustness, we introduce additional metrics and strategies that maximize the utilization of limited data. |
Yiyang Chen; Tianyu Ding; Lei Wang; Jing Huo; Yang Gao; Wenbin Li; | code |
| 615 | AVF-MAE++: Scaling Affective Video Facial Masked Autoencoders Via Efficient Audio-Visual Self-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, capturing both intra- and inter-modal correlations through scalable representations is a crucial challenge in this field. To tackle these gaps, we introduce AVF-MAE++, a series audio-visual MAE designed to explore the impact of scaling on AVFA with a focus on advanced correlation modeling. |
Xuecheng Wu; Heli Sun; Yifan Wang; Jiayu Nie; Jie Zhang; Yabing Wang; Junxiao Xue; Liang He; | code |
| 616 | Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this critical yet understudied challenge of class order sensitivity, we first extend existing CIL frameworks through theoretical analysis, proving that grouping classes with lower pairwise similarity during incremental phases significantly improves model robustness to order variations. Building on this insight, we propose Graph-Driven Dynamic Similarity Grouping (GDDSG), a novel method that employs graph coloring algorithms to dynamically partition classes into similarity-constrained groups. |
Guannan Lai; Yujie Li; Xiangkun Wang; Junbo Zhang; Tianrui Li; Xin Yang; | code |
| 617 | BFANet: Revisiting 3D Semantic Segmentation with Boundary Feature Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we revisit 3D semantic segmentation through a more granular lens, shedding light on subtle complexities that are typically overshadowed by broader performance metrics. |
Weiguang Zhao; Rui Zhang; Qiufeng Wang; Guangliang Cheng; Kaizhu Huang; | code |
| 618 | Scalable Autoregressive Monocular Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a new autoregressive model as an effective and scalable monocular depth estimator. |
Jinhong Wang; Jian Liu; Dongqi Tang; Weiqiang Wang; Wentong Li; Danny Chen; Jintai Chen; Jian Wu; | code |
| 619 | Text Embedding Is Not All You Need: Attention Control for Text-to-Image Semantic Alignment with Text Self-Attention Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, such syntactic relationships can be overlooked in cross-attention module, leading to inaccurate image generation. To address this, we propose a method that directly transfers syntactic relations from the text attention maps to the cross-attention module via a test-time optimization. |
Jeeyung Kim; Erfan Esmaeili; Qiang Qiu; | code |
| 620 | H2ST: Hierarchical Two-Sample Tests for Continual Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Continually detecting OOD samples presents several challenges for current OOD detection methods: reliance on model outputs leads to excessive dependence on model performance, selecting suitable thresholds is difficult, hindering real-world deployment, and binary ID/OOD classification fails to provide task-level identification. To address these issues, we propose a novel continual OOD detection method called the Hierarchical Two-sample Tests (H2ST). |
Yuhang Liu; Wenjie Zhao; Yunhui Guo; | code |
| 621 | D^3: Scaling Up Deepfake Detection By Learning from Discrepancy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we reveal that the current methods tailored for training on one specific generator either struggle to learn comprehensive artifacts from multiple generators or sacrifice their fitting ability for seen generators (i.e., In-Domain (ID) performance) to exchange the generalization for unseen generators (i.e., Out-Of-Domain (OOD) performance). To tackle the above challenges, we propose our Discrepancy Deepfake Detector (D3) framework, whose core idea is to deconstruct the universal artifacts from multiple generators by introducing a parallel network branch that takes a distorted image feature as an extra discrepancy signal and supplements its original counterpart. |
Yongqi Yang; Zhihao Qian; Ye Zhu; Olga Russakovsky; Yu Wu; | code |
| 622 | LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). |
Vladan Stojnić; Yannis Kalantidis; Jiří Matas; Giorgos Tolias; | code |
| 623 | BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. |
Xuewu Lin; Tianwei Lin; Lichao Huang; Hongyu Xie; Zhizhong Su; | code |
| 624 | Making Old Film Great Again: Degradation-aware State Space Model for Old Film Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a new baseline to re-examine the challenges in old film restoration. |
Yudong Mao; Hao Luo; Zhiwei Zhong; Peilin Chen; Zhijiang Zhang; Shiqi Wang; | code |
| 625 | V2V3D: View-to-View Denoised 3D Reconstruction for Light Field Microscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing LFM reconstruction algorithms are highly sensitive to sensor noise or require hard-to-get ground-truth annotated data for training. To address these challenges, this paper introduces V2V3D, an unsupervised view2view-based framework that establishes a new paradigm for joint optimization of image denoising and 3D reconstruction in a unified architecture. |
Jiayin Zhao; Zhenqi Fu; Tao Yu; Hui Qiao; | code |
| 626 | Design2GarmentCode: Turning Design Concepts to Tangible Garments Through Program Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel sewing pattern generation approach Design2GarmentCode based on Large Multimodal Models (LMMs), to generate parametric pattern-making programs from multi-modal design concepts. |
Feng Zhou; Ruiyang Liu; Chen Liu; Gaofeng He; Yong-Lu Li; Xiaogang Jin; Huamin Wang; | code |
| 627 | Balanced Rate-Distortion Optimization in Learned Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This imbalance can lead to suboptimal optimization, where one objective dominates, thereby reducing overall compression efficiency. To address this challenge, we reformulate R-D optimization as a multi-objective optimization (MOO) problem and introduce two balanced R-D optimization strategies that adaptively adjust gradient updates to achieve more equitable improvements in both rate and distortion. |
Yichi Zhang; Zhihao Duan; Yuning Huang; Fengqing Zhu; | code |
| 628 | Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). |
Jianyang Zhang; Qianli Luo; Guowu Yang; Wenjing Yang; Weide Liu; Guosheng Lin; Fengmao Lv; | code |
| 629 | Adapting Dense Matching for Homography Estimation with Grid-based Acceleration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: So in this work, we revisit the traditional image matching paradigm for homography estimation and propose GFNet, a Grid Flow regression Network that adapts the high-accuracy dense matching framework for homography estimation while enhancing efficiency through a grid-based strategy–estimating flow only over a coarse grid by leveraging homography’s global smoothness. |
Kaining Zhang; Yuxin Deng; Jiayi Ma; Paolo Favaro; | code |
| 630 | Detecting Backdoor Attacks in Federated Learning Via Direction Alignment Inspection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. |
Jiahao Xu; Zikai Zhang; Rui Hu; | code |
| 631 | ESC: Erasing Space Concept for Knowledge Deletion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these issues, we introduce a novel concept of Knowledge Deletion (KD), an advanced task that considers both concerns, and provides an appropriate metric, named Knowledge Retention score (KR), for assessing knowledge retention in feature space. To achieve this, we propose a novel training-free erasing approach named Erasing Space Concept (ESC), which restricts the important subspace for the forgetting knowledge by eliminating the relevant activations in the feature. |
Tae-Young Lee; Sundong Park; Minwoo Jeon; Hyoseok Hwang; Gyeong-Moon Park; | code |
| 632 | Self-Supervised Learning for Color Spike Camera Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a motion-guided reconstruction method for spike cameras with CFA, utilizing color layout and estimated motion information. |
Yanchen Dong; Ruiqin Xiong; Xiaopeng Fan; Zhaofei Yu; Yonghong Tian; Tiejun Huang; | code |
| 633 | Cross-Modal Interactive Perception Network with Mamba for Lung Tumor Segmentation in PET-CT Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, we introduce a large-scale PET-CT lung tumor segmentation dataset, termed PCLT20K, which comprises 21,930 pairs of PET-CT images from 605 patients. |
Jie Mei; Chenyu Lin; Yu Qiu; Yaonan Wang; Hui Zhang; Ziyang Wang; Dong Dai; | code |
| 634 | Fancy123: One Image to High-Quality 3D Mesh Generation Via Plug-and-Play Deformation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. |
Qiao Yu; Xianzhi Li; Yuan Tang; Xu Han; Long Hu; Yixue Hao; Min Chen; | code |
| 635 | Global-Local Tree Search in VLMs for 3D Indoor Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve the problem with a VLM, we propose a new global-local tree search algorithm. |
Wei Deng; Mengshi Qi; Huadong Ma; | code |
| 636 | Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. |
Ronghuan Wu; Wanchao Su; Jing Liao; | code |
| 637 | CGMatch: A Different Perspective of Semi-supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel SSL model called CGMatch, which, for the first time, incorporates a new metric known as Count-Gap (CG). |
Bo Cheng; Jueqing Lu; Yuan Tian; Haifeng Zhao; Yi Chang; Lan Du; | code |
| 638 | Rethinking Query-based Transformer for Continual Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we conduct an in-depth investigation of the built-in objectness and find that highly aggregated image features provide a shortcut for queries to generate masks through simple feature alignment. Based on this, we propose SimCIS, a simple yet powerful baseline for CIS. |
Yuchen Zhu; Cheng Shi; Dingyou Wang; Jiajin Tang; Zhengxuan Wei; Yu Wu; Guanbin Li; Sibei Yang; | code |
| 639 | Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. |
Chen-Chen Zong; Sheng-Jun Huang; | code |
| 640 | VideoGuide: Improving Video Diffusion Models Without Training Through A Teacher’s Guide Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. |
Dohun Lee; Bryan Sangwoo Kim; Geon Yeong Park; Jong Chul Ye; | code |
| 641 | Mimic In-Context Learning for Multimodal Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. |
Yuchu Jiang; Jiale Fu; Chenduo Hao; Xinting Hu; Yingzhe Peng; Xin Geng; Xu Yang; | code |
| 642 | Finding Local Diffusion Schrodinger Bridge Using Kolmogorov-Arnold Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrodinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. |
Xingyu Qiu; Mengying Yang; Xinghua Ma; Fanding Li; Dong Liang; Gongning Luo; Wei Wang; Kuanquan Wang; Shuo Li; | code |
| 643 | CoSDH: Communication-Efficient Collaborative Perception Via Supply-Demand Awareness and Intermediate-Late Hybridization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing collaborative perception methods face a dilemma between communication efficiency and perception accuracy. To address this issue, we propose a novel communication-efficient collaborative perception framework based on supply-demand awareness and intermediate-late hybridization, dubbed as CoSDH. |
Junhao Xu; Yanan Zhang; Zhi Cai; Di Huang; | code |
| 644 | Proximal Algorithm Unrolling: Flexible and Efficient Reconstruction Networks for Single-Pixel Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the challenge of integrating the strengths of both classes of solvers. |
Ping Wang; Lishun Wang; Gang Qu; Xiaodong Wang; Yulun Zhang; Xin Yuan; | code |
| 645 | Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this approach might not be sufficient for tasks like cancer subtype classification, tissue phenotyping, and survival analysis due to the limited level of detail that a single-resolution image can provide. Addressing this, we propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions and generate corresponding textual descriptions through advanced CPath VLM. |
Shahad Albastaki; Anabia Sohail; Iyyakutti Iyappan Ganapathi; Basit Alawode; Asim Khan; Sajid Javed; Naoufel Werghi; Mohammed Bennamoun; Arif Mahmood; | code |
| 646 | Homogeneous Dynamics Space for Heterogeneous Humans Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to push data-driven human dynamics understanding forward. |
Xinpeng Liu; Junxuan Liang; Chenshuo Zhang; Zixuan Cai; Cewu Lu; Yong-Lu Li; | code |
| 647 | Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores scene affinity (AIScene), namely intra-scene consistency and inter-scene correlation, for semi-supervised LiDAR semantic segmentation in driving scenes. |
Chuandong Liu; Xingxing Weng; Shuguo Jiang; Pengcheng Li; Lei Yu; Gui-Song Xia; | code |
| 648 | BF-STVSR: B-Splines and Fourier—Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model’s flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. |
Eunjin Kim; Hyeonjin Kim; Kyong Hwan Jin; Jaejun Yoo; | code |
| 649 | Face Forgery Video Detection Via Temporal Forgery Cue Unraveling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly, we design a consecutive correlate module to capture momentary anomaly cues by correlating interactions among consecutive frames. |
Zonghui Guo; Yingjie Liu; Jie Zhang; Haiyong Zheng; Shiguang Shan; | code |
| 650 | Structure-from-Motion with A Non-Parametric Camera Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a new generic Structure-from-Motion pipeline, GenSfM, that uses a non-parametric camera projection model. |
Yihan Wang; Linfei Pan; Marc Pollefeys; Viktor Larsson; | code |
| 651 | Take The Bull By The Horns: Learning to Segment Hard Samples Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an effective image segmentation framework that includes mechanisms for identifying and segmenting hard samples. |
Yuan Guo; Jingyu Kong; Yu Wang; Yuping Duan; | code |
| 652 | MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Action Anticipation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. |
Olga Zatsarynna; Emad Bahrami; Yazan Abu Farha; Gianpiero Francesca; Juergen Gall; | code |
| 653 | PIDLoc: Cross-View Pose Optimization Network Inspired By PID Controllers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods primarily rely on cross-view features at a given pose, neglecting fine-grained contexts for precision and global contexts for robustness against large initial pose errors. To overcome these limitations, we propose PIDLoc, a novel cross-view pose optimization approach inspired by the proportional-integral-derivative (PID) controller. |
Wooju Lee; Juhye Park; Dasol Hong; Changki Sung; Youngwoo Seo; DongWan Kang; Hyun Myung; | code |
| 654 | GaussHDR: High Dynamic Range Gaussian Splatting Via Learning Unified 3D and 2D Local Tone Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, the global tone mapper used in existing methods can impede the learning of both HDR and LDR representations. To address these challenges, we present GaussHDR, which unifies 3D and 2D local tone mapping through 3D Gaussian splatting. |
Jinfeng Liu; Lingtong Kong; Bo Li; Dan Xu; | code |
| 655 | FiRe: Fixed-points of Restoration Priors for Solving Inverse Problems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Fixed-points of Restoration (FiRe) priors as a new framework for expanding the notion of priors in PnP to general restoration models beyond traditional denoising models. |
Matthieu Terris; Ulugbek S. Kamilov; Thomas Moreau; | code |
| 656 | EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. |
Dongki Jung; Jaehoon Choi; Yonghan Lee; Somi Jeong; Taejae Lee; Dinesh Manocha; Suyong Yeon; | code |
| 657 | Digital Twin Catalog: A Large-Scale Photorealistic 3D Object Digital Twin Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Digital Twin Catalog (DTC), a new large-scale photorealistic 3D object digital twin dataset. |
Zhao Dong; Ka Chen; Zhaoyang Lv; Hong-Xing Yu; Yunzhi Zhang; Cheng Zhang; Yufeng Zhu; Stephen Tian; Zhengqin Li; Geordie Moffatt; Sean Christofferson; James Fort; Xiaqing Pan; Mingfei Yan; Jiajun Wu; Carl Yuheng Ren; Richard Newcombe; | code |
| 658 | PQPP: A Joint Benchmark for Text-to-Image Prompt and Query Performance Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce the first dataset of prompts which are manually annotated in terms of image generation performance. |
Eduard Poesina; Adriana Valentina Costache; Adrian-Gabriel Chifu; Josiane Mothe; Radu Tudor Ionescu; | code |
| 659 | Deep Change Monitoring: A Hyperbolic Representative Learning Framework and A Dataset for Long-term Fine-grained Tree Change Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UAVTC, a large-scale, long-term, high-resolution dataset collected using UAVs equipped with cameras, specifically designed to detect individual Tree Changes (TCs). |
Yante Li; Hanwen Qi; Haoyu Chen; Xinlian Liang; Guoying Zhao; | code |
| 660 | Prior Does Matter: Visual Navigation Via Denoising Diffusion Bridge Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Additionally, the sparsity of effective action distributions makes it challenging for the policy to generate accurate actions without guidance. To address these issues, we propose a novel, unified visual navigation framework leveraging the denoising diffusion bridge models named NaviBridger. |
Hao Ren; Yiming Zeng; Zetong Bi; Zhaoliang Wan; Junlong Huang; Hui Cheng; | code |
| 661 | MaSS13K: A Matting-level Semantic Segmentation Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we build a large-scale, matting-level semantic segmentation dataset, named MaSS13K, which consists of 13,348 real-world images, all at 4K resolution. |
Chenxi Xie; Minghan Li; Hui Zeng; Jun Luo; Lei Zhang; | code |
| 662 | Fuzzy Multimodal Learning for Trusted Cross-modal Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing methods show promising performance, most are deterministic models and are unable to capture the uncertainty inherent in the retrieval outputs, leading to potentially unreliable results. To address this issue, we propose a novel framework called FUzzy Multimodal lEarning (FUME), which is able to self-estimate epistemic uncertainty, thereby embracing trusted cross-modal retrieval. |
Siyuan Duan; Yuan Sun; Dezhong Peng; Zheng Liu; Xiaomin Song; Peng Hu; | code |
| 663 | MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. |
Sankalp Sinha; Mohammad Sadil Khan; Muhammad Usama; Shino Sam; Didier Stricker; Sk Aziz Ali; Muhammad Zeshan Afzal; | code |
| 664 | Domain Adaptive Diabetic Retinopathy Grading with Model Absence and Flowing Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address a challenging Online Model-aGnostic Domain Adaptation (OMG-DA) setting, driven by the demands of clinical environments. |
Wenxin Su; Song Tang; Xiaofeng Liu; Xiaojing Yi; Mao Ye; Chunxiao Zu; Jiahao Li; Xiatian Zhu; | code |
| 665 | EasyHOI: Unleashing The Power of Large Models for Reconstructing Hand-Object Interactions in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task.Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. |
Yumeng Liu; Xiaoxiao Long; Zemin Yang; Yuan Liu; Marc Habermann; Christian Theobalt; Yuexin Ma; Wenping Wang; | code |
| 666 | EchoONE: Segmenting Multiple Echocardiography Planes in One Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Effective solution for such a multi-plane segmentation (MPS) problem is highly demanded for medical images, yet has not been well investigated. In this paper, we propose a novel solution, EchoONE, for this problem with a SAM-based segmentation architecture, a prior-composable mask learning (PC-Mask) module for semantic-aware dense prompt generation, and a learnable CNN-branch with a simple yet effective local feature fusion and adaption (LFFA) module for SAM adapting. |
Jiongtong Hu; Wufeng Xue; Jun Cheng; Yingying Liu; Wei Zhuo; Dong Ni; | code |
| 667 | OW-OVD: Unified Open World and Open Vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel detector, OW-OVD, which inherits the zero-shot generalization capability of OVD detectors while incorporating the ability to actively detect unknown objects and progressively optimize performance through incremental learning, as seen in OWOD detectors. |
Xing Xi; Yangyang Huang; Ronghua Luo; Yu Qiu; | code |
| 668 | Hand-held Object Reconstruction from RGB Video with Dynamic Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work aims to reconstruct the 3D geometry of a rigid object manipulated by one or both hands using monocular RGB video. |
Shijian Jiang; Qi Ye; Rengan Xie; Yuchi Huo; Jiming Chen; | code |
| 669 | Depth-Guided Bundle Sampling for Efficient Generalizable Neural Radiance Field Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recognizing that natural scenes are typically piecewise smooth and sampling all rays is often redundant, we propose a novel depth-guided bundle sampling strategy to accelerate rendering. |
Li Fang; Hao Zhu; Longlong Chen; Fei Hu; Long Ye; Zhan Ma; | code |
| 670 | T2SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we defined a novel \underline T raffic \underline T opology \underline S cene \underline G raph (\text T ^2\text SG ), a unified scene graph explicitly modeling the lane, controlled and guided by different road signals ( e.g. , right turn), and topology relationships among them, which is always ignored by previous high-definition (HD) mapping methods. |
Changsheng Lv; Mengshi Qi; Liang Liu; Huadong Ma; | code |
| 671 | VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While reusing pre-trained 2D backbones in 3D enhances zero-shot potential, their performance on complex 3D structures still lags behind leading 3D models. To address these issues, we present VISTA3D, Versatile Imaging SegmenTation and Annotation model, that targets to solve all thesechallenges and requirements with one unified foundation model. |
Yufan He; Pengfei Guo; Yucheng Tang; Andriy Myronenko; Vishwesh Nath; Ziyue Xu; Dong Yang; Can Zhao; Benjamin Simon; Mason Belue; Stephanie Harmon; Baris Turkbey; Daguang Xu; Wenqi Li; | code |
| 672 | FRAMES-VQA: Benchmarking Fine-Tuning Robustness Across Multi-Modal Shifts in Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness Across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. |
Chengyue Huang; Brisa Maneechotesuwan; Shivang Chopra; Zsolt Kira; | code |
| 673 | SLVR: Super-Light Visual Reconstruction Via Blueprint Controllable Convolutions and Exploring Feature Diversity Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In addition, although blueprint separable convolutions (BSConv) have proved the dominance of intra-kernel correlation, BSConv forces the blueprint to perform scale transformation on all channels, which may lead to incorrect intra-kernel correlation and introduce useless or disruptive features on some channels and hinder the effective propagation of features. Therefore, in this paper, we rethink the FAM and BSConv for super-light visual reconstruction framework design. |
Ning Ni; Libao Zhang; | code |
| 674 | 3D-HGS: 3D Half-Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Among these, 3D Gaussian Splatting (3D-GS) outperforms Neural Radiance Fields (NeRFs) in quality and speed but struggles with shape and color discontinuities. We propose 3D Half-Gaussian (3D-HGS) kernels as a plug-and-play solution to address these limitations. |
Haolin Li; Jinyang Liu; Mario Sznaier; Octavia Camps; | code |
| 675 | Do Computer Vision Foundation Models Learn The Low-level Characteristics of The Human Visual System? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. |
Yancheng Cai; Fei Yin; Dounia Hammou; Rafal Mantiuk; | code |
| 676 | StdGEN: Semantic-Decomposed 3D Character Generation from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present StdGEN, an innovative pipeline for generating semantically decomposed high-quality 3D characters from single images, enabling broad applications in virtual reality, gaming, and filmmaking, etc. |
Yuze He; Yanning Zhou; Wang Zhao; Zhongkai Wu; Kaiwen Xiao; Wei Yang; Yong-Jin Liu; Xiao Han; | code |
| 677 | STAR-Edge: Structure-aware Local Spherical Curve Representation for Thin-walled Edge Extraction from Unstructured Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce STAR-Edge, a novel approach designed for detecting and refining edge points in thin-walled structures. |
Zikuan Li; Honghua Chen; Yuecheng Wang; Sibo Wu; Mingqiang Wei; Jun Wang; | code |
| 678 | Enhancing Facial Privacy Protection Via Weakening Diffusion Purification Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face … |
Ali Salar; Qing Liu; Yingli Tian; Guoying Zhao; | code |
| 679 | SATA: Spatial Autocorrelation Token Analysis for Enhancing The Robustness of Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These approaches often involve extensive training and fine-tuning, which are time-consuming and resource-intensive. To tackle these obstacles, we introduce a novel approach named Spatial Autocorrelation Token Analysis (SATA). |
Nick Nikzad; Yi Liao; Yongsheng Gao; Jun Zhou; | code |
| 680 | CSC-PA: Cross-image Semantic Correlation Via Prototype Attentions for Single-network Semi-supervised Breast Tumor Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing semi-supervised methods employ the mean-teacher architecture, which merely learns semantic information within a single image and heavily relies on the performance of the teacher model. Therefore, we present a novel cross-image semantic correlation semi-supervised framework, named CSC-PA, to improve the performance of BUS image segmentation. |
Zhenhui Ding; Guilian Chen; Qin Zhang; Huisi Wu; Jing Qin; | code |
| 681 | DTGBrepGen: A Novel B-rep Generative Model Through Decoupling Topology and Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose DTGBrepGen, a novel topology-geometry decoupled framework for B-rep generation that explicitly addresses both aspects. |
Jing Li; Yihang Fu; Falai Chen; | code |
| 682 | Analyzing The Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To facilitate our analysis, we propose a data synthesis pipeline to synthesize high-quality data. |
Zhuoran Zhao; Linlin Yang; Pengzhan Sun; Pan Hui; Angela Yao; | code |
| 683 | Silence Is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limitations are twofold: 1) they fail to prevent images from being manipulated by audio signals, and 2) diffusion-based purification techniques can effectively eliminate protective perturbations. To address these challenges, we propose Silencer, a two-stage method designed to proactively protect the privacy of portraits. |
Yuan Gan; Jiaxu Miao; Yunze Wang; Yi Yang; | code |
| 684 | EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. |
Shuoyan Wei; Feng Li; Shengeng Tang; Yao Zhao; Huihui Bai; | code |
| 685 | Point Cloud Upsampling Using Conditional Diffusion Module with Adaptive Noise Suppression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods mostly focus on generating the geometric details of point clouds, neglecting noise suppression. To address this, we propose a novel network based on a conditional diffusion model, incorporating the Adaptive Noise Suppression (ANS) module, which we refer to as PDANS. |
Boqian Zhang; Shen Yang; Hao Chen; Chao Yang; Jing Jia; Guang Jiang; | code |
| 686 | VoxelSplat: Dynamic Gaussian Splatting As An Effective Loss for Occupancy and Flow Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. |
Ziyue Zhu; Shenlong Wang; Jin Xie; Jiang-jiang Liu; Jingdong Wang; Jian Yang; | code |
| 687 | LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. |
Zihui Zhang; Weisheng Dai; Hongtao Wen; Bo Yang; | code |
| 688 | Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we are the first to quantify and evaluate the uncertainty of T2I models with respect to the prompt. |
Gianni Franchi; Nacim Belkhir; Dat Nguyen Trong; Guoxuan Xia; Andrea Pilzer; | code |
| 689 | MoEdit: On Learning Quantity Perception for Multi-object Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods struggle to analyze abstract relationships between multiple objects, often yielding suboptimal performance. To address this, we propose MoEdit, an auxiliary-free method for multi-object image editing. |
Yanfeng Li; Kahou Chan; Yue Sun; Chantong Lam; Tong Tong; Zitong Yu; Keren Fu; Xiaohong Liu; Tao Tan; | code |
| 690 | VIRES: Video Instance Repainting Via Sketch and Text Guided Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. |
Shuchen Weng; Haojie Zheng; Peixuan Zhang; Yuchen Hong; Han Jiang; Si Li; Boxin Shi; | code |
| 691 | URWKV: Unified RWKV Model with Multi-state Perspective for Low-light Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing low-light image enhancement (LLIE) and joint LLIE and deblurring (LLIE-deblur) models have made strides in addressing predefined degradations, yet they are often constrained by dynamically coupled degradations. To address these challenges, we introduce a Unified Receptance Weighted Key Value (URWKV) model with multi-state perspective, enabling flexible and effective degradation restoration for low-light images. |
Rui Xu; Yuzhen Niu; Yuezhou Li; Huangbiao Xu; Wenxi Liu; Yuzhong Chen; | code |
| 692 | NexusGS: Sparse View Synthesis with Epipolar Depth Priors in 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present NexusGS, a 3DGS-based approach that enhances novel view synthesis from sparse-view images by directly embedding depth information into point clouds, without relying on complex manual regularizations. |
Yulong Zheng; Zicheng Jiang; Shengfeng He; Yandu Sun; Junyu Dong; Huaidong Zhang; Yong Du; | code |
| 693 | Image Generation Diversity Issues and How to Tame Them Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Evaluation reveals that current diffusion models converge to limited subsets of the real distribution, with no current state-of-the-art models superpassing 77% of the diversity of the training data. To address this limitation, we introduce Diversity-Aware Diffusion Models (DiADM), a novel approach that improves diversity of unconditional diffusion models without loss of image quality. |
Mischa Dombrowski; Weitong Zhang; Sarah Cechnicka; Hadrien Reynaud; Bernhard Kainz; | code |
| 694 | STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. |
Siyi Du; Xinzhe Luo; Declan P. O’Regan; Chen Qin; | code |
| 695 | Detecting Out-of-Distribution Through The Lens of Neural Collapse Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the phenomenon of Neural Collapse, we propose a versatile and efficient OOD detection method. |
Litian Liu; Yao Qin; | code |
| 696 | HalLoc: Token-level Localization of Hallucinations for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Their definitive outcomes also fail to account for real-world scenarios where the line between hallucinated and truthful information is unclear. To address these issues, we propose HalLoc, a dataset designed for efficient, probabilistic hallucination detection. |
Eunkyu Park; Minyeong Kim; Gunhee Kim; | code |
| 697 | A Simple Data Augmentation for Feature Distribution Skewed Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the feature distribution skewed FL scenario, a common non-IID situation in real-world applications where data from different clients exhibit varying underlying distributions. |
Yunlu Yan; Huazhu Fu; Yuexiang Li; Jinheng Xie; Jun Ma; Guang Yang; Lei Zhu; | code |
| 698 | PICO: Reconstructing 3D People In Contact with Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways:(1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact correspondences on both body and object meshes. |
Alpár Cseke; Shashank Tripathi; Sai Kumar Dwivedi; Arjun S. Lakshmipathy; Agniv Chatterjee; Michael J. Black; Dimitrios Tzionas; | code |
| 699 | LoTUS: Large-Scale Machine Unlearning with A Taste of Uncertainty Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. |
Christoforos N. Spartalis; Theodoros Semertzidis; Efstratios Gavves; Petros Daras; | code |
| 700 | Hyperspectral Pansharpening Via Diffusion Models with Iteratively Zero-Shot Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel guided diffusion scheme with zero-shot guidance and neural spatial-spectral decomposition (NSSD) to iteratively generate the RGB detail image and map the RGB detail image to target HR-HSI. |
Jin-Liang Xiao; Ting-Zhu Huang; Liang-Jian Deng; Guang Lin; Zihan Cao; Chao Li; Qibin Zhao; | code |
| 701 | Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. |
Maochen Yang; Zekun Li; Jian Zhang; Lei Qi; Yinghuan Shi; | code |
| 702 | Thin-Shell-SfT: Fine-Grained Monocular Non-rigid 3D Surface Tracking with Neural Deformation Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, fine surface details such as cloth wrinkles are often not recovered with the desired accuracy. In response to these limitations, we propose Thin-Shell-SfT, a new method for non-rigid 3D tracking that represents a surface as an implicit and continuous spatiotemporal neural field. |
Navami Kairanda; Marc Habermann; Shanthika Naik; Christian Theobalt; Vladislav Golyanik; | code |
| 703 | Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. |
Jinhui Yi; Syed Talal Wasim; Yanan Luo; Muzammal Naseer; Juergen Gall; | code |
| 704 | Anatomical Consistency and Adaptive Prior-informed Transformation for Multi-contrast MR Image Synthesis Via Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose APT, a unified diffusion model designed to generate accurate and anatomically consistent multi-contrast MR images. |
Yejee Shin; Yeeun Lee; Hanbyol Jang; Geonhui Son; Hyeongyu Kim; Dosik Hwang; | code |
| 705 | AeSPa : Attention-guided Self-supervised Parallel Imaging for MRI Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel zero-shot scan-specific self-supervised reconstruction method for magnetic resonance imaging (MRI) to reduce scan times. |
Jinho Joo; Hyeseong Kim; Hyeyeon Won; Deukhee Lee; Taejoon Eo; Dosik Hwang; | code |
| 706 | FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current 3D generation and inpainting methods often focus on visible appearance and overlook internal textures. To bridge this gap, we introduce FruitNinja, the first method to generate internal textures for 3D objects undergoing geometric and topological changes. |
Fangyu Wu; Yuhao Chen; | code |
| 707 | Generative Photography: Scene-Consistent Camera Control for Realistic Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Generative Photography, a framework that allows controlling camera intrinsic settings during content generation. |
Yu Yuan; Xijun Wang; Yichen Sheng; Prateek Chennuri; Xingguang Zhang; Stanley Chan; | code |
| 708 | Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). |
Yuechen Xie; Jie Song; Huiqiong Wang; Mingli Song; | code |
| 709 | SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further enhance the model’s capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. |
Chenkai Zhang; Yiming Lei; Zeming Liu; Haitao Leng; ShaoGuo Liu; Tingting Gao; Qingjie Liu; Yunhong Wang; | code |
| 710 | Neuro-Symbolic Evaluation of Text-to-Video Models Using Formal Verification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these metrics emphasize visual quality and smoothness, neglecting temporal fidelity and text-to-video alignment, which are crucial for safety-critical applications. To address this gap, we introduce NeuS-V, a novel synthetic video evaluation metric that rigorously assesses text-to-video alignment using neuro-symbolic formal verification techniques. |
S P Sharan; Minkyu Choi; Sahil Shah; Harsh Goel; Mohammad Omama; Sandeep Chinchali; | code |
| 711 | Latent Space Super-Resolution for Higher-Resolution Image Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LSRNA, a novel framework for higher-resolution (exceeding 1K) image generation using diffusion models by leveraging super-resolution directly in the latent space. |
Jinho Jeong; Sangmin Han; Jinwoo Kim; Seon Joo Kim; | code |
| 712 | Sampling Innovation-Based Adaptive Compressive Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a Sampling Innovation-Based ACS (SIB-ACS) method that can effectively identify and allocate sampling to challenging image reconstruction areas, culminating in high-fidelity image reconstruction. |
Zhifu Tian; Tao Hu; Chaoyang Niu; Di Wu; Shu Wang; | code |
| 713 | LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad–a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. |
Faridoun Mehri; Mahdieh Soleymani Baghshah; Mohammad Taher Pilehvar; | code |
| 714 | Anchor-Aware Similarity Cohesion in Target Frames Enables Predicting Temporal Moment Boundaries in 2D Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, to reduce distraction from irrelevant frames, we designate an anchor frame as that with the maximum query-frame relevance measured by the established Vision-Language Model. |
Jiawei Tan; Hongxing Wang; Junwu Weng; Jiaxin Li; Zhilong Ou; Kang Dang; | code |
| 715 | Higher-Order Ratio Cycles for Fast and Globally Optimal Shape Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we address various shape matching problems that can be cast as finding cyclic paths in a product graph. |
Paul Roetzer; Viktoria Ehm; Daniel Cremers; Zorah Lähner; Florian Bernard; | code |
| 716 | LotusFilter: Fast Diverse Nearest Neighbor Search Via A Learned Cutoff Table Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose LotusFilter, a post-processing module to diversify ANNS results. |
Yusuke Matsui; | code |
| 717 | Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Spatiotemporal Skip Guidance (STG), a simple training-free sampling guidance method for enhancing transformer-based video diffusion models. |
Junha Hyung; Kinam Kim; Susung Hong; Min-Jung Kim; Jaegul Choo; | code |
| 718 | Sound Bridge: Associating Egocentric and Exocentric Videos Via Audio Cues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing visual-to-visual and visual-to-textual Ego-Exo video alignment methods struggle with the problem that there could be non-visual overlap for the same activity. To address this, we propose using sound as a bridge, as audio is often consistent across Ego-Exo videos. |
Sihong Huang; Jiaxin Wu; Xiaoyong Wei; Yi Cai; Dongmei Jiang; Yaowei Wang; | code |
| 719 | Ges3ViG : Incorporating Pointing Gestures Into Language-Based 3D Visual Grounding for Embodied Reference Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. |
Atharv Mahesh Mane; Dulanga Weerakoon; Vigneshwaran Subbaraju; Sougata Sen; Sanjay E. Sarma; Archan Misra; | code |
| 720 | Split Adaptation for Pre-trained Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A straightforward solution may be sending the pre-trained ViT to clients for local adaptation, which poses issues of model intellectual property and incurs heavy client computation overhead. To address these issues, we propose a novel split adaptation (SA) method that enables effective downstream adaptation while protecting data and models. |
Lixu Wang; Bingqi Shang; Yi Li; Payal Mohapatra; Wei Dong; Xiao Wang; Qi Zhu; | code |
| 721 | Online Task-Free Continual Learning Via Dynamic Expansionable Memory Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the online TFCL by introducing an innovative memory management approach, by incorporating a dynamic memory system for storing selected data representatives from evolving distributions while a dynamically expandable memory system enables the retention of essential long-term knowledge. |
Fei Ye; Adrian G. Bors; | code |
| 722 | Post-pre-training for Modality Alignment in Vision-Language Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning. |
Shin’ya Yamaguchi; Dewei Feng; Sekitoshi Kanai; Kazuki Adachi; Daiki Chijiwa; | code |
| 723 | Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic content in random noise. To address this issue, we propose a novel data-free universal attack method that recursively extracts pseudo-semantic priors directly from the UAPs during training to enrich the semantic content within the data-free UAP framework. |
Chanhui Lee; Yeonghwan Song; Jeany Son; | code |
| 724 | HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces HarmonySet, a comprehensive dataset designed to advance video-music understanding. |
Zitang Zhou; Ke Mei; Yu Lu; Tianyi Wang; Fengyun Rao; | code |
| 725 | Learning Person-Specific Animatable Face Models from In-the-Wild Images Via A Shared Base Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a method to train a generic base model and then transfer it to yield person-specific models by integrating lightweight adapters within the large-parameter ViT-MAE base model. |
Yuxiang Mao; Zhenfeng Fan; ZhiJie Zhang; Zhiheng Zhang; Shihong Xia; | code |
| 726 | MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in The Swiss Alps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To advance research in wild animal behavior monitoring we present MammAlps, a multimodal and multi-view dataset of wildlife behavior monitoring from 9 camera-traps in the Swiss National Park. |
Valentin Gabeff; Haozhe Qi; Brendan Flaherty; Gencer Sumbul; Alexander Mathis; Devis Tuia; | code |
| 727 | AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce AG-VPReID, a new large-scale dataset for aerial-ground video-based person re-identification (ReID) that comprises 6,632 subjects, 32,321 tracklets and over 9.6 million frames captured by drones (altitudes ranging from 15-120m), CCTV, and wearable cameras. |
Huy Nguyen; Kien Nguyen; Akila Pemasiri; Feng Liu; Sridha Sridharan; Clinton Fookes; | code |
| 728 | Attend to Not Attended: Structure-then-Detail Token Merging for Post-training DiT Acceleration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Subsequently, we introduce SDTM, a structure-then-detail token merging approach that dynamically compresses feature redundancies. |
Haipeng Fang; Sheng Tang; Juan Cao; Enshuo Zhang; Fan Tang; Tong-Yee Lee; | code |
| 729 | RICCARDO: Radar Hit Prediction and Convolution for Camera-Radar 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explicitly utilize a radar hit distribution model to assist fusion. |
Yunfei Long; Abhinav Kumar; Xiaoming Liu; Daniel Morris; | code |
| 730 | Overcoming Shortcut Problem in VLM for Robust Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we analyze the OOD problem from the perspective of shortcuts in VLMs and propose OSPCoOp which includes background decoupling and mask-guided region regularization. |
Zhuo Xu; Xiang Xiang; Yifan Liang; | code |
| 731 | DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain "content" and "context" features respectively. |
Junjie Wang; Bin Chen; Yulin Li; Bin Kang; Yichi Chen; Zhuotao Tian; | code |
| 732 | RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose RobSense, a robust multi-modal foundation model for Multi-spectral and Synthetic Aperture Radar data. |
Minh Kha Do; Kang Han; Phu Lai; Khoa T. Phan; Wei Xiang; | code |
| 733 | Exploring Semantic Feature Discrimination for Perceptual Image Super-Resolution and Opinion-Unaware No-Reference Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing GAN-based SR methods typically perform coarse-grained discrimination directly on images and ignore the semantic information of images, making it challenging for the super resolution networks (SRN) to learn fine-grained and semantic-related texture details. To alleviate this issue, we propose a semantic feature discrimination method, SFD, for perceptual SR. |
Guanglu Dong; Xiangyu Liao; Mingyang Li; Guihuan Guo; Chao Ren; | code |
| 734 | Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Channel Consistency Prior and Self-Reconstruction Strategy Based Unsupervised Image Deraining framework, CSUD, to tackle the aforementioned challenges. |
Guanglu Dong; Tianheng Zheng; Yuanzhouhan Cao; Linbo Qing; Chao Ren; | code |
| 735 | HoGS: Unified Near and Far Object Reconstruction Via Homogeneous Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that, despite its ultimate simplicity, using homogeneous coordinates, a concept on the projective geometry, for the 3DGS pipeline remarkably improves the rendering accuracies of distant objects. We therefore propose Homogeneous Gaussian Splatting (HoGS) incorporating homogeneous coordinates into the 3DGS framework, providing a unified representation for enhancing near and distant objects. |
Xinpeng Liu; Zeyi Huang; Fumio Okura; Yasuyuki Matsushita; | code |
| 736 | Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects, depending on the specific context of a given source image and target text. |
Yoonjeon Kim; Soohyun Ryu; Yeonsung Jung; Hyunkoo Lee; Joowon Kim; June Yong Yang; Jaeryong Hwang; Eunho Yang; | code |
| 737 | DPU: Dynamic Prototype Updating for Multimodal Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This assumption can lead to performance degradation, especially when prediction discrepancies are indiscriminately amplified across all samples. To address this issue, we propose Dynamic Prototype Updating (DPU), a novel plug-and-play framework for multimodal OOD detection that accounts for intra-class variations. |
Shawn Li; Huixian Gong; Hao Dong; Tiankai Yang; Zhengzhong Tu; Yue Zhao; | code |
| 738 | FASTer: Focal Token Acquiring-and-Scaling Transformer for Long-term 3D Objection Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Focal Token Acquring-and-Scaling Transformer (FASTer), which dynamically selects focal tokens and condenses token sequences in an adaptive and lightweight manner. |
Chenxu Dang; ZaiPeng Duan; Pei An; Xinmin Zhang; Xuzhong Hu; Jie Ma; | code |
| 739 | TAET: Two-Stage Adversarial Equalization Training on Long-Tailed Distributions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we provide an in-depth analysis of adversarial training in the context of long-tailed distributions and identify the limitations of the current state-of-the-art method, AT-BSL, in achieving robust performance under such conditions. To address these challenges, we propose a novel training framework, TAET, which incorporates an initial stabilization phase followed by a stratified, equalization adversarial training phase. |
Wang Yu-Hang; Junkang Guo; Aolei Liu; Kaihao Wang; Zaitong Wu; Zhenyu Liu; Wenfei Yin; Jian Liu; | code |
| 740 | Subspace Constraint and Contribution Estimation for Heterogeneous Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing frameworks face the challenges of aggregation bias and local overfitting. To address these issues, we propose FedSCE. |
Xiangtao Zhang; Sheng Li; Ao Li; Yipeng Liu; Fan Zhang; Ce Zhu; Le Zhang; | code |
| 741 | EchoMatch: Partial-to-Partial Shape Matching Via Correspondence Reflection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Finding correspondences between partial shapes comes with an additional challenge: We not only want to identify correspondences between points on either shape but also have to determine which points of each shape actually have a partner. To tackle this challenging problem, we present EchoMatch, a novel framework for partial-to-partial shape matching that incorporates the concept of correspondence reflection to enable an overlap prediction within a functional map framework. |
Yizheng Xie; Viktoria Ehm; Paul Roetzer; Nafie El Amrani; Maolin Gao; Florian Bernard; Daniel Cremers; | code |
| 742 | SP3D: Boosting Sparsely-Supervised 3D Object Detection Via Accurate Cross-Modal Semantic Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. |
Shijia Zhao; Qiming Xia; Xusheng Guo; Pufan Zou; Maoji Zheng; Hai Wu; Chenglu Wen; Cheng Wang; | code |
| 743 | AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This results in suboptimal hash codes, as it ignores frame-specific information density and reconstruction difficulty. To address this limitation, we propose a new framework, termed AutoSSVH, that employs adversarial frame sampling with hash-based contrastive learning. |
Niu Lian; Jun Li; Jinpeng Wang; Ruisheng Luo; Yaowei Wang; Shu-Tao Xia; Bin Chen; | code |
| 744 | Free Lunch Enhancements for Multi-modal Crowd Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses multi-modal crowd counting with a novel `free lunch’ training enhancement strategy that requires no additional data, parameters, or increased inference … |
Haoliang Meng; Xiaopeng Hong; Zhengqin Lai; Miao Shang; | code |
| 745 | Plug-and-Play PPO: An Adaptive Point Prompt Optimizer Making SAM Greater Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel pplug-and-play dual-space Point Prompt Optimizer (PPO) designed to enhance prompt distribution through deep reinforcement learning (DRL)-based heterogeneous graph optimization. |
Xueyu Liu; Rui Wang; Yexin Lai; Guangze Shi; Feixue Shao; Fang Hao; Jianan Zhang; Jia Shen; Yongfei Wu; Wen Zheng; | code |
| 746 | Spherical Manifold Guided Diffusion Model for Panoramic Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Panoramic image essentially acts as a pivotal role in emerging virtual reality and augmented reality scenarios; however, the generation of panoramic images are essentially challenging due to the intrinsic spherical geometry and spherical distortions caused by equirectangular projection (ERP). To address this, we start from the very basics of S^2 manifold inherent to panoramic images, and propose a novel spherical manifold convolution (SMConv) on S^2 manifold. |
Xiancheng Sun; Mai Xu; Shengxi Li; Senmao Ma; Xin Deng; Lai Jiang; Gang Shen; | code |
| 747 | The Devil Is in The Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework. |
Bingjie Gao; Xinyu Gao; Xiaoxue Wu; Yujie Zhou; Yu Qiao; Li Niu; Xinyuan Chen; Yaohui Wang; | code |
| 748 | Test-Time Domain Generalization Via Universe Learning: A Multi-Graph Matching Approach for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most existing TTA methods struggle to deliver strong performance in medical image segmentation, primarily because they overlook the crucial prior knowledge inherent to medical images. To address this challenge, we incorporate morphological information and propose a framework based on multi-graph matching. |
Xingguo Lv; Xingbo Dong; Liwen Wang; Jiewen Yang; Lei Zhao; Bin Pu; Zhe Jin; Xuejun Li; | code |
| 749 | Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. |
Ji Zhao; Banglei Guan; Zibin Liu; Laurent Kneip; | code |
| 750 | PARC: A Quantitative Framework Uncovering The Symmetries Within Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. |
Jenny Schmalfuss; Nadine Chang; Vibashan VS; Maying Shen; Andres Bruhn; Jose M. Alvarez; | code |
| 751 | Balanced Direction from Multifarious Choices: Arithmetic Meta-Learning for Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These methods actually overlook another critical factor that the balanced parameters should be close to the centroid of optimal parameters of each source domain. To address this, we propose a simple yet effective arithmetic meta-learning with arithmetic-weighted gradients. |
Xiran Wang; Jian Zhang; Lei Qi; Yinghuan Shi; | code |
| 752 | Recovering Dynamic 3D Sketches from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. |
Jaeah Lee; Changwoon Choi; Young Min Kim; Jaesik Park; | code |
| 753 | BWFormer: Building Wireframe Reconstruction from Airborne LiDAR Point Cloud with Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present BWFormer, a novel Transformer-based model for building wireframe reconstruction from airborne LiDAR point cloud. |
Yuzhou Liu; Lingjie Zhu; Hanqiao Ye; Shangfeng Huang; Xiang Gao; Xianwei Zheng; Shuhan Shen; | code |
| 754 | SmartCLIP: Modular Vision-language Alignment with Identification Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. |
Shaoan Xie; Lingjing Lingjing; Yujia Zheng; Yu Yao; Zeyu Tang; Eric P. Xing; Guangyi Chen; Kun Zhang; | code |
| 755 | Balancing Two Classifiers Via A Simplex ETF Structure for Model Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel method by Balancing learnable and ETF classifiers to solve the overconfidence or underconfidence problem for model Calibration named BalCAL. |
Jiani Ni; He Zhao; Jintong Gao; Dandan Guo; Hongyuan Zha; | code |
| 756 | DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. |
Darryl Ho; Samuel Madden; | code |
| 757 | Towards Lossless Implicit Neural Representation Via Bit Plane Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. |
Woo Kyoung Han; Byeonghun Lee; Hyunmin Cho; Sunghoon Im; Kyong Hwan Jin; | code |
| 758 | What Makes A Good Dataset for Knowledge Distillation? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore multiple possible surrogate distillation datasets and demonstrate that many different datasets, even unnatural synthetic imagery, can serve as a suitable alternative in KD. |
Logan Frank; Jim Davis; | code |
| 759 | Gradient Inversion Attacks on Parameter-Efficient Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate how the privacy of the fine-tuning data of the users can be compromised via a malicious design of the pretrained model and trainable adapter modules. |
Hasin Us Sami; Swapneel Sen; Amit K. Roy-Chowdhury; Srikanth V. Krishnamurthy; Basak Guler; | code |
| 760 | SSHNet: Unsupervised Cross-modal Homography Estimation Via Problem Reformulation and Split Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel unsupervised cross-modal homography estimation learning framework, named Split Supervised Homography estimation Network (SSHNet). |
Junchen Yu; Si-Yuan Cao; Runmin Zhang; Chenghao Zhang; Zhu Yu; Shujie Chen; Bailin Yang; Hui-Liang Shen; | code |
| 761 | High-Fidelity Relightable Monocular Portrait Animation with Lighting-Controllable Video Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a Lighting Controllable Video Diffusion model (LCVD) for high-fidelity, relightable portrait animation. |
Mingtao Guo; Guanyu Xing; Yanli Liu; | code |
| 762 | FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current sports research is largely centered on ball games, with limited relevance to artistic sports like figure skating. To address this, we introduce FSAnno, a large-scale dataset advancing artistic sports understanding through figure skating. |
Rong Gao; Xin Liu; Zhuozhao Hu; Bohao Xing; Baiqiang Xia; Zitong Yu; Heikki Kälviäinen; | code |
| 763 | GroomLight: Hybrid Inverse Rendering for Relightable Human Hair Appearance Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present GroomLight, a novel method for relightable hair appearance modeling from multi-view images. |
Yang Zheng; Menglei Chai; Delio Vicini; Yuxiao Zhou; Yinghao Xu; Leonidas Guibas; Gordon Wetzstein; Thabo Beeler; | code |
| 764 | Learning to Filter Outlier Edges in Global SfM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome the challenge of memory overflow caused by converting to a line graph, we introduce a clustering-based graph processing approach, enabling our method to be applied to arbitrarily large pose graphs. |
Nicole Damblon; Marc Pollefeys; Daniel Barath; | code |
| 765 | HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce HybridMQA, a first-of-its-kind hybrid full-reference colored MQA framework that integrates model-based and projection-based approaches, capturing complex interactions between textural information and 3D structures for enriched quality representations. |
Armin Shafiee Sarvestani; Sheyang Tang; Zhou Wang; | code |
| 766 | Beyond Local Sharpness: Communication-Efficient Global Sharpness-aware Minimization for Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces FedGloSS (Federated Global Server-side Sharpness), a novel FL approach that prioritizes the optimization of global sharpness on the server, using SAM. |
Debora Caldarola; Pietro Cagnasso; Barbara Caputo; Marco Ciccone; | code |
| 767 | Diffusion Bridge: Leveraging Diffusion Model to Reduce The Modality Gap Between Text and Vision for Zero-Shot Image Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional approaches, such as noise injection and memory-based similarity matching, attempt to address this gap, yet these methods either rely on indirect alignment or relatively naive solutions with heavy computation. Diffusion Bridge introduces a novel approach to directly reduce this modality gap by leveraging Denoising Diffusion Probabilistic Models (DDPM), trained exclusively on text embeddings to model their distribution. |
Jeong Ryong Lee; Yejee Shin; Geonhui Son; Dosik Hwang; | code |
| 768 | OmniStereo: Real-time Omnidireactional Depth Estimation with Multiview Fisheye Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While many well-recognized methods have been developed to produce high-quality omnidirectional 3D information, they are too slow for real-time computation, limiting their feasibility in practical applications. Motivated by these shortcomings, we propose an efficient omnidirectional depth sensing framework, called OmniStereo, which generates high-quality 3D information in real-time. |
Jiaxi Deng; Yushen Wang; Haitao Meng; Zuoxun Hou; Yi Chang; Gang Chen; | code |
| 769 | Shift The Lens: Environment-Aware Unsupervised Camouflaged Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing work has essentially focused on isolating camouflaged objects from the environment, demonstrating ever-improving performance but at the cost of extensive annotations and complex optimizations. In this paper, we diverge from this paradigm and shift the lens to isolating the salient environment from the camouflaged object. |
Ji Du; Fangwei Hao; Mingyang Yu; Desheng Kong; Jiesheng Wu; Bin Wang; Jing Xu; Ping Li; | code |
| 770 | RCP-Bench: Benchmarking Robustness for Collaborative Perception Under Diverse Corruptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve robustness, we propose two simple yet effective strategies, RCP-Drop and RCP-Mix, based on training regularization and feature augmentation. |
Shihang Du; Sanqing Qu; Tianhang Wang; Xudong Zhang; Yunwei Zhu; Jian Mao; Fan Lu; Qiao Lin; Guang Chen; | code |
| 771 | Closest Neighbors Are Harmful for Lightweight Masked Auto-encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our discovery shows that the lightweight model failed to distinguish different local information, leading to aliased understanding and poor accuracy. Motivated by this finding, we propose NoR-MAE, a novel MAE training algorithm for lightweight vision transformers. |
Jian Meng; Ahmed Hasssan; Li Yang; Deliang Fan; Jinwoo Shin; Jae-sun Seo; | code |
| 772 | Improving The Training of Data-Efficient GANs Via Quality Aware Dynamic Discriminator Rejection Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Focusing on sample quality during training, in this paper, we are the first to incorporate discriminator rejection sampling (DRS) into the training process and introduce a novel method, called quality aware dynamic discriminator rejection sampling (QADDRS). |
Zhaoyu Zhang; Yang Hua; Guanxiong Sun; Hui Wang; Seán McLoone; | code |
| 773 | CLIP-driven Coarse-to-fine Semantic Guidance for Fine-grained Open-set Semi-supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle the issues, in this paper, we propose a novel CLIP-driven coarse-to-fine semantic-guided framework, named CFSG-CLIP, to progressively focus on the distinctive fine-grained clues. |
Xiaokun Li; Yaping Huang; Qingji Guan; | code |
| 774 | STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. |
Divya Velayudhan; Abdelfatah Ahmed; Mohamad Alansari; Neha Gour; Abderaouf Behouch; Taimur Hassan; Syed Talal Wasim; Nabil Maalej; Muzammal Naseer; Juergen Gall; Mohammed Bennamoun; Ernesto Damiani; Naoufel Werghi; | code |
| 775 | Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reveal that dynamic texture plays a key role in rendering high-fidelity talking avatars, and introduce a high-resolution 4D dataset TexTalk4D, consisting of 100 minutes of audio-synced scan-level meshes with detailed 8K dynamic textures from 100 subjects. |
Xuanchen Li; Jianyu Wang; Yuhao Cheng; Yikun Zeng; Xingyu Ren; Wenhan Zhu; Weiming Zhao; Yichao Yan; | code |
| 776 | Object-aware Sound Source Localization Via Audio-Visual Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation arises primarily from their reliance on simple audio-visual correspondence, which does not capture fine-grained semantic differences between sound-making and silent objects. To address these challenges, we propose a novel sound source localization framework leveraging Multimodal Large Language Models (MLLMs) to generate detailed contextual information that explicitly distinguishes between sound-making foreground objects and silent background objects. |
Sung Jin Um; Dongjin Kim; Sangmin Lee; Jung Uk Kim; | code |
| 777 | Hierarchical Flow Diffusion for Efficient Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. |
Yang Hai; Guo Wang; Tan Su; Wenjie Jiang; Yinlin Hu; | code |
| 778 | Charm: The Missing Piece in ViT Fine-Tuning for Image Aesthetic Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. |
Fatemeh Behrad; Tinne Tuytelaars; Johan Wagemans; | code |
| 779 | SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth. |
Kevin Miller; Aditya Gangrade; Samarth Mishra; Kate Saenko; Venkatesh Saligrama; | code |
| 780 | StageDesigner: Artistic Stage Generation for Scenography Via Theater Scripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce StageDesigner, the first comprehensive framework for artistic stage generation using large language models combined with layout-controlled diffusion models. |
Zhaoxing Gan; Mengtian Li; Ruhua Chen; Zhongxia Ji; Sichen Guo; Huanling Hu; Guangnan Ye; Zuo Hu; | code |
| 781 | Repurposing Stable Diffusion Attention for Training-Free Unsupervised Interactive Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel unsupervised and training-free approach based solely on the self-attention of Stable Diffusion. |
Markus Karmann; Onay Urfalioglu; | code |
| 782 | Robust Message Embedding Via Attention Flow-Based Steganography Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel message embedding framework, called Robust Message Steganography (RMSteg), which is competent to hide message via QR Code in a host image based on an normalizing flow-based model. |
Huayuan Ye; Shenzhuo Zhang; Shiqi Jiang; Jing Liao; Shuhang Gu; Dejun Zheng; Changbo Wang; Chenhui Li; | code |
| 783 | Gaze-LLE: Gaze Target Estimation Via Large-Scale Learned Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. |
Fiona Ryan; Ajay Bati; Sangmin Lee; Daniel Bolya; Judy Hoffman; James M. Rehg; | code |
| 784 | Few-shot Personalized Scanpath Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose few-shot personalized scanpath prediction task (FS-PSP) and a novel method to address it, which aims to predict scanpaths for an unseen subject using minimal support data of that subject’s scanpath behavior. |
Ruoyu Xue; Jingyi Xu; Sounak Mondal; Hieu Le; Greg Zelinsky; Minh Hoai; Dimitris Samaras; | code |
| 785 | GenAssets: Generating In-the-wild 3D Assets in Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a 3D latent diffusion model that learns on in-the-wild LiDAR and camera data captured by a sensor platform and generates high-quality 3D assets with complete geometry and appearance. |
Ze Yang; Jingkang Wang; Haowei Zhang; Sivabalan Manivasagam; Yun Chen; Raquel Urtasun; | code |
| 786 | Multi-focal Conditioned Latent Diffusion for Person Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the compression process of LDM often results in the deterioration of details, particularly in sensitive areas such as facial features and clothing textures. In this paper, we propose a Multi-focal Conditioned Latent Diffusion (MCLD) method to address these limitations by conditioning the model on disentangled, pose-invariant features from these sensitive regions. |
Jiaqi Liu; Jichao Zhang; Paolo Rota; Nicu Sebe; | code |
| 787 | GIF: Generative Inspiration for Face Recognition at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by generative modeling, we present a simple yet effective method that substitutes scalar labels with structured identity code, i.e., a sequence of integers. |
Saeed Ebrahimi; Sahar Rahimi; Ali Dabouei; Srinjoy Das; Jeremy M. Dawson; Nasser M. Nasrabadi; | code |
| 788 | Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite CLIP’s proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects, guided by Class Activation Mapping (CAM) of the classifier, and debiasing pseudo-labels derived from CLIP predictions. |
Dongseob Kim; Hyunjung Shim; | code |
| 789 | Prior-free 3D Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel, truly prior-free 3D object tracking method that operates without given any model or training priors. |
Xiuqiang Song; Li Jin; Zhengxian Zhang; Jiachen Li; Fan Zhong; Guofeng Zhang; Xueying Qin; | code |
| 790 | MAGE : Single Image to Material-Aware 3D Via The Multi-View G-Buffer Estimation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel approach (named MAGE) for generating 3D geometry with realistic decomposed material properties given a single image as input. |
Haoyuan Wang; Zhenwei Wang; Xiaoxiao Long; Cheng Lin; Gerhard Hancke; Rynson W.H. Lau; | code |
| 791 | AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a novel benchmark for audio-visual question answering continual learning (AVQACL) is introduced, aiming to study fine-grained scene understanding and spatial-temporal reasoning in videos under a continual learning setting. |
Kaixuan Wu; Xinde Li; Xinling Li; Chuanfei Hu; Guoliang Wu; | code |
| 792 | Exposure-slot: Exposure-centric Representations Learning with Slot-in-Slot Attention for Region-aware Exposure Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By extending the Slot Attention algorithm with a hierarchical structure, our approach progressively clusters features, enabling precise and region-aware correction. |
Donggoo Jung; Daehyun Kim; Guanghui Wang; Tae Hyun Kim; | code |
| 793 | NN-Former: Rethinking Graph Structure in Neural Architecture Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus we propose a novel predictor leveraging the strengths of GNNs and transformers to learn the enhanced topology. |
Ruihan Xu; Haokui Zhang; Yaowei Wang; Wei Zeng; Shiliang Zhang; | code |
| 794 | VolFormer: Explore More Comprehensive Cube Interaction for Hyperspectral Image Restoration and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although uni-dimensional self-attention, like channel self-attention or spatial self-attention, builds long-range dependencies in spectral or spatial dimensions, they lack more comprehensive interactions across dimensions. To tackle the above drawback, we propose a VolFormer, a volumetric self-attention embedded Transformer network for single hyperspectral image restoration. |
Dabing Yu; Zheng Gao; | code |
| 795 | PersonaHOI: Effortlessly Improving Face Personalization in Human-Object Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. |
Xinting Hu; Haoran Wang; Jan Eric Lenssen; Bernt Schiele; | code |
| 796 | PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose PSA-SSL, a novel extension to point cloud SSL that learns object pose and size-aware (PSA) features. |
Barza Nisar; Steven L. Waslander; | code |
| 797 | DIFFER: Disentangling Identity Features Via Semantic Cues for Clothes-Changing Person Re-ID Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose DIFFER: Disentangle Identity Features From Entangled Representations, a novel adversarial learning method that leverages textual descriptions to disentangle identity features. |
Xin Liang; Yogesh S Rawat; | code |
| 798 | Directional Label Diffusion Model for Learning from Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To expand the diffusion model into a robust classifier that explicitly accommodates more noise knowledge, we propose a Directional Label Diffusion (DLD) model. |
Senyu Hou; Gaoxia Jiang; Jia Zhang; Shangrong Yang; Husheng Guo; Yaqing Guo; Wenjian Wang; | code |
| 799 | ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification Via Multi-Depth Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a framework for detecting and classifying OOD samples in a given class hierarchy. |
Erik Wallin; Fredrik Kahl; Lars Hammarstrand; | code |
| 800 | Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By integrating tile embeddings from multiple FMs, we propose a new single modality SSL method in feature space that generates useful slide representations. |
Tim Lenz; Peter Neidlinger; Marta Ligero; Georg Wölflein; Marko van Treeck; Jakob N. Kather; | code |
| 801 | ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They either focus solely on global color adjustments applied before the camera-specific image signal processing pipeline or rely on end-to-end models that generate WB outputs without accounting for global color trends, leading to suboptimal correction. To address these limitations, we propose an Auxiliary Bimodal Cross-domain Transformer (ABC-Former) that enhances WB correction by leveraging complementary knowledge from global color information from CIELab and RGB histograms alongside sRGB inputs. |
Yu-Cheng Chiu; Guan-Rong Chen; Zihao Chen; Yan-Tsung Peng; | code |
| 802 | MirrorVerse: Pushing Diffusion Models to Realistically Reflect The World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we address the challenge of generating photorealistic mirror reflections using diffusion-based generative models. |
Ankit Dhiman; Manan Shah; R Venkatesh Babu; | code |
| 803 | ShowMak3r: Compositional TV Show Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. |
Sangmin Kim; Seunguk Do; Jaesik Park; | code |
| 804 | F^3OCUS – Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy Via Multi-objective Meta-Heuristics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore 5 different meta-heuristic algorithms and compare their effectiveness for selecting model layers and adapter layers towards PEFT-FL. |
Pramit Saha; Felix Wagner; Divyanshu Mishra; Can Peng; Anshul Thakur; David A. Clifton; Konstantinos Kamnitsas; J. Alison Noble; | code |
| 805 | Causal Composition Diffusion Model for Closed-loop Traffic Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the **C**ausal **C**ompositional **Diff**usion Model (***CCDiff***), a structure-guided diffusion framework to address these challenges. |
Haohong Lin; Xin Huang; Tung Phan; David Hayden; Huan Zhang; Ding Zhao; Siddhartha Srinivasa; Eric Wolff; Hongge Chen; | code |
| 806 | ReSpec: Relevance and Specificity Grounded Online Filtering for Learning on Video-Text Data Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Relevance and Specificity-based online filtering framework (ReSpec) that selects data based on four criteria: (i) modality alignment for clean data, (ii) task relevance for target focused data, (iii) specificity for informative and detailed data, and (iv) efficiency for low-latency processing. |
Chris Dongjoo Kim; Jihwan Moon; Sangwoo Moon; Heeseung Yun; Sihaeng Lee; Aniruddha Kembhavi; Soonyoung Lee; Gunhee Kim; Sangho Lee; Christopher Clark; | code |
| 807 | TopNet: Transformer-Efficient Occupancy Prediction Network for Octree-Structured Point Cloud Geometry Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although octree-based entropy models can reduce BPP without introducing geometry distortion, existing CNN-based models struggle with limited receptive fields to capture long-range dependencies, while Transformer-built architectures always neglect fine-grained details due to their reliance on global self-attention. In this paper, we propose a Transformer-efficient occupancy prediction Network, termed TopNet, to overcome these challenges by developing several novel components: Locally-enhanced Context Encoding (LeCE) for enhancing the translation-invariance of the octree nodes, Adaptive-Length Sliding Window Attention (AL-SWA) for capturing both global and local dependencies while adaptively adjusting attention weights based on the input window length, Spatial-Gated-enhanced Channel Mixer (SG-CM) for efficient feature aggregation from ancestors and siblings, and Latent-guided Node Occupancy Predictor (LNOP) for improving prediction accuracy of spatially adjacent octree nodes. |
Xinjie Wang; Yifan Zhang; Ting Liu; Xinpu Liu; Ke Xu; Jianwei Wan; Yulan Guo; Hanyun Wang; | code |
| 808 | Towards Human-Understandable Multi-Dimensional Concept Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, MCD’s explanations can be difficult for humans to understand, raising concerns about their practical utility. To address this, we propose Human-Understandable Multi-dimensional Concept Discovery (HU-MCD). |
Arne Grobrügge; Niklas Kühl; Gerhard Satzger; Philipp Spitzer; | code |
| 809 | Correcting Deviations from Normality: A Reformulated Diffusion Model for Multi-Class Unsupervised Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a reformulation of the standard diffusion model geared toward selective region alteration, allowing the accurate identification of anomalies. |
Farzad Beizaee; Gregory A. Lodygensky; Christian Desrosiers; Jose Dolz; | code |
| 810 | Efficient Personalization of Quantized Diffusion Model Without Backpropagation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. |
Hoigi Seo; Wongi Jeong; Kyungryeol Lee; Se Young Chun; | code |
| 811 | ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. |
Yassir Bendou; Amine Ouasfi; Vincent Gripon; Adnane Boukhayma; | code |
| 812 | Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks Via Breaking Invisible Surrogate Gradients Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce an innovative potential-dependent surrogate gradient (PDSG) method to establish a robust connection between the SG and the model, thereby enhancing the adaptability of adversarial attacks across various models with invisible SGs. |
Li Lun; Kunyu Feng; Qinglong Ni; Ling Liang; Yuan Wang; Ying Li; Dunshan Yu; Xiaoxin Cui; | code |
| 813 | A Unified, Resilient, and Explainable Adversarial Patch Detector Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our proposed AdvPatchXAI approach introduces a generalized, robust, and explainable defense algorithm designed to defend DNNs against physical adversarial threats. |
Vishesh Kumar; Akshay Agarwal; | code |
| 814 | Benchmarking Object Detectors Under Real-World Distribution Shifts in Satellite Imagery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we examine the generalisability and robustness of state-of-the-art object detectors under real-world distribution shifts, focusing particularly on spatial domain shifts. |
Sara A. Al-Emadi; Yin Yang; Ferda Ofli; | code |
| 815 | Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Sketchtopia, a large-scale dataset and AI framework designed to explore goal-driven, multimodal communication through asynchronous interactions in a Pictionary-inspired setup. |
Mohd Hozaifa Khan; Ravi Kiran Sarvadevabhatla; | code |
| 816 | Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on our approaches, we develop MultiHeadDepth (replacing cost volume) and HomoDepth (MultiHeadDepth + removing pre-processing) models. |
Yongfan Liu; Hyoukjun Kwon; | code |
| 817 | Three-view Focal Length Recovery From Homographies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. |
Yaqing Ding; Viktor Kocur; Zuzana Berger Haladova; Qianliang Wu; Shen Cai; Jian Yang; Zuzana Kukelova; | code |
| 818 | Rethinking Decoder Design: Improving Biomarker Segmentation Using Depth-to-Space Restoration and Residual Linear Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is due to challenges in effectively transferring rich multiscale features from encoders to decoders, as well as limitations in decoder efficiency. To address these issues, we propose an architecture that captures multi-scale local and global contextual information and a novel decoder design, which effectively integrates features from the encoder, emphasizes important channels and regions, and reconstructs spatial dimensions to enhance segmentation accuracy. |
Saad Wazir; Daeyoung Kim; | code |
| 819 | DeepCompress-ViT: Rethinking Model Compression to Enhance Efficiency of Vision Transformers at The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome the limitations of existing works, we rethink model compression strategy for ViTs from first principle approach and develop an orthogonal strategy called \method. |
Sabbir Ahmed; Abdullah Al Arafat; Deniz Najafi; Akhlak Mahmood; Mamshad Nayeem Rizve; Mohaiminul Al Nahian; Ranyang Zhou; Shaahin Angizi; Adnan Siraj Rakin; | code |
| 820 | Random Conditioning for Diffusion Model Compression with Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Random Conditioning, a novel approach that pairs noised images with randomly selected text conditions to enable efficient, image-free knowledge distillation. |
Dohyun Kim; Sehwan Park; Geonhee Han; Seung Wook Kim; Paul Hongsuck Seo; | code |
| 821 | Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate compositionality in the image domain, where the analysis of compositional properties is challenged by noise and sparsity of visual data. |
Davide Berasi; Matteo Farina; Massimiliano Mancini; Elisa Ricci; Nicola Strisciuglio; | code |
| 822 | Saliuitl: Ensemble Salience Guided Recovery of Adversarial Patches Against CNNs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Saliuitl, a recovery method independent of the number of patches and their shape, which unlike prior works, explicitly detects patch attacks before attempting recovery. |
Mauricio Byrd Victorica; György Dán; Henrik Sandberg; | code |
| 823 | HyperNVD: Accelerating Neural Video Decomposition Via Hypernetworks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video, making the process time-consuming when applied to new videos. Noticing this limitation, we propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. |
Maria Pilligua; Danna Xue; Javier Vazquez-Corral; | code |
| 824 | OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. |
Max Gutbrod; David Rauber; Danilo Weber Nunes; Christoph Palm; | code |
| 825 | DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. |
Miaowei Wang; Yibo Zhang; Weiwei Xu; Rui Ma; Changqing Zou; Daniel Morris; | code |
| 826 | Beyond Human Perception: Understanding Multi-Object World from Monocular View Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we construct a large-scale benchmark dataset, MonoMulti3D-ROPE, and propose a model, CyclopsNet that integrates a State-Prompt Visual Encoder (SPVE) module with a Denoising Alignment Fusion (DAF) module to achieve robust multi-modal semantic alignment and fusion. |
Keyu Guo; Yongle Huang; Shijie Sun; Xiangyu Song; Mingtao Feng; Zedong Liu; Huansheng Song; Tiantian Wang; Jianxin Li; Naveed Akhtar; Ajmal Saeed Mian; | code |
| 827 | Advancing Adversarial Robustness in GNeRFs: The IL2-NeRF Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The pioneering work in this area, NeRFool, introduced a state-of-the-art attack that targets GNeRFs by manipulating source views before feature extraction, successfully disrupting the color and density results of the constructed views. Building on this foundation, we propose IL2-NeRF (Iterative L2 NeRF Attack), a novel adversarial attack method that explores a new threat model (in the L2 domain) for attacking GNeRFs. |
Nicole Meng; Caleb Manicke; Ronak Sahu; Caiwen Ding; Yingjie Lao; | code |
| 828 | Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a comprehensive framework, introducing (i) the Monocular-Video-based 3D Visual Language Tracking (Mono3DVLT) task, (ii) a large-scale dataset for the task, called Mono3DVLT-V2X, and (iii) a customized neural model for the task. |
Hongkai Wei; Yang Yang; Shijie Sun; Mingtao Feng; Xiangyu Song; Qi Lei; Hongli Hu; Rong Wang; Huansheng Song; Naveed Akhtar; Ajmal Saeed Mian; | code |
| 829 | LC-Mamba: Local and Continuous Mamba with Shifted Windows for Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose LC-Mamba, a Mamba-based model that captures fine-grained spatiotemporal information in video frames, addressing limitations in current interpolation methods and enhancing performance. |
Min Wu Jeong; Chae Eun Rhee; | code |
| 830 | Dense-SfM: Structure from Motion with Dense Consistent Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Dense-SfM, a novel Structure from Motion (SfM) framework designed for dense and accurate 3D reconstruction from multi-view images. |
JongMin Lee; Sungjoo Yoo; | code |
| 831 | EAP-GS: Efficient Augmentation of Pointcloud for 3D Gaussian Splatting in Few-shot Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce an Attentional Pointcloud Augmentation (APA) technique, which retains two-view tracks as an option for pointcloud generation. |
Dongrui Dai; Yuxiang Xing; | code |
| 832 | HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. |
Ethan Griffiths; Maryam Haghighat; Simon Denman; Clinton Fookes; Milad Ramezani; | code |
| 833 | Illumination Spectrum Estimation for Multispectral Images Via Surface Reflectance Modeling and Spatial-Spectral Feature Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, achieving accurate estimation in MS images remains challenging, as previous studies have struggled with spectral diversity and the inherent entanglement between the illuminant and surface reflectance spectra. To tackle these challenges, in this paper, we propose a novel Illumination spectrum estimation technique for MS images via Surface reflectance modeling and Spatial-spectral feature generation (ISS). |
Hyejin Oh; Woo-Shik Kim; Sangyoon Lee; YungKyung Park; Je-Won Kang; | code |
| 834 | End-to-End Implicit Neural Representations for Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents an end-to-end strategy for initializing SIRENs together with a learned learning-rate scheme, to yield representations that improve classification accuracy. |
Alexander Gielisse; Jan van Gemert; | code |
| 835 | Attention IoU: Examining Biases in CelebA Using Attention Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Attention-IoU (Attention Intersection over Union) metric and related scores, which use attention maps to reveal biases within a model’s internal representations and identify image features potentially causing the biases. |
Aaron Serianni; Tyler Zhu; Olga Russakovsky; Vikram V. Ramaswamy; | code |
| 836 | UMotion: Uncertainty-driven Human Motion Estimation from Inertial and Ultra-wideband Units Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, challenges such as pose ambiguity, data drift, and limited adaptability to diverse bodies persist. To address these issues, we propose UMotion, an uncertainty-driven, online fusing-all state estimation framework for 3D human shape and pose estimation, supported by six integrated, body-worn ultra-wideband (UWB) distance sensors with IMUs. |
Huakun Liu; Hiroki Ota; Xin Wei; Yutaro Hirao; Monica Perusquia-Hernandez; Hideaki Uchiyama; Kiyoshi Kiyokawa; | code |
| 837 | Activating Sparse Part Concepts for 3D Class Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate our method on three 3D CIL benchmarks, achieving state-of-the-art performance. |
Zhenya Tian; Jun Xiao; Lupeng Liu; Haiyong Jiang; | code |
| 838 | Advancing Manga Analysis: Comprehensive Segmentation Annotations for The Manga109 Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: There exists a significant domain gap between the manga and the natural images, that fails most existing learning-based methods. To address this gap, we introduce an augmented segmentation annotation for the Manga109 dataset, a collection of 109 manga volumes, that offers intricate artworks in a rich variety of styles. |
Minshan Xie; Jian Lin; Hanyuan Liu; Chengze Li; Tien-Tsin Wong; | code |
| 839 | Chebyshev Attention Depth Permutation Texture Network with Latent Texture Attribute Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, many rely on loss formulations that prioritize recognition accuracy while overlooking spatial coherence and statistical consistency in the feature space. To address these issues, we propose three key innovations: Stochastic Local Texture Masking (SLTM), a regularization strategy that randomly occludes small texture patches to promote the learning of broader spatial and contextual dependencies; the Chebyshev Attention Depth Permutation Texture Network (CAPTN), a novel architecture that learns expressive and persistent Latent Texture Attribute (LTA) representations. |
Ravishankar Evani; Deepu Rajan; Shangbo Mao; | code |
| 840 | HistoFS: Non-IID Histopathologic Whole Slide Image Classification Via Federated Style Transfer with RoI-Preserving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: (2) Performing style transfer may potentially shift the region of interests (RoIs) in the augmented WSIs. To address these challenges, we propose HistoFS, a federated learning framework for computational pathology on non-i.i.d. feature shifts in WSI classification. |
Farchan Hakim Raswa; Chun-Shien Lu; Jia-Ching Wang; | code |
| 841 | Automatic Spectral Calibration of Hyperspectral Images: Method, Dataset and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The traditional way to calibrate HSI utilizes a physical reference, which involves manual operations, occlusions, and/or limits camera mobility. These limitations inspire this paper to automatically calibrate HSIs using a learning-based method. |
Zhuoran Du; Shaodi You; Cheng Cheng; Shikui Wei; | code |
| 842 | PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the diversity of skeleton types and limited cross-dataset supervision complicate integration in pose estimation. To address these challenges, we introduce PoseBH, a new MDT framework that tackles keypoint heterogeneity and limited supervision through two key techniques. |
Uyoung Jeong; Jonathan Freer; Seungryul Baek; Hyung Jin Chang; Kwang In Kim; | code |
| 843 | GroupMamba: Efficient Group-Based Visual State Space Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a parameter-efficient modulated group mamba layer that divides the input channels into four groups and applies our proposed SSM-based efficient Visual Single Selective Scanning (VSSS) block independently to each group, with each VSSS block scanning in one of the four spatial directions. |
Abdelrahman Shaker; Syed Talal Wasim; Salman Khan; Juergen Gall; Fahad Shahbaz Khan; | code |
| 844 | A Hubness Perspective on Representation Learning for Graph-Based Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a simple yet effective encoder that reduces hubness while preserving neighborhood topology within each view. |
Zheming Xu; He Liu; Congyan Lang; Tao Wang; Yidong Li; Michael C. Kampffmeyer; | code |