Paper Digest: ICCV 2025 Papers & Highlights
Note: ICCV-2025 accepts more than 2,700 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,700 ICCV-2025 papers in a separate page.
To search for papers presented at ICCV-2025 on a specific topic, please make use of the search by venue (ICCV-2025) service. To summarize the latest research published at ICCV-2025 on a specific topic, you can utilize the review by venue (ICCV-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 11,000 authors (ICCV-2025). Using data from 2023 and 2025, our system also generates a report on computer vision trends. Additionally, you may want to explore our “Best Paper” Digest (ICCV), which lists the most influential ICCV papers since 1988.
As a pioneer in the field since 2018, Paper Digest has curated thousands of such lists, drawing on years of accumulated data across decades of conferences and research topics.To ensure you never miss a breakthrough, our daily service sifts through tens of thousands of new papers, clinical trials, news articles, community posts every day – delivering only what matters most to your specific interests. Beyond discovery, Paper Digest offers built-in research tools to help users read articles, write articles, get answers, conduct literature reviews, and generate research reports more efficiently.
Paper Digest Team
New York City, New York, 10017
TABLE 1: Paper Digest: ICCV 2025 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | MetaMorph: Multimodal Understanding and Generation Via Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Visual-Predictive Instruction Tuning (VPiT) – a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. |
Shengbang Tong; David Fan; Jiachen Li; Yunyang Xiong; Xinlei Chen; Koustuv Sinha; Michael Rabbat; Yann LeCun; Saining Xie; Zhuang Liu; |
| 2 | CoTracker3: Simpler and Better Point Tracking By Pseudo-Labelling Real Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce CoTracker3, a new state-of-the-art point tracker. |
Nikita Karaev; Yuri Makarov; Jianyuan Wang; Natalia Neverova; Andrea Vedaldi; Christian Rupprecht; |
| 3 | VACE: All-in-One Video Creation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. |
Zeyinzi Jiang; Zhen Han; Chaojie Mao; Jingfeng Zhang; Yulin Pan; Yu Liu; |
| 4 | CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse range of scenarios, tasks, and challenges. |
Zhibo Yang; Jun Tang; Zhaohai Li; Pengfei Wang; Jianqiang Wan; Humen Zhong; Xuejing Liu; Mingkun Yang; Peng Wang; Shuai Bai; Lianwen Jin; Junyang Lin; |
| 5 | MIEB: Massive Image Embedding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. |
Chenghao Xiao; Isaac Chung; Imene Kerboua; Jamie Stirling; Xin Zhang; Márton Kardos; Roman Solomatin; Noura Al Moubayed; Kenneth Enevoldsen; Niklas Muennighoff; |
| 6 | Learning 4D Embodied World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent’s actions, providing both spatial and temporal consistency. |
Haoyu Zhen; Qiao Sun; Hongxin Zhang; Junyan Li; Siyuan Zhou; Yilun Du; Chuang Gan; |
| 7 | Scaling Language-Free Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" |
David Fan; Shengbang Tong; Jiachen Zhu; Koustuv Sinha; Zhuang Liu; Xinlei Chen; Michael Rabbat; Nicolas Ballas; Yann LeCun; Amir Bar; Saining Xie; |
| 8 | Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Video Turing Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans.Video-TT differentiates between errors due to inadequate frame sampling and 1) genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. |
Yuanhan Zhang; Yunice Chew; Yuhao Dong; Aria Leo; Bo Hu; Ziwei Liu; |
| 9 | Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360deg wide-coverage, scene-level reconstruction. |
Chen Ziwen; Hao Tan; Kai Zhang; Sai Bi; Fujun Luan; Yicong Hong; Li Fuxin; Zexiang Xu; |
| 10 | Temporal-aware Query Routing for Real-time Video Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. |
Zesen Cheng; Kehan Li; Yian Zhao; Hang Zhang; Chang Liu; Jie Chen; |
| 11 | Scaling Laws for Native Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we revisit the architectural design of native multimodal models (NMMs)–those trained from the ground up on all modalities–and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. |
Mustafa Shukor; Enrico Fini; Victor Guilherme Turrisi da Costa; Matthieu Cord; Joshua Susskind; Alaaeldin El-Nouby; |
| 12 | Magic Insert: Style-Aware Drag-and-Drop Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Magic Insert, a method to drag-and-drop subjects from a user-provided image into a target image of a different style in a plausible manner while matching the style of the target image. |
Nataniel Ruiz; Yuanzhen Li; Neal Wadhwa; Yael Pritch; Michael Rubinstein; David E. Jacobs; Shlomi Fruchter; |
| 13 | RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose the concept of compositional constraints for embodied multi-agent systems, addressing the challenges arising from collaboration among embodied agents. |
Yiran Qin; Li Kang; Xiufeng Song; Zhenfei Yin; Xiaohong Liu; Xihui Liu; Ruimao Zhang; Lei Bai; |
| 14 | SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. |
Junsong Chen; Shuchen Xue; Yuyang Zhao; Jincheng Yu; Sayak Paul; Junyu Chen; Han Cai; Song Han; Enze Xie; |
| 15 | UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D segmentation masks of 72 organs based on the UK Biobank MRI dataset. |
Emmanuelle Bourigault; Amir Jamaludin; Abdullah Hamdi; |
| 16 | YOLOE: Real-Time Seeing Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. |
Ao Wang; Lihao Liu; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; |
| 17 | Cycle-Consistent Learning for Joint Layout-to-Image Generation and Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. |
Xinhao Cai; Qiuxia Lai; Gensheng Pei; Xiangbo Shu; Yazhou Yao; Wenguan Wang; |
| 18 | Stable Virtual Camera: Generative View Synthesis with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present \underline \text S tabl\underline \text e \underline \text V irtual C\underline \text a mera (Seva), a generalist diffusion model that creates novel views of a scene, given any number of input views and target cameras.Existing works struggle to generate either large viewpoint changes or temporally smooth samples, while relying on specific task configurations.Our approach overcomes these limitations through simple model design, optimized training recipe, and flexible sampling strategy that generalize across view synthesis tasks at test time.As a result, our samples maintain high consistency without requiring additional 3D representation-based distillation, thus streamlining view synthesis in the wild.Furthermore, we show that our method can generate high-quality videos lasting up to half a minute with seamless loop closure.Extensive benchmarking demonstrates that Seva outperforms existing methods across different datasets and settings. |
Jensen Zhou; Hang Gao; Vikram Voleti; Aaryaman Vasishta; Chun-Han Yao; Mark Boss; Philip Torr; Christian Rupprecht; Varun Jampani; |
| 19 | REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper we tackle a fundamental question: "Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?" |
Xingjian Leng; Jaskirat Singh; Yunzhong Hou; Zhenchang Xing; Saining Xie; Liang Zheng; |
| 20 | Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. |
Weiming Ren; Wentao Ma; Huan Yang; Cong Wei; Ge Zhang; Wenhu Chen; |
| 21 | EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, its training is heavily reliant on high-quality images and precise camera poses. Meeting these criteria can be challenging in non-ideal real-world conditions, where motion-blurred images frequently occur due to high-speed camera movements.To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that harnesses event streams captured by event cameras to facilitate the learning of high-quality 3D-GS from motion-blurred images. |
Wangbo Yu; Chaoran Feng; Jianing Li; Jiye Tang; Jiashu Yang; Zhenyu Tang; Meng Cao; Xu Jia; Yuchao Yang; Li Yuan; Yonghong Tian; |
| 22 | Bolt3D: Generating 3D Scenes in Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a latent diffusion model for fast feed-forward 3D scene generation. |
Stanislaw Szymanowicz; Jason Y. Zhang; Pratul Srinivasan; Ruiqi Gao; Arthur Brussee; Aleksander Holynski; Ricardo Martin-Brualla; Jonathan T. Barron; Philipp Henzler; |
| 23 | StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce StreamDiffusion, a real-time diffusion pipeline designed for streaming image generation. |
Akio Kodaira; Chenfeng Xu; Toshiki Hazama; Takanori Yoshimoto; Kohei Ohno; Shogo Mitsuhori; Soichi Sugano; Hanying Cho; Zhijian Liu; Masayoshi Tomizuka; Kurt Keutzer; |
| 24 | Hi3DGen: High-fidelity 3D Geometry Generation from Images Via Normal Bridging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating highfidelity 3D geometry from images via normal bridging. |
Chongjie Ye; Yushuang Wu; Ziteng Lu; Jiahao Chang; Xiaoyang Guo; Jiaqing Zhou; Hao Zhao; Xiaoguang Han; |
| 25 | MRGen: Segmentation Data Engine For Underrepresented MRI Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, our contributions are threefold: (i) we introduce MRGen-DB, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset featuring pixel-wise mask annotations; (ii) we present MRGen, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. |
Haoning Wu; Ziheng Zhao; Ya Zhang; Yanfeng Wang; Weidi Xie; |
| 26 | Medical World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Providing effective treatment and making informed decisions are essential goals of modern medicine and clinical care.We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models.To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that predicts future disease states based on clinical decisions. |
Yijun Yang; Zhao-Yang Wang; Qiuping Liu; Shuwen Sun; Kang Wang; Rama Chellappa; Zongwei Zhou; Alan Yuille; Lei Zhu; Yu-Dong Zhang; Jieneng Chen; |
| 27 | ViLLa: Video Reasoning Segmentation with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, they struggled to discriminate and deduce the objects from user queries in more real-world scenes featured by long durations, multiple objects, rapid motion, and heavy occlusions. In this work, we analyze the underlying causes of these limitations, and present **ViLLa**: **Vi**deo reasoning segmentation with **L**arge **La**nguage Model. |
Rongkun Zheng; Lu Qi; Xi Chen; Yi Wang; Kun Wang; Hengshuang Zhao; |
| 28 | DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. |
Yecheng Wu; Han Cai; Junyu Chen; Zhuoyang Zhang; Enze Xie; Jincheng Yu; Junsong Chen; Jinyi Hu; Yao Lu; Song Han; |
| 29 | DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation Via Dictionary Lookup Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. |
Zhen Qu; Xian Tao; Xinyi Gong; ShiChen Qu; Xiaopei Zhang; Xingang Wang; Fei Shen; Zhengtao Zhang; Mukesh Prasad; Guiguang Ding; |
| 30 | VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. |
Shoubin Yu; Difan Liu; Ziqiao Ma; Yicong Hong; Yang Zhou; Hao Tan; Joyce Chai; Mohit Bansal; |
| 31 | StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. |
Shakiba Kheradmand; Delio Vicini; George Kopanas; Dmitry Lagun; Kwang Moo Yi; Mark Matthews; Andrea Tagliasacchi; |
| 32 | MOVE: Motion-Guided Few-Shot Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in scenarios requiring motion understanding. To fill this gap, we introduce MOVE, a large-scale dataset specifically designed for motion-guided FSVOS. |
Kaining Ying; Hengrui Hu; Henghui Ding; |
| 33 | SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. |
Chun-Han Yao; Yiming Xie; Vikram Voleti; Huaizu Jiang; Varun Jampani; |
| 34 | CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications.Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.We will release our source code and the synthetic and real-world datasets we created to support further research in this area. |
Jan Ackermann; Jonas Kulhanek; Shengqu Cai; Haofei Xu; Marc Pollefeys; Gordon Wetzstein; Leonidas J. Guibas; Songyou Peng; |
| 35 | Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information and deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose Omnimodal Referring Audio-Visual Segmentation (OmniAVS), a new dataset containing 2,104 videos and 61,095 multimodal referring expressions. |
Kaining Ying; Henghui Ding; Guangquan Jie; Yu-Gang Jiang; |
| 36 | From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models Via Reflection Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the self-reflection capabilities emergent in large language models, we propose ReflectionFlow, an inference-time framework enabling diffusion models to iteratively reflect upon and refine their outputs. |
Le Zhuo; Liangbing Zhao; Sayak Paul; Yue Liao; Renrui Zhang; Yi Xin; Peng Gao; Mohamed Elhoseiny; Hongsheng Li; |
| 37 | StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce StableDepth, a scene-consistent and scale-invariant depth estimation method achieving scene-level 3D consistency. |
Zheng Zhang; Lihe Yang; Tianyu Yang; Chaohui Yu; Xiaoyang Guo; Yixing Lao; Hengshuang Zhao; |
| 38 | ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. |
Jinhyung Park; Javier Romero; Shunsuke Saito; Fabian Prada; Takaaki Shiratori; Yichen Xu; Federica Bogo; Shoou-I Yu; Kris Kitani; Rawal Khirodkar; |
| 39 | DiffDoctor: Diagnosing Image Diffusion Models Before Treating Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. |
Yiyang Wang; Xi Chen; Xiaogang Xu; Sihui Ji; Yu Liu; Yujun Shen; Hengshuang Zhao; |
| 40 | LLaVA-3D: A Simple Yet Effective Pathway to Empowering LMMs with 3D Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a simple yet effective framework called LLaVA-3D. |
Chenming Zhu; Tai Wang; Wenwei Zhang; Jiangmiao Pang; Xihui Liu; |
| 41 | OminiControl: Minimal and Universal Control for Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. |
Zhenxiong Tan; Songhua Liu; Xingyi Yang; Qiaochu Xue; Xinchao Wang; |
| 42 | DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from resampler structures, we introduce DisCo, a novel visual encapsulation method designed to yield semantically distinct and temporally coherent visual tokens for video MLLMs. |
Jiahe Zhao; Rongkun Zheng; Yi Wang; Helin Wang; Hengshuang Zhao; |
| 43 | FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FreeSplatter, a scalable feed-forward framework that generates high-quality 3D Gaussians from uncalibrated sparse-view images while estimating camera parameters within seconds. |
Jiale Xu; Shenghua Gao; Ying Shan; |
| 44 | LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion Via Distillation to Learnable Look-Up Tables Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. |
Xunpeng Yi; Yibing Zhang; Xinyu Xiang; Qinglong Yan; Han Xu; Jiayi Ma; |
| 45 | ARGUS: Hallucination and Omission Evaluation in Video-LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. |
Ruchit Rawal; Reza Shirkavand; Heng Huang; Gowthami Somepalli; Tom Goldstein; |
| 46 | VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. |
Jiacheng Ruan; Wenzhen Yuan; Xian Gao; Ye Guo; Daoxin Zhang; Zhe Xu; Yao Hu; Ting Liu; Yuzhuo Fu; |
| 47 | Pinco: Position-induced Consistent Adapter for Diffusion Transformer in Foreground-conditioned Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing T2I-based image inpainting methods can be applied to this task, they suffer from issues of subject shape expansion, distortion, or impaired ability to align with the text description, resulting in inconsistencies between the visual elements and the text description. To address these challenges, we propose Pinco, a plug-and-play foreground-conditioned inpainting adapter that generates high-quality backgrounds with good text alignment while effectively preserving the shape of the foreground subject. |
Guangben Lu; Yuzhen Du; Yizhe Tang; Zhimin Sun; Ran Yi; Yifan Qi; Tianyi Wang; Lizhuang Ma; Fangyuan Zou; |
| 48 | Randomized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. |
Qihang Yu; Ju He; Xueqing Deng; Xiaohui Shen; Liang-Chieh Chen; |
| 49 | MonoFusion: Sparse-View 4D Reconstruction Via Monocular Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). |
Zihan Wang; Jeff Tan; Tarasha Khurana; Neehar Peri; Deva Ramanan; |
| 50 | AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose AnimeGamer, which is built upon Multimodal Large Language Models (MLLMs) to generate each game state, including dynamic animation shots that depict character movements and updates to character states, as illustrated in Figure 1. |
Junhao Cheng; Yuying Ge; Yixiao Ge; Jing Liao; Ying Shan; |
| 51 | PUMA: Empowering Unified MLLM with Multi-granular Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation, a novel paradigm that tackles the diversity-controllability trade-off. |
Rongyao Fang; Chengqi Duan; Kun Wang; Hao Li; Linjiang Huang; Hao Tian; Xingyu Zeng; Rui Zhao; Jifeng Dai; Hongsheng Li; Xihui Liu; |
| 52 | VideoVAE+: Large Motion Video Autoencoding with Cross-modal Video VAE Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a powerful video VAE named VideoVAE+ that effectively reconstructs videos with large motion. |
Yazhou Xing; Yang Fei; Yingqing He; Jingye Chen; Jiaxin Xie; Xiaowei Chi; Qifeng Chen; |
| 53 | Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. |
Zhengyao Lv; Tianlin Pan; Chenyang Si; Zhaoxi Chen; Wangmeng Zuo; Ziwei Liu; Kwan-Yee K. Wong; |
| 54 | GenHancer: Imperfect Generative Models Are Secretly Strong Vision-Centric Enhancers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. |
Shijie Ma; Yuying Ge; Teng Wang; Yuxin Guo; Yixiao Ge; Ying Shan; |
| 55 | St4RTrack: Simultaneous 4D Reconstruction and Tracking in The World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose St4RTrack, a feed-forward frame- work that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB in- puts. |
Haiwen Feng; Junyi Zhang; Qianqian Wang; Yufei Ye; Pengcheng Yu; Michael J. Black; Trevor Darrell; Angjoo Kanazawa; |
| 56 | Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a kxk grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. |
Sucheng Ren; Qihang Yu; Ju He; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen; |
| 57 | SViM3D: Stable Video Material Diffusion for Single Image 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. |
Andreas Engelhardt; Mark Boss; Vikram Voleti; Chun-Han Yao; Hendrik P. A. Lensch; Varun Jampani; |
| 58 | Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. |
Quankai Gao; Iliyan Georgiev; Tuanfeng Y. Wang; Krishna Kumar Singh; Ulrich Neumann; Jae Shin Yoon; |
| 59 | Shape of Motion: 4D Reconstruction from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos. |
Qianqian Wang; Vickie Ye; Hang Gao; Weijia Zeng; Jake Austin; Zhengqi Li; Angjoo Kanazawa; |
| 60 | MaskControl: Spatio-Temporal Control for Masked Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. |
Ekkasit Pinyoanuntapong; Muhammad Saleem; Korrawe Karunratanakul; Pu Wang; Hongfei Xue; Chen Chen; Chuan Guo; Junli Cao; Jian Ren; Sergey Tulyakov; |
| 61 | PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects in interaction to produce a photo- and physically realistic, real-time interactive virtual replica.Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering, and (2) a novel multi-stage optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. |
Hanxiao Jiang; Hao-Yu Hsu; Kaifeng Zhang; Hsin-Ni Yu; Shenlong Wang; Yunzhu Li; |
| 62 | Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. |
Mingxuan Wu; Huang Huang; Justin Kerr; Chung Min Kim; Anthony Zhang; Brent Yi; Angjoo Kanazawa; |
| 63 | MedSegFactory: Text-Guided Generation of Medical Image-Mask Pairs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents **MedSegFactory**, a versatile medical synthesis framework that generates high-quality paired medical images and segmentation masks across modalities and tasks. |
Jiawei Mao; Yuhan Wang; Yucheng Tang; Daguang Xu; Kang Wang; Yang Yang; Zongwei Zhou; Yuyin Zhou; |
| 64 | FlowChef: Steering of Rectified Flow Models for Controlled Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present FlowChef, a novel training, inversion, and gradient-free inference-time steering strategy for RFMs that deterministically guides the denoising process. |
Maitreya Patel; Song Wen; Dimitris N. Metaxas; Yezhou Yang; |
| 65 | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. |
Omkar Thawakar; Dmitry Demidov; Ritesh Thawkar; Rao Muhammad Anwer; Mubarak Shah; Fahad Shahbaz Khan; Salman Khan; |
| 66 | MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence.However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps.To fill this gap, we introduce **MMReason**, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions.First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). |
Huanjin Yao; Jiaxing Huang; Yawen Qiu; Michael K. Chen; Wenzheng Liu; Wei Zhang; Wenjie Zeng; Xikun Zhang; Jingyi Zhang; YuXin Song; Wenhao Wu; Dacheng Tao; |
| 67 | The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport (C2OT) that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. |
Ho Kei Cheng; Alexander Schwing; |
| 68 | OmniHuman-1: Rethinking The Scaling-Up of One-Stage Conditioned Human Animation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. |
Gaojie Lin; Jianwen Jiang; Jiaqi Yang; Zerong Zheng; Chao Liang; Yuan Zhang; Jingtuo Liu; |
| 69 | FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. |
Matteo Poggi; Fabio Tosi; |
| 70 | GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel point map Variational Autoencoder (VAE) for encoding and decoding unbounded point maps. |
Tian-Xing Xu; Xiangjun Gao; Wenbo Hu; Xiaoyu Li; Song-Hai Zhang; Ying Shan; |
| 71 | TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. |
Mark Yu; Wenbo Hu; Jinbo Xing; Ying Shan; |
| 72 | GWM: Towards Scalable Gaussian World Models for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel branch of world model named Gaussian World Model (GWM) for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions. |
Guanxing Lu; Baoxiong Jia; Puhao Li; Yixin Chen; Ziwei Wang; Yansong Tang; Siyuan Huang; |
| 73 | UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce UniVG, a generalist diffusion model capable of supporting a diverse range of image generation tasks with a single set of weights. |
Tsu-Jui Fu; Yusu Qian; Chen Chen; Wenze Hu; Zhe Gan; Yinfei Yang; |
| 74 | Auto-Controlled Image Perception in MLLMs Via Visual Perception Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For example, they cannot selectively re-encode specific regions of an image or focus on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. |
Runpeng Yu; Xinyin Ma; Xinchao Wang; |
| 75 | ZipVL: Accelerating Vision-Language Models Through Dynamic Token Sparsity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. |
Yefei He; Feng Chen; Jing Liu; Wenqi Shao; Hong Zhou; Kaipeng Zhang; Bohan Zhuang; |
| 76 | Neighboring Autoregressive Modeling for Efficient Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far "next-neighbor prediction" mechanism. |
Yefei He; Yuanyu He; Shaoxuan He; Feng Chen; Hong Zhou; Kaipeng Zhang; Bohan Zhuang; |
| 77 | GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. |
Rui Hu; Lianghui Zhu; Yuxuan Zhang; Tianheng Cheng; Lei Liu; Heng Liu; Longjin Ran; Xiaoxin Chen; Wenyu Liu; Xinggang Wang; |
| 78 | Adaptive Caching for Faster Video Generation with Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that ‘not all videos are created equal’: meaning, some videos require fewer denoising steps to attain a reasonable quality than others. |
Kumara Kahatapitiya; Haozhe Liu; Sen He; Ding Liu; Menglin Jia; Chenyang Zhang; Michael S. Ryoo; Tian Xie; |
| 79 | ReCamMaster: Camera-Controlled Generative Rendering from A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. |
Jianhong Bai; Menghan Xia; Xiao Fu; Xintao Wang; Lianrui Mu; Jinwen Cao; Zuozhu Liu; Haoji Hu; Xiang Bai; Pengfei Wan; Di Zhang; |
| 80 | Flow4Agent: Long-form Video Understanding Via Motion Prior from Optical Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. |
Ruyang Liu; Shangkun Sun; Haoran Tang; Wei Gao; Ge Li; |
| 81 | Leveraging Panoptic Scene Graph for Evaluating Fine-Grained Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Human assessments are costly, and existing automated metrics lack accurate compositional understanding. To address these limitations, we introduce PSG-Bench, a novel benchmark featuring 5K text prompts designed to evaluate the capabilities of advanced T2I models. |
Xueqing Deng; Linjie Yang; Qihang Yu; Chenglin Yang; Liang-Chieh Chen; |
| 82 | Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Lyra, an efficient MLLM that enhances multi-modal abilities, including advanced long speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. |
Zhisheng Zhong; Chengyao Wang; Yuqi Liu; Senqiao Yang; Longxiang Tang; Yuechen Zhang; Jingyao Li; Tianyuan Qu; Yanwei Li; Yukang Chen; Shaozuo Yu; Sitong Wu; Eric Lo; Shu Liu; Jiaya Jia; |
| 83 | Rethinking DPO-style Diffusion Aligning Frameworks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, We identify two potential risks for existing DPO algorithms: First, current DPO methods for estimating the rewards of step-wise intermediate samples are biased, leading to inaccurate preference ordering for step-wise optimization. Second, existing DPO methods may inadvertently increase the sampling probabilities of dispreferred samples, potentially introducing application risks. To address these issues, we propose Revised Direct Preference Optimization (RDPO), a simple but effective step-wise DPO-based text-to-image diffusion model alignment method. |
Xun Wu; Shaohan Huang; Lingjie Jiang; Furu Wei; |
| 84 | FaceCraft4D: Animated 3D Facial Avatar Generation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel framework for generating high-quality, animatable 4D avatar from a single image. |
Fei Yin; Mallikarjun B R; Chun-Han Yao; Rafal K. Mantiuk; Varun Jampani; |
| 85 | MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present MaTVLM, a method for distilling pre-trained vision-language models (VLMs) into an efficient Mamba-Transformer hybrid architecture. |
Yingyue Li; Bencheng Liao; Wenyu Liu; Xinggang Wang; |
| 86 | Are VLMs Ready for Autonomous Driving? An Empirical Study from The Reliability, Data and Metric Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs’ corruption awareness and agentic planning with external tools to enhance perception reliability for downstream tasks. |
Shaoyuan Xie; Lingdong Kong; Yuhao Dong; Chonghao Sima; Wenwei Zhang; Qi Alfred Chen; Ziwei Liu; Liang Pan; |
| 87 | LHM: Large Animatable Human Reconstruction Model for Single Image to 3D in Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. |
Lingteng Qiu; Xiaodong Gu; Peihao Li; Qi Zuo; Weichao Shen; Junfei Zhang; Kejie Qiu; Weihao Yuan; Guanying Chen; Zilong Dong; Liefeng Bo; |
| 88 | Learning Streaming Video Representation Via Multitask Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike offline video processing, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions. To address these challenges, our main contributions are three-fold. |
Yibin Yan; Jilan Xu; Shangzhe Di; Yikun Liu; Yudi Shi; Qirui Chen; Zeqian Li; Yifei Huang; Weidi Xie; |
| 89 | FlowTok: Flowing Seamlessly Across Text and Image Tokens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. |
Ju He; Qihang Yu; Qihao Liu; Liang-Chieh Chen; |
| 90 | External Knowledge Injection for CLIP-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. |
Da-Wei Zhou; Kai-Wen Li; Jingyi Ning; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; |
| 91 | Training-Free Industrial Defect Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a one-shot setting. |
Ruyi Xu; Yen-Tzu Chiu; Tai-I Chen; Oscar Chew; Yung-Yu Chuang; Wen-Huang Cheng; |
| 92 | Controllable Weather Synthesis and Removal with Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Physics-based weather simulation requires precise reconstructions that are hard to scale to in-the-wild videos, while current video editing often lacks realism and control.In this work, we introduce WeatherWeaver, a video diffusion model that synthesizes diverse weather effects—including rain, snow, fog, and clouds—directly into any input video without the need for 3D modeling.Our model provides precise control over weather effect intensity and supports blending various weather types, ensuring both realism and adaptability.To overcome the scarcity of paired training data, we propose a novel data strategy combining synthetic videos, generative image editing, and auto-labeled real-world videos. |
Chih-Hao Lin; Zian Wang; Ruofan Liang; Yuxuan Zhang; Sanja Fidler; Shenlong Wang; Zan Gojcic; |
| 93 | Video-T1: Test-time Scaling for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reinterpret the test-time scaling of video generation as a searching problem to sample better trajectories from Gaussian noise space to the target video distribution. |
Fangfu Liu; Hanyang Wang; Yimo Cai; Kaiyan Zhang; Xiaohang Zhan; Yueqi Duan; |
| 94 | LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. |
Fangfu Liu; Hao Li; Jiawei Chi; Hanyang Wang; Minghui Yang; Fudong Wang; Yueqi Duan; |
| 95 | GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. |
Wanshui Gan; Fang Liu; Hongbin Xu; Ningkai Mo; Naoto Yokoya; |
| 96 | GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. |
Quanfeng Lu; Wenqi Shao; Zitao Liu; Lingxiao Du; Fanqing Meng; Boxuan Li; Botong Chen; Siyuan Huang; Kaipeng Zhang; Ping Luo; |
| 97 | Less-to-More Generalization: Unlocking More Controllability By In-Context Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle these challenges. |
Shaojin Wu; Mengqi Huang; Wenxu Wu; Yufeng Cheng; Fei Ding; Qian He; |
| 98 | Token-Efficient VLM: High-Resolution Image Understanding Via Dynamic Region Proposal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. |
Yitong Jiang; Jinwei Gu; Tianfan Xue; Ka Chun Cheung; Pavlo Molchanov; Hongxu Yin; Sifei Liu; |
| 99 | Deeply Supervised Flow-Based Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we observe that training velocity solely from the final layer’s output under-utilizes the rich inter-layer representations, potentially impeding model convergence. To address this limitation, we introduce DeepFlow, a novel framework that enhances velocity representation through inter-layer communication. |
Inkyu Shin; Chenglin Yang; Liang-Chieh Chen; |
| 100 | The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. |
Aoxiong Yin; Xu Tan; Kai Shen; Yichong Leng; Xinyu Zhou; Juncheng Li; Siliang Tang; |
| 101 | Cycle Consistency As Reward: Learning Image-Text Alignment Without Human Preferences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an alternative approach that leverages cycle consistency as a supervisory signal. |
Hyojin Bahng; Caroline Chan; Fredo Durand; Phillip Isola; |
| 102 | SAM2Long: Enhancing SAM 2 for Long Video Segmentation with A Training-Free Memory Tree Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we introduce SAM2Long, an improved training-free video object segmentation strategy, which considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways in a constrained tree search manner. |
Shuangrui Ding; Rui Qian; Xiaoyi Dong; Pan Zhang; Yuhang Zang; Yuhang Cao; Yuwei Guo; Dahua Lin; Jiaqi Wang; |
| 103 | V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies – early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). |
Zewei Zhou; Hao Xiang; Zhaoliang Zheng; Seth Z. Zhao; Mingyue Lei; Yun Zhang; Tianhui Cai; Xinyi Liu; Johnson Liu; Maheswari Bajji; Xin Xia; Zhiyu Huang; Bolei Zhou; Jiaqi Ma; |
| 104 | TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. |
Zewei Zhou; Seth Z. Zhao; Tianhui Cai; Zhiyu Huang; Bolei Zhou; Jiaqi Ma; |
| 105 | SynCity: Training-Free Generation of 3D Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SynCity, a method for generating explorable 3D worlds from textual descriptions. |
Paul Engstler; Aleksandar Shtedritski; Iro Laina; Christian Rupprecht; Andrea Vedaldi; |
| 106 | ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. |
Zifu Wan; Ce Zhang; Silong Yong; Martin Q. Ma; Simon Stepputtis; Louis-Philippe Morency; Deva Ramanan; Katia Sycara; Yaqi Xie; |
| 107 | GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality–a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. |
Tianwei Xiong; Jun Hao Liew; Zilong Huang; Jiashi Feng; Xihui Liu; |
| 108 | Mixture-of-Scores: Robust Image-Text Data Valuation Via Three Lines of Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This complicates the selection of scoring models. In this paper, we analyze these disparities and propose a method called Mixture-of-Scores (MoS). |
Sitong Wu; Haoru Tan; Yukang Chen; Shaofeng Zhang; Jingyao Li; Bei Yu; Xiaojuan Qi; Jiaya Jia; |
| 109 | OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, to the best of our knowledge, all these solutions are still not fully open, e.g., their training data remains proprietary and/or their training frameworks are unreleased. In this paper, we address this challenge by introducing a family of fully open vision encoders that are as competitive as, or even surpass, OpenAI’s CLIP in building multimodal foundation models like LLaVA. |
Xianhang Li; Yanqing Liu; Haoqin Tu; Cihang Xie; |
| 110 | General Compression Framework for Efficient Transformer Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce model size while preserving tracking accuracy. |
Lingyi Hong; Jinglun Li; Xinyu Zhou; Shilin Yan; Pinxue Guo; Kaixun Jiang; Zhaoyu Chen; Shuyong Gao; Runze Li; Xingdong Sheng; Wei Zhang; Hong Lu; Wenqiang Zhang; |
| 111 | GaussianReg: Rapid 2D/3D Registration for Emergency Surgery Via Explicit 3D Modeling with Gaussian Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present GaussianReg, a novel registration framework that achieves clinically acceptable accuracy within minutes of preprocessing. |
Weihao Yu; Xiaoqing Guo; Xinyu Liu; Yifan Liu; Hao Zheng; Yawen Huang; Yixuan Yuan; |
| 112 | X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, We propose X^2-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. |
Weihao Yu; Yuanhao Cai; Ruyi Zha; Zhiwen Fan; Chenxin Li; Yixuan Yuan; |
| 113 | Flash-VStream: Efficient Real-Time Understanding for Long Video Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. |
Haoji Zhang; Yiqin Wang; Yansong Tang; Yong Liu; Jiashi Feng; Xiaojie Jin; |
| 114 | SpectralAR: Spectral Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, image patches are inherently parallel, contradicting the causal nature of autoregressive modeling. To address this, we propose a Spectral AutoRegressive (SpectralAR) visual generation framework, which realizes causality for visual sequences from the spectral perspective. |
Yuanhui Huang; Weiliang Chen; Wenzhao Zheng; Yueqi Duan; Jie Zhou; Jiwen Lu; |
| 115 | Gradient Short-Circuit: Efficient Out-of-Distribution Detection Via Feature Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient gradient phenomenon: around an ID sample, the local gradient directions for "enhancing" that sample’s predicted class remain relatively consistent, whereas OOD samples–unseen in training–exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to short-circuit those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. |
Jiawei Gu; Ziyue Qiao; Zechao Li; |
| 116 | DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. |
Xuemeng Yang; Licheng Wen; Tiantian Wei; Yukai Ma; Jianbiao Mei; Xin Li; Wenjie Lei; Daocheng Fu; Pinlong Cai; Min Dou; Liang He; Yong Liu; Botian Shi; Yu Qiao; |
| 117 | Dual-Expert Consistency Model for Efficient and High-Quality Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. |
Zhengyao Lv; Chenyang Si; Tianlin Pan; Zhaoxi Chen; Kwan-Yee K. Wong; Yu Qiao; Ziwei Liu; |
| 118 | AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a plug-and-play method named AnyBimanual, which transfers pretrained unimanual policy to general bimanual manipulation policy with few bimanual demonstrations. |
Guanxing Lu; Tengbo Yu; Haoyuan Deng; Season Si Chen; Yansong Tang; Ziwei Wang; |
| 119 | Improved Noise Schedule for Diffusion Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Diffusion models have emerged as the de facto choice for generating high-quality visual signals across various domains.However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs.Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence and improve model performance.In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. |
Tiankai Hang; Shuyang Gu; Jianmin Bao; Fangyun Wei; Dong Chen; Xin Geng; Baining Guo; |
| 120 | Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce **T**ext-**A**ware **T**ransformer-based 1-D**i**mensional **Tok**enizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. |
Dongwon Kim; Ju He; Qihang Yu; Chenglin Yang; Xiaohui Shen; Suha Kwak; Liang-Chieh Chen; |
| 121 | A Conditional Probability Framework for Compositional Zero-shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL.In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. |
Peng Wu; Qiuxia Lai; Hao Fang; Guo-Sen Xie; Yilong Yin; Xiankai Lu; Wenguan Wang; |
| 122 | SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, this static nature renders them unable to dynamically track the data utility throughout pre-training, leading to subpar pre-trained models. To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method. |
Yangyang Guo; Mohan Kankanhalli; |
| 123 | 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 3,000 annotated image question answering triplets from 12 question types. |
Wufei Ma; Haoyu Chen; Guofeng Zhang; Yu-Cheng Chou; Jieneng Chen; Celso de Melo; Alan Yuille; |
| 124 | CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a **Co**unting-Aware **H**ierarchical **D**ecoding framework (CoHD) for GRES. |
Zhuoyan Luo; Yinghao Wu; Tianheng Cheng; Yong Liu; Yicheng Xiao; Hongfa Wang; Xiao-Ping Zhang; Yujiu Yang; |
| 125 | V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model’s context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. |
Junqi Ge; Ziyi Chen; Jintao Lin; Jinguo Zhu; Xihui Liu; Jifeng Dai; Xizhou Zhu; |
| 126 | AlignGuard: Scalable Safety Alignment for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce AlignGuard, a method for safety alignment of T2I models. |
Runtao Liu; I Chieh Chen; Jindong Gu; Jipeng Zhang; Renjie Pi; Qifeng Chen; Philip Torr; Ashkan Khakzar; Fabio Pizzati; |
| 127 | Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. |
Qizhe Zhang; Aosong Cheng; Ming Lu; Renrui Zhang; Zhiyong Zhuo; Jiajun Cao; Shaobo Guo; Qi She; Shanghang Zhang; |
| 128 | LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduce two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. |
Yu Cheng; Fajie Yuan; |
| 129 | TAPNext: Tracking Any Point (TAP) As Next Token Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. |
Artem Zholus; Carl Doersch; Yi Yang; Skanda Koppula; Viorica Patraucean; Xu Owen He; Ignacio Rocco; Mehdi S. M. Sajjadi; Sarath Chandar; Ross Goroshin; |
| 130 | EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. |
Haiwen Diao; Xiaotong Li; Yufeng Cui; Yueze Wang; Haoge Deng; Ting Pan; Wenxuan Wang; Huchuan Lu; Xinlong Wang; |
| 131 | ReTracker: Exploring Image Matching for Robust Online Any Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. |
Dongli Tan; Xingyi He; Sida Peng; Yiqing Gong; Xing Zhu; Jiaming Sun; Ruizhen Hu; Yujun Shen; Hujun Bao; Xiaowei Zhou; |
| 132 | C4D: 4D Made from 3D Through Dual Correspondences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction. To address this, we introduce C4D, a framework that leverages temporal Correspondences to extend existing 3D reconstruction formulation to 4D. |
Shizun Wang; Zhenxiang Jiang; Xingyi Yang; Xinchao Wang; |
| 133 | ShortFT: Diffusion Model Alignment Via Shortcut-based Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. |
Xiefan Guo; Miaomiao Cui; Liefeng Bo; Di Huang; |
| 134 | Unified Adversarial Augmentation for Improving Palmprint Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing augmentation methods struggle to generate palmprint-specific variations while preserving identity consistency, leading to suboptimal performance. To address these problems, we propose a unified adversarial augmentation framework. |
Jianlong Jin; Chenglong Zhao; Ruixin Zhang; Sheng Shang; Yang Zhao; Jun Wang; Jingyun Zhang; Shouhong Ding; Wei Jia; Yunsheng Wu; |
| 135 | Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our method, Marigold-DC, builds on a pretrained latent diffusion model (LDM) for depth estimation and injects the depth observations as test-time guidance, via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. |
Massimiliano Viola; Kevin Qu; Nando Metzger; Bingxin Ke; Alexander Becker; Konrad Schindler; Anton Obukhov; |
| 136 | Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers Via In-Context Reflection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose Reflect-DiT, a method that enables Diffusion Transformers to refine their generations using in-context examples of previously generated images alongside textual feedback describing necessary improvements. |
Shufan Li; Konstantinos Kallidromitis; Akash Gokul; Arsh Koneru; Yusuke Kato; Kazuki Kozuka; Aditya Grover; |
| 137 | LightSwitch: Multi-view Relighting with Material-guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. |
Yehonathan Litman; Fernando De la Torre; Shubham Tulsiani; |
| 138 | TAB: Transformer Attention Bottlenecks Enable User Intervention and Debugging in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention. |
Pooyan Rahmanzadehgervi; Hung Huy Nguyen; Rosanne Liu; Long Mai; Anh Totti Nguyen; |
| 139 | Zero-Shot Vision Encoder Grafting Via LLM Surrogates Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training.To reduce costs, a promising strategy is to first train the vision encoder using a small language model before transferring it to the large one.We construct small "surrogate models" that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers.Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting — when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM.Furthermore, our surrogate training approach reduces overall VLM training costs by 45% when using Llama-70B as the decoder. |
Kaiyu Yue; Vasu Singla; Menglin Jia; John Kirchenbauer; Rifaa Qadri; Zikui Cai; Abhinav Bhatele; Furong Huang; Tom Goldstein; |
| 140 | Online Dense Point Tracking with Streaming Memory Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with Streaming memory for dense POint Tracking and online video processing. |
Qiaole Dong; Yanwei Fu; |
| 141 | MV-Adapter: Multi-View Consistent Image Generation Made Easy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. |
Zehuan Huang; Yuan-Chen Guo; Haoran Wang; Ran Yi; Lizhuang Ma; Yan-Pei Cao; Lu Sheng; |
| 142 | EEdit : Rethinking The Spatial and Temporal Redundancy for Efficient Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose an Efficient Editing framework, named EEdit, to achieve efficient image editing. |
Zexuan Yan; Yue Ma; Chang Zou; Wenteng Chen; Qifeng Chen; Linfeng Zhang; |
| 143 | Efficient Track Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The high computation complexity of image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight end-to-end track anything models that produce high-quality results with low latency and small model size. |
Yunyang Xiong; Chong Zhou; Xiaoyu Xiang; Lemeng Wu; Chenchen Zhu; Zechun Liu; Saksham Suri; Balakrishnan Varadarajan; Ramya Akula; Forrest Iandola; Raghuraman Krishnamoorthi; Bilge Soran; Vikas Chandra; |
| 144 | Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. |
Zeren Jiang; Chuanxia Zheng; Iro Laina; Diane Larlus; Andrea Vedaldi; |
| 145 | ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a tuning-free method for both object insertion and subject-driven generation. |
Daniel Winter; Asaf Shul; Matan Cohen; Dana Berman; Yael Pritch; Alex Rav-Acha; Yedid Hoshen; |
| 146 | RAGD: Regional-Aware Diffusion Model for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAGD, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. |
Zhennan Chen; Yajie Li; Haofan Wang; Zhibo Chen; Zhengkai Jiang; Jun Li; Qian Wang; Jian Yang; Ying Tai; |
| 147 | LEGION: Learning to Ground and Explain for Synthetic Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. |
Hengrui Kang; Siwei Wen; Zichen Wen; Junyan Ye; Weijia Li; Peilin Feng; Baichuan Zhou; Bin Wang; Dahua Lin; Linfeng Zhang; Conghui He; |
| 148 | Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications.To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner. |
Yanzuo Lu; Yuxi Ren; Xin Xia; Shanchuan Lin; Xing Wang; Xuefeng Xiao; Andy J. Ma; Xiaohua Xie; Jian-Huang Lai; |
| 149 | RayZer: A Self-supervised Large View Synthesis Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. |
Hanwen Jiang; Hao Tan; Peng Wang; Haian Jin; Yue Zhao; Sai Bi; Kai Zhang; Fujun Luan; Kalyan Sunkavalli; Qixing Huang; Georgios Pavlakos; |
| 150 | Real3D: Towards Scaling Large Reconstruction Models with Real Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these single-view images without multi-view supervision. |
Hanwen Jiang; Qixing Huang; Georgios Pavlakos; |
| 151 | Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. |
Xingyu Chen; Yue Chen; Yuliang Xiu; Andreas Geiger; Anpei Chen; |
| 152 | Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). |
Zhiqi Ge; Juncheng Li; Xinglei Pang; Minghe Gao; Kaihang Pan; Wang Lin; Hao Fei; Wenqiao Zhang; Siliang Tang; Yueting Zhuang; |
| 153 | Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. |
Kaichen Zhang; Yifei Shen; Bo Li; Ziwei Liu; |
| 154 | 3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. |
Jianzhe Gao; Rui Liu; Wenguan Wang; |
| 155 | Multi-Modal Few-Shot Temporal Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). |
Zijia Lu; Ehsan Elhamifar; |
| 156 | Can Knowledge Be Transferred from Unimodal to Multimodal? Investigating The Transitivity of Multimodal Knowledge Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in practical applications, it is desirable for knowledge to be transferable across different modalities, which can enhance the robustness of knowledge editing and potentially allow for cost-effective editing of multimodal knowledge using textual information. To address this, we introduce the concept of Transitivity of Multimodal Knowledge Editing (TMKE) and design corresponding evaluation criteria. |
Lingyong Fang; Xinzhong Wang; Depeng Wang; Zongru Wu; Ya Guo; Huijia Zhu; Zhuosheng Zhang; Gongshen Liu; |
| 157 | UniVerse: Unleashing The Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. |
Jin Cao; Hongrui Wu; Ziyong Feng; Hujun Bao; Xiaowei Zhou; Sida Peng; |
| 158 | Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. |
Tobias Kirschstein; Javier Romero; Artem Sevastopolsky; Matthias Nießner; Shunsuke Saito; |
| 159 | Does Your Vision-Language Model Get Lost in The Long Video Sampling Dilemma? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. |
Tianyuan Qu; Longxiang Tang; Bohao Peng; Senqiao Yang; Bei Yu; Jiaya Jia; |
| 160 | InfoBridge: Balanced Multimodal Integration Through Conditional Dependency Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods attempt to enhance fusion through cross-modal alignment or interaction mechanisms, they often struggle to balance effective integration with preserving modality-specific information. We introduce InfoBridge, a novel framework grounded in conditional information maximization principles addressing these limitations. |
Chenxin Li; Yifan Liu; Panwang Pan; Hengyu Liu; Xinyu Liu; Wuyang Li; Cheng Wang; Weihao Yu; Yiyang Lin; Yixuan Yuan; |
| 161 | DiffSim: Taming Diffusion Models for Evaluating Visual Similarity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper is the first to discover that pretrained diffusion models can be utilized for measuring visual similarity and introduces the DiffSim method, addressing the limitations of traditional metrics in capturing perceptual consistency in custom generation tasks. |
Yiren Song; Xiaokang Liu; Mike Zheng Shou; |
| 162 | LayerTracer: Cognitive-Aligned Layered SVG Synthesis Via Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. |
Yiren Song; Danze Chen; Mike Zheng Shou; |
| 163 | Authentic 4D Driving Simulation with A Video Generation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. |
Lening Wang; Wenzhao Zheng; Dalong Du; Yunpeng Zhang; Yilong Ren; Han Jiang; Zhiyong Cui; Haiyang Yu; Jie Zhou; Shanghang Zhang; |
| 164 | Stable Diffusion Models Are Secretly Good at Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL). |
Trevine Oorloff; Vishwanath Sindagi; Wele Gedara Chaminda Bandara; Ali Shafahi; Amin Ghiasi; Charan Prakash; Reza Ardekani; |
| 165 | GameFactory: Creating New Games with Generative Interactive Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. |
Jiwen Yu; Yiran Qin; Xintao Wang; Pengfei Wan; Di Zhang; Xihui Liu; |
| 166 | VCA: Video Curious Agent for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as "VCA". |
Zeyuan Yang; Delin Chen; Xueyang Yu; Maohao Shen; Chuang Gan; |
| 167 | Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address XBT challenges, we propose an efficient solution: a projection module that maps the new model’s embeddings to those of the old model. |
Young Kyun Jang; Ser-nam Lim; |
| 168 | SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper consolidates diverse navigation tasks into a unified and generic framework — we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. |
Gengze Zhou; Yicong Hong; Zun Wang; Chongyang Zhao; Mohit Bansal; Qi Wu; |
| 169 | Towards Fine-grained Interactive Segmentation in Images and Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a SAM2Refiner framework built upon the SAM2 backbone. |
Yuan Yao; Qiushi Yang; Miaomiao Cui; Liefeng Bo; |
| 170 | FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FreeMorph, the first tuning-free method for image morphing that accommodates inputs with varying semantics or layouts. |
Yukang Cao; Chenyang Si; Jinghao Wang; Ziwei Liu; |
| 171 | An Empirical Study of Autoregressive Pre-training from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. |
Jathushan Rajasegaran; Ilija Radosavovic; Rahul Ravishankar; Yossi Gandelsman; Christoph Feichtenhofer; Jitendra Malik; |
| 172 | WorldScore: A Unified Evaluation Benchmark for World Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the WorldScore benchmark, the first unified benchmark for world generation. |
Haoyi Duan; Hong-Xing Yu; Sirui Chen; Li Fei-Fei; Jiajun Wu; |
| 173 | VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. |
Shiduo Zhang; Zhe Xu; Peiju Liu; Xiaopeng Yu; Yuan Li; Qinghui Gao; Zhaoye Fei; Zhangyue Yin; Zuxuan Wu; Yu-Gang Jiang; Xipeng Qiu; |
| 174 | Are They The Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. |
Yikang Zhou; Tao Zhang; Shilin Xu; Shihao Chen; Qianyu Zhou; Yunhai Tong; Shunping Ji; Jiangning Zhang; Lu Qi; Xiangtai Li; |
| 175 | Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While this formulation is elegant and powerful, it is limited to static scenes. To overcome this limitation, we introduce the concept of Dynamic Point Maps (DPM), which extends standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. |
Edgar Sucar; Zihang Lai; Eldar Insafutdinov; Andrea Vedaldi; |
| 176 | ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface.Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. |
Guosheng Zhao; Xiaofeng Wang; Chaojun Ni; Zheng Zhu; Wenkang Qin; Guan Huang; Xingang Wang; |
| 177 | PERSONA: Personalized Whole-Body 3D Avatar with Pose-Driven Deformations from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. |
Geonhee Sim; Gyeongsik Moon; |
| 178 | SpatialTrackerV2: Advancing 3D Point Tracking with Explicit Camera Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. |
Yuxi Xiao; Jianyuan Wang; Nan Xue; Nikita Karaev; Yuri Makarov; Bingyi Kang; Xing Zhu; Hujun Bao; Yujun Shen; Xiaowei Zhou; |
| 179 | CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. |
Siyu Jiao; Haoye Dong; Yuyang Yin; Zequn Jie; Yinlong Qian; Yao Zhao; Humphrey Shi; Yunchao Wei; |
| 180 | Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. |
Xinyu Fang; Zhijian Chen; Kai Lan; Lixin Ma; Shengyuan Ding; Yingji Liang; Xiangyu Zhao; Farong Wen; Zicheng Zhang; Guofeng Zhang; Haodong Duan; Kai Chen; Dahua Lin; |
| 181 | MUSE-VL: Modeling Unified VLM Through Semantic Discrete Encoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. |
Rongchang Xie; Chen Du; Ping Song; Chang Liu; |
| 182 | BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations. |
Tongfan Guan; Jiaxin Guo; Chen Wang; Yun-Hui Liu; |
| 183 | ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. |
Benjin Zhu; Xiaogang Wang; Hongsheng Li; |
| 184 | One Trajectory, One Token: Grounded Video Tokenization Via Panoptic Sub-object Trajectory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. |
Chenhao Zheng; Jieyu Zhang; Mohammadreza Salehi; Ziqi Gao; Vishnu Iyengar; Norimasa Kobori; Quan Kong; Ranjay Krishna; |
| 185 | MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MagicMirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. |
Yuechen Zhang; Yaoyang Liu; Bin Xia; Bohao Peng; Zexin Yan; Eric Lo; Jiaya Jia; |
| 186 | MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in video generation have led to remarkable improvements in visual quality and temporal coherence. Upon this, trajectory-controllable video generation has emerged to enable precise object motion control through explicitly defined spatial paths.However, existing methods struggle with complex object movements and multi-object motion control, resulting in imprecise trajectory adherence, poor object consistency, and compromised visual quality.Furthermore, these methods only support trajectory control in a single format, limiting their applicability in diverse scenarios.Additionally, there is no publicly available dataset or benchmark specifically tailored for trajectory-controllable video generation, hindering robust training and systematic evaluation.To address these challenges, we introduce MagicMotion, a novel image-to-video generation framework that enables trajectory control through three levels of conditions from dense to sparse: masks, bounding boxes, and sparse boxes. |
Quanhao Li; Zhen Xing; Rui Wang; Hui Zhang; Qi Dai; Zuxuan Wu; |
| 187 | GenieBlue: Integrating Both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. |
Xudong Lu; Yinghao Chen; Renshou Wu; Haohao Gao; Xi Chen; Xue Yang; Xiangyu Zhao; Aojun Zhou; Fangyuan Li; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
| 188 | CameraCtrl II: Dynamic Scene Exploration Via Camera-controlled Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces CameraCtrl II, a framework that enables continuous and dynamic scene exploration through a camera-controlled video diffusion model. |
Hao He; Ceyuan Yang; Shanchuan Lin; Yinghao Xu; Meng Wei; Liangke Gui; Qi Zhao; Gordon Wetzstein; Lu Jiang; Hongsheng Li; |
| 189 | CAPTURE: Evaluating Spatial Reasoning in Vision Language Models Via Occluded Object Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To test models’ ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). |
Atin Pothiraj; Elias Stengel-Eskin; Jaemin Cho; Mohit Bansal; |
| 190 | STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enhance the spatio-temporal quality of restored videos, we introduce STAR (Spatial-Temporal Augmentation with T2V models for Real-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. |
Rui Xie; Yinhong Liu; Penghao Zhou; Chen Zhao; Jun Zhou; Kai Zhang; Zhenyu Zhang; Jian Yang; Zhenheng Yang; Ying Tai; |
| 191 | Harmonizing Visual Representations for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Current approaches that utilize vector quantization (VQ) or variational autoencoders (VAE) for unified visual representation prioritize intrinsic imagery features over semantics, compromising understanding performance. In this work, we take inspiration from masked image modelling (MIM) that learns rich semantics via a mask-and-reconstruct pre-training and its successful extension to masked autoregressive (MAR) image generation. |
Size Wu; Wenwei Zhang; Lumin Xu; Sheng Jin; Zhonghua Wu; Qingyi Tao; Wentao Liu; Wei Li; Chen Change Loy; |
| 192 | Scalable Ranked Preference Optimization for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate a scalable approach for collecting large-scale and fully synthetic datasets for DPO training. |
Shyamgopal Karthik; Huseyin Coskun; Zeynep Akata; Sergey Tulyakov; Jian Ren; Anil Kag; |
| 193 | Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. |
Jixuan Fan; Wanhua Li; Yifei Han; Tianru Dai; Yansong Tang; |
| 194 | TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. |
Jinhao Duan; Fei Kong; Hao Cheng; James Diffenderfer; Bhavya Kailkhura; Lichao Sun; Xiaofeng Zhu; Xiaoshuang Shi; Kaidi Xu; |
| 195 | MotionFollower: Editing Video Motion Via Score-Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MotionFollower, a score-guided diffusion model for video motion editing. |
Shuyuan Tu; Qi Dai; Zihao Zhang; Sicheng Xie; Zhi-Qi Cheng; Chong Luo; Xintong Han; Zuxuan Wu; Yu-Gang Jiang; |
| 196 | MaterialMVP: Illumination-Invariant Material Generation Via Multi-view PBR Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. |
Zebin He; Mingxin Yang; Shuhui Yang; Yixuan Tang; Tao Wang; Kaihao Zhang; Guanying Chen; Yuhong Liu; Jie Jiang; Chunchao Guo; Wenhan Luo; |
| 197 | DIMO: Diverse 3D Motion Generation for Arbitrary Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. |
Linzhan Mou; Jiahui Lei; Chen Wang; Lingjie Liu; Kostas Daniilidis; |
| 198 | DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We construct a dataset of 3D objects labeled with stability scores obtained from the physics simulator. This dataset enables fine-tuning of the 3D generator using the stability score as an alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO)—a novel objective we introduce to align diffusion models without requiring pairwise preferences. |
Ruining Li; Chuanxia Zheng; Christian Rupprecht; Andrea Vedaldi; |
| 199 | Puppet-Master: Scaling Interactive Video Generation As A Motion Prior for Part-Level Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Puppet-Master, an interactive video generator that captures the internal, part-level motion of objects, serving as a proxy for modeling object dynamics universally. |
Ruining Li; Chuanxia Zheng; Christian Rupprecht; Andrea Vedaldi; |
| 200 | LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. |
Yuzhang Shang; Mu Cai; Bingxin Xu; Yong Jae Lee; Yan Yan; |
| 201 | Controllable and Expressive One-Shot Video Head Swapping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel diffusion-based multi-condition controllable framework for video head swapping, which seamlessly transplant a human head from a static image into a dynamic video, while preserving the original body and background of target video, and further allowing to tweak head expressions and movements during swapping as needed. |
Chaonan Ji; Jinwei Qi; Peng Zhang; Bang Zhang; Liefeng Bo; |
| 202 | HPSv3: Towards Wide-Spectrum Human Preference Score Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. |
Yuhang Ma; Xiaoshi Wu; Keqiang Sun; Hongsheng Li; |
| 203 | Rethinking The Upsampling Process in Light Field Super-Resolution with Spatial-Epipolar Implicit Image Function Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. |
Ruixuan Cong; Yu Wang; Mingyuan Zhao; Da Yang; Rongshan Chen; Hao Sheng; |
| 204 | Efficient Autoregressive Shape Generation Via Octree-Based Adaptive Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. |
Kangle Deng; Hsueh-Ti Derek Liu; Yiheng Zhu; Xiaoxia Sun; Chong Shang; Kiran S. Bhat; Deva Ramanan; Jun-Yan Zhu; Maneesh Agrawala; Tinghui Zhou; |
| 205 | Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. |
Xiyao Wang; Zhengyuan Yang; Linjie Li; Hongjin Lu; Yuancheng Xu; Chung-Ching Lin; Kevin Lin; Furong Huang; Lijuan Wang; |
| 206 | Describe, Don’t Dictate: Semantic Image Editing with Natural Language Intent Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. |
En Ci; Shanyan Guan; Yanhao Ge; Yilin Zhang; Wei Li; Zhenyu Zhang; Jian Yang; Ying Tai; |
| 207 | LLaVA-CoT: Let Vision Language Models Reason Step-by-Step Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. |
Guowei Xu; Peng Jin; Ziang Wu; Hao Li; Yibing Song; Lichao Sun; Li Yuan; |
| 208 | Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. |
Qingyu Shi; Jianzong Wu; Jinbin Bai; Jiangning Zhang; Lu Qi; Yunhai Tong; Xiangtai Li; |
| 209 | RapVerse: Coherent Vocals and Whole-Body Motion Generation from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. |
Jiaben Chen; Xin Yan; Yihang Chen; Siyuan Cen; Zixin Wang; Qinwei Ma; Haoyu Zhen; Kaizhi Qian; Lie Lu; Chuang Gan; |
| 210 | Unraveling The Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a theoretical understanding of the Lipschitz continuity and second momentum properties of the diffusion process is still lacking. In this paper, we bridge this gap by providing a detailed examination of these smoothness properties for the case where the target data distribution is a mixture of Gaussians, which serves as a universal approximator for smooth densities such as image data. |
Yingyu Liang; Zhizhou Sha; Zhenmei Shi; Zhao Song; Mingda Wan; Yufa Zhou; |
| 211 | Decoupled Diffusion Sparks Adaptive Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. |
Yunsong Zhou; Naisheng Ye; William Ljungbergh; Tianyu Li; Jiazhi Yang; Zetong Yang; Hongzi Zhu; Christoffer Petersson; Hongyang Li; |
| 212 | PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. |
Fatemeh Ghezloo; Mehmet Saygin Seyfioglu; Rustin Soraki; Wisdom O. Ikezogwo; Beibin Li; Tejoram Vivekanandan; Joann G. Elmore; Ranjay Krishna; Linda Shapiro; |
| 213 | Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. |
Li Hu; Guangyuan Wang; Zhen Shen; Xin Gao; Dechao Meng; Lian Zhuo; Peng Zhang; Bang Zhang; Liefeng Bo; |
| 214 | How Far Are AI-generated Videos from Simulating The 3D Visual World: A Learned 3D Evaluation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Learned 3D Evaluation (L3DE), an objective, quantifiable, and interpretable method for assessing AI-generated videos’ ability to simulate the real world in terms of 3D visual qualities and consistencies, without requiring manually labeled defects or quality annotations. |
Chirui Chang; Jiahui Liu; Zhengzhe Liu; Xiaoyang Lyu; Yi-Hua Huang; Xin Tao; Pengfei Wan; Di Zhang; Xiaojuan Qi; |
| 215 | DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. |
Jiazhe Guo; Yikang Ding; Xiwu Chen; Shuo Chen; Bohan Li; Yingshuang Zou; Xiaoyang Lyu; Feiyang Tan; Xiaojuan Qi; Zhiheng Li; Hao Zhao; |
| 216 | UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents UniPortrait, an innovative human image personalization framework that unifies single- and multi-ID customization with high face fidelity, extensive facial editability, free-form input description, and diverse layout generation. |
Junjie He; Yifeng Geng; Liefeng Bo; |
| 217 | RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). |
Geonho Bang; Minjae Seong; Jisong Kim; Geunju Baek; Daye Oh; Junhyung Kim; Junho Koh; Jun Won Choi; |
| 218 | Frequency-Dynamic Attention Modulation For Dense Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. |
Linwei Chen; Lin Gu; Ying Fu; |
| 219 | PoseSyn: Synthesizing Diverse 3D Pose Data from In-the-Wild 2D Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose-image pairs. |
ChangHee Yang; Hyeonseop Song; Seokhun Choi; Seungwoo Lee; Jaechul Kim; Hoseok Do; |
| 220 | Textured 3D Regenerative Morphing with 3D Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This restriction leads to labor-intensive preprocessing and poor generalization. To overcome these challenges, we propose a method for 3D regenerative morphing using a 3D diffusion prior. |
Songlin Yang; Yushi Lan; Honghua Chen; Xingang Pan; |
| 221 | LVBench: An Extreme Long Video Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. |
Weihan Wang; Zehai He; Wenyi Hong; Yean Cheng; Xiaohan Zhang; Ji Qi; Ming Ding; Xiaotao Gu; Shiyu Huang; Bin Xu; Yuxiao Dong; Jie Tang; |
| 222 | SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. |
Xianglong He; Zi-Xin Zou; Chia-Hao Chen; Yuan-Chen Guo; Ding Liang; Chun Yuan; Wanli Ouyang; Yan-Pei Cao; Yangguang Li; |
| 223 | Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. |
Yingjie Chen; Yifang Men; Yuan Yao; Miaomiao Cui; Liefeng Bo; |
| 224 | SuperDec: 3D Scene Decomposition with Superquadrics Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. |
Elisabetta Fedele; Boyang Sun; Leonidas Guibas; Marc Pollefeys; Francis Engelmann; |
| 225 | Multi-Schema Proximity Network for Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite significant advances in CIR methods, two unresolved problems remain: 1) existing methods overlook multi-schema interaction due to the lack of fine-grained explicit visual supervision, which hinders the capture of complex correspondences, and 2) existing methods overlook noisy negative pairs formed by potential corresponding query-target pairs, which increases confusion. To address these problems, we propose a Multi-schemA Proximity Network (MAPNet) for CIR, consisting of two key components: Multi-Schema Interaction (MSI) and Relaxed Proximity Loss (RPLoss). |
Jiangming Shi; Xiangbo Yin; Yeyun Chen; Yachao Zhang; Zhizhong Zhang; Yuan Xie; Yanyun Qu; |
| 226 | Free-Form Motion Control: Controlling The 6D Poses of Camera and Objects in Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Due to the lack of datasets with comprehensive 6D pose annotations, existing text-to-video methods can not simultaneously control the motions of both camera and objects in 3D-aware manner, resulting in limited controllability over generated contents. To address this issue and facilitate the research in this field, we introduce a Synthetic Dataset for Free-Form Motion Control (SynFMC). |
Xincheng Shuai; Henghui Ding; Zhenyuan Qin; Hao Luo; Xingjun Ma; Dacheng Tao; |
| 227 | MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. |
Zijian Dong; Longteng Duan; Jie Song; Michael J. Black; Andreas Geiger; |
| 228 | Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. |
Junyan Ye; Jun He; Weijia Li; Zhutao Lv; Yi Lin; Jinhua Yu; Haote Yang; Conghui He; |
| 229 | DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose DepR, a depth-guided single-view scene reconstruction framework that integrates instance-level diffusion within a compositional paradigm. |
Qingcheng Zhao; Xiang Zhang; Haiyang Xu; Zeyuan Chen; Jianwen Xie; Yuan Gao; Zhuowen Tu; |
| 230 | Unleashing Vecset Diffusion Model for Fast Shape Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Challenges exist because of not only difficulties in accelerating diffusion sampling but also VAE decoding in VDM — areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. |
Zeqiang Lai; Yunfei Zhao; Zibo Zhao; Haolin Liu; Fuyun Wang; Huiwen Shi; Xianghui Yang; Qingxiang Lin; Jingwei Huang; Yuhong Liu; Jie Jiang; Chunchao Guo; Xiangyu Yue; |
| 231 | BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. |
Yuanhong Yu; Xingyi He; Chen Zhao; Junhao Yu; Jiaqi Yang; Ruizhen Hu; Yujun Shen; Xing Zhu; Xiaowei Zhou; Sida Peng; |
| 232 | ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. |
Leonard Bruns; Axel Barroso-Laguna; Tommaso Cavallari; Aron Monszpart; Sowmya Munukutla; Victor Adrian Prisacariu; Eric Brachmann; |
| 233 | Learning to Inference Adaptively for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent effort on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. |
Zhuoyan Xu; Khoi Duc Nguyen; Preeti Mukherjee; Saurabh Bagchi; Somali Chaterji; Yingyu Liang; Yin Li; |
| 234 | Generate, Transduct, Adapt: Iterative Transduction with VLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. |
Oindrila Saha; Logan Lawrence; Grant Van Horn; Subhransu Maji; |
| 235 | GenDoP: Auto-regressive Camera Trajectory Generation As A Director of Photography Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce an auto-regressive model inspired by the expertise of Directors of Photography to generate artistic and expressive camera trajectories. |
Mengchen Zhang; Tong Wu; Jing Tan; Ziwei Liu; Gordon Wetzstein; Dahua Lin; |
| 236 | DreamCube: RGB-D Panorama Generation Via Multi-plane Synchronization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. |
Yukun Huang; Yanning Zhou; Jianan Wang; Kaiyi Huang; Xihui Liu; |
| 237 | PixTalk: Controlling Photorealistic Image Processing and Editing with Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose the first approach that introduces language and explicit control into the image processing and editing pipeline. |
Marcos V. Conde; Zihao Lu; Radu Timofte; |
| 238 | 3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. |
Tianrui Lou; Xiaojun Jia; Siyuan Liang; Jiawei Liang; Ming Zhang; Yanjun Xiao; Xiaochun Cao; |
| 239 | CODA: Repurposing Continuous VAEs for Discrete Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. |
Zeyu Liu; Zanlin Ni; Yeguo Hua; Xin Deng; Xiao Ma; Cheng Zhong; Gao Huang; |
| 240 | Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. |
Bowen Zhang; Sicheng Xu; Chuxin Wang; Jiaolong Yang; Feng Zhao; Dong Chen; Baining Guo; |
| 241 | MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose MOERL, a Mixture-of-Experts (MoE) model optimized with reinforcement learning (RL) to enhance image restoration across diverse weather conditions. |
Tao Wang; Peiwen Xia; Bo Li; Peng-Tao Jiang; Zhe Kong; Kaihao Zhang; Tong Lu; Wenhan Luo; |
| 242 | 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce a high-quality multimodal textbook corpus with richer foundational knowledge for VLM pretraining. |
Wenqi Zhang; Hang Zhang; Xin Li; Jiashuo Sun; Yongliang Shen; Weiming Lu; Deli Zhao; Yueting Zhuang; Lidong Bing; |
| 243 | DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose our DiT4SR, one of the pioneering works to tame the large-scale DiT model for Real-ISR. |
Zheng-Peng Duan; Jiawei Zhang; Xin Jin; Ziheng Zhang; Zheng Xiong; Dongqing Zou; Jimmy S. Ren; Chunle Guo; Chongyi Li; |
| 244 | RogSplat: Robust Gaussian Splatting Via Generative Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In real-world scenarios, violations of this assumption–such as occlusions, dynamic objects, or camera blur–often lead to reconstruction artifacts and rendering inaccuracies. To address these challenges, we introduce RogSplat, a robust framework that leverages generative models to enhance the reliability of 3DGS. |
Hanyang Kong; Xingyi Yang; Xinchao Wang; |
| 245 | AdsQA: Towards Advertisement Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. |
Xinwei Long; Kai Tian; Peng Xu; Guoli Jia; Jingxuan Li; Sa Yang; Yihua Shao; Kaiyan Zhang; Che Jiang; Hao Xu; Yang Liu; Jiaheng Ma; Bowen Zhou; |
| 246 | CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning Via Asymmetric Co-learning and Co-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel LNL approach, termed CA2C (Combined Asymmetric Co-learning and Co-training), which alleviates the reliance on prior knowledge through an integration of complementary learning paradigms. |
Mengmeng Sheng; Zeren Sun; Tianfei Zhou; Xiangbo Shu; Jinshan Pan; Yazhou Yao; |
| 247 | ReCoT: Reflective Self-Correction Training for Mitigating Confirmation Bias in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This problem is more common in smaller-scale LVLMs, as they are usually fine-tuned with training data that is mostly positive, focusing on generating coherent dialogue. To address this issue, we introduce ReCoT, a method designed to mitigate confirmation bias in smaller-scale LVLMs through Reflective Self-Correction Training.The method follows a two-stage SFT-DPO paradigm: the first SFT stage aims to cultivate the model’s reflective correction abilities, while the DPO stage focuses on enhancing the consistency between answers and reflections. |
Mengxue Qu; Yibo Hu; Kunyang Han; Yunchao Wei; Yao Zhao; |
| 248 | Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. |
Luca Barsellotti; Lorenzo Bianchi; Nicola Messina; Fabio Carrara; Marcella Cornia; Lorenzo Baraldi; Fabrizio Falchi; Rita Cucchiara; |
| 249 | Long-Context State-Space Video World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. |
Ryan Po; Yotam Nitzan; Richard Zhang; Berlin Chen; Tri Dao; Eli Shechtman; Gordon Wetzstein; Xun Huang; |
| 250 | Describe Anything: Detailed Localized Image and Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). |
Long Lian; Yifan Ding; Yunhao Ge; Sifei Liu; Hanzi Mao; Boyi Li; Marco Pavone; Ming-Yu Liu; Trevor Darrell; Adam Yala; Yin Cui; |
| 251 | Contrastive Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. |
George Stoica; Vivek Ramanujan; Xiang Fan; Ali Farhadi; Ranjay Krishna; Judy Hoffman; |
| 252 | STIV: Scalable Text and Image Conditioned Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a simple and scalable text and image conditioned video generation method. |
Zongyu Lin; Wei Liu; Chen Chen; Jiasen Lu; Wenze Hu; Tsu-Jui Fu; Jesse Allardice; Zhengfeng Lai; Liangchen Song; Bowen Zhang; Cha Chen; Yiran Fei; Lezhi Li; Yinfei Yang; Yizhou Sun; Kai-Wei Chang; |
| 253 | EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. |
Lu Chen; Yizhou Wang; Shixiang Tang; Qianhong Ma; Tong He; Wanli Ouyang; Xiaowei Zhou; Hujun Bao; Sida Peng; |
| 254 | A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike recent point-based and flow-based affordance methods that focus on dense spatial representations or trajectory modeling, we propose A0, a hierarchical affordance-aware diffusion model that decomposes manipulation task into high-level spatial affordance understanding and low-level action execution. |
Rongtao Xu; Jian Zhang; Minghao Guo; Youpeng Wen; Haoting Yang; Min Lin; Jianzheng Huang; Zhe Li; Kaidong Zhang; Liqiong Wang; Yuxuan Kuang; Meng Cao; Feng Zheng; Xiaodan Liang; |
| 255 | Social Debiasing for Fair Multi-modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a counter-stereotype debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. |
Harry Cheng; Yangyang Guo; Qingpei Guo; Ming Yang; Tian Gan; Weili Guan; Liqiang Nie; |
| 256 | WonderTurbo: Generating Interactive 3D World in 0.72 Seconds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. |
Chaojun Ni; Xiaofeng Wang; Zheng Zhu; Weijie Wang; Haoyun Li; Guosheng Zhao; Jie Li; Wenkang Qin; Guan Huang; Wenjun Mei; |
| 257 | DreamDance: Animating Human Images By Enriching 3D Geometry Cues from 2D Poses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. |
Yatian Pang; Bin Zhu; Bin Lin; Mingzhe Zheng; Francis E. H. Tay; Ser-Nam Lim; Harry Yang; Li Yuan; |
| 258 | Scene Coordinate Reconstruction Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. |
Wenjing Bian; Axel Barroso-Laguna; Tommaso Cavallari; Victor Adrian Prisacariu; Eric Brachmann; |
| 259 | End-to-End Driving with Online Trajectory Evaluation Via BEV World Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework **WoTE**, which leverages a BEV **Wo**rld model to predict future BEV states for **T**rajectory **E**valuation. |
Yingyan Li; Yuqi Wang; Yang Liu; Jiawei He; Lue Fan; Zhaoxiang Zhang; |
| 260 | Dual-Process Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. |
Grace Luo; Jonathan Granskog; Aleksander Holynski; Trevor Darrell; |
| 261 | Baking Gaussian Splatting Into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. |
Yuanhao Cai; He Zhang; Kai Zhang; Yixun Liang; Mengwei Ren; Fujun Luan; Qing Liu; Soo Ye Kim; Jianming Zhang; Zhifei Zhang; Yuqian Zhou; Yulun Zhang; Xiaokang Yang; Zhe Lin; Alan Yuille; |
| 262 | Rethinking The Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. |
Liuyi Wang; Xinyuan Xia; Hui Zhao; Hanqing Wang; Tai Wang; Yilun Chen; Chengju Liu; Qijun Chen; Jiangmiao Pang; |
| 263 | Moto: Latent Motion Token As The Bridging Language for Learning Robot Manipulation from Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. |
Yi Chen; Yuying Ge; Weiliang Tang; Yizhuo Li; Yixiao Ge; Mingyu Ding; Ying Shan; Xihui Liu; |
| 264 | I2VControl: Disentangled and Unified Video Motion Synthesis Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a disentangled and unified framework, namely I2VControl, to overcome the logical conflicts. |
Wanquan Feng; Tianhao Qi; Jiawei Liu; Mingzhen Sun; Pengqi Tu; Tianxiang Ma; Fei Dai; Songtao Zhao; Siyu Zhou; Qian He; |
| 265 | UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. |
Songhua Liu; Ruonan Yu; Xinchao Wang; |
| 266 | Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This occurs partly because certain benign concepts (e.g., "skin") retained in DMs are related to the unlearned ones (e.g., "nudity"), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. |
Hongcheng Gao; Tianyu Pang; Chao Du; Taihang Hu; Zhijie Deng; Min Lin; |
| 267 | Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. |
Hongyang Sun; Qinglin Yang; Jiawei Wang; Zhen Xu; Chen Liu; Yida Wang; Kun Zhan; Hujun Bao; Xiaowei Zhou; Sida Peng; |
| 268 | MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. |
Erik Daxberger; Nina Wenzel; David Griffiths; Haiming Gang; Justin Lazarow; Gefen Kohavi; Kai Kang; Marcin Eichner; Yinfei Yang; Afshin Dehghan; Peter Grasch; |
| 269 | EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of 30 FPS at 720p on an NVIDIA RTX4090. |
Alexander Mai; Peter Hedman; George Kopanas; Dor Verbin; David Futschik; Qiangeng Xu; Falko Kuester; Jonathan T. Barron; Yinda Zhang; |
| 270 | Mobile Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces the first mobile-optimized image-to-video diffusion model. |
Haitam Ben Yahia; Denis Korzhenkov; Ioannis Lelekas; Amir Ghodrati; Amirhossein Habibian; |
| 271 | Exploring The Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. |
Taowen Wang; Cheng Han; James Liang; Wenhao Yang; Dongfang Liu; Luna Xinyu Zhang; Qifan Wang; Jiebo Luo; Ruixiang Tang; |
| 272 | PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. |
Jeonghyeok Do; Sungpyo Kim; Geunhyuk Youk; Jaehyup Lee; Munchurl Kim; |
| 273 | Bridging The Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. |
Jeonghyeok Do; Munchurl Kim; |
| 274 | Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a "First Reasoning, Then Forecasting" strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. |
Muleilan Pei; Shaoshuai Shi; Xuesong Chen; Xu Liu; Shaojie Shen; |
| 275 | FullDiT: Video Generative Foundation Models with Multimodal Control Via Full Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. |
Xuan Ju; Weicai Ye; Quande Liu; Qiulin Wang; Xintao Wang; Pengfei Wan; Di Zhang; Kun Gai; Qiang Xu; |
| 276 | Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we share an interesting finding that training an MLLM with chain-of-thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. |
Jiaer Xia; Bingkui Tong; Yuhang Zang; Rui Shao; Kaiyang Zhou; |
| 277 | AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. |
Moayed Haji-Ali; Willi Menapace; Aliaksandr Siarohin; Ivan Skorokhodov; Alper Canberk; Kwot Sin Lee; Vicente Ordonez; Sergey Tulyakov; |
| 278 | StyleKeeper: Prevent Content Leakage Using Negative Visual Query Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. |
Jaeseok Jeong; Junho Kim; Gayoung Lee; Yunjey Choi; Youngjung Uh; |
| 279 | Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. |
Yuqing Wang; Zhijie Lin; Yao Teng; Yuanzhi Zhu; Shuhuai Ren; Jiashi Feng; Xihui Liu; |
| 280 | Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a collaborative framework, DataTailor, which leverages three key principles–informativeness, uniqueness, and representativeness–for effective data selection. |
Qifan Yu; Zhebei Shen; Zhongqi Yue; Yang Wu; Bosheng Qin; Wenqiao Zhang; Yunfei Li; Juncheng Li; Siliang Tang; Yueting Zhuang; |
| 281 | VSP: Diagnosing The Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in MLLMs in general, and 2) diagnoses this capability via finer-grained sub-tasks, including perception and reasoning, and measure the capabilities of models through these sub-tasks. |
Qiucheng Wu; Handong Zhao; Michael Saxon; Trung Bui; William Yang Wang; Yang Zhang; Shiyu Chang; |
| 282 | NeuralSVG: An Implicit Representation for Text-to-Vector Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. |
Sagi Polaczek; Yuval Alaluf; Elad Richardson; Yael Vinker; Daniel Cohen-Or; |
| 283 | Balanced Image Stylization with Style Matching Score Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. |
Yuxin Jiang; Liming Jiang; Shuai Yang; Jia-Wei Liu; Ivor W. Tsang; Mike Zheng Shou; |
| 284 | FreeScale: Unleashing The Resolution of Diffusion Models Via Tuning-Free Scale Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. |
Haonan Qiu; Shiwei Zhang; Yujie Wei; Ruihang Chu; Hangjie Yuan; Xiang Wang; Yingya Zhang; Ziwei Liu; |
| 285 | AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot action. |
Yi-Lin Wei; Mu Lin; Yuhao Lin; Jian-Jian Jiang; Xiao-Ming Wu; Ling-An Zeng; Wei-Shi Zheng; |
| 286 | SA-LUT: Spatial Adaptive 4D Look-Up Table for Photorealistic Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Photorealistic style transfer (PST) enables real-world color grading by adapting reference image colors while preserving content structure.Existing methods mainly follow either approaches: generation-based methods that prioritize stylistic fidelity at the cost of content integrity and efficiency, or global color transformation methods such as LUT, which preserve structure but lack local adaptability. To bridge this gap, we propose Spatial Adaptive 4D Look-Up Table (SA-LUT), combining LUT efficiency with neural network adaptability. |
Zerui Gong; Zhonghua Wu; Qingyi Tao; Qinyue Li; Chen Change Loy; |
| 287 | MINERVA: Evaluating Complex Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. |
Arsha Nagrani; Sachit Menon; Ahmet Iscen; Shyamal Buch; Ramin Mehran; Nilpa Jha; Anja Hauth; Yukun Zhu; Carl Vondrick ; Mikhail Sirotenko; Cordelia Schmid; Tobias Weyand; |
| 288 | After The Party: Navigating The Mapping From Color to Ambient Lighting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. |
Florin-Alexandru Vasluianu; Tim Seizinger; Zongwei Wu; Radu Timofte; |
| 289 | Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we show that we only need a single parameter \omega to effectively control granularity in diffusion-based synthesis. |
Xinyu Hou; Zongsheng Yue; Xiaoming Li; Chen Change Loy; |
| 290 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. |
Kim Sung-Bin; Jeongsoo Choi; Puyuan Peng; Joon Son Chung; Tae-Hyun Oh; David Harwath; |
| 291 | Growing A Twig to Accelerate Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the limitations above, we present TwigVLM—a simple and general architecture by "growing" a lightweight twig upon an early layer of the base VLM. |
Zhenwei Shao; Mingyang Wang; Zhou Yu; Wenwen Pan; Yan Yang; Tao Wei; Hongyuan Zhang; Ning Mao; Wei Chen; Jun Yu; |
| 292 | ZeroStereo: Zero-shot Stereo Matching from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. |
Xianqi Wang; Hao Yang; Gangwei Xu; Junda Cheng; Min Lin; Yong Deng; Jinliang Zang; Yurui Chen; Xin Yang; |
| 293 | MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for frame-wise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. |
Ruiyuan Gao; Kai Chen; Bo Xiao; Lanqing Hong; Zhenguo Li; Qiang Xu; |
| 294 | A Unified Framework for Motion Reasoning and Generation in Human Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, understanding and generating interactive human-like motion, especially involving coordinated interactive motion, remains a challenging problem due to its inherent complexity. To address this, we present MoLaM, the Interactive Motion-LAnguage Model, a unified architecture that jointly processes language and motion modalities for understanding, generating, and controlling interactive motions in multi-turn conversational settings. |
Jeongeun Park; Sungjoon Choi; Sangdoo Yun; |
| 295 | AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. |
Zijie Wu; Chaohui Yu; Fan Wang; Xiang Bai; |
| 296 | Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLR-Fusion) framework for robust 3D object detection. |
Yujeong Chae; Heejun Park; Hyeonseong Kim; Kuk-Jin Yoon; |
| 297 | KV-Edit: Training-Free Image Editing for Precise Background Preservation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Here, we propose KV-Edit, a training-free approach that uses KV cache in DiTs to maintain background consistency, where background tokens are preserved rather than regenerated, eliminating the need for complex mechanisms or expensive training, ultimately generating new content that seamlessly integrates with the background within user-provided regions. |
Tianrui Zhu; Shiyi Zhang; Jiawei Shao; Yansong Tang; |
| 298 | Aether: Geometric-Aware Unified World Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. |
Haoyi Zhu; Yifan Wang; Jianjun Zhou; Wenzheng Chang; Yang Zhou; Zizun Li; Junyi Chen; Chunhua Shen; Jiangmiao Pang; Tong He; |
| 299 | OmniPaint: Mastering Object-Oriented Editing Via Disentangled Insertion-Removal Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. |
Yongsheng Yu; Ziyun Zeng; Haitian Zheng; Jiebo Luo; |
| 300 | TokensGen: Harnessing Condensed Tokens for Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. |
Wenqi Ouyang; Zeqi Xiao; Danni Yang; Yifan Zhou; Shuai Yang; Lei Yang; Jianlou Si; Xingang Pan; |
| 301 | DiffVSR: Revealing An Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. |
Xiaohui Li; Yihao Liu; Shuo Cao; Ziyan Chen; Shaobin Zhuang; Xiangyu Chen; Yinan He; Yi Wang; Yu Qiao; |
| 302 | From Panels to Prose: Generating Literary Narratives from Comics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. |
Ragav Sachdeva; Andrew Zisserman; |
| 303 | Perspective-Aware Reasoning in Vision-Language Models Via Mental Imagery Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. |
Phillip Y. Lee; Jihyeon Je; Chanho Park; Mikaela Angelina Uy; Leonidas Guibas; Minhyuk Sung; |
| 304 | ERNet: Efficient Non-Rigid Registration Network for Point Sequences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose \methodname, an efficient feed-forward model trained on large deformation datasets.It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. |
Guangzhao He; Yuxi Xiao; Zhen Xu; Xiaowei Zhou; Sida Peng; |
| 305 | Heavy Labels Out! Dataset Distillation with Label Space Lightening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. |
Ruonan Yu; Songhua Liu; Zigeng Chen; Jingwen Ye; Xinchao Wang; |
| 306 | GUAVA: Generalizable Upper Body 3D Gaussian Avatar Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these challenges, we first introduce an expressive human model (EHM) to enhance facial expression capabilities and develop an accurate tracking method. Based on this template model, we propose GUAVA, the first framework for fast animatable upper-body 3D Gaussian avatar reconstruction. |
Dongbin Zhang; Yunfei Liu; Lijian Lin; Ye Zhu; Yang Li; Minghan Qin; Yu Li; Haoqian Wang; |
| 307 | Function-centric Bayesian Network for Zero-Shot Object Goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose the Function-centric Bayesian Network (FBN) for the zero-shot ObjectNav task. |
Sixian Zhang; Xinyao Yu; Xinhang Song; Yiyao Wang; Shuqiang Jiang; |
| 308 | Self-Reinforcing Prototype Evolution with Dual-Knowledge Cooperation for Semi-Supervised Lifelong Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing PRototype Evolution with Dual-Knowledge Cooperation framework (SPRED). |
Kunlun Xu; Fan Zhuo; Jiangmeng Li; Xu Zou; Jiahuan Zhou; |
| 309 | ORION: A Holistic End-to-End Autonomous Driving Framework By Vision-Language Instructed Action Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. |
Haoyu Fu; Diankun Zhang; Zongchuang Zhao; Jianfeng Cui; Dingkang Liang; Chong Zhang; Dingyuan Zhang; Hongwei Xie; Bing Wang; Xiang Bai; |
| 310 | Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this task poses significant challenges, including the accurate modeling of complex style patterns–encompassing both intra- and inter-word relationships–and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. |
Gang Dai; Yifan Zhang; Yutao Qin; Qiangya Guo; Shuangping Huang; Shuicheng Yan; |
| 311 | WonderPlay: Dynamic 3D Scene Generation from A Single Image and Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. |
Zizhang Li; Hong-Xing Yu; Wei Liu; Yin Yang; Charles Herrmann; Gordon Wetzstein; Jiajun Wu; |
| 312 | DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Even state-of-the-art models like FLUX and 3DIS face challenges, such as attribute leakage between instances, which limits user control. To address these issues, we introduce DreamRenderer, a training-free approach built upon the FLUX model. |
Dewei Zhou; Mingwei Li; Zongxin Yang; Yi Yang; |
| 313 | OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. |
Ding Zhong; Xu Zheng; Chenfei Liao; Yuanhuiyi Lyu; Jialei Chen; Shengyang Wu; Linfeng Zhang; Xuming Hu; |
| 314 | Edicho: Consistent Image Editing in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. |
Qingyan Bai; Hao Ouyang; Yinghao Xu; Qiuyu Wang; Ceyuan Yang; Ka Leong Cheng; Yujun Shen; Qifeng Chen; |
| 315 | Trans-Adapter: A Plug-and-Play Framework for Transparent Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This pipeline, however, struggles to preserve transparency consistency in edited regions, and matting can introduce jagged edges along transparency boundaries. To address these challenges, we propose Trans-Adapter, a plug-and-play adapter that enables diffusion-based inpainting models to process transparent images directly. |
Yuekun Dai; Haitian Li; Shangchen Zhou; Chen Change Loy; |
| 316 | MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. |
Wuyang Li; Wentao Pan; Xiaoyuan Liu; Zhendong Luo; Chenxin Li; Hengyu Liu; Din Ping Tsai; Mu Ku Chen; Yixuan Yuan; |
| 317 | PS-Mamba: Spatial-Temporal Graph Mamba for Pose Sequence Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose PS-Mamba, a novel framework that refines human pose sequences by integrating spatial-temporal graph learning with state space modeling. |
Haoye Dong; Gim Hee Lee; |
| 318 | From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we extend the model to generate smooth and consistent attribute transitions by introducing frame-wise guidance for the video latent during the denoising process. |
Ling Lo; Kelvin C.K. Chan; Wen-Huang Cheng; Ming-Hsuan Yang; |
| 319 | Visual Test-time Scaling for GUI Agent Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. |
Tiange Luo; Lajanugen Logeswaran; Justin Johnson; Honglak Lee; |
| 320 | 3D Mesh Editing Using Masked LRMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach to mesh shape editing, building on recent progress in 3D reconstruction from multi-view images. |
Will Gao; Dilin Wang; Yuchen Fan; Aljaz Bozic; Tuur Stuyck; Zhengqin Li; Zhao Dong; Rakesh Ranjan; Nikolaos Sarafianos; |
| 321 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. |
Yong Liu; Song-Li Wu; Sule Bai; Jiahao Wang; Yitong Wang; Yansong Tang; |
| 322 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. |
Aggelina Chatziagapi; Louis-Philippe Morency; Hongyu Gong; Michael Zollhöfer; Dimitris Samaras; Alexander Richard; |
| 323 | VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. |
Sihan Yang; Runsen Xu; Chenhang Cui; Tai Wang; Dahua Lin; Jiangmiao Pang; |
| 324 | Prompt-A-Video: Prompt Your Video Diffusion Model Via Preference-Aligned LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. |
Yatai Ji; Jiacheng Zhang; Jie Wu; Shilong Zhang; Shoufa Chen; Chongjian Ge; Peize Sun; Weifeng Chen; Wenqi Shao; Xuefeng Xiao; Weilin Huang; Ping Luo; |
| 325 | VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose VLR-Driver, a novel multi-modal Vision-Language-Reasoning (VLR) framework based on Chain of Thought (CoT) for embodied autonomous driving. |
Fanjie Kong; Yitong Li; Weihuang Chen; Chen Min; Yizhe Li; Zhiqiang Gao; Haoyang Li; Zhongyu Guo; Hongbin Sun; |
| 326 | IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. To address this Instance Feature Generation (IFG) task, we introduce the Instance Feature Adapter (IFAdapter). |
Yinwei Wu; Xianpan Zhou; Bing Ma; Xuefeng Su; Kai Ma; Xinchao Wang; |
| 327 | HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset providing multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. |
Timo Teufel; Pulkit Gera; Xilong Zhou; Umar Iqbal; Pramod Rao; Jan Kautz; Vladislav Golyanik; Christian Theobalt; |
| 328 | JailbreakDiffBench: A Comprehensive Benchmark for Jailbreaking Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the lack of standardized evaluation makes it difficult to assess the robustness of diffusion model system. To address this, we introduce JailbreakDiffBench, a comprehensive benchmark for systematically evaluating the safety of diffusion models against various attacks and under different defenses. |
Xiaolong Jin; Zixuan Weng; Hanxi Guo; Chenlong Yin; Siyuan Cheng; Guangyu Shen; Xiangyu Zhang; |
| 329 | Anchor Token Matching: Implicit Structure Locking for Training-free AR Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce Implicit Structure Locking (*ISLock*), the first training-free editing strategy for AR visual models. |
Taihang Hu; Linxuan Li; Kai Wang; Yaxing Wang; Jian Yang; Ming-Ming Cheng; |
| 330 | Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. |
Tianqi Liu; Zihao Huang; Zhaoxi Chen; Guangcong Wang; Shoukang Hu; Liao Shen; Huiqiang Sun; Zhiguo Cao; Wei Li; Ziwei Liu; |
| 331 | Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. |
Zhi Hou; Tianyi Zhang; Yuwen Xiong; Haonan Duan; Hengjun Pu; Ronglei Tong; Chengyang Zhao; Xizhou Zhu; Yu Qiao; Jifeng Dai; Yuntao Chen; |
| 332 | Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: They therefore have limited capability in recognizing diverse abnormality details that deviate from these general abnormal patterns in various ways. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD. |
Jiawen Zhu; Yew-Soon Ong; Chunhua Shen; Guansong Pang; |
| 333 | Towards Performance Consistency in Multi-Level Model Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To verify whether our findings are practical, we introduce a validation framework termed \underline Neu ral \underline Lig and (NeuLig). |
Qi Li; Runpeng Yu; Xinchao Wang; |
| 334 | ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction Via Score-Guided Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. |
Ao Li; Jinpeng Liu; Yixuan Zhu; Yansong Tang; |
| 335 | Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs. To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM’s latent space. |
Chancharik Mitra; Brandon Huang; Tianning Chai; Zhiqiu Lin; Assaf Arbelle; Rogerio Feris; Leonid Karlinsky; Trevor Darrell; Deva Ramanan; Roei Herzig; |
| 336 | 4D Visual Pre-training for Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. |
Chengkai Hou; Yanjie Ze; Yankai Fu; Zeyu Gao; Songbo Hu; Yue Yu; Shanghang Zhang; Huazhe Xu; |
| 337 | FairGen: Enhancing Fairness in Text-to-Image Diffusion Models Via Self-Discovering Latent Directions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. |
Yilei Jiang; Wei-Hong Li; Yiyuan Zhang; Minghong Cai; Xiangyu Yue; |
| 338 | PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). |
Zhihao Zhu; Yifan Zheng; Siyu Pan; Yaohui Jin; Yao Mu; |
| 339 | FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models Via Visual Registers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. |
Renshan Zhang; Rui Shao; Gongwei Chen; Miao Zhang; Kaiwen Zhou; Weili Guan; Liqiang Nie; |
| 340 | Generating Physically Stable and Buildable Brick Structures from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce BrickGPT, the first approach for generating physically stable interconnecting brick assembly models from text prompts. |
Ava Pun; Kangle Deng; Ruixuan Liu; Deva Ramanan; Changliu Liu; Jun-Yan Zhu; |
| 341 | Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. |
Jeongseok Hyun; Sukjun Hwang; Su Ho Han; Taeoh Kim; Inwoong Lee; Dongyoon Wee; Joon-Young Lee; Seon Joo Kim; Minho Shim; |
| 342 | RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects. To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. |
Yifei Feng; Mingxin Yang; Shuhui Yang; Sheng Zhang; Jiaao Yu; Zibo Zhao; Yuhong Liu; Jie Jiang; Chunchao Guo; |
| 343 | Advancing Visual Large Language Model for Multi-granular Versatile Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVL-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. |
Wentao Xiang; Haoxian Tan; Yujie Zhong; Cong Wei; Dengjie Li; Yujiu Yang; |
| 344 | Multi-scenario Overlapping Text Segmentation with Depth Awareness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing research has primarily addressed the overlapping problem in documents, its applicability to other scenes remains limited. To bridge this gap, we propose a new task of multi-scenario overlapping text segmentation and introduce a corresponding real dataset in both English and Chinese, spanning various contexts such as printed text, bills, artistic designs, and house numbers. |
Yang Liu; Xudong Xie; Yuliang Liu; Xiang Bai; |
| 345 | MOBIUS: Big-to-Mobile Universal Instance Segmentation Via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. |
Mattia Segu; Marta Tintore Gazulla; Yongqin Xian; Luc Van Gool; Federico Tombari; |
| 346 | Exploring Probabilistic Modeling Beyond Domain Generalization for Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. |
I-Hsiang Chen; Hua-En Chang; Wei-Ting Chen; Jenq-Neng Hwang; Sy-Yen Kuo; |
| 347 | Holistic Tokenizer for Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce Hita, a novel image tokenizer for autoregressive (AR) image generation. |
Anlin Zheng; Haochen Wang; Yucheng Zhao; Weipeng Deng; Tiancai Wang; Xiangyu Zhang; Xiaojuan Qi; |
| 348 | Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our objective is the automatic generation of Audio Descriptions (ADs) for edited video material, such as movies and TV series. To achieve this, we propose a two-stage framework that leverages "shots" as the fundamental units of video understanding. |
Junyu Xie; Tengda Han; Max Bain; Arsha Nagrani; Eshika Khandelwal; Gül Varol; Weidi Xie; Andrew Zisserman; |
| 349 | T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. |
Chieh-Yun Chen; Min Shi; Gong Zhang; Humphrey Shi; |
| 350 | Seeing The Unseen: A Semantic Alignment and Context-Aware Prompt Framework for Open-Vocabulary Camouflaged Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite existing open-vocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. |
Peng Ren; Tian Bai; Jing Sun; Fuming Sun; |
| 351 | LLaVA-KD: A Framework of Distilling Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model’s robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model’s ability to capture visual token relationships.Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: 1) Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in s-MLLMs, 2) Supervised Fine-Tuning to equip the s-MLLMs with multimodal understanding capacity, and 3) Distilled Fine-Tuning to refine s-MLLM’s knowledge.Our approach significantly improves s-MLLMs performance without altering the model architecture. |
Yuxuan Cai; Jiangning Zhang; Haoyang He; Xinwei He; Ao Tong; Zhenye Gan; Chengjie Wang; Zhucun Xue; Yong Liu; Xiang Bai; |
| 352 | Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. |
Mahmoud Ahmed; Junjie Fei; Jian Ding; Eslam Mohamed Bakr; Mohamed Elhoseiny; |
| 353 | Chimera: Improving Generalist Model with Domain-Specific Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Directly integrating expert models tailored for those tasks is also challenging due to representational gaps and imbalanced optimization. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. |
Tianshuo Peng; Mingsheng Li; Jiakang Yuan; Hongbin Zhou; Renqiu Xia; Renrui Zhang; Lei Bai; Song Mao; Bin Wang; Aojun Zhou; Botian Shi; Tao Chen; Bo Zhang; Xiangyu Yue; |
| 354 | Flow to The Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose FlowMo, a transformer-based diffusion autoencoder. |
Kyle Sargent; Kyle Hsu; Justin Johnson; Li Fei-Fei; Jiajun Wu; |
| 355 | ViSpeak: Visual Instruction Feedback in Streaming Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. |
Shenghao Fu; Qize Yang; Yuan-Ming Li; Yi-Xing Peng; Kun-Yu Lin; Xihan Wei; Jian-Fang Hu; Xiaohua Xie; Wei-Shi Zheng; |
| 356 | IRASim: A Fine-Grained World Model for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories. |
Fangqi Zhu; Hongtao Wu; Song Guo; Yuxiao Liu; Chilam Cheang; Tao Kong; |
| 357 | PrimHOI: Compositional Human-Object Interaction Via Reusable Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here we show that PrimHOI generates complex HOI motions through spatial and temporal composition of generalizable interaction primitives defined by relative geometry. |
Kai Jia; Tengyu Liu; Mingtao Pei; Yixin Zhu; Siyuan Huang; |
| 358 | Enrich and Detect: Video Temporal Grounding with Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. |
Shraman Pramanick; Effrosyni Mavroudi; Yale Song; Rama Chellappa; Lorenzo Torresani; Triantafyllos Afouras; |
| 359 | Multimodal LLM Guided Exploration and Active Mapping Using Fisher Information Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. |
Wen Jiang; Boshu Lei; Katrina Ashton; Kostas Daniilidis; |
| 360 | Towards Higher Effective Rank in Parameter-Efficient Fine-tuning Using Khatri-Rao Product Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We further show that full-rank methods can reduce LoRA’s approximation error on these matrix types for an equal parameter count.Our evaluation then extends beyond synthetic tasks where we observe that LoRA’s restricted work subspace can produce high norm updates, leading to over-fitting and poor out-of-distribution generalization. We address these limits by introducing KRAdapter, a novel PEFT algorithms that uses properties of the Kathri-Rao matrix product to produce weight matrices of higher effective rank and lower norm than related PEFT algorithms.We show the performance improvements of KRAdapter on vision-language models up to 1B parameters and 8B %32Bfor LLMs where we report from 20 to 25 points of accuracy improvements over LoRA when reasoning on commonsense tasks unseen during training. |
Paul Albert; Frederic Z. Zhang; Hemanth Saratchandran; Anton van den Hengel; Ehsan Abbasnejad; |
| 361 | Phantom: Subject-Consistent Video Generation Via Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references.Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. |
Lijie Liu; Tianxiang Ma; Bingchuan Li; Zhuowei Chen; Jiawei Liu; Gen Li; Siyu Zhou; Qian He; Xinglong Wu; |
| 362 | EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. |
Xiaobao Wei; Qingpo Wuwu; Zhongyu Zhao; Zhuangzhe Wu; Nan Huang; Ming Lu; Ningning Ma; Shanghang Zhang; |
| 363 | Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model Causal-VidSyn for synthesizing egocentric traffic accident videos. |
Lei-Lei Li; Jianwu Fang; Junbin Xiao; Shanmin Pang; Hongkai Yu; Chen Lv; Jianru Xue; Tat-Seng Chua; |
| 364 | HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a unified Driving World Model named HERMES. |
Xin Zhou; Dingkang Liang; Sifan Tu; Xiwu Chen; Yikang Ding; Dingyuan Zhang; Feiyang Tan; Hengshuang Zhao; Xiang Bai; |
| 365 | Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks.Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model.Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. |
Chang Qiu; Feipeng Da; Zilei Zhang; |
| 366 | BANet: Bilateral Aggregation Network for Mobile Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. |
Gangwei Xu; Jiaxin Liu; Xianqi Wang; Junda Cheng; Yong Deng; Jinliang Zang; Yurui Chen; Xin Yang; |
| 367 | DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. |
Wenwen Yu; Zhibo Yang; Yuliang Liu; Xiang Bai; |
| 368 | TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explore unsupervised object representation learning grounded in object dynamics–beyond static appearance–we introduce TrackVerse, a large-scale video dataset of 31.9 million object tracks spanning over 1,000 categories, each capturing the motion, appearance, and evolving states of an object over time. |
Yibing Wei; Samuel Church; Victor Suciu; Jinhong Lin; Cheng-En Wu; Pedro Morgado; |
| 369 | CharaConsist: Fine-Grained Consistent Character Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, when the foreground character undergoes large motion variations, inconsistencies in identity and clothing details become evident. To address these problems, we propose CharaConsist, which employs point-tracking attention and adaptive token merge along with decoupled control of the foreground and background.CharaConsist enables fine-grained consistency for both foreground and background, supporting the generation of one character in continuous shots within a fixed scene or in discrete shots across different scenes.Moreover, CharaConsist is the first consistent generation method tailored for text-to-image DiT model. |
Mengyu Wang; Henghui Ding; Jianing Peng; Yao Zhao; Yunpeng Chen; Yunchao Wei; |
| 370 | Radiant Foam: Real-Time Differentiable Ray Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. |
Shrisudhan Govindarajan; Daniel Rebain; Kwang Moo Yi; Andrea Tagliasacchi; |
| 371 | Rethinking Layered Graphic Design Generation with A Top-Down Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite this, non-layered designs still inspire human designers, influencing their choices in layouts and text styles, ultimately guiding the creation of layered designs. Motivated by this observation, we propose Accordion, a graphic design generation framework taking the first attempt to convert AI-generated designs into editable layered designs, meanwhile refining nonsensical AI-generated text with meaningful alternatives guided by user prompts. |
Jingye Chen; Zhaowen Wang; Nanxuan Zhao; Li Zhang; Difan Liu; Jimei Yang; Qifeng Chen; |
| 372 | GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they usually suffer from inaccuracies in normal estimation that subsequently degrade light transport, resulting in noisy material decomposition and flawed relighting results. To address this, we propose GeoSplatting, a novel approach that augments 3DGS with explicit geometry guidance for precise light transport modeling. |
Kai Ye; Chong Gao; Guanbin Li; Wenzheng Chen; Baoquan Chen; |
| 373 | Augmented and Softened Matching for Unsupervised Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing UVI-ReID methods have made substantial efforts during the optimization phase to enhance the model’s robustness to color variations, they often overlook the impact of color variations on the acquisition of pseudo-labels. To address this, in this paper, we focus on improving the robustness of pseudo-labels to color variations through data augmentation and propose an augmented and softened matching (ASM) method. |
Zhiqi Pang; Chunyu Wang; Lingling Zhao; Junjie Wang; |
| 374 | Demeter: A Parametric Model of Crop Plant Morphology from The Real World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. |
Tianhang Cheng; Albert J. Zhai; Evan Z. Chen; Rui Zhou; Yawen Deng; Zitong Li; Kejie Zhao; Janice Shiu; Qianyu Zhao; Yide Xu; Xinlei Wang; Yuan Shen; Sheng Wang; Lisa Ainsworth; Kaiyu Guan; Shenlong Wang; |
| 375 | DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. |
Yuntao Chen; Yuqi Wang; Zhaoxiang Zhang; |
| 376 | LD-RPS: Zero-Shot Unified Image Restoration Via Latent Diffusion Recurrent Posterior Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods either make tailored designs for specific tasks, limiting their generalizability across various types of degradation, or rely on training with paired datasets, thereby suffering from closed-set constraints. To address these issues, we propose a novel, dataset-free, and unified approach through recurrent posterior sampling utilizing a pretrained latent diffusion model. |
Huaqiu Li; Yong Wang; Tongwen Huang; Hailang Huang; Haoqian Wang; Xiangxiang Chu; |
| 377 | Deep Adaptive Unfolded Network Via Spatial Morphology Stripping and Spectral Filtration for Pan-sharpening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, validating pan-sharpening performance in high-level semantic tasks is intractable for the absence of datasets. To tackle these issues, we propose a deep adaptive unfolded network via spatial morphology stripping and spectral filtration for pan-sharpening, which is conceptualized as a linear inverse problem regularized by spatial and spectral priors. |
Hebaixu Wang; Jiayi Ma; |
| 378 | CoA-VLA: Improving Vision-Language-Action Models Via Visual-Text Chain-of-Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction?In this paper, we introduce Chain-of-Affordance (CoA-VLA), a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. |
Jinming Li; Yichen Zhu; Zhibin Tang; Junjie Wen; Minjie Zhu; Xiaoyu Liu; Chengmeng Li; Ran Cheng; Yaxin Peng; Yan Peng; Feifei Feng; |
| 379 | MultiModal Action Conditioned Video Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: General-purpose household robots require real-time fine motor control to handle delicate tasks and urgent situations. In this work, we introduce fine-grained multimodal actions to capture such precise control. |
Yichen Li; Antonio Torralba; |
| 380 | EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. |
Yuqi Wu; Wenzhao Zheng; Sicheng Zuo; Yuanhui Huang; Jie Zhou; Jiwen Lu; |
| 381 | Continuous-Time Human Motion Field from Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we predict a continuous-time human motion field from events caused by human motion. |
Ziyun Wang; Ruijun Zhang; Zi-Yan Liu; Yufu Wang; Kostas Daniilidis; |
| 382 | Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Amodal3R, a conditional image-to-3D model designed to reconstruct plausible 3D geometry and appearance from partial observations. |
Tianhao Wu; Chuanxia Zheng; Frank Guan; Andrea Vedaldi; Tat-Jen Cham; |
| 383 | Structured Policy Optimization: Enhance Large Vision-Language Model Via Self-referenced Dialogue Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) — a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. |
Guohao Sun; Can Qin; Yihao Feng; Zeyuan Chen; Ran Xu; Sohail Dianat; Majid Rabbani; Raghuveer Rao; Zhiqiang Tao; |
| 384 | From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multi-image Interleaved Reasoning aims to improve Multimodal Large Language Models’ (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks.While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations.To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images.To enhance MLLMs’ ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. |
Hang Du; Jiayang Zhang; Guoshun Nan; Wendi Deng; Zhenyan Chen; Chenyang Zhang; Wang Xiao; Shan Huang; Yuqi Pan; Tao Qi; Sicong Leng; |
| 385 | DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While auto-regressive methods generate structured meshes by predicting discrete vertex tokens, they are often constrained by limited face counts and mesh incompleteness. To address these challenges, we propose DeepMesh, a framework that optimizes mesh generation through two key innovations: (1) an efficient pre-training strategy incorporating a novel tokenization algorithm, along with improvements in data curation and processing, and (2) the introduction of Reinforcement Learning (RL) into 3D mesh generation to achieve human preference alignment via Direct Preference Optimization (DPO). |
Ruowen Zhao; Junliang Ye; Zhengyi Wang; Guangce Liu; Yiwen Chen; Yikai Wang; Jun Zhu; |
| 386 | From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. |
Chuang Yu; Jinmiao Zhao; Yunpeng Liu; Sicheng Zhao; Yimian Dai; Xiangyu Yue; |
| 387 | CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. |
Changxing Liu; Genjia Liu; Zijun Wang; Jinchang Yang; Siheng Chen; |
| 388 | "Principal Components" Enable A New Language of Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. |
Xin Wen; Bingchen Zhao; Ismail Elezi; Jiankang Deng; Xiaojuan Qi; |
| 389 | Visual-RFT: Visual Reinforcement Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is possibly one key direction in reproducing o1.While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. |
Ziyu Liu; Zeyi Sun; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Haodong Duan; Dahua Lin; Jiaqi Wang; |
| 390 | HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. |
Lingxiao Li; Kaixuan Fan; Boqing Gong; Xiangyu Yue; |
| 391 | Graph Domain Adaptation with Dual-branch Encoder and Two-level Alignment for Whole Slide Image-based Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These differences generally result in large gaps in distribution between different WSI domains and thus, the survival analysis models trained on one domain may fail to transfer to another. To address this issue, we propose a Dual-branch Encoder and Two-level Alignment (DETA) framework to explore both feature and category-level alignment between different WSI domains. |
Yuntao Shou; Xiangyong Cao; Peiqiang Yan; Qiao Hui; Qian Zhao; Deyu Meng; |
| 392 | Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. |
Ying Ba; Tianyu Zhang; Yalong Bai; Wenyi Mo; Tao Liang; Bing Su; Ji-Rong Wen; |
| 393 | USP: Unified Self-Supervised Pretraining for Image Generation and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, transferring pretrained weights from vision models to diffusion models is challenging due to input mismatches and the use of latent spaces. To address these challenges, we propose Unified Self-supervised Pretraining (USP), a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. |
Xiangxiang Chu; Renda Li; Yong Wang; |
| 394 | UnZipLoRA: Separating Content and Style from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces UnZipLoRA, a method for decomposing an image into its constituent subject and style, represented as two distinct LoRAs (Low-Rank Adaptations). |
Chang Liu; Viraj Shah; Aiyu Cui; Svetlana Lazebnik; |
| 395 | Representation Shift: Unifying Token Compression with FlashAttention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. |
Joonmyung Choi; Sanghyeok Lee; Byungoh Ko; Eunseo Kim; Jihyung Kil; Hyunwoo J. Kim; |
| 396 | Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, complex and flexible text relations can hinder the understanding of the internal logic of slides. To address this challenge, we propose a novel method, named SlideParser, which includes an auxiliary branch to predict text relations within slides and enhance attention between related texts, thereby improving slides understanding. |
Enming Zhang; Yuzhe Li; Yuliang Liu; Yingying Zhu; Xiang Bai; |
| 397 | IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. |
Wenxuan Guo; Xiuwei Xu; Hang Yin; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu; |
| 398 | AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distils structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. |
Sanjoy Chowdhury; Hanan Gani; Nishit Anand; Sayan Nag; Ruohan Gao; Mohamed Elhoseiny; Salman Khan; Dinesh Manocha; |
| 399 | AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial Attack, Compositional Reasoning, and Modality-specific Dependency. |
Sanjoy Chowdhury; Sayan Nag; Subhrajyoti Dasgupta; Yaoting Wang; Mohamed Elhoseiny; Ruohan Gao; Dinesh Manocha; |
| 400 | DLFR-Gen: Diffusion-based Video Generation with Dynamic Latent Frame Rate Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we exploit the inherent temporal non-uniformity of real-world videos, and observe that videos exhibit dynamic information density, with high-motion segments demanding greater detail preservation than static scenes. |
Zhihang Yuan; Rui Xie; Yuzhang Shang; Hanling Zhang; Siyuan Wang; Shengen Yan; Guohao Dai; Yu Wang; |
| 401 | Refer to Any Segmentation Mask Group With Vision-Language Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. |
Shengcao Cao; Zijun Wei; Jason Kuen; Kangning Liu; Lingzhi Zhang; Jiuxiang Gu; HyunJoon Jung; Liang-Yan Gui; Yu-Xiong Wang; |
| 402 | Toward Fair and Accurate Cross-Domain Medical Image Segmentation: A VLM-Driven Active Domain Adaptation Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Emerging Active Domain Adaptation (ADA) approaches offer more effective enhancements, but all ignore fairness issues. Therefore, in this work, we propose the first fairness-aware ADA paradigm that simultaneously achieves both enhanced fairness and superior overall performance. |
Hongqiu Wang; Wu Chen; Xiangde Luo; Zhaohu Xing; Lihao Liu; Jing Qin; Shaozhi Wu; Lei Zhu; |
| 403 | FastVAR: Linear Visual Autoregressive Modeling Via Cached Token Pruning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing VAR paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution. To address this challenge, we propose FastVAR, a post-training acceleration method for efficient resolution scaling with VARs. |
Hang Guo; Yawei Li; Taolin Zhang; Jiangshan Wang; Tao Dai; Shu-Tao Xia; Luca Benini; |
| 404 | InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). |
Cong Wei; Yujie Zhong; Haoxian Tan; Yingsen Zeng; Yong Liu; Hongfa Wang; Yujiu Yang; |
| 405 | MemoryTalker: Personalized Speech-Driven 3D Facial Animation Via Audio-Guided Stylization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, previous works often require priors such as class labels of a speaker or additional 3D facial meshes at inference, which makes them fail to reflect the speaking style and limits their practical use. To address these issues, we propose MemoryTalker which enables realistic and accurate 3D facial motion synthesis by reflecting speaker style only with audio input to maximize usability in applications. |
Hyung Kyu Kim; Sangmin Lee; Hak Gu Kim; |
| 406 | Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Orchid, a unified latent diffusion model that learns a joint appearance-geometry prior to generate color, depth, and surface normal images in a single diffusion process. |
Akshay Krishnan; Xinchen Yan; Vincent Casser; Abhijit Kundu; |
| 407 | LBM: Latent Bridge Matching for Fast Image-to-Image Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. |
Clément Chadebec; Onur Tasar; Sanjeev Sreetharan; Benjamin Aubin; |
| 408 | Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. |
Shizhen Zhao; Jiahui Liu; Xin Wen; Haoru Tan; Xiaojuan Qi; |
| 409 | CARP: Visuomotor Policy Learning Via Coarse-to-Fine Autoregressive Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce **C**oarse-to-Fine **A**uto**R**egressive **P**olicy (**CARP**), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach. |
Zhefei Gong; Pengxiang Ding; Shangke Lyu; Siteng Huang; Mingyang Sun; Wei Zhao; Zhaoxin Fan; Donglin Wang; |
| 410 | Principles of Visual Tokens for Efficient Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. |
Xinyue Hao; Gen Li; Shreyank N Gowda; Robert B. Fisher; Jonathan Huang; Anurag Arnab; Laura Sevilla-Lara; |
| 411 | TACO: Taming Diffusion for In-the-wild Video Amodal Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper tackles the task of Video Amodal Completion (VAC), which aims to generate the complete object consistently throughout the video given a visual prompt specifying the object of interest. Leveraging the rich, consistent manifolds learned by pre-trained video diffusion models, we propose a conditional diffusion model, TACO, that repurposes these manifolds for VAC. |
Ruijie Lu; Yixin Chen; Yu Liu; Jiaxiang Tang; Junfeng Ni; Diwen Wan; Gang Zeng; Siyuan Huang; |
| 412 | Aligning Global Semantics and Local Textures in Generative Video Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, solely relying on the knowledge embedded in the pre-trained video diffusion models might limit the generalization ability of local details (e.g., texture). In this paper, we address this issue by exploring the visual cues from a high-quality (HQ) image reference to facilitate visual details generation in video enhancement. |
Zhikai Chen; Fuchen Long; Zhaofan Qiu; Ting Yao; Wengang Zhou; Jiebo Luo; Tao Mei; |
| 413 | UPRE: Zero-Shot Domain Adaptation for Object Detection Via Unified Prompt and Representation Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. |
Xiao Zhang; Fei Wei; Yong Wang; Wenda Zhao; Feiyi Li; Xiangxiang Chu; |
| 414 | Adversarial Exploitation of Data Diversity Improves Visual Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap.Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50% and 22% on indoor datasets, and 37% and 42% on outdoor datasets. |
Sihang Li; Siqi Tan; Bowen Chang; Jing Zhang; Chen Feng; Yiming Li; |
| 415 | Self-Supervised Monocular 4D Scene Reconstruction for Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. |
Chengbo Yuan; Geng Chen; Li Yi; Yang Gao; |
| 416 | InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. |
Xiaoxue Chen; Bhargav Chandaka; Chih-Hao Lin; Ya-Qin Zhang; David Forsyth; Hao Zhao; Shenlong Wang; |
| 417 | GestureLSM: Latent Shortcut Based Co-Speech Gesture Generation with Spatial-Temporal Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. |
Pinxin Liu; Luchuan Song; Junhua Huang; Haiyang Liu; Chenliang Xu; |
| 418 | DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, HERA, with hybrid guidance to overcome these limitations. |
Yuxuan Luo; Zhengkun Rong; Lizhen Wang; Longhao Zhang; Tianshu Hu; |
| 419 | World4Drive: End-to-End Autonomous Driving Via Intention-aware Physical Latent World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. |
Yupeng Zheng; Pengxuan Yang; Zebin Xing; Qichao Zhang; Yuhang Zheng; Yinfeng Gao; Pengfei Li; Teng Zhang; Zhongpu Xia; Peng Jia; XianPeng Lang; Dongbin Zhao; |
| 420 | Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematically analyze MM-DiT’s attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. |
Joonghyuk Shin; Alchan Hwang; Yujin Kim; Daneul Kim; Jaesik Park; |
| 421 | LiT: Delving Into A Simple Linear Diffusion Transformer for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. |
Jiahao Wang; Ning Kang; Lewei Yao; Mengzhao Chen; Chengyue Wu; Songyang Zhang; Shuchen Xue; Yong Liu; Taiqiang Wu; Xihui Liu; Kaipeng Zhang; Shifeng Zhang; Wenqi Shao; Zhenguo Li; Ping Luo; |
| 422 | Unsupervised Visual Chain-of-Thought Reasoning Via Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. |
Kesen Zhao; Beier Zhu; Qianru Sun; Hanwang Zhang; |
| 423 | FaceXFormer: A Unified Transformer for Facial Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce FaceXFormer, an end-to-end unified transformer model capable of performing ten facial analysis tasks within a single framework. |
Kartik Narayan; Vibashan VS; Rama Chellappa; Vishal M. Patel; |
| 424 | Learning Efficient and Generalizable Human Representation with Human Gaussian Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network.However, these methods predict independent Gaussians for each frame prediction without fully capturing the relations of Gaussians from different frames, which are hard to be animated by novel poses. To address this, we propose Human Gaussian Graph (HGG) to generate generalizable and animatable Gaussian representations. |
Yifan Liu; Shengjun Zhang; Chensheng Dai; Yang Chen; Hao Liu; Chen Li; Yueqi Duan; |
| 425 | Error Recognition in Procedural Videos Using Generalized Task Graph Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we develop a unified framework for joint temporal action segmentation and error recognition (recognizing when and which type of error happens) in procedural task videos. |
Shih-Po Lee; Ehsan Elhamifar; |
| 426 | MOSCATO: Predicting Multiple Object State Change Through Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MOSCATO: a new benchmark for predicting the evolving states of multiple objects through long procedural videos with multiple actions. |
Parnian Zameni; Yuhan Shen; Ehsan Elhamifar; |
| 427 | ObjectGS: Object-aware Scene Reconstruction and Scene Understanding Via Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. |
Ruijie Zhu; Mulin Yu; Linning Xu; Lihan Jiang; Yixuan Li; Tianzhu Zhang; Jiangmiao Pang; Bo Dai; |
| 428 | PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This process greatly broadens available geometries by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a new data synthesis pipeline that uses multiple depth foundation models as scale manipulators. |
Haotian Wang; Aoran Xiao; Xiaoqin Zhang; Meng Yang; Shijian Lu; |
| 429 | UIP2P: Unsupervised Instruction-based Image Editing Via Edit Reversibility Constraint Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an unsupervised instruction-based image editing approach that removes the need for ground-truth edited images during training. |
Enis Simsar; Alessio Tonioni; Yongqin Xian; Thomas Hofmann; Federico Tombari; |
| 430 | Uncertainty-Driven Expert Control: Enhancing The Reliability of Medical Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. |
Xiao Liang; Di Wang; Zhicheng Jiao; Ronghan Li; Pengfei Yang; Quan Wang; Tat-Seng Chua; |
| 431 | REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The proposed method can significantly enhance object independence, detailaccuracy, and overall scene coherence. |
Haonan Han; Rui Yang; Huan Liao; Jiankai Xing; Zunnan Xu; Xiaoming Yu; Junwei Zha; Xiu Li; Wanhua Li; |
| 432 | LV-MAE: Learning Long Video Representations Through Masked-Embedding Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation.Our approach treats short- and long-span dependencies as two separate tasks.Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. |
Ilan Naiman; Emanuel Ben-Baruch; Oron Anschel; Alon Shoshan; Igor Kviatkovsky; Manoj Aggarwal; Gerard Medioni; |
| 433 | VideoAds for Fast-Paced Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. |
Zheyuan Zhang; Wanying Dou; Linkai Peng; Hongyi Pan; Ulas Bagci; Boqing Gong; |
| 434 | WikiAutoGen: Towards Multi-Modal Wikipedia-Style Article Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce WikiAutoGen, a novel system for automated multimodal Wikipedia-style article generation. |
Zhongyu Yang; Jun Chen; Dannong Xu; Junjie Fei; Xiaoqian Shen; Liangbing Zhao; Chun-Mei Feng; Mohamed Elhoseiny; |
| 435 | Inverse Image-Based Rendering for Light Field Generation from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. |
Hyunjun Jung; Hae-Gon Jeon; |
| 436 | SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. |
Jiahui Wang; Zuyan Liu; Yongming Rao; Jiwen Lu; |
| 437 | PersPose: 3D Human Pose Estimation with Perspective Encoding and Perspective Rotation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. |
Xiaoyang Hao; Han Li; |
| 438 | Generative Active Learning for Long-tail Trajectory Prediction Via Controllable Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Generative Active Learning for Trajectory prediction (GALTraj), the first method to successfully deploy generative active learning into trajectory prediction. |
Daehee Park; Monu Surana; Pranav Desai; Ashish Mehta; Reuben MV John; Kuk-Jin Yoon; |
| 439 | FonTS: Text Rendering With Typography and Style Controls Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods still encounter challenges like inconsistent fonts, style variation, and limited fine-grained control, particularly at the word-level. This paper proposes a two-stage DiT-based pipeline to address these problems by enhancing controllability over typography and style in text rendering. |
Wenda Shi; Yiren Song; Dengming Zhang; Jiaming Liu; Xingxing Zou; |
| 440 | CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated. Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches. |
Jiaru Zhong; Jiahao Wang; Jiahui Xu; Xiaofan Li; Zaiqing Nie; Haibao Yu; |
| 441 | Scaling 3D Compositional Models for Robust Classification and Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. |
Xiaoding Yuan; Guofeng Zhang; Prakhar Kaushik; Artur Jesslen; Adam Kortylewski; Alan Yuille; |
| 442 | HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. |
Qinqian Lei; Bo Wang; Robby T. Tan; |
| 443 | RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce **`RefEdit-Bench`**, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly.To overcome this limitation, we introduce **`RefEdit`** — an instruction-based editing model trained on our scalable synthetic data generation pipeline. |
Bimsara Pathiraja; Maitreya Patel; Shivam Singh; Yezhou Yang; Chitta Baral; |
| 444 | SITE: Towards Spatial Intelligence Thorough Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’ spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). |
Wenqi Wang; Reuben Tan; Pengyue Zhu; Jianwei Yang; Zhengyuan Yang; Lijuan Wang; Andrey Kolobov; Jianfeng Gao; Boqing Gong; |
| 445 | Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our approach introduces a novel tokenization algorithm that preserves face proximity relationships and compresses sequence length through locally shared vertices and edges, enabling the generation of meshes with an unprecedented scale of up to 5,000 faces. |
Yuxuan Wang; Xuanyu Yi; Haohan Weng; Qingshan Xu; Xiaokang Wei; Xianghui Yang; Chunchao Guo; Long Chen; Hanwang Zhang; |
| 446 | ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. |
Yanzhe Lyu; Kai Cheng; Xin Kang; Xuejin Chen; |
| 447 | CAD-Assistant: Tool-Augmented VLLMs As Generic CAD Task Solvers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. |
Dimitrios Mallis; Ahmet Serda Karadeniz; Sebastian Cavada; Danila Rukhovich; Niki Foteinopoulou; Kseniya Cherenkova; Anis Kacem; Djamila Aouada; |
| 448 | VPO: Aligning Text-to-Video Generation Models with Prompt Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. |
Jiale Cheng; Ruiliang Lyu; Xiaotao Gu; Xiao Liu; Jiazheng Xu; Yida Lu; Jiayan Teng; Zhuoyi Yang; Yuxiao Dong; Jie Tang; Hongning Wang; Minlie Huang; |
| 449 | From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts. To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter’s capacity to address data conflict through dual structural optimization. |
Pengkun Jiao; Bin Zhu; Jingjing Chen; Chong-Wah Ngo; Yu-Gang Jiang; |
| 450 | 4D Gaussian Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images. |
Yanyan Li; Youxu Fang; Zunjie Zhu; Kunyi Li; Yong Ding; Federico Tombari; |
| 451 | Boosting MLLM Reasoning with Text-Debiased Hint-GRPO Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. |
Qihan Huang; Weilong Dai; Jinlong Liu; Wanggui He; Hao Jiang; Mingli Song; Jingyuan Chen; Chang Yao; Jie Song; |
| 452 | RoMo: Robust Motion Segmentation Improves Structure from Motion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. |
Lily Goli; Sara Sabour; Mark Matthews; Marcus A. Brubaker; Dmitry Lagun; Alec Jacobson; David J. Fleet; Saurabh Saxena; Andrea Tagliasacchi; |
| 453 | Emulating Self-attention with Convolution for Efficient Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we tackle the high computational overhead of Transformers for efficient image super-resolution (SR). |
Dongheon Lee; Seokju Yun; Youngmin Ro; |
| 454 | LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. |
Zhang Li; Biao Yang; Qiang Liu; Shuo Zhang; Zhiyin Ma; Liang Yin; Linger Deng; Yabo Sun; Yuliang Liu; Xiang Bai; |
| 455 | Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. |
Divyansh Srivastava; Xiang Zhang; He Wen; Chenru Wen; Zhuowen Tu; |
| 456 | Trace3D: Consistent Segmentation Lifting Via Gaussian Instance Tracing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. |
Hongyu Shen; Junfeng Ni; Yixin Chen; Weishuo Li; Mingtao Pei; Siyuan Huang; |
| 457 | The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with A Single Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. |
Weixian Lei; Jiacong Wang; Haochen Wang; Xiangtai Li; Jun Hao Liew; Jiashi Feng; Zilong Huang; |
| 458 | GAS: Generative Avatar Synthesis from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. |
Yixing Lu; Junting Dong; Youngjoong Kwon; Qin Zhao; Bo Dai; Fernando De la Torre; |
| 459 | The Source Image Is The Best Attention for Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper reveals, unprecedentedly, the intrinsic "attention properties" of infrared images, which directly arise from their physical characteristics (i.e., heat distribution) and can be linked to attention mechanisms naturally, as observed in the gradient-weighted class activation mapping (Grad-CAM) visualization analysis of image classification models. To incorporate this property into IVF for better fusion, we propose the source infrared cross attention (I-SCA) and further extend it to the visible modality, subsequently introducing the source visible cross attention (V-SCA). |
Song Wang; Xie Han; Liqun Kuang; Boying Wang; Zhongyu Chen; Zherui Qiao; Fan Yang; Xiaoxia Liu; Bingyu Zhang; Zhixun Wang; |
| 460 | HADES: Human Avatar with Dynamic Explicit Hair Strands Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HADES, the first framework to seamlessly integrate dynamic hair into human avatars. |
Zhanfeng Liao; Hanzhang Tu; Cheng Peng; Hongwen Zhang; Boyao Zhou; Yebin Liu; |
| 461 | Benchmarking Multimodal CoT Reward Model Stepwise By Visual Program Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought (CoT) reward model automatically. |
Minghe Gao; Xuqi Liu; Zhongqi Yue; Yang Wu; Shuang Chen; Juncheng Li; Siliang Tang; Fei Wu; Tat-Seng Chua; Yueting Zhuang; |
| 462 | Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. |
Xiao Fang; Minhyek Jeon; Zheyang Qin; Stanislav Panev; Celso De Melo; Shuowen Hu; Shayok Chakraborty; Fernando De La Torre; |
| 463 | MotionStreamer: Streaming Motion Generation Via Diffusion-based Autoregressive Model in Causal Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. |
Lixing Xiao; Shunlin Lu; Huaijin Pi; Ke Fan; Liang Pan; Yueer Zhou; Ziyong Feng; Xiaowei Zhou; Sida Peng; Jingbo Wang; |
| 464 | CHORDS: Diffusion Sampling Accelerator with Multi-core Hierarchical ODE Solvers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores a general, training-free, and model-agnostic acceleration strategy via multi-core parallelism. |
Jiaqi Han; Haotian Ye; Puheng Li; Minkai Xu; James Zou; Stefano Ermon; |
| 465 | LookOut: Real-World Humanoid Egocentric Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. |
Boxiao Pan; Adam W. Harley; Francis Engelmann; C. Karen Liu; Leonidas J. Guibas; |
| 466 | What You Have Is What You Track: Adaptive and Robust Multimodal Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. |
Yuedong Tan; Jiawei Shao; Eduard Zamfir; Ruanjun Li; Zhaochong An; Chao Ma; Danda Paudel; Luc Van Gool; Radu Timofte; Zongwei Wu; |
| 467 | FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. |
Shuai Tan; Bill Gong; Bin Ji; Ye Pan; |
| 468 | GARF: Learning Generalizable 3D Reassembly for Real-World Fractures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose \acronym , a generalizable 3D reassembly framework for real-world fractures. |
Sihang Li; Zeyu Jiang; Grace Chen; Chenyang Xu; Siqi Tan; Xue Wang; Irving Fang; Kristof Zyskowski; Shannon P. McPherron; Radu Iovita; Chen Feng; Jing Zhang; |
| 469 | SILO: Solving Inverse Problems with Latent Operators Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a new plug-and-play paradigm that operates entirely in the latent space of diffusion models. |
Ron Raphaeli; Sean Man; Michael Elad; |
| 470 | Integrating Task-Specific and Universal Adapters for Pre-Trained Model-based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. |
Yan Wang; Da-Wei Zhou; Han-Jia Ye; |
| 471 | Teleportraits: Training-Free People Insertion Into Any Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a unified training-free pipeline that leverages pre-trained text-to-image diffusion models. |
Jialu Gao; K J Joseph; Fernando De La Torre; |
| 472 | Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models Via Textural Rectification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. |
Guibao Shen; Luozhou Wang; Jiantao Lin; Wenhang Ge; Chaozhe Zhang; Xin Tao; Di Zhang; Pengfei Wan; Guangyong Chen; Yijun Li; Ying-cong Chen; |
| 473 | SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. |
Xilin He; Cheng Luo; Xiaole Xian; Bing Li; Muhammad Haris Khan; Zongyuan Ge; Weicheng Xie; Siyang Song; Linlin Shen; Bernard Ghanem; Xiangyu Yue; |
| 474 | Vivid4D: Improving 4D Reconstruction from Monocular Video By Video Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Vivid4D, a novel approach that enhances 4D monocular video synthesis by augmenting observation views — synthesizing multi-view videos from a monocular input. |
Jiaxin Huang; Sheng Miao; Bangbang Yang; Yuewen Ma; Yiyi Liao; |
| 475 | SFUOD: Source-Free Unknown Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling), a novel framework for SFUOD. |
Keon-Hee Park; Seun-An Choe; Gyeong-Moon Park; |
| 476 | ImHead: A Large-scale Implicit Morphable Model for Localized Head Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, we retain a single compact identity space and introduce an intermediate region-specific latent representation to enable local edits. |
Rolandos Alexandros Potamias; Stathis Galanakis; Jiankang Deng; Athanasios Papaioannou; Stefanos Zafeiriou; |
| 477 | RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. |
Avinash Paliwal; Xilong Zhou; Wei Ye; Jinhui Xiong; Rakesh Ranjan; Nima Khademi Kalantari; |
| 478 | DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a human face video dataset, DH-FaceVid-1K. |
Donglin Di; He Feng; Wenzhang Sun; Yongjia Ma; Hao Li; Wei Chen; Lei Fan; Tonghua Su; Xun Yang; |
| 479 | Unraveling The Effects of Synthetic Data on End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). |
Junhao Ge; Zuhong Liu; Longteng Fan; Yifan Jiang; Jiaqi Su; Yiming Li; Zhejun Zhang; Siheng Chen; |
| 480 | D3QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D^3QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. |
Yanran Zhang; Bingyao Yu; Yu Zheng; Wenzhao Zheng; Yueqi Duan; Lei Chen; Jie Zhou; Jiwen Lu; |
| 481 | WalkVLM: Aid Visually Impaired People Walking By Vision Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. |
Zhiqiang Yuan; Ting Zhang; Yeshuang Zhu; Jiapei Zhang; Ying Deng; Zexi Jia; Peixiang Luo; Xiaoyue Duan; Jie Zhou; Jinchao Zhang; |
| 482 | TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present TAViS, a novel framework that couples the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. |
Ziyang Luo; Nian Liu; Xuguang Yang; Salman Khan; Rao Muhammad Anwer; Hisham Cholakkal; Fahad Shahbaz Khan; Junwei Han; |
| 483 | Scaling Tumor Segmentation: Best Lessons from Real and Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by these lessons, we created AbdomenAtlas 2.0—a dataset of 10,134 CT scans with a total of 13,223 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, and uterus) and 6,511 control scans. |
Qi Chen; Xinze Zhou; Chen Liu; Hao Chen; Wenxuan Li; Zekun Jiang; Ziyan Huang; Yuxuan Zhao; Dexin Yu; Junjun He; Yefeng Zheng; Ling Shao; Alan Yuille; Zongwei Zhou; |
| 484 | A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision. |
Chensheng Peng; Ido Sobol; Masayoshi Tomizuka; Kurt Keutzer; Chenfeng Xu; Or Litany; |
| 485 | ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter’s scene-specific prior with the comprehension of the current scene. |
Chong Xia; Shengjun Zhang; Fangfu Liu; Chang Liu; Khodchaphun Hirunyaratsameewong; Yueqi Duan; |
| 486 | GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. |
Lin Zeng; Boming Zhao; Jiarui Hu; Xujie Shen; Ziqiang Dang; Hujun Bao; Zhaopeng Cui; |
| 487 | Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces Multimodal Context (MiCo), a scalable pretraining framework designed to advance omni-modal intelligence–an AI system capable of understanding and learning from multiple modalities to achieve universal representation learning. |
Yiyuan Zhang; Handong Li; Jing Liu; Xiangyu Yue; |
| 488 | Learning Beyond Still Frames: Scaling Vision-Language Models with Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model’s ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. |
Yiyuan Zhang; Handong Li; Jing Liu; Xiangyu Yue; |
| 489 | GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent’s reasoning at each RL step. |
Tong Wei; Yijun Yang; Junliang Xing; Yuanchun Shi; Zongqing Lu; Deheng Ye; |
| 490 | SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. |
Yongkun Du; Zhineng Chen; Hongtao Xie; Caiyan Jia; Yu-Gang Jiang; |
| 491 | Devil Is in The Uniformity: Exploring Diverse Learners Within Transformer for Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. |
Shihao Zhou; Dayu Li; Jinshan Pan; Juncheng Zhou; Jinglei Shi; Jufeng Yang; |
| 492 | P-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or \pi-AVAS), a novel framework designed with two key objectives. |
Susan Liang; Chao Huang; Yunlong Tang; Zeliang Zhang; Chenliang Xu; |
| 493 | Unified Open-World Segmentation with Multi-Modal Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present COSINE, a unified open-world segmentation model that Consolidates Open-vocabulary Segmentation and IN-context sEgmentation with multi-modal prompts (e.g., text and image). |
Yang Liu; Yufei Yin; Chenchen Jing; Muzhi Zhu; Hao Chen; Yuling Xi; Bo Feng; Hao Wang; Shiyu Li; Chunhua Shen; |
| 494 | CopyrightShield: Enhancing Diffusion Model Security Against Copyright Infringement Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we analyze the memorization mechanism of diffusion models and find that attacks exploit the model’s overfitting to specific spatial positions and prompts, causing it to reproduce poisoned samples under backdoor triggers. Based on this, we propose a poisoned sample detection method using spatial masking and data attribution to quantify poisoning risk and accurately identify hidden backdoor samples. |
Zhixiang Guo; Siyuan Liang; Aishan Liu; Dacheng Tao; |
| 495 | Find Any Part in 3D Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. |
Ziqi Ma; Yisong Yue; Georgia Gkioxari; |
| 496 | Efficient Input-level Backdoor Defense on Text-to-Image Synthesis Via Neuron Activation Variation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose NaviT2I, an efficient input-level backdoor defense framework against diverse T2I backdoors. |
Shengfang Zhai; Jiajun Li; Yue Liu; Huanran Chen; Zhihua Tian; Wenjie Qu; Qingni Shen; Ruoxi Jia; Yinpeng Dong; Jiaheng Zhang; |
| 497 | Fair Generation Without Unfair Distortions: Debiasing Text-to-Image Generation with Entanglement-Free Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing bias mitigation methods demonstrate effectiveness, they often encounter attribute entanglement, where adjustments to attributes relevant to the bias (i.e., target attributes) unintentionally alter attributes unassociated with the bias (i.e., non-target attributes), causing undesirable distribution shifts. To address this challenge, we introduce Entanglement-Free Attention (EFA), a method that accurately incorporates target attributes (e.g., White, Black, and Asian) while preserving non-target attributes (e.g., background) during bias mitigation. |
Jeonghoon Park; Juyoung Lee; Chaeyeon Chung; Jaeseong Lee; Jaegul Choo; Jindong Gu; |
| 498 | DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DynamicID, a tuning-free framework that inherently facilitates both single-ID and multi-ID personalized generation with high fidelity and flexible facial editability. |
Xirui Hu; Jiahao Wang; Hao Chen; Weizhan Zhang; Benqi Wang; Yikun Li; Haishun Nan; |
| 499 | Unlocking The Potential of Diffusion Priors in Blind Face Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images.The vanilla diffusion model is trained on images with no or less degredations, while BFR handles moderately to severely degraded images.Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate the complex and unknown degradations in real-world scenarios.In this work, we use a unified network FLIPNET that switches between two modes to address specific gaps.In restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration.In degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets.Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations. |
Yunqi Miao; Zhiyu Qu; Mingqi Gao; Changrui Chen; Jifei Song; Jungong Han; Jiankang Deng; |
| 500 | UINavBench: A Framework for Comprehensive Evaluation of Interactive Digital Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decision-making abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. |
Harsh Agrawal; Eldon Schoop; Xinlei Pan; Anuj Mahajan; Ari Seff; Di Feng; Ruijia Cheng; Andres Romero Mier Y Teran; Esteban Gomez; Abhishek Sundararajan; Forrest Huang; Amanda Swearngin; Mohana Prasad Sathya Moorthy; Jeff Nichols; Alexander Toshev; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,700 papers), please visit Paper Digest: ICCV-2025 (Full List).