Paper Digest: CVPR 2025 Papers & Highlights
Note: CVPR-2025 accepts more than 2,800 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,800 CVPR-2025 papers in a separate page.
To search for papers presented at CVPR-2025 on a specific topic, please make use of the search by venue (CVPR-2025) service. To summarize the latest research published at CVPR-2025 on a specific topic, you can utilize the review by venue (CVPR-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 12,000 authors (CVPR-2025). Additionally, you may want to explore our “Best Paper” Digest (CVPR), which lists the most influential CVPR papers since 1988.
We’ve developed a service – CVPR-2025 Research Report that synthesizes the latest findings from CVPR 2025 into comprehensive reports. For instance, we’ve generated sample reports on Advances in 3D from Multi-View and Sensors: Insights from CVPR 2025 Papers and Advances in Image and Video Synthesis: Insights from CVPR 2025 Papers. We encourage interested users to utilize our service to create tailored reports on other emerging topics.
This curated list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that gets you the personalized and comprehensive updates on the latest research in your field. It also empowers you to read articles, write articles, get answers, conduct literature reviews and generate research reports.
Experience the full potential of our services today!
TABLE 1: Paper Digest: CVPR 2025 Papers & Highlights
Paper | Author(s) | |
---|---|---|
1 | DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. |
Cijo Jose; Théo Moutakanni; Dahyun Kang; Federico Baldassarre; Timothée Darcet; Hu Xu; Daniel Li; Marc Szafraniec; Michaël Ramamonjisoa; Maxime Oquab; Oriane Siméoni; Huy V. Vo; Patrick Labatut; Piotr Bojanowski; |
2 | SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. |
Jierun Chen; Dongting Hu; Xijie Huang; Huseyin Coskun; Arpit Sahni; Aarush Gupta; Anujraaj Goyal; Dishani Lahiri; Rajesh Singh; Yerlan Idelbayev; Junli Cao; Yanyu Li; Kwang-Ting Cheng; S.-H. Gary Chan; Mingming Gong; Sergey Tulyakov; Anil Kag; Yanwu Xu; Jian Ren; |
3 | Scaling Inference Time Compute for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. |
Nanye Ma; Shangyuan Tong; Haolin Jia; Hexiang Hu; Yu-Chuan Su; Mingda Zhang; Xuan Yang; Yandong Li; Tommi Jaakkola; Xuhui Jia; Saining Xie; |
4 | Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. |
Chaoyou Fu; Yuhan Dai; Yongdong Luo; Lei Li; Shuhuai Ren; Renrui Zhang; Zihan Wang; Chenyu Zhou; Yunhang Shen; Mengdan Zhang; Peixian Chen; Yanwei Li; Shaohui Lin; Sirui Zhao; Ke Li; Tong Xu; Xiawu Zheng; Enhong Chen; Caifeng Shan; Ran He; Xing Sun; |
5 | ChatHuman: Chatting About 3D Humans with Tools Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that combines and integrates the skills of these specialized methods. |
Jing Lin; Yao Feng; Weiyang Liu; Michael J. Black; |
6 | Diffusion Model Is Effectively Its Own Teacher Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel self-distillation paradigm for improving the performance of diffusion models. |
Xinyin Ma; Runpeng Yu; Songhua Liu; Gongfan Fang; Xinchao Wang; |
7 | VisionArena: 230k Real World User-VLM Conversations with Preference Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce VisionArena, the largest existing dataset of crowdsourced real-world conversations between users and VLMs. |
Christopher Chou; Lisa Dunlap; Koki Mashita; Krishna Mandal; Trevor Darrell; Ion Stoica; Joseph E. Gonzalez; Wei-Lin Chiang; |
8 | DiffLocks: Generating 3D Hair from A Single Image Using Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. |
Radu Alexandru Rosu; Keyu Wu; Yao Feng; Youyi Zheng; Michael J. Black; |
9 | MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. |
Wenyi Hong; Yean Cheng; Zhuoyi Yang; Weihan Wang; Lefan Wang; Xiaotao Gu; Shiyu Huang; Yuxiao Dong; Jie Tang; |
10 | Omni-ID: Holistic Identity Representation Designed for Generative Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Omni-ID, a novel facial representation designed specifically for generative tasks. |
Guocheng Qian; Kuan-Chieh Wang; Or Patashnik; Negin Heravi; Daniil Ostashev; Sergey Tulyakov; Daniel Cohen-Or; Kfir Aberman; |
11 | VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis Through User Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most evaluations rely on traditional methods like multiple-choice question answering in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation–and due to the prohibitive cost and slow pace of human annotation for video tasks–we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena’s framework, designed to automatically assess LMMs’ video analysis abilities. |
Ziyang Luo; Haoning Wu; Dongxu Li; Jing Ma; Mohan Kankanhalli; Junnan Li; |
12 | DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to incorporate the sub-quadratic modeling capability of Gated Linear Attention (GLA) into the 2D diffusion backbone. |
Lianghui Zhu; Zilong Huang; Bencheng Liao; Jun Hao Liew; Hanshu Yan; Jiashi Feng; Xinggang Wang; |
13 | Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our key contribution is a collection of new datasets called PixMo, including a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset, all collected without the use of external VLMs. |
Matt Deitke; Christopher Clark; Sangho Lee; Rohun Tripathi; Yue Yang; Jae Sung Park; Mohammadreza Salehi; Niklas Muennighoff; Kyle Lo; Luca Soldaini; Jiasen Lu; Taira Anderson; Erin Bransom; Kiana Ehsani; Huong Ngo; YenSung Chen; Ajay Patel; Mark Yatskar; Chris Callison-Burch; Andrew Head; Rose Hendrix; Favyen Bastani; Eli VanderBilt; Nathan Lambert; Yvonne Chou; Arnavi Chheda; Jenna Sparks; Sam Skjonsberg; Michael Schmitz; Aaron Sarnat; Byron Bischoff; Pete Walsh; Chris Newell; Piper Wolters; Tanmay Gupta; Kuo-Hao Zeng; Jon Borchardt; Dirk Groeneveld; Crystal Nam; Sophie Lebrecht; Caitlin Wittlif; Carissa Schoenick; Oscar Michel; Ranjay Krishna; Luca Weihs; Noah A. Smith; Hannaneh Hajishirzi; Ross Girshick; Ali Farhadi; Aniruddha Kembhavi; |
14 | Let’s Verify and Reinforce Image Generation Step By Step Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we provide the first comprehensive investigation in the potential of CoT reasoning to enhance autoregressive image generation. |
Renrui Zhang; Chengzhuo Tong; Zhizheng Zhao; Ziyu Guo; Haoquan Zhang; Manyuan Zhang; Jiaming Liu; Peng Gao; Hongsheng Li; |
15 | Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive–though subhuman–visual-spatial intelligence. |
Jihan Yang; Shusheng Yang; Anjali W. Gupta; Rilyn Han; Li Fei-Fei; Saining Xie; |
16 | FoundationStereo: Zero-Shot Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero shot generalization. |
Bowen Wen; Matthew Trepte; Joseph Aribido; Jan Kautz; Orazio Gallo; Stan Birchfield; |
17 | Bias for Action: Video Implicit Neural Representations with Bias Modulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a new continuous video modeling framework based on implicit neural representations (INRs) called ActINR. |
Alper Kayabasi; Anil Kumar Vadathya; Guha Balakrishnan; Vishwanath Saragadam; |
18 | Simpler Diffusion: 1.5 FID on ImageNet512 with Pixel-space Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. |
Emiel Hoogeboom; Thomas Mensink; Jonathan Heek; Kay Lamerigts; Ruiqi Gao; Tim Salimans; |
19 | Dynamic Camera Poses and Where to Find Them Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. |
Chris Rockwell; Joseph Tung; Tsung-Yi Lin; Ming-Yu Liu; David F. Fouhey; Chen-Hsuan Lin; |
20 | DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. |
Utkarsh Mall; Cheng Perng Phoo; Mia Chiquier; Bharath Hariharan; Kavita Bala; Carl Vondrick; |
21 | OmniGen: Unified Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce OmniGen, a new diffusion model for unified image generation. |
Shitao Xiao; Yueze Wang; Junjie Zhou; Huaying Yuan; Xingrun Xing; Ruiran Yan; Chaofan Li; Shuting Wang; Tiejun Huang; Zheng Liu; |
22 | Exploring The Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis—specifically, the deep fusion of large language models (LLMs) with diffusion transformers (DiTs) for multimodal generation. |
Bingda Tang; Boyang Zheng; Sayak Paul; Saining Xie; |
23 | Science-T2I: Addressing Scientific Illusions in Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. |
Jialuo Li; Wenhao Chai; Xingyu Fu; Haiyang Xu; Saining Xie; |
24 | Estimating Body and Hand Motion in An Ego-sensed World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present EgoAllo, a system for human motion estimation from a head-mounted device. |
Brent Yi; Vickie Ye; Maya Zheng; Yunqi Li; Lea Müller; Georgios Pavlakos; Yi Ma; Jitendra Malik; Angjoo Kanazawa; |
25 | MambaOut: Do We Really Need Mamba for Vision? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. |
Weihao Yu; Xinchao Wang; |
26 | Universal Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes. |
Shengqiong Wu; Hao Fei; Tat-seng Chua; |
27 | Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. |
Shengqiong Wu; Hao Fei; Jingkang Yang; Xiangtai Li; Juncheng Li; Hanwang Zhang; Tat-seng Chua; |
28 | A Unified Model for Compressed Sensing MRI Across Undersampling Patterns Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a unified MRI reconstruction model robust to various measurement undersampling patterns and image resolutions. |
Armeet Singh Jatyani; Jiayun Wang; Aditi Chandrashekar; Zihui Wu; Miguel Liu-Schiaffini; Bahareh Tolooshams; Anima Anandkumar; |
29 | Continuous 3D Perception Model with Persistent State Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a unified framework capable of solving a broad range of 3D tasks. |
Qianqian Wang; Yifei Zhang; Aleksander Holynski; Alexei A. Efros; Angjoo Kanazawa; |
30 | What’s in The Image? A Deep-Dive Into The Vision of Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the mechanisms underlying how VLMs process visual information remain largely unexplored. In this paper, we conduct a thorough empirical analysis, focusing on the attention modules across layers, by which we reveal several key insights about how these models process visual data: (i) the internal representation of the query tokens (e.g., representations of "describe the image"), is utilized by the model to store global image information; we demonstrate that the model generates surprisingly descriptive responses solely from these tokens, without direct access to image tokens. |
Omri Kaduri; Shai Bagon; Tali Dekel; |
31 | Decentralized Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Decentralized Diffusion Models, a scalable framework to distribute diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. |
David McAllister; Matthew Tancik; Jiaming Song; Angjoo Kanazawa; |
32 | Reconstructing People, Places, and Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present "Humans and Structure from Motion" (HSfM), a method for jointly reconstructing multiple human meshes, scene point clouds, and camera parameters in a metric world coordinate system from a sparse set of uncalibrated multi-view images featuring people. |
Lea Müller; Hongsuk Choi; Anthony Zhang; Brent Yi; Jitendra Malik; Angjoo Kanazawa; |
33 | StoryGPT-V: Large Language Models As Consistent Story Visualizers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references and process extensive sequences. Therefore, we introduce StoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters grounded on given story descriptions. |
Xiaoqian Shen; Mohamed Elhoseiny; |
34 | CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. |
Rundi Wu; Ruiqi Gao; Ben Poole; Alex Trevithick; Changxi Zheng; Jonathan T. Barron; Aleksander Holynski; |
35 | Multi-subject Open-set Personalization in Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Video Alchemist–a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. |
Tsai-Shien Chen; Aliaksandr Siarohin; Willi Menapace; Yuwei Fang; Kwot Sin Lee; Ivan Skorokhodov; Kfir Aberman; Jun-Yan Zhu; Ming-Hsuan Yang; Sergey Tulyakov; |
36 | CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. |
Weiyu Li; Jiarui Liu; Hongyu Yan; Rui Chen; Yixun Liang; Xuelin Chen; Ping Tan; Xiaoxiao Long; |
37 | EntitySAM: Segment Everything in Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce an entity decoder to facilitate inter-object communication and an automatic prompt generator using learnable object queries. |
Mingqiao Ye; Seoung Wug Oh; Lei Ke; Joon-Young Lee; |
38 | UniK3D: Universal Camera Monocular 3D Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. |
Luigi Piccinelli; Christos Sakaridis; Mattia Segu; Yung-Hsu Yang; Siyuan Li; Wim Abbeloos; Luc Van Gool; |
39 | StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current methods excel in generating short videos (up to 16s), but produce hard-cuts when naively extended to long video synthesis. To overcome these limitations, we present StreamingT2V, an autoregressive method that generates long videos of up to 2 minutes or longer with seamless transitions. |
Roberto Henschel; Levon Khachatryan; Hayk Poghosyan; Daniil Hayrapetyan; Vahram Tadevosyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi; |
40 | LLaVA-Critic: Learning to Evaluate Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. |
Tianyi Xiong; Xiyao Wang; Dong Guo; Qinghao Ye; Haoqi Fan; Quanquan Gu; Heng Huang; Chunyuan Li; |
41 | Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing benchmarks for multi-image question-answering are limited in scope, each question is paired with only up to 30 images, which does not fully capture the demands of large-scale retrieval tasks encountered in the real-world usages. To reduce these gaps, we introduce two document haystack benchmarks, dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on large-scale visual document retrieval and understanding. |
Jun Chen; Dannong Xu; Junjie Fei; Chun-Mei Feng; Mohamed Elhoseiny; |
42 | ReCapture: Generative Video Camera Controls for User-Provided Videos Using Masked Video Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present ReCapture, a method for generating new videos with novel camera trajectories from a single user-provided video. |
David Junhao Zhang; Roni Paiss; Shiran Zada; Nikhil Karnad; David E. Jacobs; Yael Pritch; Inbar Mosseri; Mike Zheng Shou; Neal Wadhwa; Nataniel Ruiz; |
43 | Arbitrary-steps Image Super-resolution Via Diffusion Inversion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance. |
Zongsheng Yue; Kang Liao; Chen Change Loy; |
44 | Breaking The Memory Barrier of Contrastive Loss Via Tile-Based Strategy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the full instantiation of the similarity matrix demands substantial GPU memory, making large batch training highly resource-intensive. To address this, we propose a tile-based computation strategy that partitions the contrastive loss calculation into small blocks, avoiding full materialization of the similarity matrix. |
Zesen Cheng; Hang Zhang; Kehan Li; Sicong Leng; Zhiqiang Hu; Fei Wu; Deli Zhao; Xin Li; Lidong Bing; |
45 | LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. |
Hongjie Wang; Chih-Yao Ma; Yen-Cheng Liu; Ji Hou; Tao Xu; Jialiang Wang; Felix Juefei-Xu; Yaqiao Luo; Peizhao Zhang; Tingbo Hou; Peter Vajda; Niraj K. Jha; Xiaoliang Dai; |
46 | Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. |
Yuying Ge; Yizhuo Li; Yixiao Ge; Ying Shan; |
47 | VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. |
Yuqian Yuan; Hang Zhang; Wentong Li; Zesen Cheng; Boqiang Zhang; Long Li; Xin Li; Deli Zhao; Wenqiao Zhang; Yueting Zhuang; Jianke Zhu; Lidong Bing; |
48 | TinyFusion: Diffusion Transformers Learned Shallow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. |
Gongfan Fang; Kunjun Li; Xinyin Ma; Xinchao Wang; |
49 | PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames. |
Chenyu Yang; Xuan Dong; Xizhou Zhu; Weijie Su; Jiahao Wang; Hao Tian; Zhe Chen; Wenhai Wang; Lewei Lu; Jifeng Dai; |
50 | SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. |
Hao Li; Changyao Tian; Jie Shao; Xizhou Zhu; Zhaokai Wang; Jinguo Zhu; Wenhan Dou; Xiaogang Wang; Hongsheng Li; Lewei Lu; Jifeng Dai; |
51 | VISTA: Enhancing Long-Duration and High-Resolution Video Understanding By Video Spatiotemporal Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets. To address this issue from a data-centric perspective, we propose VISTA, a simple yet effective video spatiotemporal augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. |
Weiming Ren; Huan Yang; Jie Min; Cong Wei; Wenhu Chen; |
52 | DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. |
Bencheng Liao; Shaoyu Chen; Haoran Yin; Bo Jiang; Cheng Wang; Sixu Yan; Xinbang Zhang; Xiangyu Li; Ying Zhang; Qian Zhang; Xinggang Wang; |
53 | HoVLE: Unleashing The Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. |
Chenxin Tao; Shiqian Su; Xizhou Zhu; Chenyu Zhang; Zhe Chen; Jiawen Liu; Wenhai Wang; Lewei Lu; Gao Huang; Yu Qiao; Jifeng Dai; |
54 | Generative Video Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large-scale video generation models have the inherent ability to realistically model natural scenes. In this paper, we demonstrate that through a careful design of a generative video propagation framework, various video tasks can be addressed in a unified way by leveraging the generative power of such models. |
Shaoteng Liu; Tianyu Wang; Jui-Hsien Wang; Qing Liu; Zhifei Zhang; Joon-Young Lee; Yijun Li; Bei Yu; Zhe Lin; Soo Ye Kim; Jiaya Jia; |
55 | Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Track4Gen, a spatially aware video generator that combines video diffusion loss with point tracking across frames, providing enhanced spatial supervision on the diffusion features. |
Hyeonho Jeong; Chun-Hao P. Huang; Jong Chul Ye; Niloy J. Mitra; Duygu Ceylan; |
56 | Video-Bench: Human-Aligned Video Generation Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current video generation benchmarks fall into two main categories: traditional benchmarks, which use metrics and embeddings to evaluate generated video quality across multiple dimensions but often lack alignment with human judgments; and large language model (LLM)-based benchmarks, though capable of human-like reasoning, are constrained by a limited understanding of video quality metrics and cross-modal consistency.To address these challenges and establish a benchmark that better aligns with human preferences, this paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions. |
Hui Han; Siyuan Li; Jiaqi Chen; Yiwen Yuan; Yuling Wu; Yufan Deng; Chak Tou Leong; Hanwen Du; Junchen Fu; Youhua Li; Jie Zhang; Chi Zhang; Li-jia Li; Yongxin Ni; |
57 | Keyframe-Guided Creative Video Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce VideoRepainter, a two-stage framework that allows users to inpaint a keyframe using established image-level techniques, then propagate the changes to other frames. |
Yuwei Guo; Ceyuan Yang; Anyi Rao; Chenlin Meng; Omer Bar-Tal; Shuangrui Ding; Maneesh Agrawala; Dahua Lin; Bo Dai; |
58 | Mono-InternVL: Pushing The Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on monolithic Multimodal Large Language Models (MLLMs) that integrate visual encoding and language decoding into a single LLM. |
Gen Luo; Xue Yang; Wenhan Dou; Zhaokai Wang; Jiawen Liu; Jifeng Dai; Yu Qiao; Xizhou Zhu; |
59 | ECBench: Can Multi-modal Foundation Models Understand The Egocentric World? A Holistic Embodied Cognition Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. |
Ronghao Dang; Yuqian Yuan; Wenqi Zhang; Yifei Xin; Boqiang Zhang; Long Li; Liuyi Wang; Qinyang Zeng; Xin Li; Lidong Bing; |
60 | TriTex: Learning Texture from A Single Mesh Via Triplane Semantic Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TriTex, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. |
Dana Cohen-Bar; Daniel Cohen-Or; Gal Chechik; Yoni Kasten; |
61 | Generative Photomontage Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a framework for creating the desired image by compositing it from various parts of generated images, in essence forming a Generative Photomontage. |
Sean J. Liu; Nupur Kumari; Ariel Shamir; Jun-Yan Zhu; |
62 | MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework (MMAudio). |
Ho Kei Cheng; Masato Ishii; Akio Hayakawa; Takashi Shibuya; Alexander Schwing; Yuki Mitsufuji; |
63 | PEACE: Empowering Geologic Map Holistic Understanding with MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To quantify this gap, we construct **GeoMap-Bench**, the first-ever benchmark for evaluating MLLMs in geologic map understanding, which assesses the full-scale abilities in extracting, referring, grounding, reasoning, and analyzing. To bridge this gap, we introduce **GeoMap-Agent**, the inaugural agent designed for geologic map understanding, which features three modules: Hierarchical Information Extraction (HIE), Domain Knowledge Injection (DKI), and Prompt-enhanced Question Answering (PEQA). |
Yangyu Huang; Tianyi Gao; Haoran Xu; Qihao Zhao; Yang Song; Zhipeng Gui; Tengchao Lv; Hao Chen; Lei Cui; Scarlett Li; Furu Wei; |
64 | Magma: A Foundation Model for Multimodal AI Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. |
Jianwei Yang; Reuben Tan; Qianhui Wu; Ruijie Zheng; Baolin Peng; Yongyuan Liang; Yu Gu; Mu Cai; Seonghyeon Ye; Joel Jang; Yuquan Deng; Jianfeng Gao; |
65 | Synthetic Prior for Few-Shot Drivable Head Avatar Inversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. |
Wojciech Zielonka; Stephan J. Garbin; Alexandros Lattas; George Kopanas; Paulo Gotardo; Thabo Beeler; Justus Thies; Timo Bolkart; |
66 | Poly-Autoregressive Prediction for Modeling Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a simple framework for predicting the behavior of an agent in multi-agent settings. |
Neerja Thakkar; Tara Sadjadpour; Jathushan Rajasegeran; Shiry Ginosar; Jitendra Malik; |
67 | DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this issue, we propose a new task–multi-round dual-speaker interaction for 3D talking head generation–which requires models to handle and generate both speaking and listening behaviors in continuous conversation. To solve this task, we introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners to simulate realistic and coherent dialogue interactions. |
Ziqiao Peng; Yanbo Fan; Haoyu Wu; Xuan Wang; Hongyan Liu; Jun He; Zhaoxin Fan; |
68 | VisionZip: Longer Is Better But Not Necessary in Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs.However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. |
Senqiao Yang; Yukang Chen; Zhuotao Tian; Chengyao Wang; Jingyao Li; Bei Yu; Jiaya Jia; |
69 | Scaling Properties of Diffusion Models For Perceptual Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. |
Rahul Ravishankar; Zeeshan Patel; Jathushan Rajasegaran; Jitendra Malik; |
70 | Pose Priors from Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Language is often used to describe physical interaction, yet most 3D human pose estimation methods overlook this rich source of information. We bridge this gap by leveraging large multimodal models (LMMs) as priors for reconstructing contact poses, offering a scalable alternative to traditional methods that rely on human annotations or motion capture data. |
Sanjay Subramanian; Evonne Ng; Lea Müller; Dan Klein; Shiry Ginosar; Trevor Darrell; |
71 | Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Perturb-and-Revise, which makes possible a variety of NeRF editing. |
Susung Hong; Johanna Karras; Ricardo Martin-Brualla; Ira Kemelmacher-Shlizerman; |
72 | DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Difix3D+, a novel pipeline designed to enhance 3D reconstruction and novel-view synthesis through single-step diffusion models. |
Jay Zhangjie Wu; Yuxuan Zhang; Haithem Turki; Xuanchi Ren; Jun Gao; Mike Zheng Shou; Sanja Fidler; Zan Gojcic; Huan Ling; |
73 | Detect Any Mirrors: Boosting Learning Reliability on Large-Scale Unlabeled Data with An Iterative Data Engine Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this issue, we first collect a large-scale dataset of approximately 0.4 million mirror-related images from the internet, significantly expanding the data scale for mirror detection. To effectively exploit this unlabeled dataset, we propose the first semi-supervised framework (namely an iterative data engine) consisting of four steps: (1) mirror detection model training, (2) pseudo label prediction, (3) dual guidance scoring, and (4) selection of highly reliable pseudo labels. |
Zhaohu Xing; Lihao Liu; Yijun Yang; Hongqiu Wang; Tian Ye; Sixiang Chen; Wenxue Li; Guang Liu; Lei Zhu; |
74 | Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a system for mining high-quality 4D reconstructions from internet stereoscopic, wide-angle videos. |
Linyi Jin; Richard Tucker; Zhengqi Li; David Fouhey; Noah Snavely; Aleksander Holynski; |
75 | Gaussian Eigen Models for Human Heads Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current personalized neural head avatars face a trade-off: lightweight models lack detail and realism, while high-quality, animatable avatars require significant computational resources, making them unsuitable for commodity devices. To address this gap, we introduce Gaussian Eigen Models (GEM), which provide high-quality, lightweight, and easily controllable head avatars. |
Wojciech Zielonka; Timo Bolkart; Thabo Beeler; Justus Thies; |
76 | Unveiling The Mist Over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To unveil the "mist", we propose Beacon3D, a benchmark for 3D-VL grounding and QA tasks, delivering a perspective shift in the evaluation of 3D-VL understanding. |
Jiangyong Huang; Baoxiong Jia; Yan Wang; Ziyu Zhu; Xiongkun Linghu; Qing Li; Song-Chun Zhu; Siyuan Huang; |
77 | Stable Flow: Vital Layers for Training-Free Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, they exhibit limited generation diversity. In this work, we leverage this limitation to perform consistent image edits via selective injection of attention features. |
Omri Avrahami; Or Patashnik; Ohad Fried; Egor Nemchinov; Kfir Aberman; Dani Lischinski; Daniel Cohen-Or; |
78 | ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce ManiVideo, a novel method for generating consistent and temporally coherent bimanual hand-object manipulation videos from given motion sequences of hands and objects. |
Youxin Pang; Ruizhi Shao; Jiajun Zhang; Hanzhang Tu; Yun Liu; Boyao Zhou; Hongwen Zhang; Yebin Liu; |
79 | BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to decompose videos into visual primitives — blob video representation, a general representation for controllable video generation. |
Weixi Feng; Chao Liu; Sifei Liu; William Yang Wang; Arash Vahdat; Weili Nie; |
80 | From Slow Bidirectional to Fast Autoregressive Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. |
Tianwei Yin; Qiang Zhang; Richard Zhang; William T. Freeman; Fredo Durand; Eli Shechtman; Xun Huang; |
81 | Evaluating Model Perception of Color Illusions in Photorealistic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. |
Lingjun Mao; Zineng Tang; Alane Suhr; |
82 | FastVLM: Efficient Vision Encoding for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce FastVLM, which achieves an optimized trade-off between resolution, latency, and accuracy by incorporating FastViTHD–a new hybrid vision encoder that outputs fewer tokens and significantly reduces encoding time while processing high-resolution images. |
Pavan Kumar Anasosalu Vasu; Fartash Faghri; Chun-Liang Li; Cem Koc; Nate True; Albert Antony; Gokula Santhanam; James Gabriel; Peter Grasch; Oncel Tuzel; Hadi Pouransari; |
83 | DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. |
Jingyu Zhuang; Di Kang; Linchao Bao; Liang Lin; Guanbin Li; |
84 | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model.JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling.Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications.To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training.Extensive experiments show that JaunsFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches.This work represents a step toward more efficient and versatile vision-language models. |
Yiyang Ma; Xingchao Liu; Xiaokang Chen; Wen Liu; Chengyue Wu; Zhiyu Wu; Zizheng Pan; Zhenda Xie; Haowei Zhang; Xingkai Yu; Liang Zhao; Yisong Wang; Jiaying Liu; Chong Ruan; |
85 | AniGS: Animatable Gaussian Avatar from A Single Image with Inconsistent Gaussian Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. |
Lingteng Qiu; Shenhao Zhu; Qi Zuo; Xiaodong Gu; Yuan Dong; Junfei Zhang; Chao Xu; Zhe Li; Weihao Yuan; Liefeng Bo; Guanying Chen; Zilong Dong; |
86 | Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360* scenes. |
Chong Bao; Xiyu Zhang; Zehao Yu; Jiale Shi; Guofeng Zhang; Songyou Peng; Zhaopeng Cui; |
87 | RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. |
Chan Hee Song; Valts Blukis; Jonathan Tremblay; Stephen Tyree; Yu Su; Stan Birchfield; |
88 | SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SF3D, a novel method for rapid and high-quality textured object mesh reconstruction from a single image in just 0.5 seconds. |
Mark Boss; Zixuan Huang; Aaryaman Vasishta; Varun Jampani; |
89 | DriveGPT4-V2: Harnessing Large Language Model Capabilities for Enhanced Closed-Loop Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike the previous work, DriveGPT4-V1, which focused on open-loop tasks, this study explores the capabilities of LLMs in enhancing closed-loop autonomous driving. |
Zhenhua Xu; Yan Bai; Yujia Zhang; Zhuoling Li; Fei Xia; Kwan-Yee K. Wong; Jianqiang Wang; Hengshuang Zhao; |
90 | F-LMM: Grounding Frozen Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. |
Size Wu; Sheng Jin; Wenwei Zhang; Lumin Xu; Wentao Liu; Wei Li; Chen Change Loy; |
91 | AutoPresent: Designing Structured Visuals from Scratch Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. |
Jiaxin Ge; Zora Zhiruo Wang; Xuhui Zhou; Yi-Hao Peng; Sanjay Subramanian; Qinyue Tan; Maarten Sap; Alane Suhr; Daniel Fried; Graham Neubig; Trevor Darrell; |
92 | HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce HunyuanPortrait, a diffusion-based condition control method that employs implicit representations for highly controllable and lifelike portrait animation. |
Zunnan Xu; Zhentao Yu; Zixiang Zhou; Jun Zhou; Xiaoyu Jin; Fa-ting Hong; Xiaozhong Ji; Junwei Zhu; Chengfei Cai; Shiyu Tang; Qin Lin; Xiu Li; Qinglin Lu; |
93 | DreamOmni: Unified Image Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in computer vision, while text-to-image (T2I) models have significantly improved generation quality through scaling up, their framework design did not initially consider how to unify with downstream tasks, such as various types of editing. To address this, we introduce DreamOmni, a unified model for image generation and editing. |
Bin Xia; Yuechen Zhang; Jingyao Li; Chengyao Wang; Yitong Wang; Xinglong Wu; Bei Yu; Jiaya Jia; |
94 | Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. |
Da-Wei Zhou; Zi-Wen Cai; Han-Jia Ye; Lijun Zhang; De-Chuan Zhan; |
95 | AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. |
Sherwin Bahmani; Ivan Skorokhodov; Guocheng Qian; Aliaksandr Siarohin; Willi Menapace; Andrea Tagliasacchi; David B. Lindell; Sergey Tulyakov; |
96 | Video Depth Without Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. |
Bingxin Ke; Dominik Narnhofer; Shengyu Huang; Lei Ke; Torben Peters; Katerina Fragkiadaki; Anton Obukhov; Konrad Schindler; |
97 | RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Traditional feedback learning for hallucination reduction relies on labor-intensive manual labeling or expensive proprietary models.This leaves the community without foundational knowledge about how to build high-quality feedback with open-source MLLMs.In this work, we introduce RLAIF-V, a novel framework that aligns MLLMs in a fully open-source paradigm. |
Tianyu Yu; Haoye Zhang; Qiming Li; Qixin Xu; Yuan Yao; Da Chen; Xiaoman Lu; Ganqu Cui; Yunkai Dang; Taiwen He; Xiaocheng Feng; Jun Song; Bo Zheng; Zhiyuan Liu; Tat-Seng Chua; Maosong Sun; |
98 | PromptHMR: Promptable Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. |
Yufu Wang; Yu Sun; Priyanka Patel; Kostas Daniilidis; Michael J. Black; Muhammed Kocabas; |
99 | SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present SPAR3D, a novel two-stage approach aiming to take the best of both directions. |
Zixuan Huang; Mark Boss; Aaryaman Vasishta; James M. Rehg; Varun Jampani; |
100 | Conical Visual Concentration for Efficient Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers.To this end, we propose ViCo, a conical-style visual concentration strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. |
Long Xing; Qidong Huang; Xiaoyi Dong; Jiajie Lu; Pan Zhang; Yuhang Zang; Yuhang Cao; Conghui He; Jiaqi Wang; Feng Wu; Dahua Lin; |
101 | Sketch Down The FLOPs: Towards Efficient Networks for Human Sketch Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. |
Aneeshan Sain; Subhajit Maity; Pinaki Nath Chowdhury; Shubhadeep Koley; Ayan Kumar Bhunia; Yi-Zhe Song; |
102 | Learning Visual Composition Through Improved Semantic Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on simple and scalable approaches.In particular, we demonstrate that by improving weakly la-beled data, i.e. captions, we can vastly improve the perfor-mance of standard contrastive learning approaches. |
Austin Stone; Hagen Soltau; Robert Geirhos; Xi Yi; Ye Xia; Bingyi Cao; Kaifeng Chen; Abhijit Ogale; Jonathon Shlens; |
103 | DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce DexHandDiff, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. |
Zhixuan Liang; Yao Mu; Yixiao Wang; Tianxing Chen; Wenqi Shao; Wei Zhan; Masayoshi Tomizuka; Ping Luo; Mingyu Ding; |
104 | DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. |
Wenbo Hu; Xiangjun Gao; Xiaoyu Li; Sijie Zhao; Xiaodong Cun; Yong Zhang; Long Quan; Ying Shan; |
105 | Multimodal Autoregressive Pre-training of Large Vision Encoders Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a novel method for pre-training of large-scale vision encoders. |
Enrico Fini; Mustafa Shukor; Xiujun Li; Philipp Dufter; Michal Klein; David Haldimann; Sai Aitharaju; Victor G. Turrisi da Costa; Louis Béthune; Zhe Gan; Alexander Toshev; Marcin Eichner; Moin Nabi; Yinfei Yang; Joshua Susskind; Alaaeldin El-Nouby; |
106 | Accurate Differential Operators for Hybrid Neural Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This is because they do not yield accurate spatial derivatives needed for these downstream applications. In this work, we propose two ways to circumvent these challenges. |
Aditya Chetan; Guandao Yang; Zichen Wang; Steve Marschner; Bharath Hariharan; |
107 | MatAnyone: Stable Video Matting with Consistent Memory Propagation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To tackle this, we propose MatAnyone, a practical framework designed for target-assigned video matting. |
Peiqing Yang; Shangchen Zhou; Jixin Zhao; Qingyi Tao; Chen Change Loy; |
108 | Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent advancements produce photorealistic outputs, they frequently struggle to create videos with accurate and consistent object motion, especially in multi-object scenarios. To address these limitations, we propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. |
Guy Yariv; Yuval Kirstain; Amit Zohar; Shelly Sheynin; Yaniv Taigman; Yossi Adi; Sagie Benaim; Adam Polyak; |
109 | Mask-Adapter: The Devil Is in The Masks for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent open-vocabulary segmentation methods adopt mask generators to predict segmentation masks and leverage pre-trained vision-language models, *e.g.*, CLIP, to classify these masks via mask pooling.Although these approaches show promising results, it is counterintuitive that accurate masks often fail to yield accurate classification results through pooling CLIP image embeddings within the mask regions.In this paper, we reveal the performance limitations of mask pooling and introduce **Mask-Adapter**, a simple yet effective method to address these challenges in open-vocabulary segmentation.Compared to directly using proposal masks, our proposed Mask-Adapter extracts *semantic activation maps* from proposal masks, providing richer contextual information and ensuring alignment between masks and CLIP.Additionally, we propose a *mask consistency loss* that encourages proposal masks with similar IoUs to obtain similar CLIP embeddings to enhance models’ robustness to varying predicted masks.Mask-Adapter integrates seamlessly into open-vocabulary segmentation methods based on mask pooling in a plug-and-play manner, delivering more accurate classification results. |
Yongkang Li; Tianheng Cheng; Bin Feng; Wenyu Liu; Xinggang Wang; |
110 | UniReal: Universal Image Generation and Editing Via Learning Real-world Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce UniReal, a unified framework designed to address various image generation and editing tasks. |
Xi Chen; Zhifei Zhang; He Zhang; Yuqian Zhou; Soo Ye Kim; Qing Liu; Yijun Li; Jianming Zhang; Nanxuan Zhao; Yilin Wang; Hui Ding; Zhe Lin; Hengshuang Zhao; |
111 | MARBLE: Material Recomposition and Blending in CLIP-Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose MARBLE, a method for performing material blending and recomposing fine-grained material properties by finding material embeddings in CLIP-space and using that to control pre-trained text-to-image models. |
Ta Ying Cheng; Prafull Sharma; Mark Boss; Varun Jampani; |
112 | Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the task of predicting functional 3D scene graphs for real-world indoor environments from posed RGB-D images. |
Chenyangguang Zhang; Alexandros Delitzas; Fangjinhua Wang; Ruida Zhang; Xiangyang Ji; Marc Pollefeys; Francis Engelmann; |
113 | Exploring Timeline Control for Facial Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a new control signal for facial motion generation: timeline control. |
Yifeng Ma; Jinwei Qi; Chaonan Ji; Peng Zhang; Bang Zhang; Zhidong Deng; Liefeng Bo; |
114 | AffordDP: Generalizable Diffusion Policy with Transferable Affordance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Diffusion Policy with transferable Affordance (AffordDP), designed for generalizable manipulation across novel categories. |
Shijie Wu; Yihang Zhu; Yunao Huang; Kaizhen Zhu; Jiayuan Gu; Jingyi Yu; Ye Shi; Jingya Wang; |
115 | EgoLife: Towards Egocentric Life Assistant Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. |
Jingkang Yang; Shuai Liu; Hongming Guo; Yuhao Dong; Xiamengwei Zhang; Sicheng Zhang; Pengyun Wang; Zitang Zhou; Binzhu Xie; Ziyue Wang; Bei Ouyang; Zhengyu Lin; Marco Cominelli; Zhongang Cai; Bo Li; Yuanhan Zhang; Peiyuan Zhang; Fangzhou Hong; Joerg Widmer; Francesco Gringoli; Lei Yang; Ziwei Liu; |
116 | StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensors data. |
Yunzhi Yan; Zhen Xu; Haotong Lin; Haian Jin; Haoyu Guo; Yida Wang; Kun Zhan; Xianpeng Lang; Hujun Bao; Xiaowei Zhou; Sida Peng; |
117 | Generative Gaussian Splatting for Unbounded 3D City Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose **GaussianCity**, a generative Gaussian splatting framework dedicated to efficiently synthesizing unbounded 3D cities with a single feed-forward pass. |
Haozhe Xie; Zhaoxi Chen; Fangzhou Hong; Ziwei Liu; |
118 | BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose BadToken, the first token-level backdoor attack to MLLMs. |
Zenghui Yuan; Jiawen Shi; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun; |
119 | Reconstruction Vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. |
Jingfeng Yao; Bin Yang; Xinggang Wang; |
120 | GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce GaussTR, a novel Gaussian-based Transformer framework that unifies sparse 3D modeling with foundation model alignment through Gaussian representations to advance 3D spatial understanding. |
Haoyi Jiang; Liu Liu; Tianheng Cheng; Xinjie Wang; Tianwei Lin; Zhizhong Su; Wenyu Liu; Xinggang Wang; |
121 | ProReflow: Progressive Reflow with Decomposed Velocity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an effective solution, rectified flow aims to rectify the diffusion process of diffusion models into a straight line for few-step and even one-step generation. However, in this paper, we suggest that the original training pipeline of reflow is not optimal and introduce two techniques to improve it. |
Lei Ke; Haohang Xu; Xuefei Ning; Yu Li; Jiajun Li; Haoling Li; Yuxuan Lin; Dongsheng Jiang; Yujiu Yang; Linfeng Zhang; |
122 | MambaVision: A Hybrid Mamba-Transformer Vision Backbone Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. |
Ali Hatamizadeh; Jan Kautz; |
123 | Generative Sparse-View Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Generative Sparse-view Gaussian Splatting (GS-GS), a general pipeline designed to enhance the rendering quality of 3D/4D Gaussian Splatting (GS) when training views are sparse. |
Hanyang Kong; Xingyi Yang; Xinchao Wang; |
124 | ScaMo: Exploring The Scaling Law in Autoregressive Motion Generation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a scalable motion generation framework that includes the motion tokenizer Motion FSQ-VAE and a text-prefix autoregressive transformer. |
Shunlin Lu; Jingbo Wang; Zeyu Lu; Ling-Hao Chen; Wenxun Dai; Junting Dong; Zhiyang Dou; Bo Dai; Ruimao Zhang; |
125 | Model Poisoning Attacks to Federated Learning Via Multi-Round Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we make a key observation that their suboptimal effectiveness arises from only leveraging model-update consistency among malicious clients within individual training rounds, making the attack effect self-cancel across training rounds. |
Yueqi Xie; Minghong Fang; Neil Zhenqiang Gong; |
126 | Seurat: From Moving Points to Depth Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, humans often perceive relative depth intuitively by observing variations in the size and spacing of objects as they move. Inspired by this, we propose a novel method that infers relative depth by examining the spatial relationships and temporal evolution of a set of tracked 2D trajectories. |
Seokju Cho; Jiahui Huang; Seungryong Kim; Joon-Young Lee; |
127 | Enhancing Video-LLM Reasoning Via Agent-of-Thoughts Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose **A**gent-**o**f-**T**houghts **D**istillation (**AoTD**), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. |
Yudi Shi; Shangzhe Di; Qirui Chen; Weidi Xie; |
128 | VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. |
Chi-Pin Huang; Yen-Siang Wu; Hung-Kai Chung; Kai-Po Chang; Fu-En Yang; Yu-Chiang Frank Wang; |
129 | SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they are primarily based on reconstructing pixel level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. |
Fida Mohammad Thoker; Letian Jiang; Chen Zhao; Bernard Ghanem; |
130 | Personalized Preference Fine-tuning of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. |
Meihua Dang; Anikait Singh; Linqi Zhou; Stefano Ermon; Jiaming Song; |
131 | DreamRelation: Bridging Customization and Relation Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we introduce DreamRelation, a framework that disentangles identity and relation learning using a carefully curated dataset. |
Qingyu Shi; Lu Qi; Jianzong Wu; Jinbin Bai; Jingbo Wang; Yunhai Tong; Xiangtai Li; |
132 | You See It, You Got It: Learning 3D Creation on Pose-Free Videos at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. |
Baorui Ma; Huachen Gao; Haoge Deng; Zhengxiong Luo; Tiejun Huang; Lulu Tang; Xinlong Wang; |
133 | Learning from Streaming Video with Orthogonal Gradients Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. |
Tengda Han; Dilara Gokay; Joseph Heyward; Chuhan Zhang; Daniel Zoran; Viorica Patraucean; Joao Carreira; Dima Damen; Andrew Zisserman; |
134 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. |
Joya Chen; Ziyun Zeng; Yiqi Lin; Wei Li; Zejun Ma; Mike Zheng Shou; |
135 | ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models. |
Shaofei Cai; Zihao Wang; Kewei Lian; Zhancun Mu; Xiaojian Ma; Anji Liu; Yitao Liang; |
136 | Parallel Sequence Modeling Via Generalized Spatial Propagation Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. |
Hongjun Wang; Wonmin Byeon; Jiarui Xu; Jinwei Gu; Ka Chun Cheung; Xiaolong Wang; Kai Han; Jan Kautz; Sifei Liu; |
137 | Let Humanoids Hike! Integrative Skill Development on Complex Trails Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose training humanoids to hike on complex trails, fostering integrative skill development across visual perception, decision making, and motor execution. |
Kwan-Yee Lin; Stella X. Yu; |
138 | LamRA: Large Multimodal Model As Your Advanced Retrieval Assistant Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore the possibility of re-purposing generative Large Multimodal Models (LMMs) for retrieval. |
Yikun Liu; Yajie Zhang; Jiayin Cai; Xiaolong Jiang; Yao Hu; Jiangchao Yao; Yanfeng Wang; Weidi Xie; |
139 | BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce a novel image-centric 3D perception model, BIP3D, which leverages expressive image features with explicit 3D position encoding to overcome the limitations of point-centric methods. |
Xuewu Lin; Tianwei Lin; Lichao Huang; Hongyu Xie; Zhizhong Su; |
140 | Speedy-Splat: Fast 3D Gaussian Splatting with Sparse Pixels and Sparse Primitives Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, its rendering speed and model size still present bottlenecks, especially in resource-constrained settings. In this paper, we identify and address two key inefficiencies in 3D-GS to substantially improve rendering speed.These improvements also yield the ancillary benefits of reduced model size and training time. |
Alex Hanson; Allen Tu; Geng Lin; Vasu Singla; Matthias Zwicker; Tom Goldstein; |
141 | PUP 3D-GS: Principled Uncertainty Pruning for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a principled sensitivity pruning score that preserves visual fidelity and foreground details at significantly higher compression ratios than existing approaches. |
Alex Hanson; Allen Tu; Vasu Singla; Mayuka Jayawardhana; Matthias Zwicker; Tom Goldstein; |
142 | AudCast: Audio-Driven Human Video Generation By Cascaded Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. |
Jiazhi Guan; Kaisiyuan Wang; Zhiliang Xu; Quanwei Yang; Yasheng Sun; Shengyi He; Borong Liang; Yukang Cao; Yingying Li; Haocheng Feng; Errui Ding; Jingdong Wang; Youjian Zhao; Hang Zhou; Ziwei Liu; |
143 | World-consistent Video Diffusion with Explicit 3D Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. |
Qihang Zhang; Shuangfei Zhai; Miguel Ángel Bautista Martin; Kevin Miao; Alexander Toshev; Joshua Susskind; Jiatao Gu; |
144 | XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing benchmarks usually adopt notably smaller image sizes than real-world RS scenarios, suffer from limited annotation quality, and consider insufficient dimensions of evaluation. To address these issues, we present XLRS-Bench: a comprehensive benchmark for evaluating the perception and reasoning capabilities of MLLMs in ultra-high-resolution RS scenarios. |
Fengxiang Wang; Hongzhen Wang; Zonghao Guo; Di Wang; Yulin Wang; Mingshuo Chen; Qiang Ma; Long Lan; Wenjing Yang; Jing Zhang; Zhiyuan Liu; Maosong Sun; |
145 | DoF-Gaussian: Controllable Depth-of-Field for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce DoF-Gaussian, a controllable depth-of-field method for 3D-GS. |
Liao Shen; Tianqi Liu; Huiqiang Sun; Jiaqi Li; Zhiguo Cao; Wei Li; Chen Change Loy; |
146 | GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although 3D semantic Gaussian serves as an object-centric sparse alternative, most of the Gaussians still describe the empty region with low efficiency. To address this, we propose a probabilistic Gaussian superposition model which interprets each Gaussian as a probability distribution of its neighborhood being occupied and conforms to probabilistic multiplication to derive the overall geometry. |
Yuanhui Huang; Amonnut Thammatadatrakoon; Wenzhao Zheng; Yunpeng Zhang; Dalong Du; Jiwen Lu; |
147 | DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present DI-PCG, a novel and efficient method for Inverse PCG from general image conditions. |
Wang Zhao; Yan-Pei Cao; Jiale Xu; Yuejiang Dong; Ying Shan; |
148 | BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce BOLT, a method to boost large VLMs without additional training through a comprehensive study of frame selection strategies. |
Shuming Liu; Chen Zhao; Tianqi Xu; Bernard Ghanem; |
149 | HyperNet Fields: Efficiently Training Hypernetworks Without Ground Truth By Learning Weight Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a method to train hypernetworks, without the need for any per-sample ground truth. |
Eric Hedlin; Munawar Hayat; Fatih Porikli; Kwang Moo Yi; Shweta Mahajan; |
150 | VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose VideoHandles as a method for editing 3D object compositions in videos of static scenes with camera motion. |
Juil Koo; Paul Guerrero; Chun-Hao P. Huang; Duygu Ceylan; Minhyuk Sung; |
151 | 3DTopia-XL: Scaling High-quality 3D Asset Generation Via Primitive Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. |
Zhaoxi Chen; Jiaxiang Tang; Yuhao Dong; Ziang Cao; Fangzhou Hong; Yushi Lan; Tengfei Wang; Haozhe Xie; Tong Wu; Shunsuke Saito; Liang Pan; Dahua Lin; Ziwei Liu; |
152 | UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, even within the same domain, current VAD approaches often require large amounts of normal samples to train class-specific models, resulting in poor generalizability and hindering unified evaluation across domains. To address this issue, we propose a generalized few-shot VAD method, UniVAD, capable of detecting anomalies across various domains, with a training-free unified model. |
Zhaopeng Gu; Bingke Zhu; Guibo Zhu; Yingying Chen; Ming Tang; Jinqiao Wang; |
153 | Deformable Radial Kernel Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Deformable Radial Kernel (DRK), which extends Gaussian splatting into a more general and flexible framework. |
Yi-Hua Huang; Ming-Xian Lin; Yang-Tian Sun; Ziyi Yang; Xiaoyang Lyu; Yan-Pei Cao; Xiaojuan Qi; |
154 | NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose NVComposer, a novel approach that eliminates the need for explicit external alignment. |
Lingen Li; Zhaoyang Zhang; Yaowei Li; Jiale Xu; Wenbo Hu; Xiaoyu Li; Weihao Cheng; Jinwei Gu; Tianfan Xue; Ying Shan; |
155 | Towards Universal Soccer Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper aims to develop a comprehensive multi-modal framework for soccer video understanding.Specifically, we make the following contributions in this paper:(i) we introduce **SoccerReplay-1988**, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline;(ii) we present the first visual-language foundation model in the soccer domain, **MatchVision**, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks;(iii) we conduct extensive experiments and ablation studies on action classification, commentary generation, and multi-view foul recognition,and demonstrate state-of-the-art performance on all of them, substantially outperforming existing models, which has demonstrated the superiority of our proposed data and model. |
Jiayuan Rao; Haoning Wu; Hao Jiang; Ya Zhang; Yanfeng Wang; Weidi Xie; |
156 | SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a model-agnostic iterative self-improvement framework (**SILMM**) that can enable LMMs to provide helpful and scalable self-feedback and optimize text-image alignment via Direct Preference Optimization (DPO). |
Leigang Qu; Haochuan Li; Wenjie Wang; Xiang Liu; Juncheng Li; Liqiang Nie; Tat-Seng Chua; |
157 | ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. |
Ozgur Kara; Krishna Kumar Singh; Feng Liu; Duygu Ceylan; James M. Rehg; Tobias Hinz; |
158 | ROICtrl: Boosting Instance Control for Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. |
Yuchao Gu; Yipin Zhou; Yunfan Ye; Yixin Nie; Licheng Yu; Pingchuan Ma; Kevin Qinghong Lin; Mike Zheng Shou; |
159 | RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. |
Yao Mu; Tianxing Chen; Zanxin Chen; Shijia Peng; Zhiqian Lan; Zeyu Gao; Zhixuan Liang; Qiaojun Yu; Yude Zou; Mingkun Xu; Lunkai Lin; Zhiqiang Xie; Mingyu Ding; Ping Luo; |
160 | OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce OminiFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. |
Shufan Li; Konstantinos Kallidromitis; Akash Gokul; Zichun Liao; Yusuke Kato; Kazuki Kozuka; Aditya Grover; |
161 | Multiple Object Tracking As ID Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. |
Ruopeng Gao; Ji Qi; Limin Wang; |
162 | Parameterized Blur Kernel Prior Learning for Local Motion Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing LMD methods rely on manually annotated blur masks and often overlook the blur kernel’s characteristics, which are crucial for accurate restoration. To address these limitations, we propose a novel parameterized motion kernel modeling approach that defines the motion blur kernel with three key parameters: length, angle, and curvature. |
Zhenxuan Fang; Fangfang Wu; Tao Huang; Le Dong; Weisheng Dong; Xin Li; Guangming Shi; |
163 | Argus: A Compact and Versatile Foundation Model for Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing vision and multi-modal foundation models can handle multiple computer vision tasks, they often suffer from significant limitations, including huge demand for data and computational resources during training and inconsistent performance across vision tasks at deployment time. To address these challenges, we introduce Argus (The name comes from Argus Panoptes — a hundred-eyed giant with "all-seeing" capability in Greek mythology), a compact and versatile vision foundation model designed to support a wide range of vision tasks through a unified multitask architecture. |
Weiming Zhuang; Chen Chen; Zhizhong Li; Sina Sajadmanesh; Jingtao Li; Jiabo Huang; Vikash Sehwag; Vivek Sharma; Hirotaka Shinozaki; Felan Carlo Garcia; Yihao Zhan; Naohiro Adachi; Ryoji Eki; Michael Spranger; Peter Stone; Lingjuan Lyu; |
164 | MNE-SLAM: Multi-Agent Neural SLAM for Mobile Robots Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose the first distributed multi-agent collaborative SLAM framework with distributed mapping and camera tracking, joint scene representation, intra-to-inter loop closure, and multi-submap fusion. |
Tianchen Deng; Guole Shen; Chen Xun; Shenghai Yuan; Tongxin Jin; Hongming Shen; Yanbo Wang; Jingchuan Wang; Hesheng Wang; Danwei Wang; Weidong Chen; |
165 | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). |
Yuhao Dong; Zuyan Liu; Hai-Long Sun; Jingkang Yang; Winston Hu; Yongming Rao; Ziwei Liu; |
166 | MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called "mesh attention" to enable training at 1024×1024 resolution. |
Yuhan Wang; Fangzhou Hong; Shuai Yang; Liming Jiang; Wayne Wu; Chen Change Loy; |
167 | Robust Multi-Object 4D Generation for In-the-wild Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To rigorously evaluate the quality of scene generation and the accuracy of the motion under multi-object occlusions, we introduce MOSE-PTS, a subset of the challenging MOSE benchmark, which we annotated with high-quality 2D point tracks. |
Wen-Hsuan Chu; Lei Ke; Jianmeng Liu; Mingxiao Huo; Pavel Tokmakov; Katerina Fragkiadaki; |
168 | PCM : Picard Consistency Model for Fast Parallel Sampling of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a new parallelization scheme, the Picard Consistency Model (PCM), which significantly reduces the number of generation steps in Picard iteration. |
Junhyuk So; Jiwoong Shin; Chaeyeon Jang; Eunhyeok Park; |
169 | VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. |
Dahun Kim; AJ Piergiovanni; Ganesh Mallya; Anelia Angelova; |
170 | Inference-Scale Complexity in ANN-SNN Conversion for High-Performance and Low-Power Applications Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Even efficient ANN-SNN conversion methods necessitate quantized training of ANNs to enhance the effectiveness of the conversion, incurring additional training costs. To address these challenges, we propose an efficient ANN-SNN conversion framework with only inference scale complexity. |
Tong Bu; Maohua Li; Zhaofei Yu; |
171 | SAR3D: Autoregressive 3D Object Generation and Understanding Via Multi-scale 3D VQVAE Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. |
Yongwei Chen; Yushi Lan; Shangchen Zhou; Tengfei Wang; Xingang Pan; |
172 | Can Generative Video Models Help Pose Estimation? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. |
Ruojin Cai; Jason Y. Zhang; Philipp Henzler; Zhengqi Li; Noah Snavely; Ricardo Martin-Brualla; |
173 | Alias-Free Latent Diffusion Models: Improving Fractional Shift Equivariance of Diffusion Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This hinders their applicability in applications requiring consistent results. In this work, we redesign LDMs to enhance consistency by making them shift-equivariant. |
Yifan Zhou; Zeqi Xiao; Shuai Yang; Xingang Pan; |
174 | AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. |
Felix Wimbauer; Weirong Chen; Dominik Muhle; Christian Rupprecht; Daniel Cremers; |
175 | Enhancing Online Continual Learning with Plug-and-Play State Space Model and Class-Conditional Mixture of Discretization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they often overlook the adaptability of the model, limiting the ability to learn generalizable and discriminative features incrementally from online training data.To address this, we introduce a plug-and-play module, S6MOD, which can be integrated into most existing methods and directly improve adaptability. Specifically, S6MOD introduces an extra branch after the backbone, where a mixture of discretization selectively adjusts parameters in a selective state space model, enriching selective scan patterns such that the model can adaptively select the most sensitive discretization method for current dynamics.We further design a class-conditional routing algorithm for dynamic, uncertainty-based adjustment and implement a contrastive discretization loss to optimize it. |
Sihao Liu; Yibo Yang; Xiaojie Li; David A. Clifton; Bernard Ghanem; |
176 | DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: customized manga generation and introduce DiffSensei, an innovative framework specifically designed for generating manga with dynamic multi-character control. |
Jianzong Wu; Chao Tang; Jingbo Wang; Yanhong Zeng; Xiangtai Li; Yunhai Tong; |
177 | 3D-Mem: 3D Scene Memory for Embodied Exploration and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose 3D-Mem, a novel 3D scene memory framework for embodied agents. |
Yuncong Yang; Han Yang; Jiachen Zhou; Peihao Chen; Hongxin Zhang; Yilun Du; Chuang Gan; |
178 | PhysAnimator: Physics-Guided Generative Cartoon Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PhysAnimator, a novel approach for generating physically plausible meanwhile anime-stylized animation from static anime illustrations. |
Tianyi Xie; Yiwei Zhao; Ying Jiang; Chenfanfu Jiang; |
179 | DNF: Unconditional 4D Generation with Dictionary-based Neural Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose DNF, a new 4D representation for unconditional generative modeling that efficiently models deformable shapes with disentangled shape and motion while capturing high-fidelity details in the deforming objects. |
Xinyi Zhang; Naiqi Li; Angela Dai; |
180 | SimVS: Simulating World Inconsistencies for Robust View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. |
Alex Trevithick; Roni Paiss; Philipp Henzler; Dor Verbin; Rundi Wu; Hadi Alzayer; Ruiqi Gao; Ben Poole; Jonathan T. Barron; Aleksander Holynski; Ravi Ramamoorthi; Pratul P. Srinivasan; |
181 | LOGICZSL: Exploring Logic-induced Representation for Compositional Zero-shot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose LOGICZSL, a novel logic-induced learning framework to explicitly model the semantic relationships. |
Peng Wu; Xiankai Lu; Hao Hu; Yongqin Xian; Jianbing Shen; Wenguan Wang; |
182 | RandAR: Decoder-only Autoregressive Visual Generation in Random Orders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of generatng images in arbitrary token orders. |
Ziqi Pang; Tianyuan Zhang; Fujun Luan; Yunze Man; Hao Tan; Kai Zhang; William T. Freeman; Yu-Xiong Wang; |
183 | UNIALIGN: Scaling Multimodal Alignment Within One Unified Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present UNIALIGN, a unified model to align an arbitrary number of modalities (\text e.g. , image, text, audio, 3D point cloud, etc.) through one encoder and a single training phase. |
Bo Zhou; Liulei Li; Yujia Wang; Huafeng Liu; Yazhou Yao; Wenguan Wang; |
184 | FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. |
Shangzhan Zhang; Jianyuan Wang; Yinghao Xu; Nan Xue; Christian Rupprecht; Xiaowei Zhou; Yujun Shen; Gordon Wetzstein; |
185 | InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. |
Sai Kumar Dwivedi; Dimitrije Antić; Shashank Tripathi; Omid Taheri; Cordelia Schmid; Michael J. Black; Dimitrios Tzionas; |
186 | 3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. |
Yihang Luo; Shangchen Zhou; Yushi Lan; Xingang Pan; Chen Change Loy; |
187 | LT3SD: Latent Trees for 3D Scene Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. |
Quan Meng; Lei Li; Matthias Nießner; Angela Dai; |
188 | ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. |
Yifan Pu; Yiming Zhao; Zhicong Tang; Ruihong Yin; Haoxing Ye; Yuhui Yuan; Dong Chen; Jianmin Bao; Sirui Zhang; Yanbin Wang; Lin Liang; Lijuan Wang; Ji Li; Xiu Li; Zhouhui Lian; Gao Huang; Baining Guo; |
189 | MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes. |
Zhengqi Li; Richard Tucker; Forrester Cole; Qianqian Wang; Linyi Jin; Vickie Ye; Angjoo Kanazawa; Aleksander Holynski; Noah Snavely; |
190 | Synchronized Video-to-Audio Generation Via Mel Quantization-Continuum Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). |
Juncheng Wang; Chao Xu; Cheng Yu; Lei Shang; Zhe Hu; Shujun Wang; Liefeng Bo; |
191 | Scaling Vision Pre-Training to 4K Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PS3 that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. |
Baifeng Shi; Boyi Li; Han Cai; Yao Lu; Sifei Liu; Marco Pavone; Jan Kautz; Song Han; Trevor Darrell; Pavlo Molchanov; Hongxu Yin; |
192 | SeaLion: Semantic Part-Aware Latent Point Diffusion Models for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, little attention has been given to generating point clouds with point-wise segmentation labels, as well as to developing evaluation metrics for this task. Therefore, in this paper, we present SeaLion, a novel diffusion model designed to generate high-quality and diverse point cloud with fine-grained segmentation labels. |
Dekai Zhu; Yan Di; Stefan Gavranovic; Slobodan Ilic; |
193 | DiC: Rethinking Conv3x3 Designs in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. |
Yuchuan Tian; Jing Han; Chengcheng Wang; Yuchen Liang; Chao Xu; Hanting Chen; |
194 | Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. |
Vadim Tschernezki; Diane Larlus; Iro Laina; Andrea Vedaldi; |
195 | GG-SSMs: Graph-Generating State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While methods like Mamba and VMamba introduce selective and flexible scanning strategies, they rely on predetermined paths, which fails to efficiently capture complex dependencies. We introduce Graph-Generating State Space Models (GG-SSMs), a novel framework that overcomes these limitations by dynamically constructing graphs based on feature relationships. |
Nikola Zubic; Davide Scaramuzza; |
196 | MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose MIMO, a novel framework which can not only synthesize realistic character videos with controllable attributes (i.e., character, motion and scene) provided by simple user inputs, but also simultaneously achieve advanced scalability to arbitrary characters, generality to novel 3D motions, and applicability to interactive real-world scenes in a unified framework. |
Yifang Men; Yuan Yao; Miaomiao Cui; Liefeng Bo; |
197 | LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these task-agnostic object features include much redundant information and missing details for the task-relevant area. To tackle these problems, we propose LSceneLLM, an adaptive framework that automatically identifies task-relevant areas by leveraging LLM’s visual preference for different tasks, followed by a plug-and-play scene magnifier module to capture fine-grained details in focused areas. |
Hongyan Zhi; Peihao Chen; Junyan Li; Shuailei Ma; Xinyu Sun; Tianhang Xiang; Yinjie Lei; Mingkui Tan; Chuang Gan; |
198 | One Diffusion to Generate Them All Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce \texttt OneDiffusion – a single large-scale diffusion model designed to tackle a wide range of image synthesis and understanding tasks. |
Duong H. Le; Tuan Pham; Sangho Lee; Christopher Clark; Aniruddha Kembhavi; Stephan Mandt; Ranjay Krishna; Jiasen Lu; |
199 | WildAvatar: Learning In-the-wild 3D Avatars from The Web Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. |
Zihao Huang; Shoukang Hu; Guangcong Wang; Tianqi Liu; Yuhang Zang; Zhiguo Cao; Wei Li; Ziwei Liu; |
200 | DepthCues: Evaluating Monocular Depth Perception in Large Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. |
Duolikun Danier; Mehmet Aygün; Changjian Li; Hakan Bilen; Oisin Mac Aodha; |
201 | Doppelgangers++: Improved Visual Disambiguation with Geometric 3D Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Doppelgangers++, a method to enhance doppelganger detection and improve 3D reconstruction accuracy. |
Yuanbo Xiangli; Ruojin Cai; Hanyu Chen; Jeffrey Byrne; Noah Snavely; |
202 | SceneFactor: Factored Latent 3D Diffusion for Controllable 3D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SceneFactor, a diffusion-based approach for large-scale 3D scene generation that enables controllable generation and effortless editing. |
Aleksey Bokhovkin; Quan Meng; Shubham Tulsiani; Angela Dai; |
203 | MeshArt: Generating Articulated Meshes with Structure-Guided Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce MeshArt, a hierarchical transformer-based approach to generate articulated 3D meshes with clean, compact geometry, reminiscent of human-crafted 3D models. |
Daoyi Gao; Yawar Siddiqui; Lei Li; Angela Dai; |
204 | DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present our observation that CLIP’s image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain "content" and "context" features respectively. |
Junjie Wang; Bin Chen; Yulin Li; Bin Kang; Yichi Chen; Zhuotao Tian; |
205 | Robust 3D Shape Reconstruction in Zero-Shot from A Single Image in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, their effectiveness is significantly compromised in real-world conditions, due to imperfect object segmentation by off-the-shelf models and the prevalence of occlusions. To effectively address these issues, we propose a unified regression model that integrates segmentation and reconstruction, specifically designed for occlusion-aware 3D shape reconstruction. |
Junhyeong Cho; Kim Youwang; Hunmin Yang; Tae-Hyun Oh; |
206 | GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields Through Efficient Dense 3D Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a novel framework that constructs dynamic 3D Gaussian fields with dense 3D point tracking and renders the Gaussian field for all video frames. |
Weikang Bian; Zhaoyang Huang; Xiaoyu Shi; Yijin Li; Fu-Yun Wang; Hongsheng Li; |
207 | Large-Scale Text-to-Image Model with Inpainting Is A Zero-Shot Subject-Driven Image Generator Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. |
Chaehun Shin; Jooyoung Choi; Heeseung Kim; Sungroh Yoon; |
208 | Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. |
Haotong Lin; Sida Peng; Jingxiao Chen; Songyou Peng; Jiaming Sun; Minghuan Liu; Hujun Bao; Jiashi Feng; Xiaowei Zhou; Bingyi Kang; |
209 | MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Furthermore, there is currently no publicly accessible dataset specifically designed for analyzing, evaluating, and training models for long video generation. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) character consistency across scenes, (2) long videos with rich and coherent storylines, and (3) multi-scene narratives. |
Weijia Wu; Mingyu Liu; Zeyu Zhu; Xi Xia; Haoen Feng; Wen Wang; Kevin Qinghong Lin; Chunhua Shen; Mike Zheng Shou; |
210 | Visual Agentic AI for Spatial Reasoning with A Dynamic API Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. |
Damiano Marsili; Rohun Agrawal; Yisong Yue; Georgia Gkioxari; |
211 | AIpparel: A Multimodal Foundation Model for Digital Garments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet, the creation of garments remains a time-consuming process, largely due to the manual work involved in designing them. To simplify this process, we introduce AIpparel, a multimodal foundation model for generating and editing sewing patterns. |
Kiyohiro Nakayama; Jan Ackermann; Timur Levent Kesdogan; Yang Zheng; Maria Korosteleva; Olga Sorkine-Hornung; Leonidas J. Guibas; Guandao Yang; Gordon Wetzstein; |
212 | Diffusion Self-Distillation for Zero-Shot Customized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. |
Shengqu Cai; Eric Ryan Chan; Yunzhi Zhang; Leonidas Guibas; Jiajun Wu; Gordon Wetzstein; |
213 | SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less Than 0.2% Training Cost Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. |
Haiyang Mei; Pengyu Zhang; Mike Zheng Shou; |
214 | LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present LesionLocator, a framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, establishing the first end-to-end model capable of 4D tracking with dense spatial prompts. |
Maximilian Rokuss; Yannick Kirchhoff; Seval Akbal; Balint Kovacs; Saikat Roy; Constantin Ulrich; Tassilo Wald; Lukas T. Rotkopf; Heinz-Peter Schlemmer; Klaus Maier-Hein; |
215 | Revisiting MAE Pre-training for 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While SSL has revolutionized fields like natural language processing and computer vision, its adoption in 3D medical image computing has been limited by three key pitfalls: Small pre-training dataset sizes, architectures inadequate for 3D medical image analysis, and insufficient evaluation practices. In this paper, we address these issues by i) leveraging a large-scale dataset of 39k 3D brain MRI volumes and ii) using a Residual Encoder U-Net architecture within the state-of-the-art nnU-Net framework. |
Tassilo Wald; Constantin Ulrich; Stanislav Lukyanenko; Andrei Goncharov; Alberto Paderno; Maximilian Miller; Leander Maerkisch; Paul Jaeger; Klaus Maier-Hein; |
216 | POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. |
Lanyun Zhu; Tianrun Chen; Qianxiong Xu; Xuanyi Liu; Deyi Ji; Haiyang Wu; De Wen Soh; Jun Liu; |
217 | OmniSplat: Taming Feed-Forward 3D Gaussian Splatting for Omnidirectional Images with Editable Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose OmniSplat, a training-free fast feed-forward 3DGS generation framework for omnidirectional images. |
Suyoung Lee; Jaeyoung Chung; Kihoon Kim; Jaeyoo Huh; Gunhee Lee; Minsoo Lee; Kyoung Mu Lee; |
218 | ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. |
Zikang Zhou; Hengjian Zhou; Haibo Hu; Zihao Wen; Jianping Wang; Yung-Hui Li; Yu-Kai Huang; |
219 | Flowing from Words to Pixels: A Noise-Free Framework for Cross-Modality Evolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a general and simple framework, CrossFlow, for cross-modal flow matching. |
Qihao Liu; Xi Yin; Alan Yuille; Andrew Brown; Mannat Singh; |
220 | MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce efficient methods for pointmap matching, camera tracking and local fusion, graph construction and loop closure, and second-order global optimisation. |
Riku Murai; Eric Dexheimer; Andrew J. Davison; |
221 | MVSAnywhere: Zero-Shot Multi-View Stereo Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. |
Sergio Izquierdo; Mohamed Sayed; Michael Firman; Guillermo Garcia-Hernando; Daniyar Turmukhambetov; Javier Civera; Oisin Mac Aodha; Gabriel Brostow; Jamie Watson; |
222 | Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. |
Reza Qorbani; Gianluca Villani; Theodoros Panagiotakopoulos; Marc Botet Colomer; Linus Härenstam-Nielsen; Mattia Segu; Pier Luigi Dovesi; Jussi Karlgren; Daniel Cremers; Federico Tombari; Matteo Poggi; |
223 | Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our objective is to translate continuous sign language into spoken language text. |
Youngjoon Jang; Haran Raajesh; Liliane Momeni; Gül Varol; Andrew Zisserman; |
224 | Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. |
Ziyang Xie; Zhizheng Liu; Zhenghao Peng; Wayne Wu; Bolei Zhou; |
225 | Textured Gaussians for Enhanced 3D Scene Appearance Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha (A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. |
Brian Chao; Hung-Yu Tseng; Lorenzo Porzi; Chen Gao; Tuotuo Li; Qinbo Li; Ayush Saraf; Jia-Bin Huang; Johannes Kopf; Gordon Wetzstein; Changil Kim; |
226 | InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. |
Sirui Xu; Dongting Li; Yucheng Zhang; Xiyan Xu; Qi Long; Ziyin Wang; Yunzhi Lu; Shuchang Dong; Hezi Jiang; Akshat Gupta; Yu-Xiong Wang; Liang-Yan Gui; |
227 | InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. |
Sirui Xu; Hung Yu Ling; Yu-Xiong Wang; Liang-Yan Gui; |
228 | UniPhy: Learning A Unified Constitutive Model for Inverse Physics Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose UniPhy, a common latent-conditioned neural constitutive model that can encode the physical properties of diverse materials. |
Himangi Mittal; Peiye Zhuang; Hsin-Ying Lee; Shubham Tulsiani; |
229 | Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. |
Han Xiao; Yina Xie; Guanxin Tan; Yinghao Chen; Rui Hu; Ke Wang; Aojun Zhou; Hao Li; Hao Shao; Xudong Lu; Peng Gao; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
230 | HOT: Hadamard-based Optimized Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. |
Seonggon Kim; Juncheol Shin; Seung-taek Woo; Eunhyeok Park; |
231 | Disco4D: Disentangled 4D Human Generation and Animation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Disco4D, a novel Gaussian Splatting framework for 4D human generation and animation from a single image. |
Hui En Pang; Shuai Liu; Zhongang Cai; Lei Yang; Tianwei Zhang; Ziwei Liu; |
232 | SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, SOLAMI builds 3D autonomous characters from three aspects: 1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user’s multimodal input to drive the character for social interaction. |
Jianping Jiang; Weiye Xiao; Zhengyu Lin; Huaizhong Zhang; Tianxiang Ren; Yang Gao; Zhiqian Lin; Zhongang Cai; Lei Yang; Ziwei Liu; |
233 | Neural LightRig: Unlocking Accurate Object Normal and Material Estimation with Multi-Light Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Neural LightRig, a novel framework that boosts intrinsic estimation by leveraging auxiliary multi-lighting conditions from 2D diffusion priors. |
Zexin He; Tengfei Wang; Xin Huang; Xingang Pan; Ziwei Liu; |
234 | Point Clouds Meets Physics: Dynamic Acoustic Field Fitting Network for Point Cloud Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limited representational capacity of pure point cloud models continues to constrain the potential of cross-modal fusion methods and performance across various tasks. To address this challenge, we propose a Dynamic Acoustic Field Fitting Network (DAF-Net), inspired by physical acoustic principles. |
Changshuo Wang; Shuting He; Xiang Fang; Jiawei Han; Zhonghang Liu; Xin Ning; Weijun Li; Prayag Tiwari; |
235 | OmniManip: Towards General Robotic Manipulation Via Object-Centric Interaction Primitives As Spatial Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM’s high-level reasoning and the low-level precision required for manipulation. |
Mingjie Pan; Jiyao Zhang; Tianshu Wu; Yinghao Zhao; Wenlong Gao; Hao Dong; |
236 | GenFusion: Closing The Loop Between Reconstruction and Generation Via Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. |
Sibo Wu; Congrong Xu; Binbin Huang; Andreas Geiger; Anpei Chen; |
237 | Gen3DEval: Using VLLMs for Automatic Evaluation of Generated 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Rapid advancements in text-to-3D generation require robust and scalable evaluation metrics that align closely with human judgment, a need unmet by current metrics such as PSNR and CLIP, which require ground-truth data or focus only on prompt fidelity. To address this, we introduce Gen3DEval, a novel evaluation framework that leverages vision large language models (vLLMs) specifically fine-tuned for 3D object quality assessment. |
Shalini Maiti; Lourdes Agapito; Filippos Kokkinos; |
238 | Goku: Flow Based Video Generative Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. |
Shoufa Chen; Chongjian Ge; Yuqi Zhang; Yida Zhang; Fengda Zhu; Hao Yang; Hongxiang Hao; Hui Wu; Zhichao Lai; Yifei Hu; Ting-Che Lin; Shilong Zhang; Fu Li; Chuan Li; Xing Wang; Yanghua Peng; Peize Sun; Ping Luo; Yi Jiang; Zehuan Yuan; Bingyue Peng; Xiaobing Liu; |
239 | FineVQ: Fine-Grained User Generated Content Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenges and promote the development of UGC videos, we establish the first large-scale Fine-grained Video quality assessment Database, termed FineVD, which comprises 6104 UGC videos with fine-grained quality scores and descriptions across multiple dimensions. Based on this database, we propose a Fine-grained Video Quality assessment (FineVQ) model to learn the fine-grained quality of UGC videos, with the capabilities of quality rating, quality scoring, and quality attribution. |
Huiyu Duan; Qiang Hu; Jiarui Wang; Liu Yang; Zitong Xu; Lu Liu; Xiongkuo Min; Chunlei Cai; Tianxiao Ye; Xiaoyun Zhang; Guangtao Zhai; |
240 | SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. |
Xuesong Chen; Linjiang Huang; Tao Ma; Rongyao Fang; Shaoshuai Shi; Hongsheng Li; |
241 | FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose FreeSim, a camera simulation method for driving scenes via 3D Gaussian Splatting and diffusion-based image generation. |
Lue Fan; Hao Zhang; Qitai Wang; Hongsheng Li; Zhaoxiang Zhang; |
242 | SceneCrafter: Controllable Multi-View Driving Scene Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. |
Zehao Zhu; Yuliang Zou; Chiyu Max Jiang; Bo Sun; Vincent Casser; Xiukun Huang; Jiahao Wang; Zhenpei Yang; Ruiqi Gao; Leonidas Guibas; Mingxing Tan; Dragomir Anguelov; |
243 | SegAgent: Exploring Pixel Understanding Capabilities in MLLMs By Imitating Human Annotator Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This approach disrupts the MLLM’s text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model’s intrinsic pixel-level understanding.Thus, We introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. |
Muzhi Zhu; Yuzhuo Tian; Hao Chen; Chunluan Zhou; Qingpei Guo; Yang Liu; Ming Yang; Chunhua Shen; |
244 | EffiDec3D: An Optimized Decoder for High-Performance and Efficient 3D Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose EffiDec3D, an optimized 3D decoder that employs a channel reduction strategy across all decoder stages, which sets the number of channels to the minimum needed for accurate feature representation. |
Md Mostafijur Rahman; Radu Marculescu; |
245 | IRIS: Inverse Rendering of Indoor Scenes from Low Dynamic Range Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we introduce IRIS, an inverse rendering framework that recovers the physically based material, spatially-varying HDR lighting, and camera response functions from multi-view, low-dynamic-range (LDR) images. |
Chih-Hao Lin; Jia-Bin Huang; Zhengqin Li; Zhao Dong; Christian Richardt; Tuotuo Li; Michael Zollhöfer; Johannes Kopf; Shenlong Wang; Changil Kim; |
246 | Distinguish Then Exploit: Source-free Open Set Domain Adaptation Via Weight Barcode Estimation and Sparse Label Assignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the source-free open set domain adaptation problem which includes two main challenges, i.e.,how to distinguish known and unknown target samples and how to exploit useful source information to provide trustworthy pseudo labels for known target samples. |
Weiming Liu; Jun Dan; Fan Wang; Xinting Liao; Junhao Dong; Hua Yu; Shunjie Dong; Lianyong Qi; |
247 | DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). |
Xiaoliang Ju; Hongsheng Li; |
248 | ArtiFade: Learning to Generate High-quality Subject from Blemished Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. |
Shuya Yang; Shaozhe Hao; Yukang Cao; Kwan-Yee K. Wong; |
249 | Omnia De EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we bring to light that current egocentric video question-answering datasets often include questions that can be answered using only few frames or commonsense reasoning, without being necessarily grounded in the actual video. |
Chiara Plizzari; Alessio Tonioni; Yongqin Xian; Achin Kulshrestha; Federico Tombari; |
250 | UniGoal: Towards Universal Zero-shot Goal-oriented Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a general framework for universal zero-shot goal-oriented navigation. |
Hang Yin; Xiuwei Xu; Linqing Zhao; Ziwei Wang; Jie Zhou; Jiwen Lu; |
251 | 3D-MVP: 3D Multiview Pretraining for Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose 3D-MVP, a novel approach for 3D Multi-View Pretraining using masked autoencoders. |
Shengyi Qian; Kaichun Mo; Valts Blukis; David F. Fouhey; Dieter Fox; Ankit Goyal; |
252 | BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. |
Xudong Lu; Yinghao Chen; Cheng Chen; Hui Tan; Boheng Chen; Yina Xie; Rui Hu; Guanxin Tan; Renshou Wu; Yan Hu; Yi Zeng; Lei Wu; Liuyang Bian; Zhaoxiong Wang; Long Liu; Yanzhou Yang; Han Xiao; Aojun Zhou; Yafei Wen; Xiaoxin Chen; Shuai Ren; Hongsheng Li; |
253 | 4DTAM: Non-Rigid Tracking and Mapping Via Dynamic Surface Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the first 4D tracking and mapping method that jointly performs camera localization and non-rigid surface reconstruction via differentiable rendering. |
Hidenobu Matsuki; Gwangbin Bae; Andrew J. Davison; |
254 | Accelerating Diffusion Transformer Via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose increment-calibrated caching, a training-free method for DiT acceleration, where the calibration parameters are generated from the pre-trained model itself with low-rank approximation. |
Zhiyuan Chen; Keyi Li; Yifan Jia; Le Ye; Yufei Ma; |
255 | Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. |
Chengyue Wu; Xiaokang Chen; Zhiyu Wu; Yiyang Ma; Xingchao Liu; Zizheng Pan; Wen Liu; Zhenda Xie; Xingkai Yu; Chong Ruan; Ping Luo; |
256 | SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. |
Peishan Cong; Ziyi Wang; Yuexin Ma; Xiangyu Yue; |
257 | Hash3D: Training-free Acceleration for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Hash3D, a universal acceleration for 3D score distillation sampling (SDS) without model training.Central to Hash3D is the observation that images rendered from similar camera positions and diffusion time-steps often have redundant feature maps. |
Xingyi Yang; Songhua Liu; Xinchao Wang; |
258 | RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. |
Yuheng Jiang; Zhehao Shen; Chengcheng Guo; Yu Hong; Zhuo Su; Yingliang Zhang; Marc Habermann; Lan Xu; |
259 | Event-based Video Super-Resolution Via State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce MamEVSR, a Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. |
Zeyu Xiao; Xinchao Wang; |
260 | Enhanced Then Progressive Fusion with View Graph for Multi-View Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often encounter challenges such as feature conflicts between views and insufficient enhancement of individual view features, which hinder clustering performance. To address these challenges, we propose a novel framework, EPFMVC, which integrates feature enhancement with progressive fusion to more effectively align multi-view data. |
Zhibin Dong; Meng Liu; Siwei Wang; Ke Liang; Yi Zhang; Suyuan Liu; Jiaqi Jin; Xinwang Liu; En Zhu; |
261 | EgoLM: Multi-Modal Language Model of Egocentric Motions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce EgoLM, a versatile framework designed for egocentric motion understanding using multi-modal data. |
Fangzhou Hong; Vladimir Guzov; Hyo Jin Kim; Yuting Ye; Richard Newcombe; Ziwei Liu; Lingni Ma; |
262 | VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Long-form video understanding has been a challenging task due to the high redundancy in video data and the abundance of query-irrelevant information. To tackle this challenge, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. |
Ziyang Wang; Shoubin Yu; Elias Stengel-Eskin; Jaehong Yoon; Feng Cheng; Gedas Bertasius; Mohit Bansal; |
263 | FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting.However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views.For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data.Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset.In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin. |
Jingqiu Zhou; Lue Fan; Linjiang Huang; Xiaoyu Shi; Si Liu; Zhaoxiang Zhang; Hongsheng Li; |
264 | PIAD: Pose and Illumination Agnostic Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Pose and Illumination agnostic Anomaly Detection (PIAD) problem, a generalization of pose-agnostic anomaly detection (PAD). |
Kaichen Yang; Junjie Cao; Zeyu Bai; Zhixun Su; Andrea Tagliasacchi; |
265 | Generating 3D-Consistent Videos from Unposed Internet Photos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address the problem of generating videos from unposed internet photos. |
Gene Chou; Kai Zhang; Sai Bi; Hao Tan; Zexiang Xu; Fujun Luan; Bharath Hariharan; Noah Snavely; |
266 | TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we reconsider the local optimal scanning path in Mamba, enhancing the rigid and uniform 1D scan through the local shortest path theory, thus creating a structure-aware Mamba suited for lightweight single-image super-resolution. |
Kun Zhou; Xinyu Lin; Jiangbo Lu; |
267 | Unbiased Video Scene Graph Generation Via Visual and Semantic Dual Debiasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, VidSGG is challenged by significant biases that skew predictions. To mitigate these biases, we propose a VIsual and Semantic Awareness (VISA) framework for unbiased VidSGG. |
Yanjun Li; Zhaoyang Li; Honghui Chen; Lizhi Xu; |
268 | Diffusion Renderer: Neural Inverse and Forward Rendering with Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations–explicit 3D geometry, high-quality material properties, and lighting conditions–that are often impractical to obtain in real-world scenarios. Therefore, we introduce Diffusion Renderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. |
Ruofan Liang; Zan Gojcic; Huan Ling; Jacob Munkberg; Jon Hasselgren; Chih-Hao Lin; Jun Gao; Alexander Keller; Nandita Vijaykumar; Sanja Fidler; Zian Wang; |
269 | FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, real-time rendering with 3DGS remains a challenging problem, particularly in large-scale, high-resolution scenes due to the presence of numerous anisotropic Gaussian representations, and it has not been extensively explored. To address this challenge, we introduce FlashGS, an open-source CUDA library with Python bindings, featuring comprehensive algorithm design and optimizations, including redundancy elimination, adaptive scheduling, and efficient pipelining. |
Guofeng Feng; Siyan Chen; Rong Fu; Zimu Liao; Yi Wang; Tao Liu; Boni Hu; Linning Xu; Zhilin Pei; Hengjie Li; Xiuhong Li; Ninghui Sun; Xingcheng Zhang; Bo Dai; |
270 | SketchFusion: Learning Universal Sketch Features Through Fusing Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD’s spatial-frequency biases. |
Subhadeep Koley; Tapas Kumar Dutta; Aneeshan Sain; Pinaki Nath Chowdhury; Ayan Kumar Bhunia; Yi-Zhe Song; |
271 | Floating No More: Object-Ground Reconstruction from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This limitation significantly affects 3D-aware image editing applications like shadow rendering and object pose manipulation. To address this issue, we introduce ORG (Object Reconstruction with Ground), a novel task aimed at reconstructing 3D object geometry in conjunction with the ground surface. |
Yunze Man; Yichen Sheng; Jianming Zhang; Liang-Yan Gui; Yu-Xiong Wang; |
272 | Rethinking Reconstruction and Denoising in The Dark: New Perspective, General Architecture and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces a novel approach by rethinking denoising and reconstruction from a "backbone-head" perspective, leveraging the stronger shared parameter space offered by the backbone, compared to the encoder used in existing works. |
Tengyu Ma; Long Ma; Ziye Li; Yuetong Wang; Jinyuan Liu; Chengpei Xu; Risheng Liu; |
273 | DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we propose DoraCycle, which integrates two multimodal cycles: text-to-image-to-text and image-to-text-to-image. |
Rui Zhao; Weijia Mao; Mike Zheng Shou; |
274 | Be More Specific: Evaluating Object-centric Realism in Synthetic Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we define a new standard for assessing object-centric realism that follows a shape-texture breakdown and proposes the first object-centric realism evaluation dataset for synthetic images. |
Anqi Liang; Ciprian Corneanu; Qianli Feng; Giorgio Giannone; Aleix Martinez; |
275 | Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal. Specifically, classifiers are able to consistently and effortlessly distinguish real images from generated ones across various settings. |
Zebin You; Xinyu Zhang; Hanzhong Guo; Jingdong Wang; Chongxuan Li; |
276 | Volumetric Surfaces: Representing Fuzzy Geometries with Layered Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel representation for real-time view synthesis where the (P1) number of sampling locations is small and bounded, (P2) sampling locations are efficiently found via rasterization, and (P3) rendering is sorting-free. |
Stefano Esposito; Anpei Chen; Christian Reiser; Samuel Rota Bulò; Lorenzo Porzi; Katja Schwarz; Christian Richardt; Michael Zollhöfer; Peter Kontschieder; Andreas Geiger; |
277 | Neural Inverse Rendering from Propagating Light Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the first system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light. |
Anagh Malik; Benjamin Attal; Andrew Xie; Matthew O’Toole; David B. Lindell; |
278 | Continuous Locomotive Crowd Behavior Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel method for automatically generating continuous, realistic crowd trajectories with heterogeneous behaviors and interactions among individuals. |
Inhwan Bae; Junoh Lee; Hae-Gon Jeon; |
279 | CoSER: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CoSER, a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D, achieving both efficiency and quality by meticulously learning neighbor-view coherence and further alleviating ambiguity through the swift traversal of all views. |
Bonan Li; Zicheng Zhang; Xingyi Yang; Xinchao Wang; |
280 | EfficientLLaVA: Generalizable Auto-Pruning for Large Vision-language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. |
Yinan Liang; Ziwei Wang; Xiuwei Xu; Jie Zhou; Jiwen Lu; |
281 | Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. |
Wenxuan Guo; Xiuwei Xu; Ziwei Wang; Jianjiang Feng; Jie Zhou; Jiwen Lu; |
282 | MoSca: Dynamic Gaussian Fusion from Casual Videos Via 4D Motion Scaffolds Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce 4D Motion Scaffolds (MoSca), a modern 4D reconstruction system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. |
Jiahui Lei; Yijia Weng; Adam W. Harley; Leonidas Guibas; Kostas Daniilidis; |
283 | UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UniPre3D, the first unified pre-training method that can be seamlessly applied to point clouds of any scale and 3D models of any architecture. |
Ziyi Wang; Yanran Zhang; Jie Zhou; Jiwen Lu; |
284 | Improving Personalized Search with Regularized Low-Rank Parameter Updates Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we show how to effectively adapt the internal representation of a vision-language dual encoder model for personalized vision-language retrieval. |
Fiona Ryan; Josef Sivic; Fabian Caba Heilbron; Judy Hoffman; James M. Rehg; Bryan Russell; |
285 | Distilled Prompt Learning for Incomplete Multimodal Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current approaches tackling incomplete modalities often fall short, as they typically compensate for only a limited part of the knowledge of missing modalities. To address this issue, we propose a Distilled Prompt Learning framework (DisPro) to utilize the strong robustness of Large Language Models (LLMs) to missing modalities, which employs two-stage prompting for compensation of comprehensive information for missing modalities. |
Yingxue Xu; Fengtao Zhou; Chenyu Zhao; Yihui Wang; Can Yang; Hao Chen; |
286 | POSTA: A Go-to Framework for Customized Artistic Poster Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prior work has explored automatic poster design using deep learning techniques, but these approaches lack text accuracy, user customization, and aesthetic appeal, limiting their applicability in artistic domains such as movies and exhibitions, where both clear content delivery and visual impact are essential. To address these limitations, we present POSTA: a modular framework powered by diffusion models and multimodal large language models (MLLMs) for customized artistic poster generation. |
Haoyu Chen; Xiaojie Xu; Wenbo Li; Jingjing Ren; Tian Ye; Songhua Liu; Ying-Cong Chen; Lei Zhu; Xinchao Wang; |
287 | Scaling Mesh Generation Via Compressive Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a compressive yet effective mesh tokenization, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. |
Haohan Weng; Zibo Zhao; Biwen Lei; Xianghui Yang; Jian Liu; Zeqiang Lai; Zhuo Chen; Yuhong Liu; Jie Jiang; Chunchao Guo; Tong Zhang; Shenghua Gao; C.L. Philip Chen; |
288 | PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models. |
Song Wang; Xiaolu Liu; Lingdong Kong; Jianyun Xu; Chunyong Hu; Gongfan Fang; Wentong Li; Jianke Zhu; Xinchao Wang; |
289 | FlexGS: Train Once, Deploy Everywhere with Many-in-One Flexible 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present an elastic inference method for 3DGS. |
Hengyu Liu; Yuehao Wang; Chenxin Li; Ruisi Cai; Kevin Wang; Wuyang Li; Pavlo Molchanov; Peihao Wang; Zhangyang Wang; |
290 | LSNet: See Large, Focus Small Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a "See Large, Focus Small" strategy for lightweight vision network design. |
Ao Wang; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; |
291 | Words or Vision: Do Vision-Language Models Have Blind Faith in Text? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs’ modality preferences when faced with visual data and varied textual inputs in vision-centered settings.By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a "blind faith in text" phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns.We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. |
Ailin Deng; Tri Cao; Zhirui Chen; Bryan Hooi; |
292 | Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask^2DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. |
Tianhao Qi; Jianlong Yuan; Wanquan Feng; Shancheng Fang; Jiawei Liu; SiYu Zhou; Qian He; Hongtao Xie; Yongdong Zhang; |
293 | LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although Low-Rank Adaptation (LoRA) is widely used to efficiently acquire specialized knowledge in MLLMs, it introduces substantial harmful redundancy during visual instruction tuning, which exacerbates the forgetting of general knowledge and degrades downstream task performance.To address this issue, we propose LoRASculpt to eliminate harmful redundant parameters, thereby harmonizing general and specialized knowledge. |
Jian Liang; Wenke Huang; Guancheng Wan; Qu Yang; Mang Ye; |
294 | Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce ProtoGCN, a Graph Convolutional Network (GCN)-based model that breaks down the dynamics of entire skeleton sequences into a combination of learnable prototypes representing core motion patterns of action units. |
Hongda Liu; Yunfan Liu; Min Ren; Hao Wang; Yunlong Wang; Zhenan Sun; |
295 | CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D Via Dynamically Optimizing 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. |
Chongjian Ge; Chenfeng Xu; Yuanfeng Ji; Chensheng Peng; Masayoshi Tomizuka; Ping Luo; Mingyu Ding; Varun Jampani; Wei Zhan; |
296 | ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer Via Residual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Data-driven embodied AI algorithms demand precise, large-scale, human-like manipulation sequences, which are challenging to obtain with conventional reinforcement learning or real-world teleoperation. To address this, we introduce ManipTrans, a novel two-stage method for efficiently transferring human bimanual skills to dexterous robotic hands in simulation. |
Kailin Li; Puhao Li; Tengyu Liu; Yuyang Li; Siyuan Huang; |
297 | SEAL: Semantic Attention Learning for Long Video Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces **SE**mantic **A**ttention **L**earning (SEAL), a novel unified representation for long videos. |
Lan Wang; Yujia Chen; Du Tran; Vishnu Naresh Boddeti; Wen-Sheng Chu; |
298 | Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, our analysis of the combination paradigm reveals that the current one-shot and independent selection mechanism induces an incompatibility issue between distilled and real images. To address this issue, we introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation. |
Yanda Chen; Gongwei Chen; Miao Zhang; Weili Guan; Liqiang Nie; |
299 | 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose 3D-AVS, a method for Auto-Vocabulary Segmentation of 3D point clouds for which the vocabulary is unknown and auto-generated for each input at runtime, thus eliminating the human in the loop and typically providing a substantially larger vocabulary for richer annotations. |
Weijie Wei; Osman Ülger; Fatemeh Karimi Nejadasl; Theo Gevers; Martin R. Oswald; |
300 | VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. |
Byung-Kwan Lee; Ryo Hachiuma; Yu-Chiang Frank Wang; Yong Man Ro; Yueh-Hua Wu; |
301 | Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. |
Jiuhai Chen; Jianwei Yang; Haiping Wu; Dianqi Li; Jianfeng Gao; Tianyi Zhou; Bin Xiao; |
302 | Classic Video Denoising in A Machine Learning World: Robust, Fast, and Controllable Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they require manually tuning parameters for each input video, which is not only tedious but also requires skill. We bridge the gap between these two paradigms by proposing a differentiable denoising pipeline based on traditional methods. |
Xin Jin; Simon Niklaus; Zhoutong Zhang; Zhihao Xia; Chunle Guo; Yuting Yang; Jiawen Chen; Chongyi Li; |
303 | SCFlow2: Plug-and-Play Object Pose Refiner with Shape-Constraint Scene Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SCFlow2, a plug-and-play refinement framework for 6D object pose estimation. |
Qingyuan Wang; Rui Song; Jiaojiao Li; Kerui Cheng; David Ferstl; Yinlin Hu; |
304 | Uncertain Multimodal Intention and Emotion Understanding in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While this uncertainty reflects real-world scenarios, it remains underexplored within the computer vision community, particularly in conjunction with the intrinsic relationship between emotion and intention. To address these challenges, we introduce the Multimodal IntentioN and Emotion Understanding in the Wild (MINE) dataset, comprising over 20,000 topic-specific social media posts with natural modality variations across text, image, video, and audio. |
Qu Yang; Qinghongya Shi; Tongxin Wang; Mang Ye; |
305 | ObjectMover: Generative Object Movement with Video Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. |
Xin Yu; Tianyu Wang; Soo Ye Kim; Paul Guerrero; Xi Chen; Qing Liu; Zhe Lin; Xiaojuan Qi; |
306 | Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. |
Yongshuo Zong; Qin Zhang; Dongsheng An; Zhihua Li; Xiang Xu; Linghan Xu; Zhuowen Tu; Yifan Xing; Onkar Dabeer; |
307 | ViUniT: Visual Unit Tests for More Robust Visual Programming Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests. |
Artemis Panagopoulou; Honglu Zhou; Silvio Savarese; Caiming Xiong; Chris Callison-Burch; Mark Yatskar; Juan Carlos Niebles; |
308 | Learning Visual Generative Priors Without Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that grasping the cross-modality alignment is not a necessity for a sound visual generative prior, whose focus should be on texture modeling. Such a philosophy inspires us to study image-to-image (I2I) generation, where models can learn from in-the-wild images in a self-supervised manner. |
Shuailei Ma; Kecheng Zheng; Ying Wei; Wei Wu; Fan Lu; Yifei Zhang; Chen-Wei Xie; Biao Gong; Jiapeng Zhu; Yujun Shen; |
309 | SketchAgent: Language-Driven Sequential Sketch Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. |
Yael Vinker; Tamar Rott Shaham; Kristine Zheng; Alex Zhao; Judith E Fan; Antonio Torralba; |
310 | Coherent 3D Portrait Video Reconstruction Via Triplane Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, neither of these two frameworks is an ideal solution for democratized 3D telepresence. In this work, we address this dilemma and propose a novel solution that maintains both coherent identity and dynamic per-frame appearance to enable the best possible realism. |
Shengze Wang; Xueting Li; Chao Liu; Matthew Chan; Michael Stengel; Henry Fuchs; Shalini De Mello; Koki Nagano; |
311 | Yo’Chameleon: Personalized Vision and Language Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Yo’Chameleon, the first attempt to study personalization for large multimodal models. |
Thao Nguyen; Krishna Kumar Singh; Jing Shi; Trung Bui; Yong Jae Lee; Yuheng Li; |
312 | Foveated Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a foveated instance segmentation(FovealSeg) framework that leverages real-time user gaze data to perform instance segmentation exclusively on instance of interest, resulting in substantial computational savings. |
Hongyi Zeng; Wenxuan Liu; Tianhua Xia; Jinhui Chen; Ziyun Li; Sai Qian Zhang; |
313 | Distilling Multi-modal Large Language Models for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. |
Deepti Hegde; Rajeev Yasarla; Hong Cai; Shizhong Han; Apratim Bhattacharyya; Shweta Mahajan; Litian Liu; Risheek Garrepalli; Vishal M. Patel; Fatih Porikli; |
314 | SaMam: Style-aware State Space Model for Arbitrary Image Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we develop a Mamba-based style transfer framework, termed SaMam. |
Hongda Liu; Longguang Wang; Ye Zhang; Ziru Yu; Yulan Guo; |
315 | Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To overcome these bottlenecks, we propose *Collaborative Decoding* (CoDe), a novel decoding strategy tailored to the VAR framework. |
Zigeng Chen; Xinyin Ma; Gongfan Fang; Xinchao Wang; |
316 | VinaBench: Benchmark for Faithful and Consistent Visual Narratives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. |
Silin Gao; Sheryl Mathew; Li Mi; Sepideh Mamooler; Mengjie Zhao; Hiromi Wakaki; Yuki Mitsufuji; Syrielle Montariol; Antoine Bosselut; |
317 | EventFly: Event Camera Perception from Ground to The Sky Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce EventFly, a framework for robust cross-platform adaptation in event camera perception. |
Lingdong Kong; Dongyue Lu; Xiang Xu; Lai Xing Ng; Wei Tsang Ooi; Benoit R. Cottereau; |
318 | Hearing Hands: Generating Sounds from Physical Interactions in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study the problem of making 3D scene reconstructions interactive by asking the following question: can we predict the sounds of human hands physically interacting with a scene? |
Yiming Dou; Wonseok Oh; Yuqing Luo; Antonio Loquercio; Andrew Owens; |
319 | MangaNinja: Line Art Colorization with Precise Reference Following Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. |
Zhiheng Liu; Ka Leong Cheng; Xi Chen; Jie Xiao; Hao Ouyang; Kai Zhu; Yu Liu; Yujun Shen; Qifeng Chen; Ping Luo; |
320 | EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it remains challenging for latent diffusion models to effectively compose multiple semantic concepts into a single, coherent motion sequence. To address this issue, we propose EnergyMoGen, which includes two spectrums of Energy-Based Models: (1) We interpret the diffusion model as a latent-aware energy-based model that generates motions by composing a set of diffusion models in latent space; (2) We introduce a semantic-aware energy model based on cross-attention, which enables semantic composition and adaptive gradient descent for text embeddings. |
Jianrong Zhang; Hehe Fan; Yi Yang; |
321 | DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: More importantly, we introduce the DiN framework, which leverages a diffusion model to handle noisy labels in Med-VQA. |
Erjian Guo; Zhen Zhao; Zicheng Wang; Tong Chen; Yunyi Liu; Luping Zhou; |
322 | ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding Using Captions with Grounded Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although both research directions are essential for developing models with human-level video comprehension, they have largely evolved separately, with distinct benchmarks and architectures. This paper aims to unify these efforts by introducing ViCaS, a new dataset containing thousands of challenging videos, each annotated with detailed, human-written captions and temporally consistent, pixel-accurate masks for multiple objects with phrase grounding. |
Ali Athar; Xueqing Deng; Liang-Chieh Chen; |
323 | PolarFree: Polarization-based Reflection-Free Imaging Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Besides, to fully exploit the potential of polarization cues for reflection removal, we introduce PolarFree, which leverages diffusion process to generate reflection-free cues for accurate reflection removal. |
Mingde Yao; Menglu Wang; King-Man Tam; Lingen Li; Tianfan Xue; Jinwei Gu; |
324 | Lux Post Facto: Learning Portrait Performance Relighting with Conditional Video Diffusion and A Hybrid Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce Lux Post Facto, a novel portrait video relighting method that produces both photorealistic and temporally consistent lighting effects. |
Yiqun Mei; Mingming He; Li Ma; Julien Philip; Wenqi Xian; David M George; Xueming Yu; Gabriel Dedic; Ahmet Levent Taşel; Ning Yu; Vishal M. Patel; Paul Debevec; |
325 | Type-R: Automatically Retouching Typos for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. |
Wataru Shimoda; Naoto Inoue; Daichi Haraguchi; Hayato Mitani; Seiichi Uchida; Kota Yamaguchi; |
326 | 3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose 3D Gaussian Unscented Transform (3DGUT), replacing the EWA splatting formulation with the Unscented Transform that approximates the particles through sigma points, which can be projected exactly under any nonlinear projection function. |
Qi Wu; Janick Martinez Esturo; Ashkan Mirzaei; Nicolas Moënne-Loccoz; Zan Gojcic; |
327 | Continuous Adverse Weather Removal Via Degradation-Aware Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While the later hampers the model’s ability to accurately identify and respond to specific types of degradation, limiting its performance across diverse adverse weather conditions. To address these issues, we introduce the Incremental Learning Adverse Weather Removal (ILAWR) framework, which uses a novel degradation-aware distillation strategy for continuous weather removal. |
Xin Lu; Jie Xiao; Yurui Zhu; Xueyang Fu; |
328 | PMA: Towards Parameter-Efficient Point Cloud Understanding Via Point Mamba Adapter Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It neglects the rich complementary information in the intermediate layer, thereby failing to fully unlock the potential of pre-trained models. To overcome this limitation, we propose an orthogonal solution: Point Mamba Adapter (PMA), which constructs an ordered feature sequence from all layers of the pre-trained model and leverages Mamba to fuse all complementary semantics, thereby promoting comprehensive point cloud understanding. |
Yaohua Zha; Yanzi Wang; Hang Guo; Jinpeng Wang; Tao Dai; Bin Chen; Zhihao Ouyang; Xue Yuerong; Ke Chen; Shu-Tao Xia; |
329 | GLUS: Global-Local Reasoning Unified Into A Single Large Language Model for Video Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a novel framework utilizing multi-modal large language models (MLLMs) for referring video object segmentation (RefVOS). |
Lang Lin; Xueyang Yu; Ziqi Pang; Yu-Xiong Wang; |
330 | Chat-based Person Retrieval Via Dialogue-Refined Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in real-world scenarios, it can be challenging to ensure the information completeness of such single-shot text. To address this limitation, we propose chat-based person retrieval (ChatPR), a new paradigm that takes an interactive dialogue as query to perform the person retrieval, engaging the user in conversational context to progressively refine the query for accurate person retrieval. |
Yang Bai; Yucheng Ji; Min Cao; Jinqiao Wang; Mang Ye; |
331 | MeshGen: Generating PBR Textured Mesh with Render-Enhanced Auto-Encoder and Generative Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce MeshGen, an advanced image-to-3D pipeline that generates high-quality 3D meshes with detailed geometry and physically based rendering (PBR) textures. |
Zilong Chen; Yikai Wang; Wenqiang Sun; Feng Wang; Yiwen Chen; Huaping Liu; |
332 | GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. |
Haoqiang Kang; Enna Sachdeva; Piyush Gupta; Sangjae Bae; Kwonjoon Lee; |
333 | GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a world-modelbased framework to exploit the scene evolution for perception. |
Sicheng Zuo; Wenzhao Zheng; Yuanhui Huang; Jie Zhou; Jiwen Lu; |
334 | Geometry in Style: 3D Stylization Via Surface Normal Deformation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Geometry in Style, a new method for identity-preserving mesh stylization. |
Nam Anh Dinh; Itai Lang; Hyunwoo Kim; Oded Stein; Rana Hanocka; |
335 | Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Assembly of Global and Local Attention (AGLA) , a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. |
Wenbin An; Feng Tian; Sicong Leng; Jiahao Nie; Haonan Lin; Qianying Wang; Ping Chen; Xiaoqin Zhang; Shijian Lu; |
336 | Building Vision Models Upon Heat Conduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Visual representation models leveraging attention mechanisms are challenged by significant computational overhead, particularly when pursuing large receptive fields. In this study, we aim to mitigate this challenge by introducing the Heat Conduction Operator (HCO) built upon the physical heat conduction principle. |
Zhaozhi Wang; Yue Liu; Yunjie Tian; Yunfan Liu; Yaowei Wang; Qixiang Ye; |
337 | Iterative Predictor-Critic Code Decoding for Real-World Image Dehazing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel Iterative Predictor-Critic Code Decoding framework for real-world image dehazing, abbreviated as IPC-Dehaze, which leverages the high-quality codebook prior encapsulated in a pre-trained VQGAN. |
Jiayi Fu; Siyu Liu; Zikun Liu; Chun-Le Guo; Hyunhee Park; Ruiqi Wu; Guoqing Wang; Chongyi Li; |
338 | SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present \text SeedVR , a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. |
Jianyi Wang; Zhijie Lin; Meng Wei; Yang Zhao; Ceyuan Yang; Chen Change Loy; Lu Jiang; |
339 | Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper attempts to tackle the three challenges jointly. First, inspired by the notable generality of using image-level blending data for image forgery detection, we investigate whether and how video-level blending can be effective in video. We then perform a thorough analysis and identify a previously underexplored temporal forgery artifact: Facial Feature Drift (FFD), which commonly exists across different forgeries. |
Zhiyuan Yan; Yandan Zhao; Shen Chen; Mingyi Guo; Xinghe Fu; Taiping Yao; Shouhong Ding; Yunsheng Wu; Li Yuan; |
340 | Synthetic Data Is An Elegant GIFT for Continual Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. |
Bin Wu; Wuxuan Shi; Jinqiao Wang; Mang Ye; |
341 | Mamba4D: Efficient 4D Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Also, recent transformer-based 4D backbones commonly suffer from large computational costs due to their quadratic complexity, particularly for long video sequences. To address these challenges, we propose a novel point cloud video understanding backbone purely based on the State Space Models (SSMs). |
Jiuming Liu; Jinru Han; Lihao Liu; Angelica I. Aviles-Rivero; Chaokang Jiang; Zhe Liu; Hesheng Wang; |
342 | Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Horizon-GS, a novel approach built upon Gaussian Splatting techniques, tackles the unified reconstruction and rendering for aerial and street views. |
Lihan Jiang; Kerui Ren; Mulin Yu; Linning Xu; Junting Dong; Tao Lu; Feng Zhao; Dahua Lin; Bo Dai; |
343 | PatchDEMUX: A Certifiably Robust Framework for Multi-label Classifiers Against Adversarial Patches Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present PatchDEMUX, a certifiably robust framework for multi-label classifiers against adversarial patches. |
Dennis Jacob; Chong Xiang; Prateek Mittal; |
344 | Few-shot Implicit Function Generation Via Equivariance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. |
Suizhi Huang; Xingyi Yang; Hongtao Lu; Xinchao Wang; |
345 | SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Otherwise, the model’s answers could be inconsistent with its behavior. Therefore, we propose a model that can handle three different tasks: (1) closed-loop driving, (2) vision-language understanding, and (3) language-action alignment. |
Katrin Renz; Long Chen; Elahe Arani; Oleg Sinavski; |
346 | CLIP Under The Microscope: A Fine-Grained Analysis of Multi-Object Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP’s limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP’s encoders in diverse multi-object scenarios. |
Reza Abbasi; Ali Nazari; Aminreza Sefid; Mohammadali Banayeeanzade; Mohammad Hossein Rohban; Mahdieh Soleymani Baghshah; |
347 | Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Learning from multiple domains is a primary factor that influences the generalization of a single unified robot system. In this paper, we aim to learn the trajectory prediction model by using broad out-of-domain data to improve its performance and generalization ability. |
Jiange Yang; Haoyi Zhu; Yating Wang; Gangshan Wu; Tong He; Limin Wang; |
348 | Hierarchical Flow Diffusion for Efficient Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to model bilateral optical flow explicitly by hierarchical diffusion models, which has much smaller search space in the denoising procedure. |
Yang Hai; Guo Wang; Tan Su; Wenjie Jiang; Yinlin Hu; |
349 | Image Referenced Sketch Colorization Based on Animation Creation Workflow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. |
Dingkun Yan; Xinrui Wang; Zhuoru Li; Suguru Saito; Yusuke Iwasawa; Yutaka Matsuo; Jiaxian Guo; |
350 | Graph Neural Network Combining Event Stream and Periodic Aggregation for Low-Latency Event-based Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the other hand, asynchronous event graph methods could leverage both, but at the cost of avoiding any form of time accumulation, which limits the prediction accuracy. In this paper, we propose to break this accuracy-latency trade-off with a novel architecture combining an asynchronous accumulation-free event branch and a periodic aggregation branch. |
Manon Dampfhoffer; Thomas Mesquida; Damien Joubert; Thomas Dalgaty; Pascal Vivet; Christoph Posch; |
351 | FSboard: Over 3 Million Characters of ASL Fingerspelling Collected Via Smartphones Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. |
Manfred Georg; Garrett Tanzer; Esha Uboweja; Saad Hassan; Maximus Shengelia; Sam Sepah; Sean Forbes; Thad Starner; |
352 | EditAR: Unified Conditional Generation with Autoregressive Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks, e.g., image editing, depth-to-image, edge-to-image, segmentation-to-image. |
Jiteng Mu; Nuno Vasconcelos; Xiaolong Wang; |
353 | Using Diffusion Priors for Video Amodal Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, thereby capitalizing on the foundational knowledge in video generative models. |
Kaihua Chen; Deva Ramanan; Tarasha Khurana; |
354 | AesthetiQ: Enhancing Graphic Layout Design Via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a data filtering protocol utilizing our layout-quality heuristics for AAPA to ensure training happens on high-quality layouts. |
Sohan Patnaik; Rishabh Jain; Balaji Krishnamurthy; Mausoom Sarkar; |
355 | SceneDiffuser++: City-Scale Traffic Simulation Via A Generative World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. |
Shuhan Tan; John Lambert; Hong Jeon; Sakshum Kulshrestha; Yijing Bai; Jing Luo; Dragomir Anguelov; Mingxing Tan; Chiyu Max Jiang; |
356 | Precise, Fast, and Low-cost Concept Erasure in Value Space: Orthogonal Complement Matters Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To improve, we propose a precise, fast, and low-cost concept erasure method, called Adaptive Vaule Decomposer (AdaVD), which is training-free. |
Yuan Wang; Ouxiang Li; Tingting Mu; Yanbin Hao; Kuien Liu; Xiang Wang; Xiangnan He; |
357 | WonderWorld: Interactive 3D Scene Generation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. |
Hong-Xing Yu; Haoyi Duan; Charles Herrmann; William T. Freeman; Jiajun Wu; |
358 | GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, with novel designs for the awareness of garment geometry, structure, and inter-object relations. |
Ruihai Wu; Ziyu Zhu; Yuran Wang; Yue Chen; Jiarui Wang; Hao Dong; |
359 | Structure-from-Motion with A Non-Parametric Camera Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a new generic Structure-from-Motion pipeline, GenSfM, that uses a non-parametric camera projection model. |
Yihan Wang; Linfei Pan; Marc Pollefeys; Viktor Larsson; |
360 | Cheb-GR: Rethinking K-nearest Neighbor Search in Re-ranking for Person Re-identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We rethink the effect of the k-nearest neighbor search and introduce the Chebyshev’s Theorem-guided Graph Re-ranking (Cheb-GR) method, which adopts the adaptive neighbor search guided by Chebyshev’s Theorem over the k-nearest neighbor search for efficient neighbor selection. |
Jinxi Yang; He Li; Bo Du; Mang Ye; |
361 | GBC-Splat: Generalizable Gaussian-Based Clothed Human Digitalization Under Sparse RGB Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an efficient approach for generalizable clothed human digitalization, termed GBC-Splat. |
Hanzhang Tu; Zhanfeng Liao; Boyao Zhou; Shunyuan Zheng; Xilong Zhou; Liuxin Zhang; QianYing Wang; Yebin Liu; |
362 | Dynamic Integration of Task-Specific Adapters for Class Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework called Dynamic Integration of task-specific Adapters (DIA), which comprises two key components: Task-Specific Adapter Integration (TSAI) and Patch-Level Model Alignment. |
Jiashuo Li; Shaokun Wang; Bo Qian; Yuhang He; Xing Wei; Qiang Wang; Yihong Gong; |
363 | SP3D: Boosting Sparsely-Supervised 3D Object Detection Via Accurate Cross-Modal Semantic Prompts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a boosting strategy, termed SP3D, explicitly utilizing the cross-modal semantic prompts generated from Large Multimodal Models (LMMs) to boost the 3D detector with robust feature discrimination capability under sparse annotation settings. |
Shijia Zhao; Qiming Xia; Xusheng Guo; Pufan Zou; Maoji Zheng; Hai Wu; Chenglu Wen; Cheng Wang; |
364 | LIRM: Large Inverse Rendering Model for Progressive Reconstruction of Shape, Materials and View-dependent Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. |
Zhengqin Li; Dilin Wang; Ka Chen; Zhaoyang Lv; Thu Nguyen-Phuoc; Milim Lee; Jia-Bin Huang; Lei Xiao; Yufeng Zhu; Carl S. Marshall; Yuheng Ren; Richard Newcombe; Zhao Dong; |
365 | FedSPA: Generalizable Federated Graph Learning Under Homophily Heterogeneity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose FedSPA, an effective framework that addresses homophily heterogeneity from the perspectives of homophily conflict and homophily bias. |
Zihan Tan; Guancheng Wan; Wenke Huang; He Li; Guibin Zhang; Carl Yang; Mang Ye; |
366 | A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. |
Xin Wen; Bingchen Zhao; Yilun Chen; Jiangmiao Pang; Xiaojuan Qi; |
367 | MAD: Memory-Augmented Detection of 3D Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most 3D object detectors are limited to using sensor evidence from a short temporal window (0.1s-0.3s). In this work, we present a simple and effective add-on for enhancing any existing 3D object detector with long-term memory regardless of its sensor modality (e.g., LiDAR, camera) and network architecture. |
Ben Agro; Sergio Casas; Patrick Wang; Thomas Gilles; Raquel Urtasun; |
368 | Generative Omnimatte: Learning to Decompose Video Into Layers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel generative layered video decomposition framework to address the omnimatte problem. |
Yao-Chih Lee; Erika Lu; Sarah Rumbley; Michal Geyer; Jia-Bin Huang; Tali Dekel; Forrester Cole; |
369 | Relative Pose Estimation Through Affine Corrections of Monocular Depth Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. |
Yifan Yu; Shaohui Liu; Rémi Pautrat; Marc Pollefeys; Viktor Larsson; |
370 | RayFlow: Instance-Aware Diffusion Acceleration Via Adaptive Flow Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose \underline RayFlow , a novel diffusion framework that addresses these limitations. |
Huiyang Shao; Xin Xia; Yuhong Yang; Yuxi Ren; Xing Wang; Xuefeng Xiao; |
371 | SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose SpatialCLIP, an enhanced version of CLIP with better spatial understanding capabilities. |
Zehan Wang; Sashuai Zhou; Shaoxuan He; Haifeng Huang; Lihe Yang; Ziang Zhang; Xize Cheng; Shengpeng Ji; Tao Jin; Hengshuang Zhao; Zhou Zhao; |
372 | ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these methods present two main limitations: (1) bluntly suppressing language priors can compromise coherence and accuracy of generated content, and (2) processing contrastive inputs adds computational load, significantly slowing inference speed. To address these challenges, we propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model’s middle layers, where modality fusion predominantly occurs. |
Hao Yin; Guangzong Si; Zilei Wang; |
373 | Lifting The Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, a shift in the dominant flow of visual information is uncovered: (1) in shallow layers, strong interactions are observed between image tokens and instruction tokens, where most visual information is injected into instruction tokens to form cross-modal semantic representations; (2) in deeper layers, image tokens primarily interact with each other, aggregating the remaining visual information to optimize semantic representations within the visual modality. |
Hao Yin; Guangzong Si; Zilei Wang; |
374 | MAGiC-SLAM: Multi-Agent Gaussian Globally Consistent SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we propose a rigidly deformable 3D Gaussian-based scene representation that dramatically speeds up the system. |
Vladimir Yugay; Theo Gevers; Martin R. Oswald; |
375 | ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. |
Zetong Zhang; Manuel Kaufmann; Lixin Xue; Jie Song; Martin R. Oswald; |
376 | CrossOver: 3D Scene Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. |
Sayan Deb Sarkar; Ondrej Miksik; Marc Pollefeys; Daniel Barath; Iro Armeni; |
377 | Interactive Medical Image Segmentation: A Benchmark Dataset and Baseline Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce the IMed-361M benchmark dataset, a significant advancement in general IMIS research. |
Junlong Cheng; Bin Fu; Jin Ye; Guoan Wang; Tianbin Li; Haoyu Wang; Ruoyu Li; He Yao; Junren Cheng; Jingwen Li; Yanzhou Su; Min Zhu; Junjun He; |
378 | Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs’ spatial-temporal reasoning with 2D images as input, without modifying the architecture or requiring task-specific fine-tuning. |
Benlin Liu; Yuhao Dong; Yiqin Wang; Zixian Ma; Yansong Tang; Luming Tang; Yongming Rao; Wei-Chiu Ma; Ranjay Krishna; |
379 | AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). |
Khiem Vuong; Anurag Ghosh; Deva Ramanan; Srinivasa Narasimhan; Shubham Tulsiani; |
380 | Conditional Balance: Improving Multi-Conditioning Trade-Offs in Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel method to identify sensitivities within the DDPM attention layers, identifying specific layers that correspond to different stylistic aspects. |
Nadav Z. Cohen; Oron Nir; Ariel Shamir; |
381 | I2VGuard: Safeguarding Images Against Misuse in Diffusion-based Image-to-Video Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel approach that applies imperceptible perturbations on images to degrade the quality of the generated videos, thereby protecting images from misuse in white-box image-to-video diffusion models. |
Dongnan Gui; Xun Guo; Wengang Zhou; Yan Lu; |
382 | Motion Prompting: Controlling Video Generation with Motion Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we train a video generation model conditioned on spatio-temporally sparse or dense motion trajectories. |
Daniel Geng; Charles Herrmann; Junhwa Hur; Forrester Cole; Serena Zhang; Tobias Pfaff; Tatiana Lopez-Guevara; Yusuf Aytar; Michael Rubinstein; Chen Sun; Oliver Wang; Andrew Owens; Deqing Sun; |
383 | PICO: Reconstructing 3D People In Contact with Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways:(1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact correspondences on both body and object meshes. |
Alpár Cseke; Shashank Tripathi; Sai Kumar Dwivedi; Arjun S. Lakshmipathy; Agniv Chatterjee; Michael J. Black; Dimitrios Tzionas; |
384 | Towards Consistent Multi-Task Learning: Unlocking The Potential of Task-Specific Parameters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work points out that task-specific parameters not only capture task-specific information but also influence the gradients propagated to shared parameters, which in turn affects gradient conflicts. Motivated by this insight, we propose ConsMTL, which models MTL as a bi-level optimization problem: in the upper-level optimization, we perform gradient aggregation on shared parameters to find a joint update vector that minimizes gradient conflicts; in the lower-level optimization, we introduce an additional loss for task-specific parameters guiding the k gradients of shared parameters to gradually converge towards the joint update vector. |
Xiaohan Qin; Xiaoxing Wang; Junchi Yan; |
385 | Revisiting Fairness in Multitask Learning: A Performance-Driven Approach for Variance Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While the prior works have attempted to mitigate inter-task unfairness through loss-based and gradient-based strategies, they still exhibit imbalanced performance across tasks on common benchmarks.This key observation motivates us to consider performance-level information as an explicit fairness indicator, which can precisely reflect the current optimization status of each task, and accordingly help to adjust the gradient aggregation process.Specifically, we utilize the performance variance among tasks as the fairness indicator and introduce a dynamic weighting strategy to gradually reduce the performance variance. Based on this, we propose PIVRG, a novel performance-informed variance reduction gradient aggregation approach.Extensive experiments show that PIVRG achieves SOTA performance across various benchmarks, spanning both supervised learning and reinforcement learning tasks with task numbers ranging from 2 to 40. |
Xiaohan Qin; Xiaoxing Wang; Junchi Yan; |
386 | SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a new Semi-supervised framework for End-to-end Text Spotting, namely SemiETS that leverages the complementarity of text detection and recognition. |
Dongliang Luo; Hanshen Zhu; Ziyang Zhang; Dingkang Liang; Xudong Xie; Yuliang Liu; Xiang Bai; |
387 | BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. |
Md Mohaiminul Islam; Tushar Nagarajan; Huiyu Wang; Gedas Bertasius; Lorenzo Torresani; |
388 | Lifting Motion to The 3D World Via 2D Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce MVLift, a novel approach to predict global 3D motion—including both joint rotations and root trajectories in the world coordinate system—using only 2D pose sequences for training. |
Jiaman Li; C. Karen Liu; Jiajun Wu; |
389 | VLog: Video-Language Models By Generative Retrieval of Narration Vocabulary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). |
Kevin Qinghong Lin; Mike Zheng Shou; |
390 | ShowUI: One Vision-Language-Action Model for GUI Visual Agent Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we develop a vision-language-action model in the digital world, namely Our Model, which features the following innovations:1. **UI-Guided Visual Token Selection** to reduce computational costs by formulating screenshots as a UI-connected graph, adaptively identifying their redundant relationships and serving as the criteria for token selection during self-attention blocks. 2. **Interleaved Vision-Language-Action Streaming** that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. 3. **Small-Scale High-Quality GUI Instruction-Following Datasets** by careful data curation and employing a resampling strategy to address significant data type imbalances. |
Kevin Qinghong Lin; Linjie Li; Difei Gao; Zhengyuan Yang; Shiwei Wu; Zechen Bai; Stan Weixian Lei; Lijuan Wang; Mike Zheng Shou; |
391 | Meta-Learning Hyperparameters for Parameter Efficient Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we observe that fixed hyperparameters — such as intra-layer positions, layer depth, and scaling factors, can considerably hinder PEFT performance, as fine-tuning on RS images proves highly sensitive to these settings. To address this, we propose MetaPEFT, a method incorporating adaptive scalers that dynamically adjust module influence during fine-tuning. |
Zichen Tian; Yaoyao Liu; Qianru Sun; |
392 | VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We design a discretized voting space that accommodates all possible translations and then identify the one shared by nearby points by differentiable voting. |
Yancong Lin; Shiming Wang; Liangliang Nan; Julian Kooij; Holger Caesar; |
393 | GPS As A Control Signal for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. |
Chao Feng; Ziyang Chen; Aleksander Holynski; Alexei A. Efros; Andrew Owens; |
394 | AniMo: Species-Aware Model for Text-Driven Animal Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Animal motion modeling presents unique challenges due to species diversity, varied morphological structures, and different behavioral patterns in response to similar textual descriptions. To address these challenges, we propose AniMo for text-driven animal motion generation. |
Xuan Wang; Kai Ruan; Xing Zhang; Gaoang Wang; |
395 | FG^2: Fine-Grained Cross-View Localization By Fine-Grained Feature Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings by matching fine-grained features between the two images. |
Zimin Xia; Alexandre Alahi; |
396 | Towards Autonomous Micromobility Through Scalable Urban Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a scalable urban simulation solution to advance autonomous micromobility. |
Wayne Wu; Honglin He; Chaoyuan Zhang; Jack He; Seth Z. Zhao; Ran Gong; Quanyi Li; Bolei Zhou; |
397 | PHGC: Procedural Heterogeneous Graph Completion for Natural Language Task Verification in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Describing rules with natural language provides generalizable applications, but also raises cross-modal heterogeneity and hierarchical misalignment challenges. In this paper, we proposed a novel approach termed Procedural Heterogeneous Graph Completion (PHGC), which addresses these challenges with heterogeneous graphs representing the logic in rules and operation flows. |
Xun Jiang; Zhiyi Huang; Xing Xu; Jingkuan Song; Fumin Shen; Heng Tao Shen; |
398 | On The Out-Of-Distribution Generalization of Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate the generalization boundaries of current Large Multimodal Models (LMMs) via comprehensive evaluation under out-of-distribution scenarios and domain-specific tasks. |
Xingxuan Zhang; Jiansheng Li; Wenjing Chu; junjia hai; Renzhe Xu; Yuqing Yang; Shikai Guan; Jiazheng Xu; Liping Jing; Peng Cui; |
399 | From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). |
Andrew Szot; Bogdan Mazoure; Omar Attia; Aleksei Timofeev; Harsh Agrawal; Devon Hjelm; Zhe Gan; Zsolt Kira; Alexander Toshev; |
400 | Controllable Human Image Generation with Personalized Multi-Garments Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present BootControl, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments.Here, the main … |
Yisol Choi; Sangkyung Kwak; Sihyun Yu; Hyungwon Choi; Jinwoo Shin; |
401 | AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present AnyEdit, a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains. |
Qifan Yu; Wei Chow; Zhongqi Yue; Kaihang Pan; Yang Wu; Xiaoyang Wan; Juncheng Li; Siliang Tang; Hanwang Zhang; Yueting Zhuang; |
402 | Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we build a proper visual language by leveraging diffusion timesteps to learn discrete, recursive visual tokens. |
Kaihang Pan; Wang Lin; Zhongqi Yue; Tenglong Ao; Liyu Jia; Wei Zhao; Juncheng Li; Siliang Tang; Hanwang Zhang; |
403 | WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. |
Silin Cheng; Yang Liu; Xinwei He; Sebastien Ourselin; Lei Tan; Gen Luo; |
404 | ACL: Activating Capability of Linear Attention for Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, the recursive computation in Mamba’s SSM leads to reduced efficiency. To resolve these issues, we exploit the mathematical congruences between linear attention and SSM within Mamba to propose a novel model, ACL, which leverages news designs to Activate the Capability of Linear attention for IR. |
Yubin Gu; Yuan Meng; Jiayi Ji; Xiaoshuai Sun; |
405 | NitroFusion: High-Fidelity Single-Step Diffusion Through Dynamic Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce NitroFusion, a fundamentally different approach to single-step diffusion that achieves high-quality generation through a dynamic adversarial framework. |
Dar-Yen Chen; Hmrishav Bandyopadhyay; Kai Zou; Yi-Zhe Song; |
406 | Free on The Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional TTA methods, however, often rely on costly training or optimization processes, or make unrealistic assumptions about accessing or storing historical training and test data.Instead, this study proposes FreeTTA, a training-free and universally available method that makes no assumptions, to enhance the flexibility of TTA. |
Qiyuan Dai; Sibei Yang; |
407 | Adaptive Part Learning for Fine-Grained Generalized Category Discovery: A Plug-and-Play Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce an adaptive part discovery and learning method, called APL, which generates consistent object parts and their correspondences across different similar images using a set of shared learnable part queries and DINO part priors, without requiring any additional annotations. |
Qiyuan Dai; Hanzhuo Huang; Yu Wu; Sibei Yang; |
408 | VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. |
Shehan Munasinghe; Hanan Gani; Wenqi Zhu; Jiale Cao; Eric Xing; Fahad Shahbaz Khan; Salman Khan; |
409 | Learning on Model Weights Using Tree Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While effective, these linear classifiers are computationally expensive, especially when dealing with larger models that have many parameters. To address this, we introduce Probing Experts (ProbeX), a theoretically motivated and lightweight method. |
Eliahu Horwitz; Bar Cavia; Jonathan Kahana; Yedid Hoshen; |
410 | SkillMimic: Learning Basketball Interaction Skills from Demonstrations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SkillMimic, a unified data-driven framework that fundamentally changes how agents learn interaction skills by eliminating the need for skill-specific rewards. |
Yinhuai Wang; Qihan Zhao; Runyi Yu; Hok Wai Tsui; Ailing Zeng; Jing Lin; Zhengyi Luo; Jiwen Yu; Xiu Li; Qifeng Chen; Jian Zhang; Lei Zhang; Ping Tan; |
411 | One-Way Ticket: Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: From our analysis, we observe redundant computations in the UNet encoders. Our findings suggest that, for T2I diffusion models, decoders are more adept at capturing richer and more explicit semantic information, while encoders can be effectively shared across decoders from diverse time steps.Based on these observations, we introduce the first Time-independent Unified Encoder (TiUE) for the student model UNet architecture, which is a loop-free image generation approach for distilling T2I diffusion models. |
Senmao Li; Lei Wang; Kai Wang; Tao Liu; Jiehang Xie; Joost van de Weijer; Fahad Shahbaz Khan; Shiqi Yang; Yaxing Wang; Jian Yang; |
412 | Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. |
Zaijing Li; Yuquan Xie; Rui Shao; Gongwei Chen; Dongmei Jiang; Liqiang Nie; |
413 | GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce GROVE, a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations. |
Jieming Cui; Tengyu Liu; Ziyu Meng; Jiale Yu; Ran Song; Wei Zhang; Yixin Zhu; Siyuan Huang; |
414 | VidSeg: Training-free Video Semantic Segmentation Based on Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the first training-free approach for Video Semantic Segmentation (VSS) based on pre-trained diffusion models. |
Qian Wang; Abdelrahman Eldesokey; Mohit Mendiratta; Fangneng Zhan; Adam Kortylewski; Christian Theobalt; Peter Wonka; |
415 | AvatarArtist: Open-Domain 4D Avatarization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We select parametric triplanes as the intermediate 4D representation, and propose a practical training paradigm that takes advantage of both generative adversarial networks (GANs) and diffusion models. |
Hongyu Liu; Xuan Wang; Ziyu Wan; Yue Ma; Jingye Chen; Yanbo Fan; Yujun Shen; Yibing Song; Qifeng Chen; |
416 | Ev-3DOD: Pushing The Temporal Boundaries of 3D Object Detection with Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. |
Hoonhee Cho; Jae-Young Kang; Youngho Kim; Kuk-Jin Yoon; |
417 | HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Multimodal large language models (MLLMs) promise better comprehension and reasoning but face their own challenges: (1) difficulty in fine-grained defect localization due to the limitations in capturing tiny details, and (2) constraints in providing pixel-wise outputs necessary for precise heatmap generation. To address these challenges, we propose HEIE: a novel MLLM-Based Hierarchical Explainable Image Implausibility Evaluator. |
Fan Yang; Ru Zhen; Jianing Wang; Yanhao Zhang; Haoxiang Chen; Haonan Lu; Sicheng Zhao; Guiguang Ding; |
418 | Decoupling Fine Detail and Global Geometry for Compressed Depth Map Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Second, densely distributed random noise reduces the accuracy of estimating the global geometric structure of the scene. To address these challenges, we propose a novel framework, termed geometry-decoupled network (GDNet), for compressed depth map super-resolution that decouples the high-quality depth map reconstruction process by handling global and detailed geometric features separately. |
Huan Zheng; Wencheng Han; Jianbing Shen; |
419 | VideoDPO: Omni-Preference Alignment for Video Diffusion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. |
Runtao Liu; Haoyu Wu; Ziqiang Zheng; Chen Wei; Yingqing He; Renjie Pi; Qifeng Chen; |
420 | ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. |
Qihang Peng; Henry Zheng; Gao Huang; |
421 | Don’t Shake The Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: End-to-end autonomous driving frameworks enable seamless integration of perception and planning but often rely on one-shot trajectory prediction, which may lead to unstable control and vulnerability to occlusions in single-frame perception. To address this, we propose the Momentum-Aware Driving (MomAD) framework, which introduces trajectory momentum and perception momentum to stabilize and refine trajectory predictions. |
Ziying Song; Caiyan Jia; Lin Liu; Hongyu Pan; Yongchang Zhang; Junming Wang; Xingyu Zhang; Shaoqing Xu; Lei Yang; Yadan Luo; |
422 | SoundVista: Novel-View Ambient Sound Synthesis Via Visual-Acoustic Binding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SoundVista, a method to generate the ambient sound of an arbitrary scene at novel viewpoints. |
Mingfei Chen; Israel D. Gebru; Ishwarya Ananthabhotla; Christian Richardt; Dejan Markovic; Jake Sandakly; Steven Krenn; Todd Keebler; Eli Shlizerman; Alexander Richard; |
423 | Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose ViewpointRosetta, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. |
Mi Luo; Zihui Xue; Alex Dimakis; Kristen Grauman; |
424 | Gaussian Splatting Feature Fields for (Privacy-Preserving) Visual Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. |
Maxime Pietrantoni; Gabriela Csurka; Torsten Sattler; |
425 | EMOE: Modality-Specific Enhanced Dynamic Emotion Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we propose a novel model, Modality-Specific Enhanced Dynamic Emotion Experts (EMOE), consisting of: (1) Mixture of Modality Experts for dynamically adjusting modality importance based on sample features, and (2) Unimodal Distillation to retain single-modality predictive ability within fused features. |
Yiyang Fang; Wenke Huang; Guancheng Wan; Kehua Su; Mang Ye; |
426 | MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. |
Yilun Zhao; Haowei Zhang; Lujing Xie; Tongyan Hu; Guo Gan; Yitao Long; Zhiyuan Hu; Weiyuan Chen; Chuhan Li; Zhijian Xu; Chengye Wang; Ziyao Shangguan; Zhenwen Liang; Yixin Liu; Chen Zhao; Arman Cohan; |
427 | Skip Tuning: Pre-trained Vision-Language Models Are Effective and Efficient Adapters Themselves Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. |
Shihan Wu; Ji Zhang; Pengpeng Zeng; Lianli Gao; Jingkuan Song; Heng Tao Shen; |
428 | DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Taking into account that depth can be regarded as a geometry supplement for RGB images, a straightforward question arises: Do we really need to explicitly encode depth information with neural networks as done for RGB images? Based on this insight, in this paper, we investigate a new way to learn RGBD feature representations and present DFormerv2, a strong RGBD encoder that explicitly uses depth maps as geometry priors rather than encoding depth information with neural networks. |
Bo-Wen Yin; Jiao-Long Cao; Ming-Ming Cheng; Qibin Hou; |
429 | Language-Guided Salient Object Ranking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we observe that when Large Vision-Language Models (LVLMs) describe a scene, they usually focus on the most salient object first, and then discuss the relations as they move on to the next (less salient) one. Based on this observation, we propose a novel Language-Guided Salient Object Ranking approach (named LG-SOR), which utilizes the internal knowledge within the LVLM-generated language descriptions, i.e., semantic relation cues and the implicit entity order cues, to facilitate saliency ranking. |
Fang Liu; Yuhao Liu; Ke Xu; Shuquan Ye; Gerhard Petrus Hancke; Rynson W. H. Lau; |
430 | Olympus: A Universal Task Router for Computer Vision Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Olympus, a new approach that transforms Multimodal Large Language Models (MLLMs) into a unified framework capable of handling a wide array of computer vision tasks. |
Yuanze Lin; Yunsheng Li; Dongdong Chen; Weijian Xu; Ronald Clark; Philip Torr; |
431 | 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. |
Jianing Yang; Xuweiyi Chen; Nikhil Madaan; Madhavan Iyengar; Shengyi Qian; David F. Fouhey; Joyce Chai; |
432 | Active Data Curation Effectively Distills Large-Scale Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we explore an alternative, yet simple approach—active data curation as effective distillation for contrastive multimodal pretraining. |
Vishaal Udandarao; Nikhil Parthasarathy; Muhammad Ferjad Naeem; Talfan Evans; Samuel Albanie; Federico Tombari; Yongqin Xian; Alessio Tonioni; Olivier J. Henaff; |
433 | MINIMA: Modality Invariant Image Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. |
Jiangwei Ren; Xingyu Jiang; Zizhuo Li; Dingkang Liang; Xin Zhou; Xiang Bai; |
434 | A Unified Image-Dense Annotation Generation Model for Underwater Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. |
Hongkai Lin; Dingkang Liang; Zhenghao Qi; Xiang Bai; |
435 | Robust-MVTON: Learning Cross-Pose Feature Alignment and Fusion for Robust Multi-View Virtual Try-On Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conversely, implicit encoding based methods often lose spatial information about clothing, resulting in outputs that lack detail. To overcome these challenges, we propose Robust-MVTON, an end-to end method for robust and high-quality multi-view try-ons. |
Nannan Zhang; Yijiang Li; Dong Du; Zheng Chong; Zhengwentai Sun; Jianhao Zeng; Yusheng Dai; Zhengyu Xie; Hairui Zhu; Xiaoguang Han; |
436 | MVPortrait: Text-Guided Motion and Emotion Control for Multi-view Vivid Portrait Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel two-stage text-guided framework, MVPortrait, to generate expressive multi-view portrait animations that faithfully capture the described motion and emotion. |
Yukang Lin; Hokit Fung; Jianjin Xu; Zeping Ren; Adela S.M. Lau; Guosheng Yin; Xiu Li; |
437 | IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to safeguard portrait photos from unauthorized encoder-based customization. |
Yiren Song; Pei Yang; Hai Ci; Mike Zheng Shou; |
438 | The Photographer’s Eye: Teaching Multimodal Large Language Models to See, and Critique Like Photographers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such distinctions between general visual understanding (detection, localization, etc.) and aesthetic perception (color, lighting, composition, etc.) pose a significant challenge for existing Multimodal Large Language Models (MLLMs) in comprehending image aesthetics, which is increasingly needed in real-world applications, from image recommendation and enhancement to generation. To fundamentally advance the aesthetic understanding of MLLMs, we introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, distinguished by its large scale, expertise, and diversity. |
Daiqing Qi; Handong Zhao; Jing Shi; Simon Jenni; Yifei Fan; Franck Dernoncourt; Scott Cohen; Sheng Li; |
439 | PartGen: Part-level 3D Generation and Reconstruction with Multi-view Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, most applications and creative workflows require 3D assets to be composed of distinct, meaningful parts that can be independently manipulated. To bridge this gap, we introduce PartGen, a novel approach for generating, from text, images, or unstructured 3D objects, 3D objects composed of meaningful parts. |
Minghao Chen; Roman Shapovalov; Iro Laina; Tom Monnier; Jianyuan Wang; David Novotny; Andrea Vedaldi; |
440 | Language-Guided Image Tokenization for Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). |
Kaiwen Zha; Lijun Yu; Alireza Fathi; David A. Ross; Cordelia Schmid; Dina Katabi; Xiuye Gu; |
441 | Visual Persona: Foundation Model for Full-Body Human Customization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. |
Jisu Nam; Soowon Son; Zhan Xu; Jing Shi; Difan Liu; Feng Liu; Seungryong Kim; Yang Zhou; |
442 | Learning from Neighbors: Category Extrapolation for Long-Tail Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. |
Shizhen Zhao; Xin Wen; Jiahui Liu; Chuofan Ma; Chunfeng Yuan; Xiaojuan Qi; |
443 | Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce the Silent Branding Attack, a novel data poisoning method that manipulates text-to-image diffusion models to generate images containing specific brand logos or symbols without any text triggers. |
Sangwon Jang; June Suk Choi; Jaehyeong Jo; Kimin Lee; Sung Ju Hwang; |
444 | Efficient Long Video Tokenization Via Coordinate-based Patch Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. |
Huiwon Jang; Sihyun Yu; Jinwoo Shin; Pieter Abbeel; Younggyo Seo; |
445 | DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. |
Zhendong Wang; Jianmin Bao; Shuyang Gu; Dong Chen; Wengang Zhou; Houqiang Li; |
446 | AIM-Fair: Advancing Algorithmic Fairness Via Selectively Fine-Tuning Biased Models with Contextual Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, many approaches rely on the availability of demographic group labels, which are often costly to annotate. This paper proposes AIM-Fair, aiming to overcome these limitations and harness the potential of cutting-edge generative models in promoting algorithmic fairness. |
Zengqun Zhao; Ziquan Liu; Yu Cao; Shaogang Gong; Ioannis Patras; |
447 | MFogHub: Bridging Multi-Regional and Multi-Satellite Data for Global Marine Fog Detection and Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through MFogHub, we aim to advance both the practical monitoring and scientific understanding of marine fog dynamics on a global scale. |
Mengqiu Xu; Kaixin Chen; Heng Guo; Yixiang Huang; Ming Wu; Zhenwei Shi; Chuang Zhang; Jun Guo; |
448 | DiffusionSfM: Predicting Structure and Motion Via Ray Origin and Endpoint Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. |
Qitao Zhao; Amy Lin; Jeff Tan; Jason Y. Zhang; Deva Ramanan; Shubham Tulsiani; |
449 | VODiff: Controlling Object Visibility Order in Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we observe that this approach treats all objects as being on the same layer and neglect their visibility order, leading to the synthesis of overlapping objects with incorrect occlusions.To address this limitation, we introduce in this paper a new training-free framework that considers object visibility order explicitly and allows users to place overlapping objects in a stack of layers. |
Dong Liang; Jinyuan Jia; Yuhao Liu; Zhanghan Ke; Hongbo Fu; Rynson W. H. Lau; |
450 | DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. |
Leqi Shen; Guoqiang Gong; Tianxiang Hao; Tao He; Yifeng Zhang; Pengzhang Liu; Sicheng Zhao; Jungong Han; Guiguang Ding; |
451 | FlexUOD: The Answer to Real-world Unsupervised Image Outlier Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite a series of unsupervised outlier detection (UOD) approaches have been proposed, they cannot correctly answer this critical question, resulting in their performance instability across various real-world (varying contamination factor) scenarios. To address this problem, we propose FlexUOD, with a novel contamination factor estimation perspective. |
Zhonghang Liu; Kun Zhou; Changshuo Wang; Wen-Yan Lin; Jiangbo Lu; |
452 | DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most existing methods of 3D clothed human reconstruction from a single image treat the clothed human as a single object without distinguishing between cloth and human body. In this regard, we present DeClotH, which separately reconstructs 3D cloth and human body from a single image. |
Hyeongjin Nam; Donghwan Kim; Jeongtaek Oh; Kyoung Mu Lee; |
453 | BG-Triangle: Bezier Gaussian Triangle for 3D Vectorization and Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a robust and effective discontinuity-aware rendering technique to reduce uncertainties at object boundaries. |
Minye Wu; Haizhao Dai; Kaixin Yao; Tinne Tuytelaars; Jingyi Yu; |
454 | Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. |
Haolin Liu; Xiaohang Zhan; Zizheng Yan; Zhongjin Luo; Yuxin Wen; Xiaoguang Han; |
455 | 3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Without hand-crafted regularizers, they tend to disperse irregularly around the actual surface. To circumvent these issues, we introduce a novel method, named 3D Convex Splatting (3DCS), which leverages 3D smooth convexes as primitives for modeling geometrically-meaningful radiance fields from multi-view images. |
Jan Held; Renaud Vandeghen; Abdullah Hamdi; Adrien Deliege; Anthony Cioppa; Silvio Giancola; Andrea Vedaldi; Bernard Ghanem; Marc Van Droogenbroeck; |
456 | Navigation World Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. |
Amir Bar; Gaoyue Zhou; Danny Tran; Trevor Darrell; Yann LeCun; |
457 | Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Sparsely activated mixture-of-experts (MoE) has recently been demonstrated as a valid solution to training large-scale models with limited resources. Inspired by this, we present Aurora, a GAN-based text-to-image generator that employs a collection of experts to learn feature processing, together with a sparse router to adaptively select the most suitable expert for each feature point. |
Jiapeng Zhu; Ceyuan Yang; Kecheng Zheng; Yinghao Xu; Zifan Shi; Yifei Zhang; Qifeng Chen; Yujun Shen; |
458 | VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. |
Zijian He; Yuwei Ning; Yipeng Qin; Guangrun Wang; Sibei Yang; Liang Lin; Guanbin Li; |
459 | Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Are semantically similar embedding spaces separated only by simple projection transformations? To validate this, we propose a novel framework that aligns vision and language using frozen unimodal encoders. |
Mayug Maniparambil; Raiymbek Akshulakov; Yasser Abdelaziz Dahou Djilali; Sanath Narayan; Ankit Singh; Noel E. O’Connor; |
460 | CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the first manual-based appliance manipulation benchmark CheckManual. |
Yuxing Long; Jiyao Zhang; Mingjie Pan; Tianshu Wu; Taewhan Kim; Hao Dong; |
461 | Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc.To detect the ever-increasingly **diverse** malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design **robust** forgery detectors due to their impressive performance on a **wide** range of multimodal tasks.However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs’ discerning capabilities on forgery media.To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries.Forensics-Bench comprises 63,292 meticulously curated multi-choicevisual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models.We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. |
Jin Wang; Chenghui Lv; Xian Li; Shichao Dong; Huadong Li; Kelu Yao; Chao Li; Wenqi Shao; Ping Luo; |
462 | Motions As Queries: One-Stage Multi-Person Holistic Human Motion Capture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Differently, we develop a one-stage multi-person holistic human motion capture system, which 1) employs only one network, enabling significant benefits from the end-to-end training on a large-scale dataset; 2) enables performance improving of the tracking module during training, avoiding being limited by a pre-trained tracker; 3) captures the motions of all individuals within a single shot, rather than tracking and estimating each person sequentially. |
Kenkun Liu; Yurong Fu; Weihao Yuan; Jing Lin; Peihao Li; Xiaodong Gu; Lingteng Qiu; Haoqian Wang; Zilong Dong; Xiaoguang Han; |
463 | Dense Match Summarization for Faster Two-view Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we speed up robust two-view relative pose from dense correspondences. |
Jonathan Astermark; Anders Heyden; Viktor Larsson; |
464 | WISNet: Pseudo Label Generation on Unbalanced and Patch Annotated Waste Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we construct a new dataset consisting of 12,208 waste images, upon which seed regions (i.e., patches) are annotated and classified into 21 categories in a crowdsourcing fashion. |
Shifan Zhang; Hongzi Zhu; Yinan He; Minyi Guo; Ziyang Lou; Shan Chang; |
465 | MixerMDM: Learnable Composition of Human Motion Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. |
Pablo Ruiz-Ponce; German Barquero; Cristina Palmero; Sergio Escalera; José García-Rodríguez; |
466 | Disentangled Pose and Appearance Guidance for Multi-Pose Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often overlook the fundamental differences between spatial transformations of poses and texture generation for appearance, which makes them prone to overfitting. To address this issue, we propose a multi-pose generation framework driven by disentangled pose and appearance guidance. |
Tengfei Xiao; Yue Wu; Yuelong Li; Can Qin; Maoguo Gong; Qiguang Miao; Wenping Ma; |
467 | FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate how to best work with MLLMs in an object placement task. |
Ian Huang; Yanan Bao; Karen Truong; Howard Zhou; Cordelia Schmid; Leonidas Guibas; Alireza Fathi; |
468 | TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. |
Hongxiang Zhao; Xingchen Liu; Mutian Xu; Yiming Hao; Weikai Chen; Xiaoguang Han; |
469 | IM-Zero: Instance-level Motion Controllable Video Generation in A Zero-shot Manner Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an efficient alternative to prohibitive training-based video generation, existing zero-shot video generation methods cannot generate high-quality and motion-consistent videos under the control of layouts and movement trajectories. In this paper, we propose a novel zero-shot method named IM-Zero that ameliorates instance-level motion controllable video generation with enhanced control accuracy, motion consistency, and richness of details to address this problem. |
Yuyang Huang; Yabo Chen; Li Ding; Xiaopeng Zhang; Wenrui Dai; Junni Zou; Hongkai Xiong; Qi Tian; |
470 | EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. |
Yang Yue; Yulin Wang; Haojun Jiang; Pan Liu; Shiji Song; Gao Huang; |
471 | CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CheXWorld, the first effort towards a self-supervised world model for radiographic images. |
Yang Yue; Yulin Wang; Chenxin Tao; Pan Liu; Shiji Song; Gao Huang; |
472 | Zero-shot 3D Question Answering Via Voxel-based Dynamic Token Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Common methods such as token pooling, reduce visual token usage but often lead to information loss, impairing the model’s ability to preserve visual details essential for 3D question answering tasks. To address this, we propose voxel-based Dynamic Token Compression (DTC), which combines 3D spatial priors and visual semantics to achieve over 90% reduction in visual tokens usage for current multi-frame VLMs. |
Hsiang-Wei Huang; Fu-Chen Chen; Wenhao Chai; Che-Chun Su; Lu Xia; Sanghun Jung; Cheng-Yen Yang; Jenq-Neng Hwang; Min Sun; Cheng-Hao Kuo; |
473 | ID-Patch: Robust ID Association for Group Photo Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods suffer from limitations such as the reliance on segmentation models, increased runtime, or a high probability of ID leakage. To address these challenges, we propose ID-Patch, a novel method that provides robust association between identities and 2D positions. |
Yimeng Zhang; Tiancheng Zhi; Jing Liu; Shen Sang; Liming Jiang; Qing Yan; Sijia Liu; Linjie Luo; |
474 | Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While recent studies demonstrate improved performance, their generalization capability across unfamiliar driving scenes remains unexplored. To tackle this issue, we propose UIGenMap, an uncertainty-instructed structure injection approach for generalizable HD map vectorization, which concerns the uncertainty resampling in statistical distribution and employs explicit instance features to reduce the excessive reliance on training data. |
Xiaolu Liu; Ruizi Yang; Song Wang; Wentong Li; Junbo Chen; Jianke Zhu; |
475 | Hiding Images in Diffusion Models By Editing Learned Score Functions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address these, we describe a simple yet effective approach that embeds images at specific timesteps in the reverse diffusion process by editing the learned score functions. |
Haoyu Chen; Yunqiao Yang; Nan Zhong; Kede Ma; |
476 | Mamba-Reg: Vision Mamba Also Needs Registers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. |
Feng Wang; Jiahao Wang; Sucheng Ren; Guoyizhe Wei; Jieru Mei; Wei Shao; Yuyin Zhou; Alan Yuille; Cihang Xie; |
477 | Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. |
Feng Wang; Timing Yang; Yaodong Yu; Sucheng Ren; Guoyizhe Wei; Angtian Wang; Wei Shao; Yuyin Zhou; Alan Yuille; Cihang Xie; |
478 | Parallelized Autoregressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple yet effective approach for parallelized autoregressive visual generation that improves generation efficiency while preserving the advantages of autoregressive modeling. |
Yuqing Wang; Shuhuai Ren; Zhijie Lin; Yujin Han; Haoyuan Guo; Zhenheng Yang; Difan Zou; Jiashi Feng; Xihui Liu; |
479 | VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided By Hand Priors Embedding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although diffusion-based image virtual try-on has made considerable progress, emerging approaches still struggle to effectively address the issue of hand occlusion (i.e., clothing regions occluded by the hand part), leading to a notable degradation of the try-on performance. To tackle this issue widely existing in real-world scenarios, we propose VTON-HandFit, leveraging the power of hand priors to reconstruct the appearance and structure for hand occlusion cases. |
Yujie Liang; Xiaobin Hu; Boyuan Jiang; Donghao Luo; Xu Peng; Kai Wu; Chengming Xu; Wenhui Han; Taisong Jin; Chengjie Wang; Rongrong Ji; |
480 | Learning Temporally Consistent Video Depth from Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. |
Jiahao Shao; Yuanbo Yang; Hongyu Zhou; Youmin Zhang; Yujun Shen; Vitor Guizilini; Yue Wang; Matteo Poggi; Yiyi Liao; |
481 | OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. |
Meng Lou; Yizhou Yu; |
482 | K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that the intrinsic properties of LoRA can effectively guide diffusion models in merging learned subject and style. |
Ziheng Ouyang; Zhen Li; Qibin Hou; |
483 | JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving Via Minimal Real-World Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome both challenges, we propose a plug-and-play method called JiSAM, shorthand for Jittering augmentation, domain-aware backbone and memory-based Sectorized AlignMent. |
Runjian Chen; Wenqi Shao; Bo Zhang; Shaoshuai Shi; Li Jiang; Ping Luo; |
484 | TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. |
Liangbin Xie; Daniil Pakhomov; Zhonghao Wang; Zongze Wu; Ziyan Chen; Yuqian Zhou; Haitian Zheng; Zhifei Zhang; Zhe Lin; Jiantao Zhou; Chao Dong; |
485 | Revisiting Backdoor Attacks Against Large Vision-Language Models from Domain Shift Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new evaluation dimension, backdoor domain generalization, to assess attack robustness under visual and text domain shifts. |
Siyuan Liang; Jiawei Liang; Tianyu Pang; Chao Du; Aishan Liu; Mingli Zhu; Xiaochun Cao; Dacheng Tao; |
486 | HVI: A New Color Space for Low-light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. |
Qingsen Yan; Yixu Feng; Cheng Zhang; Guansong Pang; Kangbiao Shi; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang; |
487 | Sonata: Self-Supervised Learning of Reliable Point Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. |
Xiaoyang Wu; Daniel DeTone; Duncan Frost; Tianwei Shen; Chris Xie; Nan Yang; Jakob Engel; Richard Newcombe; Hengshuang Zhao; Julian Straub; |
488 | Decompositional Neural Scene Reconstruction with Generative Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. |
Junfeng Ni; Yu Liu; Ruijie Lu; Zirui Zhou; Song-Chun Zhu; Yixin Chen; Siyuan Huang; |
489 | Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we move a step forward and design an approach that allows for multimodal queries – composed of both an image and a text – and can search within collections of multimodal documents, where images and text are interleaved. |
Davide Caffagni; Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Rita Cucchiara; |
490 | Leveraging Temporal Cues for Semi-Supervised Multi-View 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This requirement significantly inhibits their deployment in various locations and sensor configurations. To address this gap, we propose a performant semi-supervised framework that leverages unlabeled RGB-only driving sequences – data easily collected with cost-effective RGB cameras – to significantly improve temporal, camera-only 3D detectors. |
Jinhyung Park; Navyata Sanghvi; Hiroki Adachi; Yoshihisa Shibata; Shawn Hunt; Shinya Tanaka; Hironobu Fujiyoshi; Kris Kitani; |
491 | Transfer Your Perspective: Controllable 3D Generation from Any Viewpoint in A Driving Scene Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the very first solution, using a combination of synthetic collaborative data and real ego-car data. |
Tai-Yu Pan; Sooyoung Jeon; Mengdi Fan; Jinsu Yoo; Zhenyang Feng; Mark Campbell; Kilian Q. Weinberger; Bharath Hariharan; Wei-Lun Chao; |
492 | SVG-IR: Spatially-Varying Gaussian Splatting for Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel framework called Spatially-vayring Gaussian Inverse Rendering (SVG-IR), aimed at enhancing both NVS and relighting quality. |
Hanxiao Sun; Yupeng Gao; Jin Xie; Jian Yang; Beibei Wang; |
493 | No Pains, More Gains: Recycling Sub-Salient Patches for Efficient High-Resolution Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless, many HRIR tasks necessitate the exploration of wider regions to model objects and contexts, which limits their performance in such scenarios. To address this issue, we present a DBPS strategy to enable training with more patches at low consumption. |
Rong Qin; Xin Liu; Xingyu Liu; Jiaxuan Liu; Jinglei Shi; Liang Lin; Jufeng Yang; |
494 | Boosting The Dual-Stream Architecture in Ultra-High Resolution Segmentation with Resolution-Biased Uncertainty Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most of them overly concentrate on crafting complex pipelines to pursue one of the above objectives separately, limiting the model performance in both accuracy and inference consumption. In this paper, we suggest simultaneously achieving these objectives by estimating resolution-biased uncertainties in low resolution stream. |
Rong Qin; Xingyu Liu; Jinglei Shi; Liang Lin; Jufeng Yang; |
495 | RORem: Training A Robust Object Remover with Human-in-the-Loop Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). |
Ruibin Li; Tao Yang; Song Guo; Lei Zhang; |
496 | Generative Densification: Learning to Densify Gaussians for High-Fidelity Generalizable 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Generative Densification, an efficient and generalizable densification strategy specifically tailored for feed-forward models. |
Seungtae Nam; Xiangyu Sun; Gyeongjin Kang; Younggeun Lee; Seungjun Oh; Eunbyung Park; |
497 | PARC: A Quantitative Framework Uncovering The Symmetries Within Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce PARC (Prompt Analysis via Reliability and Calibration), a VLM prompt sensitivity analysis framework built on three pillars: (1) plausible prompt variations in both the language and vision domain, (2) a novel model reliability score with built-in guarantees, and (3) a calibration step that enables dataset- and prompt-spanning prompt variation analysis. |
Jenny Schmalfuss; Nadine Chang; Vibashan VS; Maying Shen; Andres Bruhn; Jose M. Alvarez; |
498 | Image Is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation.Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model’s original generative capabilities.To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective.We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. |
Pu Cao; Feng Zhou; Lu Yang; Tianrui Huang; Qing Song; |
499 | COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. |
Sanghwan Kim; Rui Xiao; Mariana-Iuliana Georgescu; Stephan Alaniz; Zeynep Akata; |
500 | Mind The Time: Temporally-Controlled Multi-Event Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. |
Ziyi Wu; Aliaksandr Siarohin; Willi Menapace; Ivan Skorokhodov; Yuwei Fang; Varnith Chordia; Igor Gilitschenski; Sergey Tulyakov; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,800 papers), please visit Paper Digest: CVPR-2025 (Full List).