Paper Digest: CVPR 2026 Papers & Highlights
Note: CVPR-2026 accepts more than 4,000 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 4,000 CVPR-2026 papers in a separate page, which takes quite some time to load.
To search for papers presented at CVPR-2026 on a specific topic, please make use of the search by venue (CVPR-2026) service. To summarize the latest research published at CVPR-2026 on a specific topic, you can utilize the review by venue (CVPR-2026) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 17,000 authors (CVPR-2026). Additionally, you may want to explore our “Best Paper” Digest (CVPR), which lists the most influential CVPR papers since 1988.
Since 2018, Paper Digest has built a foundation of data spanning decades of conferences, journals, and research topics. The platform features a daily digest service that sifts through tens of thousands of new papers, clinical trials, news articles, and community posts, filtering the noise to highlight what matters most to specific interests. Beyond daily updates, dozens of built-in research tools streamline the academic workflow, supporting efficient reading and writing, comprehensive literature reviews, and automated research report generation.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: CVPR 2026 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | In Pursuit of Pixel Supervision for Visual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Pixo, a capable self-supervised model trained by purely predicting pixels. |
Lihe Yang; Shang-Wen Li; Yang Li; Xinjie Lei; Dong Wang; Abdelrahman Mohamed; Saining Xie; Hengshuang Zhao; Kaiming He; Hu Xu; |
| 2 | ARCache: Mitigating Error Accumulation for Caching-based Acceleration in Autoregressive Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In such settings, any approximation errors introduced by acceleration tend to propagate and accumulate over time, resulting in severe error accumulation and progressive degradation of video quality. To address this challenge, we propose ARCache, the first training-free caching-based acceleration framework specifically designed for autoregressive video diffusion models. |
Kepan Nan; Wangbo Zhao; Penghao Zhou; Jun Li; Zhenheng Yang; Jian Yang; Ying Tai; |
| 3 | Native and Compact Structured Latents for 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. |
Jianfeng XIANG; Xiaoxue Chen; Sicheng Xu; Ruicheng Wang; Zelong Lv; Yu Deng; Hongyuan Zhu; Yue Dong; Hao Zhao; Nicholas Jing Yuan; Jiaolong Yang; |
| 4 | Building A Precise Video Language with Human–AI Oversight Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. |
Zhiqiu Lin; Siyuan Cen; Chancharik Mitra; Isaac Li; Yuhan Huang; Yu Ling; Hewei Wang; Irene Pi; Shihang Zhu; Yili Han; Yilun Du; Deva Ramanan; |
| 5 | Omni-Attribute: Open-vocabulary Image Attribute Encoder for Visual Disentanglement and Composition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. |
Tsai-Shien Chen; Aliaksandr Siarohin; Guocheng Qian; Kuan-Chieh Wang; Egor Nemchinov; Moayed Haji Ali; Riza Alp Guler; Willi Menapace; Ivan Skorokhodov; Anil Kag; Jun-Yan Zhu; Sergey Tulyakov; |
| 6 | FRM: Linear-Time 3D Reconstruction Via Test-Time Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Fast Reconstruction Model, a stateful feed-forward reconstruction model that uses a bidirectional architecture that scales linearly in the number of input views, while matching or surpassing the reconstruction quality of quadratic-time methods. |
Haian Jin; Rundi Wu; Tianyuan Zhang; Ruiqi Gao; Jonathan T. Barron; Noah Snavely; Aleksander Holynski; |
| 7 | Cross-View Splatter: Feed-Forward View Synthesis with Georeferenced Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Cross-View Splatter, a feed-forward method that predicts pixel-aligned Gaussian splats for outdoor scenes captured at ground-level AND by satellite. |
Matias Turkulainen; Akshay Krishnan; Filippo Aleotti; Mohamed Sayed; Guillermo Garcia-Hernando; Juho Kannala; Arno Solin; Gabriel Brostow; Daniyar Turmukhambetov; |
| 8 | Towards Hierarchical 3D Spatial Understanding in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a principled hierarchical framework that decomposes the learning of 3D spatial understanding in VLMs into four progressively complex stages, from geometric perception to abstract spatial reasoning. |
Huizhi Liang; Yichao Shen; Yu Deng; Sicheng Xu; ZhiYuan Feng; Tong Zhang; Yaobo Liang; Jiaolong Yang; |
| 9 | Improved Mean Flows: On The Challenges of Fastforward Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: MeanFlow provides a principled framework for fastforward generative modeling. |
ZHENGYANG GENG; Yiyang Lu; Zongze Wu; Eli Shechtman; Zico Kolter; Kaiming He; |
| 10 | Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. |
Tairan He; Zi Wang; Haoru Xue; Qingwei Ben; Zhengyi Luo; Wenli Xiao; Ye Yuan; Xingye Da; Fernando Castañeda; Shankar Sastry; Changliu Liu; Guanya Shi; Linxi Fan; Yuke Zhu; |
| 11 | Open-Med-Reasoner: Data Recipes for Multimodal Medical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning systems. |
Timothy Ossowski; Sheng Zhang; Qianchu Liu; Guanghui Qin; Reuben Tan; Tristan Naumann; Junjie Hu; Hoifung Poon; |
| 12 | Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Masked-Diffusion Autoencoders (MDAE), a self-supervised framework that imposes concurrent spatial masking and diffusion corruption, encouraging the model to learn complementary objectives: masked region reconstruction for structural coherence and visible region denoising for textural characteristics. |
Jiachen Tu; Guanghui Qin; Theodore Zhao; Jeya Maria Jose Valanarasu; Sheng Zhang; Tristan Naumann; Fan Lam; Sheng Wang; Hoifung Poon; |
| 13 | 4D-RGPT: Toward Region-level 4D Understanding Via Perceptual Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing:(a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception;(b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and(c) \ourbenchmark, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. |
Chiao-An Yang; Ryo Hachiuma; Sifei Liu; Subhashree Radhakrishnan; Raymond A. Yeh; Yu-Chiang Frank Wang; Min-Hung Chen; |
| 14 | Monet: Reasoning in Latent Visual Space Beyond Image and Language Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. |
Qixun Wang; Yang Shi; Yifei Wang; Yuanxing Zhang; Pengfei Wan; Kun Gai; Xianghua Ying; Yisen Wang; |
| 15 | Grounded 3D-Aware Spatial Vision-Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities—explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding—within a single framework. |
An-Chieh Cheng; Yang Fu; Yatai Ji; Ligeng Zhu; Guanqi Zhan; Zhuoyang Zhang; Zhaojing Yang; Song Han; Yao Lu; Pavlo Molchanov; Vidya Nariyambut Murali; Jan Kautz; Xiaolong Wang; Danny Yin; Sifei Liu; |
| 16 | Back to Basics: Let Denoising Generative Models Denoise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we suggest that predicting clean data and predicting noised quantities are fundamentally different. |
Tianhong Li; Kaiming He; |
| 17 | Global Structure-from-Motion Meets Feedforward Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, and robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new state-of-the-art Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. |
Linfei Pan; Johannes Schönberger; Marc Pollefeys; |
| 18 | Iris: Bringing Real-World Priors Into Diffusion Model for Monocular Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose \textbf{Iris}, a deterministic framework for Monocular Depth Estimation (MDE) that integrates real-world priors into the diffusion model. |
Xinhao Cai; Gensheng Pei; Zeren Sun; Yazhou Yao; Fumin Shen; Wenguan Wang; |
| 19 | SpeeDiff: Scalable Pixel-Anchored End-to-End Latent Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Scalable Pixel-anchored End-to-end Diffusion (SpeeDiff), a latent diffusion method that jointly trains the VAE and the diffusion model from scratch. |
Bingliang Zhang; Wenda Chu; Yizhuo Li; Linjie Yang; Yisong Yue; Katie Bouman; Yang Song; Qiushan Guo; |
| 20 | OneThinker: All-in-one Reasoning Model for Image and Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. |
Kaituo Feng; Manyuan Zhang; Hongyu Li; Kaixuan Fan; shuang chen; Yilei Jiang; Dian Zheng; Peiwen Sun; Yiyuan Zhang; Haoze Sun; Yan Feng; Peng Pei; Xunliang Cai; Xiangyu Yue; |
| 21 | LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we show that much better quality can be obtained by leveraging a strong 3D bias without a 3D representation. |
Stanislaw Szymanowicz; Minghao Chen; Jianyuan Wang; Christian Rupprecht; Andrea Vedaldi; |
| 22 | Improving Vision-language Models with Perception-centric Process Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. |
Yingqian Min; Kun Zhou; Yifan Li; Yuhuan Wu; Han Peng; Yifan Du; Xin Zhao; Min Yang; Ji-Rong Wen; |
| 23 | GeoSAM2: Unleashing The Power of SAM2 for 3D Part Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation that casts the task as multi-view 2D mask prediction. |
Ken Deng; Yunhan Yang; Jingxiang Sun; Xihui Liu; Yebin Liu; Ding Liang; Yan-Pei Cao; |
| 24 | Humanoid Generative Pre-Training for Zero-Shot Motion Tracker Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Humanoid-GPT, the first GPT-style humanoid motion Transformer trained with causal attention on a billion-scale motion corpus for whole-body control. |
Zekun Qi; Xuchuan Chen; Jilong Wang; Chenghuai Lin; Yunrui Lian; Wenyao Zhang; XinQiang Yu; He Wang; Li Yi; |
| 25 | FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FlashDecoder, the first Transformer-based latent-to-pixel video decoder designed for streaming. |
Minguk Kang; Suha Kwak; |
| 26 | Reconstructing Functional 3D Scenes from Egocentric Interaction Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FunREC, a method for reconstructing functional 3D digital twins of indoor scenes directly from egocentric RGB-D interaction videos. |
Alexandros Delitzas; Chenyangguang Zhang; Alexey Gavryushin; Tommaso Di Mario; Boyang Sun; Rishabh Dabral; Leonidas Guibas; Christian Theobalt; Marc Pollefeys; Francis Engelmann; Daniel Barath; |
| 27 | LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. |
Zebin You; Shen Nie; Xiaolu Zhang; JUN ZHOU; Zhiwu Lu; Ji-Rong Wen; Chongxuan Li; |
| 28 | Cupid: Generative 3D Reconstruction Via Joint Object and Pose Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Cupid, a generative 3D reconstruction framework that jointly models the full distribution over both canonical objects and camera poses. |
Binbin Huang; Haobin Duan; Yiqun Zhao; Zibo Zhao; Yi Ma; Shenghua Gao; |
| 29 | ARC Is A Vision Problem! Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem. |
Keya Hu; Ali Cy; Linlu Qiu; Delores(Xiaoman) Ding; Runqian Wang; Yeyin Zhu; Jacob Andreas; Kaiming He; |
| 30 | SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, we make the following contributions in this paper: (i) we propose **SpatialScore**, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and QA formats with around 5K manually verified samples across 30 distinct tasks; (ii) we construct **SpatialCorpus**, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop **SpaitalAgent**, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both *Plan-Execute* and *ReAct* reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. |
Haoning Wu; Xiao Huang; Yaohui Chen; Ya Zhang; Yanfeng Wang; Weidi Xie; |
| 31 | Hint2Gen: Bridging Understanding and Generation Via Code-structured Hints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This reveals that the core bottleneck is not reasoning capacity, but the lack of a structured interface to translate high-level reasoning into precise visual output. To bridge this gap, we propose using code-structured visual hints (i.e., SVG/HTML) overlays that explicitly encode reasoning steps directly on the image plane. |
Yuanpeng Tu; Yunpeng Chen; Xi Chen; Liang Li; Hengshuang Zhao; |
| 32 | Temporal Equilibrium MeanFlow: Bridging The Scale Gap for One-Step Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The core issue is a conflict between two opposing forces: terms that amplify variance over long time spans and strong constraints needed near the start of generation, which a fixed sampling strategy cannot reconcile. To resolve this, we propose Temporal Equilibrium MeanFlow (TEMF), which balances these competing demands through two simple yet effective components: (1) a temporal equilibrium weighting function that equalizes gradient influence across all time scales, and (2) a dynamic boundary scheduler that gradually shifts training focus—from stabilizing early steps to refining the full trajectory as training progresses. |
Yuanpeng Tu; Yunpeng Chen; Xinyu Zhang; Chao Liao; Hengshuang Zhao; |
| 33 | Demo2Tutorial: From Human Experience to Multimodal Software Tutorials Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. |
Zechen Bai; Zhiheng Chen; Yiqi Lin; Kevin Qinghong Lin; Difei Gao; Xiangwu Guo; WANG XIN; Mike Zheng Shou; |
| 34 | Ego2Web: A Web Agent Benchmark Grounded on Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation prevents evaluation in crucial scenarios, such as when an agent must use egocentric visual perception (e.g., via AR glasses) to recognize an object in the user’s surroundingsand then complete a related task online (e.g., making a purchase). To address this gap, we introduce Ego2Web, the first benchmark designed to bridge egocentric video perception and multimodal web agent execution. |
Shoubin Yu; Lei Shu; Antoine Yang; Yao Fu; Srinivas Sunkara; Maria Wang; Jindong Chen; Mohit Bansal; Boqing Gong; |
| 35 | Spe-BEVHead: Rethinking The Detection Head Design for Bird’s-Eye-View Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to three inherent limitations: (i) a geometric mismatch between the Gaussian kernel used for classification and the real BEV object, (ii) degraded end-to-end performance without Non-Maximum Suppression(NMS), and (iii) sparse supervisory signals. To address these issues, we propose Spe-BEVHead, a detection head specifically tailored for BEV 3D object detection. |
Junshu Zhang; Sicheng Zhao; Xin Zhao; Fan Yang; Ruike Chen; Jungong Han; Guiguang Ding; |
| 36 | Retrieving Counterfactuals Improves Visual In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual examples through targeted, attribute-guided composed image retrieval. |
Guangzhi Xiong; Sanchit Sinha; Zhenghao He; Aidong Zhang; |
| 37 | SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. |
Shuai Tan; Biao Gong; Yujie Wei; Shiwei Zhang; Zhuoxin Liu; Ke Ma; Yan Wang; Kecheng Zheng; Xing Zhu; Yujun Shen; Hengshuang Zhao; |
| 38 | Bidirectional Normalizing Flow: From Data to Noise and Back Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a new framework that removes the need for an exact analytic inverse by learning a flexible, data-driven reverse model to $\textbf{approximate}$ the inverse mapping. |
Yiyang Lu; Qiao Sun; Xianbang Wang; Zhicheng Jiang; Hanhong Zhao; Kaiming He; |
| 39 | SURF: Signature-retained Fast Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose **SURF**, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. |
Kaixin Ding; Xi Chen; Sihui Ji; Yuan Gao; Liang Hou; Xin Tao; Hengshuang Zhao; |
| 40 | Qwen-Image-Layered: Towards Inherent Editability Via Layer Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose Qwen-Image-Layered, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling inherent editability, where each RGBA layer can be independently manipulated without affecting other content. |
Shengming Yin; Zekai Zhang; Zecheng Tang; Kaiyuan Gao; Xiao Xu; Kun Yan; Jiahao Li; Yilei chen; Yuxiang Chen; Heung-Yeung Shum; Lionel Ni; Junyang Lin; Chenfei Wu; |
| 41 | Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training — without modifying or retraining the autoencoder. |
Bolin Lai; XuDong Wang; Sai Saketh Rambhatla; James Rehg; Zsolt Kira; Rohit Girdhar; Ishan Misra; |
| 42 | Unlocking The Power of Critical Factors for 3D Visual Geometry Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Interestingly, per-frame visual geometry estimation approaches typically exhibit weaker multi-frame consistency but demonstrate superior per-frame accuracy compared to multi-frame algorithms. |
Guangkai Xu; Hua Geng; Huanyi Zheng; Songyi Yin; Yanlong Sun; Hao Chen; Chunhua Shen; |
| 43 | Proxy3D: Efficient 3D Representations for Vision-Language Models Via Semantic Clustering and Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. |
Jerry Jiang; Haowen Sun; Denis Gudovskiy; Yohei Nakata; Tomoyuki Okuno; Kurt Keutzer; Wenzhao Zheng; |
| 44 | DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework for autonomous driving, performing spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through joint optimization. |
Zhe Liu; Runhui Huang; Rui Yang; Siming Yan; Zining Wang; Lu Hou; Di Lin; Xiang Bai; Hengshuang Zhao; |
| 45 | Language Models Can Explain Visual Features Via Steering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. |
Javier Ferrando; Enrique Lopez-Cuena; Pablo Agustin Martin-Torres; Daniel Hinjos; Anna Arias Duart; Dario Garcia-Gasulla; |
| 46 | R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs Via Bi-Mode Annealing and Reinforce Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We first introduce bi-mode annealing, a unified training paradigm that constructs a model competent in both reasoning-intensive and direct-answer settings without requiring explicit complexity annotations. Building on this foundation, we propose Bi-mode Policy Optimization (BPO), a lightweight reinforcement learning algorithm that employs a dual-rollout mechanism: for each input, the model generates both thinking and non-thinking responses. |
Qi Yang; Bolin Ni; Shiming Xiang; Houwen Peng; |
| 47 | Opening The Sim-to-Real Door for Humanoid Pixel-to-Action Policy Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our approach introduces a staged-reset exploration strategy that stabilizes long-horizon privileged-policy training, and a GRPO-based fine-tuning procedure designed to mitigate partial observability and improve closed-loop consistency in sim-to-real RL. |
Haoru Xue; Tairan He; Zi Wang; Qingwei Ben; Wenli Xiao; Zhengyi Luo; Xingye Da; Fernando Castañeda; Guanya Shi; Shankar Sastry; Linxi Fan; Yuke Zhu; |
| 48 | GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often rely on a single diffusion model to directly map driving actions to videos, which makes learning difficult and leads to physically inconsistent outputs. To overcome these challenges, we propose GenieDrive, a novel framework designed for physics-aware driving video generation. |
Zhenya Yang; Zhe Liu; Yuxiang Lu; Liping Hou; Chenxuan Miao; peng siyi; Bailan Feng; Xiang Bai; Hengshuang Zhao; |
| 49 | GDRO: Group-level Reward Post-training Suitable for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). |
Yiyang Wang; Xi Chen; Xiaogang Xu; Yu Liu; Hengshuang Zhao; |
| 50 | CGHair: Compact Gaussian Hair Reconstruction with Card Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a compact pipeline for high-fidelity hair reconstruction from multi-view images. |
Haimin Luo; Srinjay Sarkar; Albert Mosella-Montoro; Francisco Vicente Carrasco; Fernando De la Torre; |
| 51 | Hierarchical Action Learning for Weakly-Supervised Action Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Interestingly, we observe that lower-level visual and high-level action latent variables evolve at different rates, with low-level visual variables changing rapidly, while high-level action variables evolve more slowly, making them easier to identify. Building on this insight, we propose the Hierarchical Action Learning (\textbf{HAL}) model for weakly-supervised action segmentation. |
Junxian Huang; Ruichu Cai; Juntao Fang; Hao Zhu; Boyan Xu; Weilin Chen; Zijian Li; Shenghua Gao; |
| 52 | COT-FM: Cluster-wise Optimal Transport Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce COT-FM, a general framework that reshapes the probability path in Flow Matching (FM) to achieve faster and more reliable generation. |
Chiensheng Chiang; Kuan-Hsun Tu; Jia-Wei Liao; Cheng-Fu Chou; Tsung-Wei Ke; |
| 53 | MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. |
Jiale Xu; Wang Zhao; Ying Shan; |
| 54 | Weaver: Decoupled Training for Interleaved Multi-modal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Weaver, which frames interleaved generation as an autoregressive planning–visualization process within a unified multi-modal architecture. |
Jinbo Xing; Zeyinzi Jiang; Yuxiang Tuo; Chaojie Mao; Xiaotang Gai; Xi Chen; Jingfeng Zhang; Yulin Pan; Zhen Han; Jie Xiao; Keyu Yan; Chen-Wei Xie; Chongyang Zhong; Kai Zhu; Shen Tong; Lianghua Huang; Yu Liu; Yujiu Yang; |
| 55 | VOSR: A Vision-Only Generative Model for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. |
Rongyuan Wu; Lingchen Sun; Zhengqiang ZHANG; Xiangtao Kong; Jixin Zhao; Shihao Wang; Lei Zhang; |
| 56 | Pointer-CAD: Unifying B-Rep and Command Sequences Via Pointer-based Edges & Faces Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further, the discretization of a continuous variable during sketch and extrude operations may result in topological errors. To address these limitations, we present Pointer-CAD, a novel LLM-based CAD generation framework that leverages a pointer-based command sequence representation to explicitly incorporate the geometric information of B-rep models into sequential modeling. |
Dacheng Qi; Chenyu Wang; Jingwei Xu; Tianzhe Chu; Zibo Zhao; Wen Liu; Wenrui Ding; Yi Ma; Shenghua Gao; |
| 57 | Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation Across 3-D Constrained Terrains Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents $\textbf{Gallant}$, a voxel-grid–based framework for humanoid locomotion and local navigation in 3D constrained terrains. |
Qingwei Ben; Botian Xu; Kailin Li; Feiyu Jia; Wentao Zhang; Jingping Wang; Jingbo Wang; Dahua Lin; Jiangmiao Pang; |
| 58 | Enhancing Hands in 3D Whole-Body Pose Estimation with Conditional Hands Modulator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This difficulty arises from a fundamental supervision gap: whole-body pose estimators are trained on full-body datasets with limited hand diversity, while hand-only estimators, trained on hand-centric datasets, excel at detailed finger articulation but lack global body awareness. To address this, we propose WholeBody++, a modular framework that leverages the strengths of both pre-trained whole-body and hand pose estimators. |
Gyeongsik Moon; |
| 59 | Watch and Learn: Learning to Use Computers from Online Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Watch & Learn (W&L), a framework that converts readily available Internet videos of human computer use into executable UI trajectories at scale. |
Chan Hee Song; Yiwen Song; Palash Goyal; Yu Su; Oriana Riva; Hamid Palangi; Tomas Pfister; |
| 60 | EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing offline editing pipelines suffer from high latency, limiting real-time interaction. To address these issues, we present a complete ecosystem for egocentric video editing. |
Runjia Li; Moayed Haji Ali; Ashkan Mirzaei; Chaoyang Wang; Arpit Sahni; Ivan Skorokhodov; Aliaksandr Siarohin; Tomas Jakab; Junlin Han; Sergey Tulyakov; Philip H.S. Torr; Willi Menapace; |
| 61 | DUO-VSR: Dual-Stream Distillation for One-Step Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While distribution matching distillation (DMD) accelerates diffusion models to one-step generation, directly applying it to VSR leads to training instability and degraded, insufficient supervision. To address these issues, we propose \textbf{DUO-VSR}, a three-stage framework centered on a \textbf{DU}al-Stream Distillation strategy that integrates distribution matching and adversarial supervision for \textbf{O}ne-step VSR. |
Zhengyao Lv; Menghan Xia; Xintao Wang; Kwan-Yee K. Wong; |
| 62 | Learning to Track Instance from Single Nature Language Description Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: How to achieve vision-language (VL) tracking using natural language descriptions from a video sequence \textbf{without relying on any bounding-box ground truth}? In this work, we achieve this goal by tackling \textit{self-supervised VL tracking}, which aims to evaluate tracking capabilities guided by natural language descriptions. |
Yaozong Zheng; Bineng Zhong; Qihua Liang; Shuimu Zeng; Haiying Xia; Shuxiang Song; |
| 63 | Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. |
Yaozong Zheng; Qihua Liang; Bineng Zhong; Shuimu Zeng; Yuanliang Xue; Ning Li; Shuxiang Song; |
| 64 | LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that this asymmetry leads to a significant drop in the performance of the learner. To combat this, we present LEAD, a new high-quality synthetic dataset collected in the CARLA simulator with three key improvements. |
Long Nguyen; Micha Fauth; Bernhard Jaeger; Daniel Dauner; Maximilian Igl; Andreas Geiger; Kashyap Chitta; |
| 65 | Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. |
Yusu Qian; Eli Bocek-Rivele; Liangchen Song; Jialing Tong; Yinfei Yang; Jiasen Lu; Wenze Hu; Zhe Gan; |
| 66 | HAVE-Bench: Hierarchical Audio-Visual Evaluation from Perception to Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the limitation that existing benchmarks focus mainly on perception tasks and lack a unified cognitive evaluation framework, we propose Hierarchical Audio-Visual Evaluation Benchmark (HAVE-Bench). |
Zhong Muyan; Erfei Cui; Sen Xing; Weiyun Wang; Wen Wu; Yuchen Hu; Yanting Zhang; Xiaowei Hu; Wenhai Wang; Chao Zhang; Jifeng Dai; |
| 67 | VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. |
Kyle Sargent; Ruiqi Gao; Philipp Henzler; Charles Herrmann; Aleksander Holynski; Li Fei-Fei; Jiajun Wu; Jason Y. Zhang; |
| 68 | Frequency-Aware Flow Matching for High-Quality Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, during inference, flow matching models tend to generate low-frequency components (global structure) in the early stages, while high-frequency components (fine details) emerge only later in the reverse process. Building on this insight, we propose Frequency-Aware Flow Matching (FreqFlow), a novel approach that explicitly incorporates frequency-aware conditioning into the flow matching framework via time-dependent adaptive weighting. |
Sucheng Ren; Qihang Yu; Ju He; Xiaohui Shen; Liang-Chieh Chen; |
| 69 | Active Intelligence in Video Avatars Via Closed-loop World Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency—they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. |
Xuanhua He; Tianyu Yang; Ke Cao; Rui-Qi Wu; Meng Cheng; Yong Zhang; Zhuoliang Kang; Xiaoming Wei; Qifeng Chen; |
| 70 | VGGT-$\Omega$ Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present VGGT-Ω, a feed-forward model for 3D reconstruction that substantially advances the state of the art in accuracy, efficiency, and capability for both static and dynamic scenes. |
Jianyuan Wang; Minghao Chen; Shangzhan Zhang; Nikita Karaev; Johannes Schönberger; Patrick Labatut; Piotr Bojanowski; David Novotny; Andrea Vedaldi; Christian Rupprecht; |
| 71 | EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. |
Haolan Xu; Keli Cheng; Lei Wang; Ning Bi; Xiaoming Liu; |
| 72 | Any4D: Unified Feed-Forward Metric 4D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. |
Jay Karhade; Nikhil Keetha; Yuchen Zhang; Tanisha Gupta; Akash Sharma; Sebastian Scherer; Deva Ramanan; |
| 73 | Spatial Retrieval Augmented Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, human drivers are able to recall road structure even under poor visibility. To endow models with this recall ability, we propose the spatial retrieval paradigm, introducing offline retrieved geographic images as an additional input. |
Xiaosong Jia; Chenhe Zhang; Yule Jiang; Songbur Wong; Zhiyuan Zhang; chen chen; Shaofeng Zhang; Xuanhe Zhou; Xue Yang; Junchi Yan; Yu-Gang Jiang; |
| 74 | Fast-ThinkAct: Efficient Vision-Language-Action Reasoning Via Verbalizable Latent Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. |
Chi-Pin Huang; Yunze Man; Zhiding Yu; Min-Hung Chen; Jan Kautz; Yu-Chiang Frank Wang; Fu-En Yang; |
| 75 | XR-Poser: Accurate Egocentric Human Motion Estimation for AR/VR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present XR-Poser, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training. |
Zhenyu Li; Sai Kumar Dwivedi; Filip Maric; Carlos Chacón; Nadine Bertsch; Filippo Arcadu; Tomas Hodan; Michael Ramamonjisoa; Peter Wonka; Amy Zhao; Robin Kips; Cem Keskin; Anastasia Tkach; Chenhongyi Yang; |
| 76 | CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. |
Lingen Li; Guangzhi Wang; Xiaoyu Li; Zhaoyang Zhang; Qi Dou; Jinwei Gu; Tianfan Xue; Ying Shan; |
| 77 | When Token Pruning Is Worse Than Random: Understanding Visual Token Information in VLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by **vanishing token information”**, where visual tokens progressively lose their salience with increasing network depth. |
Yahong Wang; Juncheng Wu; Zhangkai Ni; Longzhen Yang; Yihang Liu; Chengmei Yang; Ying Wen; Lianghua He; Xianfeng Tang; Hui Liu; Yuyin Zhou; |
| 78 | BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Autoregressive video models are promising for world modeling via next-frame prediction, but they suffer from exposure bias: a mismatch between training on clean contexts and inference on self-generated frames, causing errors to compound and quality to drift over time. We introduce Backwards Aggregation (BAgger), a self-supervised scheme that constructs corrective trajectories from the model’s own rollouts, teaching it to recover from its mistakes. |
Ryan Po; Eric Ryan Chan; Changan Chen; Gordon Wetzstein; |
| 79 | Stabilizing Streaming Video Geometry Via Dynamic Feature Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth’s scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. |
Xiaoyang Lyu; Muxin Liu; Xiaoshan Wu; Ruicheng Wang; Yihua Huang; Yangtian Sun; Shaoshuai Shi; Xiaojuan Qi; |
| 80 | ZINA: Multimodal Fine-grained Hallucination Detection and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. |
Yuiga Wada; Kazuki Matsuda; Komei Sugiura; Graham Neubig; |
| 81 | DriveLaW: Unifying Planning and Video Generation in A Latent Driving World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. |
Tianze Xia; Yongkang Li; Lijun Zhou; Jingfeng Yao; Kaixin Xiong; Haiyang Sun; Bing Wang; Kun Ma; Guang Chen; Hangjun Ye; Wenyu Liu; Xinggang Wang; |
| 82 | VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. |
Maitreya Patel; Jingtao Li; Weiming Zhuang; Yezhou Yang; Lingjuan Lyu; |
| 83 | MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web. |
Xuehui Wang; Zhenyu Wu; JingJing Xie; Zichen Ding; Bowen Yang; Zehao Li; Zhaoyang Liu; Qingyun Li; Xuan Dong; Zhe Chen; Weiyun Wang; Xiangyu Zhao; Jixuan Chen; Haodong Duan; Tianbao Xie; Chenyu Yang; Shiqian Su; Yue Yu; Yanting Zhang; Xiangyu Yue; Weijie Su; Xizhou Zhu; Wei Shen; Jifeng Dai; Wenhai Wang; |
| 84 | VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. |
Zhengfei Kuang; Rui Lin; Long Zhao; Gordon Wetzstein; Saining Xie; Sanghyun Woo; |
| 85 | AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. |
Hengyi Wang; Lourdes Agapito; |
| 86 | A Frame Is Worth One Token: Efficient Generative World Modeling with Delta Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explicitly and efficiently model diverse plausible futures, we introduce DeltaWorld, the first VFM-based world model which shifts from deterministic prediction to the ability to generate multiple plausible futures in a single forward pass. |
Tommie Kerssies; Gabriele Berton; Ju He; Qihang Yu; Wufei Ma; Daan de Geus; Gijs Dubbelman; Liang-Chieh Chen; |
| 87 | Scaling View Synthesis Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a rigorous analysis of the scaling laws for view synthesis transformers and elucidate a series of design choices for training compute-optimal NVS models. |
Evan Kim; Hyunwoo Ryu; Thomas W. Mitchel; Vincent Sitzmann; |
| 88 | Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. |
Haoji Zhang; Xin Gu; Jiawen Li; Chixiang Ma; Sule Bai; Chubin Zhang; bowen zhang; zhichao zhou; Dongliang He; Yansong Tang; |
| 89 | Hyperbolic Gramian Volumes for Multimodal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing. |
Saiyang Na; Feng Jiang; Qifeng Zhou; Wenliang Zhong; Thao Dang; Yuzhi Guo; Hehuan Ma; Chunyuan Li; Weizhi An; Junzhou Huang; |
| 90 | MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MoVieS, a motion-aware view synthesis model that reconstruct 4D dynamic scenes from monocular videos in one second. |
Chenguo Lin; Yuchen Lin; Panwang Pan; Yifan Yu; Tao Hu; Honglei Yan; Katerina Fragkiadaki; Yadong Mu; |
| 91 | One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. |
Moayed Haji Ali; Willi Menapace; Ivan Skorokhodov; Dogyun Park; Anil Kag; Michael Vasilkovsky; Sergey Tulyakov; Vicente Ordonez; Aliaksandr Siarohin; |
| 92 | Structural–Semantic Perception for Diffusion-Guided Temporal Forgery Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods face two limitations: (1) localization precision, where one-shot boundary prediction models fail to rectify inherent initial prediction biases, and temporal emphasis overlooks modality-internal semantic forgery cues, resulting in noise-sensitive localization, and (2) cross-dataset generalization, where fixed-scale temporal receptive fields struggle to accommodate varying manipulation durations across real-world scenarios. To address these challenges, we propose a unified framework based on structural–semantic perception and diffusion-guided refinement. |
Ligong Cao; Yeting Guo; Haoang Chi; |
| 93 | AToken: A Unified Tokenizer for Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. |
Jiasen Lu; Liangchen Song; Mingze Xu; Byeongjoo Ahn; Yanjun Wang; Chen Chen; Afshin Dehghan; Yinfei Yang; |
| 94 | VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an integrated learning scheme of Video Super-Resolution and Enhancement in Low-Light environment, named VSRELL, which aims to recover Well-Illuminated High-Resolution (WIHR) sequence from Low-Light Low-Resolution (LLLR) counterparts. |
Yanming hui; Fanhua Shang; Hongying Liu; Ben Wang; Zhenwei Zhang; Liang Wan; Wei Feng; Tong Xue; Bingqin Lv; |
| 95 | VisionLeaf: Entropy-Guided Leaf-First Reasoning for Efficient and Accurate Think-with-Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge primarily arises from the direct use of standard reinforcement learning policies, which do not incorporate improvements for the think-with-image multi-turn conversational scenario. To address this challenge, we propose VisionLeaf, an entropy-guided, tree-based reasoning framework. |
Haokun GUI; Senqiao Yang; Mingkang Zhu; Meng Chu; WU Sitong; Changsheng Lu; Zihao Wang; Zhuotao Tian; Jiaya Jia; |
| 96 | Attend Before Attention: Efficient and Scalable Video Understanding Via Autoregressive Gazing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. |
Baifeng Shi; Stephanie Fu; Long Lian; Hanrong Ye; David Eigen; Aaron Reite; Jan Kautz; Boyi Li; David Chan; Trevor Darrell; Pavlo Molchanov; Danny Yin; |
| 97 | BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. |
Zishu Yao; Xiang-Xiang Su; Shengning Zhou; Guang-Yong Chen; Guodong Fan; Xing Chen; |
| 98 | SpatialTree: How Spatial Intelligence Branches Out in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, how these abilities are acquired, emerge, and transferred remains largely unknown. To investigate this, we propose SpatialTree a hierarchical taxonomy that organizes SI into a capability tree—from low level perception (L1), mental mapping (L2), mental simulation (L3), to agentic competence (L4). |
Yuxi Xiao; longfei li; Shen Yan; Xinhang Liu; Sida Peng; Yunchao Wei; Xiaowei Zhou; Bingyi Kang; |
| 99 | ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. |
Yuantao Chen; Jiahao Chang; Chongjie Ye; Chaoran Zhang; Zhaojie Fang; Chenghong Li; Xiaoguang Han; |
| 100 | Unified Customized Generation By Disentangled Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce **USO**, a **U**nified **S**imultaneous **O**ptimization framework to simultaneously unify different customized tasks (i.e., subject and style). |
Shaojin Wu; Mengqi Huang; Yufeng Cheng; wenxu wu; Jiahe Tian; Yiming Luo; Fei Ding; Qian HE; |
| 101 | Scaling Parallel Sequence Models to Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10×. |
Yitong Jiang; Collin McCarthy; Hongjun Wang; Hanrong Ye; Qi Dou; Tianfan Xue; Jinwei Gu; Jan Kautz; Danny Yin; Pavlo Molchanov; Sifei Liu; |
| 102 | Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Paper2Figure, a dual multi-agent system with an interactive web platform for paper-to-figure generation. |
Siwei Han; Haonian Ji; Siyang Xin; Juanquan Shi; Shi Qiu; Xinyu Ye; Peng Xia; Jiaqi Liu; Zhaorun Chen; Yiyang Zhou; Linjie Li; Lijuan Wang; Huaxiu Yao; |
| 103 | Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. |
Sicheng Xu; Yu Deng; Shoukang Hu; Yichuan Wang; Yizhong Zhang; Zhan Chen; Jiaolong Yang; Baining Guo; |
| 104 | From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing benchmarks effectively evaluate this foundational geometric perception capabilites of multimodal LLMs, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1500 expert-annotated questions derived from diverse, egocentric indoor video scans. |
Le Zhang; Jihan Yang; Soundarya Krishnan; Jimit Majmudar; Xiou Ge; Prasoon Puri; Prathamesh Saraf; Shruti Bhargava; Dhivya Piraviperumal; Yinan Ling; Cindy Pan; Hong Yu; Aishwarya Agrawal; Bo-Hsiang Tseng; |
| 105 | MimiCAT: Mimic with Correspondence-Aware Cascade-Transformer for Category-Free 3D Pose Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: 3D pose transfer aims to transfer the pose-style of a source mesh to a target character while preserving both the target’s geometry and the source’s pose characteristic. Existing … |
Zenghao Chai; Chen Tang; Yongkang Wong; Xulei Yang; Mohan Kankanhalli; |
| 106 | Vista4D: Video Reshooting with 4D Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present **Vista4D**, a robust and flexible video reshooting framework that grounds the input video and target cameras in a 4D point cloud. |
Kuan Heng Lin; Zhizheng Liu; Pablo Salamanca; Yash Kant; Ryan Burgert; Yuancheng Xu; Koichi Namekata; Yiwei Zhao; Bolei Zhou; Micah Goldblum; Paul Debevec; Ning Yu; |
| 107 | OpenMMReasoner: Pushing The Frontiers in Multimodal Reasoning with An Open and General Recipe Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). |
Kaichen Zhang; Keming Wu; Zuhao Yang; Bo Li; Kairui Hu; Bin Wang; Xingxuan Li; Lidong Bing; |
| 108 | Transition Matching Distillation for Fast Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. |
Weili Nie; Julius Berner; Nanye Ma; Chao Liu; Saining Xie; Arash Vahdat; |
| 109 | EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we introduce **EVATok**, a framework to produce **E**fficient **V**ideo **A**daptive **Tok**enizers. |
Tianwei Xiong; Jun Hao Liew; Zilong Huang; Zhijie Lin; Jiashi Feng; Xihui Liu; |
| 110 | WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, existing benchmarks provide too narrow a scope for evaluation, failing to holistically assess these advanced abilities. To address this, we introduce WiseEdit, a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing, featuring deep task depth and broad knowledge breadth. |
Kaihang Pan; Weile Chen; Haiyi Qiu; Qifan Yu; Wendong Bu; zehan wang; Yun Zhu; Juncheng Li; Siliang Tang; |
| 111 | BinaryAttention: One-Bit Attention for Vision and Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit attention. |
Chaodong XIAO; Zhengqiang ZHANG; Lei Zhang; |
| 112 | ViT$^3$: Unlocking Test-Time Training in Vision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. |
Dongchen Han; Yining Li; Tianyu Li; Zixuan Cao; Ziming Wang; Jun Song; YuCheng YuCheng; Bo Zheng; Gao Huang; |
| 113 | PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multi-image queries performs 40 to 70\% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. |
Rohan Mahadev; Joyce Yuan; Patrick Poirson; David Xue; Hao-Yu Wu; Dmitry Kislyuk; |
| 114 | DNF-SR: Dual-Input and Negative-Aware Feature Fine-Tuning for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose \textbf{DNF‑SR}, a \textbf{D}ual‑input and \textbf{N}egative‑aware \textbf{F}eature fine‑tuning method for Real-ISR. |
Shuhao Han; Wenjie Liao; Haotian Fan; Hang Dong; Rui Zhang; Chun-Le Guo; Chongyi Li; |
| 115 | MiniCPM-V 4.5: Cooking Efficient MLLMs Via Architecture, Data, and Training Recipe Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. |
Tianyu Yu; Zefan Wang; Chongyi Wang; Fuwei Huang; Wenshuo Ma; Zhihui He; Tianchi Cai; Weize Chen; Yuxiang Huang; Ranchi Zhao; Bokai Xu; Junbo Cui; Yingjing Xu; Liqing Ruan; Luoyuan Zhang; Hanyu Liu; Jingkun Tang; Hongyuan Liu; Qining Guo; Wenhao Hu; Bingxiang He; Jie Zhou; Jie Cai; Ji Qi; Zonghao Guo; Chi Chen; Guoyang Zeng; Yuxuan Li; Ganqu Cui; Ning Ding; Xu Han; Yuan Yao; Zhiyuan Liu; Maosong Sun; |
| 116 | Self-Consistency for LLM-based Motion Trajectory Generation and Verification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study how to adapt self-consistency to visual domains; specifically, we consider the generation and verification of LLM-produced motion graphics trajectories. |
Jiaju Ma; R. Kenny Jones; Jiajun Wu; Maneesh Agrawala; |
| 117 | UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This stems from insufficient cross-modal interaction and limited modal diversity for comprehensive world knowledge representation. To address these limitations, we introduce UnityVideo, a unified framework for world-aware video generation that jointly learns across multiple modalities—segmentation masks, human skeletons, DensePose, optical flow, and depth maps—and training paradigms. |
Jiehui Huang; Yuechen Zhang; Xu He; Yuan Gao; Zhi Cen; Bin Xia; Yan Zhou; Xin Tao; Pengfei Wan; Jiaya Jia; |
| 118 | ArchSym: Detecting 3D-Grounded Architectural Symmetries in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane’s orientation. In this paper, we address these limitations by presenting the first framework for detecting *3D-grounded reflectional symmetries* from single, in-the-wild RGB images, focusing on architectural landmarks. |
Hanyu Chen; Ruojin Cai; Steve Marschner; Noah Snavely; |
| 119 | Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent self-supervised methods attempt to unify geometry, appearance, and semantics in a feed-forward manner, but they often suffer from weak geometry induction, limited appearance detail, and inconsistencies between geometry and semantics. We introduce $\textbf{\textit{UniSplat}}$, a feed-forward framework designed to address these limitations through three complementary components. |
bo zhou; Qiuxia Lai; Zeren Sun; Xiangbo Shu; Yazhou Yao; Wenguan Wang; |
| 120 | InterPrior: A Scalable Motion Prior for Physics-Based Human-Object Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce InterPrior, a scalable framework that learns a unified control policy, i.e., interaction motion prior through large-scale imitation pretraining and post-training by reinforcement learning. |
Sirui Xu; Samuel Schulter; Morteza Ziyadi; Xialin He; Xiaohan Fei; Yu-Xiong Wang; Liangyan Gui; |
| 121 | LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP ($\textbf{L}$ocalization $\textbf{A}$ware $\textbf{M}$ulti-camera $\textbf{P}$eople Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. |
Nan Yang; Julian Straub; Fan Zhang; Richard Newcombe; Jakob Engel; Lingni Ma; |
| 122 | LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. |
Yunze Man; Shihao Wang; Guowen Zhang; Johan Bjorck; Liangyan Gui; Linxi Fan; Jan Kautz; Yu-Xiong Wang; Zhiding Yu; |
| 123 | Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce **AuditDM**, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. |
Qihao Liu; Chengzhi Mao; Yaojie Liu; Alan L. Yuille; Wen-Sheng Chu; |
| 124 | PhyOceanCast: Global Ocean Forecasting with Physics-Informed Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent deep learning models have achieved notable success, but still face three fundamental challenges: (1) they homogenize ocean variables despite strong physical coupling via equation-of-state relationships; (2) they neglect spherical geometry, resulting in severe distortions at high latitudes; and (3) they struggle to model multi-scale temporal dynamics. We introduce PhyOceanCast, a physics-informed diffusion model that overcomes these limitations through two key innovations. |
Qixiu Li; Xiang Zhu; Xiaoyong Li; Xiaolong Xu; |
| 125 | ORBIT: Benchmarking SfM in The Wild with 360° Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key insight is to leverage online panoramic 360° as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery. |
Sara Sabour; Richard Tucker; Marcus A. Brubaker; Saurabh Saxena; Junhwa Hur; Andrea Tagliasacchi; Deqing Sun; David J. Fleet; Richard Szeliski; Noah Snavely; |
| 126 | AGENTSAFE: Benchmarking The Safety of Embodied Agents on Hazardous Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current safety evaluation benchmarks remain limited: they cover only narrow scopes of hazards and focus primarily on final outcomes, neglecting the agent’s full perception-planning-execution process and thereby obscuring critical failure modes. Therefore, we present SAFE, a benchmark for systematically assessing the safety of embodied VLM agents on hazardous instructions. |
Zonghao Ying; Le Wang; Yisong Xiao; Jiakai Wang; Yuqing Ma; Jinyang Guo; Zhenfei Yin; Mingchuan Zhang; Aishan Liu; Xianglong Liu; |
| 127 | CTCal: Rethinking Text-to-Image Diffusion Models Via Cross-Timestep Self-Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Cross-Timestep Self-Calibration (CTCal), founded on the supporting observation that establishing accurate text-image alignment within diffusion models becomes progressively more difficult as the timestep increases. |
Xiefan Guo; Xinzhu Ma; Haiyu Zhang; Di Huang; |
| 128 | BulletTime: Decoupled Control of Time and Camera Pose for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. |
Yiming Wang; Qihang Zhang; Shengqu Cai; Tong Wu; Jan Ackermann; Zhengfei Kuang; Yang Zheng; Frano Rajič; Siyu Tang; Gordon Wetzstein; |
| 129 | LitePT: Lighter Yet Stronger Point Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. |
Yuanwen Yue; Damien Robert; Jianyuan Wang; Sunghwan Hong; Jan D. Wegner; Christian Rupprecht; Konrad Schindler; |
| 130 | Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). |
Zaijing Li; Bing Hu; Rui Shao; Gongwei Chen; Dongmei Jiang; Pengwei Xie; Jianye Hao; Liqiang Nie; |
| 131 | Seeing Through The Noise: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. |
Maoxun Yuan; Duanni Meng; Ziteng Xi; Tianyi Zhao; Shiji Zhao; Yimian Dai; Xingxing Wei; |
| 132 | General Process Reward Modeling for Robotic Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we introduce Robo-Dopamine, a novel reward modeling method for learning a general-purpose, step-aware process reward model from multi-view inputs. |
Huajie Tan; Sixiang Chen; Yijie Xu; Zixiao Wang; Cheng Chi; Yuheng Ji; Yaoxu Lyu; Zhongxia Zhao; Xiansheng Chen; Peterson Co; Shaoxuan Xie; Guocai Yao; Pengwei Wang; Zhongyuan Wang; Shanghang Zhang; |
| 133 | Action-Sketcher: From Reasoning to Action Via Visual Sketches for Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on \textit{Visual Sketch}, we present \textbf{Action-Sketcher}, a VLA framework that operates in a cyclic \textit{See $\rightarrow$ Think $\rightarrow$ Sketch $\rightarrow$ Act} workflow coordinated by adaptive token-gated strategy for reasoning triggers, sketch revision, and action issuance, thereby supporting reactive corrections and human interaction while preserving real-time action prediction. |
Huajie Tan; Peterson Co; Yijie Xu; Shanyu Rong; Yuheng Ji; Cheng Chi; Xiansheng Chen; Zhongxia Zhao; Pengwei Wang; Zhongyuan Wang; Shanghang Zhang; |
| 134 | Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Efficient stereo architectures, on the other hand, sacrificerobustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. |
Bowen Wen; Shaurya Dewan; Stan Birchfield; |
| 135 | PV-Ground: Text-Guided Point-Voxel Interaction for 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes PV-Ground, a novel 3D VG architecture based on effective text-guided point-voxel feature interaction. |
Junpeng Shang; Feifei Shao; Jun Xiao; Lin Li; Hongwei Wang; Dongfang Ma; |
| 136 | GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Beyond evaluation, we propose GeoMMAgent, a multi-agent framework that strategically integrates retrieval, perception, and reasoning through domain-specific RS models and tools. |
Aoran Xiao; Shihao Cheng; Yonghao Xu; Yexian Ren; Hongruixuan Chen; Naoto Yokoya; |
| 137 | UniLS: End-to-End Audio-Driven Avatars for Unified Listening and Speaking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This design is not end-to-end, thereby hindering the real-time applicability. To address this limitation, we present UniLS, the first end-to-end framework for generating unified speak-listen expressions, driven by only dual-track audio. |
Xuangeng Chu; Ruicong Liu; Yifei Huang; Yun Liu; YICHEN PENG; Bo Zheng; |
| 138 | Plan, Imagine, Then Act: Steering Your VLA with Efficient Visually Grounded Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present \textit{Visually Grounded Planning}, a general and efficient high-level planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. |
Zhuoyang Zhang; Shang Yang; Qinghao Hu; Luke Huang; James Hou; Yufei Sun; Yao Lu; Song Han; |
| 139 | Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. |
Junyuan Mao; Qiankun Li; Linghao Meng; Zhicheng He; Xinliang Zhou; Kun Wang; Yang Liu; Yueming Jin; |
| 140 | Dual Ascent Diffusion for Inverse Problems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing maximum-a-posteriori (MAP) or posterior sampling approaches, however, rely on different computational approximations, leading to inaccurate or suboptimal samples. To address this issue, we introduce a new approach to solving MAP problems with diffusion model priors using a dual ascent optimization framework. |
Minseo Kim; Axel Levy; Gordon Wetzstein; |
| 141 | Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. |
Yunsong Wang; Gim Hee Lee; |
| 142 | Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. |
Yushi Hu; Reyhane Askari; Melissa Hall; Emily Dinan; Luke Zettlemoyer; Marjan Ghazvininejad; |
| 143 | HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. |
Hezhen Hu; Wangbo Zhao; Lanqing Guo; Hanwen Jiang; Jonathan Liu; Zhiwen Fan; Kai Wang; Zhangyang Wang; Georgios Pavlakos; |
| 144 | Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. |
Zeren Jiang; Chuanxia Zheng; Iro Laina; Diane Larlus; Andrea Vedaldi; |
| 145 | DiP: Taming Diffusion Models in Pixel Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. |
Zhennan Chen; junwei zhu; Xu Chen; Jiangning Zhang; Xiaobin Hu; Hanzhen Zhao; Chengjie Wang; Jian Yang; Ying Tai; |
| 146 | OmniDocLayout: Towards Diverse Document Layout Generation Via Coarse-to-Fine LLM Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M$^6$Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. |
Hengrui Kang; Zhuangcheng Gu; Zhiyuan Zhao; Zichen Wen; Bin Wang; Weijia Li; Conghui He; |
| 147 | JarvisEvo: Towards A Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, two critical challenges persist: (1) instruction hallucination—text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking—dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. |
yunlong lin; Linqing Wang; Kunjie Lin; Zixu Lin; Kaixiong Gong; Wenbo Li; Bin Lin; Zhenxi Li; Shiyi Zhang; Yuyang Peng; Wenxun Dai; Xinghao Ding; Chunyu Wang; qinglin lu; |
| 148 | FlexiVideo: Variation-Aware Temporal Dynamics Modeling for Efficient Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The recent state-of-the-art model, i.e., Qwen2.5-VL, adopts a fixed two-frame encoding scheme, but our pilot experiments indicate that it encounters a visual confusion problem under high-dynamic frame pairs. To address this issue, we propose FlexiVideo, an efficient MLLM that models temporal dynamics leveraging visual variation. |
Da Peng; Xuesong Yang; Zonghao Guo; Yichen Zhang; Chi Chen; Yidan Zhang; Yuan Yao; Fang Wan; Wei Ke; Maosong Sun; |
| 149 | CHIRP Dataset: Towards Long-term, Individual-level, Behavioural Monitoring of Bird Populations in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This stems from the lack of datasets that cover a range of computer vision tasks necessary to extract biologically meaningful measurements of individual animals. Here, we introduce such a dataset (CHIRP) with a new method (CORVID) for individual re-identification of wild birds. |
Alex Hoi Hang Chan; Neha Singhal; Onur Kocahan; Andrea Meltzer; Saverio Lubrano; Miya Warrington; Michael Griesser; Fumihiro Kano; Hemal Naik; |
| 150 | Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. |
Joohyun Kwon; Geonhee Sim; Gyeongsik Moon; |
| 151 | R2G:A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present R2G (RTL-to-GDSII), a standardized benchmark and framework that converts DEF files into typed, heterogeneous, information-preserving circuit graphs and supports node- and edge-level tasks in placement and routing. |
ZEWEI ZHOU; Jiajun Zou; Jiajia Zhang; Ao Yang; Ruichao He; Haozheng Zhou; Ao Liu; Jiawei Liu; Leilei Jin; Shan Shen; Daying Sun; |
| 152 | MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. |
Yiren Song; Cheng Liu; Mike Zheng Shou; |
| 153 | OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper provides a simplification on OpenVision’s architecture and loss design for enhancing its training efficiency. |
Yanqing Liu; Xianhang li; Letian Zhang; Zirui Wang; Zeyu Zheng; Yuyin Zhou; Cihang Xie; |
| 154 | Uncertainty-driven 3D Gaussian Splatting Active Mapping Via Anisotropic Visibility Field Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. |
Shangjie Xue; Jesse Dill; Dhruv Ahuja; Frank Dellaert; Panagiotis Tsiotras; Danfei Xu; |
| 155 | Intrinsic Image Fusion for Multi-View 3D Material Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Intrinsic Image Fusion, a method that reconstructs high-quality physically based materials from multi-view images. |
Peter Kocsis; Lukas Höllein; Matthias Nießner; |
| 156 | Efficient Hybrid SE(3)-Equivariant Visuomotor Flow Policy Via Spherical Harmonics for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose E3Flow, a novel framework that addresses the critical limitations of equivariant diffusion policies. |
Qinglun Zhang; Shen Cheng; Tian Dan; Haoqiang Fan; Guanghui Liu; Shuaicheng Liu; |
| 157 | OmniGen2: Towards Instruction-Aligned Multimodal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduces \textbf{OmniGen2}, a unified multimodal generator designed to follow complex, fine-grained instructions. |
Chenyuan Wu; Jiahao Wang; PengFei Zheng; Ruiran Yan; Shitao Xiao; Xin Luo; Yueze Wang; Wanli Li; Xiyan Jiang; Yexin Liu; Junjie Zhou; Ziyi Xia; Ze Liu; Chaofan Li; Haoge Deng; Kun Luo; Bo Zhang; Jiajun Zhang; Dong Liu; Defu Lian; Xinlong Wang; Zhongyuan Wang; Tiejun Huang; Zheng Liu; |
| 158 | Taming Generative Diffusion Model for Task-Oriented Infrared Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a unified diffusion framework that re-formulates IR restoration as a single-step generative process. |
Tengyu Ma; Zhilong Dai; Yubo Diao; Guanming An; Long Ma; Jinyuan Liu; Risheng Liu; |
| 159 | ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their proficiency in tasks requiring both fine-grained visual understanding and spatial reasoning remains underexplored. To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities. |
Sicheng Feng; Song Wang; Shuyi Ouyang; Lingdong Kong; Zikai Song; Jianke Zhu; Huan Wang; Xinchao Wang; |
| 160 | Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed \textbf{IQPIR}, that introduces an Image Quality Prior (IQP)—extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models—to guide the restoration process toward perceptually optimal outputs explicitly. |
Fengyang Xiao; Peng Hu; Lei Xu; XingE Guo; Guanyi Qin; Yuqi Shen; Chengyu Fang; Rihan Zhang; Chunming He; Sina Farsiu; |
| 161 | Learning to Assist: Physics-Grounded Human-Human Control Via Multi-Agent Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we formulate the imitation of closely interacting, force-exchanging human–human motion sequences as a multi-agent reinforcement learning problem. |
Yuto Shibata; Kashu Yamazaki; Lalit Jayanti; Yoshimitsu Aoki; Mariko Isogawa; Katerina Fragkiadaki; |
| 162 | FlashLips: 100-FPS Mask-Free Latent Lip-Sync Using Reconstruction Instead of Diffusion or GANs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FlashLips, a two-stage, mask-free lip-sync system that decouples lips control from rendering and achieves real-time performance running at over 100 FPS on a single GPU, while matching the visual quality of larger state-of-the-art models. |
Andreas Zinonos; Michał Stypułkowski; Antoni Bigata Casademunt; Stavros Petridis; Maja Pantic; Nikita Drobyshev; |
| 163 | WEAVE: Unleashing and Benchmarking The In-context Interleaved Comprehension and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. |
Wei Chow; Jiachun Pan; Yongyuan Liang; Mingze Zhou; Xue Song; Liyu Jia; Saining Zhang; Siliang Tang; Juncheng Li; Fengda Zhang; Weijia Wu; Hanwang Zhang; Tat-seng Chua; |
| 164 | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present a Machine Mental Imagery framework, dubbed as “Mirage”, which augments VLM decoding with latent visual tokens alongside ordinary text. |
Zeyuan Yang; Xueyang Yu; Delin Chen; Maohao Shen; Chuang Gan; |
| 165 | PhyCritic: Multimodal Critic Models for Physical AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. |
Tianyi Xiong; Shihao Wang; Guilin Liu; Yi Dong; Ming Li; Heng Huang; Jan Kautz; Zhiding Yu; |
| 166 | Unified Multimodal Models As Auto-Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that both tasks can be connected under a shared Auto-Encoder perspective, where text serves as the intermediate latent representation, bridging the two directions — encoding images into textual semantics (I2T) and decoding text back into images (T2I). |
Zhiyuan Yan; Kaiqing Lin; Zongjian Li; Junyan Ye; Hui Han; Haochen Wang; Zhendong Wang; Bin Lin; Li Hao; Xinyan Xiao; Jingdong Wang; Haifeng Wang; Li Yuan; |
| 167 | Adaptive Depth Lightweight RGB-T Tracking with Holistic Token Routing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This computational burden constrains real-time performance and limits scalability beyond high-end GPUs. To balance accuracy and efficiency, we propose Adaptive Early-Exit (AEE): we augment the backbone with anytime heads and pair them with a confidence-calibrated early-exit policy that halts inference at the earliest reliable layer, skipping redundant computation. |
Tian Ding; Hongtao Yang; Liangtao Shi; Jun Li; Xiantao Hu; Jian Yang; Ying Tai; |
| 168 | TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a generalizable framework to transfer relative depth to metric depth. |
Beilei Cui; Yiming Huang; Long Bai; Hongliang Ren; |
| 169 | Flowception: Temporally Expansive Flow Matching for Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Flowception, a novel non-autoregressive and variable-length video generation framework. |
Tariq Berrada; John Nguyen; Karteek Alahari; Jakob Verbeek; Ricky T. Q. Chen; |
| 170 | MERLIN: Building Low-SNR Robust Multimodal LLMs for Electromagnetic Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A critical fragility in low Signal-to-Noise Ratio (SNR) environments, where critical signal features can be obscured, leading to significant performance degradation. To address these challenges, we introduce a tripartite contribution to establish a foundation for MLLMs in the EM domain. |
Junyu Shen; Zhendong She; Chenghanyu Zhang; Yuchuang Sun; Luqing Luo; Dingwei Tan; Zonghao Guo; Bo Guo; Zehua Han; Wupeng Xie; Yaxin Mu; Peng Zhang; Pei Pei Li; Fengxiang Wang; Yangang Sun; Maosong Sun; |
| 171 | SAMTok: Representing Any Mask with Two Words Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To solve these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two textual special tokens and reconstructs masks from these tokens with high fidelity. |
yikang zhou; Tao Zhang; Dengxian Gong; Yuanzheng Wu; Ye Tian; Haochen Wang; Haobo Yuan; Jiacong Wang; Lu Qi; Hao Fei; Shunping Ji; Anran Wang; Zhuochen Wang; Yujing Wang; Cheng CHEN; Xiangtai Li; |
| 172 | V-DPM: Video Reconstruction with Dynamic Point Maps Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that DPMs are far more meaningful when applied to videos and introduce V-DPM to demonstrate this. |
Edgar Sucar; Eldar Insafutdinov; Zihang Lai; Andrea Vedaldi; |
| 173 | HoneyBee: Data Recipes for Vision-Language Reasoners Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. |
Hritik Bansal; Devendra Singh Sachan; Kai-Wei Chang; Aditya Grover; Gargi Ghosh; Wen-tau Yih; Ramakanth Pasunuru; |
| 174 | Spherical Leech Quantization for Visual Tokenization and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a unified formulation of different non-parametric quantization methods through the lens of lattice coding. |
Yue Zhao; Hanwen Jiang; Zhenlin Xu; Chutong Yang; Ehsan Adeli; Philipp Krähenbühl; |
| 175 | ThinkGen: Generalized Thinking for Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM’s CoT reasoning in various generation scenarios. |
Siyu Jiao; Yiheng Lin; Yujie Zhong; Qi She; Wei zhou; Xiaohan Lan; Zilong Huang; Fei Yu; Yingchen Yu; Yunqing Zhao; Yao Zhao; Yunchao Wei; |
| 176 | UniVerse: Empower Unified Generation with Reasoning and Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current text-to-image (T2I) generation models often struggle with prompts that require complex reasoning or specialized knowledge, failing to accurately interpret implicit user intent. To bridge this gap, we introduce \textbf{T2I-Reason}, a large-scale dataset designed to empower text-to-image generation in unified multimodal models (UMMs) with reasoning and knowledge. |
Kaiyue Sun; Weiyang Jin; Chengqi Duan; Rongyao Fang; Xian Liu; Yuwei Niu; Chunwei Wang; Aoxue Li; Xihui Liu; |
| 177 | TrajTok: Learning Trajectory Tokens Enables Better Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While the recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex, external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. |
Chenhao Zheng; Jieyu Zhang; Jianing Zhang; Weikai Huang; Ashutosh Kumar; Quan Kong; Oncel Tuzel; Chun-Liang Li; Ranjay Krishna; |
| 178 | FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing video distillation methods successfully distill multi-step generators into few-step, directly applying these approaches to trajectory-controllable video generation results in noticeable degradation in both video quality and trajectory accuracy. To bridge this gap, we introduce FlashMotion, a novel training framework designed for few-step trajectory-controllable video generation. |
Quanhao Li; Zhen Xing; Rui Wang; Haidong Cao; Qi Dai; Daoguo Dong; Zuxuan Wu; |
| 179 | ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ColaVLA, a unified vision–language–action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. |
Qihang Peng; Xuesong Chen; Chenye Yang; Shaoshuai Shi; Hongsheng Li; |
| 180 | Expanding MmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EMDUL, a novel approach to expand the volume and diversity of an existing mmWave dataset using unlabeled mmWave data and a LiDAR dataset. |
Zhuoxuan Peng; Boan Zhu; Xingjian Zhang; Wenying Li; S.-H. Gary Chan; |
| 181 | ARES: Unifying Asymmetric RGB-Event Stereo for Probabilistic Scene Flow Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Estimating dense three dimensional motion in dynamic high speed scenes remains challenging due to motion blur, illumination variation, and the limited temporal resolution of conventional cameras. We introduce ARES, a unified framework for Asymmetric RGB-Event Stereo that addresses these issues through a hybrid setup where an event camera captures fine grained temporal dynamics and an RGB camera provides rich spatial structure. |
Jie Long Lee; Gim Hee Lee; |
| 182 | Are Image-to-Video Models Good Zero-Shot Image Editors? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present \ifedit{IF-Edit} (\textbf{I}mage Edit by Generating \textbf{F}rames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. |
Zechuan Zhang; Zhenyuan Chen; Zongxin Yang; Yi Yang; |
| 183 | GPFlow: Gaussian Prototype Probability Flow for Unsupervised Multi-Modal Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To capture diverse and continuous normality variations, we propose GPFlow, a probability flow inspired framework that embeds diverse normal patterns into a latent space of learnable Gaussian prototypes. |
YITING LI; Xulei Yang; Jingyi Liao; Jing Zhang; Fayao Liu; |
| 184 | Real-World Point Tracking with Verifier-Guided Pseudo-Labeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address the problem of real-world fine-tuning and introduce Verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. |
Görkay Aydemir; Fatma Güney; Weidi Xie; |
| 185 | AutoDebias: An Automated Framework for Detecting and Mitigating Backdoor Biases in Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberate and subtle injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack vectors. |
Hongyi Cai; HONGYI CAI; MingKang Dong; Muxin Pu; Moayad Aloqaily; jie li; Xinfeng Li; Jialie Shen; Meikang Qiu; Qingsong Wen; |
| 186 | Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. |
Zixuan Ye; Quande Liu; Cong Wei; Yuanxing Zhang; Xintao Wang; Pengfei Wan; Kun Gai; Wenhan Luo; |
| 187 | Modeling Cross-vision Synergy for Unified Large Vision Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. |
Shengqiong Wu; Lanhu Wu; Mingyang Bao; Wenhao Xu; Hanwang Zhang; Shuicheng Yan; Hao Fei; Tat-seng Chua; |
| 188 | From Remember to Transfer: Interpretable Open-World Reasoning in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: If an agent can capture and reuse these latent patterns, it can infer new actionable knowledge from prior experience, enabling more efficient and flexible task execution. To explore this capability, we propose Echo. |
Chenghao Li; Jun Liu; Songbo Zhang; HuaDong Jian; Hao Ni; LIK-HANG LEE; SUNG BAE BAE; Guoqing Wang; Yang Yang; Chaoning Zhang; |
| 189 | PromptEnhancer: Taming Your Rewriter for Text-to-Image Generation Via Fine-Grained Reward Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these models often struggle to faithfully render complex user prompts, particularly in aspects such as attribute binding, negation, and compositional relationships. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pre-trained T2I model. |
Linqing Wang; zhiyong xu; XiMing Xing; YIJI CHENG; Zhiyuan Zhao; Donghao Li; Tiankai Hang; Zhenxi Li; Jiale Tao; wangqixun wangqixun; Ruihuang Li; Comi Chen; Xin LI; Mingrui Wu; Xinchi Deng; Shuyang Gu; Chunyu Wang; qinglin lu; |
| 190 | Wavelet-based Frame Selection By Detecting Semantic Boundary for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce $\textbf{W}$avelet-based $\textbf{F}$rame $\textbf{S}$election by Detecting $\textbf{S}$emantic $\textbf{B}$oundary ($\textbf{WFS-SB}$), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts—pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. |
Wang Chen; Yuhui zeng; Yongdong Luo; Tianyu Xie; Luojun Lin; Jiayi Ji; Yan Zhang; Xiawu Zheng; |
| 191 | NaTex: Seamless Texture Generation As Latent Color Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. |
Zeqiang Lai; Yunfei Zhao; Zibo Zhao; Xin Yang; Xin Huang; Jingwei Huang; Xiangyu Yue; Chunchao Guo; |
| 192 | Stable Mean Flow: Lyapunov-Inspired One-Step Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The Mean Flow Matching algorithm is the state-of-the-art for one-step generative models. Building on this idea, we propose the Stable Mean Flow algorithm and introduce a Lyapunov-inspired stability regularizer that enforces local non-expansivity of the single-step transport map. |
Guangxun Zhang; Mason Haberle; Davi Geiger; |
| 193 | Bridging Fidelity-Reality with Controllable One-Step Diffusion for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent diffusion-based one-step methods have shown remarkable progress in the field of image super-resolution, yet they remain constrained by three critical limitations: (1) inferior fidelity performance caused by the information loss from compression encoding of low-quality (LQ) inputs; (2) insufficient region-discriminative activation of generative priors; (3) misalignment between text prompts and their corresponding semantic regions. To address these limitations, we propose CODSR, a controllable one-step diffusion network for image super-resolution. |
Hao Chen; Junyang Chen; Jinshan Pan; Jiangxin Dong; |
| 194 | DreamOmni2: Multimodal Instruction-based Generation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. |
Bin Xia; Bohao Peng; Yuechen Zhang; Junjia Huang; JiyangLiu JiyangLiu; Jingyao Li; Haoru Tan; WU Sitong; Chengyao Wang; Yitong Wang; Bei Yu; Jiaya Jia; |
| 195 | Flow3r: Factored Flow Prediction for Visual Geometry Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Flow3r, a scalable framework for visual geometry learning that leverages flow prediction to guide learning using unlabeled monocular videos. |
Zhongxiao Cong; Qitao Zhao; Minsik Jeon; Shubham Tulsiani; |
| 196 | Particulate: Feed-Forward 3D Object Articulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Particulate, a feed-forward model that, given a single static 3D mesh of an everyday object, predicts its 3D parts, kinematic structure, and articulation parameters. |
Ruining Li; YUXIN YAO; Chuanxia Zheng; Christian Rupprecht; Joan Lasenby; Shangzhe Wu; Andrea Vedaldi; |
| 197 | Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. |
hongyuan chen; Xingyu Chen; Zexiang Xu; Anpei Chen; |
| 198 | What Is It Like to Be A Noise? An Entropy-based Gaussian Noise Regularization for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a principled, differentiable regularizer that correctly targets the high-mass typical set rather than the high-probability mode. |
Pascal Chang; Kai Lascheit; Jingwei Tang; Markus Gross; Vinicius C. Azevedo; |
| 199 | VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VINS-120K, the first large-scale dataset for instruction-based UHR image editing, comprising 120K carefully curated triplets of instruction, input image, and edited image. |
Zhizhou Chen; Shanyan Guan; Zhanxin Gao; En Ci; Yanhao Ge; Wei Li; Zhenyu Zhang; Jian Yang; Ying Tai; |
| 200 | FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6$\times$ acceleration in inference speed. |
Shuyuan Tu; Yueming Pan; Yinming Huang; Xintong Han; Zhen Xing; Qi Dai; Kai Qiu; Chong Luo; Zuxuan Wu; |
| 201 | Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. |
Christopher Clark; Jieyu Zhang; Zixian Ma; Jae Sung Park; Rohun Tripathi; Sangho Lee; Reza Salehi; Jason Ren; Chris Dongjoo Kim; Yinuo Yang; Vincent Shao; Yue Yang; Weikai Huang; Ziqi Gao; Taira Anderson; Jianrui Zhang; Jitesh Jain; George Stoica; Ali Farhadi; Ranjay Krishna; |
| 202 | LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose LongVideo-R1, an active, reasoning-equipped multimodal large language model (MLLM) agent designed for efficient video context navigation, avoiding the redundancy of exhaustive search. |
Jihao Qiu; Lingxi Xie; Xinyue Huo; Qi Tian; Qixiang Ye; |
| 203 | Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. |
Zhongxing Xu; Zhonghua Wang; Zhe Qian; Dachuan Shi; feilong tang; Ming Hu; Shiyan Su; Xiaocheng Zou; Wei Feng; Dwarikanath Mahapatra; Yifan Peng; Mingquan Lin; Zongyuan Ge; |
| 204 | Adversarial Style Optimization: Enhancing VLM Jailbreaks By GRPO-based Stylistic Triggers Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, from the perspective of safety ability, their defense mechanisms can be easily bypassed by these specific stylistic triggers, leading to harmful responses. Based on this finding, we propose Adversarial Style Optimization (ASO), a plug-and-play enhancement module to amplify existing visual jailbreaks. |
Bingjun Luo; Jialin Guo; Yue Yao; Xinpeng Ding; |
| 205 | DVGT: Visual Geometry Transformer for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Visual Geometry Transformer specifically designed for autonomous Driving (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. |
Sicheng Zuo; Zixun Xie; Wenzhao Zheng; Shaoqing Xu; Fang Li; Shengyin Jiang; Long Chen; Zhixin Yang; Jiwen Lu; |
| 206 | FINER: MLLMs Hallucinate Under Fine-grained Negative Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **FI**ne-grained **NE**gative que**R**ies (**FINER**), alongside two benchmarks: **FINER-CompreCap** and **FINER-DOCCI**. |
Rui Xiao; Sanghwan Kim; Yongqin Xian; Zeynep Akata; Stephan Alaniz; |
| 207 | Omni-Supervised Motion Editing: Balancing Change and Invariance Through Positive-Negative Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. |
Zhenwu Shi; Jingyu Gong; Peiwei Wang; Xingzan Wang; Tianwen Qian; Wenxi Li; Yuan Fang; Jiao Xie; Lizhuang Ma; Shaohui Lin; |
| 208 | VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find even state-of-the-art commercial APIs satisfy fewer than 72\% of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present **VisionDirector**, a training-free, vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling plus semantic verification/rollback after every edit, and (iv) logs goal-level rewards. |
Meng Chu; Senqiao Yang; Haoxuan Che; Suiyun Zhang; Xichen Zhang; Shaozuo Yu; Haokun GUI; Zhefan Rao; Dandan Tu; Rui Liu; Jiaya Jia; |
| 209 | AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. |
Sharath Girish; Viacheslav Ivanov; Tsai-Shien Chen; Hao Chen; Aliaksandr Siarohin; Sergey Tulyakov; |
| 210 | Lightmover: Towards Precise and Efficient Control for Light Movement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. |
Gengze Zhou; Tianyu Wang; Soo Ye Kim; ZHIXIN SHU; Xin Yu; Yannick Hold-Geoffroy; Sumit Chaturvedi; Qi Wu; Zhe Lin; Scott Cohen; |
| 211 | Dynamics-Aware Preference Optimization for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work revisits this issue through the lens of learning dynamics and identifies a core pathology, the squeezing effect, where easy negatives retain large, misaligned gradients despite having negligible loss. To address this, we propose Cooling-Weighted Direct Preference Optimization (CW-DPO), a two-stage framework that first smooths and then stabilizes the alignment process. |
jusheng zhang; Kaitong Cai; Jing Yang; Jian Wang; Keze Wang; |
| 212 | HTC-VLM: Disentangled Hybrid Token Compression for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work demonstrates that a minimalist hybrid can resolve the efficiency–fidelity dilemma, advancing scalable VLMs. |
jusheng zhang; Xiaoyang Guo; Kaitong Cai; Qinhan Lyu; Yijia Fan; Wenhao Chai; Jian Wang; Keze Wang; |
| 213 | PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task’s heterogeneous nature. |
Hee Suk Yoon; Eunseop Yoon; Ji Woo Hong; SooHwan Eom; Gwanhyeong Koo; Mark A. Hasegawa-Johnson; Qi Dai; Chong Luo; Chang D. Yoo; |
| 214 | TINA: Text-Free Inversion Attack for Unlearned Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, identifying such a visual generative pathway is challenging because standard text-guided DDIM inversion is actively resisted by text-centric defenses within the erased model. To address this, we introduce TINA, a novel Text-free INversion Attack, which enforces this visual-only probe by operating under a null-text condition, thereby avoiding existing text-centric defenses. |
Qianlong Xiang; Miao Zhang; Haoyu Zhang; Kun Wang; Junhui Hou; Liqiang Nie; |
| 215 | VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. |
Sven Elflein; Ruilong Li; Sérgio Agostinho; Žan Gojčič; Laura Leal-Taixe; Qunjie Zhou; Aljoša Ošep; |
| 216 | MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paradigm leads to computational inefficiency when scaling, as it requires a separate forward pass for each pair and overlooks potential contextual relationships between multiple queries that can relate to the same context. In this work, we introduce Multi-Turn Contrastive Learning (MuCo), a dialogue-inspired framework that revisits this process. |
Geonmo Gu; Byeongho Heo; Jaemyung Yu; Jaehui Hwang; Taekyung Kim; Sangmin Lee; HeeJae Jun; Yoohoon Kang; Sangdoo Yun; Dongyoon Han; |
| 217 | MeshFlow: Efficient Artistic Mesh Generation Via MeshVAE and Flow-based DiTs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MeshFlow, a new method for compressing and generating artist-like 3D meshes. |
Weiyu Li; Antoine Toisoul; Tom Monnier; Roman Shapovalov; Rakesh Ranjan; Ping Tan; Andrea Vedaldi; |
| 218 | FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image. |
Tobias Kirschstein; Simon Giebenhain; Matthias Nießner; |
| 219 | CompBench: Benchmarking Complex Instruction-guided Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To construct CompBench, We propose an MLLM-human collaborative framework with tailored task pipelines. |
Bohan Jia; Wenxuan Huang; Yuntian Tang; Junbo Qiao; Jincheng Liao; Shaosheng Cao; Fei Zhao; Zhaopeng Feng; Zhouhong Gu; Zhenfei Yin; Lei Bai; Wanli Ouyang; Lin Chen; Fei Zhao; Zihan Wang; Yuan Xie; Shaohui Lin; |
| 220 | GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by Direct Preference Optimization (DPO), we propose ***GlyphPrinter***, a preference-based text rendering method that eliminates reliance on explicit reward models. |
Xincheng Shuai; Ziye Li; Henghui Ding; Dacheng Tao; |
| 221 | PSDesigner: Automated Graphic Design with A Human-Like Creative Workflow Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although recent studies have leveraged powerful text-to-image models and MLLMs to assist graphic design, they typically simplify professional workflows, resulting in limited flexibility and intuitiveness. To address these limitations, we propose ***PSDesigner***, an automated graphic design system that emulates the creative workflow of human designers. |
Xincheng Shuai; Song Tang; Yutong Huang; Henghui Ding; Dacheng Tao; |
| 222 | Unlocking Token Rewards Via Training-Free Reward Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an extremely efficient, training-free method to extract token-level reward signals directly from an existing deep reward model. |
WU Sitong; Haoru Tan; Bin Xia; Xichen Zhang; Jingyao Li; Shaofeng Zhang; Xiaojuan Qi; Bei Yu; Jiaya Jia; |
| 223 | Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While the recent Gaussian-Enhanced Surfels (GES) enable high-performance, sort-free rendering, they suffer from aliasing artifacts and suboptimal reconstruction. To address these limitations, we propose DP-GES, a novel representation that augments opaque surfels with semi-transparent boundaries and leverages Depth Peeling to establish accurate per-pixel ordering. |
Keyang Ye; Hongzhi Wu; Kun Zhou; |
| 224 | InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces InfiniDepth, which represents depth as neural implicit fields. |
Hao Yu; Haotong Lin; Jiawei Wang; Jiaxin Li; Yida Wang; Xueyang Zhang; Yue Wang; Xiaowei Zhou; Ruizhen Hu; Sida Peng; |
| 225 | Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a new data-centric pipeline that leverages a high-fidelity captioner to create SOTA-quality captions and the first Unified Tag System (UTS) that bridges speech, music, and environmental sounds. |
Xuanru Zhou; Yiwen Shao; Wei-Cheng Tseng; Dong Yu; |
| 226 | From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. |
Yuyuan Liu; Yiping Ji; Anjie Le; Jiayuan Zhu; Jiazhen Pan; Can Peng; Jiajun Deng; Fengbei Liu; Junde Wu; |
| 227 | LATTICE: Democratize High-Fidelity 3D Generation at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. |
Zeqiang Lai; Yunfei Zhao; Zibo Zhao; Haolin Liu; Qingxiang Lin; Jingwei Huang; Chunchao Guo; Xiangyu Yue; |
| 228 | Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we first reinterpret existing SDE-based GRPO methods from a distance optimization perspective, revealing their underlying mechanism as a form of contrastive learning. Based on this insight, we propose Neighbor GRPO, a novel alignment algorithm that completely bypasses the need for SDEs. |
Dailan He; Guanlin Feng; Xingtong Ge; Yazhe Niu; Yi Zhang; Bingqi Ma; Guanglu Song; Yu Liu; Hongsheng Li; |
| 229 | High-Fidelity Diffusion Face Swapping with ID-Constrained Facial Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper addresses two key challenges in diffusion-based face swapping: the prioritized preservation of identity over target attributes and the inherent conflict between identity and attribute conditioning. To tackle these issues, we introduce an identity-constrained attribute-tuning framework for face swapping that first ensures identity preservation and then fine-tunes for attribute alignment, achieved through a decoupled condition injection. |
Dailan He; Xiahong Wang; Shulun Wang; Hao Shao; Bingqi Ma; Guanglu Song; Yu Liu; Hongsheng Li; |
| 230 | MotionEdit: Benchmarking and Learning Motion-Centric Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **MotionEdit**, a novel dataset for motion-centric image editing—the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. |
Yixin Wan; Lei Ke; Wenhao Yu; Kai-Wei Chang; Dong Yu; |
| 231 | VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce VLM-3R, a framework for Vision-Language Models that couples 3D reconstructive instruction tuning with scalable training data curation and a new benchmark for temporal reasoning. |
Zhiwen Fan; Jian Zhang; Renjie Li; Junge Zhang; Runjin Chen; Hezhen Hu; Kevin Wang; Peihao Wang; Huaizhi Qu; Shijie Zhou; Dilin Wang; Zhicheng Yan; Hongyu Xu; Justin Theiss; Tianlong Chen; Jiachen Li; Zhengzhong Tu; Zhangyang Wang; Rakesh Ranjan; |
| 232 | PhysSkin: Real-Time and Generalizable Physics-Based Animation Via Self-Supervised Neural Skinning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Achieving real-time physics-based animation that generalizes across diverse 3D shapes and discretizations remains a fundamental challenge. We introduce PhysSkin, a physics-informed framework that addresses this challenge. |
Yuanhang Lei; Tao Cheng; Xingxuan Li; Boming Zhao; Siyuan Huang; Ruizhen Hu; Peter Yichen Chen; Hujun Bao; Zhaopeng Cui; |
| 233 | SAGE: Scalable Agentic 3D Scene Generation for Embodied AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. |
Hongchi Xia; Xuan Li; Max Li; Qianli Ma; Jiashu Xu; Ming-Yu Liu; Yin Cui; Tsung-Yi Lin; Wei-Chiu Ma; Shenlong Wang; Shuran Song; Fangyin Wei; |
| 234 | SAM 3D: 3Dfy Anything in Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. |
Xingyu Chen; Fu-Jen Chu; Pierre Gleize; Kevin Liang; Alexander Sax; Hao Tang; Weiyao Wang; Michelle Guo; Thibaut Hardin; Xiang Li; Aohan Lin; Jia-Wei Liu; Ziqi Ma; Anushka Sagar; Bowen Song; Xiaodong Wang; Jianing "Jed" Yang; Bowen Zhang; Piotr Dollár; Georgia Gkioxari; Matt Feiszli; Jitendra Malik; |
| 235 | Learning to Generate Highly Dynamic Videos Using Synthetic Motion Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. |
Wonjoon Jin; Jiyun Won; Janghyeok Han; Qi Dai; Chong Luo; Seung-Hwan Baek; Sunghyun Cho; |
| 236 | ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. |
Xiaoxue Wu; Xinyuan Chen; Yaohui Wang; Yu Qiao; |
| 237 | Learning to Adapt: Self-Improving Web Agent Via Cognitive-Aware Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose **SCALE** (**S**elf-**C**ognitive-**A**ware **L**earning and **E**xploration), which leverages three advertise roles——*selector*, *predictor*, and *judger* to autonomously discover their limitations and expand its cognitive boundaries through the environment exploration. |
Weile Chen; Bingchen Miao; Qifan Yu; Wendong Bu; Guoming Wang; Wenqiao Zhang; Shengyu Zhang; Juncheng Li; Siliang Tang; |
| 238 | VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. |
Xinlei Yu; Chengming Xu; Guibin Zhang; Zhangquan Chen; Yudong Zhang; Yongbo He; Peng-Tao Jiang; Jiangning Zhang; Xiaobin Hu; Shuicheng Yan; |
| 239 | Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Diff4Splat, a feed-forward framework for dynamic scene generation from a single image. |
Panwang Pan; Chenguo Lin; Chenxin Li; Jingjing Zhao; Yuchen Lin; Haopeng Li; yunlong lin; Kairun Wen; Yixuan Yuan; Yadong Mu; |
| 240 | ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to semantic conflicts and suboptimal performance in preserving identities and interactions, limiting controllability and applicability. To tackle this issue, we introduce ID-Crafter, a framework for multi-subject video generation that achieves superior identity preservation and semantic coherence. |
Panwang Pan; Jingjing Zhao; Yuchen Lin; Chenguo Lin; Chenxin Li; Hengyu Liu; Tingting Shen; Yadong Mu; |
| 241 | GaussianDWM: Driving World Model Using Language-aligned 3D Gaussians for Scene Understanding and Multi-modal Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. |
Tianchen Deng; Xuefeng Chen; Yi Chen; Qu Chen; Yuyao Xu; Lijin Yang; Le Xu; Yu Zhang; Bo Zhang; Wuxiong.Huang Wuxiong.Huang; Hesheng Wang; |
| 242 | G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. |
Wenbo hu; JINGLI LIN; Yilin Long; Yunlong Ran; Lihan Jiang; Yifan Wang; Chenming Zhu; Runsen Xu; Tai Wang; Jiangmiao Pang; |
| 243 | SoccerMaster: A Vision Foundation Model for Soccer Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to propose a unified framework that enables a single model to handle diverse soccer visual understanding tasks, spanning both fine-grained perception (e.g., athlete detection) and semantic reasoning (e.g., event classification). Concretely, we make the following contributions in this paper:(i) we present **SoccerMaster**, the first soccer-specific vision foundation model that unifies comprehensive understanding tasks within a single framework via **supervised multi-task pretraining**;(ii) we consolidate multiple existing soccer video datasets and develop an automated data curation pipeline, termed as **SoccerFactory**, to produce scalable multi-task training annotations;and (iii) we conduct extensive experiments demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, underscoring its breadth and superiority. |
Haolin Yang; Jiayuan Rao; Haoning Wu; Weidi Xie; |
| 244 | STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction and native likelihood estimation. |
Jiatao Gu; Ying Shen; Tianrong Chen; Laurent Dinh; Yuyang Wang; Miguel Ángel Bautista; David Berthelot; Joshua Susskind; Shuangfei Zhai; |
| 245 | LongVT: Incentivizing Thinking with Long Videos Via Native Tool Calling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by how humans comprehend long videos—by first skimming globally and then examining relevant clips for details—we introduce LongVT, an end-to-end agentic framework that sparks Thinking with Long Videos via interleaved Multimodal Chain-of-Tool-Thought. |
Zuhao Yang; Sudong Wang; Kaichen Zhang; Keming Wu; Sicong Leng; Yifan Zhang; Bo Li; Chengwei Qin; Shijian Lu; Xingxuan Li; Lidong Bing; |
| 246 | Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **Synthetic Object Compositions (SOC)**, an accurate and scalable data synthesis pipeline via a novel object-centric composition strategy. |
Weikai Huang; Jieyu Zhang; Taoyang jia; Chenhao Zheng; Ziqi Gao; Jae Sung Park; Ranjay Krishna; |
| 247 | VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current benchmarks evaluate for fine-grained actions in a domain agnostic manner, making to hard to evaluate models on this task. To address this gap, we introduce \dataset, a comprehensive benchmark aimed at evaluating the domain-specific, fine-grained action understanding of video models. |
Tanush Yadav; Reza Salehi; Jae Sung Park; Vivek Ramanujan; Hannaneh Hajishirzi; Yejin Choi; Ali Farhadi; Rohun Tripathi; Ranjay Krishna; |
| 248 | UniGen-1.5: Enhancing Image Generation and Editing Through Reward Unification in RL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing. |
Rui Tian; Mingfei Gao; Haiming Gang; Jiasen Lu; Zhe Gan; Yinfei Yang; Zuxuan Wu; Afshin Dehghan; |
| 249 | Learning from Itself: Mining Internal Knowledge from Vision Language Models for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Learning from Itself (LfI), which mines CLIP’s internal knowledge to address both challenges. |
Yizheng Gong; Siyue Yu; Waleed Al-Nuaimy; Jimin Xiao; |
| 250 | BiGMINT: Biologically-guided Hierarchical Multimodal Integration for Modeling Multiple Compound Activities in Drug Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present **BiGMINT**, a **Bi**ologically **G**uided **M**ultimodal framework that hierarchically **INT**egrates chemoproteomic and high-content imaging (HCI) data, introducing chemoproteomics-guided phenotypic aggregation, task-aware cross-modal fusion, and protein–protein interaction priors for modeling activities. |
Pushpak Pati; Bo Li; Abbas Khan; Tomé Albuquerque; Steffen Jaensch; Amina Mollaysa; Walid Hassan; Samantha J Allen; Joke Reumers; Helai Mohammad; Scott Oloff; Tommaso Mansi; Rui Liao; Dmytro Lituiev; Zhoubing Xu; |
| 251 | MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. |
Taha Koleilat; Hojat Asgariandehkordi; Omid Nejatimanzari; Berardino Barile; Yiming Xiao; Hassan Rivaz; |
| 252 | FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. |
Weiheng Lu; An Yu; Jian Li; Zhenfei Zhang; Felix X. Ye; Ming-Ching Chang; |
| 253 | SpatialStack: Layered Geometry-Semantic Fusion for 3D VLM Spatial Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent efforts have introduced multi-view geometry transformers into VLMs, they typically fuse only the deep-layer features from vision and geometry encoders, discarding rich hierarchical signals and creating a fundamental bottleneck for spatial understanding. To overcome this, we propose SpatialStack, a general hierarchical fusion framework that progressively aligns vision, geometry, and language representations across the model hierarchy. |
Jian Zhang; Shijie Zhou; Bangya LIU; Achuta Kadambi; Zhiwen Fan; |
| 254 | P-Flow: Prompting Visual Effects Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it is hard and time-consuming for humans to craft a single prompt that accurately specifies these effects, as they require complex temporal reasoning and iterative refinement over time. To address this challenge, we propose P-Flow, a novel training-free framework for customizing dynamic visual effects in video generation without modifying the underlying model. |
Rui Zhao; Mike Zheng Shou; |
| 255 | TeamHOI: Learning A Unified Policy for Cooperative Human-Object Interactions with Any Team Size Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TeamHOI, a framework that enables a single decentralized policy to handle cooperative HOIs across any number of cooperating agents. |
Stefan Lionar; Gim Hee Lee; |
| 256 | Ego-STAR: Spatiotemporal Hints for Egocentric Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce EgoSTAR, a benchmark for evaluating complex egocentric visual reasoning. |
Arsha Nagrani; Jasper Uijlings; Shyamal Buch; Tobias Weyand; Sudheendra Vijayanarasimhan; Bo Hu; Ramin Mehran; David A. Ross; Cordelia Schmid; |
| 257 | UniRain: Unified Image Deraining with RAG-based Dataset Distillation and Multi-objective Reweighted Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose UniRain, an effective unified image deraining framework capable of restoring images degraded by rain streak and raindrop under both daytime and nighttime conditions. |
Qianfeng Yang; Qiyuan Guan; Xiang Chen; Jiyu Jin; Guiyue Jin; Jiangxin Dong; |
| 258 | Interact2Ar: Full-Body Human-Human Interaction Generation Via Autoregressive Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. |
Pablo Ruiz-Ponce; Sergio Escalera; Jose Garcia-Rodriguez; Jiankang Deng; Rolandos Alexandros Potamias; |
| 259 | SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. |
Chaojun Ni; Chen Cheng; Xiaofeng Wang; Zheng Zhu; Wenzhao Zheng; Boyuan Wang; Tianrun Chen; Guosheng Zhao; Haoyun Li; Zhehao Dong; Qiang Zhang; Yun Ye; Yang Wang; Guan Huang; Wenjun Mei; |
| 260 | ShowUI-π: Flow-based Generative Models As GUI Dexterous Hands Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we develop ShowUI-π, the first flow-based generative model as GUI dexterous hand, featuring the following designs:(i) Unified Discrete–Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes;(ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories;(iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents’ drag capabilities. |
Siyuan Hu; Kevin Qinghong Lin; Mike Zheng Shou; |
| 261 | StreamDiT: Real-Time Streaming Text-to-Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To practice the proposed method, we train a StreamDiT model with 4B parameters. |
Akio Kodaira; Tingbo Hou; Ji Hou; Markos Georgopoulos; Felix Juefei-Xu; Masayoshi Tomizuka; Yue Zhao; |
| 262 | EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. |
Wei Chow; Linfeng Li; Lingdong Kong; Zefeng Li; Qi Xu; Hang Song; Tian Ye; Xian Wang; Jinbin Bai; Shilin Xu; Xiangtai Li; Junting Pan; Shaoteng Liu; Ran Zhou; Tianshu Yang; Songhua Liu; |
| 263 | Generative Video Motion Editing with 3D Point Tracks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a track-conditioned V2V framework that enables joint editing of camera and object motion. |
Yao-Chih Lee; Zhoutong Zhang; Gabriel Huang; Jui-Hsien Wang; Joon-Young Lee; Jia-Bin Huang; Eli Shechtman; Zhengqi Li; |
| 264 | Enhancing Vision Language Models for 4D Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we present a QA generation pipeline that focuses on motion-related scene understanding. |
Seokju Cho; Abhishek Badki; Hang Su; Jindong Jiang; Ziyao Zeng; Seungryong Kim; Sifei Liu; Orazio Gallo; |
| 265 | Region-Adaptive Sampling for Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Leveraging the flexibility of Diffusion Transformers (DiTs) to handle variable token counts, we propose RAS, a training-free sampling strategy that dynamically assigns different update ratios to image regions based on model focus. |
Ziming Liu; Yifan Yang; Chengruidong Zhang; Yiqi Zhang; Lili Qiu; Yang You; Yuqing Yang; |
| 266 | Exploring Visual Pretraining for Learning Language Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: MAPLE is universal to integrate masked auto-regressive models with various LLM backbones, where the LLMs are incentivized to generate latent hypotheses for the masked regions based on the unmasked regions. |
Zhonghan Zhao; Yiming Zhang; Wenwei Zhang; Haiteng Zhao; Xingguang Wei; Zhangwei Gao; Kuikun Liu; Yuzhe Gu; Size Wu; Haian Huang; Jianfei Gao; haijun Lv; Demin Song; Yunhua Zhou; Qipeng Guo; Gaoang Wang; Kai Chen; |
| 267 | CineScene: Implicit 3D As Effective Scene Representation for Cinematic Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. |
Kaiyi Huang; Yukun Huang; Yu Li; Jianhong Bai; Xintao Wang; Zinan Lin; Xuefei Ning; Jiwen Yu; Yu Wang; Xihui Liu; |
| 268 | SO-Bench: A Structural Output Evaluation of Multimodal LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-BENCH benchmark. |
Di Feng; Kaixin Ma; Feng Nan; Haofeng Chen; Bohan Zhai; David Griffiths; Mingfei Gao; Zhe Gan; Eshan Verma; Yinfei Yang; Zhifeng Chen; Afshin Dehghan; |
| 269 | Mind The Gap: Transferring Labels to Align Object Detection Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Label-Aligned Transfer (LAT), a label transfer framework that systematically projects annotations from diverse source datasets into the label space of a target dataset. |
Mikhail Kennerley; Angelica I Aviles-Rivero; Carola-Bibiane Schönlieb; Robby T. Tan; |
| 270 | EffectMaker: Unifying Reasoning and Generation for Customized Visual Effect Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present EffectMaker, a unified reasoning–generation framework that enables reference-based VFX customization. |
Shiyuan Yang; Ruihuang Li; Jiale Tao; Shuai Shao; qinglin lu; Jing Liao; |
| 271 | ConsistCompose: Unified Multimodal Layout Control for Image Composition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ConsistCompose, a unified multimodal framework that embeds layout coordinates directly into language prompts, enabling layout-controlled multi-instance image generation from Interleaved Image-Text within a single generative interface. |
Xuanke Shi; Boxuan Li; Xiaoyang Han; Zhongang Cai; Lei Yang; Quan Wang; Dahua Lin; |
| 272 | CountGD++: Generalized Prompting for Open-World Counting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. |
Niki Amini-Naieni; Andrew Zisserman; |
| 273 | CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. |
Jiange Yang; tom tomlinson; Haoyi Zhu; Mingyu Liu; Kaijing Ma; Yating Wang; Gangshan Wu; Tong He; Limin Wang; |
| 274 | Exploring Spatial Intelligence from A Generative Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. |
Muzhi Zhu; Shunyao Jiang; Huanyi Zheng; Zekai Luo; Hao Zhong; Anzhou Li; Kaijun Wang; Jintao Rong; Yang Liu; Hao Chen; Tao Lin; Chunhua Shen; |
| 275 | Match-and-Fuse: Consistent Generation from Unstructured Image Sets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Match-and-Fuse – a zero-short, training-free method for consistent controlled generation of unstructured image sets — collections that share a common visual element, yet differ in viewpoint, time of capture, and surrounding content. |
Kate Feingold; Omri Kaduri; Tali Dekel; |
| 276 | Recovering Physically Plausible Human-Object Interactions from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. |
Dingbang Huang; Etienne Vouga; Qixing Huang; Georgios Pavlakos; |
| 277 | StableMaterials: Enhancing Diversity in Material Generation Via Semi-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **StableMaterials**, a novel approach for generating photorealistic physical-based rendering (PBR) materials that integrate semi-supervised learning with Latent Diffusion Models (LDMs). |
Giuseppe Vecchio; |
| 278 | Confusion-Aware Spectral Regularizer for Long-Tailed Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. |
Ziquan Zhu; Gaojie Jin; Hanruo Zhu; Si-Yuan Lu; Yunxiao Zhang; ZEYU FU; Ronghui Mu; Guoqiang Zhang; Zhao Sun; Xia Yuhang; Jiaxing Shang; Xiang Li; Lu Liu; Tianjin Huang; |
| 279 | VideoFusion: A Spatio-Temporal Collaborative Network for Multi-modal Video Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Compared to images, videos better align with real-world acquisition scenarios and possess valuable temporal cues. However, existing multi-sensor fusion research predominantly … |
Linfeng Tang; Yeda Wang; Meiqi Gong; Zizhuo Li; Yuxin Deng; Xunpeng Yi; Chunyu Li; Han Xu; HAO ZHANG; Jiayi Ma; |
| 280 | AnchorFlow: Training-Free 3D Editing Via Latent Anchor-Aligned Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. |
Zhenglin Zhou; Fan Ma; Chengzhuo Gui; Xiaobo Xia; Hehe Fan; Yi Yang; Tat-seng Chua; |
| 281 | Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. |
Junxuan Li; Rawal Khirodkar; Egor Zakharov; Jihyun Lee; Zhaoen Su; Yuan Dong; Julieta Martinez; Kai Li; Qingyang Tan; Takaaki Shiratori; Matthew Hu; Peihong Guo; Xuhua Huang; Zhongshi Jiang; LINGCHEN YANG; Ariyan Zarei; Marco Pesavento; Yichen Xu; Chengan He; He Wen; Giljoo Nam; Teng Deng; Wyatt Borsos; Anjali Thakrar; Jean-Charles Bazin; Rinat Abdrashitov; Carsten Stoll; Ginés Hidalgo; James Booth; Lucy Wang; Xiaowen Ma; Yu Rong; Sairanjith Thalanki; Chen Cao; Christian Häne; Abhishek Kar; Sofien Bouaziz; Jason Saragih; Yaser Sheikh; Shunsuke Saito; |
| 282 | Streaming Video Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. |
Jiaer Xia; Peixian Chen; Mengdan Zhang; Xing Sun; Kaiyang Zhou; |
| 283 | The Image As Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce **Adv-GRPO**, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. |
Weijia Mao; Hao Chen; Zhenheng Yang; Mike Zheng Shou; |
| 284 | Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. |
Yuqing Wang; Chuofan Ma; Zhijie Lin; Yao Teng; Lijun Yu; Shuai Wang; Jiaming Han; Jiashi Feng; Yi Jiang; Xihui Liu; |
| 285 | MoEActok: A MoE-based Action Tokenizer for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These approaches typically train a single tokenizer over entire manipulation trajectories, which often comprise multiple distinct skills and thus pose a challenging optimization trade-off. To address this issue, we introduce MoEActok, a novel action tokenizer that employs a mixture-of-experts (MoE) quantizer to produce skill-aware discrete representations for VLA models. |
Chunpu Xu; Zhixuan Liang; Tianshuo Yang; Chi-Min Chan; Yang Xiao; Jessie Wang; Xiaokang Yang; Yao Mu; |
| 286 | Unified Camera Positional Encoding for Controlled Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **Relative Ray Encoding**, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. |
Cheng Zhang; Boying Li; Meng Wei; Yan-Pei Cao; Camilo Cruz Gambardella; Dinh Phung; Jianfei Cai; |
| 287 | Think Visually, Reason Textually: Vision-Language Synergy in Abstract Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages:vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies:(1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and(2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. |
Beichen Zhang; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Haodong Duan; Dahua Lin; Jiaqi Wang; |
| 288 | WarpTracker: Tracking By Warping Instead of Correlation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose WarpTracker, a novel dense point tracker that eschews cost volumes in favor of warping. |
Zihang Lai; Eldar Insafutdinov; Edgar Sucar; Andrea Vedaldi; |
| 289 | Latent Action Pretraining Meets Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. |
Zhengqing Wang; Saurabh Nair; Prajwal Chidananda; Pujith Kachana; Samuel Li; Matthew Brown; Yasutaka Furukawa; |
| 290 | CycleManip: Enabling Cycle-based Manipulation Via Effective History Perception and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore an important yet underexplored task in robot manipulation: cycle-based manipulation, where robots need to perform cyclic or repetitive actions with an expected terminal time. |
Yi-Lin Wei; Haoran Liao; Yuhao Lin; Pengyue Wang; Zhizhao Liang; Guiliang Liu; Wei-Shi Zheng; |
| 291 | WildPose: A Unified Framework for Robust Pose Estimation in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key insight is to connect the two powerful paradigms in modern 3D vision: the rich perceptual frontend of feed-forward models and the end-to-end optimization of differentiable bundle adjustment (BA). |
Jianhao Zheng; Liyuan Zhu; Zihan Zhu; Iro Armeni; |
| 292 | Radiance Meshes for Volumetric Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Radiance Meshes for representing radiance fields with constant density tetrahedral cells produced with a Delaunay tetrahedralization. |
Alexander Mai; Trevor Hedstrom; George Kopanas; Janne Kontkanen; Falko Kuester; Jonathan T. Barron; |
| 293 | Spatiotemporal Pyramid Flow Matching for Climate Emulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. |
Jeremy Irvin; Jiaqi Han; Zikui Wang; Abdulaziz Alharbi; Yufei Zhao; Nomin-Erdene Bayarsaikhan; Daniele Visioni; Andrew Y. Ng; Duncan Watson-Parris; |
| 294 | 3D-LATTE: Latent Space 3D Editing from Textual Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. |
Maria Parelli; Michael Oechsle; Michael Niemeyer; Federico Tombari; Andreas Geiger; |
| 295 | VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches often fall short in handling complex instruction-following tasks and scenarios that demand precise temporal modeling, resulting in limited performance in both semantic alignment and temporal reasoning. To address the above challenges, we introduce Instructed Temporal Grounding for Videos (VideoITG), a framework aiming to adaptively customize frame sampling strategies based on user instructions. |
Shihao Wang; Guo Chen; De-An Huang; Zhiqi Li; Minghan LI; Guilin Liu; Jan Kautz; Jose M. Alvarez; Lei Zhang; Zhiding Yu; |
| 296 | PromptStereo: Zero-Shot Stereo Matching Via Structure and Motion Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Prompt Recurrent Unit (PRU), a novel iterative refinement module based on the decoder of monocular depth foundation models. |
Xianqi Wang; Hao Yang; Hangtian Wang; JunDa Cheng; Gangwei Xu; Min Lin; Xin Yang; |
| 297 | Toward Low-Cost Yet Effective Temporal Learning for UAV Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we advocate designing temporal learning components from a more balanced perspective that jointly considers performance gains and computational costs. |
chaocan xue; Qihua Liang; Bineng Zhong; Yanting Zu; Yuanliang Xue; Haiying Xia; Shuxiang Song; |
| 298 | Virtual Full-stack Scanning of Brain MRI Via Imputing Any Quantised Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing imputation methods often depend on global conditioning or modality-specific designs, which limit their generalisability across patient cohorts and imaging protocols. To address these limitations, we propose CodeBrain, a unified framework that reformulates various “any-to-any” imputation tasks as a region-level full-stack code prediction problem. |
Yicheng Wu; Tao Song; Zhonghua Wu; Jin Ye; Zongyuan Ge; Wenjia Bai; Zhaolin Chen; Jianfei Cai; |
| 299 | Hypergraph-State Collaborative Reasoning for Multi-Object Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing motion estimation approaches face two major limitations: (1) instability caused by noisy or probabilistic predictions, and (2) vulnerability under occlusion, where trajectories often fragment once visual cues disappear. To overcome these issues, we propose a collaborative reasoning framework that enhances motion estimation through joint inference among multiple correlated objects. |
Zikai Song; Junqing Yu; Yi-Ping Phoebe Chen; Wei Yang; Xinchao Wang; |
| 300 | Thinking with Video: Video Generation As A Promising Multimodal Reasoning Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. Therefore, we propose Thinking with Video, a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. |
Jingqi Tong; Yurong Mou; Hangcheng Li; Mingzhe Li; Yongzhuo Yang; Ming Zhang; Qiguang Chen; Tianyi Liang; Xiaomeng Hu; Yining Zheng; Xinchi Chen; Jun Zhao; Xuanjing Huang; Xipeng Qiu; |
| 301 | Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Edit2Perceive, a unified diffusion framework that adapts editing models for depth, normal, and matting. |
Yiqing Shi; Yiren Song; Mike Zheng Shou; |
| 302 | MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, their task definitions are vague, typically limited to axes such as what to edit or how many references are given, and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. |
Yuta Oshima; Daiki Miyake; Kohsei Matsutani; Yusuke Iwasawa; Masahiro Suzuki; Yutaka Matsuo; Hiroki Furuta; |
| 303 | HandX+: Scaling Up Text-Conditioned Bimanual Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing whole-body models often overlook the fine-grained details required for natural dexterous behavior, such as finger articulation, contact timing, and inter-hand coordination. We aim to close this gap by introducing a hand-centric animation framework. |
Zimu Zhang; Yucheng Zhang; Xiyan Xu; Ziyin Wang; Sirui Xu; Kai Zhou; Bing Zhou; Chuan Guo; Jian Wang; Yu-Xiong Wang; Liangyan Gui; |
| 304 | Mixture of Style Experts for Diverse Image Stylization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE). |
Shihao Zhu; Ziheng Ouyang; Yijia Kang; Qilong Wang; Mi Zhou; Bo Li; Ming-Ming Cheng; Qibin Hou; |
| 305 | Image Generation from Contextually-Contradictory Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. |
Saar Huberman; Or Patashnik; Omer Dahary; Ron Mokady; Daniel Cohen-Or; |
| 306 | PixelDiT: Pixel Diffusion Transformers for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. |
Yongsheng Yu; Wei Xiong; Weili Nie; Yichen Sheng; Shiqiu Liu; Jiebo Luo; |
| 307 | Bridge: Basis-Driven Causal Inference Marries VFMs for Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, this paper proposes a novel Basis-driven framework for domain generalization, namely **Bridge**, that incorporates causal inference into object detection. |
Mingbo Hong; Feng Liu; Caroline Gevaert; George Vosselman; Hao Cheng; |
| 308 | Token Warping Helps MLLMs Look from Nearby Viewpoints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint warping. |
Phillip Y. Lee; Chanho Park; Mingue Park; Seungwoo Yoo; Juil Koo; Minhyuk Sung; |
| 309 | Tokenization Allows Multimodal Large Language Models to Understand, Generate and Edit Architectural Floor Plans Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present HouseMind, a multimodal large language model that unifies floor plan understanding, generation, and editing in one framework. |
Sizhong Qin; Ramon Weber; Xinzheng Lu; |
| 310 | Adaptive 3D Perception Under Sparse Sampling Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce A3PRL, an RL-driven adaptive perception framework that closes the loop between LiDAR sensing and tracking. |
Shenghai Yuan; Wei Yihan; Jason Yee; Zhuoran Qiao; boyang lou; Enwen Hu; |
| 311 | Inference-time Physics Alignment of Video Generative Models with Latent World Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. |
Jianhao Yuan; Zhang Xiaofeng; Felix Friedrich; Nicolas Beltran-Velez; Melissa Hall; Reyhane Askari; Xiaochuang Han; Nicolas Ballas; Michal Drozdzal; Adriana Romero-Soriano; |
| 312 | Time-Aware One Step Diffusion Network for Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, since SD will perform different generative priors at different timesteps, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). |
Tianyi Zhang; Zheng-Peng Duan; Chun-Le Guo; Peng-Tao Jiang; Bo Li; Ming-Ming Cheng; Chongyi Li; |
| 313 | SaPaVe: Towards Active Perception and Manipulation in Vision-Language Action Models for Robot Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. |
MENGZHEN LIU; Enshen Zhou; Cheng Chi; Yi Han; Shanyu Rong; Liming Chen; Pengwei Wang; Zhongyuan Wang; Shanghang Zhang; |
| 314 | Multi-Scale Gaussian-Language Map for Embodied Navigation and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose the multi-scale Gaussian-Language Map (GLMap), which introduces three key designs: (1) explicit geometry, (2) multi-scale semantics covering both instance and region level concepts, and (3) a dual-modality interface where each semantic unit jointly stores a natural language description and a 3D Gaussian representation. |
Sixian Zhang; Yiyao Wang; Xinhang Song; Keming Zhang; Zijian Xu; Shuqiang Jiang; |
| 315 | VOLD: Reasoning Transfer from LLMs to Vision-Language Models Via On-Policy Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. |
Walid Bousselham; Hilde Kuehne; Cordelia Schmid; |
| 316 | Generative Modeling of Weights: Generalization or Memorization? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative, well-known methods in this emerging area on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. |
Boya Zeng; Yida Yin; Zhiqiu Xu; Zhuang Liu; |
| 317 | Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). |
Kunlun Xu; Haotong Cheng; Jiangmeng Li; Xu Zou; Jiahuan Zhou; |
| 318 | Motion-Aware Animatable Gaussian Avatars Deblurring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. |
Muyao Niu; Yifan Zhan; Qingtian Zhu; Zhuoxiao Li; Wei Wang; Zhihang Zhong; Xiao Sun; Yinqiang Zheng; |
| 319 | WorldGen: From Text to Traversable and Interactive 3D Worlds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce WorldGen, a method for generating large, fully formed, navigable 3D worlds from a single text prompt. |
Dilin Wang; Hyunyoung Jung; Tom Monnier; Kihyuk Sohn; Chuhang Zou; Xiaoyu Xiang; Yu-Ying Yeh; Di Liu; Zixuan Huang; Thu Nguyen-Phuoc; Yuchen Fan; Sergiu Oprea; Ziyan Wang; Roman Shapovalov; Nikolaos Sarafianos; Thibault Groueix; Antoine Toisoul; Prithviraj Dhar; Xiao Chu; Minghao Chen; Geon Yeong Park; Rakesh Ranjan; Andrea Vedaldi; |
| 320 | Describe Anything Anywhere At Any Moment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods face a tradeoff, where producing rich open-vocabulary descriptions comes at the expense of real-time performance when these descriptions have to be grounded in 3D. To address these challenges, we propose Describe Anything, Anywhere, at Any Moment (DAAAM), a novel spatio-temporal memory framework for large-scale and real-time 4D scene understanding. |
Nicolas Gorlo; Lukas Schmid; Luca Carlone; |
| 321 | Efficient Frame Selection for Long Video Understanding Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach is at high risk of missing critical visual information and constrains performance especially for long videos. To address this problem, we propose a lightweight frame selection method to identify keyframes and train it via a two-stage strategy. |
Yaxuan Qin; Hefei Li; Wenqi Mu; Yancheng He; |
| 322 | UVU: Improving Multimodal Understanding Via Vision-Language Unified Autoregressive Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we observe that pixel-level image patches and textual tokens coexist in raw high-dimensional spaces with inherent input symmetry. |
Zhehan Kan; Xinghua Jiang; Yanlin Liu; Xiaochen Yang; ZHIXIANG WEI; Shifeng Liu; Yubo Zhu; Qingmin Liao; Wenming Yang; Xin Li; Yinsong Liu; Deqiang Jiang; Xing Sun; |
| 323 | E-RayZer: Self-supervised 3D Reconstruction As Spatial Visual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. |
Qitao Zhao; Hao Tan; Qianqian Wang; Sai Bi; Kai Zhang; Kalyan Sunkavalli; Shubham Tulsiani; Hanwen Jiang; |
| 324 | DLWM: Dual Latent World Models Enable Holistic Gaussian-centric Pre-training in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DLWM, a novel paradigm with Dual Latent World Models specifically designed to enable holistic gaussian-centric pre-training in autonomous driving using two stages. |
Yiyao Zhu; Ying Xue; Haiming Zhang; Guangfeng Jiang; Wending Zhou; Xu Yan; Jiantao Gao; Yingjie CAI; Bingbing Liu; Zhen Li; Shaojie Shen; |
| 325 | SDUIE: Semi-Supervised Diffusion for Underwater Image Enhancement with Quant-Text Dual Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing enhancement methods have achieved promising performance, they typically overlook the subjective nature of visual preferences. To address this gap, we propose SDUIE, a level-aware Semi-supervised Diffusion framework for Underwater Image Enhancement that enables dual control through both quantitative and textual inputs. |
Xiaofeng Cong; Yu-Xin Zhang; Hao Shen; Yeying Jin; Junming Hou; Jie Gui; |
| 326 | Align Images Before You Generate Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Addressing this issue remains challenging, especially without any external geometric or semantic priors during the pure generative inference. In this paper, we introduce CorrAdapter, a plug-and-play adapter that discovers and exploits an innate property of the multi-image diffusion itself, aligning all output images before they are in fact generated. |
Shihua Zhang; Qiuhong Shen; Xinchao Wang; |
| 327 | Bi-directional Autoregressive Diffusion for Large Complex Motion Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that these limitations arise from both the current full-sequence interpolation strategy and the pixel reconstruction training objective. To solve these challenges, we propose ARVFI, a novel video diffusion-based interpolation method for large complex motion interpolation. |
Yongrui Ma; Shijie Zhao; Mingde Yao; Junlin Li; Li zhang; Xiaohong Liu; Qi Dou; Jinwei Gu; Tianfan Xue; |
| 328 | The Devil Is in The Details: Enhancing Video Virtual Try-On Via Keyframe-Driven Details Injection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. |
Qingdong He; Xueqin Chen; Yanjie Pan; Peng Tang; Pengcheng Xu; Zhenye Gan; Chengjie Wang; Xiaobin Hu; Jiangning Zhang; Yabiao Wang; |
| 329 | MoCoDiff: A Controllable Autoregressive Diffusion Model for Expressive Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MoCoDiff, a controlable autoregressive diffusion framework that introduces Injection Modulation Controllers (IMC). |
Wenfeng Song; Xuehan Wang; Shuai Li; Yi Chen; Yuting Guo; Zhenyu Wu; Xingliang Jin; Chenglizhao Chen; Fei Hou; Hongyu Wu; Aimin Hao; |
| 330 | MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. |
Albert Dominguez Mantes; Gioele Manno; Martin Weigert; |
| 331 | Tracking By Predicting 3-D Gaussians Over Time Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. |
Tanish Baranwal; Himanshu Singh Singh; Jathushan Rajasegaran; Jitendra Malik; |
| 332 | SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Relying on binary success indicators wastes valuable information in failed trajectories, resulting in low training efficiency. To solve this, we propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework. |
Senyu Fei; Siyin Wang; Li Ji; Ao Li; Shiduo Zhang; Liming Liu; Jinlong Hou; Jingjing Gong; Xianzhong Zhao; Xipeng Qiu; |
| 333 | LIBERO-Plus: A Progressive Robustness Benchmark for Visual-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current simulation-based robustness evaluations suffer from narrow perturbation coverage, manual design constraints, and coarse-grained analysis that fails to reveal when and how models fail. To address this gap, we propose LIBERO-Plus, a comprehensive, automatic, and fine-grained evaluation framework with controlled perturbations across seven dimensions: object layouts, camera viewpoints, robot initial states, language instructions, lighting conditions, background textures, and sensor noise. |
Senyu Fei; Siyin Wang; Junhao Shi; Zihao Dai; Jikun Cai; Pengfang Qian; Li Ji; Xinzhe He; Shiduo Zhang; Zhaoye Fei; Jinlan Fu; Jingjing Gong; Xipeng Qiu; |
| 334 | VideoWorld 2: Learning Transferable Knowledge from Real-world Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a disentangled Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related changes. |
Zhongwei Ren; Yunchao Wei; Xiao Yu; Guixun Luo; Yao Zhao; Bingyi Kang; Jiashi Feng; Xiaojie Jin; |
| 335 | Visual Autoregressive Modeling Via Next Focus Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present \textbf{FVAR}, which reframes the paradigm from \emph{next-scale prediction} to \emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. |
Xiaofan Li; Chenming Wu; Yanpeng Sun; Jiaming Zhou; Delin Qu; Yansong Qu; Weihao Bo; Haibao Yu; Dingkang Liang; |
| 336 | Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. |
Hanchao Liu; Fang-Lue Zhang; Shining Zhang; Tai-Jiang Mu; Shi-Min Hu; |
| 337 | Scaling Instruction-Based Video Editing with A High-Quality Synthetic Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. |
Qingyan Bai; Qiuyu Wang; Hao Ouyang; Yue Yu; Hanlin Wang; Wen Wang; Ka Leong Cheng; Shuailei Ma; Yanhong Zeng; Zichen Liu; Yinghao Xu; Yujun Shen; Qifeng Chen; |
| 338 | TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, these approaches often neglect the importance of point-to-instance (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. |
Yifeng Bai; Zhirong Chen; Bo Song; Erkang Cheng; Haibin Ling; |
| 339 | Stable and Efficient Single-Rollout RL for Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this sample efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. |
Rui Liu; Dian Yu; Lei Ke; Haolin Liu; Yujun Zhou; Zhenwen Liang; Haitao Mi; Pratap Tokekar; Dong Yu; |
| 340 | Order Matters: 3D Shape Generation from Sequential VR Sketches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce VRSketch2Shape, the first framework and multi-category dataset for 3D shape generation from sequential VR sketches. |
Yizi Chen; Sidi Wu; Tianyi Xiao; Nina Wiedemann; Loic Landrieu; |
| 341 | V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V$^{2}$-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. |
Jiancheng Pan; Runze Wang; Tianwen Qian; Mohammad Mahdi; Yanwei Fu; Xiangyang Xue; Xiaomeng Huang; Luc Van Gool; Danda Paudel; Yuqian Fu; |
| 342 | Pushing The Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Perception Encoder-Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. |
Apoorv Vyas; Heng-Jui Chang; Cheng-Fu Yang; Po-Yao Huang; Luya Gao; Julius Richter; Sanyuan Chen; Matthew Le; Piotr Dollár; Christoph Feichtenhofer; Ann Lee; Wei-Ning Hsu; |
| 343 | APPO: Attention-guided Perception Policy Optimization for Video Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model’s fine-grained perception. |
Henghui Du; Chang Zhou; Xi Chen; Di Hu; |
| 344 | Beyond Single-View Sufficiency: CVBench for Cross-View Human Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks for Multimodal Large Language Models (MLLMs) are overwhelmingly predicated on a sufficient-view assumption, rewarding single-view pattern recognition while failing to evaluate cross-view fusion. To address this critical gap, we introduce \textbf{CVBench}, a large-scale, multi-task benchmark for cross-view human understanding. |
Tianchen Guo; Chen Liu; Xin Yu; |
| 345 | X-Part: High Fidelity And Structure Coherent Shape Decomposition And Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce X-Part, a controllable generative model designed to decompose a holistic 3D object into semantically meaningful and structurally coherent parts with high geometric fidelity. |
XINHAO YAN; Jiachen Xu; Yang Li; Changfeng Ma; Yunhan Yang; Chunshi Wang; Zibo Zhao; Zeqiang Lai; Yunfei Zhao; Zhuo Chen; Chunchao Guo; |
| 346 | PE3R: Perception-Efficient 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, prevailing methods often suffer from limited generalization, reliance on per-scene optimization, and semantic inconsistencies across viewpoints. To address these limitations, we introduce PE3R, a tuning-free framework for efficient and generalizable 3D semantic reconstruction. |
Jie Hu; Shizun Wang; Xinchao Wang; |
| 347 | Efficient and High-Fidelity Omni Modality Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation impedes the development of universal retrieval systems capable of comprehending queries that combine more than two modalities. To advance toward this goal, we present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio. |
Chuong Huynh; Manh Luong; Abhinav Shrivastava; |
| 348 | OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. |
Hongjia Zhai; Qi Zhang; Xiaokun Pan; Xiyu Zhang; Yitong Dong; Huaqi Zhang; Dan Xu; Guofeng Zhang; |
| 349 | Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. |
Mark Endo; Serena Yeung; |
| 350 | Adaptive Action Chunking at Inference-time for Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unfortunately, a dominant trend in current VLA models is an empirical fixed chunk length at inference-time, hindering their superiority and scalability across diverse manipulation tasks. To address this issue, we propose a novel Adaptive Action Chunking (AAC) strategy, which exploits action entropy as the cue to adaptively determine the chunk size based on current predictions. |
Yuanchang Liang; Xiaobo Wang; Kai Wang; Shuo Wang; Xiaojiang Peng; Haoyu Chen; David Chua; Prahlad Vadakkepat; |
| 351 | ChArtist: Generating Pictorial Charts with Unified Spatial and Subject Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ChArtist, a domain-specific method for generating pictorial charts automatically, offering two distinct types of control: 1) spatial control that aligns well with the chart structure, and 2) subject-driven control that respects the visual characteristics of a reference image. |
Shishi Xiao; Tongyu Zhou; David H. Laidlaw; Gromit Yeuk-Yin Chan; |
| 352 | Photo-Guided Tooth Segmentation on 3D Oral Scan Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, we propose a novel Photo-guided 3D Model Tooth Segmentation framework, PMTSeg, that enhances 3D tooth segmentation by integrating texture cues from intraoral photos. |
Shaojie Zhuang; Guangshun Wei; Jiangxin He; Yuanfeng Zhou; |
| 353 | Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet these models primarily describe what they perceive and intend to do, rarely questioning whether their planned actions are safe or appropriate. This work introduces Counterfactual VLA (CF-VLA), a self-reflective VLA framework that enables the model to reason about and revise its planned actions before execution. |
Zhenghao Peng; Wenhao Ding; Yurong You; Yuxiao Chen; Wenjie Luo; Thomas Tian; Yulong Cao; Apoorva Sharma; Danfei Xu; Boris Ivanovic; Boyi Li; Yan Wang; Marco Pavone; |
| 354 | Material Magic Wand: Material-Aware Grouping of 3D Parts in Untextured Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce the problem of material-aware part grouping in untextured meshes. |
Umangi Jain; Vladimir G. Kim; Matheus Gadelha; Igor Gilitschenski; Zhiqin Chen; |
| 355 | InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables flexible task assembly, long-horizon skill composition, and heterogeneous embodiments with minimal manual tuning. |
Yang Tian; Yuyin Yang; Yiman Xie; Zetao Cai; Xu Shi; Ning Gao; Hangxu Liu; Xuekun Jiang; Zherui Qiu; Feng Yuan; Yaping Li; Ping Wang; Junhao Cai; Jia Zeng; Hao Dong; Jiangmiao Pang; |
| 356 | PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. |
Christina Ourania Tze; Daniel Dauner; Yiyi Liao; Dzmitry Tsishkou; Andreas Geiger; |
| 357 | Foca-VLA: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Foca-VLA, an end-to-end vision-language-action framework that equips robots with hybrid force-position control and explicit force awareness. |
Yang Li; Zhaxizhuoma Zhaxizhuoma; Hongru Jiang; Junjie Xia; Hongquan Zhang; Jinda Du; Yunsong Zhou; Jia Zeng; Ce Hao; Jieji Ren; Qiaojun Yu; Cewu Lu; Yu Qiao; Jiangmiao Pang; |
| 358 | Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. |
Taekyung Ki; Sangwon Jang; Jaehyeong Jo; Jaehong Yoon; Sung Ju Hwang; |
| 359 | Iris: Integrating Language Into Diffusion-based Monocular Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, Iris, we investigate the benefits of our strategy to integrate text descriptions into training and inference of diffusion-based depth estimation models. |
Ziyao Zeng; Jingcheng Ni; Daniel Wang; Patrick Rim; Younjoon Chung; Fengyu Yang; Byung-Woo Hong; Alex Wong; |
| 360 | KV-Tracker: Real-Time Pose Tracking with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. |
Marwan Taher; Ignacio Alzugaray; Kirill Mazur; Xin Kong; Andrew J. Davison; |
| 361 | GeoDiff4D: Geometry-Aware Diffusion for 4D Head Avatar Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel framework that leverages geometry-aware diffusion to distill strong geometry priors for high-fidelity head avatar reconstruction. |
Chao Xu; Xiaochen Zhao; xiang deng; Jingxiang Sun; Donglin Di; Zhuo Su; Yebin Liu; |
| 362 | Rethinking Token Reduction for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. |
Yi Wang; Haofei Zhang; Qihan Huang; Anda Cao; Gongfan Fang; Wei Wang; Xuan Jin; Jie Song; Mingli Song; Xinchao Wang; |
| 363 | VidEoMT: Your ViT Is Secretly Also A Video Segmentation Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. |
Narges Norouzi; Idil Esen Zulfikar; Niccolò Cavagnero; Tommie Kerssies; Bastian Leibe; Gijs Dubbelman; Daan de Geus; |
| 364 | Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). |
Yanglin Deng; Tianyang Xu; Chunyang Cheng; Hui Li; Xiaojun Wu; Josef Kittler; |
| 365 | RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. |
Haiyang Mei; Qiming Huang; Hai Ci; Mike Zheng Shou; |
| 366 | The Consistency Critic: Correcting Inconsistencies in Generated Images Via Reference-Guided Attentive Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. |
Ziheng Ouyang; Yiren Song; Yaoli Liu; Shihao Zhu; Qibin Hou; Ming-Ming Cheng; Mike Zheng Shou; |
| 367 | U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce \textbf{U-Mind}, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. |
xiang deng; Feng Gao; Yong Zhang; Youxin Pang; Xu Xiaoming; Zhuoliang Kang; Xiaoming Wei; Yebin Liu; |
| 368 | Coupled Diffusion Sampling for Training-free Multi-view Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given a collection of multi-view images, we perform consistent multi-view editing with a training-free framework using pre-trained 2D editing models and a generative multi-view model. |
Hadi Alzayer; Yunzhi Zhang; Chen Geng; Jia-Bin Huang; Jiajun Wu; |
| 369 | OneSparse: A Unified Framework for Sparse Activation Layers in Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite conceptual similarities, these paradigms have evolved independently, hindering systematic comparison and the development of modules that exploit their complementary strengths. To bridge this gap, we propose **OneSparse**, a unified framework that reformulates MoE and memory modules under a common abstraction. |
Xingkui Zhu; Dingkang Liang; Cheng Chen; Daoxin Zhang; lv hanxiang; Zhe Xu; Yao Hu; Xiang Bai; |
| 370 | RunawayEvil: Jailbreaking The Image-to-Video Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing attack methods remain confined to single-modal settings, relying solely on isolated text or image perturbations, which severely limits their effectiveness. To bridge this gap, we propose Runaway Evil, the first multimodal jailbreaking framework for I2V models with dynamic evolutionary capability. |
yueming lyu; Rufan Qian; Yueming Lyu; Qinglong Liu; Linzhuang Zou; Jie Qin; Songhua Liu; Caifeng Shan; |
| 371 | Diffusion Probe: Generated Image Result Prediction Using CNN Probes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this, we first reveal a strong correlation between the attention distribution in the early diffusion process and the final image quality. Building upon this insight, we introduce **Diffusion Probe**, a pioneering framework that leverages the model’s internal cross-attention maps as a predictive signal. |
Bukun Huang; Benlei Cui; Zhizeng Ye; Xuemei Dong; Tuo Chen; Hui Xue; Dingkang Yang; Longtao Huang; Haiwen Hong; Jingqun Tang; |
| 372 | Unleashing Vision-Language Semantics for Video Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches focus on leveraging visual features only, overlooking their most distinctive strength — the rich vision-language semantics embedded in the latent space. We proposes VLAForge, a novel DFD framework that unleashes the potential of such cross-modal semantics in enhancing model’s discriminability in deepfake detection. |
Jiawen Zhu; Yunqi Miao; Xueyi Zhang; Jiankang Deng; Guansong Pang; |
| 373 | From Softmax to Dirichlet: Evidential Learning for Semi-supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel evidential learning framework to explicitly model the prediction uncertainty for reliable pseudo-label selection. |
Huayu Mai; Rui Sun; Yujia Chen; Wangkai Li; Bingzhou Wang; Aibing Li; Zhangyu He; Yuan Wang; |
| 374 | MeanFlow Transformers with Representation Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we develop an efficient training and sampling scheme for MF in the latent space of a Representation Autoencoder (RAE), where a pre-trained vision encoder (e.g., DINO) provides semantically rich latents paired with a lightweight decoder. |
Zheyuan Hu; Chieh-Hsin Lai; Ge Wu; Yuki Mitsufuji; Stefano Ermon; |
| 375 | AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Affordance Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often limited to specific object shapes due to the constrained data diversity. Leveraging powerful 3D generative models and vision foundation models (VFM), the proposed AffordGen framework overcomes this limitation by utilizing the semantic correspondence of meaningful keypoints across large-scale 3D meshes to generate new robot manipulation tra-jectories. |
Jiawei Zhang; Kaizhe Hu; Yingqian Huang; Yuanchen Ju; Zhengrong Xue; Huazhe Xu; |
| 376 | Robust Remote Sensing Image–Text Retrieval with Noisy Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the above observations, we reveal an important but untouched problem in RSITR, i.e., Noisy Correspondence (NC). To overcome these challenges, we propose a novel Robust Remote Sensing Image–Text Retrieval (RRSITR) paradigm that designs a self-paced learning strategy to mimic human cognitive learning patterns, thereby learning from easy to hard from multi-modal data with NC. |
qiya song; Yiqiang Xie; Yuan Sun; Renwei Dian; Xudong Kang; |
| 377 | ARGUS: Defending Against Multimodal Indirect Prompt Injection Via Steering Instruction-Following Behavior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we also found that a naive defense direction could be coupled with a utility-degrading direction, and excessive intervention strength harms model performance. To address this, we propose ARGUS, which searches for an optimal defense direction within the safety subspace that decouples from the utility degradation direction, further combining adaptive strength steering to achieve a better safety-utility trade-off. |
Weikai Lu; Ziqian Zeng; Kehua Zhang; Haoran Li; Huiping Zhuang; Ruidong Wang; Cen Chen; Hao Peng; |
| 378 | When Visualizing Is The First Step to Reasoning: MIRA, A Benchmark for Visual Chain-of-Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MIRA (Multimodal Imagination for Reasoning Assessment), a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. |
Yiyang Zhou; Haoqin Tu; Zijun Wang; Zeyu Wang; Niklas Muennighoff; Fan Nie; Chaorui Deng; Shen Yan; Haoqi Fan; Yejin Choi; James Zou; Cihang Xie; Huaxiu Yao; Qinghao Ye; |
| 379 | FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. |
Cheng Peng; Zhuo Su; Liao Wang; Chen Guo; Zhaohu Li; Chengjiang Long; Zheng Lv; Jingxiang Sun; Chenyangguang Zhang; Yebin Liu; |
| 380 | OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. |
Zhaochong An; Menglin Jia; Haonan Qiu; Zijian Zhou; Xiaoke Huang; Zhiheng Liu; Weiming Ren; Kumara Kahatapitiya; Ding Liu; Sen He; Chenyang Zhang; Tao Xiang; Fanny Yang; Serge Belongie; Tian Xie; |
| 381 | Same or Not? Enhancing Visual Perception in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. |
Damiano Marsili; Aditya Mehta; Ryan Lin; Georgia Gkioxari; |
| 382 | Semantic Context Matters: Improving Conditioning for Autoregressive Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, extending AR models to controllable image editing remains challenging due to weak and inefficient conditioning strategies, which often lead to suboptimal semantic alignment and visual quality. To address this limitation, we present SCAR, a Semantic-Context-driven method for AutoregRessive models. |
Dongyang Jin; Ryan Xu; Jianhao Zeng; Rui Lan; Yancheng Bai; Lei Sun; Xiangxiang Chu; |
| 383 | Matching Every Pair to Track Every Point: PairFormer for All-Pairs Tracking and Video Trajectory Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose PairFormer, a feed-forward transformer that addresses APT in a single pass. |
Guangyang Wu; Youran Ding; Xinyu Che; BENYUAN SUN; Yi Yang; Xiaohong Liu; |
| 384 | Customized Fusion: A Closed-Loop Dynamic Network for Adaptive Multi-Task-Aware Infrared-Visible Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Infrared-visible image fusion aims to integrate complementary information for robust visual understanding, but existing fusion methods struggle with simultaneously adapting to multiple downstream tasks. To address this issue, we propose a Closed-Loop Dynamic Network (CLDyN) that can adaptively respond to the semantic requirements of diverse downstream tasks for task-customized image fusion. |
Zengyi Yang; Yu Liu; Juan Cheng; Zhiqin Zhu; Yafei Zhang; Huafeng Li; |
| 385 | Mesh-Pro: Asynchronous Advantage-guided Ranking Preference Optimization for Artist-style Quadrilateral Mesh Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we aim to enhance both the training efficiency and generation quality of RL in 3D mesh generation. |
Zhen Zhou; Jian Liu; Biwen Lei; Jing Xu; Haohan Weng; Yiling Zhu; Zhuo Chen; Junfeng Fan; Yunkai Ma; Dazhao Du; Song Guo; Fengshui Jing; Chunchao Guo; |
| 386 | Dynamic Important Example Mining for Reinforcement Finetuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose **Dynamic Important Example Mining (DIEM)**, a principled and fully automated framework that makes data utilization adaptive throughout RFT. |
Haoru Tan; WU Sitong; Yanfeng Chen; Shizhen Zhao; Yangtian Sun; Tianjia Liu; Chirui Chang; Shaofeng Zhang; Xingwu Sun; Xiuzhe Wu; Ruobing Xie; Xiaojuan Qi; |
| 387 | Dataset Distillation Via Influence Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, we introduce a fully differentiable, sample-level influence estimator that quantifies parameter shifts from adding or removing data– without time-consuming inverse-Hessian products or convexity assumptions. |
Haoru Tan; Wang Wang; WU Sitong; Xiuzhe Wu; Yangtian Sun; Chirui Chang; Shaofeng Zhang; Xiaojuan Qi; |
| 388 | CoD: A Diffusion Foundation Model for Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address it, we introduce CoD, the first Compression-oriented Diffusion foundation model, trained from scratch to enable end-to-end optimization of both compression and generation. |
Zhaoyang Jia; Zihan Zheng; Naifu Xue; Jiahao Li; Bin Li; Zongyu Guo; Xiaoyi Zhang; Houqiang Li; Yan Lu; |
| 389 | Alternative Reprogramming for Service Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). |
Yunbei Zhang; Chengyi Cai; Feng Liu; Jihun Hamm; |
| 390 | MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce MICON-Bench, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. |
Mingrui Wu; Hang Liu; Jiayi Ji; Xiaoshuai Sun; Rongrong Ji; |
| 391 | TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, although proprietary models have continuously pushed the performance boundary, open-source models, often trained with limited resources and, in practice, the only viable option for many due to privacy regulations, still lag far behind. To bridge this gap, we introduce TRivia, a self-supervised fine-tuning method that enables pretrained VLMs to learn TR directly from unlabeled table images in the wild. |
JUNYUAN ZHANG; Bin Wang; Qintong Zhang; Fan Wu; Zichen Wen; Jialin Lu; Junjie Shan; Ziqi Zhao; Shuya Yang; Ziling Wang; Ziyang Miao; Huaping Zhong; Yuhang Zang; Xiaoyi Dong; Ka-Ho Chow; Conghui He; |
| 392 | See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. |
Bo-Yuan Sun; Bo-Wen Yin; Yuan-Ming Li; Xihan Wei; Qibin Hou; |
| 393 | GeoAgent: Learning to Geolocate Everywhere with Reinforced Geographic Characteristics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents GeoAgent, a model capable of reasoning closely with humans and deriving fine-grained address conclusions. |
Modi Jin; Yiming Zhang; Bo-Yuan Sun; Dingwen Zhang; Ming-Ming Cheng; Qibin Hou; |
| 394 | Revisiting The Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). |
Yifan Du; Kun Zhou; Yingqian Min; Yue Ling; Xin Zhao; Youbin Wu; Ji-Rong Wen; |
| 395 | Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes **I**mage-to-**P**oint Cloud **F**eature Back-**P**rojection (**IPFP**), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. |
Jiawei Han; Matteo Poggi; HUAN LI; Changshuo Wang; Kaiqi Liu; Wei Li; |
| 396 | Rethinking MLLM Itself As A Segmenter with A Single Segmentation Token Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding~(SELF1E) while achieving competitive results, which eliminates the need for external decoders. |
Anqi Zhang; Xiaokang Ji; Guangyu Gao; Jianbo Jiao; Chi Harold Liu; Yunchao Wei; |
| 397 | Do You Have Freestyle? Expressive Humanoid Locomotion Via Audio Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose RoboPerform, the first unified audio-to-locomotion framework that can directly generate music-driven dance and speech-driven co-speech gestures from audio. |
Zhe Li; Cheng Chi; Yangyang Wei; Boan Zhu; Tao Huang; Zhenguo Sun; Yibo Peng; Pengwei Wang; Zhongyuan Wang; Fangzhou Liu; Chang Xu; Shanghang Zhang; |
| 398 | Meta-CoT: Enhancing Granularity and Generalization in Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, a critical question remains underexplored: what forms of CoT and training strategy can jointly enhance both the understanding granularity and generalization? To address this, we propose Meta-CoT, a paradigm that performs a two-level decomposition of any single-image editing operation with two key properties: (1) Decomposability. |
Shiyi Zhang; YIJI CHENG; Tiankai Hang; Zijin Yin; Runze He; Yu Xu; Wenxun Dai; yunlong lin; Chunyu Wang; qinglin lu; Yansong Tang; |
| 399 | MapReduce LoRA: Advancing The Pareto Front in Multi-Preference Optimization for Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, jointly optimizing multiple rewards often incurs an alignment tax—improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). |
Chieh-Yun Chen; Zhonghao Wang; Qi Chen; Zhifan Ye; Min Shi; Yue Zhao; Yinan Zhao; Hui Qu; Wei-An Lin; Yiru Shen; Ajinkya Kale; Irfan Essa; Humphrey Shi; |
| 400 | MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present \textbf{MatPedia}, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. |
Di Luo; Shuhui Yang; Mingxin Yang; Jiawei Lu; Yixuan Tang; Xintong Han; Zhuo Chen; Beibei Wang; Chunchao Guo; |
| 401 | Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors. This work addresses these limitations in two complementary ways. |
wenfei guan; Jilin Mei; Tong Shen; Xumin Wu; Shuo Wang; Chen Min; Yu Hu; |
| 402 | Bridging Facial Understanding and Animation Via Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Text-guided human body animation has advanced rapidly, yet facial animation lags due to the scarcity of well-annotated, text-paired facial corpora. To close this gap, we leverage foundation generative models to synthesize a large, balanced corpus of facial behavior. |
Luchuan Song; Pinxin Liu; Haiyang Liu; Zhenchao Jin; Yunlong Tang; Zichong Xu; Susan Liang; Jing Bi; Jason Corso; Chenliang Xu; |
| 403 | 4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. |
Kirill Mazur; Marwan Taher; Andrew J. Davison; |
| 404 | Taming The Long Tail: Rebalancing Adversarial Training Via Adaptive Perturbation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, we show that perturbations can simultaneously address both adversarial vulnerability and class imbalance. Based on these insights, we propose Rebalanced Adversarial Intensity for Long-Tailed Data (RAIL), a plug-and-play framework that adaptively adjusts perturbations during adversarial training. |
Lilin Zhang; Yimo Guo; Li Yue; Jiancheng Shi; Xianggen Liu; |
| 405 | WeDetect: Fast Open-Vocabulary Object Detection As Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, i.e., matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. |
Shenghao Fu; Yukun Su; Fengyun Rao; Jing LYU; Xiaohua Xie; Wei-Shi Zheng; |
| 406 | MeshSplatting: Differentiable Rendering with Opaque Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MeshSplatting, a mesh-based reconstruction approach that jointly optimizes geometry and appearance through differentiable rendering. |
Jan Held; Sanghyun Son; Renaud Vandeghen; Daniel Rebain; Matheus Gadelha; Yi Zhou; Anthony Cioppa; Ming Lin; Marc Van Droogenbroeck; Andrea Tagliasacchi; |
| 407 | Visual Diffusion Models Are Geometric Solvers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. |
Nir Goren; Shai Yehezkel; Omer Dahary; Andrey Voynov; Or Patashnik; Daniel Cohen-Or; |
| 408 | BrickNet: Graph-Backed Generative Brick Assembly Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We train a language model to generate LEGO®-brick build sequences. |
Peter Kulits; Cordelia Schmid; |
| 409 | Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. |
Haojie Zheng; Shuchen Weng; Jingqi Liu; Siqi Yang; Boxin Shi; Xinlong Wang; |
| 410 | DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. |
Peiying Zhang; Nanxuan Zhao; Matthew Fisher; Yiran Xu; Jing Liao; Difan Liu; |
| 411 | Extending Embodied Question Answering from Perception to Decision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. |
Xicheng Gong; Qiwei Li; Peiran Xu; Yadong Mu; |
| 412 | MM-OVSeg: Multimodal Optical–SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the cross-modal domain gap and the limited dense prediction capability of current vision–language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. |
YIMIN WEI; Aoran Xiao; Hongruixuan Chen; Junshi Xia; Naoto Yokoya; |
| 413 | RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. |
Yang Shi; Yuhao Dong; Yue Ding; Yuran Wang; Xuanyu Zhu; Sheng Zhou; Wenting Liu; Haochen Tian; rundong wang; Huanqian Wang; Zuyan Liu; Bohan Zeng; Ruizhe Chen; Qixun Wang; Zhuoran Zhang; Xinlong Chen; Chengzhuo Tong; bozhou li; Qiang Liu; Haotian Wang; Wenjing Yang; Yuanxing Zhang; Pengfei Wan; YiFan Zhang; Ziwei Liu; |
| 414 | Plug-and-Play Incomplete Multi-View Clustering Via Janus-Faced Affinity Learning with Topology Harmonization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The reliance on carefully-tuned regularization hyper-parameters also usually undermines the model’s practical utility. To alleviate these issues, we propose a plug-and-play IMVC framework named PJFTH that incorporates Janus-faced affinity learning with topology harmonization. |
Shengju Yu; Suyuan Liu; Wenhao SHAO; Siwei Wang; KE LIANG; Xihong Yang; Tiejun Li; Xinwang Liu; |
| 415 | UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision–language–action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. |
Gu Zhang; Qicheng Xu; Haozhe Zhang; Jianhan Ma; Long He; Yiming Bao; Zeyu Ping; Zhecheng Yuan; Chenhao Lu; Chengbo Yuan; Tianhai Liang; Xiaoyu Tian; Maanping Shao; Feihong Zhang; Mingyu Ding; Yang Gao; Hao Zhao; Hang Zhao; Huazhe Xu; |
| 416 | AnimaMimic: Imitating 3D Animation from Video Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AnimaMimic, a framework that animates static 3D meshes using motion priors learned from video diffusion models. |
Tianyi Xie; Yunuo Chen; Yaowei Guo; Yin Yang; Bolei Zhou; Demetri Terzopoulos; Ying Jiang; Chenfanfu Jiang; |
| 417 | Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding. |
Shrinidhi Kumbhar; Haofu Liao; srikar appalaraju; Kunwar Yashraj Singh; |
| 418 | Elastic Weight Consolidation Done Right for Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. |
Xuan Liu; Xiaobin Chang; |
| 419 | LoL: Longer Than Longer, Scaling Video Generation to Hour Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis reveals that sink-collapse originates from an inherent conflict between the periodic structure of Rotary Position Embedding (RoPE) and the multi-head attention mechanisms prevalent in current generative models. To address it, we propose a lightweight, training-free approach that effectively suppresses this behavior by introducing multi-head RoPE jitter that breaks inter-head attention homogenization and mitigates long-horizon collapse. |
Jiaxing Cui; Jie Wu; Ming Li; Tao Yang; Xiaojie Li; Rui Wang; Andrew Bai; Yuanhao Ban; Cho-Jui Hsieh; |
| 420 | YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present YOSE — You Only Select Essential Tokens, an efficient fine-tuning framework. |
wu chenyang; Lina Lei; Fan Li; Chun-Le Guo; Dehong Kong; Xinran Qin; Zhixin Wang; Ming-Ming Cheng; Chongyi Li; |
| 421 | AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this protocol has two key blind spots: (i) Instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than VFMs’ visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities in a single question, making it difficult to determine whether errors arise from the lack of all required abilities or just one key ability. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs), foundational skills such as localization, depth estimation, and spatial understanding, which collectively support complex visual reasoning tasks. |
Zheda Mai; Arpita Chowdhury; Zihe Wang; Sooyoung Jeon; Lemeng Wang; Jiacheng Hou; Jihyung Kil; Wei-Lun Chao; |
| 422 | Image Diffusion Preview with Consistency Solver Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ConsistencySolver derived from general linear multistep methods, a lightweight, trainable high-order solver optimized via Reinforcement Learning, that enhances preview quality and consistency. |
Fu-Yun Wang; Hao Zhou; Liangzhe Yuan; Sanghyun Woo; Boqing Gong; Bohyung Han; Ming-Hsuan Yang; Han Zhang; Yukun Zhu; Ting Liu; Long Zhao; |
| 423 | PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose the first VLM-based physical 3D generative model, along with a new 3D representation that efficiently tokenizes geometry. |
Ziang Cao; Fangzhou Hong; Zhaoxi Chen; Liang Pan; Ziwei Liu; |
| 424 | The Power of Prior: Training-Free Open-Vocabulary Semantic Segmentation with LLaVA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, in this paper, we explore the utilization of LLaVA for training-free open-vocabulary semantic segmentation. |
Bingfeng Zhang; Siyue Yu; Hui Li; Jiahua Lin; Wenwu Wang; Jimin Xiao; |
| 425 | TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. |
Yifei Zeng; Bao Yajie; Jiachen Qian; Shuang Wu; Youtian Lin; Hao Zhu; Buyu Li; Feihu Zhang; Xun Cao; Yao Yao; |
| 426 | Mario: Multimodal Graph Reasoning with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. |
Yuanfu Sun; Kang Li; Pengkang Guo; Jiajin Liu; Qiaoyu Tan; |
| 427 | Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. |
Tianyi Xiong; Yi Ge; Ming Li; Zuolong Zhang; Pranav Kulkarni; Kaishen Wang; Qi He; Zeying Zhu; Chenxi Liu; Ruibo Chen; Tong Zheng; Yanshuo Chen; Xiyao Wang; Renrui Zhang; Wenhu Chen; Heng Huang; |
| 428 | Simple But Effective Triplet-Based Compression Strategies for Compact Visual Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on compressing the 3D structure of the scene by selecting a subset of points from a Structure-from-Motion (SfM) point cloud. |
Torsten Sattler; Zuzana Kukelova; |
| 429 | Text-Driven 3D Hand Motion Generation from Sign Language Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. |
Léore Bensabath; Mathis Petrovich; Gul Varol; |
| 430 | DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. |
Chang Zou; Changlin Li; Songtao Liu; Zhao Zhong; Kailin Huang; Linfeng Zhang; |
| 431 | GSV2X: Geometry-Aware Uncertainty Modeling and Orthogonal Fusion for Robust Roadside Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reliable 3D perception from multi-view roadside sensors hinges on the robust fusion of camera and LiDAR data, a task complicated by geometric misalignments and sensor calibration errors. This paper presents GSV2X, a fusion framework that tackles these challenges through two core contributions. First, to achieve robustness against spatial uncertainty, we lift 2D image features into a unified Bird’s-Eye-View (BEV) space by representing them as 3D Gaussian distributions. |
jianqiang xu; Gensheng Pei; 刘华峰 Liu; Yazhou Yao; |
| 432 | Reevaluating The Intra-modal Misalignment Hypothesis in CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated similarities between images. In this study, we question this intra-modal misalignment hypothesis. |
Jonas Herzog; Yue Wang; |
| 433 | When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce NUMINA, a training-free identify-then-guide framework for improved numerical alignment. |
Zhengyang Sun; Yu Chen; Xin Zhou; Xiaofan Li; Xiwu Chen; Dingkang Liang; Xiang Bai; |
| 434 | Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. |
ZIREN GONG; Xiaohan Li; Fabio Tosi; Jiawei Han; Stefano Mattoccia; Jianfei Cai; Matteo Poggi; |
| 435 | MD2E: Modeling Depth-to-Edge Cues for Monocular Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MD2E, a method that models depth-to-edge cues by deriving edge targets from depth annotations, calibrating metric scale using the spectral score, and using edge predictions to regularize depth boundaries while producing metric depth. |
Chao Ning; Minghe Shen; Naoto Yokoya; |
| 436 | Multi-Scale Speculative Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. |
Elia Peruzzo; Guillaume Sautiere; Amirhossein Habibian; |
| 437 | Linking Modality Isolation in Heterogeneous Collaborative Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing alignment methods rely on supervision from spatially overlapping observations, thus fail to handle modality isolation. To address this challenge, we propose CodeAlign, the first efficient, co-occurrence-free alignment framework that smoothly aligns modalities via cross-modal feature-code-feature(FCF) translation. |
Changxing Liu; Zichen Chao; Siheng Chen; |
| 438 | DocPrune: Efficient Document Question Answering Via Background, Question, and Comprehension-aware Token Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that existing token reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DOCPRUNE, a training-free document token pruning framework designed for efficient long document understanding. |
Joonmyung Choi; Sanghyeok Lee; Jongha Kim; Sehyung Kim; Dohwan Ko; Jihyung Kil; Hyunwoo J. Kim; |
| 439 | StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present **StereoWorld**, an **end-to-end framework** that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. |
Ke Xing; longfei li; Yuyang Yin; Hanwen Liang; Guixun Luo; Chen Fang; Jue Wang; Konstantinos N. Plataniotis; Xiaojie Jin; Yao Zhao; Yunchao Wei; |
| 440 | DABO: Difficulty-Aware Bayesian Optimization with Diffusion-Learned Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach leads to inefficient resource allocation, wasting budget in simple regions while under-exploring complex, rugged landscapes, and thereby critically undermining both search efficiency and final performance. To address this universal challenge, we introduce DABO, a framework that pioneers difficulty-aware tuning within the efficient context of Freeze-Thaw Bayesian Optimization. |
Mengyang Li; Pinlong Zhao; |
| 441 | AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent’s state and task progress in a fully end-to-end and data-driven manner. |
Wenxuan Guo; Xiuwei Xu; Yichen Liu; Xiangyu Li; Hang Yin; Huangxing Chen; Wenzhao Zheng; Jianjiang Feng; Jie Zhou; Jiwen Lu; |
| 442 | Beyond Generation: Advancing Image Editing Priors for Depth and Normal Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by refining their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. |
jiyuan WANG; Chunyu Lin; Lei Sun; Rongying Liu; Lang Nie; Mingxing Li; Kang Liao; Xiangxiang Chu; |
| 443 | QuietPrune: Query-Guided Early Token Pruning for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose QuietPrune, a QUery-guIded Early Token Pruning method to remove redundant visual tokens within VLMs, thereby enhancing computational efficiency. |
Tianxiao Gao; Shanwei Zhao; Shuo Fang; Shiai Zhu; Chenguang Ma; |
| 444 | ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ViBES (Voice in Behavioral Expression and Synchrony), a conversational 3D agent that jointly plans language and movement and executes dialogue-conditioned body actions. |
Juze Zhang; Changan Chen; Xin Chen; Heng Yu; Tiange Xiang; Ali Sartaz Khan; Shrinidhi Kowshika Lakshmikanth; Ehsan Adeli; |
| 445 | Coverage Optimization for Camera View Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key insight is that informative views can be obtained by minimizing a tractable approximation of the Fisher Information Gain, which reduces to favoring viewpoints that cover geometry that has been insufficiently observed by past cameras. |
Timothy Chen; Adam Dai; Maximilian Adang; Grace Gao; Mac Schwager; |
| 446 | Feed-forward Gaussian Registration for Head Avatar Creation and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MATCH (Multi-view Avatars from Topologically Corresponding Heads), a multi-view Gaussian registration method for high-quality head avatar creation and editing. |
Malte Prinzler; Paulo Gotardo; Siyu Tang; Timo Bolkart; |
| 447 | Reliev3R: Relieving Feed-forward 3D Reconstruction from Multi-View Geometric Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Reliev3R, a weakly-supervised paradigm for training FFRMs from scratch without cost-prohibitive multi-view geometric annotations. |
Youyu Chen; Junjun Jiang; Yueru Luo; Kui Jiang; Xianming Liu; Xu Yan; Dave Zhenyu Chen; |
| 448 | TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce \benchname, a comprehensive benchmark for interpretable DeepFake detection. |
Jian-Yu Jiang-Lin; Kang-Yang Huang; LING ZOU; Ling Lo; Sheng-Ping Yang; Yu-Wen Tseng; Kun-Hsiang Lin; Chia-Ling Chen; Yu-Ting Ta; Yan-Tsung Wang; Po-Ching Chen; Hongxia Xie; Hong-Han Shuai; Wen-Huang Cheng; |
| 449 | Few-shot Acoustic Synthesis with Multimodal Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FLow-matching ACoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. |
Amandine Brunetto; |
| 450 | TokenTrace: Multi-Concept Attribution Through Watermarked Token Recovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce TokenTrace, a novel proactive watermarking framework for robust, multi-concept attribution. |
Li Zhang; Shruti Agarwal; John Collomosse; Pengtao Xie; Vishal Asnani; |
| 451 | Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, both static and dynamic approaches typically focus on visual features of individual frames while neglecting the temporal correlations between them, limiting their performance in handling complex video streams. To address these challenges, we propose Recurrent Dynamic Submodel (RDS), a dynamic architecture that adaptively selects submodel blocks for each frame. |
Weidong Tang; Zhiyuan Liang; Xinyan Wan; Chen Zhu; Zhaopan Xu; Pengfei Zhou; Yan Song; Yang You; Wangbo Zhao; |
| 452 | GenMatter: Perceiving Physical Objects with Generative Matter Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion and appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. |
Eric Li; Arijit Dasgupta; Yoni Friedman; Mathieu Huot; Vikash Mansinghka; Thomas O'Connell; William Freeman; Joshua B. Tenenbaum; |
| 453 | MV-Fashion: Towards Enabling Virtual Try-On and Size Estimation with Multi-View Paired Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Synthetic datasets suffer from a realism gap, whereas real-world captures lack the detailed annotations and paired data required for virtual try-on (VTON) and size estimation tasks. To bridge this gap, we introduce MV-Fashion, a large-scale, multi-view video dataset engineered for domain-specific fashion analysis. |
Hunor Laczko; Libang Jia; Phat Truong; Diego Hernández; Sergio Escalera; Jordi Gonzàlez; Meysam Madadi; |
| 454 | LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. |
Ruofan Liang; Norman Müller; Ethan Weber; Duncan Zauss; Nandita Vijaykumar; Peter Kontschieder; Christian Richardt; |
| 455 | Reviving ConvNeXt for Efficient Convolutional Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Here we introduce the fully convolutional diffusion model (FCDM), a ConvNeXt-inspired backbone redesigned for conditional diffusion modeling. |
Taesung Kwon; Lorenzo Bianchi; Lennart Wittke; Felix Watine; Fabio Carrara; Jong Chul Ye; Romann M. Weber; Vinicius C. Azevedo; |
| 456 | From Manuals to Actions: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce ManualVLA, a unified VLA framework built upon a Mixture-of-Transformers (MoT) architecture, enabling coherent collaboration between multimodal manual generation and action execution. |
Chenyang Gu; Jiaming Liu; Hao Chen; Runzhong Huang; Qingpo Wuwu; Xiaoqi Li; Zhuoyang Liu; Ying Li; Renrui Zhang; Peng Jia; Pheng-Ann Heng; Shanghang Zhang; |
| 457 | Multi-view Consistent 3D Gaussian Head Avatars ‘without’ Multi-view Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. |
Aviral Chharia; Fernando De la Torre; |
| 458 | DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured “Analysis, Localization and Reasoning” workflow. |
Hao Yan; Yuliang Liu; Xingchen Liu; Yuyi Zhang; Minghui Liao; Jihao Wu; Wei Chen; Xiang Bai; |
| 459 | PointTPA: Test-Time Parameter Adaptation for 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Test-time Parameter Adaptation for Point Cloud Scene Perception (PointTPA), a test-time dynamic adaptation framework that constructs input-aware parameters for scene-level point clouds. |
Siyuan Liu; Chaoqun Zheng; Xin Zhou; Tianrui Feng; Dingkang Liang; Xiang Bai; |
| 460 | DyaDiT: A Multi-Modal Diffusion Transformer for Socially-Aware Dyadic Gesture Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. |
YICHEN PENG; Jyun-Ting Song; Siyeol Jung; RUOFAN LIU; Haiyang Liu; Xuangeng Chu; Ruicong Liu; Erwin Wu; Hideki Koike; Kris Kitani; |
| 461 | BAMI: Training-Free Bias Mitigation in GUI Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Utilizing the proposed \textbf{Masked Prediction Distribution (MPD)} attribution method, we identify that the primary sources of errors are twofold: high image resolution (leading to precision bias) and intricate interface elements (resulting in ambiguity bias). To address these challenges, we introduce \textbf{Bias-Aware Manipulation Inference (BAMI)}, which incorporates two key manipulations, coarse-to-fine focus and candidate selection, to effectively mitigate these biases. |
Borui Zhang; Bo Zhang; Bo Wang; Wenzhao Zheng; Yuhao Cheng; Liang Tang; Yiqiang Yan; Jie Zhou; Jiwen Lu; |
| 462 | Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. |
Zhuan Shi; Alireza Dehghanpour Farashah; Rik de Vries; Golnoosh Farnadi; |
| 463 | AXG-Reasoner: Error Detection and Explanation in Long Task Videos with Vision–Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we address the problem of error reasoning in long task videos, which is to detect and explain errors. |
Shih-Po Lee; Ehsan Elhamifar; |
| 464 | Cross-Domain Few-Shot Segmentation Via Multi-view Progressive Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. |
Jiahao Nie; Guanqiao Fu; Wenbin An; Yap-Peng Tan; Alex C. Kot; Shijian Lu; |
| 465 | AnyID: Ultra-Fidelity Universal Identity-Preserving Video Generation from Any Visual References Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we present AnyID, an ultra-fidelity identity-preservation video generation framework. |
Jiahao Wang; Hualian Sheng; Sijia Cai; Yuxiao Yang; Weizhan Zhang; Caixia Yan; Bing Deng; Jieping Ye; |
| 466 | ShreddingNet: Coarse-to-Fine Restoration for Multi-Source Shredded Manuscripts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ShreddingNet, a coarse-to-fine two-stage pipeline for multi-source manuscript restoration that operates without restrictive conditions. |
Haoyang Cui; Hao Jiang; Yadong Mu; |
| 467 | WonderZoom: Multi-Scale 3D World Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WonderZoom, a novel approach to generating 3D scenes with contents across multiple spatial scales from a single image. |
Jin Cao; Hong-Xing Yu; Jiajun Wu; |
| 468 | PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these limitations and achieve semantic-consistent segmentation, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. |
Yixiao Song; Qingyong Li; Wen Wang; Zhicheng Yan; |
| 469 | CrackSSM: Reviving SSMs for Crack Segmentation Via Dynamic Scanning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This fixed flattening order disrupts spatial continuity and weakens the SSM’s ability to model irregular crack patterns effectively. To address this limitation, we propose \textbf{CrackSSM}, a novel crack-aware segmentation framework featuring a dynamic scanning strategy that adapts the token sequence to the underlying structure of each image. |
Yubin Gu; Boyang Hou; Yuan Meng; Wenting Luo; Jiayi Ji; Xiaoshuai Sun; |
| 470 | DeltaQuant: 4-bit Video Diffusion Models with Spatiotemporal Delta Smoothing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As recent advances in attention optimization mitigate previous computational bottlenecks, linear layers now dominate both computational cost and inference memory. In this work, we focus on quantizing both weights and activations to 4 bits to accelerate these layers. |
Xingyang Li; Samuel Tesfai; Zhekai Zhang; Haocheng Xi; Shuo Yang; Lvmin Zhang; Yufei Sun; Kelly Peng; Maneesh Agrawala; Ion Stoica; Kurt Keutzer; Jun-Yan Zhu; Song Han; Yujun Lin; Muyang Li; |
| 471 | Leveraging Verifier-Based Reinforcement Learning in Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Edit-R1, a framework to build a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and leverage it into the downstream editing task. |
Hanzhong Guo; Jie Wu; Jie Liu; Yu Gao; Zilyu Ye; Linxiao Yuan; Xionghui Wang; Yizhou Yu; Weilin Huang; |
| 472 | Skyra: AI-Generated Video Detection Via Grounded Artifact Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present $\textbf{Skyra}$, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. |
Yifei Li; Wenzhao Zheng; Yanran Zhang; Runze Sun; Yu Zheng; Lei Chen; Jie Zhou; Jiwen Lu; |
| 473 | MotionV2V: Editing Motion in A Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. |
Ryan Burgert; Charles Herrmann; Forrester Cole; Michael Ryoo; Neal Wadhwa; Andrey Voynov; Nataniel Ruiz; |
| 474 | SIR: Structured Image Representations for Explainable Robot Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representation, a method that leverages Scene Graphs as an intermediate representation for robot policy learning. |
Paul Mattes; Jan Schwab; Jens Bosch; Maximilian Li; Nils Blank; Minh-Trung Tang; Moritz Haberland; Rudolf Lioutikov; |
| 475 | Beyond The Ground Truth: Enhanced Supervision for Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, when addressing real-world degradations, model performance is limited by the quality of ground-truth images in datasets due to practical constraints in data acquisition. To address this limitation, we propose a novel framework that enhances existing ground truth images to provide higher-quality supervision for real-world restoration. |
Donghun Ryou; Inju Ha; sanghyeok chu; Bohyung Han; |
| 476 | Domain-Skewed Federated Learning with Feature Decoupling and Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we argue that the domain skew is reflected in the domain-specific biased features of each client, causing the local model’s representations to collapse into a narrow low-dimensional subspace. |
Huan Wang; Jun Shen; Jun Yan; Guansong Pang; |
| 477 | Open The Motion Door: Atomic Motion Decomposition and Recomposition for Open-Vocabulary Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe that although high-level motion semantics vary widely, many motions share a common set of underlying atomic motions—that is, simple, reusable body-part movements. Building on this insight, we introduce an **Atomic Motion Decomposition and Recomposition** framework for open-vocabulary text-to-motion generation. |
Ke Fan; Jiangning Zhang; Ran Yi; Jingyu Gong; Yabiao Wang; yating wang; Xin Tan; Chengjie Wang; Lizhuang Ma; |
| 478 | EventHub: Data Factory for Generalizable Event-Based Stereo Networks Without Active Sensors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. |
Luca Bartolomei; Fabio Tosi; Matteo Poggi; Stefano Mattoccia; Guillermo Gallego; |
| 479 | Think Before You Drive: World Model-Inspired Multimodal Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. |
Haicheng Liao; Huanming Shen; Bonan Wang; yong kang li; Yihong Tang; Chengyue Wang; Dingyi Zhuang; Kehua Chen; HAI YANG; Cheng-Zhong Xu; Zhenning Li; |
| 480 | RxnCaption: Reformulating Reaction Diagram Parsing As Visual Prompt Guided Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the \textbf{RxnCaption} framework for the task of chemical Reaction Diagram Parsing (RxnDP). |
Jiahe Song; Chuang Wang; Bowen Jiang; Yinfan Wang; Hao Zheng; Xingjian Wei; Chengjin Liu; Rui Nie; Junyuan Gao; Jiaxing Sun; Yubin Wang; Lijun Wu; Zhenhua Huang; Jiang Wu; Qian Yu; Conghui He; |
| 481 | Factorize, Reconstruct, Enhance: A Unified Framework for Multimodal Sentiment Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the performance of existing models is often hampered by two key challenges: insufficient multilayer semantic extraction inherent to modalities and static feature fusion, leading to low performance. Therefore, this paper proposes a Multi-factor Factor-Decoupling and Semantics-enhanced Fusion Framework for accurate multimodal sentiment analysis. |
Zhilu Yang; Mingcheng Li; |
| 482 | Masked Region Transformer for Layered Image Generation and Editing at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite its importance, this remains an underexplored area at scale. To address this gap, we present the Masked Region Transformer, a 20B-parameter diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. |
Zhicong Tang; Jingye Chen; Zhao Zhang; Mohan Zhou; Yuchi Liu; Yifan Pu; Yalong Bai; Ethan Smith; Yuhui Yuan; |
| 483 | Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking from Sparse Inertial Sensors and Ranging-based Between-sensor Distances Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. |
Dominik Hollidt; Tommaso Bendinelli; Christian Holz; |
| 484 | CogniVerse: Revolutionizing Multi-modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. |
Xiang Fang; Wanlong Fang; Changshuo Wang; |
| 485 | DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images Via Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we delve into a novel task: Layer-Wise Decomposition of Alpha-Composited Images, aiming to recover constituent layers from single overlapped images under the condition of semi-transparent/transparent alpha layer non-linear occlusion. |
Hang Zhao; Hang Zhao; Qianyu Zhou; Xuequan Lu; Xiangtai Li; Hao Yang; Bo Yang; Yiren Song; |
| 486 | M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image–question pairs for evaluating retrieval-augmented VQA across languages and modalities. |
David Anugraha; Patrick Irawan; Anshul Singh; En-Shiun Annie Lee; Genta Indra Winata; |
| 487 | Language-guided Frequency Modulation for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation hinders fine-grained control of visual representations and complicates their hierarchical alignment with language. To address this issue, we introduce Language-guided Frequency Modulation (LFM), a plug-and-play approach that adaptively refines visual signals in the frequency domain under linguistic guidance. |
Shuyi Ouyang; Gongfan Fang; Xinyin Ma; Yen-Wei Chen; Lanfen Lin; Xinchao Wang; |
| 488 | Gated Condition Injection Without Multimodal Attention: Towards Controllable Linear-Attention Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. |
Yuhe Liu; Zhenxiong Tan; Yujia Hu; Songhua Liu; Xinchao Wang; |
| 489 | Merge3D: Efficient 3D Multimodal LLMs Via Joint 2D-3D Token Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. |
Tianbo Pan; Xingyi Yang; Xinchao Wang; |
| 490 | SpotEdit: Selective Region Editing in Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. |
ZHIBIN QIN; Zhenxiong Tan; Zeqing Wang; Songhua Liu; Xinchao Wang; |
| 491 | Beyond Soft Label: Dataset Distillation Via Orthogonal Gradient Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we theoretically identify that BN matching mainly aligns the scales of real and synthetic gradients but overlooks their directions. |
Deyu Bo; Xinchao Wang; |
| 492 | A Unified Benchmark for HOI Evaluation Across Vision-Language Models and HOI-Specific Methods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to incorrect penalization, especially for VLMs whose outputs are less constrained, making fair comparison between the two paradigms difficult. To address this limitation, we introduce a multi-choice HOI benchmark with explicitly defined positives and curated negatives, enabling unified and correct evaluation of both VLMs and HOI-specific models. |
Qinqian Lei; Bo Wang; Robby T. Tan; |
| 493 | ILRM: An Iterative Large 3D Reconstruction Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. |
Gyeongjin Kang; Seungtae Nam; Seung kwon Yang; Xiangyu Sun; Sameh Khamis; Abdelrahman Mohamed; Eunbyung Park; |
| 494 | Multi-view Pyramid Transformer: Look Coarser to See Broader Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. |
Gyeongjin Kang; Seung kwon Yang; Seungtae Nam; Younggeun Lee; Jungwoo Kim; Eunbyung Park; |
| 495 | DuoMo: Dual Motion Diffusion for World-Space Human Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DuoMo, a generative method that recovers human motion in world-space coordinates from unconstrained videos with noisy or incomplete observations. |
Yufu Wang; Evonne Ng; Soyong Shin; Rawal Khirodkar; Yuan Dong; Zhaoen Su; Jinhyung Park; Kris Kitani; Alexander Richard; Fabian Prada; Michael Zollhoefer; |
| 496 | ObjectMorpher: 3D-Aware Image Editing Via Deformable 3DGS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ObjectMorpher, a unified, interactive framework that converts ambiguous 2D edits into geometry-grounded operations. |
Yuhuan Xie; Aoxuan Pan; Yihua Huang; Chirui Chang; Peng Dai; Xin Yu; Xiaojuan Qi; |
| 497 | Group Editing: Edit Multiple Images in One Go Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we tackle the problem of performing consistent and unified modifications across a set of related images. |
Yue Ma; Xinyu Wang; Qianli Ma; Qinghe Wang; Mingzhe Zheng; xiangpeng yang; Hao Li; Chongbo Zhao; Jixuan Ying; Harry Yang; Hongyu Liu; Qifeng Chen; |
| 498 | TableMix: Enhancing Multimodal Table Reasoning in MLLMs from A Data-Centric Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that a major limitation lies in the pre-training process, which inadvertently weakens the model’s intrinsic reasoning ability and consequently hinders the effectiveness of reinforcement fine-tuning on table reasoning tasks. In this paper, we introduce TableMix, a novel framework that tackles this challenge from a data-centric perspective. |
Chaohu Liu; Shida Wang; Yubo Wang; Linli Xu; |
| 499 | RetFormer: Multimodal Retrieval for Enhancing Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we introduce RetFormer, a model enhanced with a multimodal knowledge base for storing world knowledge, and a retrieval cross-fusion module designed to establish robust multimodal sample relationships by leveraging content from the knowledge base. |
Tianrui Yu; Xiubo Liang; Hongzhi Wang; |
| 500 | MoLingo: Motion–Language Alignment for Text-to-Human Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MoLingo, a text-to-motion (T2M) model that generates realistic, lifelike human motion by denoising in a continuous latent space. |
Yannan He; Garvita Tiwari; Xiaohan Zhang; Pankaj Bora; Tolga Birdal; Jan Lenssen; Gerard Pons-Moll; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~4,000 papers), please visit Paper Digest: CVPR-2026 (Full List).