Paper Digest: CVPR 2024 Highlights
Note: CVPR-2024 accepts more than 2,700 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,700 CVPR-2024 papers in a separate page.
To search or review papers within CVPR-2024 related to a specific topic, please use the search by venue (CVPR-2024), review by venue (CVPR-2024) and question answering by venue (CVPR-2024) services. To browse papers by author, here is a list of all authors (CVPR-2024). You may also like to explore our “Best Paper” Digest (CVPR), which lists the most influential CVPR papers since 1988.
Based in New York, Paper Digest is dedicated to producing high-quality text analysis results that people can acturally use on a daily basis. Since 2018, we have been serving users across the world with a number of exclusive services to track, search, review and rewrite scientific literature.
You are welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: CVPR 2024 Highlights
Paper | Author(s) | |
---|---|---|
1 | MPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a versatile multi-modal large language model mPLUG-Owl2 which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. |
Qinghao Ye; Haiyang Xu; Jiabo Ye; Ming Yan; Anwen Hu; Haowei Liu; Qi Qian; Ji Zhang; Fei Huang; |
2 | Improved Baselines with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework. |
Haotian Liu; Chunyuan Li; Yuheng Li; Yong Jae Lee; |
3 | WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body’s global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions such as climbing stairs. |
Soyong Shin; Juyong Kim; Eni Halilaj; Michael J. Black; |
4 | TokenHMR: Advancing Human Mesh Recovery with A Tokenized Pose Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address the problem of regressing 3D human pose and shape from a single image with a focus on 3D accuracy. |
Sai Kumar Dwivedi; Yu Sun; Priyanka Patel; Yao Feng; Michael J. Black; |
5 | Generating Illustrated Instructions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new task of generating "Illustrated Instructions" i.e. visual instructions customized to a user’s needs. |
Sachit Menon; Ishan Misra; Rohit Girdhar; |
6 | ChatPose: Chatting About 3D Human Pose Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ChatPose a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. |
Yao Feng; Jing Lin; Sai Kumar Dwivedi; Yu Sun; Priyanka Patel; Michael J. Black; |
7 | EMAGE: Towards Unified Holistic Co-Speech Gesture Generation Via Expressive Masked Audio Gesture Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose EMAGE a framework to generate full-body human gestures from audio and masked gestures encompassing facial local body hands and global movements. |
Haiyang Liu; Zihao Zhu; Giorgio Becherini; Yichen Peng; Mingyang Su; You Zhou; Xuefei Zhe; Naoya Iwamoto; Bo Zheng; Michael J. Black; |
8 | WANDR: Intention-guided Human Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this we introduce WANDR a data-driven model that takes an avatar’s initial pose and a goal’s 3D position and generates natural human motions that place the end effector (wrist) on the goal location. |
Markos Diomataris; Nikos Athanasiou; Omid Taheri; Xi Wang; Otmar Hilliges; Michael J. Black; |
9 | VAREN: Very Accurate and Realistic Equine Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce VAREN a novel 3D articulated parametric shape model learned from 3D scans of many real horses. |
Silvia Zuffi; Ylva Mellbin; Ci Li; Markus Hoeschle; Hedvig Kjellström; Senya Polikovsky; Elin Hernlund; Michael J. Black; |
10 | ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. |
Mu Cai; Haotian Liu; Siva Karthik Mustikovela; Gregory P. Meyer; Yuning Chai; Dennis Park; Yong Jae Lee; |
11 | A Unified Approach for Text- and Image-guided 4D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D which features a novel two-stage approach for text-to-4D synthesis leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. |
Yufeng Zheng; Xueting Li; Koki Nagano; Sifei Liu; Otmar Hilliges; Shalini De Mello; |
12 | Edit One for All: Interactive Batch Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the goal of minimizing human supervision in the editing process this paper presents a novel method for interactive batch image editing using StyleGAN as the medium. |
Thao Nguyen; Utkarsh Ojha; Yuheng Li; Haotian Liu; Yong Jae Lee; |
13 | DiffMorpher: Unleashing The Capability of Diffusion Models for Image Morphing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work we address this limitation via DiffMorpher an approach that enables smooth and natural image interpolation by harnessing the prior knowledge of a pre-trained diffusion model. |
Kaiwen Zhang; Yifan Zhou; Xudong Xu; Bo Dai; Xingang Pan; |
14 | MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes Via Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting addressing both appearance and geometry aspects. |
Honghua Chen; Chen Change Loy; Xingang Pan; |
15 | DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In our solution we introduce image prompts in fine-grained image editing cooperating with the text prompt to better describe the editing content. |
Chong Mou; Xintao Wang; Jiechong Song; Ying Shan; Jian Zhang; |
16 | SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces SplaTAM an approach that for the first time leverages explicit volumetric representations i.e. 3D Gaussians to enable high-fidelity reconstruction from a single unposed RGB-D camera surpassing the capabilities of existing methods. |
Nikhil Keetha; Jay Karhade; Krishna Murthy Jatavallabhula; Gengshan Yang; Sebastian Scherer; Deva Ramanan; Jonathon Luiten; |
17 | 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. |
Guanjun Wu; Taoran Yi; Jiemin Fang; Lingxi Xie; Xiaopeng Zhang; Wei Wei; Wenyu Liu; Qi Tian; Xinggang Wang; |
18 | SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. |
Hoon Kim; Minje Jang; Wonjun Yoon; Jisoo Lee; Donghyun Na; Sanghyun Woo; |
19 | MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However due to the difficulty and cost of collecting and labeling data existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue we present MTMMC a real-world large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments – campus and factory – across various time weather and season conditions. |
Sanghyun Woo; Kwanyong Park; Inkyu Shin; Myungchul Kim; In So Kweon; |
20 | Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we instead focus on the underexplored text-to-4D setting and synthesize dynamic animated 3D objects using score distillation methods with an additional temporal dimension. |
Huan Ling; Seung Wook Kim; Antonio Torralba; Sanja Fidler; Karsten Kreis; |
21 | Eyes Wide Shut? Exploring The Visual Shortcomings of Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues we propose a Mixture of Features (MoF) approach demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. |
Shengbang Tong; Zhuang Liu; Yuexiang Zhai; Yi Ma; Yann LeCun; Saining Xie; |
22 | V?: Guided Visual Search As A Core Mechanism in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details especially when handling high-resolution and visually crowded images. To address this we introduce V* an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. |
Penghao Wu; Saining Xie; |
23 | InstanceDiffusion: Instance-level Control for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. |
Xudong Wang; Trevor Darrell; Sai Saketh Rambhatla; Rohit Girdhar; Ishan Misra; |
24 | NeRFiller: Completing Scenes Via Generative 3D Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose NeRFiller an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. |
Ethan Weber; Aleksander Holynski; Varun Jampani; Saurabh Saxena; Noah Snavely; Abhishek Kar; Angjoo Kanazawa; |
25 | GARField: Group Anything with Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose Group Anything with Radiance Fields (GARField) an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. |
Chung Min Kim; Mingxuan Wu; Justin Kerr; Ken Goldberg; Matthew Tancik; Angjoo Kanazawa; |
26 | Generative Proxemics: A Prior for 3D Social Interaction from Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. |
Lea Müller; Vickie Ye; Georgios Pavlakos; Michael Black; Angjoo Kanazawa; |
27 | HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Fine-tuning each personalized model needs considerable GPU time investment and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges we propose HyperDreamBooth – a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. |
Nataniel Ruiz; Yuanzhen Li; Varun Jampani; Wei Wei; Tingbo Hou; Yael Pritch; Neal Wadhwa; Michael Rubinstein; Kfir Aberman; |
28 | Hierarchical Patch Diffusion Models for High-Resolution Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we study patch diffusion models (PDMs) — a diffusion paradigm which models the distribution of patches rather than whole inputs keeping up to 0.7% of the original pixels. |
Ivan Skorokhodov; Willi Menapace; Aliaksandr Siarohin; Sergey Tulyakov; |
29 | Pix2gestalt: Amodal Segmentation By Synthesizing Wholes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce pix2gestalt a framework for zero-shot amodal segmentation which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. |
Ege Ozguroglu; Ruoshi Liu; Dídac Surís; Dian Chen; Achal Dave; Pavel Tokmakov; Carl Vondrick; |
30 | M&M VTO: Multi-Garment Virtual Try-On and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present M&M VTO-a mix and match virtual try-on method that takes as input multiple garment images text description for garment layout and an image of a person. |
Luyang Zhu; Yingwei Li; Nan Liu; Hao Peng; Dawei Yang; Ira Kemelmacher-Shlizerman; |
31 | On The Content Bias in Frechet Video Distance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Frechet Video Distance (FVD) a prominent metric for evaluating video generation models is known to conflict with human perception occasionally. In this paper we aim to explore the extent of FVD’s bias toward frame quality over temporal realism and identify its sources. |
Songwei Ge; Aniruddha Mahapatra; Gaurav Parmar; Jun-Yan Zhu; Jia-Bin Huang; |
32 | DeepCache: Accelerating Diffusion Models for Free Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce DeepCache a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. |
Xinyin Ma; Gongfan Fang; Xinchao Wang; |
33 | Scaling Laws of Synthetic Images for Model Training … for Now Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models for the training of supervised models: image classifiers with label supervision and CLIP with language supervision. |
Lijie Fan; Kaifeng Chen; Dilip Krishnan; Dina Katabi; Phillip Isola; Yonglong Tian; |
34 | Grounded Text-to-Image Synthesis with Attention Refocusing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reveal the potential causes of the diffusion model’s cross-attention and self-attention layers. |
Quynh Phung; Songwei Ge; Jia-Bin Huang; |
35 | Image Sculpting: Precise Object Editing with 3D Geometry Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Image Sculpting a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. |
Jiraphon Yenphraphai; Xichen Pan; Sainan Liu; Daniele Panozzo; Saining Xie; |
36 | Overcoming Generic Knowledge Loss with Selective Parameter Update Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Leveraging the fact that foundation models have initial knowledge on various tasks and domains we propose a novel approach that instead of updating all parameters equally localizes the updates to a sparse set of parameters relevant to the task being learned. |
Wenxuan Zhang; Paul Janson; Rahaf Aljundi; Mohamed Elhoseiny; |
37 | Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge we propose Make-It-Vivid the first attempt to enable high-quality texture generation from text in UV space. |
Junshu Tang; Yanhong Zeng; Ke Fan; Xuheng Wang; Bo Dai; Kai Chen; Lizhuang Ma; |
38 | CogAgent: A Visual Language Model for GUI Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce CogAgent an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. |
Wenyi Hong; Weihan Wang; Qingsong Lv; Jiazheng Xu; Wenmeng Yu; Junhui Ji; Yan Wang; Zihan Wang; Yuxiao Dong; Ming Ding; Jie Tang; |
39 | InceptionNeXt: When Inception Meets ConvNeXt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although reducing the kernel size of ConvNeXt can improve speed it results in significant performance degradation which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue inspired by Inceptions we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension i.e. small square kernel two orthogonal band kernels and an identity mapping. |
Weihao Yu; Pan Zhou; Shuicheng Yan; Xinchao Wang; |
40 | Reconstructing Hands in 3D with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an approach that can reconstruct hands in 3D from monocular input. |
Georgios Pavlakos; Dandan Shan; Ilija Radosavovic; Angjoo Kanazawa; David Fouhey; Jitendra Malik; |
41 | ShapeWalk: Compositional Shape Editing Through Language-Guided Chains Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Editing 3D shapes through natural language instructions is a challenging task that requires the comprehension of both language semantics and fine-grained geometric details. To bridge this gap we introduce ShapeWalk a carefully designed synthetic dataset designed to advance the field of language-guided shape editing. |
Habib Slim; Mohamed Elhoseiny; |
42 | Adversarial Text to Continuous Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we approach the text-to-image task from a different perspective where a 2D image is represented as an implicit neural representation (INR). |
Kilichbek Haydarov; Aashiq Muhamed; Xiaoqian Shen; Jovana Lazarevic; Ivan Skorokhodov; Chamuditha Jayanga Galappaththige; Mohamed Elhoseiny; |
43 | Emu Edit: Precise Image Editing Via Recognition and Generation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Emu Edit a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. |
Shelly Sheynin; Adam Polyak; Uriel Singer; Yuval Kirstain; Amit Zohar; Oron Ashual; Devi Parikh; Yaniv Taigman; |
44 | Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a new method for text-driven motion transfer – synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video’s motion and scene layout. |
Danah Yatim; Rafail Fridman; Omer Bar-Tal; Yoni Kasten; Tali Dekel; |
45 | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. |
Xiang Yue; Yuansheng Ni; Kai Zhang; Tianyu Zheng; Ruoqi Liu; Ge Zhang; Samuel Stevens; Dongfu Jiang; Weiming Ren; Yuxuan Sun; Cong Wei; Botao Yu; Ruibin Yuan; Renliang Sun; Ming Yin; Boyuan Zheng; Zhenzhu Yang; Yibo Liu; Wenhao Huang; Huan Sun; Yu Su; Wenhu Chen; |
46 | FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present FoundationPose a unified foundation model for 6D object pose estimation and tracking supporting both model-based and model-free setups. |
Bowen Wen; Wei Yang; Jan Kautz; Stan Birchfield; |
47 | Mosaic-SDF for 3D Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Mosaic-SDF (M-SDF): a simple 3D shape representation that approximates the Signed Distance Function (SDF) of a given shape by using a set of local grids spread near the shape’s boundary. |
Lior Yariv; Omri Puny; Oran Gafni; Yaron Lipman; |
48 | MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current multi-modal large language models however passively absorb sensory data as inputs lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area we propose MultiPLY a multisensory embodied large language model that could incorporate multisensory interactive data including visual audio tactile and thermal information into large language models thereby establishing the correlation among words actions and percepts. |
Yining Hong; Zishuo Zheng; Peihao Chen; Yian Wang; Junyan Li; Chuang Gan; |
49 | GaussianDreamer: Fast Generation from Text to 3D Gaussians By Bridging 2D and 3D Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. |
Taoran Yi; Jiemin Fang; Junjie Wang; Guanjun Wu; Lingxi Xie; Xiaopeng Zhang; Wenyu Liu; Qi Tian; Xinggang Wang; |
50 | Readout Guidance: Learning Control from Diffusion Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Readout Guidance a method for controlling text-to-image diffusion models with learned signals. |
Grace Luo; Trevor Darrell; Oliver Wang; Dan B Goldman; Aleksander Holynski; |
51 | Breathing Life Into Sketches Using Text-to-Video Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present a method that automatically adds motion to a single-subject sketch (hence "breathing life into it") merely by providing a text prompt indicating the desired motion. |
Rinon Gal; Yael Vinker; Yuval Alaluf; Amit Bermano; Daniel Cohen-Or; Ariel Shamir; Gal Chechik; |
52 | Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Deformable Convolution v4 (DCNv4) a highly efficient and effective operator designed for a broad spectrum of vision applications. |
Yuwen Xiong; Zhiqi Li; Yuntao Chen; Feng Wang; Xizhou Zhu; Jiapeng Luo; Wenhai Wang; Tong Lu; Hongsheng Li; Yu Qiao; Lewei Lu; Jie Zhou; Jifeng Dai; |
53 | Learning Vision from Models Rivals Learning Vision from Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce SynCLR a novel approach for learning visual representations exclusively from synthetic images without any real data. |
Yonglong Tian; Lijie Fan; Kaifeng Chen; Dina Katabi; Dilip Krishnan; Phillip Isola; |
54 | PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose PerAda a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance especially under test-time distribution shifts. |
Chulin Xie; De-An Huang; Wenda Chu; Daguang Xu; Chaowei Xiao; Bo Li; Anima Anandkumar; |
55 | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However most benchmarks predominantly assess spatial understanding in the static image tasks while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue we introduce a comprehensive Multi-modal Video understanding Benchmark namely MVBench which covers 20 challenging video tasks that cannot be effectively solved with a single frame. |
Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Yi Liu; Zun Wang; Jilan Xu; Guo Chen; Ping Luo; Limin Wang; Yu Qiao; |
56 | InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we design a large-scale vision-language foundation model (InternVL) which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources. |
Zhe Chen; Jiannan Wu; Wenhai Wang; Weijie Su; Guo Chen; Sen Xing; Muyan Zhong; Qinglong Zhang; Xizhou Zhu; Lewei Lu; Bin Li; Ping Luo; Tong Lu; Yu Qiao; Jifeng Dai; |
57 | Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The challenge in exploration efficiency in such environments makes it difficult for reinforcement-learning-based agents to learn complex tasks. To address this this paper introduces an advanced learning system named Auto MC-Reward that leverages Large Language Models (LLMs) to automatically design dense reward functions thereby enhancing the learning efficiency. |
Hao Li; Xue Yang; Zhaokai Wang; Xizhou Zhu; Jie Zhou; Yu Qiao; Xiaogang Wang; Hongsheng Li; Lewei Lu; Jifeng Dai; |
58 | Prompt-Free Diffusion: Taking "Text" Out of Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. |
Xingqian Xu; Jiayi Guo; Zhangyang Wang; Gao Huang; Irfan Essa; Humphrey Shi; |
59 | Beyond First-Order Tweedie: Solving Inverse Problems Using Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Second-order Tweedie sampler from Surrogate Loss (STSL) a novel sampler offering efficiency comparable to first-order Tweedie while enabling tractable reverse processes using second-order approximation. |
Litu Rout; Yujia Chen; Abhishek Kumar; Constantine Caramanis; Sanjay Shakkottai; Wen-Sheng Chu; |
60 | Wonder3D: Single Image to 3D Using Cross-Domain Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce Wonder3D a novel method for generating high-fidelity textured meshes from single-view images with remarkable efficiency. |
Xiaoxiao Long; Yuan-Chen Guo; Cheng Lin; Yuan Liu; Zhiyang Dou; Lingjie Liu; Yuexin Ma; Song-Hai Zhang; Marc Habermann; Christian Theobalt; Wenping Wang; |
61 | Fixed Point Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce the Fixed Point Diffusion Model (FPDM) a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. |
Xingjian Bai; Luke Melas-Kyriazi; |
62 | An Edit Friendly DDPM Noise Space: Inversion and Manipulations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this native noise space does not possess a convenient structure and is thus challenging to work with in editing tasks. Here we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). |
Inbar Huberman-Spiegelglas; Vladimir Kulikov; Tomer Michaeli; |
63 | Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce VistaLLM a powerful visual system that addresses coarse- and fine grained VL tasks over single and multiple input images using a unified framework. |
Shraman Pramanick; Guangxing Han; Rui Hou; Sayan Nag; Ser-Nam Lim; Nicolas Ballas; Qifan Wang; Rama Chellappa; Amjad Almahairi; |
64 | Splatter Image: Ultra-Fast Single-View 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce the Splatter Image an ultra-efficient approach for monocular 3D object reconstruction. |
Stanislaw Szymanowicz; Chrisitian Rupprecht; Andrea Vedaldi; |
65 | PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. |
Vidit Goel; Elia Peruzzo; Yifan Jiang; Dejia Xu; Xingqian Xu; Nicu Sebe; Trevor Darrell; Zhangyang Wang; Humphrey Shi; |
66 | CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. |
Zineng Tang; Ziyi Yang; Mahmoud Khademi; Yang Liu; Chenguang Zhu; Mohit Bansal; |
67 | GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. |
Chong Bao; Yinda Zhang; Yuan Li; Xiyu Zhang; Bangbang Yang; Hujun Bao; Marc Pollefeys; Guofeng Zhang; Zhaopeng Cui; |
68 | ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ZeroRF a novel per-scene optimization method addressing the challenge of sparse view 360deg reconstruction in neural field representations. |
Ruoxi Shi; Xinyue Wei; Cheng Wang; Hao Su; |
69 | Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly to establish a video dataset with high-quality captions we propose an automatic approach leveraging multimodal inputs such as textual video description subtitles and individual video frames. |
Tsai-Shien Chen; Aliaksandr Siarohin; Willi Menapace; Ekaterina Deyneka; Hsiang-wei Chao; Byung Eun Jeon; Yuwei Fang; Hsin-Ying Lee; Jian Ren; Ming-Hsuan Yang; Sergey Tulyakov; |
70 | VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce VSCode a generalist model with novel 2D prompt learning to jointly address four SOD tasks and three COD tasks. |
Ziyang Luo; Nian Liu; Wangbo Zhao; Xuguang Yang; Dingwen Zhang; Deng-Ping Fan; Fahad Khan; Junwei Han; |
71 | Gaussian Head Avatar: Ultra High-fidelity Head Avatar Via Dynamic Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. |
Yuelang Xu; Benwang Chen; Zhe Li; Hongwen Zhang; Lizhen Wang; Zerong Zheng; Yebin Liu; |
72 | PIGEON: Predicting Image Geolocations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new geolocalization system that combines semantic geocell creation multi-task contrastive pretraining and a novel loss function. |
Lukas Haas; Michal Skreta; Silas Alberti; Chelsea Finn; |
73 | Retrieval-Augmented Egocentric Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. |
Jilan Xu; Yifei Huang; Junlin Hou; Guo Chen; Yuejie Zhang; Rui Feng; Weidi Xie; |
74 | Visual Program Distillation: Distilling Tools and Programmatic Reasoning Into Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Visual Program Distillation (VPD) an instruction-tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. |
Yushi Hu; Otilia Stretcu; Chun-Ta Lu; Krishnamurthy Viswanathan; Kenji Hata; Enming Luo; Ranjay Krishna; Ariel Fuxman; |
75 | Style Aligned Image Generation Via Shared Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce StyleAligned a novel technique designed to establish style alignment among a series of generated images. |
Amir Hertz; Andrey Voynov; Shlomi Fruchter; Daniel Cohen-Or; |
76 | Gaussian Shell Maps for Efficient 3D Human Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell–based scaffold. |
Rameen Abdal; Wang Yifan; Zifan Shi; Yinghao Xu; Ryan Po; Zhengfei Kuang; Qifeng Chen; Dit-Yan Yeung; Gordon Wetzstein; |
77 | FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models By Inverting Stable Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. |
George Cazenavette; Avneesh Sud; Thomas Leung; Ben Usman; |
78 | Mitigating Object Hallucinations in Large Vision-Language Models Through Visual Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite their success LVLMs still suffer from the issue of object hallucinations where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue we introduce Visual Contrastive Decoding (VCD) a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. |
Sicong Leng; Hang Zhang; Guanzheng Chen; Xin Li; Shijian Lu; Chunyan Miao; Lidong Bing; |
79 | Depth-aware Test-Time Training for Zero-shot Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce a test-time training (TTT) strategy to address the problem. |
Weihuang Liu; Xi Shen; Haolun Li; Xiuli Bi; Bo Liu; Chi-Man Pun; Xiaodong Cun; |
80 | SceneTex: High-Quality Texture Synthesis for Indoor Scenes Via Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SceneTex a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. |
Dave Zhenyu Chen; Haoxuan Li; Hsin-Ying Lee; Sergey Tulyakov; Matthias Nießner; |
81 | DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present DreamAvatar a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. |
Yukang Cao; Yan-Pei Cao; Kai Han; Ying Shan; Kwan-Yee K. Wong; |
82 | Bayes’ Rays: Uncertainty Quantification for Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce BayesRays a post-hoc framework to evaluate uncertainty in any pretrained NeRF without modifying the training process. |
Lily Goli; Cody Reading; Silvia Sellán; Alec Jacobson; Andrea Tagliasacchi; |
83 | Video-P2P: Video Editing with Cross-attention Control Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For attention control we introduce a novel decoupled-guidance strategy which uses different guidance strategies for the source and target prompts. |
Shaoteng Liu; Yuechen Zhang; Wenbo Li; Zhe Lin; Jiaya Jia; |
84 | One-step Diffusion with Distribution Matching Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Distribution Matching Distillation (DMD) a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. |
Tianwei Yin; Michaël Gharbi; Richard Zhang; Eli Shechtman; Frédo Durand; William T. Freeman; Taesung Park; |
85 | FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conversely recent advancements in long-range trackers offer extended temporal coverage but at the cost of spatial sparsity. This paper introduces FlowTrack a novel framework designed to bridge this gap. |
Seokju Cho; Jiahui Huang; Seungryong Kim; Joon-Young Lee; |
86 | Orthogonal Adaptation for Modular Customization of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we address a new problem called Modular Customization with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. |
Ryan Po; Guandao Yang; Kfir Aberman; Gordon Wetzstein; |
87 | Towards Language-Driven Video Inpainting Via Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new task — language-driven video inpainting which uses natural language instructions to guide the inpainting process. |
Jianzong Wu; Xiangtai Li; Chenyang Si; Shangchen Zhou; Jingkang Yang; Jiangning Zhang; Yining Li; Kai Chen; Yunhai Tong; Ziwei Liu; Chen Change Loy; |
88 | What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose techniques to scale neural volume rendering to the much higher resolution of native 2D images thereby resolving fine-grained 3D geometry with unprecedented detail. |
Alex Trevithick; Matthew Chan; Towaki Takikawa; Umar Iqbal; Shalini De Mello; Manmohan Chandraker; Ravi Ramamoorthi; Koki Nagano; |
89 | OMG-Seg: Is One Model Good Enough For All Segmentation? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we address various segmentation tasks each traditionally tackled by distinct or partially unified models. |
Xiangtai Li; Haobo Yuan; Wei Li; Henghui Ding; Size Wu; Wenwei Zhang; Yining Li; Kai Chen; Chen Change Loy; |
90 | Diffusion Model Alignment Using Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Diffusion-DPO a method to align diffusion models to human preferences by directly optimizing on human comparison data. |
Bram Wallace; Meihua Dang; Rafael Rafailov; Linqi Zhou; Aaron Lou; Senthil Purushwalkam; Stefano Ermon; Caiming Xiong; Shafiq Joty; Nikhil Naik; |
91 | Generative Image Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an approach to modeling an image-space prior on scene motion. |
Zhengqi Li; Richard Tucker; Noah Snavely; Aleksander Holynski; |
92 | GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new approach termed GPS-Gaussian for synthesizing novel views of a character in a real-time manner. |
Shunyuan Zheng; Boyao Zhou; Ruizhi Shao; Boning Liu; Shengping Zhang; Liqiang Nie; Yebin Liu; |
93 | Pixel-Aligned Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we aim to develop a vision-language model that can take locations for example a set of points or boxes as either inputs or outputs. |
Jiarui Xu; Xingyi Zhou; Shen Yan; Xiuye Gu; Anurag Arnab; Chen Sun; Xiaolong Wang; Cordelia Schmid; |
94 | Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents Depth Anything a highly practical solution for robust monocular depth estimation. |
Lihe Yang; Bingyi Kang; Zilong Huang; Xiaogang Xu; Jiashi Feng; Hengshuang Zhao; |
95 | VideoCon: Robust Video-Language Alignment Via Contrast Captions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce the VideoCon a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. |
Hritik Bansal; Yonatan Bitton; Idan Szpektor; Kai-Wei Chang; Aditya Grover; |
96 | PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce pixelSplat a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. |
David Charatan; Sizhe Lester Li; Andrea Tagliasacchi; Vincent Sitzmann; |
97 | On Scaling Up A Multilingual Vision and Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. |
Xi Chen; Josip Djolonga; Piotr Padlewski; Basil Mustafa; Soravit Changpinyo; Jialin Wu; Carlos Riquelme Ruiz; Sebastian Goodman; Xiao Wang; Yi Tay; Siamak Shakeri; Mostafa Dehghani; Daniel Salz; Mario Lucic; Michael Tschannen; Arsha Nagrani; Hexiang Hu; Mandar Joshi; Bo Pang; Ceslee Montgomery; Paulina Pietrzyk; Marvin Ritter; AJ Piergiovanni; Matthias Minderer; Filip Pavetic; Austin Waters; Gang Li; Ibrahim Alabdulmohsin; Lucas Beyer; Julien Amelot; Kenton Lee; Andreas Peter Steiner; Yang Li; Daniel Keysers; Anurag Arnab; Yuanzhong Xu; Keran Rong; Alexander Kolesnikov; Mojtaba Seyedhosseini; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut; |
98 | Visual In-Context Prompting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce a universal visual in-context prompting framework for both tasks as shown in Fig.1. |
Feng Li; Qing Jiang; Hao Zhang; Tianhe Ren; Shilong Liu; Xueyan Zou; Huaizhe Xu; Hongyang Li; Jianwei Yang; Chunyuan Li; Lei Zhang; Jianfeng Gao; |
99 | AssistGUI: Task-Oriented PC Graphical User Interface Automation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel benchmark AssistGUI to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. |
Difei Gao; Lei Ji; Zechen Bai; Mingyu Ouyang; Peiran Li; Dongxing Mao; Qinchen Wu; Weichen Zhang; Peiyi Wang; Xiangwu Guo; Hengxu Wang; Luowei Zhou; Mike Zheng Shou; |
100 | CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a novel cost-based approach to adapt vision-language foundation models notably CLIP for the intricate task of semantic segmentation. |
Seokju Cho; Heeseong Shin; Sunghwan Hong; Anurag Arnab; Paul Hongsuck Seo; Seungryong Kim; |
101 | Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we circumvent the difficulty of image generation and propose an alternative to build the connection between unpaired images in a compact proxy space. |
Longguang Wang; Juncheng Li; Yingqian Wang; Qingyong Hu; Yulan Guo; |
102 | FreeU: Free Lunch in Diffusion U-Net Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we uncover the untapped potential of diffusion U-Net which serves as a "free lunch" that substantially improves the generation quality on the fly. |
Chenyang Si; Ziqi Huang; Yuming Jiang; Ziwei Liu; |
103 | Generative Powers of Ten Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a method that uses a text-to-image model to generate consistent content across multiple image scales enabling extreme semantic zooms into a scene e.g. ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. |
Xiaojuan Wang; Janne Kontkanen; Brian Curless; Steven M. Seitz; Ira Kemelmacher-Shlizerman; Ben Mildenhall; Pratul Srinivasan; Dor Verbin; Aleksander Holynski; |
104 | Tune-An-Ellipse: CLIP Has Potential to Find What You Want Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our novel simple yet effective approach i.e. Differentiable Visual Prompting enables CLIP to zero-shot localize: given an image and a text prompt describing an object we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. |
Jinheng Xie; Songhe Deng; Bing Li; Haozhe Liu; Yawen Huang; Yefeng Zheng; Jurgen Schmidhuber; Bernard Ghanem; Linlin Shen; Mike Zheng Shou; |
105 | HUGS: Human Gaussian Splats Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). |
Muhammed Kocabas; Jen-Hao Rick Chang; James Gabriel; Oncel Tuzel; Anurag Ranjan; |
106 | Adversarial Distillation Based on Slack Matching and Attribution Region Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: During the training process we facilitate the student model in better understanding the teacher model’s behavior by aligning the attribution region that the student model focuses on with that of the teacher model. Concurrently we relax the condition of exact matching in KL divergence and replace it with a more flexible matching criterion thereby enhancing the model’s robustness. |
Shenglin Yin; Zhen Xiao; Mingxuan Song; Jieyi Long; |
107 | SEED-Bench: Benchmarking Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work we categorize the capabilities of MLLMs into hierarchical levels from L_0 to L_4 based on the modalities they can accept and generate and propose SEED-Bench a comprehensive benchmark that evaluates the hierarchical capabilities of MLLMs. |
Bohao Li; Yuying Ge; Yixiao Ge; Guangzhi Wang; Rui Wang; Ruimao Zhang; Ying Shan; |
108 | HybridNeRF: Efficient Neural Rendering Via Adaptive Volumetric Surfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method HybridNeRF that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. |
Haithem Turki; Vasu Agrawal; Samuel Rota Bulò; Lorenzo Porzi; Peter Kontschieder; Deva Ramanan; Michael Zollhöfer; Christian Richardt; |
109 | Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present Omni-SMoLA a multimodal architecture that mixes many multi-modal experts efficiently and achieves both high specialist and generalist performance. |
Jialin Wu; Xia Hu; Yaqing Wang; Bo Pang; Radu Soricut; |
110 | MobileCLIP: Fast Image-Text Models Through Multi-Modal Reinforced Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce MobileCLIP – a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach namely multi-modal reinforced training. |
Pavan Kumar Anasosalu Vasu; Hadi Pouransari; Fartash Faghri; Raviteja Vemulapalli; Oncel Tuzel; |
111 | PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this research we introduce a novel universal proposition learning approach called panoramic renal pathology segmentation (PrPSeg) designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. |
Ruining Deng; Quan Liu; Can Cui; Tianyuan Yao; Jialin Yue; Juming Xiong; Lining Yu; Yifei Wu; Mengmeng Yin; Yu Wang; Shilin Zhao; Yucheng Tang; Haichun Yang; Yuankai Huo; |
112 | LISA: Reasoning Segmentation Via Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose a new segmentation task — reasoning segmentation. |
Xin Lai; Zhuotao Tian; Yukang Chen; Yanwei Li; Yuhui Yuan; Shu Liu; Jiaya Jia; |
113 | CityDreamer: Compositional Generative Model of Unbounded 3D Cities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally generating 3D cities is more complex than 3D natural scenes since buildings as objects of the same class exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges we propose CityDreamer a compositional generative model designed specifically for unbounded 3D cities. |
Haozhe Xie; Zhaoxi Chen; Fangzhou Hong; Ziwei Liu; |
114 | Putting The Object Back Into Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. |
Ho Kei Cheng; Seoung Wug Oh; Brian Price; Joon-Young Lee; Alexander Schwing; |
115 | Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since video content is highly redundant we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity visual quality and impairs scalability. In this work we build Snap Video a video-first model that systematically addresses these challenges. |
Willi Menapace; Aliaksandr Siarohin; Ivan Skorokhodov; Ekaterina Deyneka; Tsai-Shien Chen; Anil Kag; Yuwei Fang; Aleksei Stoliar; Elisa Ricci; Jian Ren; Sergey Tulyakov; |
116 | OpenEQA: Embodied Question Answering in The Era of Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. |
Arjun Majumdar; Anurag Ajay; Xiaohan Zhang; Pranav Putta; Sriram Yenamandra; Mikael Henaff; Sneha Silwal; Paul Mcvay; Oleksandr Maksymets; Sergio Arnaud; Karmesh Yadav; Qiyang Li; Ben Newman; Mohit Sharma; Vincent Berges; Shiqi Zhang; Pulkit Agrawal; Yonatan Bisk; Dhruv Batra; Mrinal Kalakrishnan; Franziska Meier; Chris Paxton; Alexander Sax; Aravind Rajeswaran; |
117 | OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. |
Bohao Peng; Xiaoyang Wu; Li Jiang; Yukang Chen; Hengshuang Zhao; Zhuotao Tian; Jiaya Jia; |
118 | Unified Language-driven Zero-shot Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify the constraints in the existing language-driven zero-shot domain adaptation task particularly the requirement for domain IDs and domain-specific models which may restrict flexibility and scalability. To overcome these issues we propose a new framework for ULDA consisting of Hierarchical Context Alignment (HCA) Domain Consistent Representation Learning (DCRL) and Text-Driven Rectifier (TDR). |
Senqiao Yang; Zhuotao Tian; Li Jiang; Jiaya Jia; |
119 | Federated Online Adaptation for Deep Stereo Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel approach for adapting deep stereo networks in a collaborative manner. |
Matteo Poggi; Fabio Tosi; |
120 | Multiplane Prior Guided Few-Shot Aerial Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The acquisition of dense aerial views is often prohibitive as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work we introduce Multiplane Prior guided NeRF (MPNeRF) a novel approach tailored for few-shot aerial scene rendering–marking a pioneering effort in this domain. |
Zihan Gao; Licheng Jiao; Lingling Li; Xu Liu; Fang Liu; Puhua Chen; Yuwei Guo; |
121 | TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. |
Haomiao Ni; Bernhard Egger; Suhas Lohit; Anoop Cherian; Ye Wang; Toshiaki Koike-Akino; Sharon X. Huang; Tim K. Marks; |
122 | Probing The 3D Awareness of Visual Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we analyze the 3D awareness of visual foundation models. |
Mohamed El Banani; Amit Raj; Kevis-Kokitsi Maninis; Abhishek Kar; Yuanzhen Li; Michael Rubinstein; Deqing Sun; Leonidas Guibas; Justin Johnson; Varun Jampani; |
123 | GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GaussianAvatars a new method to create photorealistic head avatars that are fully controllable in terms of expression pose and viewpoint. |
Shenhan Qian; Tobias Kirschstein; Liam Schoneveld; Davide Davoli; Simon Giebenhain; Matthias Nießner; |
124 | Sieve: Multimodal Dataset Pruning Using Image Captioning Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a pruning signal Sieve that employs synthetic captions generated by image-captioning models pretrained on small diverse and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. |
Anas Mahmoud; Mostafa Elhoushi; Amro Abbas; Yu Yang; Newsha Ardalani; Hugh Leather; Ari S. Morcos; |
125 | Optimizing Diffusion Noise Can Serve As Universal Motion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Diffusion Noise Optimization (DNO) a new method that effectively leverages existing motion diffusion models as motion priors for a wide range of motion-related tasks. |
Korrawe Karunratanakul; Konpat Preechakul; Emre Aksan; Thabo Beeler; Supasorn Suwajanakorn; Siyu Tang; |
126 | Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous techniques mitigate this by reweighting these boxes as pseudo labels but these boxes can still poison the training process. To resolve this problem in this paper we propose a novel pseudo label refinery framework. |
Zhanwei Zhang; Minghao Chen; Shuai Xiao; Liang Peng; Hengjia Li; Binbin Lin; Ping Li; Wenxiao Wang; Boxi Wu; Deng Cai; |
127 | GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present GaussianShader a novel method that applies a simplified shading function on 3D Gaussians to enhance the neural rendering in scenes with reflective surfaces while preserving the training and rendering efficiency. |
Yingwenqi Jiang; Jiadong Tu; Yuan Liu; Xifeng Gao; Xiaoxiao Long; Wenping Wang; Yuexin Ma; |
128 | Sequential Modeling Enables Scalable Learning for Large Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. |
Yutong Bai; Xinyang Geng; Karttikeya Mangalam; Amir Bar; Alan L. Yuille; Trevor Darrell; Jitendra Malik; Alexei A. Efros; |
129 | InstructDiffusion: A Generalist Modeling Interface for Vision Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present InstructDiffusion a unified and generic framework for aligning computer vision tasks with human instructions. |
Zigang Geng; Binxin Yang; Tiankai Hang; Chen Li; Shuyang Gu; Ting Zhang; Jianmin Bao; Zheng Zhang; Houqiang Li; Han Hu; Dong Chen; Baining Guo; |
130 | Seeing The World Through Your Eyes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reconstruct a radiance field beyond the camera’s line of sight using portrait images containing eye reflections. |
Hadi Alzayer; Kevin Zhang; Brandon Feng; Christopher A. Metzler; Jia-Bin Huang; |
131 | Learning Occupancy for Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose OccupancyM3D a method of learning occupancy for monocular 3D detection. |
Liang Peng; Junkai Xu; Haoran Cheng; Zheng Yang; Xiaopei Wu; Wei Qian; Wenxiao Wang; Boxi Wu; Deng Cai; |
132 | VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. |
Haoxin Chen; Yong Zhang; Xiaodong Cun; Menghan Xia; Xintao Wang; Chao Weng; Ying Shan; |
133 | VBench: Comprehensive Benchmark Suite for Video Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we present VBench a comprehensive benchmark suite that dissects "video generation quality" into specific hierarchical and disentangled dimensions each with tailored prompts and evaluation methods. |
Ziqi Huang; Yinan He; Jiashuo Yu; Fan Zhang; Chenyang Si; Yuming Jiang; Yuanhan Zhang; Tianxing Wu; Qingyang Jin; Nattapol Chanpaisit; Yaohui Wang; Xinyuan Chen; Limin Wang; Dahua Lin; Yu Qiao; Ziwei Liu; |
134 | Control4D: Efficient 4D Portrait Editing with Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Control4D an innovative framework for editing dynamic 4D portraits using text instructions. |
Ruizhi Shao; Jingxiang Sun; Cheng Peng; Zerong Zheng; Boyao Zhou; Hongwen Zhang; Yebin Liu; |
135 | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For long videos the computation complexity memory cost and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism we propose the MovieChat to overcome these challenges. |
Enxin Song; Wenhao Chai; Guanhong Wang; Yucheng Zhang; Haoyang Zhou; Feiyang Wu; Haozhe Chi; Xun Guo; Tian Ye; Yanting Zhang; Yan Lu; Jenq-Neng Hwang; Gaoang Wang; |
136 | Self-correcting LLM-controlled Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast to existing models that aim to generate images only with their best effort we introduce Self-correcting LLM-controlled Diffusion (SLD). |
Tsung-Han Wu; Long Lian; Joseph E. Gonzalez; Boyi Li; Trevor Darrell; |
137 | See Say and Segment: Teaching LMMs to Overcome False Premises Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say") a form of catastrophic forgetting. In this work we propose a cascading and joint training approach for LMMs to solve this task avoiding catastrophic forgetting of previous skills. |
Tsung-Han Wu; Giscard Biamby; David Chan; Lisa Dunlap; Ritwik Gupta; Xudong Wang; Joseph E. Gonzalez; Trevor Darrell; |
138 | Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our study introduces Upscale-A-Video a text-guided latent diffusion framework for video upscaling. |
Shangchen Zhou; Peiqing Yang; Jianyi Wang; Yihang Luo; Chen Change Loy; |
139 | Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. |
Yijia Weng; Bowen Wen; Jonathan Tremblay; Valts Blukis; Dieter Fox; Leonidas Guibas; Stan Birchfield; |
140 | SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem we propose a novel loss function named Sample-wise affinity Consistency (SaCo) loss which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. |
Sitong Wu; Haoru Tan; Zhuotao Tian; Yukang Chen; Xiaojuan Qi; Jiaya Jia; |
141 | Prompt Highlighter: Interactive Control for Multi-Modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While manipulating prompt formats could improve outputs designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue we introduce a novel inference method Prompt Highlighter which enables users to highlight specific prompt spans to interactively control the focus during generation. |
Yuechen Zhang; Shengju Qian; Bohao Peng; Shu Liu; Jiaya Jia; |
142 | Intelligent Grimm – Open-ended Visual Storytelling Via Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we focus on a novel yet challenging task of generating a coherent image sequence based on a given storyline denoted as open-ended visual storytelling. |
Chang Liu; Haoning Wu; Yujie Zhong; Xiaoyun Zhang; Yanfeng Wang; Weidi Xie; |
143 | MagicAnimate: Temporally Consistent Human Image Animation Using Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce MagicAnimate a diffusion-based framework that aims at enhancing temporal consistency preserving reference image faithfully and improving animation fidelity. |
Zhongcong Xu; Jianfeng Zhang; Jun Hao Liew; Hanshu Yan; Jia-Wei Liu; Chenxu Zhang; Jiashi Feng; Mike Zheng Shou; |
144 | ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars real-time performance has mostly been demonstrated for static scenes only. To address this we propose ASH an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real time. |
Haokai Pang; Heming Zhu; Adam Kortylewski; Christian Theobalt; Marc Habermann; |
145 | THRONE: An Object-based Hallucination Benchmark for The Free-form Generations of Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. |
Prannay Kaul; Zhizhong Li; Hao Yang; Yonatan Dukler; Ashwin Swaminathan; C. J. Taylor; Stefano Soatto; |
146 | EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However these decoding mechanisms usually come with high computational costs. To address this concern we introduce EMCAD a new efficient multi-scale convolutional attention decoder designed to optimize both performance and computational efficiency. |
Md Mostafijur Rahman; Mustafa Munir; Radu Marculescu; |
147 | OneLLM: One Framework to Align All Modalities with Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present OneLLM an MLLM that aligns eight modalities to language using a unified framework. |
Jiaming Han; Kaixiong Gong; Yiyuan Zhang; Jiaqi Wang; Kaipeng Zhang; Dahua Lin; Yu Qiao; Peng Gao; Xiangyu Yue; |
148 | Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose to decouple video-level referring expression understanding into static and motion perception with a specific emphasis on enhancing temporal comprehension. |
Shuting He; Henghui Ding; |
149 | Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. |
Da-Wei Zhou; Hai-Long Sun; Han-Jia Ye; De-Chuan Zhan; |
150 | IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose a novel framework for implicit quadratic video frame interpolation (IQ-VFI) which explores latent acceleration information and accurate intermediate motions via knowledge distillation. |
Mengshun Hu; Kui Jiang; Zhihang Zhong; Zheng Wang; Yinqiang Zheng; |
151 | Transcriptomics-guided Slide Representation Learning in Computational Pathology Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we leverage complementary information from gene expression profiles to guide slide representation learning using multi-modal pre-training. |
Guillaume Jaume; Lukas Oldenburg; Anurag Vaidya; Richard J. Chen; Drew F.K. Williamson; Thomas Peeters; Andrew H. Song; Faisal Mahmood; |
152 | Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way? |
Guillaume Jaume; Anurag Vaidya; Richard J. Chen; Drew F.K. Williamson; Paul Pu Liang; Faisal Mahmood; |
153 | RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable robust animation for out-of-distribution poses we propose a Motion Distribution Align module to compensate for the discrepancies between the training and testing motion distribution. |
Xiang Deng; Zerong Zheng; Yuxiang Zhang; Jingxiang Sun; Chao Xu; Xiaodong Yang; Lizhen Wang; Yebin Liu; |
154 | EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While beneficial the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation we propose EfficientSAMs light-weight SAM models that exhibits decent performance with largely reduced complexity. |
Yunyang Xiong; Bala Varadarajan; Lemeng Wu; Xiaoyu Xiang; Fanyi Xiao; Chenchen Zhu; Xiaoliang Dai; Dilin Wang; Fei Sun; Forrest Iandola; Raghuraman Krishnamoorthi; Vikas Chandra; |
155 | PerceptionGPT: Effectively Fusing Visual Perception Into LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a novel end-to-end framework named PerceptionGPT which represent the perception signals using LLM’s dynamic token embedding. |
Renjie Pi; Lewei Yao; Jiahui Gao; Jipeng Zhang; Tong Zhang; |
156 | Can I Trust Your Answer? Visually Grounded Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Experiments with different backbones demonstrate that this grounding mechanism improves both grounding and QA. With these efforts we aim to push towards trustworthy VLMs in VQA systems. |
Junbin Xiao; Angela Yao; Yicong Li; Tat-Seng Chua; |
157 | Digital Life Project: Autonomous 3D Characters with Social Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present Digital Life Project a framework utilizing language as the universal medium to build autonomous 3D characters who are capable of engaging in social interactions and expressing with articulated body motions thereby simulating life in a digital environment. |
Zhongang Cai; Jianping Jiang; Zhongfei Qing; Xinying Guo; Mingyuan Zhang; Zhengyu Lin; Haiyi Mei; Chen Wei; Ruisi Wang; Wanqi Yin; Liang Pan; Xiangyu Fan; Han Du; Peng Gao; Zhitao Yang; Yang Gao; Jiaqi Li; Tianxiang Ren; Yukun Wei; Xiaogang Wang; Chen Change Loy; Lei Yang; Ziwei Liu; |
158 | VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present VideoCutLER a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. |
Xudong Wang; Ishan Misra; Ziyun Zeng; Rohit Girdhar; Trevor Darrell; |
159 | Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g. clarity brightness) to the evaluation on image quality there’s still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this we collect the first dataset consisting of human natural language feedback on low-level vision. |
Haoning Wu; Zicheng Zhang; Erli Zhang; Chaofeng Chen; Liang Liao; Annan Wang; Kaixin Xu; Chunyi Li; Jingwen Hou; Guangtao Zhai; Geng Xue; Wenxiu Sun; Qiong Yan; Weisi Lin; |
160 | FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a generative approach to forecast long-term future human behavior in 3D requiring only weak supervision from readily available 2D human action data. |
Christian Diller; Thomas Funkhouser; Angela Dai; |
161 | CG-HOI: Contact-Guided 3D Human-Object Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose CG-HOI the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. |
Christian Diller; Angela Dai; |
162 | 3DGS-Avatar: Animatable Avatars Via Deformable 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). |
Zhiyin Qian; Shaofei Wang; Marko Mihajlovic; Andreas Geiger; Siyu Tang; |
163 | Grounded Question-Answering in Long Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we delve into open-ended question-answering (QA) in long egocentric videos which allows individuals or robots to inquire about their own past visual experiences. |
Shangzhe Di; Weidi Xie; |
164 | TCP:Textual-based Class-aware Prompt Tuning for Visual-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However those textual tokens have a limited generalization ability regarding unseen domains as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. |
Hantao Yao; Rui Zhang; Changsheng Xu; |
165 | VCoder: Versatile Vision Encoders for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Working towards developing an accurate MLLM system for perception and reasoning we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. |
Jitesh Jain; Jianwei Yang; Humphrey Shi; |
166 | Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce a semantic panel as the middleware in decoding texts to images supporting the generator to better follow instructions. |
Yutong Feng; Biao Gong; Di Chen; Yujun Shen; Yu Liu; Jingren Zhou; |
167 | In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. |
Yiran Xu; Zhixin Shu; Cameron Smith; Seoung Wug Oh; Jia-Bin Huang; |
168 | SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we introduce SceneFun3D a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. |
Alexandros Delitzas; Ayca Takmaz; Federico Tombari; Robert Sumner; Marc Pollefeys; Francis Engelmann; |
169 | GenZI: Zero-Shot 3D Human-Scene Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GenZI the first zero-shot approach to generating 3D human-scene interactions. |
Lei Li; Angela Dai; |
170 | Instruct-Imagen: Image Generation with Multi-modal Instruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. |
Hexiang Hu; Kelvin C.K. Chan; Yu-Chuan Su; Wenhu Chen; Yandong Li; Kihyuk Sohn; Yang Zhao; Xue Ben; Boqing Gong; William Cohen; Ming-Wei Chang; Xuhui Jia; |
171 | DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation By Combining 3D GANs and Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel framework DiffusionGAN3D which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. |
Biwen Lei; Kai Yu; Mengyang Feng; Miaomiao Cui; Xuansong Xie; |
172 | Honeybee: Locality-enhanced Projector for Multimodal LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens crucial for MLLMs’ overall efficiency and (ii) preservation of local context from visual features vital for spatial understanding. Based on these findings we propose a novel projector design that is both flexible and locality-enhanced effectively satisfying the two desirable properties. |
Junbum Cha; Wooyoung Kang; Jonghwan Mun; Byungseok Roh; |
173 | CapsFusion: Rethinking Image-Text Data at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To provide higher-quality and more scalable multimodal pretraining data we propose CapsFusion an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. |
Qiying Yu; Quan Sun; Xiaosong Zhang; Yufeng Cui; Fan Zhang; Yue Cao; Xinlong Wang; Jingjing Liu; |
174 | Unsupervised Universal Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks—instance semantic and panoptic—using a novel unified framework. |
Dantong Niu; Xudong Wang; Xinyang Han; Long Lian; Roei Herzig; Trevor Darrell; |
175 | Feedback-Guided Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast learning in humans often involves additional detailed guidance throughout the interactive learning process i.e. where feedback often via language provides detailed information as to which part of their trial was performed incorrectly or suboptimally and why. Motivated by this observation we introduce an efficient feedback-based framework for improving behavior-cloning-based training of sensorimotor driving agents. |
Jimuyang Zhang; Zanming Huang; Arijit Ray; Eshed Ohn-Bar; |
176 | SPIN: Simultaneous Perception Interaction and Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This causes several limitations such as compounding errors delays in decision-making and no whole-body coordination. In this work we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. |
Shagun Uppal; Ananye Agarwal; Haoyu Xiong; Kenneth Shaw; Deepak Pathak; |
177 | Neural Clustering Based Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose feature extraction with clustering (FEC) a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. |
Guikun Chen; Xia Li; Yi Yang; Wenguan Wang; |
178 | Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on this framework we propose Prompt-driven Normalization which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. |
Xiaoyang Wu; Zhuotao Tian; Xin Wen; Bohao Peng; Xihui Liu; Kaicheng Yu; Hengshuang Zhao; |
179 | Point Transformer V3: Simpler Faster Stronger Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from recent advances in 3D large-scale representation learning we recognize that model performance is more influenced by scale than by intricate design. Therefore we present Point Transformer V3 (PTv3) which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. |
Xiaoyang Wu; Li Jiang; Peng-Shuai Wang; Zhijian Liu; Xihui Liu; Yu Qiao; Wanli Ouyang; Tong He; Hengshuang Zhao; |
180 | End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we reduce the memory consumption for end-to-end training and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1536 frames leading to significant detection performance. |
Shuming Liu; Chen-Lin Zhang; Chen Zhao; Bernard Ghanem; |
181 | DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce DetCLIPv3 a high-performing detector that excels not only at both open-vocabulary object detection but also generating hierarchical labels for detected objects. |
Lewei Yao; Renjie Pi; Jianhua Han; Xiaodan Liang; Hang Xu; Wei Zhang; Zhenguo Li; Dan Xu; |
182 | One-Prompt to Segment All Medical Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces a new paradigm toward the universal medical image segmentation termed ‘One-Prompt Segmentation.’ |
Junde Wu; Min Xu; |
183 | Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose Dynamic Reversible Dual-Residual Networks or Dr2Net a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. |
Chen Zhao; Shuming Liu; Karttikeya Mangalam; Guocheng Qian; Fatimah Zohra; Abdulmohsen Alghannam; Jitendra Malik; Bernard Ghanem; |
184 | Symphonize 3D Semantic Scene Completion with Contextual Instance Queries Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present a novel paradigm termed Symphonies (Scene-from-Insts) that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling. |
Haoyi Jiang; Tianheng Cheng; Naiyu Gao; Haoyang Zhang; Tianwei Lin; Wenyu Liu; Xinggang Wang; |
185 | Generative Multimodal Models Are In-Context Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. |
Quan Sun; Yufeng Cui; Xiaosong Zhang; Fan Zhang; Qiying Yu; Yueze Wang; Yongming Rao; Jingjing Liu; Tiejun Huang; Xinlong Wang; |
186 | ViTamin: Designing Scalable Vision Models in The Vision-Language Era Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. |
Jieneng Chen; Qihang Yu; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen; |
187 | VideoLLM-online: Online Video Large Language Model for Streaming Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel Learning-In-Video-Stream (LIVE) framework which enables temporally aligned long-context and real-time dialogue within a continuous video stream. |
Joya Chen; Zhaoyang Lv; Shiwei Wu; Kevin Qinghong Lin; Chenan Song; Difei Gao; Jia-Wei Liu; Ziteng Gao; Dongxing Mao; Mike Zheng Shou; |
188 | One-Shot Open Affordance Learning with Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce One-shot Open Affordance Learning (OOAL) where a model is trained with just one example per base object category but is expected to identify novel objects and affordances. |
Gen Li; Deqing Sun; Laura Sevilla-Lara; Varun Jampani; |
189 | ZeroNVS: Zero-Shot 360-Degree View Synthesis from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a 3D-aware diffusion model ZeroNVS for single-image novel view synthesis for in-the-wild scenes. |
Kyle Sargent; Zizhang Li; Tanmay Shah; Charles Herrmann; Hong-Xing Yu; Yunzhi Zhang; Eric Ryan Chan; Dmitry Lagun; Li Fei-Fei; Deqing Sun; Jiajun Wu; |
190 | G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose G-HOP a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand conditioned on the object category. |
Yufei Ye; Abhinav Gupta; Kris Kitani; Shubham Tulsiani; |
191 | Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we investigate strengthening the awareness of video dynamics for DMs for high-quality T2V generation. |
Hao Fei; Shengqiong Wu; Wei Ji; Hanwang Zhang; Tat-Seng Chua; |
192 | SyncTalk: The Devil Is in The Synchronization for Talking Head Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A lifelike talking head requires synchronized coordination of subject identity lip movements facial expressions and head poses. The absence of these synchronizations is a fundamental flaw leading to unrealistic and artificial outcomes. To address the critical issue of synchronization identified as the devil in creating realistic talking heads we introduce SyncTalk. |
Ziqiao Peng; Wentao Hu; Yue Shi; Xiangyu Zhu; Xiaomei Zhang; Hao Zhao; Jun He; Hongyan Liu; Zhaoxin Fan; |
193 | Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce Language Embedded 3D Gaussians a novel scene representation for open-vocabulary query tasks. |
Jin-Chuan Shi; Miao Wang; Hao-Bin Duan; Shao-Hua Guan; |
194 | LucidDreamer: Towards High-Fidelity Text-to-3D Generation Via Interval Score Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper identifies a notable deficiency in SDS that it brings inconsistent and low-quality updating direction for the 3D model causing the over-smoothing effect. To address this we propose a novel approach called Interval Score Matching (ISM). |
Yixun Liang; Xin Yang; Jiantao Lin; Haodong Li; Xiaogang Xu; Yingcong Chen; |
195 | Link-Context Learning for Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose link-context learning (LCL) which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. |
Yan Tai; Weichen Fan; Zhao Zhang; Ziwei Liu; |
196 | UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose UnScene3D the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. |
David Rozenberszki; Or Litany; Angela Dai; |
197 | Amodal Ground Truth and Completion in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. |
Guanqi Zhan; Chuanxia Zheng; Weidi Xie; Andrew Zisserman; |
198 | Shadows Don’t Lie and Lines Can’t Bend! Generative Models Don’t Know Projective Geometry…for Now Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper demonstrates that generated images have geometric features different from those of real images. |
Ayush Sarkar; Hanlin Mai; Amitabh Mahapatra; Svetlana Lazebnik; D.A. Forsyth; Anand Bhattad; |
199 | GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this approach often results in semantically identical points having dissimilar representations leading to a high number of false negatives and introducing a semantic conflict problem. To address this issue we propose GroupContrast a novel approach that combines segment grouping and semantic-aware contrastive learning. |
Chengyao Wang; Li Jiang; Xiaoyang Wu; Zhuotao Tian; Bohao Peng; Hengshuang Zhao; Jiaya Jia; |
200 | UniMODE: Unified Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However involving various scenarios of data to train models poses challenges due to their significantly different characteristics e.g. diverse geometry properties and heterogeneous domain distributions. To address these challenges we build a detector based on the bird’s-eye-view (BEV) detection paradigm where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. |
Zhuoling Li; Xiaogang Xu; SerNam Lim; Hengshuang Zhao; |
201 | SHINOBI: Shape and Illumination Using Neural Object Decomposition Via BRDF Optimization In-the-wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SHINOBI an end-to-end framework for the reconstruction of shape material and illumination from object images captured with varying lighting pose and background. |
Andreas Engelhardt; Amit Raj; Mark Boss; Yunzhi Zhang; Abhishek Kar; Yuanzhen Li; Deqing Sun; Ricardo Martin Brualla; Jonathan T. Barron; Hendrik P. A. Lensch; Varun Jampani; |
202 | Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. |
Jiayi Guo; Xingqian Xu; Yifan Pu; Zanlin Ni; Chaofei Wang; Manushree Vasu; Shiji Song; Gao Huang; Humphrey Shi; |
203 | Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Zero-Painter a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. |
Marianna Ohanyan; Hayk Manukyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi; |
204 | Brush2Prompt: Contextual Prompt Generator for Object Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a prompt suggestion model to simplify the process of prompt input. |
Mang Tik Chiu; Yuqian Zhou; Lingzhi Zhang; Zhe Lin; Connelly Barnes; Sohrab Amirghodsi; Eli Shechtman; Humphrey Shi; |
205 | DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. |
Tobias Kirschstein; Simon Giebenhain; Matthias Nießner; |
206 | NOPE: Novel Object Pose Estimation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object’s 3D model and without requiring training time for new objects and categories. |
Van Nguyen Nguyen; Thibault Groueix; Georgy Ponimatkin; Yinlin Hu; Renaud Marlet; Mathieu Salzmann; Vincent Lepetit; |
207 | GigaPose: Fast and Robust Novel Object Pose Estimation Via One Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GigaPose a fast robust and accurate method for CAD-based novel object pose estimation in RGB images. |
Van Nguyen Nguyen; Thibault Groueix; Mathieu Salzmann; Vincent Lepetit; |
208 | Language-only Training of Zero-shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework only using language for its training. |
Geonmo Gu; Sanghyuk Chun; Wonjae Kim; Yoohoon Kang; Sangdoo Yun; |
209 | RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. |
Lingteng Qiu; Guanying Chen; Xiaodong Gu; Qi Zuo; Mutian Xu; Yushuang Wu; Weihao Yuan; Zilong Dong; Liefeng Bo; Xiaoguang Han; |
210 | NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To support interactions with scarcely available data we propose an automated synthetic data pipeline. |
Nilesh Kulkarni; Davis Rempe; Kyle Genova; Abhijit Kundu; Justin Johnson; David Fouhey; Leonidas Guibas; |
211 | 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce hybrid score distillation sampling an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. |
Sherwin Bahmani; Ivan Skorokhodov; Victor Rong; Gordon Wetzstein; Leonidas Guibas; Peter Wonka; Sergey Tulyakov; Jeong Joon Park; Andrea Tagliasacchi; David B. Lindell; |
212 | CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper establishes theoretical limits (Cramer Rao bounds) on 3D point localization and tracking with PSF-engineered event cameras. |
Sachin Shah; Matthew A. Chan; Haoming Cai; Jingxi Chen; Sakshum Kulshrestha; Chahat Deep Singh; Yiannis Aloimonos; Christopher A. Metzler; |
213 | Rethinking The Objectives of Vector-Quantized Tokenizers for Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. |
Yuchao Gu; Xintao Wang; Yixiao Ge; Ying Shan; Mike Zheng Shou; |
214 | EvalCrafter: Benchmarking and Evaluating Large Video Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Thus we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. |
Yaofang Liu; Xiaodong Cun; Xuebo Liu; Xintao Wang; Yong Zhang; Haoxin Chen; Yang Liu; Tieyong Zeng; Raymond Chan; Ying Shan; |
215 | Learning Multi-Dimensional Human Preference for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. |
Sixian Zhang; Bohan Wang; Junqiang Wu; Yan Li; Tingting Gao; Di Zhang; Zhongyuan Wang; |
216 | HDRFlow: Real-Time HDR Video Reconstruction with Large Motions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However they often struggle to handle large complex motions and are computationally expensive. To address these challenges we propose a robust and efficient flow estimator tailored for real-time HDR video reconstruction named HDRFlow. |
Gangwei Xu; Yujin Wang; Jinwei Gu; Tianfan Xue; Xin Yang; |
217 | HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data restricting their ability to scale and generalize to more unconstrained interaction settings. To address this we introduce HOLD — the first category-agnostic method that reconstructs an articulated hand and an object jointly from a monocular interaction video. |
Zicong Fan; Maria Parelli; Maria Eleni Kadoglou; Xu Chen; Muhammed Kocabas; Michael J. Black; Otmar Hilliges; |
218 | Privacy-Preserving Optics for Enhancing Protection in Face De-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While software-level solutions like face de-identification provide a good privacy/utility trade-off they present vulnerabilities to sniffing attacks. In this paper we propose a hardware-level face de-identification method to solve this vulnerability. |
Jhon Lopez; Carlos Hinojosa; Henry Arguello; Bernard Ghanem; |
219 | Towards Automated Movie Trailer Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the process of creating trailers can be time-consuming and expensive. To streamline this process we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. |
Dawit Mureja Argaw; Mattia Soldan; Alejandro Pardo; Chen Zhao; Fabian Caba Heilbron; Joon Son Chung; Bernard Ghanem; |
220 | ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the methods used by existing frameworks to curate such multimodal data in particular language descriptions for 3D shapes are not scalable and the collected language descriptions are not diverse. To address this we introduce ULIP-2 a simple yet effective tri-modal pretraining framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. |
Le Xue; Ning Yu; Shu Zhang; Artemis Panagopoulou; Junnan Li; Roberto Martín-Martín; Jiajun Wu; Caiming Xiong; Ran Xu; Juan Carlos Niebles; Silvio Savarese; |
221 | ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce generative model as a data source for synthesizing hard images that benchmark deep models’ robustness. |
Chenshuang Zhang; Fei Pan; Junmo Kim; In So Kweon; Chengzhi Mao; |
222 | BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods that attempt to decompose neural fields in the frequency domain either resort to heuristics or require extensive modifications to the neural field architecture. We show that via a simple modification one can obtain neural fields that are low-pass filtered and in turn show how this can be exploited to obtain a frequency decomposition of the entire signal. |
Akhmedkhan Shabanov; Shrisudhan Govindarajan; Cody Reading; Lily Goli; Daniel Rebain; Kwang Moo Yi; Andrea Tagliasacchi; |
223 | ReconFusion: 3D Reconstruction with Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ReconFusion to reconstruct real-world scenes using only a few photos. |
Rundi Wu; Ben Mildenhall; Philipp Henzler; Keunhong Park; Ruiqi Gao; Daniel Watson; Pratul P. Srinivasan; Dor Verbin; Jonathan T. Barron; Ben Poole; Aleksander Ho?y?ski; |
224 | Neural Fields As Distributions: Signal Processing Beyond Euclidean Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However in contrast to classical discrete digital signal processing the portfolio of tools to process such representations is still severely limited and restricted to Euclidean domains. In this paper we address this problem by showing how a probabilistic re-interpretation of neural fields can enable their training and inference processes to become "filter-aware". |
Daniel Rebain; Soroosh Yazdani; Kwang Moo Yi; Andrea Tagliasacchi; |
225 | Ungeneralizable Examples Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we extend the concept of unlearnable data to conditional data learnability and introduce UnGeneralizable Examples (UGEs). |
Jingwen Ye; Xinchao Wang; |
226 | Distilled Datamodel with Reverse Gradient Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce an efficient framework for assessing data impact comprising offline training and online evaluation stages. |
Jingwen Ye; Ruonan Yu; Songhua Liu; Xinchao Wang; |
227 | GPT4Point: A Unified Framework for Point-Language Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation but their understanding of the 3D world is notably deficient limiting progress in 3D language understanding and generation. To solve this problem we introduce GPT4Point an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. |
Zhangyang Qi; Ye Fang; Zeyi Sun; Xiaoyang Wu; Tong Wu; Jiaqi Wang; Dahua Lin; Hengshuang Zhao; |
228 | RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge we present the Reflective and Imaginative Language Agent (RILA). |
Zeyuan Yang; Jiageng Liu; Peihao Chen; Anoop Cherian; Tim K. Marks; Jonathan Le Roux; Chuang Gan; |
229 | RegionGPT: Towards Region Understanding Vision Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder and the use of coarse-grained training data that lacks detailed region-specific captions. To address this we introduce RegionGPT (short as RGPT) a novel framework designed for complex region-level captioning and understanding. |
Qiushan Guo; Shalini De Mello; Hongxu Yin; Wonmin Byeon; Ka Chun Cheung; Yizhou Yu; Ping Luo; Sifei Liu; |
230 | Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce a so-called content-adaptive non-local convolution (CANConv) a novel method tailored for remote sensing image pansharpening. |
Yule Duan; Xiao Wu; Haoyu Deng; Liang-Jian Deng; |
231 | FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce FRESCO intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. |
Shuai Yang; Yifan Zhou; Ziwei Liu; Chen Change Loy; |
232 | Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues we propose a multi-scale 3D Gaussian splatting algorithm which maintains Gaussians at different scales to represent the same scene. |
Zhiwen Yan; Weng Fei Low; Yu Chen; Gim Hee Lee; |
233 | Diffusion Models Without Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods such as patchifying expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this we introduce the Diffusion State Space Model (DiffuSSM) an architecture that supplants attention mechanisms with a more scalable state space model backbone. |
Jing Nathan Yan; Jiatao Gu; Alexander M. Rush; |
234 | Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. |
Zi-Xin Zou; Zhipeng Yu; Yuan-Chen Guo; Yangguang Li; Ding Liang; Yan-Pei Cao; Song-Hai Zhang; |
235 | UniDepth: Universal Monocular Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a new model UniDepth capable of reconstructing metric 3D scenes from solely single images across domains. |
Luigi Piccinelli; Yung-Hsu Yang; Christos Sakaridis; Mattia Segu; Siyuan Li; Luc Van Gool; Fisher Yu; |
236 | SkillDiffuser: Interpretable Hierarchical Planning Via Skill Abstractions in Diffusion-Based Task Execution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However generating coherent trajectories from high-level instructions remains challenging especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. |
Zhixuan Liang; Yao Mu; Hengbo Ma; Masayoshi Tomizuka; Mingyu Ding; Ping Luo; |
237 | SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. |
Yuanhui Huang; Wenzhao Zheng; Borui Zhang; Jie Zhou; Jiwen Lu; |
238 | RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a lightweight and scalable Regional Point-Language Contrastive learning framework namely RegionPLC for open-world 3D scene understanding aiming to identify and recognize open-set objects and categories. |
Jihan Yang; Runyu Ding; Weipeng Deng; Zhe Wang; Xiaojuan Qi; |
239 | Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. |
Zheren Fu; Lei Zhang; Hou Xia; Zhendong Mao; |
240 | Neural Lineage Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce a novel task known as neural lineage detection aiming at discovering lineage relationships between parent and child models. |
Runpeng Yu; Xinchao Wang; |
241 | Three Pillars Improving Vision Foundation Model Distillation for Lidar Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work instead of focusing only on the distillation method we study the effect of three pillars for distillation: the 3D backbone the pretrained 2D backbone and the pretraining 2D+3D dataset. |
Gilles Puy; Spyros Gidaris; Alexandre Boulch; Oriane Siméoni; Corentin Sautier; Patrick Pérez; Andrei Bursuc; Renaud Marlet; |
242 | Fooling Polarization-Based Vision Using Locally Controllable Polarizing Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we warn the community of the vulnerability of polarization-based vision which can be more serious than RGB-based vision. |
Zhuoxiao Li; Zhihang Zhong; Shohei Nobuhara; Ko Nishino; Yinqiang Zheng; |
243 | ASAM: Boosting Segment Anything Model with Adversarial Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces ASAM a novel methodology that amplifies SAM’s performance through adversarial tuning. |
Bo Li; Haoke Xiao; Lv Tang; |
244 | SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset we propose an automatic and scalable generation method to generate question-answer pairs knowledge graphs and rationales by instructing the combinations of LLMs and MLLMs. |
Andong Wang; Bo Wu; Sunli Chen; Zhenfang Chen; Haotian Guan; Wei-Ning Lee; Li Erran Li; Chuang Gan; |
245 | ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes ConsistDreamer – a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency thus enabling high-fidelity instruction-guided scene editing. |
Jun-Kun Chen; Samuel Rota Bulò; Norman Müller; Lorenzo Porzi; Peter Kontschieder; Yu-Xiong Wang; |
246 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. |
Jiasen Lu; Christopher Clark; Sangho Lee; Zichen Zhang; Savya Khosla; Ryan Marten; Derek Hoiem; Aniruddha Kembhavi; |
247 | OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work for the first time we synergize information from image text and event-data domains and introduce OpenESS to enable scalable ESS in an open-world annotation-efficient manner. |
Lingdong Kong; Youquan Liu; Lai Xing Ng; Benoit R. Cottereau; Wei Tsang Ooi; |
248 | Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose the self-supervised generative map (SGM) a modular method that learns the explicit context relation via self-supervised learning. |
Sixian Zhang; Xinyao Yu; Xinhang Song; Xiaohan Wang; Shuqiang Jiang; |
249 | AutoAD III: The Prequel – Back to The Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data and build training and evaluation datasets using these. |
Tengda Han; Max Bain; Arsha Nagrani; Gül Varol; Weidi Xie; Andrew Zisserman; |
250 | Traffic Scene Parsing Through The TSP6K Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However little effort has been put into improving the traffic monitoring scene understanding mainly due to the lack of specific datasets. To fill this gap we introduce a specialized traffic monitoring dataset termed TSP6K containing images from the traffic monitoring scenario with high-quality pixel-level and instance-level annotations. |
Peng-Tao Jiang; Yuqi Yang; Yang Cao; Qibin Hou; Ming-Ming Cheng; Chunhua Shen; |
251 | VideoBooth: Diffusion-based Video Generation with Image Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we study the task of video generation with image prompts which provide more accurate and direct content control beyond the text prompts. |
Yuming Jiang; Tianxing Wu; Shuai Yang; Chenyang Si; Dahua Lin; Yu Qiao; Chen Change Loy; Ziwei Liu; |
252 | HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. |
Ce Zhang; Simon Stepputtis; Joseph Campbell; Katia Sycara; Yaqi Xie; |
253 | VILA: On Pre-training for Visual Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step con- trollable comparisons. |
Ji Lin; Hongxu Yin; Wei Ping; Pavlo Molchanov; Mohammad Shoeybi; Song Han; |
254 | YOLO-World: Real-Time Open-Vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation we introduce YOLO-World an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. |
Tianheng Cheng; Lin Song; Yixiao Ge; Wenyu Liu; Xinggang Wang; Ying Shan; |
255 | Mip-Splatting: Alias-free 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem we introduce a 3D smoothing filter to constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views. |
Zehao Yu; Anpei Chen; Binbin Huang; Torsten Sattler; Andreas Geiger; |
256 | Equivariant Multi-Modality Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. |
Zixiang Zhao; Haowen Bai; Jiangshe Zhang; Yulun Zhang; Kai Zhang; Shuang Xu; Dongdong Chen; Radu Timofte; Luc Van Gool; |
257 | FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present FreeControl a training-free approach for controllable T2I generation that supports multiple conditions architectures and checkpoints simultaneously. |
Sicheng Mo; Fangzhou Mu; Kuan Heng Lin; Yanli Liu; Bochen Guan; Yin Li; Bolei Zhou; |
258 | UniPAD: A Universal Pre-training Paradigm for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present UniPAD a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. |
Honghui Yang; Sha Zhang; Di Huang; Xiaoyang Wu; Haoyi Zhu; Tong He; Shixiang Tang; Hengshuang Zhao; Qibo Qiu; Binbin Lin; Xiaofei He; Wanli Ouyang; |
259 | Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address this gap with MoRE a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as Living Scenes and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances whose accuracy and completeness increase over time. |
Liyuan Zhu; Shengyu Huang; Konrad Schindler; Iro Armeni; |
260 | Federated Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). |
Nan Pu; Wenjing Li; Xingyuan Ji; Yalan Qin; Nicu Sebe; Zhun Zhong; |
261 | MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis Via Meta-learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges we propose MetaCloak which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. |
Yixin Liu; Chenrui Fan; Yutong Dai; Xun Chen; Pan Zhou; Lichao Sun; |
262 | HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose an efficient yet effective framework HumanGaussian that generates high-quality 3D humans with fine-grained geometry and realistic appearance. |
Xian Liu; Xiaohang Zhan; Jiaxiang Tang; Ying Shan; Gang Zeng; Dahua Lin; Xihui Liu; Ziwei Liu; |
263 | Situational Awareness Matters in 3D Vision Language Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work we demonstrate that a critical and distinct challenge in 3D vision language reasoning is the situational awareness which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. |
Yunze Man; Liang-Yan Gui; Yu-Xiong Wang; |
264 | 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present 3D Paintbrush a technique for automatically texturing local semantic regions on meshes via text descriptions. |
Dale Decatur; Itai Lang; Kfir Aberman; Rana Hanocka; |
265 | MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Multi-view Ancestral Sampling (MAS) a method for 3D motion generation using 2D diffusion models that were trained on motions obtained from in-the-wild videos. |
Roy Kapon; Guy Tevet; Daniel Cohen-Or; Amit H. Bermano; |
266 | When StyleGAN Meets Stable Diffusion: A W+ Adapter for Personalized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Text descriptions intended to guide the facial attributes of the synthesized face may fall short owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues we present the novel use of the extended StyleGAN embedding space \mathcal W _+ to achieve enhanced identity preservation and disentanglement for diffusion models. |
Xiaoming Li; Xinyu Hou; Chen Change Loy; |
267 | One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present One-2-3-45++ an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. |
Minghua Liu; Ruoxi Shi; Linghao Chen; Zhuoyang Zhang; Chao Xu; Xinyue Wei; Hansheng Chen; Chong Zeng; Jiayuan Gu; Hao Su; |
268 | It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A pilot study underscores the necessity revealing that deformities in existing models stem from spatial-conditioning. To rectify this we propose an abstraction-aware framework utilising a sketch adapter adaptive time-step sampling and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model working synergistically to reinforce fine-grained sketch-photo association. |
Subhadeep Koley; Ayan Kumar Bhunia; Deeptanshu Sekhri; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
269 | Text-to-Image Diffusion Models Are Great Sketch-Photo Matchmakers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to harness pre-trained diffusion models effectively we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
270 | How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
271 | You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text orchestrating a duet between the two. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
272 | UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels – they can see wide without going deep. Following such guidelines our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0% ADE20K mIoU of 55.6% and COCO box AP of 56.4%) demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. |
Xiaohan Ding; Yiyuan Zhang; Yixiao Ge; Sijie Zhao; Lin Song; Xiangyu Yue; Ying Shan; |
273 | Class Tokens Infusion for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes a novel WSSS framework with Class Token Infusion (CTI). |
Sung-Hoon Yoon; Hoyong Kwon; Hyeonseong Kim; Kuk-Jin Yoon; |
274 | AnyDoor: Zero-shot Object-level Image Customization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents AnyDoor a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. |
Xi Chen; Lianghua Huang; Yu Liu; Yujun Shen; Deli Zhao; Hengshuang Zhao; |
275 | LMDrive: Closed-Loop End-to-End Driving with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end this paper introduces LMDrive a novel language-guided end-to-end closed-loop autonomous driving framework. |
Hao Shao; Yuxuan Hu; Letian Wang; Guanglu Song; Steven L. Waslander; Yu Liu; Hongsheng Li; |
276 | Referring Image Editing: Object-level Image Editing Via Referring Expressions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response to this challenge we introduce an object-level generative task called Referring Image Editing (RIE) which enables the identification and editing of specific source objects in an image using text prompts. To tackle this task effectively we propose a tailored framework called ReferDiffusion. |
Chang Liu; Xiangtai Li; Henghui Ding; |
277 | PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. |
Zhengyao Lv; Yuxiang Wei; Wangmeng Zuo; Kwan-Yee K. Wong; |
278 | Building A Strong Pre-Training Baseline for Universal 3D Large-Scale Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict \textit i.e. the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges we propose a CSC framework that puts a scene-level semantic consistency in the heart bridging the connection of the similar semantic segments across various scenes. |
Haoming Chen; Zhizhong Zhang; Yanyun Qu; Ruixin Zhang; Xin Tan; Yuan Xie; |
279 | Multi-Attribute Interactions Matter for 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a multi-attribute aware Transformer for 3D visual grounding learning the multi-attribute interactions to refine the intra-modal and inter-modal grounding cues. |
Can Xu; Yuehui Han; Rui Xu; Le Hui; Jin Xie; Jian Yang; |
280 | NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present NeLF-Pro a novel representation to model and reconstruct light fields in diverse natural scenes that vary in extent and spatial granularity. |
Zinuo You; Andreas Geiger; Anpei Chen; |
281 | Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation Using Stable Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. |
Junjiao Tian; Lavisha Aggarwal; Andrea Colaco; Zsolt Kira; Mar Gonzalez-Franco; |
282 | Compositional Chain-of-Thought Prompting for Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this inspired by chain-of-thought methods we propose Compositional Chain-of-Thought (CCoT) a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. |
Chancharik Mitra; Brandon Huang; Trevor Darrell; Roei Herzig; |
283 | Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce Animatable Gaussians a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. |
Zhe Li; Zerong Zheng; Lizhen Wang; Yebin Liu; |
284 | Discriminative Probing and Tuning for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. |
Leigang Qu; Wenjie Wang; Yongqi Li; Hanwang Zhang; Liqiang Nie; Tat-Seng Chua; |
285 | SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current instruction-based image editing methods such as InstructPix2Pix often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this this paper introduces SmartEdit a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. |
Yuzhou Huang; Liangbin Xie; Xintao Wang; Ziyang Yuan; Xiaodong Cun; Yixiao Ge; Jiantao Zhou; Chao Dong; Rui Huang; Ruimao Zhang; Ying Shan; |
286 | Language Models As Black-Box Optimizers for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However many VLMs rely on proprietary data and are not open-source which restricts the use of white-box approaches for fine-tuning. As such we aim to develop a black-box approach to optimize VLMs through natural language prompts thereby avoiding the need to access model parameters feature embeddings or even output logits. |
Shihong Liu; Samuel Yu; Zhiqiu Lin; Deepak Pathak; Deva Ramanan; |
287 | SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As a result the content of reproduced high-resolution image may have semantic errors deteriorating the super-resolution performance. To address this issue we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. |
Rongyuan Wu; Tao Yang; Lingchen Sun; Zhengqiang Zhang; Shuai Li; Lei Zhang; |
288 | Authentic Hand Avatar from A Phone Scan Via Universal Hand Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a universal hand model (UHM) which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. |
Gyeongsik Moon; Weipeng Xu; Rohan Joshi; Chenglei Wu; Takaaki Shiratori; |
289 | Few-shot Learner Parameterization By Diffusion Time-steps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes i.e. as the forward diffusion adds noise to an image at each time-step nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this we propose Time-step Few-shot (TiF) learner. |
Zhongqi Yue; Pan Zhou; Richang Hong; Hanwang Zhang; Qianru Sun; |
290 | SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The main challenge lies in inferring unknown body shapes appearances and clothing details in areas not visible in the images. To address this we propose SiTH a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. |
Hsuan- I Ho; Jie Song; Otmar Hilliges; |
291 | 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While easy to collect synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap we introduce 4D-DRESS the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. |
Wenbo Wang; Hsuan-I Ho; Chen Guo; Boxiang Rong; Artur Grigorev; Jie Song; Juan Jose Zarate; Otmar Hilliges; |
292 | Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task termed as Text-IF. |
Xunpeng Yi; Han Xu; Hao Zhang; Linfeng Tang; Jiayi Ma; |
293 | Robust Emotion Recognition in Context Debiasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation causing severe performance bottlenecks and confounding valuable context priors. In this paper we propose a counterfactual emotion inference (CLEF) framework to address the above issue. |
Dingkang Yang; Kun Yang; Mingcheng Li; Shunli Wang; Shuaibing Wang; Lihua Zhang; |
294 | Towards Accurate Post-training Quantization for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose an accurate post-training quantization framework of diffusion models (APQ-DM) for efficient image generation. |
Changyuan Wang; Ziwei Wang; Xiuwei Xu; Yansong Tang; Jie Zhou; Jiwen Lu; |
295 | SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new 4D motion modeling paradigm SurMo that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. |
Tao Hu; Fangzhou Hong; Ziwei Liu; |
296 | GauHuman: Articulated Gaussian Splatting from Monocular Human Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GauHuman a 3D human model with Gaussian Splatting for both fast training (1 2 minutes) and real-time rendering (up to 189 FPS) compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame. |
Shoukang Hu; Tao Hu; Ziwei Liu; |
297 | Physical Backdoor: Towards Temperature-based Backdoor Attacks in The Physical World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks spanning both the digital and physical realms. |
Wen Yin; Jian Lou; Pan Zhou; Yulai Xie; Dan Feng; Yuhua Sun; Tailai Zhang; Lichao Sun; |
298 | GPT-4V(ision) Is A Human-Aligned Evaluator for Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents an automatic versatile and human-aligned evaluation metric for text-to-3D generative models. |
Tong Wu; Guandao Yang; Zhibing Li; Kai Zhang; Ziwei Liu; Leonidas Guibas; Dahua Lin; Gordon Wetzstein; |
299 | Structured Gradient-based Interpretations Via Norm-Regularized Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. |
Shizhan Gong; Qi Dou; Farzan Farnia; |
300 | MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models By Mirage Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we argue that the crux of the matter resides in the basic premise of existing projection strategies that the medium is homogeneous thereby projection rays propagate along straight lines and behind objects are occluded by front ones. |
Haowen Sun; Yueqi Duan; Juncheng Yan; Yifan Liu; Jiwen Lu; |
301 | Learning Inclusion Matching for Animation Paint Bucket Colorization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a new learning-based inclusion matching pipeline which directs the network to comprehend the inclusion relationships between segments rather than relying solely on direct visual correspondences. |
Yuekun Dai; Shangchen Zhou; Qinyue Li; Chongyi Li; Chen Change Loy; |
302 | Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose to improve transformers of a specific modality with irrelevant data from other modalities e.g. improve an ImageNet model with audio or point cloud datasets. |
Yiyuan Zhang; Xiaohan Ding; Kaixiong Gong; Yixiao Ge; Ying Shan; Xiangyu Yue; |
303 | EASE-DETR: Easing The Competition Among Object Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To help the leading query stands out this paper proposes EASE-DETR which eases the competition by introducing bias that favours the leading one. |
Yulu Gao; Yifan Sun; Xudong Ding; Chuyang Zhao; Si Liu; |
304 | Unsegment Anything By Simulating Deformation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Foundation segmentation models while powerful pose a significant risk: they enable users to effortlessly extract any objects from any digital content with a single click potentially leading to copyright infringement or malicious misuse. To mitigate this risk we introduce a new task "Anything Unsegmentable" to grant any image "the right to be unsegmented". |
Jiahao Lu; Xingyi Yang; Xinchao Wang; |
305 | DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel concept of dual and integrated latent topologies (DITTO in short) for implicit 3D reconstruction from noisy and sparse point clouds. |
Jaehyeok Shim; Kyungdon Joo; |
306 | Programmable Motion Generation for Open-Set Motion Control Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response to the complexity of practical motion control we propose and attempt to solve the open-set motion control problem. |
Hanchao Liu; Xiaohang Zhan; Shaoli Huang; Tai-Jiang Mu; Ying Shan; |
307 | Desigen: A Pipeline for Controllable Design Template Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present Desigen an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. |
Haohan Weng; Danqing Huang; Yu Qiao; Zheng Hu; Chin-Yew Lin; Tong Zhang; C. L. Philip Chen; |
308 | GaussianAvatar: Towards Realistic Human Avatar Modeling from A Single Video Via Animatable 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GaussianAvatar an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. |
Liangxiao Hu; Hongwen Zhang; Yuxiang Zhang; Boyao Zhou; Boning Liu; Shengping Zhang; Liqiang Nie; |
309 | Text-Image Alignment for Diffusion-Based Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We find that automatically generated captions can improve text-image alignment and significantly enhance a model’s cross-attention maps leading to better perceptual performance. |
Neehar Kondapaneni; Markus Marks; Manuel Knott; Rogerio Guimaraes; Pietro Perona; |
310 | Accelerating Diffusion Sampling with Optimized Time Steps Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While this is a significant development most sampling methods still employ uniform time steps which is not optimal when using a small number of steps. To address this issue we propose a general framework for designing an optimization problem that seeks more appropriate time steps for a specific numerical ODE solver for DPMs. |
Shuchen Xue; Zhaoqiang Liu; Fei Chen; Shifeng Zhang; Tianyang Hu; Enze Xie; Zhenguo Li; |
311 | PhotoMaker: Customizing Realistic Human Photos Via Stacked ID Embedding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce PhotoMaker an efficient personalized text-to-image generation method which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. |
Zhen Li; Mingdeng Cao; Xintao Wang; Zhongang Qi; Ming-Ming Cheng; Ying Shan; |
312 | UFOGen: You Forward Once Large Scale Text-to-Image Generation Via Diffusion GANs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Text-to-image diffusion models have demonstrated remarkable capabilities in transforming text prompts into coherent images yet the computational cost of the multi-step inference remains a persistent challenge. To address this issue we present UFOGen a novel generative model designed for ultra-fast one-step text-to-image generation. |
Yanwu Xu; Yang Zhao; Zhisheng Xiao; Tingbo Hou; |
313 | NeuRAD: Neural Rendering for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose \modelname a robust novel view synthesis method tailored to dynamic AD data. |
Adam Tonderski; Carl Lindström; Georg Hess; William Ljungbergh; Lennart Svensson; Christoffer Petersson; |
314 | OmniViD: A Generative Framework for Universal Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast natural language processing benefits from a unified output space i.e. text sequences which simplifies the training of powerful foundational language models such as GPT-3 with extensive training corpora. Inspired by this we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. |
Junke Wang; Dongdong Chen; Chong Luo; Bo He; Lu Yuan; Zuxuan Wu; Yu-Gang Jiang; |
315 | Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel approach pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. |
Zhikai Chen; Fuchen Long; Zhaofan Qiu; Ting Yao; Wengang Zhou; Jiebo Luo; Tao Mei; |
316 | Motion Blur Decomposition with Cross-shutter Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper inspired by the complementary exposure characteristics of a global shutter (GS) camera and a rolling shutter (RS) camera we propose to utilize the ordered scanline-wise delay in a rolling shutter image to robustify motion decomposition of a single blurry image. |
Xiang Ji; Haiyang Jiang; Yinqiang Zheng; |
317 | Rolling Shutter Correction with Intermediate Distortion Flow Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly. |
Mingdeng Cao; Sidi Yang; Yujiu Yang; Yinqiang Zheng; |
318 | PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Fueled by innovations in stacked image sensor fabrication emerging sensor–processors offer programmability and processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture PixelRNN that encodes spatio-temporal features on the sensor using purely binary operations. |
Haley M. So; Laurie Bose; Piotr Dudek; Gordon Wetzstein; |
319 | Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite great promise video diffusion models are difficult to control hindering users to apply their creativity rather than amplifying it. To address this challenge we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. |
Shengqu Cai; Duygu Ceylan; Matheus Gadelha; Chun-Hao Paul Huang; Tuanfeng Yang Wang; Gordon Wetzstein; |
320 | Posterior Distillation Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Posterior Distillation Sampling (PDS) a novel optimization method for parametric image editing based on diffusion models. |
Juil Koo; Chanho Park; Minhyuk Sung; |
321 | LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a tensor decomposition and low-rank recovery approach (LowRankOcc) for vision-based 3D semantic occupancy prediction. |
Linqing Zhao; Xiuwei Xu; Ziwei Wang; Yunpeng Zhang; Borui Zhang; Wenzhao Zheng; Dalong Du; Jie Zhou; Jiwen Lu; |
322 | Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper focuses on the high computational complexity in Large Language Models (LLMs) a significant challenge in both natural language processing (NLP) and multi-modal tasks. We propose Low-Rank Approximation for Sparse At- tention (LoRA-Sparse) an innovative approach that strate- gically reduces this complexity. |
Lin Song; Yukang Chen; Shuai Yang; Xiaohan Ding; Yixiao Ge; Ying-Cong Chen; Ying Shan; |
323 | GLID: Pre-training A Generalist Encoder-Decoder Vision Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. |
Jihao Liu; Jinliang Zheng; Yu Liu; Hongsheng Li; |
324 | Functional Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose functional diffusion a generative diffusion model focused on infinite-dimensional function data samples. |
Biao Zhang; Peter Wonka; |
325 | COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations we propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. |
Qihang Ma; Xin Tan; Yanyun Qu; Lizhuang Ma; Zhizhong Zhang; Yuan Xie; |
326 | VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this here we present the Video Motion Customization (VMC) framework a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. |
Hyeonho Jeong; Geon Yeong Park; Jong Chul Ye; |
327 | TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically we propose a Temporal LiDAR Aggregation and Distillation (TLAD) algorithm which leverages historical priors to assign different aggregation steps for different classes. |
Xiaopei Wu; Yuenan Hou; Xiaoshui Huang; Binbin Lin; Tong He; Xinge Zhu; Yuexin Ma; Boxi Wu; Haifeng Liu; Deng Cai; Wanli Ouyang; |
328 | RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We hence release the first real-world large-scale RCooper dataset to bloom the research on practical roadside cooperative perception including detection and tracking. |
Ruiyang Hao; Siqi Fan; Yingru Dai; Zhenlin Zhang; Chenxi Li; Yuntian Wang; Haibao Yu; Wenxian Yang; Jirui Yuan; Zaiqing Nie; |
329 | LangSplat: 3D Language Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces LangSplat which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. |
Minghan Qin; Wanhua Li; Jiawei Zhou; Haoqian Wang; Hanspeter Pfister; |
330 | Boosting Spike Camera Image Reconstruction from A Perspective of Dealing with Spike Fluctuations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present an approach to deal with spike fluctuations and boost spike camera image reconstruction. |
Rui Zhao; Ruiqin Xiong; Jing Zhao; Jian Zhang; Xiaopeng Fan; Zhaofei Yu; Tiejun Huang; |
331 | BT-Adapter: Video Conversation Is Feasible Without Video Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose Branching Temporal Adapter (BT-Adapter) a novel method for extending image-language pretrained models into the video domain. |
Ruyang Liu; Chen Li; Yixiao Ge; Thomas H. Li; Ying Shan; Ge Li; |
332 | VRP-SAM: SAM with Visual Reference Prompt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation creating the VRP-SAM model. |
Yanpeng Sun; Jiahui Chen; Shan Zhang; Xinyu Zhang; Qiang Chen; Gang Zhang; Errui Ding; Jingdong Wang; Zechao Li; |
333 | Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SUPIR (Scaling-UP Image Restoration) a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. |
Fanghua Yu; Jinjin Gu; Zheyuan Li; Jinfan Hu; Xiangtao Kong; Xintao Wang; Jingwen He; Yu Qiao; Chao Dong; |
334 | HHMR: Holistic Hand Mesh Recovery By Enhancing The Multimodal Controllability of Graph Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation inpainting reconstruction and fitting in a single framework which we name as Holistic Hand Mesh Recovery (HHMR). |
Mengcheng Li; Hongwen Zhang; Yuxiang Zhang; Ruizhi Shao; Tao Yu; Yebin Liu; |
335 | ProxyCap: Real-time Monocular Full-body Capture in World Space Via Human-Centric Proxy-to-Motion Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce ProxyCap a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. |
Yuxiang Zhang; Hongwen Zhang; Liangxiao Hu; Jiajun Zhang; Hongwei Yi; Shengping Zhang; Yebin Liu; |
336 | GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. |
Yunsong Wang; Hanlin Chen; Gim Hee Lee; |
337 | C3: High-Performance and Low-Complexity Neural Compression from A Single Image or Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce C3 a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. |
Hyunjik Kim; Matthias Bauer; Lucas Theis; Jonathan Richard Schwarz; Emilien Dupont; |
338 | Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an image-vector dual diffusion model for generative layout design. |
Mohammad Amin Shabani; Zhaowen Wang; Difan Liu; Nanxuan Zhao; Jimei Yang; Yasutaka Furukawa; |
339 | UnO: Unsupervised Occupancy Fields for Perception and Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. |
Ben Agro; Quinlan Sykora; Sergio Casas; Thomas Gilles; Raquel Urtasun; |
340 | GeoChat: Grounded Large Vision-Language Model for Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Furthermore the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations we propose GeoChat – the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. |
Kartik Kuckreja; Muhammad Sohail Danish; Muzammal Naseer; Abhijit Das; Salman Khan; Fahad Shahbaz Khan; |
341 | Classes Are Not Equal: An Empirical Study on Image Recognition Fairness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present an empirical study on image recognition unfairness i.e. extreme class accuracy disparity on balanced data like ImageNet. |
Jiequan Cui; Beier Zhu; Xin Wen; Xiaojuan Qi; Bei Yu; Hanwang Zhang; |
342 | Osprey: Pixel Understanding with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. |
Yuqian Yuan; Wentong Li; Jian Liu; Dongqi Tang; Xinjie Luo; Chi Qin; Lei Zhang; Jianke Zhu; |
343 | CPR-Coach: Recognizing Composite Error Actions Based on Single-class Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve the unavoidable "Single-class Training & Multi-class Testing" problem we propose a human-cognition-inspired framework named ImagineNet to improve the model’s multi-error recognition performance under restricted supervision. |
Shunli Wang; Shuaibing Wang; Dingkang Yang; Mingcheng Li; Haopeng Kuang; Xiao Zhao; Liuzhen Su; Peng Zhai; Lihua Zhang; |
344 | Holistic Autonomous Driving Understanding By Bird’s-Eye-View Injected Multi-Modal Large Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. |
Xinpeng Ding; Jianhua Han; Hang Xu; Xiaodan Liang; Wei Zhang; Xiaomeng Li; |
345 | TUMTraf V2X Cooperative Perception Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose CoopDet3D a cooperative multi-modal fusion model and TUMTraf-V2X a perception dataset for the cooperative 3D object detection and tracking task. |
Walter Zimmer; Gerhard Arya Wardana; Suren Sritharan; Xingcheng Zhou; Rui Song; Alois C. Knoll; |
346 | OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In particular we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. |
Ganlong Zhao; Guanbin Li; Weikai Chen; Yizhou Yu; |
347 | Structure-Aware Sparse-View X-ray 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a framework Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF) for sparse-view X-ray 3D reconstruction. |
Yuanhao Cai; Jiahao Wang; Alan Yuille; Zongwei Zhou; Angtian Wang; |
348 | AlignSAM: Aligning Segment Anything Model to Open Context Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel framework termed AlignSAM designed for automatic prompting for aligning SAM to an open context through reinforcement learning. |
Duojun Huang; Xinyu Xiong; Jie Ma; Jichang Li; Zequn Jie; Lin Ma; Guanbin Li; |
349 | RLHF-V: Towards Trustworthy MLLMs Via Behavior Alignment from Fine-grained Correctional Human Feedback Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the challenge we present RLHF-V which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. |
Tianyu Yu; Yuan Yao; Haoye Zhang; Taiwen He; Yifeng Han; Ganqu Cui; Jinyi Hu; Zhiyuan Liu; Hai-Tao Zheng; Maosong Sun; Tat-Seng Chua; |
350 | Loopy-SLAM: Dense Neural SLAM with Loop Closures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response we introduce Loopy-SLAM that globally optimizes poses and the dense 3D model. |
Lorenzo Liso; Erik Sandström; Vladimir Yugay; Luc Van Gool; Martin R. Oswald; |
351 | DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the long generation time of such algorithms significantly degrades the user experience. To tackle this problem we propose DreamPropeller a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. |
Linqi Zhou; Andy Shih; Chenlin Meng; Stefano Ermon; |
352 | PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PhysGaussian a new method that seamlessly integrates physically grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel motion synthesis. |
Tianyi Xie; Zeshun Zong; Yuxing Qiu; Xuan Li; Yutao Feng; Yin Yang; Chenfanfu Jiang; |
353 | Unsupervised Keypoints from Pretrained Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). |
Eric Hedlin; Gopal Sharma; Shweta Mahajan; Xingzhe He; Hossam Isack; Abhishek Kar; Helge Rhodin; Andrea Tagliasacchi; Kwang Moo Yi; |
354 | MorpheuS: Neural Dynamic 360deg Surface Reconstruction from Monocular RGB-D Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite this real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge we introduce MorpheuS a framework for dynamic 360deg surface reconstruction from a casually captured RGB-D video. |
Hengyi Wang; Jingwen Wang; Lourdes Agapito; |
355 | Rapid 3D Model Generation with Intuitive 3D Input Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we propose Deep3DVRSketch the first 3D model generation network that inputs 3D VR sketches from novice users and generates highly consistent 3D models in multiple categories within seconds irrespective of the users’ drawing abilities. |
Tianrun Chen; Chaotao Ding; Shangzhan Zhang; Chunan Yu; Ying Zang; Zejian Li; Sida Peng; Lingyun Sun; |
356 | SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Fortunately the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance which provides a promising solution to tackle this task. Motivated by this we introduce SAM-6D a novel framework designed to realize the task through two steps including instance segmentation and pose estimation. |
Jiehong Lin; Lihua Liu; Dekun Lu; Kui Jia; |
357 | Image Processing GNN: Breaking Rigidity in Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Alternatively we leverage the flexibility of graphs and propose the Image Processing GNN (IPG) model to break the rigidity that dominates previous SR methods. |
Yuchuan Tian; Hanting Chen; Chao Xu; Yunhe Wang; |
358 | Taming Mode Collapse in Score Distillation for Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem manifesting as the Janus artifact in practice. |
Peihao Wang; Dejia Xu; Zhiwen Fan; Dilin Wang; Sreyas Mohan; Forrest Iandola; Rakesh Ranjan; Yilei Li; Qiang Liu; Zhangyang Wang; Vikas Chandra; |
359 | A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the memory bottleneck we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention parameter-efficient image-to-video adaptation input masking and multi-resolution patchification. |
Pinelopi Papalampidi; Skanda Koppula; Shreya Pathak; Justin Chiu; Joe Heyward; Viorica Patraucean; Jiajun Shen; Antoine Miech; Andrew Zisserman; Aida Nematzdeh; |
360 | Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Since for any SDE there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. |
Zike Wu; Pan Zhou; Xuanyu Yi; Xiaoding Yuan; Hanwang Zhang; |
361 | Multi-modal In-Context Learning Makes An Ego-evolving Scene Text Recognizer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce E2STR a STR model trained with context-rich scene text sequences where the sequences are generated via our proposed in-context training strategy. |
Zhen Zhao; Jingqun Tang; Chunhui Lin; Binghong Wu; Can Huang; Hao Liu; Xin Tan; Zhizhong Zhang; Yuan Xie; |
362 | DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce DPMesh an innovative framework for occluded human mesh recovery that capitalizes on the profound knowledge about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. |
Yixuan Zhu; Ao Li; Yansong Tang; Wenliang Zhao; Jie Zhou; Jiwen Lu; |
363 | FlowIE: Efficient Image Enhancement Via Rectified Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response we propose FlowIE a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. |
Yixuan Zhu; Wenliang Zhao; Ao Li; Yansong Tang; Jie Zhou; Jiwen Lu; |
364 | Memory-based Adapters for Online 3D Scene Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new framework for online 3D scene perception. |
Xiuwei Xu; Chong Xia; Ziwei Wang; Linqing Zhao; Yueqi Duan; Jie Zhou; Jiwen Lu; |
365 | Cache Me If You Can: Accelerating Diffusion Models Through Block Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we investigate the behavior of the layers within the network and find that 1) the layers’ output changes smoothly over time 2) the layers show distinct patterns of change and 3) the change from step to step is often very small. |
Felix Wimbauer; Bichen Wu; Edgar Schoenfeld; Xiaoliang Dai; Ji Hou; Zijian He; Artsiom Sanakoyeu; Peizhao Zhang; Sam Tsai; Jonas Kohler; Christian Rupprecht; Daniel Cremers; Peter Vajda; Jialiang Wang; |
366 | 3DSFLabelling: Boosting 3D Scene Flow Estimation By Pseudo Auto-labelling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel approach from the perspective of auto-labelling aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR point clouds. |
Chaokang Jiang; Guangming Wang; Jiuming Liu; Hesheng Wang; Zhuang Ma; Zhenqiang Liu; Zhujin Liang; Yi Shan; Dalong Du; |
367 | Towards Generalizable Tumor Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and furthermore the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g. hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT) whether they originate in the liver pancreas or kidneys. |
Qi Chen; Xiaoxi Chen; Haorui Song; Zhiwei Xiong; Alan Yuille; Chen Wei; Zongwei Zhou; |
368 | Text-to-3D Using Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response this paper proposes GSGEN a novel method that adopts Gaussian Splatting a recent state-of-the-art representation to text-to-3D generation. |
Zilong Chen; Feng Wang; Yikai Wang; Huaping Liu; |
369 | DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However due to its reliance on generative adversarial networks (GANs) its generality is limited by the capacity of pretrained GAN models. In this work we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. |
Yujun Shi; Chuhui Xue; Jun Hao Liew; Jiachun Pan; Hanshu Yan; Wenqing Zhang; Vincent Y. F. Tan; Song Bai; |
370 | Rethinking The Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In particular the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. |
Chuangchuang Tan; Yao Zhao; Shikui Wei; Guanghua Gu; Ping Liu; Yunchao Wei; |
371 | SimDA: Simple Diffusion Adapter for Efficient Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model adapting it to video generation in a parameter-efficient way. |
Zhen Xing; Qi Dai; Han Hu; Zuxuan Wu; Yu-Gang Jiang; |
372 | MindBridge: A Cross-Subject Brain Decoding Framework Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present a novel approach MindBridge that achieves cross-subject brain decoding by employing only one model. |
Shizun Wang; Songhua Liu; Zhenxiong Tan; Xinchao Wang; |
373 | Relation Rectification in Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To resolve this we introduce a novel task termed Relation Rectification aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). |
Yinwei Wu; Xingyi Yang; Xinchao Wang; |
374 | LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose LayoutLLM an LLM/MLLM based method for document understanding. |
Chuwei Luo; Yufan Shen; Zhaoqing Zhu; Qi Zheng; Zhi Yu; Cong Yao; |
375 | CLiC: Concept Learning in Context Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It involves acquiring a visual concept (e.g. an ornament) from a source image and subsequently applying it to an object (e.g. a chair) in a target image. Our key idea is to perform in-context concept learning acquiring the local visual concept within the broader context of the objects they belong to. |
Mehdi Safaee; Aryan Mikaeili; Or Patashnik; Daniel Cohen-Or; Ali Mahdavi-Amiri; |
376 | Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges we introduce Monkey to enhance LMM capabilities. |
Zhang Li; Biao Yang; Qiang Liu; Zhiyin Ma; Shuo Zhang; Jingxu Yang; Yabo Sun; Yuliang Liu; Xiang Bai; |
377 | SketchINR: A First Look Into Sketches As Implicit Neural Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SketchINR to advance the representation of vector sketches with implicit neural models. |
Hmrishav Bandyopadhyay; Ayan Kumar Bhunia; Pinaki Nath Chowdhury; Aneeshan Sain; Tao Xiang; Timothy Hospedales; Yi-Zhe Song; |
378 | Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we democratise 3D content creation enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. |
Hmrishav Bandyopadhyay; Subhadeep Koley; Ayan Das; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
379 | What Sketch Explainability Really Means for Downstream Tasks? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we explore the unique modality of sketch for explainability emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. |
Hmrishav Bandyopadhyay; Pinaki Nath Chowdhury; Ayan Kumar Bhunia; Aneeshan Sain; Tao Xiang; Yi-Zhe Song; |
380 | Unveiling The Power of Audio-Visual Early Fusion Transformers with Dense Interactions Through Masked Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However training early fusion architectures poses significant challenges as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper we address this challenge by leveraging the masked reconstruction framework previously successful in unimodal settings to train audio-visual encoders with early fusion. |
Shentong Mo; Pedro Morgado; |
381 | Leak and Learn: An Attacker’s Cookbook to Train Using Leaked Data from Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we explore data reconstruction attacks through the lens of training and improving models with leaked data. |
Joshua C. Zhao; Ahaan Dabholkar; Atul Sharma; Saurabh Bagchi; |
382 | T4P: Test-Time Training of Trajectory Prediction Via Masked Autoencoder and Actor-specific Token Memory Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: First previous works underfit and overfit as they only optimize the last layer of motion decoder. To this end we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second utilizing the sequential nature of driving data we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. |
Daehee Park; Jaeseok Jeong; Sung-Hoon Yoon; Jaewoo Jeong; Kuk-Jin Yoon; |
383 | Bootstrapping SparseFormers from Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. |
Ziteng Gao; Zhan Tong; Kevin Qinghong Lin; Joya Chen; Mike Zheng Shou; |
384 | FedAS: Bridging Inconsistency in Personalized Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This results in their under-trained personalized models and impedes the collaborative training stage for other clients. In this paper we present a novel PFL framework named FedAS which uses Federated Parameter-Alignment and Client-Synchronization to overcome above challenges. |
Xiyuan Yang; Wenke Huang; Mang Ye; |
385 | C2KD: Bridging The Modality Gap for Cross-Modal Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a solution we propose a novel \underline C ustomized \underline C rossmodal \underline K nowledge \underline D istillation (C^2KD). |
Fushuo Huo; Wenchao Xu; Jingcai Guo; Haozhao Wang; Song Guo; |
386 | DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To provide consistent and controllable editing we propose the image-based video-NeRF editing pipeline with a set of innovative designs including multi-view multi-pose Score Distillation Sampling (SDS) from both the 2D personalized diffusion prior and 3D diffusion prior reconstruction losses text-guided local parts super-resolution and style transfer. |
Jia-Wei Liu; Yan-Pei Cao; Jay Zhangjie Wu; Weijia Mao; Yuchao Gu; Rui Zhao; Jussi Keppo; Ying Shan; Mike Zheng Shou; |
387 | X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the adaptation of these models to egocentric videos has been largely unexplored. To address this gap we propose a simple yet effective cross-modal adaptation framework which we call X-MIC. |
Anna Kukleva; Fadime Sener; Edoardo Remelli; Bugra Tekin; Eric Sauser; Bernt Schiele; Shugao Ma; |
388 | Gradient Reweighting: Towards Imbalanced Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers thus inducing over/under-fitting and catastrophic forgetting in CIL. Our method addresses it by reweighting the gradients towards balanced optimization and unbiased classifier learning. |
Jiangpeng He; |
389 | MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce MeshGPT a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. |
Yawar Siddiqui; Antonio Alliegro; Alexey Artemov; Tatiana Tommasi; Daniele Sirigatti; Vladislav Rosov; Angela Dai; Matthias Nießner; |
390 | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce "HallusionBench" a comprehensive benchmark designed for the evaluation of image-context reasoning. |
Tianrui Guan; Fuxiao Liu; Xiyang Wu; Ruiqi Xian; Zongxia Li; Xiaoyu Liu; Xijun Wang; Lichang Chen; Furong Huang; Yaser Yacoob; Dinesh Manocha; Tianyi Zhou; |
391 | NeRFDeformer: NeRF Transformation from A Single View Via 3D Scene Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. |
Zhenggang Tang; Zhongzheng Ren; Xiaoming Zhao; Bowen Wen; Jonathan Tremblay; Stan Birchfield; Alexander Schwing; |
392 | GART: Gaussian Articulated Template Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Gaussian Articulated Template Model (GART) an explicit efficient and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. |
Jiahui Lei; Yufu Wang; Georgios Pavlakos; Lingjie Liu; Kostas Daniilidis; |
393 | Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While attempts have been made to develop fast neural rendering approaches for static scenes these methods cannot be simply employed to support realistic facial expressions such as in the case of a dynamic facial performance. To address these challenges we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. |
Ziqian Bai; Feitong Tan; Sean Fanello; Rohit Pandey; Mingsong Dou; Shichen Liu; Ping Tan; Yinda Zhang; |
394 | Holo-Relighting: Controllable Volumetric Portrait Relighting from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose Holo-Relighting a volumetric relighting method that is capable of synthesizing novel viewpoints and novel lighting from a single image. |
Yiqun Mei; Yu Zeng; He Zhang; Zhixin Shu; Xuaner Zhang; Sai Bi; Jianming Zhang; HyunJoon Jung; Vishal M. Patel; |
395 | The Manga Whisperer: Automatically Generating Transcriptions for Comics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Yet the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work we seek to address this substantial barrier with the aim of ensuring that manga can be appreciated and actively engaged by everyone. |
Ragav Sachdeva; Andrew Zisserman; |
396 | Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present a unified framework to predict both point-wise correspondences and shape interpolation between 3D shapes. |
Dongliang Cao; Marvin Eisenberger; Nafie El Amrani; Daniel Cremers; Florian Bernard; |
397 | SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Building upon this technique we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians respectively. |
Yi-Hua Huang; Yang-Tian Sun; Ziyi Yang; Xiaoyang Lyu; Yan-Pei Cao; Xiaojuan Qi; |
398 | Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Through the analysis of existing methods we point out that the key to obtaining high-quality and texture-rich results in real-world self-supervised denoising tasks is to train at the original input resolution structure and use asymmetric operations during training and inference. Based on this we propose Asymmetric Tunable Blind-Spot Network (AT-BSN) where the blind-spot size can be freely adjusted thus better balancing noise correlation suppression and image local spatial destruction during training and inference. |
Shiyan Chen; Jiyuan Zhang; Zhaofei Yu; Tiejun Huang; |
399 | DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by KinectFusion’s incremental alignment and fusion of local TSDF volumes we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes facilitating the generation of an entire room environment. |
Xiaoliang Ju; Zhaoyang Huang; Yijin Li; Guofeng Zhang; Yu Qiao; Hongsheng Li; |
400 | ViT-Lens: Towards Omni-modal Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. |
Weixian Lei; Yixiao Ge; Kun Yi; Jianfeng Zhang; Difei Gao; Dylan Sun; Yuying Ge; Ying Shan; Mike Zheng Shou; |
401 | Data Valuation and Detections in Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. |
Wenqian Li; Shuran Fu; Fengrui Zhang; Yan Pang; |
402 | FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose FreeCustom a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts using only one image per concept as input. |
Ganggui Ding; Canyu Zhao; Wen Wang; Zhen Yang; Zide Liu; Hao Chen; Chunhua Shen; |
403 | DiverGen: Improving Instance Segmentation By Learning Wider Data Distribution with More Diverse Generative Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation these approaches do not efficiently harness the full potential of generative models. To address these issues we introduce a more efficient strategy to construct generative datasets for data augmentation termed DiverGen. |
Chengxiang Fan; Muzhi Zhu; Hao Chen; Yang Liu; Weijia Wu; Huaqi Zhang; Chunhua Shen; |
404 | SimAC: A Simple Anti-Customization Method for Protecting Face Privacy Against Text-to-Image Synthesis of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately most of these methods adopt straightforward designs such as end-to-end optimization with a focus on adversarially maximizing the original training loss thereby neglecting nuanced internal properties intrinsic to the diffusion model and even leading to ineffective optimization in some diffusion time steps. In this paper we strive to bridge this gap by undertaking a comprehensive exploration of these inherent properties to boost the performance of current anti-customization approaches. |
Feifei Wang; Zhentao Tan; Tianyi Wei; Yue Wu; Qidong Huang; |
405 | Noisy One-point Homographies Are Surprisingly Good Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The current state-of-the-art methods therefore seek to reduce the sample size from four point correspondences originally by including additional information such as keypoint orientation/angles or local affine information. In this work we continue in this direction and propose a novel one-point solver that leverages different approximate constraints derived from the same auxiliary information. |
Yaqing Ding; Jonathan Astermark; Magnus Oskarsson; Viktor Larsson; |
406 | TexOct: Generating Textures of 3D Models with Octree-based Diffusion Related Papers Related Patents Related Grants Related Venues |