Paper Digest: ECCV 2024 Highlights
Note: ECCV-2024 accepts more than 2,300 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,300 ECCV-2024 papers in a separate page.
To search or review papers within ECCV-2024 related to a specific topic, please use the search by venue (ECCV-2024), review by venue (ECCV-2024) and question answering by venue (ECCV-2024) services. To browse papers by author, here is a list of all authors (ECCV-2024). You may also like to explore our “Best Paper” Digest (ECCV), which lists the most influential ECCV papers in recent years.
This list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that empowers you to write, review, get answers and more. Try us today and unlock the full potential of our services for free!
TABLE 1: Paper Digest: ECCV 2024 Highlights
Paper | Author(s) | |
---|---|---|
1 | Adversarial Diffusion Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. |
Axel Sauer; Dominik Lorenz; Andreas Blattmann; Robin Rombach; |
2 | Factorizing Text-to-Video Generation By Explicit Image Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. |
Rohit Girdhar; Mannat Singh; Andrew Brown; Quentin Duval; Samaneh Azadi; Sai Saketh Rambhatla; Mian Akbar Shah; Xi Yin; Devi Parikh; Ishan Misra; |
3 | FlashTex: Fast Relightable Mesh Texturing with LightControlNet Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. |
Kangle Deng; Timothy Omernick; Alexander B Weiss; Deva Ramanan; Jun-Yan Zhu; Tinghui Zhou; Maneesh Agrawala; |
4 | Distilling Diffusion Models Into Conditional GANs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference, while preserving image quality. |
MinGuk Kang; Richard Zhang; Connelly Barnes; Sylvain Paris; Suha Kwak; Jaesik Park; Eli Shechtman; Jun-Yan Zhu; Taesung Park; |
5 | LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. |
Jiaxiang Tang; Zhaoxi Chen; Xiaokang Chen; Tengfei Wang; Gang Zeng; Ziwei Liu; |
6 | AWOL: Analysis WithOut Synthesis Using Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. |
Silvia Zuffi; Michael J. Black; |
7 | Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce “Idea to Image,”1 an agent system that enables multimodal iterative self-refinement with for automatic image design and generation. |
Zhengyuan Yang; Jianfeng Wang; Linjie Li; Kevin Lin; Chung-Ching Lin; Zicheng Liu; Lijuan Wang; |
8 | LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents (), a general-purpose multimodal assistant trained using an end-to-end approach that systematically expands the capabilities of large multimodal models (LMMs). |
Shilong Liu; Hao Cheng; Haotian Liu; Hao Zhang; Feng Li; Tianhe Ren; Xueyan Zou; Jianwei Yang; Hang Su; Jun Zhu; Lei Zhang; Jianfeng Gao; Chunyuan Li; |
9 | SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Scalable Interpolant Transformers (SiT), a family of generative models built on the backbone of Diffusion Transformers (DiT). |
Nanye Ma; Mark Goldstein; Michael Albergo; Nicholas M Boffi; Eric Vanden-Eijnden; Saining Xie; |
10 | Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a method to create interpretable concept sliders that enable precise control over attributes in image generations from diffusion models. |
Rohit Gandikota; Joanna Materzynska; Tingrui Zhou; Antonio Torralba; David Bau; |
11 | Video Editing Via Factorized Diffusion Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce , a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. |
Uriel Singer; Amit Zohar; Yuval Kirstain; Shelly Sheynin; Adam Polyak; Devi Parikh; Yaniv Taigman; |
12 | MathVerse: Does Your Multi-modal LLM Truly See The Diagrams in Visual Math Problems? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce , an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. |
Renrui Zhang; Dongzhi Jiang; Yichi Zhang; Haokun Lin; Ziyu Guo; Pengshuo Qiu; Aojun Zhou; Pan Lu; Kai-Wei Chang; Peng Gao; Hongsheng Li; |
13 | Isomorphic Pruning for Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: These heterogeneous substructures usually exhibit diverged parameter scales, weight distributions, and computational topology, introducing considerable difficulty to importance comparison. To overcome this, we present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures such as Vision Transformers and CNNs, and delivers competitive performance across different model sizes. |
Gongfan Fang; Xinyin Ma; Michael Bi Mi; Xinchao Wang; |
14 | Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we develop an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. |
Shilong Liu; Zhaoyang Zeng; Tianhe Ren; Feng Li; Hao Zhang; Jie Yang; Qing Jiang; Chunyuan Li; Jianwei Yang; Hang Su; Jun Zhu; Lei Zhang; |
15 | V-IRL: Grounding Virtual Intelligence in Real Life Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: How can we embody agents in an environment as rich and diverse as the one we inhabit, without the constraints imposed by real hardware and control? Towards this end, we introduce : a platform that enables agents to scalably interact with the real world in a virtual yet realistic environment. |
Jihan Yang; Runyu Ding; Ellis L Brown; Xiaojuan Qi; Saining Xie; |
16 | Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose GCD, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. |
Basile Van Hoorick; Rundi Wu; Ege Ozguroglu; Kyle Sargent; Ruoshi Liu; Pavel Tokmakov; Achal Dave; Changxi Zheng; Carl Vondrick; |
17 | DriveLM: Driving with Graph Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. |
Chonghao Sima; Katrin Renz; Kashyap Chitta; Li Chen; Zhang Hanxue; Chengen Xie; Jens Beiß,wenger; Ping Luo; Andreas Geiger; Hongyang Li; |
18 | Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. |
Xiaoshi Wu; Yiming Hao; Manyuan Zhang; Keqiang Sun; Zhaoyang Huang; Guanglu Song; Yu Liu; Hongsheng Li; |
19 | DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in A Single Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present – a new framework for long-term dense tracking in video. |
Narek Tumanyan; Assaf Singer; Shai Bagon; Tali Dekel; |
20 | Discovering Unwritten Visual Classifiers with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. |
Mia Chiquier; Utkarsh Mall; Carl Vondrick; |
21 | VideoMamba: State Space Model for Efficient Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. |
Kunchang Li; Xinhao Li; Yi Wang; Yinan He; Yali Wang; Limin Wang; Yu Qiao; |
22 | How Video Meetings Change Your Expression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such methods are insufficient as they are unable to provide insights beyond obvious dataset biases, and the explanations are useful only if humans themselves are good at the task. Instead, we tackle the problem through the lens of generative domain translation: our method generates a detailed report of learned, input-dependent spatio-temporal features and the extent to which they vary between the domains. |
Sumit Sarin; Utkarsh Mall; Purva Tendulkar; Carl Vondrick; |
23 | EraseDraw : Learning to Insert Objects By Erasing Them from Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Can we build a computational model to perform this task?Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings.With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text-conditioned diffusion model. |
Alper Canberk; Maksym Bondarenko; Ege Ozguroglu; Ruoshi Liu; Carl Vondrick; |
24 | Flash Cache: Reducing Bias in Radiance Cache Based Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these solutions rely on approximations that introduce bias into the renderings and, more importantly, into the gradients used for optimization. We present a method that avoids these approximations while remaining computationally efficient. |
Benjamin Attal; Dor Verbin; Ben Mildenhall; Peter Hedman; Jonathan T Barron; Matthew O’Toole; Pratul Srinivasan; |
25 | Adaptive Human Trajectory Prediction Via Latent Corridors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We formalize the problem of context-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. |
Neerja Thakkar; Karttikeya Mangalam; Andrea Bajcsy; Jitendra Malik; |
26 | MVDiffHD: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a neural architecture for 3D object reconstruction that synthesizes dense and high-resolution views of an object given one or a few images without camera poses. |
Shitao Tang; Jiacheng Chen; Dilin Wang; Chengzhou Tang; Fuyang Zhang; Yuchen Fan; Vikas Chandra; Yasutaka Furukawa; Rakesh Ranjan; |
27 | TextDiffuser-2: Unleashing The Power of Language Models for Text Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although existing work has endeavored to enhance the accuracy of text rendering, these methods still suffer from several drawbacks, such as (1) limited flexibility and automation, (2) constrained capability of layout prediction, and (3) restricted diversity. In this paper, we present TextDiffuser-2, aiming to unleash the power of language models for text rendering while taking these three aspects into account. |
Jingye Chen; Yupan Huang; Tengchao Lv; Lei Cui; Qifeng Chen; Furu Wei; |
28 | Goldfish: Vision-Language Understanding of Arbitrarily Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present , a methodology tailored for comprehending videos of arbitrary lengths.We also introduce the TVQA-long benchmark, specifically designed to evaluate models’ capabilities in understanding long videos with questions in both vision and text content. |
Kirolos Ataallah; Xiaoqian shen; Eslam mohamed abdelrahman; Essam Sleiman; Mingchen Zhuge; Jian Ding; Deyao Zhu; Jü,rgen Schmidhuber; Mohamed Elhoseiny; |
29 | The All-Seeing Project V2: Towards General Relation Comprehension of The Open World Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present the All-Seeing Project V2: a new model and dataset designed for understanding object relations in images.To facilitate training and evaluation of MLLMs in relation understanding, we created the first high-quality ReC dataset () which is aligned with the format of standard instruction tuning data. |
Weiyun Wang; yiming ren; Haowen Luo; Tiantong Li; Chenxiang Yan; Zhe Chen; Wenhai Wang; Qingyun Li; Lewei Lu; Xizhou Zhu; Yu Qiao; Jifeng Dai; |
30 | Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding constructed emotions in response to visually grounded conversations. |
Kilichbek Haydarov; Xiaoqian Shen; Avinash Madasu; Mahmoud Salem; Li-Jia Li; Gamaleldin F Elsayed; Mohamed Elhoseiny; |
31 | Uni3DL: A Unified Model for 3D Vision-Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Uni3DL, a unified model for 3D Vision-Language understanding. |
Xiang Li; Jian Ding; Zhaoyang Chen; Mohamed Elhoseiny; |
32 | Investigating Style Similarity in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a framework for understanding and extracting style descriptors from images. |
Gowthami Somepalli; Anubhav Gupta; Kamal Gupta; Shramay Palta; Micah Goldblum; Jonas A. Geiping; Abhinav Shrivastava; Tom Goldstein; |
33 | Vista3D: Unravel The 3d Darkside of A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We embark on the age-old quest: unveiling the hidden dimensions of objects from mere glimpses of their visible parts. To address this, we present Vista3D, a framework that realizes swift and consistent 3D generation within a mere 5 minutes. |
Qiuhong Shen; Xingyi Yang; Michael Bi Mi; Xinchao Wang; |
34 | FlashSplat: 2D to 3D Gaussian Splatting Segmentation Solved Optimally Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, we propose a straightforward yet globally optimal solver for 3D-GS segmentation. |
Qiuhong Shen; Xingyi Yang; Xinchao Wang; |
35 | I Can’t Believe It’s Not Scene Flow! Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: State-of-the-art scene flow methods broadly fail to describe the motion of small objects, and existing evaluation protocols hide this failure by averaging over many points. To address this limitation, we propose Bucket Normalized EPE, a new class-aware and speed-normalized evaluation protocol that better contextualizes error comparisons between object types that move at vastly different speeds. |
Ishan Khatri; Kyle Vedder; Neehar Peri; Deva Ramanan; James Hays; |
36 | Gaussian Grouping: Segment and Edit Anything in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. |
Mingqiao Ye; Martin Danelljan; Fisher Yu; Lei Ke; |
37 | GeoWizard: Unleashing The Diffusion Priors for 3D Geometry Estimation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (, CNNs and Transformers), can effectively address the inherently ill-posed problem. |
Xiao Fu; Wei Yin; Mu Hu; Kaixuan Wang; Yuexin Ma; Ping Tan; Shaojie Shen; Dahua Lin; Xiaoxiao Long; |
38 | "MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we discuss building performant Multimodal Large Language Models (MLLMs). |
Brandon McKinzie; Zhe Gan; Jean-Philippe Fauconnier; Samuel Dodge; Bowen Zhang; Philipp Dufter; Dhruti Shah; Futang Peng; Anton Belyi; Max A Schwarzer; Hongyu Hè,; Xianzhi Du; Haotian Zhang; Karanjeet Singh; Doug Kang; Tom Gunter; Xiang Kong; Aonan Zhang; Jianyu Wang; Chong Wang; Nan Du; Tao Lei; Sam Wiseman; Mark Lee; Zirui Wang; Ruoming Pang; Peter Grasch; Alexander Toshev; Yinfei Yang; |
39 | YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. |
Chien-Yao Wang; I-Hau Yeh; Hong-Yuan Mark Liao; |
40 | LCM-Lookahead for Encoder-based Text-to-Image Personalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These properties present opportunities to leverage fast sampling methods as a shortcut-mechanism, using them to create a preview of denoised outputs through which we can backpropagate image-space losses. In this work, we explore the potential of using such shortcut-mechanisms to guide the personalization of text-to-image models to specific facial identities. |
Rinon Gal; Or Lichter; Elad Richardson; Or Patashnik; Amit Bermano; Gal Chechik; Danny Cohen-Or; |
41 | Lane Graph As Path: Continuity-preserving Path-wise Modeling for Online Lane Graph Construction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We argue that the path, which indicates the traffic flow, is the primitive of the lane graph. Motivated by this, we propose to model the lane graph in a novel path-wise manner, which well preserves the continuity of the lane and encodes traffic information for planning. |
Bencheng Liao; Shaoyu Chen; Bo Jiang; Tianheng Cheng; Qian Zhang; Wenyu Liu; Chang Huang; Xinggang Wang; |
42 | DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. |
Jinbo Xing; Menghan Xia; Yong Zhang; Haoxin Chen; Wangbo Yu; Hanyuan Liu; Gongye Liu; Xintao Wang; Ying Shan; Tien-Tsin Wong; |
43 | Diffusion Models for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To that end, we present , a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. |
Laurynas Karazija; Iro Laina; Andrea Vedaldi; Christian Rupprecht; |
44 | PointLLM: Empowering Large Language Models to Understand Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, empowering LLMs to understand point clouds and offering a new avenue beyond 2D data. |
Runsen Xu; Xiaolong Wang; Tai Wang; Yilun Chen; Jiangmiao Pang; Dahua Lin; |
45 | View Selection for 3D Captioning Via Diffusion Ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object’s characteristics. |
Tiange Luo; Justin Johnson; Honglak Lee; |
46 | Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper studies visual representation learning with diffusion-generated synthetic images. |
David Junhao Zhang; Mutian Xu; Jay Zhangjie Wu; Chuhui Xue; Wenqing Zhang; Xiaoguang Han; Song Bai; Mike Zheng Shou; |
47 | Teddy: Efficient Large-Scale Dataset Distillation Via Taylor-Approximated Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this approach incurs high memory and time complexity, posing difficulties in scaling up to large datasets such as ImageNet. Addressing these concerns, this paper introduces Teddy, a Taylor-approximated dataset distillation framework designed to handle large-scale dataset and enhance efficiency. |
Ruonan Yu; Songhua Liu; Jingwen Ye; Xinchao Wang; |
48 | CoTracker: It Is Better to Track Together Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce , a transformer-based model that tracks a large number of 2D points in long video sequences. |
Nikita Karaev; Ignacio Rocco; Ben Graham; Natalia Neverova; Andrea Vedaldi; Christian Rupprecht; |
49 | Improving Unsupervised Domain Adaptation: A Pseudo-Candidate Set Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to improve target set accuracy in any existing UDA method by introducing an approach that utilizes pseudo-candidate sets for labeling the target data. |
Aveen Dayal; Rishabh Lalla; Linga Reddy Cenkeramaddi; C. Krishna Mohan; Abhinav Kumar; Vineeth N Balasubramanian; |
50 | SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present to enable flexible structure control with temporally sparse signals, requiring only one or few inputs, as shown in fig:teaser. |
Yuwei Guo; Ceyuan Yang; Anyi Rao; Maneesh Agrawala; Dahua Lin; Bo Dai; |
51 | Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. |
Tim Salzmann; Markus Ryll; Alex Bewley; Matthias Minderer; |
52 | Evaluating Text-to-Visual Generation with Image-to-Text Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: . To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a Yes answer to a simple Does this figure show {text}? |
Zhiqiu Lin; Deepak Pathak; Baiqi Li; Jiayao Li; Xide Xia; Graham Neubig; Pengchuan Zhang; Deva Ramanan; |
53 | Context Diffusion: In-Context Aware Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Context Diffusion, a diffusion-based framework that enables image generation models to learn from visual examples presented in context. |
Ivona Najdenkoska; Animesh Sinha; Abhimanyu Dubey; Dhruv Mahajan; Vignesh Ramanathan; Filip Radenovic; |
54 | Segment and Recognize Anything at Any Granularity Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce , an augmented image segmentation foundation for segmenting and recognizing anything at desired granularities. |
Feng Li; Hao Zhang; Peize Sun; Xueyan Zou; Shilong Liu; Chunyuan Li; Jianwei Yang; Lei Zhang; Jianfeng Gao; |
55 | DiffiT: Diffusion Vision Transformers for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT). |
Ali Hatamizadeh; Jiaming Song; Guilin Liu; Jan Kautz; Arash Vahdat; |
56 | Region-centric Image-Language Pretraining for Open-Vocabulary Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection. |
Dahun Kim; Anelia Angelova; Weicheng Kuo; |
57 | UniIR: Training and Benchmarking Universal Multimodal Information Retrievers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. |
Cong Wei; Yang Chen; Haonan Chen; Hexiang Hu; Ge Zhang; Jie Fu; Alan Ritter; Wenhu Chen; |
58 | 3D Gaussian Parametric Head Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach, 3D Gaussian Parametric Head Model, which employs 3D Gaussians to accurately represent the complexities of the human head, allowing precise control over both identity and expression. |
Yuelang Xu; Lizhen Wang; Zerong Zheng; Zhaoqi Su; Yebin Liu; |
59 | EMO: Emote Portrait Alive – Generating Expressive Portrait Videos with Audio2Video Diffusion Model Under Weak Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. |
Linrui Tian; Qi Wang; Bang Zhang; Liefeng Bo; |
60 | When Do We Not Need Larger Vision Models? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we discuss the point beyond which larger vision models are not necessary. |
Baifeng Shi; Ziyang Wu; Maolin Mao; Xin Wang; Trevor Darrell; |
61 | Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. |
Brian Gordon; Yonatan Bitton; Yonatan Shafir; Roopal Garg; Xi Chen; Dani Lischinski; Daniel Cohen-Or; Idan Szpektor; |
62 | Images Are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we study the harmlessness alignment problem of multimodal large language models (MLLMs). |
Yifan Li; hangyu guo; Kun Zhou; Wayne Xin Zhao; Ji-Rong Wen; |
63 | Taming Latent Diffusion Model for Neural Radiance Field Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model’s stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. |
Chieh Hubert Lin; Changil Kim; Jia-Bin Huang; Qinbo Li; Chih-Yao Ma; Johannes Kopf; Ming-Hsuan Yang; Hung-Yu Tseng; |
64 | ShapeLLM: Universal 3D Object Understanding for Embodied Interaction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents , the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. |
Zekun Qi; Runpei Dong; Shaochen Zhang; Haoran Geng; Chunrui Han; Zheng Ge; Li Yi; Kaisheng Ma; |
65 | Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we delve into the intricate dynamics of Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) by addressing a critical yet overlooked aspect – the choice of viewpoint during sketch creation. |
Aneeshan Sain; Pinaki Nath Chowdhury; Subhadeep Koley; Ayan Kumar Bhunia; Yi-Zhe Song; |
66 | DC-Solver: Improving Predictor-Corrector Diffusion Sampler Via Dynamic Compensation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. |
Wenliang Zhao; Haolin Wang; Jie Zhou; Jiwen Lu; |
67 | VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed , which is tailored for this purpose. |
Xiangxiang Chu; Jianlin Su; Bo Zhang; Chunhua Shen; |
68 | Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper aims to tackle the problem of modeling dynamic urban streets for autonomous driving scenes. |
Yunzhi Yan; Haotong Lin; Chenxu Zhou; Weijie Wang; Haiyang Sun; Kun Zhan; Xianpeng Lang; Xiaowei Zhou; Sida Peng; |
69 | NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we strive to bridge the divide between VLN-specialized models and LLM-based navigation paradigms, while maintaining the interpretative prowess of LLMs in generating linguistic navigational reasoning. |
Gengze Zhou; Yicong Hong; Zun Wang; Xin Eric Wang; Qi Wu; |
70 | GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from sparse-view images in around 0.1s. GRM is a feed-forward transformer-based model that efficiently incorporates multi-view information to translate the input pixels into pixel-aligned Gaussians, which are unprojected to create a set of densely distributed 3D Gaussians representing a scene. |
Yinghao Xu; Zifan Shi; Wang Yifan; Hansheng Chen; Ceyuan Yang; Sida Peng; Yujun Shen; Gordon Wetzstein; |
71 | TrackNeRF: Bundle Adjusting NeRF from Sparse and Noisy Views Via Feature Tracks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Closely following bundle adjustment in Structure-from-Motion (SfM), we introduce TrackNeRF for more globally consistent geometry reconstruction and more accurate pose optimization. |
Jinjie Mai; Wenxuan Zhu; Sara Rojas; Jesus Zarzar; Abdullah Hamdi; Guocheng Qian; Bing Li; Silvio Giancola; Bernard Ghanem; |
72 | SPIRE: Semantic Prompt-Driven Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we develop , a Semantic and restoration Prompt-driven Image framework that leverages natural language as a user-friendly interface to control the image restoration process. |
Chenyang QI; Zhengzhong Tu; Keren Ye; Mauricio Delbracio; Peyman Milanfar; Qifeng Chen; Hossein Talebi; |
73 | Interactive 3D Object Detection with Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach, incorporated with “prompt in 2D, detect in 3D” and “detect in 3D, refine in 3D” strategies, to 3D object annotation: multi-modal interactive 3D object detection. |
Ruifei Zhang; Xiangru Lin; Wei Zhang; Jincheng Lu; Xuekuan Wang; Xiao Tan; Yingying Li; Errui Ding; Jingdong Wang; Guanbin Li; |
74 | Editable Image Elements for Controllable Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. |
Jiteng Mu; Michaë,l Gharbi; Richard Zhang; Eli Shechtman; Nuno Vasconcelos; Xiaolong Wang; Taesung Park; |
75 | DOCCI: Descriptions of Connected and Contrasting Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. |
Yasumasa Onoe; Sunayana Rane; Zachary E Berger; Yonatan Bitton; Jaemin Cho; Roopal Garg; Alexander Ku; Zarana Parekh; Jordi Pont-Tuset; Garrett Tanzer; Su Wang; Jason M Baldridge; |
76 | Mind The Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study addresses the Domain-Class Incremental Learning problem, a realistic but challenging continual learning scenario where both the domain distribution and target classes vary across tasks. To handle these diverse tasks, pre-trained Vision-Language Models (VLMs) are introduced for their strong generalizability. |
Longxiang Tang; Zhuotao Tian; Kai Li; Chunming He; Hantao Zhou; Hengshuang Zhao; Xiu Li; Jiaya Jia; |
77 | Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Surf-D, a novel method for generating high-quality 3D shapes as Surfaces with arbitrary topologies using Diffusion models. |
Zhengming Yu; Zhiyang Dou; Xiaoxiao Long; Cheng Lin; Zekun Li; Yuan Liu; Norman Mü,ller; Taku Komura; Marc Habermann; Christian Theobalt; Xin Li; Wenping Wang; |
78 | MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both high identity fidelity and flexible editability. |
Yuxiang Wei; Zhilong Ji; Jinfeng Bai; Hongzhi Zhang; Lei Zhang; Wangmeng Zuo; |
79 | SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The unique properties of the entities to be generated for SLEDGE, such as their connectivity and variable count per scene, render the naive application of most modern generative models to this task non-trivial. Therefore, together with a systematic study of existing lane graph representations, we introduce a novel raster-to-vector autoencoder. |
Kashyap Chitta; Daniel Dauner; Andreas Geiger; |
80 | Challenging Forgets: Unveiling The Worst-Case Forget Sets in Machine Unlearning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite various methods for data influence erasure, evaluations have largely focused on random data forgetting, ignoring the vital inquiry into which subset should be chosen to truly gauge the authenticity of unlearning performance. To tackle this issue, we introduce a new evaluative angle for from an adversarial viewpoint. |
Chongyu Fan; Jiancheng Liu; Alfred Hero; Sijia Liu; |
81 | Instant 3D Human Avatar Generation Using Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a method for fast, high quality 3D human avatar generation from different input modalities, such as images and text prompts and with control over the generated pose and shape. |
Nikos Kolotouros; Thiemo Alldieck; Enric Corona; Eduard Gabriel Bazavan; Cristian Sminchisescu; |
82 | Generating Human Interaction Motions in Scenes with Text Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a text-controlled scene-aware motion generation method based on denoising diffusion models. |
Hongwei Yi; Justus Thies; Michael J. Black; Xue Bin Peng; Davis Rempe; |
83 | MyVLM: Personalizing VLMs for User-Specific Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. |
Yuval Alaluf; Elad Richardson; Sergey Tulyakov; Kfir Aberman; Danny Cohen-Or; |
84 | PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce , a Diffusion Transformer model (DiT) capable of directly generating images at 4K resolution. |
Junsong Chen; Chongjian GE; Enze Xie; Yue Wu; Lewei Yao; Xiaozhe Ren; Zhongdao Wang; Ping Luo; Huchuan Lu; Zhenguo Li; |
85 | VLAD-BuFF: Burst-aware Fast Feature Aggregation for Visual Place Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The current state-of-the-art VPR methods rely on VLAD aggregation, which can be trained to learn a weighted contribution of features through their soft assignment to cluster centers. However, this process has two key limitations. Firstly, the feature-to-cluster weighting does not account for over-represented repetitive structures within a cluster, e.g., shadows or window panes; this phenomenon is also referred to as the \u2018burstiness\u2019 problem, classically solved by discounting repetitive features before aggregation. Secondly, feature to cluster comparisons are compute-intensive for state-of-the-art image encoders with high-dimensional local features. This paper addresses these limitations by introducing VLAD-BuFF with two novel contributions: i) a self-similarity based feature discounting mechanism to learn Burst-aware features within end-to-end VPR training, and ii) Fast Feature aggregation by reducing local feature dimensions specifically through PCA-initialized learnable pre-projection. |
Ahmad Khaliq; Ming Xu; Stephen Hausler; Michael J Milford; Sourav Garg; |
86 | Revisit Anything: Visual Place Recognition Via Image Segment Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to use open-set image segmentation to decompose an image into ‘meaningful’ entities (i.e., things and stuff). |
Kartik Garg; Sai Shubodh; Shishir N Y Kolathaya; Madhava Krishna; Sourav Garg; |
87 | Recursive Visual Programming Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by human coding practices, we propose Recursive Visual Programming (RVP), which better harnesses the reasoning capacity of LLMs, provides modular code structure between code pieces, and assigns different return types for the sub-problems elegantly. |
Jiaxin Ge; Sanjay Subramanian; Baifeng Shi; Roei Herzig; Trevor Darrell; |
88 | Dolphins: Multimodal Language Model for Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce , a novel vision-language model architected to imbibe human-like abilities as a conversational driving assistant. |
Yingzi Ma; Yulong Cao; Jiachen Sun; Marco Pavone; Chaowei Xiao; |
89 | Towards Open-ended Visual Quality Comparison Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to deepgreenopen-range questions on quality comparison 2) can provide deepgreendetailed reasonings beyond direct answers.To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V “teacher” responses on unlabeled data. |
Haoning Wu; Hanwei Zhu; Zicheng Zhang; Erli Zhang; Chaofeng Chen; Liang Liao; Chunyi Li; Annan Wang; Wenxiu Sun; Qiong Yan; Xiaohong Liu; Guangtao Zhai; Shiqi Wang; Weisi Lin; |
90 | GOEmbed: Gradient Origin Embeddings for Representation Agnostic 3D Feature Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose (Gradient Origin Embeddings) that encodes input 2D images into any 3D representation, without requiring a pre-trained image feature extractor unlike typical prior approaches in which input images are either encoded using 2D features extracted from large pre-trained models, or customized features are designed to handle different 3D representations or worse, encoders may not yet be available for specialized 3D neural representations such as MLPs and hash-grids. |
Animesh Karnewar; Roman Shapovalov; Tom Monnier; Andrea Vedaldi; Niloy J. Mitra; David Novotny; |
91 | OccWorld: Learning A 3D Occupancy World Model for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore a new framework of learning a world model, OccWorld, in the 3D occupancy space to simultaneously predict the movement of the ego car and the evolution of the surrounding scenes. |
Wenzhao Zheng; Weiliang Chen; Yuanhui Huang; Borui Zhang; Yueqi Duan; Jiwen Lu; |
92 | ZipLoRA: Any Subject in Any Style By Effectively Merging LoRAs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose , a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. |
Viraj Shah; Nataniel Ruiz; Forrester Cole; Erika Lu; Svetlana Lazebnik; Yuanzhen Li; Varun Jampani; |
93 | LaRa: Efficient Large-Baseline Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method that unifies local and global reasoning in transformer layers, resulting in improved quality and faster convergence. |
Anpei Chen; Haofei Xu; Stefano Esposito; Siyu Tang; Andreas Geiger; |
94 | Towards Open-Ended Visual Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce the (), a novel Large Language Model (LLM) based mask classifier, as a straightforward and effective solution to the aforementioned challenges. |
Qihang Yu; Xiaohui Shen; Liang-Chieh Chen; |
95 | Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation Without Manual Labels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, recent 2D foundation models have demonstrated strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate high-quality training labels for 3D segmentation models. |
Rui Huang; Songyou Peng; Ayca Takmaz; Federico Tombari; Marc Pollefeys; Shiji Song; Gao Huang; Francis Engelmann; |
96 | ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we delve into the influence of training data on LMMs, uncovering three pivotal findings: 1) Highly detailed captions enable more nuanced vision-language alignment, significantly boosting the performance of LMMs in diverse benchmarks, surpassing outcomes from brief captions or VQA data 2) Cutting-edge LMMs can be close to the captioning capability of costly human annotators, and open-source LMMs could reach similar quality after lightweight fine-tuning 3) The performance of LMMs scales with the number of detailed captions, exhibiting remarkable improvements across a range from thousands to millions of captions. |
Lin Chen; Jinsong Li; Xiaoyi Dong; Pan Zhang; Conghui He; Jiaqi Wang; Feng Zhao; Dahua Lin; |
97 | LLaMA-VID: An Image Is Worth 2 Tokens in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. |
Yanwei Li; Chengyao Wang; Jiaya Jia; |
98 | Occupancy As Set of Points Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore a novel point representation for 3D occupancy prediction from multi-view images, which is named Occupancy as Set of Points. |
Yiang Shi; Tianheng Cheng; Qian Zhang; Wenyu Liu; Xinggang Wang; |
99 | Multi-Sentence Grounding for Long-term Instructional Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. |
Zeqian Li; Qirui Chen; Tengda Han; Ya Zhang; Yan-Feng Wang; Weidi Xie; |
100 | MMBENCH: Is Your Multi-Modal Model An All-around Player? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model’s abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. |
Yuan Liu; Haodong Duan; Yuanhan Zhang; Bo Li; Songyang Zhang; Wangbo Zhao; Yike Yuan; Jiaqi Wang; Conghui He; Ziwei Liu; Kai Chen; Dahua Lin; |
101 | CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. |
Zhengyi Wang; Yikai Wang; Yifei Chen; Chendong Xiang; Shuo Chen; Dajiang Yu; Chongxuan Li; Hang Su; Jun Zhu; |
102 | Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. |
Siyu Jiao; hongguang Zhu; Yunchao Wei; Yao Zhao; Jiannan Huang; Humphrey Shi; |
103 | Rethinking Image-to-Video Adaptation: An Object-centric Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. |
Rui Qian; Shuangrui Ding; Dahua Lin; |
104 | LogoSticker: Inserting Logos Into Diffusion Models for Customized Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast, logos, characterized by unique patterns and textual elements, are hard to establish shared knowledge within diffusion models, thus presenting a unique challenge. To bridge this gap, we introduce the task of logo insertion. |
Mingkang Zhu; Xi CHEN; Zhongdao Wang; Hengshuang Zhao; Jiaya Jia; |
105 | Octopus: Embodied Vision-Language Programmer from Environmental Feedback Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: When integrated into an embodied agent, existing embodied VLM works either output detailed action sequences at the manipulation level or only provide plans at an abstract level, leaving a gap between high-level planning and real-world manipulation. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. |
Jingkang Yang; Yuhao Dong; Shuai Liu; Bo Li; Ziyue Wang; ChenCheng Jiang; Haoran Tan; Jiamu Kang; Yuanhan Zhang; Kaiyang Zhou; Ziwei Liu; |
106 | ProtoComp: Diverse Point Cloud Completion with Controllable Prototype Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose PrototypeCompletion, a novel prototype-based approach for point cloud completion.Additionally, we propose a new metric and test benchmark based on ScanNet200 and KITTI to evaluate the model’s performance in real-world scenarios, aiming to promote future research. |
Xumin Yu; Yanbo Wang; Jie Zhou; Jiwen Lu; |
107 | Co-Student: Collaborating Strong and Weak Students for Sparsely Annotated Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle those two issues, we introduce , a novel framework aiming to bridge the gap between SAOD and FAOD via fully exploiting the pseudo-labels from both teacher and student detectors. |
Lianjun Wu; Jiangxiao Han; Zengqiang Zheng; Xinggang Wang; |
108 | The Hard Positive Truth About Vision-Language Compositionality Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work suggests the need for future research to rigorously test and improve CLIP’s understanding of semantic relationships between related “positive” concepts. |
Amita Kamath; Cheng-Yu Hsieh; Kai-Wei Chang; Ranjay Krishna; |
109 | Human Hair Reconstruction with Strand-Aligned 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. |
Egor Zakharov; Vanessa Sklyarova; Michael J. Black; Giljoo Nam; Justus Thies; Otmar Hilliges; |
110 | Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability. |
Chuofan Ma; Yi Jiang; Jiannan Wu; Zehuan Yuan; Xiaojuan Qi; |
111 | Diffusion Bridges for 3D Point Cloud Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we address the task of point cloud denoising using a novel framework adapting Diffusion Schrödinger bridges to unstructured data like point sets. |
Mathias Vogel Hü,ni; Keisuke Tateno; Marc Pollefeys; Federico Tombari; Marie-Julie Rakotosaona; Francis Engelmann; |
112 | Fully Sparse 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. |
Haisong Liu; Yang Chen; Haiguang Wang; Zetong Yang; Tianyu Li; Jia Zeng; Li Chen; Hongyang Li; Limin Wang; |
113 | Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. |
Shihao Zhao; Shaozhe Hao; Bojia Zi; Huaizhe Xu; Kwan-Yee K. Wong; |
114 | LLMGA: Multimodal Large Language Model Based Generation Assistant Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. |
bin xia; Shiyin Wang; Yingfan Tao; Yitong Wang; Jiaya Jia; |
115 | InsMapper: Exploring Inner-instance Information for Vectorized HD Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods usually overlook and fail to analyze the important inner-instance correlations between predicted points, impeding further advancements. To address this issue, we investigate the utilization of inner-instance information for vectorized high-definition mapping through transformers, and propose a powerful system named InsMapper, which effectively harnesses inner-instance information with three exquisite designs, including hybrid query generation, inner-instance query fusion, and inner-instance feature aggregation. |
zhenhua xu; Kwan-Yee K. Wong; Hengshuang Zhao; |
116 | EgoPet: Egomotion and Interaction Data from An Animal’s Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To advance our understanding and reduce the gap between the capabilities of animals and AI systems, we introduce a dataset of pet egomotion imagery with diverse examples of simultaneous egomotion and multi-agent interaction. |
Amir Bar; Arya Bakhtiar; Danny L Tran; Antonio Loquercio; Jathushan Rajasegaran; yann lecun; Amir Globerson; Trevor Darrell; |
117 | Local All-Pair Correspondence for Point Tracking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce , a highly accurate and efficient model designed for the task of tracking any point (TAP) across video sequences. |
Seokju Cho; Jiahui Huang; Jisu Nam; Honggyu An; Seungryong Kim; Joon-Young Lee; |
118 | Visual Prompting Via Partial Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet, existing methodologies often overlook the synergy between these components, leaving the intricate relationship between them underexplored. To address this, we propose an Optimal Transport-based Label Mapping strategy (OTLM) that effectively reduces distribution migration and lessens the modifications required by the visual prompts. |
Mengyu Zheng; Zhiwei Hao; Yehui Tang; Chang Xu; |
119 | Adapt Without Forgetting: Distill Proximity from Dual Teachers in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a strategy to enhance both the zero-shot transfer ability and adaptability to new data distribution. |
Mengyu Zheng; Yehui Tang; Zhiwei Hao; Kai Han; Yunhe Wang; Chang Xu; |
120 | Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. |
Keen You; Haotian Zhang; Eldon Schoop; Floris Weers; Amanda Swearngin; Jeff Nichols; Yinfei Yang; Zhe Gan; |
121 | PromptIQA: Boosting The Performance and Generalization for No-Reference Image Quality Assessment Via Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Prompt-based IQA (PromptIQA) that can fast adapt to new requirements without fine-tuning after training. |
Zewen Chen; Haina Qin; Juan Wang; Chunfeng Yuan; Bing Li; Weiming Hu; Leon Wang; |
122 | "Eyes Closed, Safety On: Protecting Multimodal LLMs Via Image-to-Text Transformation" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To construct robust MLLMs, we propose ECSO (Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs in MLLMs. |
Yunhao Gou; Kai Chen; Zhili LIU; Lanqing Hong; Hang Xu; Zhenguo Li; Dit-Yan Yeung; James Kwok; Yu Zhang; |
123 | Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this generalization is limited by the narrow range of categories covered by existing datasets, such as NOCS, which also tend to overlook common real-world challenges like occlusion. To tackle these challenges, we introduce Omni6D, a comprehensive RGBD dataset featuring a wide range of categories and varied backgrounds, elevating the task to a more realistic context. |
Mengchen Zhang; Tong Wu; Tai Wang; Tengfei Wang; Ziwei Liu; Dahua Lin; |
124 | 3D Congealing: 3D-Aware Image Alignment in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters.We propose , a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. |
Yunzhi Zhang; Zizhang Li; Amit Raj; Andreas Engelhardt; Yuanzhen Li; Tingbo Hou; Jiajun Wu; Varun Jampani; |
125 | ZeST: Zero-Shot Material Transfer from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose , a method for zero-shot material transfer to an object in the input image given a material exemplar image. |
Ta-Ying Cheng; Prafull Sharma; Andrew Markham; Niki Trigoni; Varun Jampani; |
126 | ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. |
Jiazhi Guan; Zhiliang Xu; Hang Zhou; Kaisiyuan Wang; Shengyi He; Zhanwang Zhang; Borong Liang; Haocheng Feng; Errui Ding; Jingtuo Liu; Jingdong Wang; Youjian Zhao; Ziwei Liu; |
127 | DreamDrone: Text-to-Image Diffusion Models Are Zero-shot Perpetual View Generators Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce DreamDrone, a novel zero-shot and training-free pipeline for generating unbounded flythrough scenes from textual prompts. |
Hanyang Kong; Dongze Lian; Michael Bi Mi; Xinchao Wang; |
128 | SHIC: Shape-Image Correspondences with No Keypoint Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce , a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. |
Aleksandar Shtedritski; Christian Rupprecht; Andrea Vedaldi; |
129 | Photorealistic Video Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a diffusion transformer for photorealistic video generation from text prompts. |
Agrim Gupta; Lijun Yu; Kihyuk Sohn; Xiuye Gu; Meera Hahn; Li Fei-Fei; Irfan Essa; Lu Jiang; Jose Lezama; |
130 | ControlNet-XS: Rethinking The Control of Text-to-Image Diffusion Models As Feedback-Control Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we take an existing controlling network (ControlNet [?]) and change the communication between the controlling network and the generation process to be of high-frequency and with large-bandwidth. By doing so, we are able to considerably improve the quality of the generated images, as well as the fidelity of the control. |
Denis Zavadski; Johann-Friedrich Feiden; Carsten Rother; |
131 | SV3D: Novel Multi-view Synthesis and 3D Generation from A Single Image Using Latent Video Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose SV3D that adapts image-to-video diffusion model for novel multi-view synthesis and 3D generation, thereby leveraging the generalization and multi-view consistency of the video models, while further adding explicit camera control for NVS. |
Vikram Voleti; Chun-Han Yao; Mark Boss; Adam Letts; David Pankratz; Dmitrii Tochilkin; Christian Laforte; Robin Rombach; Varun Jampani; |
132 | DoughNet: A Visual Predictive Model for Topological Manipulation of Deformable Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The ability to accurately predict these topological changes that a specific action might incur is critical for planning interactions with elastoplastic objects. We present DoughNet, a Transformer-based architecture for handling these challenges, consisting of two components. |
Dominik Bauer; Zhenjia Xu; Shuran Song; |
133 | GaussianFormer: Scene As Gaussians for Vision-Based 3D Semantic Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. |
Yuanhui Huang; Wenzhao Zheng; Yunpeng Zhang; Jie Zhou; Jiwen Lu; |
134 | Diffusion for Natural Image Matting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we investigate a multi-step iterative approach for the first time to tackle the challenging natural image matting task, and achieve excellent performance by introducing a pixel-level denoising diffusion method (DiffMatte) for the alpha matte refinement. |
Yihan Hu; Yiheng Lin; Wei Wang; Yao Zhao; Yunchao Wei; Humphrey Shi; |
135 | "SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, visual embeddings and image scales. |
Ziyi Lin; Dongyang Liu; Renrui Zhang; Peng Gao; Longtian Qiu; Han Xiao; Han Qiu; Wenqi Shao; Keqin Chen; Jiaming Han; Siyuan Huang; Yichi Zhang; Xuming He; Yu Qiao; Hongsheng Li; |
136 | Track2Act: Predicting Point Tracks from Internet Videos Enables Generalizable Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We seek to learn a generalizable goal-conditioned policy that enables diverse robot manipulation — interacting with unseen objects in novel scenes without test-time adaptation. |
Homanga Bharadhwaj; Roozbeh Mottaghi; Abhinav Gupta; Shubham Tulsiani; |
137 | GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose , a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in ∼0.23 seconds on single A100 GPU. |
Kai Zhang; Sai Bi; Hao Tan; Yuanbo Xiangli; Nanxuan Zhao; Kalyan Sunkavalli; Zexiang Xu; |
138 | ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that considers an unsupervised setting without any human knowledge of the concepts. |
Shaozhe Hao; Kai Han; Zhengyao Lv; Shihao Zhao; Kwan-Yee K. Wong; |
139 | Pixel-GS Density Control with Pixel-aware Gradient for 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This issue is mainly due to the point cloud growth condition, which only considers the average gradient magnitude of points from observable views, thereby failing to grow for large Gaussians that are observable from many viewpoints while many of them are only covered in the boundaries. To address this, we introduce Pixel-GS to take the area covered by the Gaussian in each view into account during the computation of the growth condition. |
Zheng Zhang; Wenbo Hu; Yixing Lao; Tong He; Hengshuang Zhao; |
140 | Watch Your Steps: Local Image and Scene Editing By Text Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new task, 3D edit localization, to automatically identify the relevant region for an editing task and restrict the edit accordingly. |
Ashkan Mirzaei; Tristan T Aumentado-Armstrong; Marcus A Brubaker; Jonathan Kelly; Alex Levinshtein; Konstantinos G Derpanis; Igor Gilitschenski; |
141 | LivePhoto: Real Image Animation with Text-guided Motion Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. |
Xi Chen; Zhiheng Liu; Mengting Chen; Yutong Feng; Yu Liu; Yujun Shen; Hengshuang Zhao; |
142 | ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models. |
Yongwei Chen; Tengfei Wang; Tong Wu; Xingang Pan; Kui Jia; Ziwei Liu; |
143 | "X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and Its Emergent Cross-modal Reasoning" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an efficient and effective framework that integrates multiple modalities (images, 3D, audio and video) to a frozen LLM and demonstrates an emergent ability for cross-modal reasoning (2+ modality inputs). |
Artemis Panagopoulou; Le Xue; Ning Yu; LI JUNNAN; DONGXU LI; Shafiq Joty; Ran Xu; Silvio Savarese; Caiming Xiong; Juan Carlos Niebles; |
144 | Noise Calibration: Plug-and-play Content-Preserving Video Enhancement Using Pre-trained Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite the significant training costs, maintaining consistency of content between the original and enhanced videos remains a major challenge. To tackle this challenge, we propose a novel formulation that considers both visual quality and consistency of content. |
Qinyu Yang; Haoxin Chen; Yong Zhang; Menghan Xia; Xiaodong Cun; Zhixun Su; Ying Shan; |
145 | EAGLES: Efficient Accelerated 3D Gaussians with Lightweight EncodingS Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a technique utilizing quantized embeddings to significantly reduce per-point memory storage requirements and a coarse-to-fine training strategy for a faster and more stable optimization of the Gaussian point clouds. |
Sharath Girish; Kamal Gupta; Abhinav Shrivastava; |
146 | SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to address these major challenges in 3D-VL by examining the potential of systematically upscaling 3D-VL learning in indoor scenes.We introduce the first million-scale 3D-VL dataset, , encompassing indoor scenes and comprising vision-language pairs collected from both human annotations and our scalable scene-graph-based generation approach. |
Baoxiong Jia; Yixin Chen; Huangyue Yu; Yan Wang; Xuesong Niu; Tengyu Liu; Qing Li; Siyuan Huang; |
147 | PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. |
Jason J. Yu; Tristan Aumentado-Armstrong; Fereshteh Forghani; Konstantinos G. Derpanis; Marcus A. Brubaker; |
148 | Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, we introduce the Open-Vocabulary SAM, a SAM-inspired model designed for simultaneous interactive segmentation and recognition, leveraging two unique knowledge transfer modules: SAM2CLIP and CLIP2SAM. |
Haobo Yuan; Xiangtai Li; Chong Zhou; Yining Li; Kai Chen; Chen Change Loy; |
149 | DHR: Dual Features-Driven Hierarchical Rebalancing in Inter- and Intra-Class Regions for Weakly-Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, WSS faces challenges related to minor classes since those are overlooked in images with adjacent multiple classes, a limitation originating from the overfitting of traditional expansion methods like Random Walk. We first address this by employing unsupervised and weakly-supervised feature maps instead of conventional methodologies, allowing for hierarchical mask enhancement. |
Sanghyun Jo; Fei Pan; In-Jae Yu; Kyungsu Kim; |
150 | TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This bias manifests as a disproportionate focus on a singular tag (word) while neglecting other pertinent tags, stemming from CLIP embeddings prioritizing one specific tag in image-text relationships. In this paper, we introduce a novel two-step fine-tuning approach, Text-Tag Self-Distillation (TTD), to address this challenge. |
Sanghyun Jo; Soohyun Ryu; Sungyub Kim; Eunho Yang; Kyungsu Kim; |
151 | HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, we introduce a novel and unified framework that simultaneously achieves temporally and spatially coherent 3D reconstruction of static scenes with dynamic humans from monocular RGB videos.Furthermore, we introduce a synthetic dataset for quantitative evaluations. |
Lixin Xue; Chen Guo; Chengwei Zheng; Fangjinhua Wang; Tianjian Jiang; Hsuan-I Ho; Manuel Kaufmann; Jie Song; Otmar Hilliges; |
152 | LLM As Copilot for Coarse-grained Vision-and-Language Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces VLN-Copilot, a framework enabling agents to actively seek assistance when encountering confusion, with the LLM serving as a copilot to facilitate navigation. |
Yanyuan Qiao; Qianyi Liu; Jiajun Liu; Jing Liu; Qi Wu; |
153 | MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose , a Mobile Object Detector learned from Unlabeled Videos only. |
Yihong Sun; Bharath Hariharan; |
154 | Diffusion Models for Monocular Depth Estimation: Overcoming Challenging Conditions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel approach designed to address the complexities posed by challenging, out-of-distribution data in the single-image depth estimation task. |
Fabio Tosi; Pierluigi Zama Ramirez; Matteo Poggi; |
155 | Large Motion Model for Unified Multi-Modal Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. |
Mingyuan Zhang; Daisheng Jin; Chenyang Gu; Fangzhou Hong; Zhongang Cai; Jingfang Huang; Chongzhi Zhang; Xinying Guo; Lei Yang; Ying He; Ziwei Liu; |
156 | FlexAttention for Efficient High-Resolution Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose , a flexible attention mechanism for efficient high-resolution vision-language models. |
Junyan Li; Delin Chen; Tianle Cai; Peihao Chen; Yining Hong; Zhenfang Chen; Yikang Shen; Chuang Gan; |
157 | GIVT: Generative Infinite-Vocabulary Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Generative Infinite-Vocabulary Transformers(GIVT) which generate vector sequences with real-valued entries, instead of discrete tokens from a finite vocabulary. |
Michael Tschannen; Cian Eastwood; Fabian Mentzer; |
158 | Gaussian Frosting: Editable Complex Radiance Fields with Real-Time Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Gaussian Frosting, a novel mesh-based representation for high-quality rendering and editing of complex 3D effects in real-time. |
Antoine Gué,don; Vincent Lepetit; |
159 | Strengthening Multimodal Large Language Model with Bootstrapped Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We treat this bias as a “preference” for pretraining statistics, which hinders the model’s grounding in visual input. To mitigate this issue, we propose Bootstrapped Preference Optimization (BPO), which conducts preference learning with datasets containing negative responses bootstrapped from the model itself. |
Renjie Pi; Tianyang Han; Wei Xiong; Jipeng ZHANG; Runtao Liu; Rui Pan; Tong Zhang; |
160 | DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. |
Yunpeng Bai; Xintao Wang; Yan-Pei Cao; Yixiao Ge; Chun Yuan; Ying Shan; |
161 | Make Your ViT-based Multi-view 3D Detectors Faster Via Token Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although many sparse query-based methods have already attempted to improve the efficiency of 3D detectors, they neglect to consider the backbone, especially when using Vision Transformers (ViT) for better performance. To tackle this problem, we explore the efficient ViT backbones for multi-view 3D detection via token compression and propose a simple yet effective method called TokenCompression3D (ToC3D). |
Dingyuan Zhang; Dingkang Liang; Zichang Tan; Xiaoqing Ye; Cheng Zhang; Jingdong Wang; Xiang Bai; |
162 | Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Following the cutting-edge perspective of ’Detecting As Labeling’, we propose a novel paradigm dubbed DAL. |
Junjie Huang; Yun Ye; Zhujin Liang; Yi Shan; Dalong Du; |
163 | FuseTeacher: Modality-fused Encoders Are Strong Vision Supervisors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since modality-fused representation augments the image representation with textual information, we conjecture it is more discriminative and potential to be a strong teacher for visual representation learning. In this paper, we validate this hypothesis by experiments and propose a novel method that learns visual representation by modality-fused supervision. |
Chen-Wei Xie; Siyang Sun; Liming Zhao; Pandeng Li; Shuailei Ma; Yun Zheng; |
164 | LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called to address this gap and enable fast, high-quality, and generic conditional 3D generation. |
Yushi Lan; Fangzhou Hong; Shuai Yang; Shangchen Zhou; Xuyi Meng; Bo Dai; Xingang Pan; Chen Change Loy; |
165 | TC4D: Trajectory-Conditioned Text-to-4D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose TC4D: trajectory-conditioned text-to-4D generation, an approach that factors motion into global and local components. |
Sherwin Bahmani; Xian Liu; Wang Yifan; Ivan Skorokhodov; Victor Rong; Ziwei Liu; Xihui Liu; Jeong Joon Park; Sergey Tulyakov; Gordon Wetzstein; Andrea Tagliasacchi; David B Lindell; |
166 | SemGrasp: Semantic Grasp Generation Via Language Aligned Discretization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel semantic-based grasp generation method, termed , which generates a static human grasp pose by incorporating semantic information into the grasp representation. |
Kailin Li; Jingbo Wang; Lixin Yang; Cewu Lu; Bo Dai; |
167 | FunQA: Towards Surprising Video Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce FunQA, a challenging video question answering (QA) dataset specifically designed to evaluate and enhance the depth of video reasoning based on counter-intuitive and fun videos. |
Binzhu Xie; Sicheng Zhang; Zitang Zhou; Bo Li; Yuanhan Zhang; Jack Hessel; Jingkang Yang; Ziwei Liu; |
168 | MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present MVSGaussian, a new generalizable 3D Gaussian representation approach derived from Multi-View Stereo (MVS) that can efficiently reconstruct unseen scenes. |
Tianqi Liu; Guangcong Wang; Shoukang Hu; Liao Shen; Xinyi Ye; Yuhang Zang; Zhiguo Cao; Wei Li; Ziwei Liu; |
169 | GroupDiff: Diffusion-based Group Portrait Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there are no labeled data for group photo editing, we create a data engine to generate paired data for training. |
Yuming Jiang; Nanxuan Zhao; Qing Liu; Krishna Kumar Singh; Shuai Yang; Chen Change Loy; Ziwei Liu; |
170 | ColorMAE: Exploring Data-independent Masking Strategies in Masked AutoEncoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a simple yet effective data-independent method, termed , which generates different binary mask patterns by filtering random noise. |
Carlos Hinojosa; Shuming Liu; Bernard Ghanem; |
171 | On Pretraining Data Diversity for Self-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. |
Hasan Abed Al Kader Hammoud; Tuhin Das; Fabio Pizzati; Philip Torr; Adel Bibi; Bernard Ghanem; |
172 | VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder’s capacity by coupling it directly with the language prompt. |
Ofir Abramovich; Niv Nayman; Sharon Fogel; Inbal Lavi; Ron Litman; Shahar Tsiper; Royee Tichauer; Srikar Appalaraju; Shai Mazor; R. Manmatha; |
173 | FreeInit: Bridging Initialization Gap in Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality. |
Tianxing Wu; Chenyang Si; Yuming Jiang; Ziqi Huang; Ziwei Liu; |
174 | Kalman-Inspired Feature Propagation for Video Face Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These paradigms encounter challenges either in reconstructing facial details or maintaining temporal consistency. To address these issues, we introduce a novel framework called Kalman-inspired Feature Propagation (), designed to maintain a stable face prior over time. |
Ruicheng Feng; Chongyi Li; Chen Change Loy; |
175 | NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we use our simulator to test the responses of AD models to safety-critical scenarios inspired by the European New Car Assessment Programme (Euro NCAP).We present a versatile NeRF-based simulator for testing autonomous driving (AD) software systems, designed with a focus on sensor-realistic closed-loop evaluation and the creation of safety-critical scenarios. |
William Ljungbergh; Adam Tonderski; Joakim Johnander; Holger Caesar; Kalle Å,strö,m; Michael Felsberg; Christoffer Petersson; |
176 | SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce , a method that directly produces full scene models as a sequence of structured language commands using an autoregressive, token-based approach.To train , we generate and release a large-scale synthetic dataset called consisting of 100k high-quality indoor scenes, with photorealistic and ground-truth annotated renders of egocentric scene walkthroughs. |
Armen Avetisyan; Christopher Xie; Henry Howard-Jenkins; Tsun-Yi Yang; Samir Aroudj; Suvam Patra; Fuyang Zhang; Luke Holland; Duncan Frost; Campbell Orme; Jakob Engel; Edward Miller; Richard Newcombe; Vasileios Balntas; |
177 | FSGS: Real-Time Few-shot View Synthesis Using Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the realm of NeRF-based few-shot view synthesis, there is often a trade-off between the accuracy of the synthesized view and the efficiency of the 3D representation. To tackle this dilemma, we introduce a Few-Shot view synthesis framework based on 3D Gaussian Splatting, which facilitates real-time, photo-realistic synthesis from a minimal number of training views. |
Zehao Zhu; Zhiwen Fan; Yifan Jiang; Zhangyang Wang; |
178 | VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models.By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. |
Junlin Han; Filippos Kokkinos; Philip Torr; |
179 | MegaScenes: Scene-Level View Synthesis at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. |
Joseph Tung; Gene Chou; Ruojin Cai; Guandao Yang; Kai Zhang; Gordon Wetzstein; Bharath Hariharan; Noah Snavely; |
180 | Tuning-Free Image Customization with Image and Text Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response, we introduce a tuning-free framework for simultaneous text-image-guided image customization, enabling precise editing of specific image regions within seconds. |
Pengzhi Li; Qiang Nie; Ying Chen; Xi Jiang; Kai Wu; Yuhuan Lin; Yong Liu; Jinlong Peng; Chengjie Wang; Feng Zheng; |
181 | UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these developments, a prevalent limitation arises from the use of RGB data in diffusion or reconstruction models, which often results in models with inherent lighting and shadows effects that detract from their realism, thereby limiting their usability in applications that demand accurate relighting capabilities. To bridge this gap, we present UniDream, a text-to-3D generation framework by incorporating unified diffusion priors. |
Zexiang Liu; Yangguang Li; Youtian Lin; Xin Yu; Sida Peng; Yan-Pei Cao; Xiaojuan Qi; Xiaoshui Huang; Ding Liang; Wanli Ouyang; |
182 | Training-free Video Temporal Grounding Using Large-scale Pre-trained Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Training-Free Video Temporal Grounding () approach that leverages the ability of pre-trained large models. |
Minghang Zheng; Xinhao Cai; Qingchao Chen; Yuxin Peng; Yang Liu; |
183 | Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. |
Ruofan Liang; Zan Gojcic; Merlin Nimier-David; David Acuna; Nandita Vijaykumar; Sanja Fidler; Zian Wang; |
184 | LingoQA: Video Question Answering for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving.We release our dataset and benchmark1 as an evaluation platform for vision-language models in autonomous driving. |
Ana-Maria Marcu; Long Chen; Jan Hü,nermann; Alice Karnsund; Benoit Hanotte; Prajwal Chidananda; Saurabh Nair; Vijay Badrinarayanan; Alex Kendall; Jamie Shotton; Elahe Arani; Oleg Sinavski; |
185 | Deblur E-NeRF: NeRF from Motion-Blurred Events Under High-speed or Low-light Conditions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose, Deblur e-NeRF, a novel method to directly and effectively reconstruct blur-minimal NeRFs from motion-blurred events, generated under high-speed or low-light conditions. |
Weng Fei Low; Gim Hee Lee; |
186 | Augmented Neural Fine-tuning for Efficient Backdoor Purification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose—Neural mask Fine-Tuning (NFT)—with an aim to optimally re-organize the neuron activities in a way that the effect of the backdoor is removed. |
Nazmul Karim; Abdullah Al Arafat; Umar Khalid; Zhishan Guo; Nazanin Rahnavard; |
187 | Parrot Captions Teach CLIP to Spot Text Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our analysis shows that around 50% of images are embedded with visual text content and around 30% of captions words are concurrently embedded in the visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is a dominant factor in measuring the LAION-style image-text similarity for these models. |
Yiqi Lin; Conghui He; Alex Jinpeng Wang; Bin Wang; Weijia Li; Mike Zheng Shou; |
188 | Exploiting Dual-Correlation for Multi-frame Time-of-Flight Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the first learning-based framework for multi-frame ToF denoising. |
Guanting Dong; Yueyi Zhang; Xiaoyan Sun; Zhiwei Xiong; |
189 | Sapiens: Foundation for Human Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Sapiens, a family of models for four fundamental human-centric vision tasks – 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. |
Rawal Khirodkar; Timur Bagautdinov; jltmtzc@gmail.com Martinez; Zhaoen Su; Austin T James; Peter Selednik; Stuart Anderson; Shunsuke Saito; |
190 | COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose , a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. |
Jiefeng Li; Ye Yuan; Davis Rempe; Haotian Zhang; Pavlo Molchanov; Cewu Lu; Jan Kautz; Umar Iqbal; |
191 | Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a spatio-temporal equivariant learning framework by considering both spatial and temporal augmentations jointly. |
Deepti Hegde; Suhas Lohit; Kuan-Chuan Peng; Michael J. Jones; Vishal M. Patel; |
192 | Be-Your-Outpainter: Mastering Video Outpainting Through Input-Specific Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce (Mastering Video Outpainting Through Input-Specific Adaptation), a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. |
Fu-Yun Wang; Xiaoshi Wu; Zhaoyang Huang; Xiaoyu Shi; Dazhong Shen; Guanglu Song; Yu Liu; Hongsheng Li; |
193 | ZoLA: Zero-Shot Creative Long Animation Generation with Short Video Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a zero-shot method for creative long animation generation with short video diffusion models and even with short video consistency models (a new family of generative models known for the fast generation with high quality). |
Fu-Yun Wang; Zhaoyang Huang; Qiang Ma; Guanglu Song; Xudong LU; Weikang Bian; Yijin Li; Yu Liu; Hongsheng Li; |
194 | PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In navigating new cities, humans gradually develop a preliminary mental map to supplement real-time perception during subsequent visits. Inspired by this human approach, we introduce a novel framework, PreSight, that leverages past traversals to construct static prior memories, enhancing online perception in later navigations. |
Tianyuan Yuan; Yucheng Mao; Jiawei Yang; Yicheng LIU; Yue Wang; Hang Zhao; |
195 | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack Via Adaptive Shield Prompting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, with the integration of additional modalities, MLLMs are exposed to new vulnerabilities, rendering them prone to structured-based jailbreak attacks, where semantic content (e.g. “harmful text”) has been injected into the images to mislead MLLMs. In this work, we aim to defend against such threats. |
Yu Wang; Xiaogeng Liu; Yu Li; Muhao Chen; Chaowei Xiao; |
196 | Nonverbal Interaction Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work addresses a new challenge of understanding human nonverbal interaction in social contexts. |
Jianan Wei; Tianfei Zhou; Yi Yang; Wenguan Wang; |
197 | Stitched ViTs Are Flexible Vision Backbones Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we are inspired by stitchable neural networks (SN-Net), which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. |
Zizheng Pan; Jing Liu; Haoyu He; Jianfei Cai; Bohan Zhuang; |
198 | Better Regression Makes Better Test-time Adaptive 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we explore a new task named test-time domain adaptive 3D object detection and propose Reg-TTA3D, a pseudo-label-based test-time adaptative 3D object detection method. |
Jiakang Yuan; Bo Zhang; Kaixiong Gong; Xiangyu Yue; Botian Shi; Yu Qiao; Tao Chen; |
199 | Diffusion-Guided Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To effectively guide ViT in excavating the relation between the patches, we devise the Patch Affinity Consistency (PAC) between the outputs of the original image and the denoised image. |
Sung-Hoon Yoon; Hoyong Kwon; Jaeseok Jeong; Daehee Park; Kuk-Jin Yoon; |
200 | Agent Attention: On The Integration of Softmax and Linear Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. |
Dongchen Han; Tianzhu Ye; Yizeng Han; Zhuofan Xia; Siyuan Pan; Pengfei Wan; Shiji Song; Gao Huang; |
201 | Fast Training of Diffusion Transformer with Extreme Masking for 3D Point Clouds Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by the inherent redundancy of 3D compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer tailored for efficient 3D point cloud generation, which greatly reduces training costs. |
Shentong Mo; Enze Xie; Yue Wu; Junsong Chen; Matthias Niessner; Zhenguo Li; |
202 | Robo-ABC: Affordance Generalization Beyond Categories Via Semantic Correspondence for Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the natural way humans think, we propose : when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual and semantic similarities from the memory, then mapping the contact points of the retrieved objects to the new object. |
Yuanchen Ju; Kaizhe Hu; Guowei Zhang; Gu Zhang; Mingrun Jiang; Huazhe Xu; |
203 | FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In our work, we introduce an advanced acceleration technique that leverages the temporal redundancy inherent in diffusion models. |
Junhyuk So; Jungwon Lee; Eunhyeok Park; |
204 | LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Building on this foundation, we introduce , an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method.To bridge this gap, we construct a large-scale RS image-text dataset, LHRS1 -Align, and an informative RS-specific instruction dataset, , leveraging the extensive volunteered geographic information (VGI) and globally available RS images.Additionally, we introduce , a benchmark for thoroughly evaluating MLLMs’ abilities in RS image understanding. |
Dilxat Muhtar; Zhenshi Li; Feng Gu; Xueliang Zhang; Pengfeng Xiao; |
205 | Real-time 3D-aware Portrait Editing from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents , a practical method that can efficiently edit a face image following given prompts, like reference images or text descriptions, in a 3D-aware manner. |
Qingyan Bai; Zifan Shi; Yinghao Xu; Hao Ouyang; Qiuyu Wang; Ceyuan Yang; Xuan Wang; Gordon Wetzstein; Yujun Shen; Qifeng Chen; |
206 | SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, they only consider the local positional information within each image token and cannot effectively model the global spatial relations of the underlying scene. To address this challenge, we propose an efficient vision transformer architecture, SpatialFormer, with explicit spatial understanding for generalizable image representation learning. |
Han Xiao; Wenzhao Zheng; Sicheng Zuo; Peng Gao; Jie Zhou; Jiwen Lu; |
207 | DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. |
Xiaofeng Wang; Zheng Zhu; Guan Huang; Chen Xinze; Jiagang Zhu; Jiwen Lu; |
208 | Attention Prompting on Image for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models’ ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image (), which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. |
Runpeng Yu; Weihao Yu; Xinchao Wang; |
209 | DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our analysis reveals that while video score distillation can effectively introduce new content indicated by target text, it can also cause significant structure and motion deviation. To counteract this, we propose to match the space-time self-similarities of the original video and the edited video during the score distillation. |
Hyeonho Jeong; Jinho Chang; Geon Yeong Park; Jong Chul Ye; |
210 | Textual Query-Driven Mask Transformer for Domain Generalized Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a method to tackle Domain Generalized Semantic Segmentation (DGSS) by utilizing domain-invariant semantic knowledge from text embeddings of vision-language models. |
Byeonghyun Pak; Byeongju Woo; Sunghwan Kim; Dae-hwan Kim; Hoseong Kim; |
211 | StableDrag: Stable Dragging for Point-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these great success, this dragging scheme exhibits two major drawbacks, namely inaccurate point tracking and incomplete motion supervision, which may result in unsatisfactory dragging outcomes. To tackle these issues, we build a stable and precise drag-based editing framework, coined as StableDrag , by designing a discriminative point tracking method and a confidence-based latent enhancement strategy for motion supervision. |
Yutao Cui; Xiaotong Zhao; Guozhen Zhang; Shengming Cao; Kai Ma; Limin Wang; |
212 | Fast View Synthesis of Casual Videos with Soup-of-Planes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. |
Yao-Chih Lee; Zhoutong Zhang; Kevin Blackburn-Matzen; Simon Niklaus; Jianming Zhang; Jia-Bin Huang; Feng Liu; |
213 | Embodied Understanding of Driving Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents’ understanding of driving scenes with large spatial and temporal spans. |
Yunsong Zhou; Linyan Huang; Qingwen Bu; Jia Zeng; Tianyu Li; Hang Qiu; Hongzi Zhu; Minyi Guo; Yu Qiao; Hongyang Li; |
214 | ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, , overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. |
Chen Guo; Tianjian Jiang; Manuel Kaufmann; Chengwei Zheng; Julien Valentin; Jie Song; Otmar Hilliges; |
215 | StructLDM: Structured Latent Diffusion for 3D Human Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose , a diffusion-based unconditional 3D human generative model, which is learned from 2D images. |
Tao Hu; Fangzhou Hong; Ziwei Liu; |
216 | Three Things We Need to Know About Transferring Stable Diffusion to Visual Dense Prediciton Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate how to conduct transfer learning to adapt Stable Diffusion to downstream visual dense prediction tasks such as semantic segmentation and depth estimation. |
Manyuan Zhang; Guanglu Song; Xiaoyu Shi; Yu Liu; Hongsheng Li; |
217 | EgoLifter: Open-world 3D Segmentation for Egocentric Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present , a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects.We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. |
Qiao Gu; Zhaoyang Lv; Duncan Frost; Simon Green; Julian Straub; Chris Sweeney; |
218 | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For the OOD evaluation, we present two novel visual question-answering (VQA) datasets, each with one variant, designed to test model performance under challenging conditions. |
Haoqin Tu; Chenhang Cui; Zijun Wang; Yiyang Zhou; Bingchen Zhao; Junlin Han; Wangchunshu Zhou; Huaxiu Yao; Cihang Xie; |
219 | Fisher Calibration for Backdoor-Robust Heterogeneous Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that parameters appear in different important degrees towards distinct distribution and instead consider meaningful and meaningless parameters for the ideal target distribution. |
Wenke Huang; Mang Ye; zekun shi; Bo Du; Dacheng Tao; |
220 | Efficient Inference of Vision Instruction-Following Models with Elastic Cache Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. |
Zuyan Liu; Benlin Liu; Jiahui Wang; Yuhao Dong; Guangyi Chen; Yongming Rao; Ranjay Krishna; Jiwen Lu; |
221 | Towards Image Ambient Lighting Normalization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new challenging task termed Ambient Lighting Normalization (ALN), which enables the study of interactions between shadows, unifying image restoration and shadow removal in a broader context. |
Florin-Alexandru Vasluianu; Tim Seizinger; Zongwei WU; Rakesh Ranjan; Radu Timofte; |
222 | LetsMap: Unsupervised Representation Learning for Label-Efficient Semantic BEV Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. |
Nikhil Gosala; Kü,rsat Petek; B Ravi Kiran; Senthil Yogamani; Paulo L. J. Drews-Jr; Wolfram Burgard; Abhinav Valada; |
223 | An Image Is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat, and Video-LLaVA. |
Liang Chen; Haozhe Zhao; Tianyu Liu; Shuai Bai; Junyang Lin; Chang Zhou; Baobao Chang; |
224 | LongVLM: Efficient Long Video Understanding Via Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. |
Yuetian Weng; Mingfei Han; Haoyu He; Xiaojun Chang; Bohan Zhuang; |
225 | BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation Using RGB Frames and Events Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: BlinkVision enables extensive benchmarks on three types of correspondence tasks (i.e., optical flow, point tracking and scene flow estimation) for both image-based methods and event-based methods, leading to new observations, practices, and insights for future research. |
Yijin Li; Yichen Shen; Zhaoyang Huang; Shuo Chen; Weikang Bian; Xiaoyu Shi; Fu-Yun Wang; Keqiang Sun; Hujun Bao; Zhaopeng Cui; Guofeng Zhang; Hongsheng Li; |
226 | PhysAvatar: Learning The Physics of Dressed 3D Avatars from Visual Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: PhysAvatar is a novel framework that captures the physics of dressed 3D avatars from visual observations, enabling a wide spectrum of applications, such as (a) animation, (b) relighting, and (c) redressing, with high-fidelity rendering results. |
Yang Zheng; Qingqing Zhao; Guandao Yang; Wang Yifan; Donglai Xiang; Florian Dubost; Dmitry Lagun; Thabo Beeler; Federico Tombari; Leonidas Guibas; Gordon Wetzstein; |
227 | TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here, we propose TetraDiffusion, a diffusion model that operates on a tetrahedral partitioning of 3D space to enable efficient, high-resolution 3D shape generation. |
Nikolai Kalischek; Torben Peters; Jan Dirk Wegner; Konrad Schindler; |
228 | OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. |
Zhening Huang; Xiaoyang Wu; Xi Chen; Hengshuang Zhao; Lei Zhu; Joan Lasenby; |
229 | Neural Metamorphosis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a new learning paradigm termed Neural Metamorphosis (NeuMeta), which aims to build self-morphable neural networks. |
Xingyi Yang; Xinchao Wang; |
230 | Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our investigation further exposes inherent flaws in its original design, particularly in its ability to identify multiple distinct keys, where distribution shift offers no assistance. Based on these findings and analysis, we present RingID for enhanced multi-key identification. |
Hai Ci; Pei Yang; Yiren Song; Mike Zheng Shou; |
231 | Progressive Proxy Anchor Propagation for Unsupervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, relying solely on similarity-based supervision from image-level pretrained models often leads to unreliable guidance due to insufficient patch-level semantic representations. To address this, we propose a Progressive Proxy Anchor Propagation (PPAP) strategy. |
Hyun Seok Seong; WonJun Moon; SuBeen Lee; Jae-Pil Heo; |
232 | AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study introduces AdaCLIP for the ZSAD task, leveraging a pre-trained vision-language model (VLM), CLIP. |
Yunkang Cao; Jiangning Zhang; Luca Frittoli; Yuqi Cheng; Weiming Shen; Giacomo Boracchi; |
233 | Online Vectorized HD Map Construction Using Geometry Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In our work, we propose GeMap (Geometry Map), which end-to-end learns Euclidean shapes and relations of map instances beyond fundamental perception. |
Zhixin Zhang; Yiyuan Zhang; Xiaohan Ding; Fusheng Jin; Xiangyu Yue; |
234 | Align Before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a model-agnostic and lightweight plugin to mitigate the feature-level misalignment issue, called dynamic feature alignment (NEAT). |
Dingkang Yang; Dingkang Yang; Ke Li; Dongling Xiao; Zedian Shao; Peng Sun; Liang Song; |
235 | Dataset Growth Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This leads to repeated data curation with sub-optimal efficiency. To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity. |
Ziheng Qin; zhaopan xu; YuKun Zhou; Kai Wang; Zangwei Zheng; Zebang Cheng; Hao Tang; Lei Shang; Baigui Sun; Radu Timofte; Xiaojiang Peng; Hongxun Yao; Yang You; |
236 | Generative End-to-End Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GenAD, a generative framework that casts autonomous driving into a generative modeling problem. |
Wenzhao Zheng; Ruiqi Song; Xianda Guo; Chenming Zhang; Long Chen; |
237 | DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we introduce a novel framework DMiT (Deformable Mipmapped Tri-Plane) that adopts the mipmaps to render dynamic scenes at various resolutions from novel views. |
Jing-Wen Yang; Jia-Mu Sun; Yong-Liang Yang; Jie Yang; Ying Shan; Yan-Pei Cao; Lin Gao; |
238 | Contribution-based Low-Rank Adaptation with Pre-training Model for Real Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose a novel efficient parameter tuning approach dubbed contribution-based low-rank adaptation (CoLoRA) for multiple image restorations along with effective pre-training method with random order degradations (PROD). |
Dongwon Park; Hayeon Kim; Se Young Chun; |
239 | MilliFlow: Scene Flow Estimation on MmWave Radar Point Cloud for Human Motion Sensing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose milliFlow, a novel deep learning approach to estimate scene flow as complementary motion information for mmWave point cloud, serving as an intermediate level of features and directly benefiting downstream human motion sensing tasks. |
Fangqiang Ding; Zhen Luo; Peijun Zhao; Chris Xiaoxuan Lu; |
240 | Improving Video Segmentation Via Dynamic Anchor Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap by dynamically generating anchor queries based on the features of potential newly emerging and disappearing candidates. |
Yikang Zhou; Tao Zhang; Xiangtai Li; Shunping Ji; Shuicheng Yan; |
241 | EditShield: Protecting Unauthorized Image Editing By Instruction-guided Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we first propose a protection method against unauthorized modifications from such models. Specifically, works by adding imperceptible perturbations that can shift the latent representation used in the diffusion process, tricking models into generating unrealistic images with mismatched subjects. |
Ruoxi Chen; Haibo Jin; Yixin Liu; Jinyin Chen; Haohan Wang; Lichao Sun; |
242 | Improving Diffusion Models for Authentic Virtual Try-on in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. |
Yisol Choi; Sangkyung Kwak; Kyungmin Lee; Hyungwon Choi; Jinwoo Shin; |
243 | EchoScene: Indoor Scene Generation Via Information Echo Over Scene Graph Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. |
Guangyao Zhai; Evin Pınar Ö,rnek; Dave Zhenyu Chen; Ruotong Liao; Yan Di; Nassir Navab; Federico Tombari; Benjamin Busam; |
244 | ST-LLM: Large Language Models Are Effective Temporal Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs? |
Ruyang Liu; Chen Li; Haoran Tang; Yixiao Ge; Ying Shan; Ge Li; |
245 | Encapsulating Knowledge in One Prompt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a new knowledge transfer paradigm called Knowledge in One Prompt (KiOP). |
Qi Li; Runpeng Yu; Xinchao Wang; |
246 | HeadGaS: Real-Time Animatable Head Avatars Via 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose HeadGaS, a model that uses 3D Gaussian Splats (3DGS) for 3D head reconstruction and animation. |
Helisa Dhamo; Yinyu Nie; Arthur Moreau; Jifei Song; Richard Shaw; Yiren Zhou; Eduardo Pé,rez-Pellitero; |
247 | Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models Via Lightweight Erasers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Reliable Concept Erasing via Lightweight Erasers (Receler). |
Chi-Pin Huang; Kai-Po Chang; Chung-Ting Tsai; Yung-Hsuan Lai; Fu-En Yang; Yu-Chiang Frank Wang; |
248 | MotionDirector: Motion Customization of Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. |
Rui Zhao; Yuchao Gu; Jay Zhangjie Wu; David Junhao Zhang; Jia-Wei Liu; weijia wu; Jussi Keppo; Mike Zheng Shou; |
249 | AnatoMask: Enhancing Medical Image Segmentation with Reconstruction-guided Self-masking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose AnatoMask, a novel MIM method that leverages reconstruction loss to dynamically identify and mask out anatomically significant regions to improve pretraining efficacy. |
Yuheng Li; Tianyu Luan; Yizhou Wu; Shaoyan Pan; Yenho Chen; Xiaofeng Yang; |
250 | Flying with Photons: Rendering Novel Views of Propagating Light Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an imaging and neural rendering technique that seeks to synthesize videos of light propagating through a scene from novel, moving camera viewpoints. |
Anagh Malik; Noah Juravsky; Ryan Po; Gordon Wetzstein; Kiriakos N. Kutulakos; David B. Lindell; |
251 | Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Under this framework, we propose an efficient adaptation method dubbed D3IP, specified for 3D measurements, which accelerates DDIP by orders of magnitude while achieving superior performance. |
Hyungjin Chung; Jong Chul Ye; |
252 | FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. |
Linjiang Huang; Rongyao Fang; Aiping Zhang; Guanglu Song; Si Liu; Yu Liu; Hongsheng Li; |
253 | ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a counterfactual dataset. |
Daniel Winter; Matan Cohen; Shlomi Fruchter; Yael Pritch; Alex Rav-Acha; Yedid Hoshen; |
254 | Towards Multimodal Sentiment Analysis Debiasing Via Bias Purification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These harmful biases potentially mislead models to focus on statistical shortcuts and spurious correlations, causing severe performance bottlenecks. To alleviate these issues, we present a Multimodal Counterfactual Inference Sentiment (MCIS) analysis framework based on causality rather than conventional likelihood. |
Dingkang Yang; Mingcheng Li; Dongling Xiao; Yang Liu; Kun Yang; Zhaoyu Chen; Yuzheng Wang; Peng Zhai; Ke Li; Lihua Zhang; |
255 | ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a new task called and introduce a new benchmark ScanReason which provides over 10K question-answer-location pairs from five reasoning types that require the synerization of reasoning and grounding. |
Chenming Zhu; Tai Wang; Wenwei Zhang; Kai Chen; Xihui Liu; |
256 | Radiative Gaussian Splatting for Efficient X-ray Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a 3D Gaussian splatting-based method, namely X-Gaussian, for X-ray novel view synthesis. |
Yuanhao Cai; Yixun Liang; Jiahao Wang; Angtian Wang; Yulun Zhang; Xiaokang Yang; Zongwei Zhou; Alan Yuille; |
257 | Multistain Pretraining for Slide Representation Learning in Pathology Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce , a multimodal pretraining strategy for slide representation learning. |
Guillaume Jaume; Anurag J Vaidya; Andrew Zhang; Andrew Song; Richard J Chen; Sharifa Sahai; Dandan Mo; Emilio Madrigal; Long P Le; Faisal Mahmood; |
258 | MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a vector HD-mapping algorithm that formulates the mapping as a tracking task and uses a history of memory latents to ensure consistent reconstructions over time. |
Jiacheng Chen; Yuefan Wu; Jiaqi Tan; Hang Ma; Yasutaka Furukawa; |
259 | AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. |
Feichi Lu; Zijian Dong; Jie Song; Otmar Hilliges; |
260 | NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our goal is made possible by masking random patches from NeRF’s radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. |
Muhammad Zubair Irshad; Sergey Zakharov; Vitor Guizilini; Adrien Gaidon; Zsolt Kira; Rares Ambrus; |
261 | Cascade-Zero123: One Image to Highly Consistent 3D with Self-Prompted Nearby Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, due to the high sparsity of the single input image, Zero-1-to-3 tends to produce geometry and appearance inconsistency across views, especially for complex objects. To tackle this issue, we propose to supply more condition information for the generation model but in a self-prompt way. |
Yabo Chen; Jiemin Fang; Yuyang Huang; Taoran Yi; Xiaopeng Zhang; Lingxi Xie; Xinggang Wang; Wenrui Dai; Hongkai Xiong; Qi Tian; |
262 | UniProcessor: A Text-induced Unified Low-level Image Processor Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a text-induced Unified image Processor for low-level vision tasks, termed UniProcessor, which can effectively process various degradation types and levels, and support multimodal control. |
Huiyu Duan; Xiongkuo Min; Sijing Wu; Wei Shen; Guangtao Zhai; |
263 | Text to Layer-wise 3D Clothed Human Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. |
Junting Dong; Qi Fang; Zehuan Huang; Xudong XU; Jingbo Wang; Sida Peng; Bo Dai; |
264 | SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these methods struggle to strike a balance among reference view alignment, spatio-temporal consistency, and motion fidelity under single-view conditions due to the implicit nature of NeRF or the intricate dense Gaussian motion prediction. To address these issues, this paper proposes an efficient, sparse-controlled video-to-4D framework named SC4D, that decouples motion and appearance to achieve superior video-to-4D generation. |
Zijie Wu; Chaohui Yu; Yanqin Jiang; Chenjie Cao; Fan Wang; Xiang Bai; |
265 | Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose several problem-specific novel attacks minimizing different metrics in accuracy and mIoU. |
Francesco Croce; Naman D. Singh; Matthias Hein; |
266 | Towards Model-Agnostic Dataset Condensation By Heterogeneous Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Notably, these condensed images tend to be specific to particular models, constraining their versatility and practicality. In response to this limitation, we introduce a novel method, Heterogeneous Model Dataset Condensation (HMDC), designed to produce universally applicable condensed images through cross-model interactions. |
Jun-Yeong Moon; Jung Uk Kim; Gyeong-Moon Park; |
267 | An Explainable Vision Question Answer Model Via Diffusion Chain-of-Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This means that generating explanations solely for the answer can lead to a semantic discrepancy between the content of the explanation and the question-answering content. To address this, we propose a step-by-step reasoning approach to reduce such semantic discrepancies. |
Chunhao LU; Qiang Lu; Jake Luo; |
268 | Unveiling Advanced Frequency Disentanglement Paradigm for Low-Light Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Leveraging the image Laplace decomposition scheme, we propose a novel low-frequency consistency method, facilitating improved frequency disentanglement optimization. |
Kun Zhou; Xinyu Lin; Wenbo Li; Xiaogang Xu; Yuanhao Cai; Zhonghang Liu; Xiaoguang Han; Jiangbo Lu; |
269 | HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. |
Wonjae Kim; Sanghyuk Chun; Taekyung Kim; Dongyoon Han; Sangdoo Yun; |
270 | Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. |
Ruibin Li; Ruihuang Li; Song Guo; Lei Zhang; |
271 | Emergent Visual-Semantic Hierarchies in Image-Text Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the () framework for probing and optimizing hierarchical understanding, and contribute the dataset, a benchmark facilitating the study of hierarchical knowledge in image–text representations, constructed automatically via large language models. |
Morris Alper; Hadar Averbuch-Elor; |
272 | Sparse Refinement for Efficient High-Resolution Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. |
Zhijian Liu; Zhuoyang Zhang; Samir Khaki; Shang Yang; Haotian Tang; Chenfeng Xu; Kurt Keutzer; Song Han; |
273 | Factorized Diffusion: Perceptual Illusions By Noise Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. |
Daniel Geng; Inbum Park; Andrew Owens; |
274 | Rotary Position Embedding for Vision Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. |
Byeongho Heo; Song Park; Dongyoon Han; Sangdoo Yun; |
275 | MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present MobileDiffusion, an ultra-efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. |
Yang Zhao; Zhisheng Xiao; Yanwu Xu; Haolin Jia; Tingbo Hou; |
276 | Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In pursuit of greater generalizability and robustness, we present an SSDA framework with a new episodic learning strategy: “learn, forget, then learn more”. |
Hritam Basak; Zhaozheng Yin; |
277 | VideoStudio: Generating Consistent-Content and Multi-Scene Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. |
Fuchen Long; Zhaofan Qiu; Ting Yao; Tao Mei; |
278 | PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a part-level foundation model for locating and identifying both objects and parts in images. |
Junyi Li; Junfeng Wu; Weizhi Zhao; Song Bai; Xiang Bai; |
279 | SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. |
Lukas Hoyer; David Joseph Tan; Muhammad Ferjad Naeem; Luc Van Gool; Federico Tombari; |
280 | Plain-Det: A Plain Multi-Dataset Object Detector Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose Plain-Det, which offers flexibility to accommodate new datasets, robustness in performance across diverse datasets, training efficiency, and compatibility with various detection architectures. |
Cheng Shi; Yuchen Zhu; Sibei Yang; |
281 | Part2Object: Hierarchical Unsupervised 3D Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing methods face the challenge of either too loose or too tight clustering, leading to under-segmentation or over-segmentation. To address this issue, we propose Part2Object, hierarchical clustering with object guidance. |
Cheng Shi; Yulin Zhang; Bin Yang; Jiajin Tang; Yuexin Ma; Sibei Yang; |
282 | Learning The Unlearned: Mitigating Feature Suppression in Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This issue often leads to indistinguishable representations for visually similar but semantically different inputs, adversely affecting downstream task performance, particularly those requiring rigorous semantic comprehension. To address this challenge, we propose a novel model-agnostic Multistage Contrastive Learning (MCL) framework. |
Jihai Zhang; Xiang Lan; Xiaoye Qu; Yu Cheng; Mengling Feng; Bryan Hooi; |
283 | CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the deficiency, we explore the advanced generative paradigm with distribution perception and propose a novel framework based on the diffusion model, coined Continual Latent Diffusion (CLIFF), which formulates a continual distribution transfer among the object, image, and text latent space probabilistically. |
Wuyang Li; Xinyu Liu; Jiayi Ma; Yixuan Yuan; |
284 | Neural Volumetric World Models for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods for perception and planning in autonomous driving primarily rely on a 2D spatial representation, based on a bird’s eye perspective of the scene, which is insufficient for modeling motion characteristics and decision-making in real-world 3D settings with occlusion, partial observability, subtle motions, and varying terrains. Motivated by this key insight, we present a novel framework for learning end-to-end autonomous driving based on volumetric representations. |
Zanming Huang; Jimuyang Zhang; Eshed Ohn-Bar; |
285 | NuCraft: Crafting High Resolution 3D Semantic Occupancy for Unified 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The root of these limitations lies in the sparsity, noise, and even errors present in the raw data. In this paper, we overcome these challenges by introducing nuCraft, a high-resolution and accurate semantic occupancy dataset derived from nuScenes. |
Benjin Zhu; zhe wang; Hongsheng Li; |
286 | GENIXER: Empowering Multimodal Large Language Models As A Powerful Data Generator Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce , a comprehensive data generation pipeline consisting of four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLMs, and (iv) data generation and filtering. |
Henry Hengyuan Zhao; Pan Zhou; Mike Zheng Shou; |
287 | Region-Native Visual Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER (Reader). |
Mengyu Wang; Yuyao Huang; Henghui Ding; Xinlong Wang; Tiejun Huang; Yao Zhao; Yunchao Wei; Shuicheng Yan; |
288 | Merlin: Empowering Multimodal LLMs with Foresight Minds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: By utilizing the trajectory, a highly structured representation, as a learning objective, we aim to equip the model to understand spatiotemporal dynamics. |
En Yu; Liang Zhao; YANA WEI; Jinrong Yang; Dongming Wu; Lingyu Kong; Haoran Wei; Tiancai Wang; Zheng Ge; Xiangyu Zhang; Wenbing Tao; |
289 | Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although some approaches leverage extensively studied person ReID techniques, they struggle to address the unique challenges posed by wildlife. Therefore, in this paper, we present a unified, multi-species general framework for wildlife ReID. |
Chenyue Li; Shuoyi Chen; Mang Ye; |
290 | Deciphering The Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model’s ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. |
Reza Abbasi; Mohammad Rohban; Mahdieh Soleymani Baghshah; |
291 | See and Think: Embodied Agent in Virtual Environment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment.We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs. |
Zhonghan Zhao; Xuan Wang; Wenhao Chai; Boyi Li; Shengyu Hao; Shidong Cao; Tian Ye; Gaoang Wang; |
292 | Enhancing Diffusion Models with Text-Encoder Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results, thereby improving the visual quality. |
Chaofeng Chen; Annan Wang; Haoning Wu; Liang Liao; Wenxiu Sun; Qiong Yan; Weisi Lin; |
293 | Deblurring 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Thus, we propose a novel real-time deblurring framework, Deblurring 3D Gaussian Splatting, using a small Multi-Layer Perceptron (MLP) that manipulates the covariance of each 3D Gaussian to model the scene blurriness. |
Byeonghyeon Lee; Howoong Lee; Xiangyu Sun; Usman Ali; Eunbyung Park; |
294 | SEED: A Simple and Effective 3D DETR in Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we propose a Simple and EffEctive 3D DETR method () for detecting 3D objects from point clouds, which involves a dual query selection (DQS) module and a deformable grid attention (DGA) module. |
Zhe Liu; Jinghua Hou; Xiaoqing Ye; Tong Wang; Jingdong Wang; Xiang Bai; |
295 | MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi-View Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce , an efficient model that, given sparse multi-view images as input, predicts clean feed-forward 3D Gaussians. |
Yuedong Chen; Haofei Xu; Chuanxia Zheng; Bohan Zhuang; Marc Pollefeys; Andreas Geiger; Tat-Jen Cham; Jianfei Cai; |
296 | M&m’s: A Benchmark to Evaluate Tool-Use for Multi-step Multi-modal Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To answer these questions and more, we introduce : a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules.We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. |
Zixian Ma; Weikai Huang; Jieyu Zhang; Tanmay Gupta; Ranjay Krishna; |
297 | A High-quality Robust Diffusion Framework for Corrupted Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Motivated by DDGAN, our work introduces the first robust-to-outlier diffusion. |
Quan Dao; Binh Ta; Tung Pham; Anh Tran; |
298 | Within The Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Dyco, a novel method that utilizes the delta pose sequence to effectively model temporal appearance variations.To validate the effectiveness of our approach, we collect a novel dataset named I3D-Human, focused on capturing temporal changes in clothing appearance under similar poses. |
Yutong Chen; Yifan Zhan; Zhihang Zhong; Wei Wang; Xiao Sun; Yu Qiao; Yinqiang Zheng; |
299 | Disentangling Masked Autoencoders for Unsupervised Domain Generalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To close the research gap, we propose a novel learning framework designed for UDG, termed the Disentangled Masked AutoEncoder (DisMAE), aiming to discover the disentangled representations that faithfully reveal the intrinsic features and superficial variations without access to the class label. |
An Zhang; Han Wang; Xiang Wang; Tat-Seng Chua; |
300 | High-Fidelity 3D Textured Shapes Generation By Sparse Encoding and Adversarial Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve these, we design a 3D generation framework that maintains most of the building blocks of StableDiffusion with minimal adaptations for textured shape generation.Moreover, we clean up data and build a benchmark on the biggest 3D dataset (Objaverse).We release the processed data at https://aigc3d.github.io/gobjaverse/. |
Qi Zuo; Xiaodong Gu; Yuan Dong; Zhengyi Zhao; Weihao Yuan; Qiu Lingteng; Liefeng Bo; Zilong Dong; |
301 | "PoseEmbroider: Towards A 3D, Visual, Semantic-aware Human Pose Representation" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we combine 3D poses, person’s pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. |
Ginger Delmas; Philippe Weinzaepfel; Francesc Moreno-Noguer; Gregory Rogez; |
302 | Bridging The Pathology Domain Gap: Efficiently Adapting CLIP for Pathology Image Analysis with Limited Labeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce Path-CLIP, a framework tailored for a swift adaptation of CLIP to various pathology tasks. |
Zhengfeng Lai; Joohi Chauhan; Brittany N. Dugger; Chen-Nee Chuah; |
303 | A Task Is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, achieving both tasks simultaneously is challenging due to differing training strategies. To overcome this challenge, we introduce , the first high-quality and versatile inpainting model that excels in multiple inpainting tasks. |
Junhao Zhuang; Yanhong Zeng; WENRAN LIU; Chun Yuan; Kai Chen; |
304 | Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. |
Yifan Pu; Zhuofan Xia; Jiayi Guo; Dongchen Han; Qixiu Li; Duo Li; Yuhui Yuan; Ji Li; Yizeng Han; Shiji Song; Gao Huang; Xiu Li; |
305 | Reinforcement Learning Meets Visual Odometry Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our approach introduces a neural network, operating as an agent within the VO pipeline, to make decisions such as keyframe and grid-size selection based on real-time conditions. |
Nico Messikommer; Giovanni Cioffi; Mathias Gehrig; Davide Scaramuzza; |
306 | KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering. |
Yifan Zhan; Zhuoxiao Li; Muyao Niu; Zhihang Zhong; Shohei Nobuhara; Ko Nishino; Yinqiang Zheng; |
307 | PSALM: Pixelwise Segmentation with Large Multi-modal Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To overcome the limitation of the LMM being limited to textual output, PSALM incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. This schema includes images, task instructions, conditional prompts, and mask tokens, which enable the model to generate and classify segmentation masks effectively. |
Zheng Zhang; yeyao ma; Enming Zhang; Xiang Bai; |
308 | Generalizable Human Gaussians for Sparse View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, modeling 3D humans from sparse views presents formidable hurdles due to the inherent complexity of human geometry, resulting in inaccurate reconstructions of geometry and textures. To tackle this challenge, this paper leverages recent advancements in Gaussian Splatting and introduces a new method to learn generalizable human Gaussians that allows photorealistic and accurate view-rendering of a new human subject from a limited set of sparse views in a feed-forward manner. |
YoungJoong Kwon; Baole Fang; Yixing Lu; Haoye Dong; Cheng Zhang; Francisco Vicente Carrasco; Albert Mosella-Montoro; Jianjin Xu; Shingo J Takagi; Daeil Kim; Aayush Prakash; Fernando de la Torre; |
309 | Language-Image Pre-training with Long Captions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observe that, each sentence within a long caption is very likely to describe the image partially (e.g., an object). Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs, and introduce a grouping loss to match the embeddings of each sub-caption with its corresponding local image patches in a self-supervised manner. |
Kecheng Zheng; Yifei Zhang; Wei Wu; Fan Lu; Shuailei Ma; Xin Jin; Wei Chen; Yujun Shen; |
310 | CG-SLAM: Efficient Dense RGB-D SLAM in A Consistent Uncertainty-aware 3D Gaussian Field Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. |
Jiarui Hu; Xianhao Chen; Boyin Feng; Guanglin Li; Liangjing Yang; Hujun Bao; Guofeng Zhang; Zhaopeng Cui; |
311 | IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present () for high-quality human-centric joint video-depth generation. |
Yuanhao Zhai; Kevin Lin; Linjie Li; Chung-Ching Lin; Jianfeng Wang; Zhengyuan Yang; David Doermann; Junsong Yuan; Zicheng Liu; Lijuan Wang; |
312 | Lagrangian Hashing for Compressed Neural Field Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Lagrangian Hashing, a representation for neural fields combining the characteristics of fast training NeRF methods that rely on Eulerian grids (i.e. InstantNGP), with those that employ points equipped with features as a way to represent information (e.g. 3D Gaussian Splatting or PointNeRF). |
Shrisudhan Govindarajan; Zeno Sambugaro; Akhmedkhan Shabanov; Towaki Takikawa; Weiwei Sun; Daniel Rebain; Nicola Conci; Kwang Moo Yi; Andrea Tagliasacchi; |
313 | Volumetric Rendering with Baked Quadrature Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel Neural Radiance Field (NeRF) representation for non-opaque scenes that enables fast inference by utilizing textured polygons. |
Gopal Sharma; Daniel Rebain; Kwang Moo Yi; Andrea Tagliasacchi; |
314 | NeRF-XL: NeRF at Any Scale with Multi-GPU Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPUs, thus enabling the training and rendering of NeRFs with an arbitrarily large capacity. |
Ruilong Li; Sanja Fidler; Angjoo Kanazawa; Francis Williams; |
315 | Fast Sprite Decomposition from Animated Graphics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an approach to decomposing animated graphics into sprites, a set of basic elements or layers.For our study, we construct the Crello Animation dataset from an online design service and define quantitative metrics to measure the quality of the extracted sprites. |
Tomoyuki Suzuki; Kotaro Kikuchi; Kota Yamaguchi; |
316 | MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by simple query-relevant images when paired with a malicious text query. |
Xin Liu; Yichen Zhu; Jindong Gu; Yunshi Lan; Chao Yang; Yu Qiao; |
317 | Learning to Adapt SAM for Segmenting Cross-domain Point Clouds Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. |
Xidong Peng; Runnan Chen; Feng Qiao; Lingdong Kong; Youquan Liu; Yujing Sun; Tai Wang; Xinge Zhu; Yuexin Ma; |
318 | DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the problem, we propose a local-to-global fusion network (DVLO) with bi-directional structure alignment. |
Jiuming Liu; Dong Zhuo; Zhiheng Feng; Siting Zhu; Chensheng Peng; Zhe Liu; Hesheng Wang; |
319 | FreeCompose: Generic Zero-Shot Image Composition with Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We observe that the pre-trained diffusion models automatically identify simple copy-paste boundary areas as low-density regions during denoising. Building on this insight, we propose to optimize the composed image towards high-density regions guided by the diffusion prior. |
Zhekai Chen; Wen Wang; Zhen Yang; Zeqing Yuan; Hao Chen; Chunhua Shen; |
320 | OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a new multi-view 3D object detector named , whose main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding. |
Jinghua Hou; Tong Wang; Xiaoqing Ye; Zhe Liu; Shi Gong; Xiao Tan; Errui Ding; Jingdong Wang; Xiang Bai; |
321 | Labeled Data Selection for Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that changing the labeled data does indeed significantly impact discovery performance. Motivated by this, we propose two new approaches for automatically selecting the most suitable labeled data based on the similarity between the labeled and unlabeled data. |
Bingchen Zhao; Nico Lang; Serge Belongie; Oisin Mac Aodha; |
322 | LISO: Lidar-only Self-Supervised 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel self-supervised method to train object detection networks, requiring only unlabeled sequences of lidar point clouds. |
Stefan Andreas Baur; Frank Moosmann; Andreas Geiger; |
323 | Let The Avatar Talk Using Texts Without Paired Training Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces text-driven talking avatar generation, a task that uses text to instruct both the generation and animation of an avatar. |
Xiuzhe Wu; Yang-Tian Sun; Handi Chen; Hang Zhou; Jingdong Wang; Zhengzhe Liu; Xiaojuan Qi; |
324 | MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel pipeline for learning high-quality triangular human avatars from multi-view videos. |
Yushuo Chen; Zerong Zheng; Zhe Li; Chao Xu; Yebin Liu; |
325 | "Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. |
Jianhao Li; Tianyu Sun; Zhongdao Wang; Enze Xie; Bailan Feng; Hongbo Zhang; Ze Yuan; Ke Xu; Jiaheng Liu; Ping Luo; |
326 | Synchronous Diffusion for Unsupervised Smooth Non-Rigid 3D Shape Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, inspired by the success of message passing on graphs, we propose a synchronous diffusion process which we use as regularisation to achieve smoothness in non-rigid 3D shape matching problems. |
Dongliang Cao; Zorah Laehner; Florian Bernard; |
327 | Contrastive Region Guidance: Improving Grounding in Vision-Language Models Without Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source VLMs to respond to visual prompts. |
David Wan; Jaemin Cho; Elias Stengel-Eskin; Mohit Bansal; |
328 | Long-term Temporal Context Gathering for Neural Video Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the issue by facilitating the synergy of both long-term and short-term temporal contexts during feature propagation. |
Linfeng Qi; Zhaoyang Jia; Jiahao Li; Bin Li; Houqiang Li; Yan Lu; |
329 | Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel training framework for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. |
Xiao Liu; Xiaoliu Guan; Yu Wu; Jiaxu Miao; |
330 | AdaDistill: Adaptive Knowledge Distillation for Deep Face Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an adaptive KD approach, namely AdaDistill, for deep face recognition. |
Fadi Boutros; Vitomir Struc; Naser Damer; |
331 | Removing Distributional Discrepancies in Captions Improves Image-Text Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a model designed to improve the prediction of image-text alignment, targeting the challenge of compositional understanding in current visual-language models. |
Mu Cai; Haotian Liu; Yuheng Li; Yijun Li; Eli Shechtman; Zhe Lin; Yong Jae Lee; Krishna Kumar Singh; |
332 | Appearance-based Refinement for Object-Centric Motion Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The goal of this paper is to discover, segment, and track independently moving objects in complex visual scenes. |
Junyu Xie; Weidi Xie; Andrew Zisserman; |
333 | Made to Order: Discovering Monotonic Temporal Changes Via Self-supervised Video Ordering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our objective is to discover and localize monotonic temporal changes in a sequence of images. |
Charig Yang; Weidi Xie; Andrew Zisserman; |
334 | SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper aims to develop small, efficient one-step diffusion models based on the powerful rectified flow framework, by exploring joint compression of inference steps and model size. |
Yuanzhi Zhu; Xingchao Liu; Qiang Liu; |
335 | RPBG: Towards Robust Neural Point-based Graphics in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we revisit one such influential method, known as Neural Point-based Graphics (NPBG), as our baseline, and propose Robust Point-based Graphics (RPBG). |
Qingtian Zhu; Zizhuang Wei; Zhongtian Zheng; Yifan Zhan; Zhuyu Yao; Jiawang Zhang; Kejian Wu; Yinqiang Zheng; |
336 | MOFA-Video: Controllable Image Animation Via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. |
Muyao Niu; Xiaodong Cun; Xintao Wang; Yong Zhang; Ying Shan; Yinqiang Zheng; |
337 | RS-NeRF: Neural Radiance Fields from Rolling Shutter Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, their effectiveness is hindered by the Rolling Shutter (RS) effects commonly found in most camera systems. To solve this, we present RS-NeRF, a method designed to synthesize normal images from novel views using input with RS distortions. |
Muyao Niu; Tong Chen; Yifan Zhan; Zhuoxiao Li; Xiang Ji; Yinqiang Zheng; |
338 | Which Model Generated This Image? A Model-Agnostic Approach for Origin Attribution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we study the origin attribution of generated images in a practical setting where only a few images generated by a source model are available and the source model cannot be accessed. |
Fengyuan Liu; Haochen Luo; Yiming Li; Philip Torr; Jindong Gu; |
339 | Fairness-aware Vision Transformer Via Debiased Self-Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We establish that the existing fairness-aware algorithms designed for CNNs do not perform well on ViT, which highlights the need to develop our novel framework via Debiased Self-Attention (DSA). |
Yao Qiang; Chengyin Li; Prashant Khanduri; Dongxiao Zhu; |
340 | Learning Quantized Adaptive Conditions for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel and effective approach to reduce trajectory curvature by utilizing adaptive conditions. |
Yuchen Liang; Yuchuan Tian; Lei Yu; Huaao Tang; Jie Hu; Xiangzhong Fang; Hanting Chen; |
341 | CLIP-DPO: Vision-Language Models As A Source of Preference for Fixing Hallucinations in LVLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this and improve their robustness, we present CLIP-DPO, a preference optimization method that leverages contrastively pre-trained Vision-Language (VL) embedding models, such as CLIP, for DPO-based optimization of LVLMs.Instead, starting from the initial pool of supervised fine-tuning data, we generate a diverse set of predictions, which are ranked based on their CLIP image-text similarities, and then filtered using a robust rule-based approach to obtain a set of positive and negative pairs for DPO-based training. |
Yassine Ouali; Adrian Bulat; Brais Martinez; Georgios Tzimiropoulos; |
342 | AdaNAT: Exploring Adaptive Policy for Token-Based Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, their one-size-fits-all nature cannot flexibly adapt to the diverse characteristics of each individual sample. To address these issues, we propose , a learnable approach that automatically configures a suitable policy tailored for every sample to be generated. |
Zanlin Ni; Yulin Wang; Renping Zhou; Rui Lu; Jiayi Guo; Jinyi Hu; Zhiyuan Liu; Yuan Yao; Gao Huang; |
343 | APL: Anchor-based Prompt Learning for One-stage Weakly Supervised Referring Expression Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the effectiveness, we identify that visual semantics of RefCLIP are ambiguous and insufficient for weakly supervised REC modeling. To address this issue, we propose a novel method that enriches visual semantics with various prompt information, called anchor-based prompt learning (APL). |
Yaxin Luo; Jiayi Ji; Xiaofu Chen; Yuxin Zhang; Tianhe Ren; Gen Luo; |
344 | HiFi-123: Towards High-fidelity One Image to 3D Content Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation. |
Wangbo Yu; Li Yuan; Yan-Pei Cao; Xiangjun Gao; Xiaoyu Li; Wenbo Hu; Long Quan; Ying Shan; Yonghong Tian; |
345 | ZigMa: A DiT-style Zigzag Mamba Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. |
Vincent Tao Hu; Stefan A Baumann; Ming Gui; Olga Grebenkova; Pingchuan Ma; Johannes S Fischer; Bjorn Ommer; |
346 | Controllable Human-Object Interaction Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. |
Jiaman Li; Alexander Clegg; Roozbeh Mottaghi; Jiajun Wu; Xavier Puig; C. Karen Liu; |
347 | Learning Video Context As Interleaved Multimodal Sequences Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce , a multimodal language model developed to address the wide range of challenges in understanding video contexts. |
Kevin Qinghong Lin; Pengchuan Zhang; Difei Gao; Xide Xia; Joya Chen; Ziteng Gao; Jinheng Xie; Xuhong Xiao; Mike Zheng Shou; |
348 | T-MAE: Temporal Masked Autoencoders for Point Cloud Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Nevertheless, temporal information, which is inherent in the LiDAR point cloud sequence, is consistently disregarded. To better utilize this property, we propose an effective pre-training strategy, namely Temporal Masked Auto-Encoders (T-MAE), which takes as input temporally adjacent frames and learns temporal dependency. |
Weijie Wei; Fatemeh Karimi Nejadasl; Theo Gevers; Martin R. Oswald; |
349 | Visual Text Generation in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. |
Yuanzhi Zhu; Jiawei Liu; Feiyu Gao; Wenyu Liu; Xinggang Wang; Peng Wang; Fei Huang; Cong Yao; Zhibo Yang; |
350 | TTT-MIM: Test-Time Training with Masked Image Modeling for Denoising Distribution Shifts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a test-time training (TTT) method based on masked image modeling (MIM) to improve denoising performance for out-of-distribution images. |
Youssef Mansour; Xuyang Zhong; Serdar Caglar; Reinhard Heckel; |
351 | ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose to unify the two frameworks by employing a generator and a discriminator: the generator is trained to output multiple hypotheses of 6DoF camera pose to represent a distribution and handle multi-mode ambiguity, and the discriminator is trained to identify the hypothesis that best explains the data. |
Hao Tang; Weiyao Wang; Pierre Gleize; Matt Feiszli; |
352 | Temporal Event Stereo Via Joint Learning with Stereoscopic Flow Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since obtaining ground truth for optical flow during training is challenging, we propose a method that uses only disparity maps to train the stereoscopic flow. |
Hoonhee Cho; Jae-Young Kang; Kuk-Jin Yoon; |
353 | Finding Meaning in Points: Weakly Supervised Semantic Segmentation for Event Cameras Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As a remedy, we present EV-WSSS: a novel weakly supervised approach for event-based semantic segmentation that utilizes sparse point annotations. |
Hoonhee Cho; Sung-Hoon Yoon; Hyeokjun Kweon; Kuk-Jin Yoon; |
354 | CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents CAT-SAM, a ConditionAl Tuning network that explores few-shot adaptation of SAM toward various challenging downstream domains in a data-efficient manner. |
Aoran Xiao; Weihao Xuan; Heli Qi; Yun Xing; Ruijie Ren; Xiaoqin Zhang; Ling Shao; Shijian Lu; |
355 | PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. |
Zhenyu Li; Shariq Farooq Bhat; Peter Wonka; |
356 | LPViT: Low-Power Semi-structured Pruning for Vision Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a new block-structured pruning to address the resource-intensive issue for ViTs, offering a balanced trade-off between accuracy and hardware acceleration. |
Kaixin Xu; Zhe Wang; Chunyun Chen; Xue Geng; Jie Lin; Xulei Yang; Min Wu; Xiaoli Li; Weisi Lin; |
357 | TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end, we introduce the new task of outdoor 3D dense captioning.We also introduce the T OD3 Cap dataset, the first million-scale dataset to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in nuScenes. |
Bu Jin; Yupeng Zheng; Pengfei Li; Weize Li; Yuhang Zheng; Sujie Hu; Xinyu Liu; Jinwei Zhu; Zhijie Yan; Haiyang Sun; Kun Zhan; Peng Jia; Xiaoxiao Long; Yilun Chen; Hao Zhao; |
358 | Look Hear: Gaze Prediction for Speech-directed Human Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. |
Sounak Mondal; Seoyoung Ahn; Zhibo Yang; Niranjan Balasubramanian; Dimitris Samaras; Gregory Zelinsky; Minh Hoai; |
359 | EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present , a simple yet effective transformer-based model for stereo egocentric human pose estimation. |
Chenhongyi Yang; Anastasia Tkach; Shreyas Hampali; Linguang Zhang; Elliot J Crowley; Cem Keskin; |
360 | Diffusion Models As Optimizers for Efficient Planning in Offline RL Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, the practicality of these methods is limited due to the lengthy inference processes they require. In this paper, we address this problem by decomposing the sampling process of diffusion models into two decoupled subprocesses: 1) generating a feasible trajectory, which is a time-consuming process, and 2) optimizing the trajectory. |
Renming Huang; Yunqiang Pei; Guoqing Wang; Yangming Zhang; Yang Yang; Peng Wang; Heng Tao Shen; |
361 | Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel 3D customization method, dubbed Make-Your-3D that can personalize high-fidelity and consistent 3D content from only a single image of a subject with text description within 5 minutes. |
Fangfu Liu; Hanyang Wang; Weiliang Chen; Haowen Sun; Yueqi Duan; |
362 | Unveiling and Mitigating Memorization in Text-to-image Diffusion Models Through Cross Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To elucidate this phenomenon, we further identify and discuss various intrinsic findings of cross-attention that contribute to memorization. Building on these insights, we introduce an innovative approach to detect and mitigate memorization in diffusion models. |
Jie Ren; Yaxin Li; Shenglai Zeng; Han Xu; Lingjuan Lyu; Yue Xing; Jiliang Tang; |
363 | Audio-visual Generalized Zero-shot Learning The Easy Way Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for , named , that aligns audio-visual embeddings with transformed text representations. |
Shentong Mo; Pedro Morgado; |
364 | High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Rich-resource Prior Depth estimator (RPrDepth), which only requires single input image during the inference phase but can still produce highly accurate depth estimations comparable to rich-resource based methods. |
Jianbing Shen; Wencheng Han; |
365 | DataDream: Few-shot Guided Dataset Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose DataDream, a framework for synthesizing classification datasets that more faithfully represents the real data distribution when guided by few-shot examples of the target classes. |
Jae Myung Kim; Jessica Bader; Stephan Alaniz; Cordelia Schmid; Zeynep Akata; |
366 | Can OOD Object Detectors Learn from Foundation Models? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SyncOOD, a simple data curation method that capitalizes on the capabilities of large foundation models to automatically extract meaningful OOD data from text-to-image generative models. |
Jiahui Liu; Xin Wen; Shizhen Zhao; Yingxian Chen; Xiaojuan Qi; |
367 | Physics-Free Spectrally Multiplexed Photometric Stereo Under Unknown Spectral Composition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a groundbreaking spectrally multiplexed photometric stereo approach for recovering surface normals of dynamic surfaces without the need for calibrated lighting or sensors, a notable advancement in the field traditionally hindered by stringent prerequisites and spectral ambiguity. |
Satoshi Ikehata; Yuta Asano; |
368 | Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce Lego, a textual inversion method designed to invert subject entangled concepts from a few example images. |
Saman Motamed; Danda Pani Paudel; Luc Van Gool; |
369 | Arc2Face: A Foundation Model for ID-Consistent Human Faces Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents , an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. |
Foivos Paraperas Papantoniou; Alexandros Lattas; Stylianos Moschoglou; Jiankang Deng; Bernhard Kainz; Stefanos Zafeiriou; |
370 | Certifiably Robust Image Watermark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose the first image watermarks with certified robustness guarantees against removal and forgery attacks. |
Zhengyuan Jiang; Moyang Guo; Yuepeng Hu; Jinyuan Jia; Neil Zhenqiang Gong; |
371 | TRAM: Global Trajectory and Motion of 3D Humans from In-the-wild Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose TRAM, a two-stage method to reconstruct a human’s global trajectory and motion from in-the-wild videos. |
Yufu Wang; Ziyun Wang; Lingjie Liu; Kostas Daniilidis; |
372 | Learning to Robustly Reconstruct Dynamic Scenes from Low-light Spike Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose a bidirectional recurrent-based reconstruction framework to better handle such extreme conditions. |
Liwen Hu; Ziluo Ding; Mianzhi Liu; Lei Ma; Tiejun Huang; |
373 | MTMamba: Enhancing Multi-Task Dense Scene Understanding By Mamba-Based Decoders Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. |
Baijiong Lin; Weisen Jiang; Pengguang Chen; Yu Zhang; Shu Liu; Yingcong Chen; |
374 | SphereHead: Stable 3D Full-head Synthesis with Spherical Tri-plane Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we propose SphereHead, a novel tri-plane representation in the spherical coordinate system that fits the human head’s geometric characteristics and efficiently mitigates many of the generated artifacts. |
Heyuan Li; Ce Chen; Tianhao Shi; Yuda Qiu; Sizhe An; Guanying CHEN; Xiaoguang Han; |
375 | Token Compensator: Altering Inference Cost of Vision Transformer Without Re-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a model arithmetic framework to decouple the compression degrees between the two stages. |
Shibo Jie; Yehui Tang; Jianyuan Guo; Zhi-Hong Deng; Kai Han; Yunhe Wang; |
376 | Dissolving Is Amplifying: Towards Fine-Grained Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce DIA, dissolving is amplifying. |
Jian Shi; Pengyi Zhang; Ni Zhang; Hakim Ghazzai; Peter Wonka; |
377 | ScribblePrompt: Fast and Flexible Interactive Segmentation for Any Biomedical Image Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present ScribblePrompt, a flexible neural network based interactive segmentation tool for biomedical imaging that enables human annotators to segment previously unseen structures using scribbles, clicks, and bounding boxes.We showcase ScribblePrompt in an interactive demo, provide code, and release a dataset of scribble annotations at https://scribbleprompt.csail.mit.edu |
Hallee E. Wong; Marianne Rakic; John Guttag; Adrian V. Dalca; |
378 | Generalizing to Unseen Domains Via Text-guided Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: to train a different augmentation network for each possible unseen domain, which suffers from time-inefficiency. To overcome these challenges, we benefit from the multimodal embedding space of a pre-trained vision-language model and propose to acquire training-free and domain-invariant augmentations with text descriptions of arbitrary crafted unseen domains, which not necessarily match test domains. |
Daiqing Qi; Handong Zhao; Aidong Zhang; Sheng Li; |
379 | WAS: Dataset and Methods for Artistic Text Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a decoder with the layer-wise momentum query to prevent the model from ignoring stroke regions of special shapes. |
Xudong Xie; Yuzhe Li; Yang Liu; Zhifei Zhang; Zhaowen Wang; Wei Xiong; Xiang Bai; |
380 | Boosting 3D Single Object Tracking with 2D Matching Distillation and 3D Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, learning robust 3D SOT trackers remains challenging due to the limited category-specific point cloud data and the inherent sparsity and incompleteness of LiDAR scans. To tackle these issues, we propose a unified 3D SOT framework that leverages 3D generative pre-training and learns robust 3D matching abilities from 2D pre-trained foundation trackers. |
Qiangqiang Wu; Yan Xia; Jia Wan; Antoni Chan; |
381 | DreamReward: Aligning Human Preference in Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a comprehensive framework, coined DreamReward, to learn and improve text-to-3D models from human preference feedback. |
Junliang Ye; Fangfu Liu; Qixiu Li; Zhengyi Wang; Yikai Wang; Xinzhou Wang; Yueqi Duan; Jun Zhu; |
382 | Do Generalised Classifiers Really Work on Human Drawn Sketches? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings – a paradigm shift in terms of generalised sketch representation learning (e.g., classification). |
Hmrishav Bandyopadhyay; Pinaki Nath Chowdhury; Aneeshan Sain; Subhadeep Koley; Tao Xiang; Ayan Kumar Bhunia; Yi-Zhe Song; |
383 | N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. |
Yash Bhalgat; Iro Laina; Joao F Henriques; Andrew Zisserman; Andrea Vedaldi; |
384 | GGRt: Towards Generalizable 3D Gaussians Without Pose Priors in Real-Time Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. |
Hao Li; Yuanyuan Gao; Dingwen Zhang; Chenming Wu; YALUN DAI; Chen Zhao; Haocheng Feng; Errui Ding; Jingdong Wang; Junwei Han; |
385 | UAV First-Person Viewers Are Radiance Field Learners Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: UAV videos exacerbate these issues with limited viewpoints and significant spatial scale variations, resulting in inadequate detail rendering across diverse scales. In response, we introduce FPV-NeRF, addressing these challenges through three key facets: (1) Temporal consistency. Leveraging spatio-temporal continuity ensures seamless coherence between frames; (2) Global structure. Incorporating various global features during point sampling preserves space integrity; (3) Local granularity. |
Liqi Yan; Qifan Wang; Junhan Zhao; Qiang Guan; Zheng Tang; Jianhui Zhang; Dongfang Liu; |
386 | TurboEdit: Real-time Text-based Disentangled Real Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce an encoder based iterative inversion technique. |
Zongze Wu; Nicholas I Kolkin; Jonathan Brandt; Richard Zhang; Eli Shechtman; |
387 | MotionChain: Conversational Motion Controllers Via Multimodal Prompts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present MotionChain, a conversational human motion controller that generates continuous and long-term human motion through multimodal prompts. |
Biao Jiang; Xin Chen; Chi Zhang; Fukun Yin; Zhuoyuan Li; Gang Yu; Jiayuan Fan; |
388 | UniCode : Learning A Unified Codebook for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UniCode, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. |
Sipeng Zheng; Bohan Zhou; Yicheng Feng; Ye Wang; Zongqing Lu; |
389 | Defect Spectrum: A Granular Look of Large-scale Defect Datasets with Rich Semantics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. |
Shuai Yang; ZhiFei Chen; Pengguang Chen; Xi Fang; Yixun Liang; Shu Liu; Yingcong Chen; |
390 | CogView3: Finer and Faster Text-to-Image Generation Via Relay Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. |
Wendi Zheng; Jiayan Teng; Zhuoyi Yang; Weihan Wang; Jidong Chen; Xiaotao Gu; Yuxiao Dong; Ming Ding; Jie Tang; |
391 | Analytic-Splatting: Anti-Aliased 3D Gaussian Splatting Via Analytic Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we use a conditioned logistic function as the analytic approximation of the cumulative distribution function (CDF) of the Gaussian signal and calculate the integral by subtracting the CDFs. |
Zhihao Liang; Qi Zhang; Wenbo Hu; Ying Feng; Lei ZHU; Kui Jia; |
392 | BLINK: Multimodal Large Language Models Can See But Not Perceive Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce , a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. |
Xingyu Fu; Yushi Hu; Bangzheng Li; Yu Feng; Haoyu Wang; Xudong Lin; Dan Roth; Noah A Smith; Wei-Chiu Ma; Ranjay Krishna; |
393 | Free-Editor: Zero-shot Text-driven 3D Scene Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we introduce a novel, training-free 3D scene editing technique called Free-Editor, which enables users to edit 3D scenes without the need for model retraining during the testing phase. |
Nazmul Karim; Hasan Iqbal; Umar Khalid; Chen Chen; Jing Hua; |
394 | TAPTR: Tracking Any Point with Transformers As Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple yet effective approach for Tracking Any Point with TRansformers (). |
Hongyang Li; Hao Zhang; Shilong Liu; Zhaoyang Zeng; Tianhe Ren; Feng Li; Lei Zhang; |
395 | VeCLIP: Improving CLIP Training Via Visual-enriched Captions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study introduces a scalable pipeline for noisy caption rewriting. |
Zhengfeng Lai; Haotian Zhang; Bowen Zhang; Wentao Wu; Haoping Bai; Aleksei Timofeev; Xianzhi Du; Zhe Gan; Jiulong Shan; Chen-Nee Chuah; Yinfei Yang; Meng Cao; |
396 | Mesh2NeRF: Direct Mesh Supervision for Neural Radiance Field Representation and Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , an approach to derive ground-truth radiance fields from textured meshes for 3D generation tasks. |
Yujin Chen; Yinyu Nie; Benjamin Ummenhofer; Reiner Birkl; Michael Paulitsch; Matthias Mü,ller; Matthias Niessner; |
397 | Raindrop Clarity: A Dual-Focused Dataset for Day and Night Raindrop Removal Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. |
Yeying Jin; Xin Li; Jiadong Wang; Yan Zhan; Malu Zhang; |
398 | Revisiting Feature Disentanglement Strategy in Diffusion Training and Breaking Conditional Independence Assumption in Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a training framework for feature disentanglement of Diffusion Models (FDiff). |
Wonwoong Cho; Hareesh Ravi; Midhun Harikumar; Vinh Khuc; Krishna Kumar Singh; Jingwan Lu; David Iseri Inouye; Ajinkya Kale; |
399 | Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. |
Reyhane Askari Hemmat; Melissa Hall; Alicia Yi Sun; Candace Ross; Michal Drozdzal; Adriana Romero-Soriano; |
400 | MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we investigate the root cause of VLMs’ biased prediction under the OVD context. |
Kuo Wang; Lechao Cheng; Weikai Chen; Pingping Zhang; Liang Lin; Fan Zhou; Guanbin Li; |
401 | DomainFusion: Generalizing To Unseen Domains with Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing diffusion-based DG methods are restricted to offline augmentation using LDM and suffer from degraded performance and prohibitive computational costs. To address these challenges, we propose DomainFusion to simultaneously achieve knowledge extraction in the latent space and augmentation in the pixel space of the Latent Diffusion Model (LDM) for efficiently and sufficiently exploiting LDM. |
Yuyang Huang; Yabo Chen; Yuchen Liu; xiaopeng zhang; Wenrui Dai; Hongkai Xiong; Qi Tian; |
402 | AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate it, we propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline, including proposal extraction, classification, and correction, to better fit the goal of zero-shot segmentation. |
Jiannan Ge; Lingxi Xie; Hongtao Xie; Pandeng Li; Xiaopeng Zhang; Yongdong Zhang; Qi Tian; |
403 | Text-Anchored Score Composition: Tackling Condition Misalignment in Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: If this alignment is not satisfied, the final output could be either dominated by one condition, or ambiguity may arise, failing to meet user expectations. To address this issue, we present a training-free approach called Text-Anchored Score Composition (TASC) to further improve the controllability of existing models when provided with partially aligned conditions. |
Luozhou Wang; Guibao Shen; Wenhang Ge; Guangyong Chen; Yijun Li; Yingcong Chen; |
404 | MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, we propose , a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human.For evaluating our method under different scenarios, we collect a new dataset, , which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on, both, public and dataset. |
Guoxing Sun; Rishabh Dabral; Pascal Fua; Christian Theobalt; Marc Habermann; |
405 | Large-scale Reinforcement Learning for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a scalable algorithm for enhancing diffusion models using Reinforcement Learning (RL) with a diverse range of reward functions, including human preference, compositionality, and social diversity over millions of images. |
Yinan Zhang; Eric Tzeng; Yilun Du; Dmitry Kislyuk; |
406 | RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose RGNet which deeply integrates clip retrieval and grounding into a single network capable of processing long videos into multiple granular levels, e.g., clips and frames. |
Tanveer Hannan; Md Mohaiminul Islam; Thomas Seidl; Gedas Bertasius; |
407 | DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. |
Yiqun Duan; Xianda Guo; Zheng Zhu; |
408 | Bi-TTA: Bidirectional Test-Time Adapter for Remote Physiological Measurement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, 1) TTA algorithms are designed predominantly for classification tasks, ill-suited in regression tasks such as rPPG due to inadequate supervision. 2) Tuning pre-trained models in a single-instance manner introduces variability and instability, posing challenges to effectively filtering domain-relevant from domain-irrelevant features while simultaneously preserving the learned information. To overcome these challenges, we present Bi-TTA, a novel expert knowledge-based Bidirectional Test-Time Adapter framework. |
Haodong LI; Hao LU; Yingcong Chen; |
409 | MinD-3D: Reconstruct High-quality 3D Objects in Human Brain Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Recon3DMind, an innovative task aimed at reconstructing 3D visuals from Functional Magnetic Resonance Imaging (fMRI) signals, marking a significant advancement in the fields of cognitive neuroscience and computer vision.To support this pioneering task, we present the fMRI-Shape dataset, which includes data from 14 participants and features 360-degree videos of 3D objects to enable comprehensive fMRI signal capture across various settings, thereby laying a foundation for future research. |
Jianxiong Gao; Yuqian Fu; Yun Wang; Xuelin Qian; Jianfeng Feng; Yanwei Fu; |
410 | HERGen: Elevating Radiology Report Generation with Longitudinal Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing automated approaches are based on a single timestamp and often neglect the critical temporal aspect of patients’ imaging histories, which is essential for accurate longitudinal analysis. To address this gap, we propose a novel History Enhanced Radiology Report Generation () framework that employs a group causal transformer to efficiently integrate longitudinal data across patient visits. |
Fuying Wang; Shenghui Du; Lequan Yu; |
411 | GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a robust fusion framework called GraphBEV. |
Ziying Song; Lei Yang; Shaoqing Xu; Lin Liu; Dongyang Xu; Caiyan Jia; Feiyang Jia; Li Wang; |
412 | AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present AutoDIR, an innovative all-in-one image restoration system incorporating latent diffusion. |
Yitong Jiang; Zhaoyang Zhang; Tianfan Xue; Jinwei Gu; |
413 | RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks by utilizing a hierarchical structure of queries that implicitly model the relationships both between and within tasks. |
Jianbing Shen; Chunliang Li; Wencheng Han; Junbo Yin; Sanyuan Zhao; |
414 | On-the-fly Category Discovery for LiDAR Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel on-the-fly category discovery method for LiDAR semantic segmentation, aiming to classify and segment both unknown and known classes instantaneously during test time, achieved solely by learning with known classes in training. |
Hyeonseong Kim; Sung-Hoon Yoon; Minseok Kim; Kuk-Jin Yoon; |
415 | A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Panoptic and instance segmentation networks are often trained with specialized object detection modules, complex loss functions, and ad-hoc post-processing steps to manage the permutation-invariance of the instance masks. This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation, resulting in a simple architecture that omits these complexities. |
Wouter Van Gansbeke; Bert De Brabandere; |
416 | Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from the generalization capability and adaptability of emerging foundation models, our work aims to develop one general-purpose NeRF for handling diverse 3D tasks. |
Yonggan Fu; Huaizhi Qu; Zhifan Ye; Chaojian Li; Kevin Zhao; Yingyan (Celine) Lin; |
417 | Momentum Auxiliary Network for Supervised Local Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, these methods cannot replace end-to-end training due to lower accuracy, as gradients only propagate within their local block, creating a lack of information exchange between blocks. To address this issue and establish information transfer across blocks, we propose a Momentum Auxiliary Network (MAN) that establishes a dynamic interaction mechanism. |
Junhao Su; Changpeng Cai; Feiyu Zhu; Chenghao He; Xiaojie Xu; Dongzhi Guan; Chenyang Si; |
418 | HPFF: Hierarchical Locally Supervised Learning with Patch Feature Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, this approach can lead to performance lag due to limited interaction between these modules, and the design of auxiliary networks occupies a certain amount of GPU memory. To overcome these limitations, we propose a novel model called HPFF that performs hierarchical locally supervised learning and patch-level feature computation on the auxiliary networks. |
Junhao Su; Chenghao He; Feiyu Zhu; Xiaojie Xu; Dongzhi Guan; Chenyang Si; |
419 | Dual-Path Adversarial Lifting for Domain Shift Correction in Online Test-time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, motivated by the dual-subband wavelet lifting scheme developed in multi-scale signal processing which is able to efficiently separate the input signals into principal components and noise components, we introduce a dual-path token lifting for domain shift correction in test time adaptation. |
Yushun Tang; Shuoshuo Chen; Zhihe Lu; Xinchao Wang; Zhihai He; |
420 | SWAG: Splatting in The Wild Images with Appearance-conditioned Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. |
Hiba Dahmani; Moussab Bennehar; Nathan Piasco; Luis G Roldao Jimenez; Dzmitry Tsishkou; |
421 | Get Your Embedding Space in Order: Domain-Adaptive Regression for Forest Monitoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a new dataset with aerial and satellite imagery in five countries with three forest-related regression tasks2 . |
Sizhuo Li; Dimitri Gominski; Martin Brandt; Xiaoye Tong; Philippe Ciais; |
422 | Data Augmentation Via Latent Diffusion for Saliency Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes. |
Bahar Aydemir; Deblina Bhattacharjee; Tong Zhang; Mathieu Salzmann; Sabine Sü,sstrunk; |
423 | SegPoint: Segment Any Point Cloud Via Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a model, called , that leverages the reasoning capabilities of a multi-modal Large Language Model (LLM) to produce point-wise segmentation masks across a diverse range of tasks: 1) 3D instruction segmentation, 2) 3D referring segmentation, 3) 3D semantic segmentation, and 4) 3D open-vocabulary semantic segmentation.To advance 3D instruction research, we introduce a new benchmark, , designed to evaluate segmentation performance from complex and implicit instructional texts, featuring point cloud-instruction pairs. |
Shuting He; Henghui Ding; Xudong Jiang; Bihan Wen; |
424 | Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geolocalization benchmark. |
Meng Chu; Zhedong Zheng; Wei Ji; Tingyu Wang; Tat-Seng Chua; |
425 | SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we explore an important scenario in which the input consists of one or a few unposed 2D images of a single object, with little or no overlap. We propose a novel method, , to reconstruct a 3D textured mesh and estimate the relative camera poses for these sparse-view images. |
Chao Xu; Ang Li; Linghao Chen; Yulin Liu; Ruoxi Shi; Hao Su; Minghua Liu; |
426 | NOVUM: Neural Object Volumes for Robust Object Classification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we show that explicitly integrating 3D compositional object representations into deep networks for image classification leads to a largely enhanced generalization in out-of-distribution scenarios. |
Artur Jesslen; Guofeng Zhang; Angtian Wang; Wufei Ma; Alan Yuille; Adam Kortylewski; |
427 | Siamese Vision Transformers Are Scalable Audio-visual Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we investigate using an audio-visual siamese network () for efficient and scalable audio-visual pretraining. |
Yan-Bo Lin; Gedas Bertasius; |
428 | Synergy of Sight and Semantics: Visual Intention Understanding with CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel framework, Intention Understanding with CLIP (IntCLIP), which utilizes a dual-branch approach. |
Qu Yang; Mang Ye; Dacheng Tao; |
429 | BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a new benchmark, BenchLMM, to assess the robustness of LMMs toward three different styles: artistic image style, imaging sensor style, and application style. |
Rizhao Cai; Zirui Song; Dayan Guan; Zhenhao Chen; Yaohang Li; Xing Luo; Chenyu Yi; Alex Kot; |
430 | GTP-4o: Modality-prompted Heterogeneous Graph Learning for Omni-modal Biomedical Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle these, we propose an innovative Modality-prompted Heterogeneous Graph for Omni-modal Learning (GTP-4o), which embeds the numerous disparate clinical modalities into a unified representation, completes the deficient embedding of missing modality and reformulates the cross-modal learning with a graph-based aggregation. |
Chenxin Li; Xinyu Liu; Cheng Wang; Yifan Liu; Weihao Yu; Jing Shao; Yixuan Yuan; |
431 | Learning Neural Deformation Representation for 4D Dynamic Shape Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. |
Gyojin Han; Jiwan Hur; Jaehyun Choi; Junmo Kim; |
432 | UpFusion: Novel View Diffusion from Unposed Sparse View Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose UpFusion, a system that can perform novel view synthesis and infer 3D representations for generic objects given a sparse set of reference images without corresponding pose information. |
Bharath Raj Nagoor Kani; Hsin-Ying Lee; Sergey Tulyakov; Shubham Tulsiani; |
433 | You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces a unified and versatile framework (HQNet) for single-stage multi-person multi-task human-centric perception (HCP).To address this gap, we propose COCO-UniHuman benchmark to enable model development and comprehensive evaluation. |
Sheng Jin; Shuhuai Li; Tong Li; Wentao Liu; Chen Qian; Ping Luo; |
434 | UniFS: Universal Few-shot Instance Perception with Point Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing few-shot learning methods primarily focus on a restricted set of tasks, presumably due to the challenges involved in designing a generic model capable of representing diverse tasks in a unified manner. In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework. |
Sheng Jin; Ruijie Yao; Lumin Xu; Wentao Liu; Chen Qian; Ji Wu; Ping Luo; |
435 | Think2Drive: Efficient Reinforcement Learning By Thinking with Latent World Model for Autonomous Driving (in CARLA-v2) Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we take the initiative of directly training a neural planner and the hope is to handle the corner cases flexibly and effectively. |
Qifeng Li; Xiaosong Jia; Shaobo Wang; Junchi Yan; |
436 | KDProR: A Knowledge-Decoupling Probabilistic Framework for Video-Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, those approaches diverge from human learning paradigms, where humans possess the capability to seek and associate knowledge from an open set, rather than rote memorizing all text-video instances. Motivated by this, we attempt to decouple knowledge from retrieval models through multi-grained knowledge stores and identify two significant benefits of our knowledge-decoupling strategy: (1) it ensures a harmonious balance between knowledge memorization and retrieval optimization, thereby improving retrieval performance and (2) it can promote incorporating diverse open-world knowledge to augment video-text retrieval. |
Xianwei Zhuang; Hongxiang Li; Xuxin Cheng; Zhihong Zhu; Yuxin Xie; Yuexian Zou; |
437 | MLPHand: Real Time Multi-View 3D Hand Reconstruction Via MLP Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. |
Jian Yang; Jiakun Li; Guoming Li; Huaiyu Wu; Zhen Shen; Zhaoxin Fan; |
438 | MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present MoMA: an open-vocabulary, training-free personalized image model that boasts flexible zero-shot capabilities. |
Kunpeng Song; Yizhe Zhu; Bingchen Liu; Qing Yan; Ahmed Elgammal; Xiao Yang; |
439 | Learning with Counterfactual Explanations for Radiology Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle these, we propose a novel CounterFactual Explanations-based framework (CoFE) for radiology report generation. |
Mingjie Li; Haokun Lin; Liang Qiu; Xiaodan Liang; Ling Chen; Abdulmotaleb Elsaddik; Xiaojun Chang; |
440 | AMD: Automatic Multi-step Distillation of Large-scale Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. |
Cheng Han; Qifan Wang; Sohail A Dianat; Majid Rabbani; Raghuveer Rao; Yi Fang; Qiang Guan; Lifu Huang; Dongfang Liu; |
441 | Curved Diffusion: A Generative Model With Optical Geometry Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a framework that intimately integrates a text-to-image diffusion model with the particular lens geometry used in image rendering. |
Andrey Voynov; Amir Hertz; Moab Arar; Shlomi Fruchter; Daniel Cohen-Or; |
442 | StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although these models have shown notable progress, there are still three flaws. 1) The unidirectional generation of auto-regressive manner restricts the usability in many scenarios. 2) The additional introduced story history encoders bring an extremely high computational cost. 3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly. To these ends, we propose a bidirectional, unified, and efficient framework, namely StoryImager. |
Ming Tao; Bingkun Bao; Hao Tang; Yaowei Wang; Changsheng Xu; |
443 | PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Consequently, a simple combination of them cannot guarantee accomplishing both training efficiency and inference efficiency with minimal costs. In this paper, we propose a novel Parallel Yielding Re-Activation (PYRA) method for such a challenge of training-inference efficient task adaptation. |
Yizhe Xiong; Hui Chen; Tianxiang Hao; Zijia Lin; Jungong Han; Yuesong Zhang; Guoxin Wang; Yongjun Bao; Guiguang Ding; |
444 | ConDense: Consistent 2D-3D Pre-training for Dense and Sparse Features from Multi-View Images Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To advance the state of the art in the creation of 3D foundation models, this paper introduces the framework for 3D pre-training utilizing existing pre-trained 2D networks and large-scale multi-view datasets. |
Xiaoshuai Zhang; Zhicheng Wang; Howard Zhou; Soham Ghosh; Danushen L Gnanapragasam; Varun Jampani; Hao Su; Leonidas Guibas; |
445 | Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). |
Seonghoon Yu; Paul Hongsuck Seo; Jeany Son; |
446 | TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a general approach to study and quantify a broad spectrum of biases, for any TTI model and for any prompt, using counterfactual reasoning. |
Aditya Chinchure; Pushkar Shukla; Gaurav Bhatt; Kiri Salij; Kartik Hosanagar; Leonid Sigal; Matthew Turk; |
447 | Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it suffers from a significant lack of large-scale and diverse datasets, impeding comprehensive model evaluation and curtailing downstream applications. To address these issues, this paper introduces , a substantial benchmark featured by its diversity in object categories, large scale, and variety in object materials. |
Jiyao Zhang; Weiyao Huang; Bo Peng; Mingdong Wu; Fei Hu; Zijian Chen; Bo Zhao; Hao Dong; |
448 | Cross-view Image Geo-localization with Panorama-BEV Co-Retrieval Network Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a new approach for cross-view image geo-localization, i.e., the Panorama-BEV Co-Retrieval Network. |
Junyan Ye; Zhutao Lv; Weijia Li; Jinhua Yu; Haote Yang; Huaping Zhong; Conghui He; |
449 | Adversarial Robustification Via Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data. |
Daewon Choi; Jongheon Jeong; Huiwon Jang; Jinwoo Shin; |
450 | The Lottery Ticket Hypothesis in Denoising: Towards Semantic-Driven Initialization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We formulate the lottery ticket hypothesis in denoising: randomly initialized Gaussian noise images contain special pixel blocks (winning tickets) that naturally tend to be denoised into specific content independently. |
Jiafeng Mao; Xueting Wang; Kiyoharu Aizawa; |
451 | "Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals, negatively impacting the recall. |
Sergio Izquierdo; Javier Civera; |
452 | SweepNet: Unsupervised Learning Shape Abstraction Via Neural Sweepers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce , a novel approach to shape abstraction through sweep surfaces. |
Mingrui Zhao; Yizhi Wang; Fenggen Yu; Changqing Zou; Ali Mahdavi-Amiri; |
453 | DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing text-to-3D methods struggle with this task, as they are designed to generate either non-independent objects or independent objects lacking spatially plausible interactions. Addressing this, we propose DreamDissector, a text-to-3D method capable of generating multiple independent objects with interactions. |
Zizheng Yan; Jiapeng Zhou; Fanpeng Meng; Yushuang Wu; Lingteng Qiu; Zisheng Ye; Shuguang Cui; Guanying CHEN; Xiaoguang Han; |
454 | FisherRF: Active View Selection and Mapping with Radiance Fields Using Fisher Information Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study addresses the challenging problem of active view selection and uncertainty quantification within the domain of Radiance Fields. |
Wen Jiang; BOSHU LEI; Kostas Daniilidis; |
455 | V2X-Real: A Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present V2X-Real, a large-scale dataset that includes a mixture of multiple vehicles and smart infrastructure to facilitate the V2X cooperative perception development with multi-modality sensing data. |
Hao Xiang; Xin Xia; Zhaoliang Zheng; Runsheng Xu; Letian Gao; Zewei Zhou; xu han; Xinkai Ji; Mingxi Li; Zonglin Meng; Li Jin; Mingyue Lei; Zhaoyang Ma; Zihang He; Haoxuan Ma; Yunshuang Yuan; Yingqian Zhao; Jiaqi Ma; |
456 | Eliminating Feature Ambiguity for Few-Shot Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a novel plug-in termed ambiguity elimination network (AENet), which can be plugged into any existing cross attention-based FSS methods. |
Qianxiong Xu; Guosheng Lin; Chen Change Loy; Cheng Long; Ziyue Li; Rui Zhao; |
457 | Improving Adversarial Transferability Via Model Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a novel model alignment technique aimed at improving a given source model’s ability in generating transferable adversarial perturbations. |
Avery Ma; Amir-massoud Farahmand; Yangchen Pan; Philip Torr; Jindong Gu; |
458 | ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a dynamic Gaussian Splatting method named ManiGaussian for multi-task robotic manipulation, which mines scene dynamics via future scene reconstruction. |
Guanxing Lu; Shiyi Zhang; Ziwei Wang; Changliu Liu; Jiwen Lu; Yansong Tang; |
459 | Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods fail to account for the alignment noise, , irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. |
Yuxiao Chen; Kai Li; Wentao Bao; Deep Patel; Yu Kong; Martin Renqiang Min; Dimitris N. Metaxas; |
460 | ReNoise: Real Image Inversion Through Iterative Noising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce an inversion method with a high quality-to-operation ratio, enhancing reconstruction accuracy without increasing the number of operations. |
Daniel Garibi; Or Patashnik; Andrey Voynov; Hadar Averbuch-Elor; Danny Cohen-Or; |
461 | 3×2: 3D Object Part Segmentation By 2D Semantic Correspondences Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels.In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. |
Anh Thai; Weiyao Wang; Hao Tang; Stefan Stojanov; James M Rehg; Matt Feiszli; |
462 | DEPICT: Diffusion-Enabled Permutation Importance for Image Classification Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a permutation-based explanation method for image classifiers. |
Sarah Jabbour; Gregory Kondas; Ella Kazerooni; Michael Sjoding; David Fouhey; Jenna Wiens; |
463 | PPAD: Iterative Interactions of Prediction and Planning for End-to-end Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new interaction mechanism of prediction and planning for end-to-end autonomous driving, called PPAD (Iterative Interaction of Prediction and Planning Autonomous Driving), which considers the timestep-wise interaction to better integrate prediction and planning. |
Zhili Chen; Maosheng Ye; Shuangjie Xu; Tongyi Cao; Qifeng Chen; |
464 | Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The Bird’s-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. |
Zhili Chen; Shuangjie Xu; Maosheng Ye; Zian Qian; Xiaoyi Zou; Dit-Yan Yeung; Qifeng Chen; |
465 | Make A Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. |
Lanqing Guo; Yingqing HE; Haoxin Chen; Menghan Xia; Xiaodong Cun; Yufei Wang; Siyu Huang; Yong Zhang; Xintao Wang; Qifeng Chen; Ying Shan; Bihan Wen; |
466 | GRiT: A Generative Region-to-text Transformer for Object Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. |
Jialian Wu; Jianfeng Wang; Zhengyuan Yang; Zhe Gan; Zicheng Liu; Junsong Yuan; Lijuan Wang; |
467 | Self-Cooperation Knowledge Distillation for Novel Class Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Therefore, these methods suffer from a challenging trade-off between reviewing known classes and discovering novel classes. Based on this observation, we propose a Self-Cooperation Knowledge Distillation (SCKD) method to utilize each training sample (whether known or novel, labeled or unlabeled) for both review and discovery. |
Yuzheng Wang; Zhaoyu Chen; Dingkang Yang; Yunquan Sun; Lizhe Qi; |
468 | Expressive Whole-Body 3D Gaussian Avatar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. |
Gyeongsik Moon; Takaaki Shiratori; Shunsuke Saito; |
469 | UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While these questions can be studied by employing multiple datasets, it is challenging due to several discrepancies, e.g., in data formats, map resolution, and semantic annotation types. To address these challenges, we introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria, presenting new opportunities for the vehicle trajectory prediction field. |
Lan Feng; Mohammadhossein Bahari; Kaouther Messaoud; Eloi Zablocki; Matthieu Cord; Alexandre Alahi; |
470 | Implicit Concept Removal of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present , a novel concept removal method based on geometric-driven control.Moreover, we introduce the Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. |
Zhili Liu; Kai Chen; Yifan Zhang; Jianhua Han; Lanqing Hong; Hang Xu; Zhenguo Li; Dit-Yan Yeung; James Kwok; |
471 | Towards Stable 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the detection stability cannot be accessed by existing metrics such as mAP and MOTA, and consequently is less explored by the community. To bridge this gap, this work proposes (), a new metric that can comprehensively evaluate the stability of 3D detectors in terms of confidence, box localization, extent, and heading. |
Jiabao Wang; Qiang Meng; Guochao Liu; Liujiang Yan; Ke Wang; Ming-Ming Cheng; Qibin Hou; |
472 | 3DFG-PIFu: 3D Feature Grids for Human Digitization from Sparse Views Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To a large extent, this defeats the purpose of having multi-view information since the multi-view task in question is predominantly treated as a single-view task. To resolve this, we introduce 3DFG-PIFu, a pixel-aligned implicit model that exploits multi-view information right from the start and all the way to the end of the pipeline. |
Kennard Yanting Chan; Fayao Liu; Guosheng Lin; Chuan Sheng Foo; Weisi Lin; |
473 | Learn from The Learnt: Source-Free Active Domain Adaptation Via Contrastive Sampling and Visual Persistence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response, we present learn from the learnt (LFTL), a novel paradigm for SFADA to leverage the learnt knowledge from the source pretrained model and actively iterated models without extra overhead. |
Mengyao Lyu; Tianxiang Hao; Xinhao Xu; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding; |
474 | Learning Local Pattern Modularization for Point Cloud Reconstruction from Unseen Classes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the reconstruction accuracy and interpretability are still eager to get improved. To resolve this issue, we introduce to learn local pattern modularization for reconstructing 3D shapes in unseen classes, which achieves both good generalization ability and high reconstruction accuracy. |
Chao Chen; Yu-Shen Liu; Zhizhong Han; |
475 | AddMe: Zero-shot Group-photo Synthesis By Inserting People Into Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Concretely, existing customization methods struggle to insert facial identities at desired locations in existing images, and it is difficult for existing local image editing methods to deal with facial details. To address these limitations, we propose AddMe, a powerful diffusion-based portrait generator that can insert a given portrait into a desired location in an existing scene image in a zero-shot manner. |
Dongxu Yue; Maomao Li; Yunfei Liu; Ailing Zeng; Tianyu Yang; Qin Guo; Yu Li; |
476 | Bayesian Self-Training for Semi-Supervised 3D Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic n-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. |
Ozan Unal; Christos Sakaridis; Luc Van Gool; |
477 | Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. |
Ozan Unal; Christos Sakaridis; Suman Saha; Luc Van Gool; |
478 | To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite the development of safety-driven unlearning techniques to counteract these challenges, doubts about their efficacy persist. To tackle this issue, we introduce an evaluation framework that leverages adversarial prompts to discern the trustworthiness of these safety-driven DMs after they have undergone the process of unlearning harmful concepts. |
Yimeng Zhang; jinghan jia; Xin Chen; Aochuan Chen; Yihua Zhang; Jiancheng Liu; Ke Ding; Sijia Liu; |
479 | Restore Anything with Masks: Leveraging Mask Image Modeling for Blind All-in-One Image Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a simple pipeline for all-in-one blind image restoration to Restore Anything with Masks (). |
Chujie Qin; Ruiqi Wu; Zikun Liu; Xin Lin; Chun-Le Guo; Hyun Hee Park; Chongyi Li; |
480 | RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present RodinHD, which can generate high-fidelity 3D avatars from a portrait image. |
Bowen Zhang; Yiji Cheng; Chunyu Wang; Ting Zhang; Jiaolong Yang; Yansong Tang; Feng Zhao; Dong Chen; Baining Guo; |
481 | Vary: Scaling Up The Vision Vocabulary for Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Accordingly, we propose Vary, an efficient and productive method to scale up the Vision vocabulary of LVLMs. |
Haoran Wei; Lingyu Kong; Jinyue Chen; Liang Zhao; Zheng Ge; Jinrong Yang; Jianjian Sun; Chunrui Han; Xiangyu Zhang; |
482 | PACE: Pose Annotations in Cluttered Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PACE (Pose Annotations in Cluttered Environments), a large-scale benchmark designed to advance the development and evaluation of pose estimation methods in cluttered scenarios. |
Yang You; kai xiong; Zhening Yang; Zhengxiang Huang; Junwei Zhou; Ruoxi Shi; Zhou FANG; Adam Harley; Leonidas Guibas; Cewu Lu; |
483 | Modeling and Driving Human Body Soundfields Through Acoustic Primitives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. |
Chao Huang; Dejan Markovic; Chenliang Xu; Alexander Richard; |
484 | Multiscale Sliced Wasserstein Distances As Perceptual Color Difference Measures Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we describe a perceptual CD measure based on the multiscale sliced Wasserstein distance, which facilitates efficient comparisons between non-local patches of similar color and structure. |
Jiaqi He; Zhihua Wang; Leon Wang; Tsein-I Liu; Yuming Fang; Qilin Sun; Kede Ma; |
485 | Blind Image Deblurring with Noise-robust Kernel Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here, we propose a blind deblurring method based on a noise-robust kernel estimation function and deep image prior (DIP). |
Chanseok Lee; Jeongsol Kim; Seungmin Lee; Jaehwang Jung; Yunje Cho; Taejoong Kim; Taeyong Jo; Myungjun Lee; Mooseok Jang; |
486 | PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present “PanGu-Draw”, a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. |
Guansong Lu; Yuanfan Guo; Jianhua Han; Minzhe Niu; Yihan Zeng; Songcen Xu; Wei Zhang; Hang Xu; Zhao Zhong; Zeyi Huang; |
487 | "PointNeRF++: A Multi-scale, Point-based Neural Radiance Field" Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Neural rendering methods based on point clouds do exist, but they do not perform well when the point cloud is sparse or incomplete, which is often the case with real-world data. We overcome these problems with a simple representation that aggregates point clouds at multiple scale levels with sparse voxel grids at different resolutions. |
Weiwei Sun; Eduard Trulls; Yang-Che Tseng; Sneha Sambandam; Gopal Sharma; Andrea Tagliasacchi; Kwang Moo Yi; |
488 | Griffon: Spelling Out All Object Locations at Any Granularity with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: More importantly, we present Griffon, a purely LVLM-based baseline, which does not introduce any special tokens, expert models, or additional detection modules.Building on this insight, we introduce a novel Language-prompted Localization Dataset to fully unleash the capabilities of LVLMs in fine-grained object perception and precise location awareness. |
Yufei Zhan; Yousong Zhu; Zhiyang Chen; Fan Yang; Ming Tang; Jinqiao Wang; |
489 | TF-FAS: Twofold-Element Fine-Grained Semantic Guidance for Generalizable Face Anti-Spoofing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods only utilize coarse-grained or single-element prompts for fine-tuning FAS tasks, without fully exploring the potential of language supervision, leading to unsatisfactory generalization ability. To address these concerns, we propose a novel framework called TF-FAS, which aims to thoroughly explore and harness twofold-element fine-grained semantic guidance to enhance generalization. |
Xudong Wang; Ke-Yue Zhang; Taiping Yao; Qianyu Zhou; Shouhong Ding; Pingyang Dai; Rongrong Ji; |
490 | ControlCap: Controllable Region-level Captioning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a controllable region-level captioning (ControlCap) approach, which introduces control words to a multimodal model to address the caption degeneration issue. |
Yuzhong Zhao; Liu Yue; Zonghao Guo; weijia wu; Chen Gong; Qixiang Ye; Fang Wan; |
491 | Enhancing Tampered Text Detection Through Frequency Feature Fusion and Decomposition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods often fail to integrate this information effectively, thereby compromising RGB detection capabilities and missing the high-frequency details necessary to detect subtle tampering. To address these gaps, we introduce a Feature Fusion and Decomposition Network (FFDN) that combines a Visual Enhancement Module (VEM) with a Wavelet-like Frequency Enhancement (WFE). |
Zhongxi Chen; Shen Chen; Taiping Yao; Ke Sun; Shouhong Ding; Xianming Lin; Liujuan Cao; Rongrong Ji; |
492 | AdaDiff: Accelerating Diffusion Models Through Step-Wise Adaptive Computation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose , an adaptive framework that dynamically allocates computation resources in each sampling step to improve the generation efficiency of diffusion models. |
Shengkun Tang; Yaqing Wang; Caiwen Ding; Yi Liang; Yao Li; Dongkuan Xu; |
493 | SILC: Improving Vision Language Pretraining with Self-Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce SILC, a novel framework for vision language pretraining. |
Muhammad Ferjad Naeem; Yongqin Xian; Xiaohua Zhai; Lukas Hoyer; Luc Van Gool; Federico Tombari; |
494 | Model Stock: All We Need Is Just A Few Fine-tuned Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. |
Dong-Hwan Jang; Sangdoo Yun; Dongyoon Han; |
495 | Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, traditional methods for generating such scenarios often fall short in terms of controllability and realism they also neglect the dynamics of agent interactions. To address these limitations, we introduce , a novel diffusion-based controllable closed-loop safety-critical simulation framework. |
Wei-Jer Chang; Francesco Pittaluga; Masayoshi Tomizuka; Wei Zhan; Manmohan Chandraker; |
496 | GRA: Detecting Oriented Objects Through Group-wise Rotating and Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a lightweight yet effective Group-wise Rotating and Attention (GRA) module to replace the convolution operations in backbone networks for oriented object detection. |
Jiangshan Wang; Yifan Pu; Yizeng Han; Jiayi Guo; Yiru Wang; Xiu Li; Gao Huang; |
497 | LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. |
Zonghao Guo; Ruyi Xu; Yuan Yao; Junbo Cui; Zanlin Ni; Chunjiang Ge; Tat-Seng Chua; Zhiyuan Liu; Gao Huang; |
498 | DiffuMatting: Synthesizing Arbitrary Objects with Matting-level Annotation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the difficulty and labor-consuming nature of getting highly accurate or matting annotations, there only exists a limited amount of highly accurate labels available to the public. To tackle this challenge, we propose a DiffuMatting which inherits the strong Everything generation ability of diffusion and endows the power of ‘matting anything’. |
Xiaobin Hu; Xu Peng; Donghao Luo; Xiaozhong Ji; Jinlong Peng; ZhengKai Jiang; Jiangning Zhang; Taisong Jin; Chengjie Wang; Rongrong Ji; |
499 | Text Motion Translator: A Bi-Directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To tackle a key challenge in Text2Motion, the deficiency of 3D human motions and their corresponding textual descriptions, we built a novel large-scale 3D human motion dataset, LaViMo, extracted from in-the-wild web videos and action recognition datasets. |
Yijun Qian; Jack Urbanek; Alexander Hauptmann; Jungdam Won; |
500 | DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While reverse diffusion sampling often requires adjustments of LDM architecture or feature engineering, score distillation offers a simple yet powerful model-agnostic approach, but it is often prone to mode-collapsing. To address these limitations and leverage the strengths of both approaches, here we introduce a novel framework called DreamSampler, which seamlessly integrates these two distinct approaches through the lens of regularized latent optimization. |
Jeongsol Kim; Geon Yeong Park; Jong Chul Ye; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,300 papers), please visit Paper Digest: ECCV-2024 (Full List).