Paper Digest: AAAI 2026 Papers & Highlights
Note: AAAI-2026 accepts more than 4,000 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 4,300 AAAI-2026 papers in a separate page, which takes quite some time to load.
To search for papers presented at AAAI-2026 on a specific topic, please make use of the search by venue (AAAI-2026) service. To summarize the latest research published at AAAI-2026 on a specific topic, you can utilize the review by venue (AAAI-2026) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 17,700 authors (AAAI-2026). Additionally, you may want to explore our “Best Paper” Digest (AAAI), which lists the most influential AAAI papers since 1982.
As a pioneer in the field since 2018, Paper Digest has curated thousands of such lists, drawing on years of accumulated data across decades of conferences and research topics.To ensure users never miss a breakthrough, our daily digest service sifts through tens of thousands of new papers, clinical trials, news articles, community posts every day – delivering only what matters most to your specific interests. Beyond discovery, Paper Digest offers built-in research tools to help users read articles, write articles, get answers, conduct literature reviews, and generate research reports more efficiently.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: AAAI 2026 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | LENS: Learning to Segment Anything with Unified Reinforced Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. |
Lianghui Zhu; Bin Ouyang; Yuxuan Zhang; Tianheng Cheng; Rui Hu; Haocheng Shen; Longjin Ran; Xiaoxin Chen; Li Yu; Wenyu Liu; Xinggang Wang; |
| 2 | Binary-Gaussian: Compact and Progressive Representation for 3D Gaussian Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, fine-grained segmentation remains challenging due to label space congestion and the lack of stable multi-granularity control mechanisms. To address these limitations, we propose a coarse-to-fine binary encoding scheme for per-Gaussian category representation, which compresses each feature into a single integer via the binary-to-decimal mapping, drastically reducing memory usage. |
An Yang; Chenyu Liu; Jun Du; Jianqing Gao; Jia Pan; Jinshui Hu; Baocai Yin; Bing Yin; Cong Liu; |
| 3 | Towards Better Correctness and Efficiency in Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While code large language models have demonstrated remarkable progress in code generation, the generated code often exhibits poor runtime efficiency, limiting its practical application in performance-sensitive scenarios. To address this limitation, we propose an efficiency-oriented reinforcement learning framework guided by a novel performance reward. |
Yunlong Feng; Yang Xu; Xiao Xu; Binyuan Hui; Junyang Lin; |
| 4 | Reasoning with Exploration: An Entropy Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advances in enhancing language model (LM) reasoning, most methods lean toward exploitation, and increasingly encounter performance plateaus. In this work, we revisit entropy — a signal of exploration in RL — and examine its relationship to exploratory reasoning in LMs. |
Daixuan Cheng; Shaohan Huang; Xuekai Zhu; Bo Dai; Xin Zhao; Zhenliang Zhang; Furu Wei; |
| 5 | Game Ground Bench: Probing The Limits of LVLMs in Complex Semantic Grounding Across Game Universes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Due to limited data scale, fine-tuning them for gaming scenarios is also challenging. To address this, we propose Game-R1, a novel training method centered on the Grounded Reinforcement Policy Optimization (GRPO) algorithm. |
Zhangyang Qi; Jinsong Li; Hongjian Wu; Jiaqi Wang; Hengshuang Zhao; |
| 6 | Fault Diagnosis of Irregular Sequences By Adjoint Learning in Continuous-Time Model Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, this paper proposes FD by adjoint learning in continuous-time model space. |
Xiren Zhou; Chuyang Wei; Ao Chen; Shikang Liu; Xiangyu Wang; Huanhuan Chen; |
| 7 | Spikingformer: A Key Foundation Model for Spiking Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we analyze the spike-driven behavior of the residual connection methods in SNNs. |
Chenlin Zhou; Liutao Yu; Zhaokun Zhou; Han Zhang; Jiaqi Wang; Huihui Zhou; Zhengyu Ma; Yonghong Tian; |
| 8 | MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models’ reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs’ comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. |
Jiacheng Ruan; Dan Jiang; Xian Gao; Ting Liu; Yuzhuo Fu; Yangyang Kang; |
| 9 | PharmaQA: Prompt-Based Molecular Representation Learning Via Pharmacophore-Oriented Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their incorporation into molecular representation learning, particularly in a context reasoning or generalization, remains relatively limited. To address this gap, we propose PharmaQA, a pharmacophore oriented question answering framework that formulates tailored prompts to extract context-aware molecular semantics. |
Chengwei Ai; Qiaozhen Meng; Mengwei Sun; Ruihan Dong; Hongpeng Yang; Shiqiang Ma; Xiaoyi Liu; Cheng Liang; Fei Guo; |
| 10 | Reasoning Transfer for An Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce \emph{English-Pivoted CoT Training}, leveraging the insight that LLMs internally operate in a latent space aligned toward the dominant language. |
Khanh-Tung Tran; Barry O’Sullivan; Hoang D. Nguyen; |
| 11 | VaccineRAG: Boosting Multimodal Large Language Models’ Immunity to Harmful RAG Samples Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs’ performance. To address this challenge, we introduce \textbf{VaccineRAG}, a novel Chain-of-Thought-based retrieval-augmented generation dataset. |
Qixin Sun; Ziqin Wang; Hengyuan Zhao; Yilin Li; Kaiyou Song; Si Liu; Xiaolin Hu; Qingpei Guo; Linjiang Huang; |
| 12 | VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present VisionReward, a general framework for learning human visual preferences in both image and video generation. |
Jiazheng Xu; Yu Huang; Jiale Cheng; Yuanming Yang; Jiajun Xu; Yuan Wang; Wenbo Duan; Shen Yang; Qunlin Jin; Shurun Li; Jiayan Teng; Zhuoyi Yang; Wendi Zheng; Xiao Liu; Dan Zhang; Ming Ding; Xiaohan Zhang; Shiyu Huang; Xiaotao Gu; Minlie Huang; Jie Tang; Yuxiao Dong; |
| 13 | Sample-specific Modality Diagnosis and Cross-modal Enhancement for Incomplete Multimodal Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this way, the proposed CMCG method can avoid redundancy and inconsistency, enhancing the consistency and discriminativity of the fused representation. |
Junsong Chen; Jiyuan Liu; Suyuan Liu; Wei Zhang; Ao Li; En Zhu; Xinwang Liu; |
| 14 | HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Simply training an agent equipped with multiple search tools using flat reinforcement learning (RL) is a straightforward idea, but it has problems such as low training data efficiency and poor mastery of complex tools. To address the above issue, we propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL. |
Jiejun Tan; Zhicheng Dou; Yan Yu; Jiehan Cheng; Lifeng Liu; Jian Xie; Jirong Wen; |
| 15 | Dynamic Semantic Tokenization for Time Series Via Elastic Sampling on Physics-aware Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel tokenization framework known as physics-aware tokenization (PATK), designed to implement adaptive time-frequency tokenization via distribution-sensitive sampling strategies. |
Huaizhang Liao; Zhixiong Yang; Jingyuan Xia; Yuheng Sun; Yue Zhang; Shengxi Li; Yongxiang Liu; |
| 16 | MetaAct-RL: Training Language Models for Reasoning Through Meta-Action-Based Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce MetaAct-RL, a new RL framework that frames LMs’ thinking as sequential decision making over meta-actions. |
Zhiheng Xi; Yuhui Wang; Yiwen Ding; Guanyu Li; Senjie Jin; Shichun Liu; Jixuan Huang; Dingwen Yang; Jiafu Tang; Boyang Hong; Junjie Ye; Shihan Dou; Ming Zhang; Jian Guan; Wei Wu; Rui Zheng; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 17 | TaREx: Reinforcement Learning for Code-Driven Table Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Automatically solving table reasoning tasks remains challenging due to three main factors: (1) diverse and hierarchical table structures that hinder comprehension, (2) the heavy reliance on complex logical and numerical reasoning—which makes purely text-based methods prone to hallucinations—and (3) the necessity of multi-step processing to handle intricate tasks involving multiple and lengthy tables. To address these challenges, we introduce TaREx, a novel framework that unifies table representation, integrates code-driven execution, and supports interactive multi-step reasoning. |
Fangyu Lei; Jinxiang Meng; Yiming Huang; Shizhu He; Jun Zhao; Kang Liu; |
| 18 | Computer Vision Modeling of The Development of Geometric and Numerical Concepts in Humans Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on this demonstrated cognitive alignment, the current study investigates whether CV models also show developmental alignment: whether their performance improvements across training to match the developmental progressions observed in children. In a detailed case study of the ResNet-50 model, we show that this is the case. |
Zekun Wang; Sashank Varma; |
| 19 | Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. |
Ya Zou; Jingfeng Yao; Siyuan Yu; Shuai Zhang; Wenyu Liu; Xinggang Wang; |
| 20 | Scalable Vision-Guided Crop Yield Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing data collection methods, such as crop cuts in randomly sampled fields at harvest time, are relatively time-consuming. Thus, we propose an approach based on prediction-powered inference (PPI) to supplement these crop cuts with less time-consuming field photos. |
Harrison H Li; Medhanie Irgau; Nabil Janmohamed; Karen Solveig Rieckmann; David B. Lobell; |
| 21 | Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching. |
Yuerong Song; Xiaoran Liu; Ruixiao Li; Zhigeng Liu; Zengfeng Huang; Qipeng Guo; Ziwei He; Xipeng Qiu; |
| 22 | Aware First, Think Less: Dynamic Boundary Self-Awareness Drives Significant Gains in Reasoning Efficiency in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables LLMs to dynamically assess and adjust their reasoning depth in response to problem complexity. |
Qiguang Chen; Dengyun Peng; Jinhao Liu; Huikang Su; Jiannan Guan; Libo Qin; Wanxiang Che; |
| 23 | Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, developing a reliable benchmark that effectively evaluates large language models’ (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. |
Zijin Hong; Hao Wu; Su Dong; Junnan Dong; Yilin Xiao; Yujing Zhang; Zhu Wang; Feiran Huang; Linyi Li; Hongxia Yang; Xiao Huang; |
| 24 | A Multi-Agent Conversational Bandit Approach to Online Evaluation and Selection of User-Aligned LLM Responses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. |
Xiangxiang Dai; Yuejin Xie; Maoli Liu; Xuchuang Wang; Zhuohua Li; Huanyu Wang; John C.S. Lui; |
| 25 | Cliqueformer: Model-Based Optimization with Structured Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Cliqueformer, a transformer- based architecture that learns the black-box function’s structure through functional graphical models (FGM), addressing distribution shift without relying on explicit conservative approaches. |
Jakub Grudzien Kuba; Pieter Abbeel; Sergey Levine; |
| 26 | Cognitive Enhancement Chain-of-Thought Towards Enhancing Style Learning and Content Preservation for Long Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Cognitive enhancement Chain-of-Thought (CeCoT) towards enhancing style learning and content preservation for long style transfer. |
Lianwei Wu; Botao Wang; Wenbo An; Tieqiao Li; Xianghua Li; |
| 27 | City Sampling for Citizens’ Assemblies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study a two-stage sampling problem faced by practitioners in countries such as Germany, in which constituents’ contact information is stored at a municipal level. |
Paul Gölz; Jan Maly; Ulrike Schmidt-Kraepelin; Markus Utke; Philipp C. Verpoort; |
| 28 | Towards High-Resolution 3D Anomaly Detection: A Scalable Dataset and Real-Time Framework for Subtle Industrial Defects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In industrial point cloud analysis, detecting subtle anomalies demands high-resolution spatial data, yet prevailing benchmarks emphasize low-resolution inputs. To address this disparity, we propose a scalable pipeline for generating realistic and subtle 3D anomalies. |
Yuqi Cheng; Yihan Sun; Hui Zhang; Weiming Shen; Yunkang Cao; |
| 29 | Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. |
Yuxin Jiang; Wei Luo; Hui Zhang; Qiyu Chen; Haiming Yao; Weiming Shen; Yunkang Cao; |
| 30 | ArchRAG: Attributed Community-based Hierarchical Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To retrieve the most relevant information from the graph for the question, we build a novel hierarchical index structure for the attributed communities and develop an effective online retrieval method. |
Shu Wang; Yixiang Fang; Yingli Zhou; Xilin Liu; Yuchi Ma; |
| 31 | Self-NPO: Data-Free Diffusion Model Enhancement Via Truncated Diffusion Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Self-NPO, specifically truncated diffusion fine-tuning, a data-free approach of negative preference optimization by directly learning from the model itself, eliminating the need for manual data labeling or reward model training. |
Fu-Yun Wang; Keqiang Sun; Yao Teng; Xihui Liu; Jiale Yuan; Jiaming Song; Hongsheng Li; |
| 32 | EchoMimicV3: 1.3B Parameters Are All You Need for Unified Multi-Modal and Multi-Task Human Animation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. |
Rang Meng; Yan Wang; Weipeng Wu; Ruobing Zheng; Yuming Li; Chenguang Ma; |
| 33 | Mitigating Low-Quality Reasoning in MLLMs: Self-Driven Refined Multimodal CoT with Selective Thinking and Step-wise Visual Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. |
Chongjun Tu; Peng Ye; Dongzhan Zhou; Tao Chen; Wanli Ouyang; |
| 34 | Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The security of these systems is crucial, as compromised fact-checkers can amplify misinformation, but remains largely underexplored. To bridge this gap, this work introduces a novel threat model against such fact-checking systems and presents Fact2Fiction, the first poisoning attack framework targeting SOTA agentic fact-checking systems. |
Haorui He; Yupeng Li; Bin Benjamin Zhu; Dacheng Wen; Reynold Cheng; Francis C. M. Lau; |
| 35 | Vision Transformers Are Circulant Attention Learners Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present a novel attention paradigm termed Circulant Attention by exploiting the inherent efficient pattern of self-attention. |
Dongchen Han; Tianyu Li; Ziyi Wang; Gao Huang; |
| 36 | Who Is Helping Whom? Analyzing Inter-Dependencies to Evaluate Cooperation in Human-AI Teaming Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we are interested in understanding the cooperative behaviors arising within the team when trained agents are paired with humans – a problem that has been overlooked by the existing literature. To formally address this problem, we propose the concept of constructive interdependence – measuring how much agents rely on each other’s actions to achieve the shared goal – as a key metric for evaluating cooperation in human-agent teams. |
Upasana Biswas; Vardhan Palod; Siddhant Bhambri; Subbarao Kambhampati; |
| 37 | QuoTA: Query-oriented Token Assignment Via CoT Query Decouple for Long Video Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models (LVLMs) for visual token assignment based on query-oriented frame-level importance assessment. |
Yongdong Luo; Wang Chen; Weizhong Huang; Shukang Yin; Haojia Lin; Jinfa Huang; Chaoyou Fu; Jiayi Ji; Xiawu Zheng; Jiebo Luo; |
| 38 | Inductive Generative Recommendation Via Retrieval-based Speculation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose SpecGR, a plug-and-play framework that enables GR models to recommend new items in an inductive setting. |
Yijie Ding; Jiacheng Li; Julian McAuley; Yupeng Hou; |
| 39 | Object-Centric Framework for Video Moment Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. |
Zongyao Li; Yongkang Wong; Satoshi Yamazaki; Jianquan Liu; Mohan Kankanhalli; |
| 40 | EVOKE: Efficient and High-Fidelity EEG-to-Video Reconstruction Via Decoupling Implicit Neural Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EVOKE — an innovative framework for zero-shot decoding of high-fidelity videos from EEG signals. |
Haodong Jing; Panqi Yang; Dongyao Jiang; Zhipeng Liu; Nanning Zheng; Yongqiang Ma; |
| 41 | UI-R1: Enhancing Efficient Action Prediction of GUI Agents By Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite its success in language tasks, its application in multimodal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this gap, we propose UI-R1, the first framework to investigate how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. |
Zhengxi Lu; Yuxiang Chai; Yaxuan Guo; Xi Yin; Liang Liu; Hao Wang; Han Xiao; Shuai Ren; Pengxiang Zhao; Guangyi Liu; Guanjing Xiong; Hongsheng Li; |
| 42 | Teaching Large Language Models to Maintain Contextual Faithfulness Via Synthetic Tasks and Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to reduce faithfulness hallucinations of LLMs across different downstream tasks without human annotations. |
Shuzheng Si; Haozhe Zhao; Cheng Gao; Yuzhuo Bai; Zhitong Wang; Bofei Gao; Kangyang Luo; Wenhao Li; Yufei Huang; Gang Chen; Fanchao Qi; Minjia Zhang; Baobao Chang; Maosong Sun; |
| 43 | RSA-CR: Resisting Shilling Attacks in Citation Recommendation Via Dumbbell Inductive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this problem, we theoretically reveal the impact of shilling attacks on citation recommendation and propose three feasible resistance strategies: historical collaborations, significant citations and content constraints. Based on these insights, we introduce RSA-CR, a robust and hybrid citation recommendation algorithm resistant to shilling attacks. |
Xiyue Gao; Yukai Liu; Zhuoqi Ma; Xiaotian Qiao; Hui Li; Cai Xu; Kunhua Zhang; Jiangtao Cui; |
| 44 | State-Space Hierarchical Compression with Gated Attention and Learnable Sampling for Hour-Long Video Understanding in Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an efficient framework to compress massive video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from hour-long videos. |
Geewook Kim; Minjoon Seo; |
| 45 | Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we show that current approaches using large square kernels or transformer-based global modeling aggregate contextual information uniformly across spatial dimensions, leading to feature dilution and localization errors for elongated targets. |
Xinbin Yuan; Zhaohui Zheng; Yuxuan Li; Xialei Liu; Li Liu; Xiang Li; Qibin Hou; Ming-Ming Cheng; |
| 46 | LLaVA-UHD V2: Exploiting Hierarchical Vision Granularity in MLLMs Via Inverse Semantic Pyramid Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels. To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). |
Yipeng Zhang; Yifan Liu; Zonghao Guo; Yidan Zhang; Xuesong Yang; Xiaoying Zhang; Chi Chen; Jun Song; Yuan Yao; Tat-Seng Chua; Maosong Sun; |
| 47 | UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we leverage the advanced understanding capabilities of MLLMs to enhance representation learning, and present a novel Universal Multimodal Embedding(UniME-V2) model. |
Tiancheng Gu; Kaicheng Yang; Kaichen Zhang; Xiang An; Ziyong Feng; Yueyi Zhang; Weidong Cai; Jiankang Deng; Lidong Bing; |
| 48 | ResMAS: Resilience Optimization in LLM-based Multi-agent Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. |
Zhilun Zhou; Zihan Liu; Jiahe Liu; Qingyu Shao; Yihan Wang; Kun Shao; Depeng Jin; Fengli Xu; |
| 49 | UniHOI: Unified Human-Object Interaction Understanding Via Unified Token Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the field of human-object interaction (HOI), detection and generation are two dual tasks that have traditionally been addressed separately, hindering the development of comprehensive interaction understanding. To address this, we propose UniHOI, which jointly models HOI detection and generation via a unified token space, thereby effectively promoting knowledge sharing and enhancing generalization. |
Panqi Yang; Haodong Jing; Nanning Zheng; Yongqiang Ma; |
| 50 | TIDE: Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TIDE—Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs—a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. |
Victor Shea-Jay Huang; Le Zhuo; Yi Xin; Zhaokai Wang; Fu-Yun Wang; Yuchi Wang; Renrui Zhang; Peng Gao; Hongsheng Li; |
| 51 | Generalized-Scale Object Counting with Gradual Query Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose GeCo2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. |
Jer Pelhan; Alan Lukežič; Matej Kristan; |
| 52 | Insert Anything: Image Insertion Via In-Context Editing in DiT Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. |
Wensong Song; Hong Jiang; Zongxin Yang; Zheqiao Cheng; Ruijie Quan; Yi Yang; |
| 53 | Phased One-Step Adversarial Equilibrium for Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. |
Jiaxiang Cheng; Bing Ma; Xuhua Ren; Hongyi Henry Jin; Kai Yu; Peng Zhang; Wenyue Li; Yuan Zhou; Tianxiang Zheng; Qinglin Lu; |
| 54 | Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Operations Research Knowledge-based 3D Grounded Task Scheduling (OKS3D), a new task that requires synerization of language understanding, 3D grounding, and efficiency optimization for embodied agents. |
Dingkang Liang; Cheng Zhang; Xiaopeng Xu; Jianzhong Ju; Zhenbo Luo; Xiang Bai; |
| 55 | Towards Acyclic Preference Evaluation of Language Models Via Multiple Evaluators Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing works usually leverage a strong LLM as the judge for comparing LLMs’ response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. |
Zhengyu Hu; Jieyu Zhang; Zhihan Xiong; Alexander Ratner; Kaize Ding; Ranjay Krishna; |
| 56 | RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user’s mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. |
Suyu Ye; Haojun Shi; Darren Shih; Hyokun Yun; Tanya G. Roosta; Tianmin Shu; |
| 57 | WorldAgen: Unified State-Action Prediction with Test-Time World Model Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present WorldAgen, a unified framework that jointly learns world modeling and action prediction while enabling test-time training (TTT) to adapt to new environments. |
Chi Wan; Kangrui Wang; Yuan Si; Pingyue Zhang; Manling Li; |
| 58 | DesireKV: Decoupling Sensitivity and Importance for Reasoning-Aware KV Cache Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose DesireKV, a novel compression framework that first constructs a two-dimensional coordinate system based on attention-derived importance and outlier-based quantization sensitivity. |
Pengyu Cheng; Jiacheng Wang; Tianle Chen; Bei Liu; Xiaofeng Hou; Jiacheng Liu; |
| 59 | SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. |
Ruijia Zhang; Xinyan Zhao; Ruixiang Wang; Sigen Chen; Guibin Zhang; An Zhang; Kun Wang; Qingsong Wen; |
| 60 | AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Concretely, we introduce AbductiveMLLM comprising of two synergistic components: REASONER and IMAGINER. |
Boyu Chang; Qi Wang; Xi Guo; Zhixiong Nan; Yazhou Yao; Tianfei Zhou; |
| 61 | DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present DexGraspVLA, a hierarchical framework for robust generalization in language-guided general dexterous grasping and beyond. |
Yifan Zhong; Xuchuan Huang; Ruochong Li; Ceyao Zhang; Zhang Chen; Tianrui Guan; Fanlian Zeng; Ka Nam Lui; Yuyao Ye; Yitao Liang; Yaodong Yang; Yuanpei Chen; |
| 62 | GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Large-scale video collection: The dataset contains 6.78 million videos and is currently the largest dataset for AI-generated video detection. |
Zhenliang Ni; Qiangyu Yan; Mouxiao Huang; Tianning Yuan; Yehui Tang; Hailin Hu; Xinghao Chen; Yunhe Wang; |
| 63 | URPO: A Unified Reward & Policy Optimization Framework for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) into a single model and a single training phase. |
Songshuo Lu; Hua Wang; Zhi Chen; Yaohua Tang; |
| 64 | CP-Router: An Uncertainty-Aware Router Between LLM and LRM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, LRMs often produce unnecessarily lengthy outputs even for simple queries, leading to inefficiencies or even accuracy degradation compared to LLMs. To address this, we propose CP-Router, a training-free, model-agnostic routing framework that dynamically selects between an LLM and an LRM, demonstrated with multiple-choice question answering (MCQA) prompts. |
Jiayuan Su; Fulin Lin; Zhaopeng Feng; Han Zheng; Teng Wang; Zhenyu Xiao; Xinlong Zhao; Zuozhu Liu; Lu Cheng; Hongwei Wang; |
| 65 | Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose PCRS-TKA, a prompt-based framework employing retrieval-augmented generation to integrate PLMs with KGs. |
Yongwen Ren; Chao Wang; Peng Du; Chuan Qin; Dazhong Shen; Hui Xiong; |
| 66 | MUTrack: A Memory-Aware Unified Representation Framework for Visual Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conversely, trackers that emphasize extensive temporal context provide strong robustness, yet this approach can compromise their short-term adaptability. To bridge this gap, we propose a novel tracker, MUTrack, which comprehensively integrates both long-term and short-term memories into a unified target representation for more robust tracking. |
Weijing Wu; Qihua Liang; Bineng Zhong; Xiaohu Tang; Yufei Tan; Ning Li; Yuanliang Xue; |
| 67 | OptMark: Robust Multi-bit Diffusion Watermarking Via Inference Time Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose OptMark, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. |
Jiazheng Xing; Hai Ci; Hongbin Xu; Hangjie Yuan; Yong Liu; Mike Zheng Shou; |
| 68 | AgentSwift: Efficient LLM Agent Design Via Value-Guided Hierarchical Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. |
Yu Li; Lehui Li; Zhihao Wu; Qingmin Liao; Jianye HAO; Kun Shao; Fengli Xu; |
| 69 | Hilbert Curve-Encoded Rotation-Equivariant Oriented Object Detector with Locality-Preserving Spatial Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by Hilbert curve’s intrinsic locality-preserving property, we propose a flexible Hilbert curve-Encoded Rotation-Equivariant Oriented Object Detector (HERO-Det). |
Qi Ming; Liuqian Wang; Juan Fang; Xudong Zhao; Yucheng Xu; Ziyi Teng; Yue Zhou; Xiaoxi Hu; Xiaohan Zhang; Yufei Guo; |
| 70 | From Solver to Tutor: Evaluating The Pedagogical Intelligence of LLMs with KMP-Bench Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. |
Weikang Shi; Houxing Ren; Junting Pan; Aojun Zhou; Ke Wang; Zimu Lu; Yunqiao Yang; Yuxuan Hu; Linda Wei; Mingjie Zhan; Hongsheng Li; |
| 71 | Endowing Vision-Language Models with System 2 Thinking for Fine-grained Visual Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the observation, we propose System-2 enhanCed visuAl recogNition (SCAN), a novel plug-and-play approach that makes VLMs aware of nuanced differences. |
Yutong Yang; Lifu Huang; Yijie Lin; Xi Peng; Mouxing Yang; |
| 72 | TraveLLaMA: A Multimodal Travel Assistant with Large-Scale Dataset and Structured Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TraveLLaMA, a specialized multimodal language model designed for comprehensive travel assistance. |
Meng Chu; Yukang Chen; Haokun Gui; Shaozuo Yu; Yi Wang; Jiaya Jia; |
| 73 | Generalizable Drug–Target Interaction Prediction Via ESM-2 Representations and Progressive Contrastive Curriculum Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Ablation studies confirm the complementary benefits of each component, validating their collective contribution to robust and generalizable DTI prediction.Our work underscores the effectiveness of combining pretrained protein language models with structured training curricula and cross-modal contrastive learning for reliable DTI prediction under real-world, distribution-shifted conditions. |
Qianyang Wu; Jingwei Lv; Zilong Zhang; Feifei Cui; |
| 74 | SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias via guided prompt selection. |
Wenqian Ye; Di Wang; Guangtao Zheng; Bohan Liu; Aidong Zhang; |
| 75 | Top-Down Semantic Refinement for Image Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, planning within the vast state space of a VLM presents a significant computational hurdle. Our core contribution, therefore, is the design of a highly efficient Monte Carlo Tree Search (MCTS) algorithm tailored for VLMs. |
Jusheng Zhang; Kaitong Cai; Jing Yang; Jian Wang; Chengpei Tang; Keze Wang; |
| 76 | LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional methods like supervised fine-tuning and reinforcement learning from human feedback are data-intensive and computationally expensive, while static parameter editing struggles with context-dependent errors and catastrophic forgetting. To overcome these limitations, we introduce LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning (HRL) problem. |
Jusheng Zhang; Ningyuan Liu; Yijia Fan; Zihao Huang; Qinglin Zeng; Kaitong Cai; Jian Wang; Keze Wang; |
| 77 | MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These factors limit cost-efficiency and impact. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. |
Boyuan Chen; Minghao Shao; Abdul Basit; Siddharth Garg; Muhammad Shafique; |
| 78 | PCGS: Progressive Compression of 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: For quality, we propose a progressive quantization approach that gradually reduces quantization step sizes to achieve finer modeling of Gaussian attributes. |
Yihang Chen; Mengyao Li; Qianyi Wu; Weiyao Lin; Mehrtash Harandi; Jianfei Cai; |
| 79 | CAMERA: Multi-Matrix Joint Compression for MoE Models Via Micro-Expert Redundancy Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. |
Yuzhuang Xu; Xu Han; Yuanchi Zhang; Yixuan Wang; Yijun Liu; Shiyu Ji; Qingfu Zhu; Wanxiang Che; |
| 80 | UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. |
Lichen Ma; Xiaolong Fu; GaojingZhou; Zipeng Guo; Ting Zhu; Yichun Liu; Yu Shi; Jason Li; Junshi Huang; |
| 81 | STAR-1: Safer Alignment of Reasoning LLMs with 1K Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. |
Zijun Wang; Haoqin Tu; Yuhan Wang; Juncheng Wu; Yanqing Liu; Jieru Mei; Brian R. Bartoldson; Bhavya Kailkhura; Cihang Xie; |
| 82 | Robust Lazy Conflict Detection Via Multi-Conflict Extraction and Genetic Diversity Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While lazy conflict detection addresses runtime efficiency by predetermining conflict sets offline using a genetic algorithm, it suffers from low conflict coverage, stagnation, and instability. We propose a robust enhancement that integrates multi-conflict extraction and genetic diversity control to overcome these limitations. |
Viet-Man Le; Lukas André Feldgrill; Alexander Felfernig; |
| 83 | Free-Form Scene Editor: Enabling Multi-Round Object Manipulation Like in A 3D Engine Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. |
Xincheng Shuai; Zhenyuan Qin; Henghui Ding; Dacheng Tao; |
| 84 | State Mamba: Spatiotemporal EEG State-Space Model with Dynamic Brain Alignment for Cross-Subject Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenges, we propose State Mamba, a novel spatiotemporal EEG state-space model that explicitly models and aligns neural responses and their spatiotemporal state transitions to learn consistent and generalizable representations across subjects. |
Weining Weng; Yang Gu; Yuan Ma; Yuchen Liu; Yingwei Zhang; Yiqiang Chen; |
| 85 | OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving, built upon open-source large language models. |
Xingcheng Zhou; Xuyuan Han; Feng Yang; Yunpu Ma; Volker Tresp; Alois Knoll; |
| 86 | Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion, limiting generalizability in complex dynamic environments. To tackle this gap, we propose CAMEL, a dynamic Collaborative Assistance Mixture of Experts Learning framework. |
En Yu; Jie Lu; Kun Wang; Xiaoyu Yang; Guangquan Zhang; |
| 87 | Biologically-Inspired Evolutionary Domain Symbiosis for Few-shot and Zero-shot Point Cloud Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, inspired by natural symbiotic evolution, we propose a Symbiotic Evolution Module (SEM) that models co-adaptation between support and query features through self-correlation and cross-correlation mechanisms. |
Changshuo Wang; Zhijian Hu; Xiang Fang; Zai Yang Yu; Yibin Wu; Mingkun Xu; Yusong Wang; Xingyu Gao; Prayag Tiwari; |
| 88 | PET2Rep: Towards Vision-Language Model-Drived Automated Radiology Report Generation for Positron Emission Tomography Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we introduce PET2Rep, a large-scale comprehensive benchmark for evaluation of general and medical VLMs for radiology report generation for PET images. |
Yichi Zhang; Wenbo Zhang; Zehui Ling; Gang Feng; Sisi Peng; Deshu Chen; Yuchen Liu; Hongwei Zhang; Shuqi Wang; Lanlan Li; Limei Han; Yuan Cheng; Zixin Hu; Yuan Qi; Le Xue; |
| 89 | Knowledge-Enhanced Explainable Prompting for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This results in a failure to satisfy the stringent trustworthiness requirements of Explainable Artificial Intelligence (XAI) in high-risk scenarios like healthcare. To address this issue, we propose a Knowledge-Enhanced Explainable Prompting (KEEP) framework that leverages fine-grained domain-specific knowledge to enhance the adaptation process of VLMs across various domains and image modalities. |
Yequan Bie; Andong Tan; Zhixuan Chen; Zhiyuan Cai; Luyang Luo; Hao Chen; |
| 90 | Analyzing and Mitigating Object Hallucination: A Training Bias Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To understand this issue, we conduct probing experiments on the models’ internal components, revealing that this training bias is primarily located in the language modeling (LM) head, which fails to correctly translate accurate visual representations into textual outputs. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. |
Yifan Li; Kun Zhou; Xin Zhao; Lei Fang; Jirong Wen; |
| 91 | SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. |
Hao Shi; Bin Xie; Yingfei Liu; Yang Yue; Tiancai Wang; Haoqiang Fan; Xiangyu Zhang; Gao Huang; |
| 92 | Exploiting Blurry Representations for Event-guided Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose BluR-EVSR, a unified framework that implicitly models Blurry Representations and leverages Event cameras to jointly address both blur and resolution degradation for VSR. |
Zeyu Xiao; Xinchao Wang; |
| 93 | Model Editing As A Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. |
Baixiang Huang; Zhen Tan; Haoran Wang; Zijie Liu; Dawei Li; Ali Payani; Huan Liu; Tianlong Chen; Kai Shu; |
| 94 | CASL: Curvature-Augmented Self-supervised Learning for 3D Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a Curvature-Augmented Self-supervised Learning (CASL) framework based on a reconstruction paradigm. |
Yaohua Zha; Xue Yuerong; Chunlin Fan; Yuansong Wang; Tao Dai; Ke Chen; Shu-Tao Xia; |
| 95 | Enhancing Spatial Reasoning Through Visual and Textual Thinking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). |
Xun Liang; Xin Guo; Zhongming Jin; Weihang Pan; Penghui Shang; Deng Cai; Binbin Lin; Jieping Ye; |
| 96 | Next Patch Prediction for AutoRegressive Visual Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. |
Yatian Pang; Peng Jin; Shuo Yang; Bin Zhu; Bin Lin; Chaoran Feng; Zhenyu Tang; Liuhan Chen; Francis E. H. Tay; Ser-Nam Lim; Harry Yang; Li Yuan; |
| 97 | What Makes A Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. |
Xiaoran Fan; Zhichao Sun; Yangfan Gao; Jingfei Xiong; Hang Yan; Yifei Cao; Jiajun Sun; Shuo Li; Zhihao Zhang; Zhiheng Xi; Yuhao Zhou; Senjie Jin; Changhao Jiang; Junjie Ye; Ming Zhang; Rui Zheng; Zhenhua Han; Yunke Zhang; Demei Yan; Shaokang Dong; Tao Ji; Tao Gui; |
| 98 | MHA2MLA-VLM: Enabling DeepSeek’s Economical Multi-Head Latent Attention Across Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present \textbf{MHA2MLA-VLM}, a parameter-efficient and multimodal-aware framework for converting off-the-shelf VLMs to MLA. |
Xiaoran Fan; Zhichao Sun; Tao Ji; Lixing Shen; Tao Gui; |
| 99 | GENMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by ef- fective human creative workflow, we propose GENMAC, a multi-agent collaboration framework that enables composi- tional text-to-video generation. |
Kaiyi Huang; Yukun Huang; Xuefei Ning; Zinan Lin; Yu Wang; Xihui Liu; |
| 100 | Evolving Generalist Virtual Agents with Generative and Associative Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This stems from memory systems that treat experiences as isolated fragments and rely on brittle semantic retrieval, preventing the synthesis of novel solutions from disparate knowledge. To address this, we introduce CA3Mem, a framework inspired by the human hippocampus that organizes experiences into a structured memory graph. |
Zhenkui Zhang; Wendong Bu; Kaihang Pan; Bingchen Miao; Wenqiao Zhang; Guoming Wang; Wei Ji; Rui Tang; Juncheng Li; Siliang Tang; |
| 101 | Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. |
Boyao Zhou; Shunyuan Zheng; Zhanfeng Liao; Zihan Ma; Hanzhang Tu; Boning Liu; Yebin Liu; |
| 102 | Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. |
Ziyu Ma; Chenhui Gou; Yiming Hu; Yong Wang; Bohan Zhuang; Jianfei Cai; |
| 103 | Open-World Object Counting in Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we make the following contributions: we introduce a model, CountVid, for this task. |
Niki Amini-Naieni; Andrew Zisserman; |
| 104 | ConSurv: Multimodal Continual Learning for Survival Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the two challenges of catastrophic forgetting and complex inter-modal interactions between gigapixel whole slide images and genomics, we propose ConSurv, the first multimodal continual learning (MMCL) method for survival analysis. |
Dianzhi Yu; Conghao Xiong; Yankai Chen; Wenqian Cui; Xinni Zhang; Yifei Zhang; Hao Chen; Joseph J. Y. Sung; Irwin King; |
| 105 | Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM As A Judge, and A Lightweight CTF Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. |
Minghao Shao; Nanda Rani; Kimberly Milner; Haoran Xi; Meet Udeshi; Saksham Aggarwal; Venkata Sai Charan Putrevu; Sandeep K. Shukla; Prashanth Krishnamurthy; Farshad Khorrami; Ramesh Karri; Muhammad Shafique; |
| 106 | VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. |
Chenyang Wu; Jiayi Fu; Chun-Le Guo; Shuhao Han; Chongyi Li; |
| 107 | PanFlow: Decoupled Motion Control for Panoramic Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose PanFlow a novel approach that exploits the spherical nature of panoramas to decouple the highly dynamic camera rotation from the input optical flow condition, enabling more precise control over large and dynamic motions. |
Cheng Zhang; Hanwen Liang; Donny Y. Chen; Qianyi Wu; Konstantinos N. Plataniotis; Camilo Cruz Gambardella; Jianfei Cai; |
| 108 | Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues. To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. |
Zhongjie Ba; Liang Yi; Peng Cheng; Qingcao Li; Qinglong Wang; Li Lu; |
| 109 | Graph Masked Autoencoder for Multi-view Remote Sensing Data Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This challenge is especially notable in the multi-view remote sensing setting, where high heterogeneity and complex spatial structures increase the difficulty of effective representation learning. To address these issues, we propose Clustering-Guided graph Mask AutoEncoder (CG-MAE), the first framework to extend graph masked autoencoders to multi-view remote sensing clustering. |
Renxiang Guan; Junhong Li; Siwei Wang; Tianrui Liu; Dayu Hu; Miaomiao Li; Xinwang Liu; |
| 110 | MMMamba: A Versatile Cross-Modal in Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MMMamba, a cross-modal in-context fusion framework for pan-sharpening, with the flexibility to support image super-resolution in a zero-shot manner. |
Yingying Wang; Xuanhua He; Chen Wu; Jialing Huang; Suiyun Zhang; Rui Liu; Xinghao Ding; Haoxuan Che; |
| 111 | Head-Aware KV Cache Compression for Efficient Visual Autoregressive Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale generation paradigm. |
Ziran Qin; Youru Lv; Mingbao Lin; Hang Guo; Zeren Zhang; Danping Zou; Weiyao Lin; |
| 112 | OIDA-QA: A Multimodal Benchmark for Analyzing The Opioid Industry Documents Archive Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. |
Xuan Shen; Brian Wingenroth; Zichao Wang; Jason Kuen; Wanrong Zhu; Ruiyi Zhang; Yiwei Wang; Lichun Ma; Anqi Liu; Hongfu Liu; Tong Sun; Kevin S. Hawkins; Kate Tasker; G. Caleb Alexander; Jiuxiang Gu; |
| 113 | SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. |
Jiayi Pan; Jiaming Xu; Yongkang Zhou; Guohao Dai; |
| 114 | FreLay: Frequency-aware Energy Function for Training-free Layout-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, while attention varies over time during the denoising process, existing approaches employ a fixed formulation. To address these challenges, we introduce FreLay, a novel training-free approach equipped with a frequency-aware energy function. |
Bonan Li; Yinhan Hu; Songhua Liu; Zeyu Xiao; Xinchao Wang; |
| 115 | RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current agent frameworks are not well-adapted to real-world clinical scenarios, especially those involving the complex demands of rare diseases. To bridge this gap, we introduce RareAgents, the first LLM-driven multi-disciplinary team decision-support tool designed specifically for the complex clinical context of rare diseases. |
Xuanzhong Chen; Ye Jin; Xiaohao Mao; Lun Wang; Shuyang Zhang; Ting Chen; |
| 116 | Enhancing Retrieval-Augmented Large Vision Language Models Via Knowledge Conflict Mitigation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the retrieved contextual knowledge is usually not aligned with LVLMs’ internal parametric knowledge, leading to knowledge conflicts and further unreliable responses. To tackle this issue, we design KCM, a training-free and plug-and-play framework that can effectively mitigate knowledge conflicts while incorporating MRAG for more accurate LVLM responses. |
Wenbin An; Jiahao Nie; Feng Tian; Mingxiang Cai; Yaqiang Wu; Xiaoqin Zhang; Shijian Lu; |
| 117 | RealUHR: Harnessing Patch-Cascade Flows for Photorealistic Ultra-High-Resolution Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we introduce RealUHR, an efficient and scalable framework for generating photorealistic 4K images. |
Yongsheng Yu; Haitian Zheng; Zhe Lin; Connelly Barnes; Yuqian Zhou; Zhifei Zhang; Jiebo Luo; |
| 118 | What to Ask Next? Probing The Imaginative Reasoning of LLMs with TurtleSoup Puzzles Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. |
Mengtao Zhou; Sifan Wu; Huan Zhang; Qi Sima; Bang Liu; |
| 119 | Exposing The Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: REtrieval-Augmented LLM-based Machine Translation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval, a common challenge in real-world deployment, remains poorly understood. To address this gap, we propose a noise synthesis framework and new metrics to systematically evaluate REAL-MT’s reliability across high-, medium-, and low-resource language pairs. |
Yanming Sun; Runzhe Zhan; Chi Seng Cheang; Han Wu; Xuebo Liu; Yuyao Niu; Fengying Ye; Kaixin Lan; Lidia S. Chao; Derek F. Wong; |
| 120 | Audio-Thinker: Guiding Large Audio Language Model When and How to Think Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the explicit reasoning process has not yet yielded substantial benefits for audio question answering, and effectively leveraging deep reasoning remains an open challenge, with LALMs still falling short of achieving human-level auditory-language reasoning. To address these limitations, we propose Audio-Thinker, a reinforcement learning framework designed to enhance the reasoning capabilities of LALMs through improved adaptability, consistency, and effectiveness. |
Shu Wu; Chenxing Li; Wenfu Wang; Hao Zhang; Hualei Wang; Meng Yu; Dong Yu; |
| 121 | SCALAR: Scale-wise Controllable Visual Autoregressive Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a Scale-wise Conditional Decoding mechanism. |
Ryan Xu; Dongyang Jin; Yancheng Bai; Rui Lan; Xu Duan; Lei Sun; Xiangxiang Chu; |
| 122 | DP-NCB: Privacy Preserving Fair Bandits Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing privacy-preserving bandit algorithms typically optimize average regret, a utilitarian measure, whereas fairness-aware approaches focus on minimizing Nash regret, which penalizes inequitable reward distributions, but often disregard privacy concerns. To bridge this gap, we introduce Differentially Private Nash Confidence Bound (DP-NCB)—a novel and unified algorithmic framework that simultaneously ensures ϵ-differential privacy and achieves order-optimal Nash regret, matching known lower bounds up to logarithmic factors. |
Dhruv Sarkar; Nishant Pandey; Sayak Ray Chowdhury; |
| 123 | FloorPlanFormer: Multi-Task Transformer Network for Floor Plan Recognition with Outer-to-Inner Feature Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Floor plan recognition requires accurate segmentation and classification of entrance doors, outer contours (walls and windows) and inner contours (various room types) , despite strong spatial dependencies and large stylistic differences between different datasets. To overcome these challenges, we propose FloorPlanFormer, a multi-task learning network divided into three phases: the first phase introduces a Swin Transformer backbone with a pixel decoder to extract fine-grained pixel-level semantics; the second phase employs prompt encoder and mask decoder, and a novel Global Contextual Attention Module (GCAM) is designed to generate clear, high-quality outer contour masks; the third stage uses mask transformer decoder to recognize targets and designs a Masked Feature Refinement Module (MFRM) to accurately delineate the inner contour by modeling the relationship between the local inner and outer contours. |
Yun Liang; ZiHao Wu; Run Zheng; Shuai Xie; Bo Hong; Yishen Lin; |
| 124 | Video Mirror Detection with The Motion-in-Depth Cue Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: MiD integrates information from visual appearance (image colors and textures), the way objects move around us in 3D space (3D motions), and their relative distance from us (depth) to determine if something is approaching or receding and to support navigation. Motivated by this neuroscience mechanism, we introduce MiD-VMD, the first approach to explicitly model MiD for video mirror detection. |
Alex Warren; Ke Xu; Xin Tian; Gary K. L. Tam; Benjamin W. Wah; Rynson W. H. Lau; |
| 125 | Multiplicative Orthogonal Sequential Editing for Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although subsequent methods have made some improvements, they remain within the additive framework and have not fundamentally addressed this limitation. To solve this problem, we analyze it from both statistical and mathematical perspectives and conclude that multiplying the original matrix by an orthogonal matrix does not change the numerical stability of the matrix. |
Hao-Xiang Xu; Jun-Yu Ma; Ziqi Peng; Yuhao Sun; Zhen-Hua Ling; Jia-Chen Gu; |
| 126 | Heterogeneous Uncertainty-Guided Composed Image Retrieval with Fine-Grained Probabilistic Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Probabilistic learning approaches have shown promise in addressing such issues; however, they fall short for CIR due to their instance-level holistic modeling and homogeneous treatments for queries and targets. This paper introduces a Heterogeneous Uncertainty-Guided (HUG) paradigm to overcome these limitations. |
Haomiao Tang; Jinpeng Wang; Minyi Zhao; GuangHao Meng; Ruisheng Luo; Long Chen; Shu-Tao Xia; |
| 127 | PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. |
Hao Yang; Qianyu Zhou; Haijia Sun; Xiangtai Li; Xuequan Lu; Lizhuang Ma; Shuicheng YAN; |
| 128 | SPA: Achieving Consensus in LLM Alignment Via Self-Priority Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose priority alignment, a new alignment paradigm that enforces a strict “trustworthy-before-helpful” ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). |
Yue Huang; Xiangqi Wang; Xiangliang Zhang; |
| 129 | Conditional Prompt Learning Via Degradation Perception for Underwater Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods simplistically treat various degradations as homogeneous, disregarding their intrinsic connections and causing models to blindly learn, resulting in conflicting optimization goals and visual distortions. To address above limitations, we propose a Conditional Prompt Learning via Degradation Perception (CPLDP) model, which employs conditional prompt as degradation perception priors and guides underwater image enhancement. |
Mingze Yao; Zhiying Jiang; Xianping Fu; Huibing Wang; |
| 130 | ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present ODYSSEY, a unified mobile manipulation framework for agile quadruped robots equipped with manipulators, which seamlessly integrates high-level task planning with low-level whole-body control. |
Kaijun Wang; Liqin Lu; Mingyu Liu; Jianuo Jiang; Zeju Li; Bolin Zhang; Wancai Zheng; Xinyi Yu; Hao Chen; Chunhua Shen; |
| 131 | AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AutoLink, an autonomous agent framework that reformulates schema linking as an iterative, agent-driven process. |
Ziyang Wang; Yuanlei Zheng; Zhenbiao Cao; Xiaojin Zhang; Zhongyu Wei; Pei Fu; Zhenbo Luo; Wei Chen; Xiang Bai; |
| 132 | Beyond Quadratic: Linear-Time Change Detection with RWKV Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. |
Zhenyu Yang; Gensheng Pei; Tao Chen; Xia Yuan; Haofeng Zhang; Xiangbo Shu; Yazhou Yao; |
| 133 | Flora: Effortless Context Construction to Arbitrary Length and Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. |
Tianxiang Chen; Zhentao Tan; Xiaofan Bo; Yue Wu; Tao Gong; Qi Chu; Jieping Ye; |
| 134 | Towards Efficient and Robust Manipulation Via Multi-Frame Vision-Language-Action Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. |
Hao Li; Shuai Yang; Yilun Chen; Xinyi Chen; Xiaoda Yang; Yang Tian; Hanqing Wang; Tai Wang; Dahua Lin; Feng Zhao; Jiangmiao Pang; |
| 135 | MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model Via Mixture-of-Layers for Efficient Robot Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-LayEr Vision Language Action model (MoLe-VLA or simply MoLe) architecture for dynamic LLM layer activation. |
Rongyu Zhang; Menghang Dong; Yuan Zhang; Liang Heng; Xiaowei Chi; Gaole Dai; Li Du; Dan Wang; Yuan Du; Shanghang Zhang; |
| 136 | DeOcc-1-to-3: 3D De-Occlusion from A Single Image Via Self-Supervised Multi-View Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and struggle when parts of the object are occluded, leading to inconsistent views and degraded 3D reconstruction quality. To address this limitation, we propose DeOcc-1-to-3, an end-to-end framework for occlusion-aware multi-view generation. |
Yansong Qu; Shaohui Dai; Xinyang Li; Yuze Wang; You Shen; Shengchuan Zhang; Liujuan Cao; |
| 137 | Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. |
Yuqi Zhu; Yi Zhong; Jintian Zhang; Ziheng Zhang; Shuofei Qiao; Yujie Luo; Lun Du; Da Zheng; Ningyu Zhang; Huajun Chen; |
| 138 | Improving Deepfake Detection with Reinforcement Learning-Based Adaptive Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that existing methods overlook the evolving complexity of real-world forgery patterns, such as facial warping, expression manipulation, and compression artifacts, which cannot be fully simulated by fixed policies. To bridge this gap, we propose CRDA (Curriculum Reinforcement-Learning Data Augmentation), a novel framework that guides the detector to progressively master multi-domain forgery features from simple to complex. |
Yuxuan Chou; Tao Yu; Wen Huang; ZhangYuHeng; Tao Dai; Shu-Tao Xia; |
| 139 | TraceTrans: Translation and Spatial Tracing for Surgical Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. |
Xiyu Luo; Haodong Li; Xinxing Cheng; He Zhao; Yang Hu; Xuan Song; Tianyang Zhang; |
| 140 | CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite great success having been achieved, the inherent sparsity of point clouds introduces a dual-redundancy challenge that limits existing trackers: (1) vast spatial redundancy from background noise impairs accuracy, and (2) informational redundancy within the foreground hinders efficiency. To tackle these issues, we propose CompTrack, a novel end-to-end framework that systematically eliminates both forms of redundancy in point clouds. |
Sifan Zhou; Yichao Cao; Jiahao Nie; Yuqian Fu; Ziyu Zhao; Xiaobo Lu; Shuo Wang; |
| 141 | ReCode: Updating Code API Knowledge with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. |
Haoze Wu; Yunzhi Yao; Wenhao Yu; Ningyu Zhang; |
| 142 | Deep Research Arena: The First Exam of LLMs’ Research Abilities Via Seminar-Grounded Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers’ attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. |
Haiyuan Wan; Chen Yang; Junchi Yu; Meiqi Tu; Jiaxuan Lu; Di Yu; Jianbao Cao; Ben Gao; Jiaqing Xie; Aoran Wang; Wenlong Zhang; Philip Torr; Dongzhan Zhou; |
| 143 | Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, instead of retraining or tuning an LMM costly, we propose a training-free debiasing framework, in which the image quality prediction is rectified by mitigating the bias caused by image semantics. |
Baoliang Chen; Siyi Pan; Dongxu Wu; Liang Xie; Xiangjie Sui; Lingyu Zhu; Hanwei Zhu; |
| 144 | Rethinking Video-Language Model from The Language Input Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. |
Xiang Fang; Wanlong Fang; Changshuo Wang; Xiaoye Qu; Daizong Liu; |
| 145 | DAWN: Distributed LLM Multi-Agent Workflow Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We reveal two fundamental obstacles to generative workflow synthesis in this setting: (i) workflow specialization conflict, where agents optimized for different task distributions generate incompatible communication patterns that resist meaningful aggregation, and (ii) structural communication shift, where locally optimal agent interaction graphs fail to compose into globally coherent multi-agent workflows. To address these challenges, we propose DAWN, a federated framework that integrates two key innovations: Parametric Resonance, which robustly aggregates heterogeneous local updates via layer-wise SVD-based denoising and alignment, and Structural Gravity, which regularizes local workflow generation by penalizing the Fusion Gromov-Wasserstein distance to a set of prototype communication graphs, ensuring global structural coherence without stifling local adaptation. |
Guancheng Wan; Mo Zhou; Ziyi Wang; Xiaoran Shang; Eric Hanchen Jiang; Guibin Zhang; Jinhe Bi; Yunpu Ma; Zaixi Zhang; Ke Liang; Wenke Huang; |
| 146 | Minute-Long Videos with Dual Parallelisms Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. |
Zeqing Wang; Bowen Zheng; Xingyi Yang; Zhenxiong Tan; Yuecong Xu; Xinchao Wang; |
| 147 | Enhancing Stability and Fidelity for Zero-Shot TTS with A Multi-Level Evaluator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Nevertheless, stability and fidelity remain key challenges, manifesting as mispronunciations, audible noise, and quality degradation. To address these issues, we introduce Vox-Evaluator, a multi-level evaluator designed to guide the correction of erroneous speech segments and preference alignment for TTS systems. |
Hualei Wang; Na Li; Chuke Wang; Shu Wu; Zhifeng Li; Dong Yu; |
| 148 | Discrete-Guided Diffusion for Scalable and Safe Multi-Robot Motion Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, continuous optimization-based planners offer higher-quality paths but suffer from the curse of dimensionality, resulting in poor scalability with respect to the number of robots. This paper tackles the limitations of these two approaches by introducing a novel framework that integrates discrete MAPF solvers with constrained generative diffusion models. |
Jinhao Liang; Sven Koenig; Ferdinando Fioretto; |
| 149 | Learn from Global Correlations: Enhancing Evolutionary Algorithm Via Spectral GNN Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing EAs heavily rely on manual parameter settings, inappropriate parameters might disrupt the exploration-exploitation balance, further impairing model performance. To address these challenges, we propose a novel evolutionary algorithm framework called Graph Neural Evolution (GNE). |
Kaichen Ouyang; Zong Ke; Shengwei Fu; Lingjie Liu; Puning Zhao; Dayu Hu; |
| 150 | ASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We first empirically show that their improvements systematically coincide with redistributions of cross-modal attention. Building on this insight, we propose Attention-Steerable Contrastive Decoding (ASCD), which directly steers the attention scores during decoding. |
Yujun Wang; Aniri; Jinhe Bi; Soren Pirk; Yunpu Ma; |
| 151 | From Subtle to Significant: Prompt-Driven Self-Improving Optimization in Test-Time Graph OOD Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a Self-Improving Graph Out-of-Distribution detector (SIGOOD), which is an unsupervised framework that integrates continuous self-learning with test-time training for effective graph OOD detection. |
Luzhi Wang; Xuanshuo Fu; He Zhang; Chuang Liu; Xiaobao Wang; Hongbo Liu; |
| 152 | SPEED-Q: Staged Processing with Enhanced Distillation Towards Efficient Low-Bit On-Device VLM Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SPEED-Q, a novel Staged Processing with EnhancEd Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. |
Tianyu Guo; Shanwei Zhao; Shiai Zhu; Chenguang Ma; |
| 153 | Temporal-Consistent Video Restoration with Pre-trained Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. |
Hengkang Wang; Yang Liu; Huidong Liu; Chien-Chih Wang; Yanhui Guo; Hongdong Li; Bryan Wang; Ju Sun; |
| 154 | AutoTool: Efficient Tool Selection for Large Language Model Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia—the tendency of tool invocations to follow predictable sequential patterns. |
Jingyi Jia; Qinbin Li; |
| 155 | Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike prior studies focusing on isolated biases, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. Specifically, we propose CognitiveAttack, a novel red-teaming framework that adaptively selects optimal ensembles from 154 human social psychology-defined cognitive biases, engineering them into adversarial prompts to effectively compromise LLM safety mechanisms. |
Xikang Yang; Biyu Zhou; Xuehai Tang; Jizhong Han; Songlin Hu; |
| 156 | Does Question Really Matter? The Attribution of Answer Bias in LLM Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we conduct a systematic investigation of the attribution of answer bias, and demonstrate a strong correlation between the degree of data contamination and the severity of answer bias, while the position of options and the popularity of answers have relatively minor effects. |
Boxi Cao; Ruotong Pan; Hongyu Lin; Xianpei Han; Le Sun; |
| 157 | Graph-augmented and Over-smoothing-resistant Contrastive Clustering for Short Text Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent studies incorporating contrastive learning and cluster structure optimization have improved performance, their reliance on augmented samples often introduces noise and weakens the capacity of pretrained language models to capture fine-grained semantics. To address these issues, we propose a Graph-augmented and Over-smoothing-resistant Contrastive Clustering framework (GOCC). |
Zijian Zheng; Tao Ai; Yonghe Lu; |
| 158 | Codebook-Empowered Analysis-Friendly Extreme Underwater Image Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel -quantized (VQ) codebook-driven framework for machine-centric UIC. |
JianHao Wu; Yudong Mao; Qiuping Jiang; |
| 159 | Breaking The Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing retrieval methods, such as traditional mass spectral library matching, suffer from limited spectral library coverage, while recent cross-modal representation learning frameworks often encounter modality misalignment, resulting in suboptimal retrieval accuracy and generalization. To address these limitations, we propose GLMR, a Generative Language Model-based Retrieval framework that mitigates the cross-modal misalignment through a two-stage process. |
Yiwen Zhang; Keyan Ding; Yihang Wu; Xiang Zhuang; Yi Yang; Qiang Zhang; Huajun Chen; |
| 160 | Spatial-Spectral Homogeneous Attacks on Physical-World Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the research gap and counter-practice phenomenon, this paper proposes the first practical LVLM attack method based on a novel adversarial patch design, which can achieve physical and digital attack settings without using any LVLM details. |
Daizong Liu; Baoquan Chen; Wei Hu; |
| 161 | Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. |
Yuankun Xie; Ruibo Fu; Xiaopeng Wang; Zhiyong Wang; Songjun Cao; Long Ma; Haonan Cheng; Long Ye; |
| 162 | LandCraft: Designing The Structured 3D Landscapes Via Text Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present LandCraft, a novel AI-assisted authoring tool that enables the rapid creation of high-quality landscape scenes based on user descriptions. |
Zhihao Liu; Fang Liu; Weihao Xuan; Naoto Yokoya; |
| 163 | MoReMouse: Monocular Reconstruction of Laboratory Mouse Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To achieve high-fidelity 3D reconstructions, we present three key innovations. |
Yuan Zhong; Jingxiang Sun; Zhongbin Zhang; Liang An; Yebin Liu; |
| 164 | FairGC: Fostering Individual and Group Fairness for Deep Graph Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce for the first time a fairness-aware framework termed FairGC for deep graph clustering, which integrates the dual objectives of individual and group fairness while maintaining accurate clustering results. |
Haodong Zhang; Xinyue Wang; Tao Ren; Yifan Wang; Siyu Yi; Fanchun Meng; Zeyu Ma; Qingqing Long; Wei Ju; |
| 165 | Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. |
Zeguan Xiao; Diyang Dou; Boya Xiong; Yun Chen; Guanhua Chen; |
| 166 | Vision-G1: Towards General Reasoning Vision-Language Models Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. |
Yuheng Zha; Kun Zhou; Yujia Wu; Yushu Wang; Jie Feng; Zhi Xu; Shibo Hao; Zhengzhong Liu; Eric P. Xing; Zhiting Hu; |
| 167 | AcoustoReinforce: Multi-Particle Acoustophoretic Path Planning with Deep Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AcoustoReinforce, a reinforcement learning-based path planner that autonomously controls the motion of multiple levitated particles. |
Pengyuan Wei; Giorgos Christopoulos; Zhouyang Shen; Jincheng Wang; Joshua Mukherjee; Ryuji Hirayama; Sriram Subramanian; Prateek Mittal; |
| 168 | Boosting Fine-Grained Urban Flow Inference Via Lightweight Architecture and Focalized Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the practical deployment of existing methods is hindered by two key challenges: the prohibitive computational cost of over-parameterized models and the suboptimal performance of conventional loss functions on the highly skewed distribution of urban flows. To address these challenges, we propose a unified solution that synergizes architectural efficiency with adaptive optimization. |
Yuanshao Zhu; Xiangyu Zhao; Zijian Zhang; Xuetao Wei; James Jianqiao Yu; |
| 169 | ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose ConvMix, a mixed-criteria framework to augment conversational dense retrieval, which covers more aspects than existing data augmentation frameworks. |
Fengran Mo; Jinghan Zhang; Yuchen Hui; Jia Ao Sun; Zhichao Xu; Zhan Su; Jian-Yun Nie; |
| 170 | Predicting The Future By Retrieving The Past Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This results in an underutilization of rich patterns from the global history. To bridge this gap, we propose Predicting the Future by Retrieving the Past (PFRP), a novel approach that explicitly integrates global historical data to enhance forecasting accuracy. |
Dazhao Du; Tao Han; Song Guo; |
| 171 | Trainable EEG Interpolation and Structure-Sharing Dual-Path Encoders for Brain-Assisted Target Speaker Extraction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They also employ single-path encoders that extract only target-relevant features while neglecting complementary, irrelevant ones, limiting discriminability. To address these limitations, this paper proposes a Trainable EEG Interpolation and Structure-sharing Dual-path Encoders network (TIDENet). |
Zhao Lv; Haoran Zhou; Ying Chen; Youdian Gao; Xinhui Li; Ruibo Fu; Cunhang Fan; |
| 172 | BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have not fully exploited the synergy between visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. |
Lan Li; Tao Hu; Da-Wei Zhou; Jia-Qi Yang; Han-jia Ye; De-Chuan Zhan; |
| 173 | MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection Under Cloaking Perturbations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. |
Qiyao Xue; Yuchen Dou; Zheyuan Ryan Shi; Xiang Lorraine Li; Wei Gao; |
| 174 | Exploring High-order-aware Prompt Learning for Zero-shot Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Relying solely on the current PL paradigm restricts the ability to generate more precise prompts, thereby hindering improved ZSAD performance. To mitigate this issue, this paper proposes a high-order-aware prompt learning framework, termed HiPL, which facilitates the detection of unseen anomalies through generating prompts fortified by hypergraphs. |
Shun Wei; Jielin Jiang; Xiaolong Xu; |
| 175 | BAT: Learning Event-based Optical Flow with Bidirectional Adaptive Temporal Correlation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present BAT, an innovative framework that estimates event-based optical flow using bidirectional adaptive temporal correlation. |
Gangwei Xu; Haotong Lin; Zhaoxing Zhang; Hongcheng Luo; Haiyang Sun; Xin Yang; |
| 176 | Look As You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the Chain-of-Evidence (CoE) paradigm for VD-RAG. |
Shuochen Liu; Pengfei Luo; Chao Zhang; Yuhao Chen; Haotian Zhang; Qi Liu; Xin Kou; Tong Xu; Enhong Chen; |
| 177 | Towards Effective and Efficient Context-aware Nucleus Detection in Histopathology Whole Slide Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose an effective and efficient context-aware nucleus detection approach. |
Zhongyi Shui; Honglin Li; Yunlong Zhang; Yuxuan Sun; Yiwen Ye; Pingyi Chen; Ruizhe Guo; Lei Cui; Chenglu Zhu; Lin Yang; |
| 178 | A Theory of Adaptive Scaffolding for LLM-Based Pedagogical Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current LLM systems used in classrooms often lack the solid theoretical foundations found in earlier intelligent tutoring systems. To bridge this gap, we propose a framework that combines Evidence-Centered Design with Social Cognitive Theory and Zone of Proximal Development for adaptive scaffolding in LLM-based agents focused on STEM+C learning. |
Clayton Cohn; Surya Rayala; Namrata Srivastava; Joyce Horn Fonteles; Shruti Jain; Xinying Luo; Divya Mereddy; Naveeduddin Mohammed; Gautam Biswas; |
| 179 | PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. |
Zewei Chang; Zheng-Peng Duan; Jianxing Zhang; Chun-Le Guo; Siyu Liu; Hyungju Chun; Hyunhee Park; Zikun Liu; Chongyi Li; |
| 180 | PGMamba: A Physical Model-Guided Global Mamba for Underwater Image Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a novel Physical Model-Guided Global Mamba (PGMamba) that combines the efficient sequential modeling capability of Mamba with underwater imaging physical model. |
Zijun Tan; Chuan Fu; Tan Guo; Zhixiong Nan; Pengzhan Zhou; Xinggan Peng; Fulin Luo; |
| 181 | FastDriveVLA: Efficient End-to-End Driving Via Plug-and-Play Reconstruction-based Token Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. |
Jiajun Cao; Qizhe Zhang; Peidong Jia; Xuhui Zhao; Bo Lan; Xiaoan Zhang; Lizhuo; Xiaobao Wei; Sixiang Chen; Liyun Li; Xianming Liu; Ming Lu; Yang Wang; Shanghang Zhang; |
| 182 | Let’s Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. |
Xu Liu; Yongheng Zhang; Qiguang Chen; Yao Li; Sheng Wang; Libo Qin; |
| 183 | EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces EasyText, a text rendering framework based on DiT (Diffusion Transformer), which connects denoising latents with multilingual character tokens encoded as character tokens. |
Runnan Lu; Yuxuan Zhang; Jiaming Liu; Haofan Wang; Yiren Song; |
| 184 | AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception conditions. To address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. |
Jirong Zha; Yuxuan Fan; Tianyu Zhang; Geng Chen; Yingfeng Chen; Chen Gao; Xinlei Chen; |
| 185 | DIMM: Decoupled Multi-hierarchy Kalman Filter Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel framework, DIMM, to effectively combine estimates from different motion models in each direction, thus increasing the target tracking accuracy. |
Jirong Zha; Yuxuan Fan; Kai Li; Han Li; Chen Gao; Xinlei Chen; |
| 186 | MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they do not consider the dynamic characteristics and temporal dependencies of video frames, as they perceive video understanding as a multi-frame task. To address these challenges, we propose MMG-Vid, a novel training-free visual token pruning framework that removes redundancy by Maximizing Marginal Gains at both segment-level and token-level. |
Junpeng Ma; Qizhe Zhang; Ming Lu; Zhibin Wang; Qiang Zhou; Jun Song; Shanghang Zhang; |
| 187 | TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods typically compress visual tokens either at the input stage or in early model layers, ignoring variations across tasks and depths. To address these limitations, we introduce TOP-RL, a Task-Optimized Progressive token pruning framework based on Reinforcement Learning. |
Hengyi Wang; Weiying Xie; Hui Jiang; Yaotao Wei; Kai Jiang; Mingxiang Cao; Chenhe Hao; Leyuan Fang; |
| 188 | 10 Open Challenges Steering The Future of Vision-Language-Action Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we discuss 10 principal milestones in the ongoing develop- ment of VLA models—multimodality, reasoning, data, eval- uation, cross-robkot action generalization, efficiency, whole- body coordination, safety, agents, and coordination with hu- mans. |
Soujanya Poria; Navonil Majumder; Chia-Yu Hung; Amir Ali Bagherzadeh; Chuan Li; Kenneth Kwok; Ziwei Wang; Cheston Tan; Jiajun Wu; David Hsu; |
| 189 | Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. |
Fangyuan Mao; Aiming Hao; Jintao Chen; Dongxia Liu; Xiaokun Feng; Jiashu Zhu; Meiqi Wu; Chubin Chen; Jiahong Wu; Xiangxiang Chu; |
| 190 | Difficulty-Aware Learning Curve Extrapolation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we argue that task difficulty is a crucial yet neglected dimension for robust LCE. |
Mengyang Li; Pinlong Zhao; |
| 191 | MMPG: MoE-based Adaptive Multi-Perspective Graph Fusion for Protein Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current GNN-based PRL methods typically rely on single-perspective graph construction strategies, which capture partial properties of residue interactions, resulting in incomplete protein representations. To address this limitation, we propose MMPG, a framework that constructs protein graphs from multiple perspectives and adaptively fuses them via Mixture of Experts (MoE) for PRL. |
Yusong Wang; Jialun Shen; Zhihao Wu; Yicheng Xu; Shiyin Tan; Mingkun Xu; Changshuo Wang; Zixing Song; Prayag Tiwari; |
| 192 | Multi-Metric Preference Alignment for Generative Speech Restoration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. |
Junan Zhang; Xueyao Zhang; Jing Yang; Yuancheng Wang; Fan Fan; Zhizheng Wu; |
| 193 | STrans: Spontaneous Architecture Evolution for Adaptive Time Series Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces STrans (Spontaneous Transformer), a comprehensive neural architecture search framework for time series Transformers that simultaneously explores attention variants, normalization techniques, activation functions, and encoding operations. |
Haoyi Jia; |
| 194 | F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To address this, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards to encourage outputs with semantic precision, relevance, and contextual coherence. |
Hanbo Bi; Zhiqiang Yuan; Zexi Jia; Jiapei Zhang; Chongyang Li; Peixiang Luo; Ying Deng; Xiaoyue Duan; Jinchao Zhang; |
| 195 | Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware stabilized ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. |
Ziwei Liu; Borui Kang; Wei Li; Hangjie Yuan; Yanbing Yang; Wenbin Li; Yifan Zhu; Tao Feng; Jun Luo; |
| 196 | TAPO: Dynamic Teacher and Perturbed Answer Injection for Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, focusing solely on final answer correctness while ignoring the reasoning process, along with rigid length penalties, can hinder training stability and output quality. To address these issues, we introduce TAPO, a reinforcement learning framework that enhances optimization signals by modifying sampled completions within training groups. |
Maowei Jiang; Zihang Wang; Qi Wang; Peter Búš; Moquan Cheng; Yifan Wang; Quangao Liu; Ruiqi Li; Pengyu Zeng; Ruikai Liu; Alan Liang; Yansong Xu; Yusong Hu; Chaoran Zhang; Zhiyong Dong; |
| 197 | Query-Routed Activation Editing with Truth-hierarchical Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, previous editing approaches neglect the query‑specific inference pathways that require tailored truthful steering vectors, resulting in suboptimal hallucination mitigation. To address these issues, we propose the Query-Routed Activation Editing (QRAE) framework, which comprises Divergence-sensitive Head Routing (DHR) and Truth-hierarchical Preference Steering (TPS), to fully leverage query-specific semantics for adaptive activation editing. |
Kewei Liao; Tianbo Wang; Yuqing Ma; Zhange Zhang; Zhicheng Geng; Xiaowei Zhao; Jiakai Wang; Xianglong Liu; |
| 198 | Better Datasets Start from RefineLab: Automatic Optimization for High-Quality Dataset Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce RefineLab, the first LLM‑driven framework that automatically refines raw QA textual data into high-quality datasets under a controllable token‑budget constraint. |
Xiaonan Luo; Yue Huang; Ping He; Xiangliang Zhang; |
| 199 | RMO: Towards Better LLM Alignment Via Reshaping Reward Margin Distributions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Reward Margin Optimization (RMO), a framework that reshapes reward margin distributions during training to improve alignment performance. |
Yanchi Ru; Yue Huang; Xiangliang Zhang; |
| 200 | GUIC: Certified Graph Unlearning with Individual Fairness Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While much attention has focused on the technical feasibility of unlearning, its implications for fairness remain largely unexamined. To address this critical gap, this paper introduces GUIC, the first framework that jointly ensures certified unlearning and individual fairness in graph-based models, introducing a novel perspective on responsible model updates in graph unlearning. |
Zichong Wang; Tongliang Liu; Wenbin Zhang; |
| 201 | Fair Graph Learning with Limited Sensitive Attribute Information Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we introduce an innovative fairness optimization strategy, propose a novel framework named FGLISA, and provide a theoretical perspective linking limited sensitive attribute information access to fairness objectives, thus enabling fair graph learning in real-world applications with limited sensitive attribute information. |
Zichong Wang; Jie Yang; Jun Zhuang; Puqing Jiang; Mingzhe Chen; Ye Hu; Wenbin Zhang; |
| 202 | Bridging The Language Gap: Uncovering and Aligning Shared Circuits for Multi-Hop Reasoning in Multilingual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The underlying mechanisms for this knowledge gap have remained largely unexplored. In this work, we resolve this question by introducing a mechanistic interpretability framework that traces the causal pathways of multi-hop knowledge reasoning. |
Chenghao Sun; Zhen Huang; Yonggang Zhang; Xinmei Tian; Xu Shen; Jieping Ye; |
| 203 | Towards High-Fidelity 3D Portrait Generation with Rich Details By Cross-View Prior-Aware Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute this issue to the insufficient consideration of cross-view consistency during the diffusion process, resulting in significant disparities between different views and ultimately leading to blurred 3D representations. In this paper, we address this issue by comprehensively exploiting multi-view priors in both the conditioning and diffusion procedures to produce consistent, detail-rich portraits. |
Haoran Wei; Wencheng Han; Xingping Dong; Jianbing Shen; |
| 204 | SCoUT: A Framework for Structured Stereotype Analysis in Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SCoUT (Stereotype Content-oriented Utility structure via Thurstonian modeling), a closed-loop framework that structurally models, explicitly probes, and functionally steers stereotype dimensions (warmth and competence) in LLMs. |
Jinxuan Wu; Bin Li; Xiangyang Xue; |
| 205 | Logic Unseen: Revealing The Logical Blindspots of Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their capacity for logical understanding remains significantly underexplored, resulting in critical **”logical blindspots”** that limit their reliability in practical applications. To systematically diagnose this, we introduce **LogicBench**, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. |
Yuchen Zhou; Jiayu Tang; Shuo Yang; Xiaoyan Xiao; Yuqin Dai; Wenhao Yang; Chao Gou; Xiaobo Xia; Tat-Seng Chua; |
| 206 | LLMs Unleashed: Generating Protocol Code from RFC Specifications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This makes the automated parsing and comprehension of RFC documents a major challenge in network protocol research. To address this gap, we introduce large language models (LLMs) into the task of automatic network protocol code generation from RFC documents (RFC2Code) and propose a comprehensive evaluation framework to quantitatively assess LLM performance. |
Junfeng Long; Jinshu Su; Biao Han; |
| 207 | Rethinking Irregular Time Series Forecasting: A Simple Yet Effective Baseline Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, most existing methods are typically complex and resource-intensive. In this study, we propose a general framework called APN to address these challenges. |
Xvyuan Liu; Xiangfei Qiu; Xingjian Wu; Zhengyu Li; Chenjuan Guo; Jilin Hu; Bin Yang; |
| 208 | Efficient Reasoning for Large Reasoning Language Models Via Certainty-Guided Reflection Suppression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. |
Jiameng Huang; Baijiong Lin; Guhao Feng; Jierun Chen; Di He; Lu Hou; |
| 209 | Attack The Messages, Not The Agents: A Multi-round Adaptive Stealthy Tampering Framework for LLM-MAS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MAST, a Multi-round Adaptive Stealthy Tampering framework designed to exploit communication vulnerabilities within the system. |
Bingyu Yan; Xiaoming Zhang; Ziyi Zhou; Chaozhuo Li; Ruilin Zeng; Yirui Qi; Tianbo Wang; Litian Zhang; |
| 210 | FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate v; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates (v, m) at each round slows down convergence. To address these challenges, we propose the first Federated AdamW algorithm, called FedAdamW, for training and fine-tuning various large models. |
Junkang Liu; Fanhua Shang; Hongying Liu; Yuxuan Tian; Yuanyuan Liu; Jin Liu; Kewen Zhu; Zhouchen Lin; |
| 211 | When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs’ Toxicity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. |
Shiyao Cui; Xijia Feng; Yingkang Wang; Junxiao Yang; Zhexin Zhang; Biplab Sikdar; Hongning Wang; Han Qiu; Minlie Huang; |
| 212 | IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. |
Siyi Zhou; Yiquan Zhou; Yi He; Xun Zhou; Jinchao Wang; Wei Deng; Jingchen Shu; |
| 213 | Towards OOD Generalization in Dynamic Graphs Via Causal Invariant Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although several attempts have been made to tackle these challenges, none has successfully addressed all three simultaneously, and they face various limitations in complex OOD scenarios. To solve these issues, we propose a Dynamic graph Causal Invariant Learning (DyCIL) model for OOD generalization via exploiting invariant spatio-temporal patterns from a causal view. |
Xinxun Zhang; Pengfei Jiao; Mengzhou Gao; Tianpeng Li; Xuan Guo; |
| 214 | Oscillation Inversion: Training-Free Image and Video Enhancement Through Oscillated Latents in Large Flow Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We offer theoretical insights, showing that this behavior arises from oscillatory dynamics in flow models. Building on this understanding, we introduce a simple and fast distribution transfer technique that facilitates training-free image and video editing/enhancement. |
Yan Zheng; Zhenxiao Liang; Xiaoyan Cong; Yi Yang; Lanqing Guo; Yuehao Wang; Peihao Wang; Zhangyang Wang; |
| 215 | Domain-Auxiliary Infrared Moving Small Target Detection By Learning to Overlook Domain Discrepancy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, exceeding expectations, it is found that simply adding auxiliary samples cannot often be always effective, even causing performance decline, due to existing infrared domain shift. To overcome this unexpected problem, we propose the first infrared moving small target detection framework with domain-auxiliary supports by Learning to Overlook Domain Discrepancy (Loddis). |
Shengjia Chen; Luping Ji; Shuang Peng; Sicheng Zhu; Mao Ye; |
| 216 | Investigating Prosocial Behavior Theory in LLM Agents Under Policy-Induced Inequities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing research has primarily relied on static, economically framed paradigms, lacking models that capture the dynamic evolution of prosociality and its sensitivity to structural inequities. To address these gaps, we introduce ProSim, a simulation framework for modeling the prosocial behavior in LLM agents across diverse social conditions. |
Yujia Zhou; Hexi Wang; Qingyao Ai; Zhen Wu; Yiqun Liu; |
| 217 | A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. |
Jiyue Jiang; Yanyu Chen; Pengan Chen; Kai Liu; Jingqi Zhou; Zheyong Zhu; He Hu; Fei Ma; Qi Tian; Chuan Wu; |
| 218 | A Hybrid Space Model for Misaligned Multi-modality Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Euclidean space excels at preserving local structural details and supporting efficient computation, hyperbolic space is naturally suited for modeling hierarchical relationships due to its geometric properties. Building upon these observations, this paper proposes a unified framework that jointly optimizes image registration and fusion through a dual-space architecture. |
Yi Xiao; Jia Wang; Zhu Liu; Di Wang; Jinyuan Liu; Risheng Liu; |
| 219 | IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. |
Donghao Zhou; Jingyu Lin; Guibao Shen; Quande Liu; Jialin Gao; Lihao Liu; Lan Du; Cunjian Chen; Chi-Wing Fu; Xiaowei Hu; Pheng-Ann Heng; |
| 220 | CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. |
HyunGi Kim; Jisoo Mok; Hyungyu Lee; Juhyeon Shin; Sungroh Yoon; |
| 221 | LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. |
Xiaoran Liu; Yuerong Song; Zhigeng Liu; Zengfeng Huang; Qipeng Guo; Ziwei He; Xipeng Qiu; |
| 222 | AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce AdvBDGen, a generative fine-tuning framework that automatically creates prompt-specific paraphrases as triggers, enabling stealthier and more resilient backdoor attacks in LLM alignment. |
Pankayaraj Pathmanathan; Udari Madhushani Sehwag; Michael-Andrei Panaitescu-Liess; Cho-Yu Jason Chiang; Furong Huang; |
| 223 | IntentMotion: Learning Intent-Aware Human Motion from Language in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often overlook realistic physical contact, resulting in visually plausible but physically unrealistic motion, e.g., penetration. To alleviate this, we propose IntentMotion, a novel framework that generates human motion in 3D scenes from natural language instructions by explicitly modeling intent. |
Wenfeng Song; Shi Zheng; Xinyu Zhang; Xingliang Jin; Aimin Hao; Fei Hou; Xia Hou; Shuai Li; |
| 224 | Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. |
Hanqing Wang; Shaoyang Wang; Yiming Zhong; Zemin Yang; Jiamin Wang; Zhiqing Cui; Jiahao Yuan; Yifan Han; Mingyu Liu; Yuexin Ma; |
| 225 | CO²IF: Language-Bridging Hyperspectral-Multispectral Image Fusion with Coordinated and Cross-modal Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recognizing that textual scene descriptions encapsulate valuable object attributes and contextual information, we introduce the first Language-Bridging framework for Hyperspectral and Multispectral image fusion (CO²IF). |
Mingjin Zhang; Zhongkai Yang; Fei Gao; |
| 226 | Unveiling The Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy Via Texture-Constrained Perturbations and Cross-Modal Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Multi-Modal Adversarial Synergy (MMAS), a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. |
Xiang Fang; Wanlong Fang; Changshuo Wang; |
| 227 | Taming The Phantom: Token-Asymmetric Filtering for Hallucination Mitigation in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, interaction patterns beyond the modality level remain insufficiently explored. In this paper, we conduct a token-level analysis and identify two key phenomena: (1) a small subset of textual tokens in LVLMs exert disproportionate influence in the visual-active layers, surpassing that of the visual modality and potentially misleading visual understanding; (2) while LVLMs can correctly identify key visual information, insufficient focus on these cues can sometimes lead to hallucinations. |
Shuyi Ouyang; Hongyi Wang; Gongfan Fang; Xinyin Ma; Lanfen Lin; Xinchao Wang; |
| 228 | PriorDrive: Enhancing Online HD Mapping with Unified Vector Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The online construction of HD maps using on-board sensors has emerged as a promising solution; however, these methods can be impeded by incomplete data due to occlusions and inclement weather, while their performance in distant regions remains unsatisfying. This paper proposes PriorDrive to address these limitations by directly harnessing the power of various vectorized prior maps, significantly enhancing the robustness and accuracy of online HD map construction. |
Shuang Zeng; Xinyuan Chang; Xinran Liu; Yujian Yuan; Shiyi Liang; Zheng Pan; Mu Xu; Xing Wei; |
| 229 | Driving with Regulation: Trustworthy and Interpretable Decision-Making for Autonomous Driving with Retrieval-Augmented Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an interpretable, regulation-aware decision-making framework, DriveReg, which enables autonomous vehicles to understand and adhere to region-specific traffic laws and safety guidelines. |
Tianhui Cai; Yifan Liu; Zewei Zhou; Haoxuan Ma; Seth Z. Zhao; Zhiwen Wu; Xu Han; Zhiyu Huang; Jiaqi Ma; |
| 230 | Tracking and Segmenting Anything in Any Modality Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. |
Tianlu Zhang; Qiang Zhang; Guiguang Ding; Jungong Han; |
| 231 | Multimodal Table Understanding with Difficulty-aware Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our empirical analysis reveals that the performance of leading Multimodal Large Language Models (MLLMs) deteriorates markedly as table complexity increases, exposing a critical vulnerability in their ability to perceive and reason over intricate tabular data. To address this challenge, we propose MM-Table-R1, a model enhanced through difficulty-aware reinforcement learning (RL) post-training strategy. |
Chaohu Liu; Haoyu Cao; YongXiang Hua; Linli Xu; |
| 232 | Reward Redistribution Via Gaussian Process Likelihood Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing reward redistribution methods typically assume that per-step rewards are independent, thus overlooking interdependencies among state–action pairs. In this paper, we propose a Gaussian Process-based Likelihood Reward Redistribution (GP-LRR) framework that addresses this issue by modeling the reward function as a sample from a Gaussian Process (GP), which explicitly captures dependencies between state–action pairs through the kernel function. |
Minheng Xiao; Xian Yu; |
| 233 | Template-Theorems Graph Construction to Enhance Mathematical Reasoning Capabilities of LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast to humans—who can effectively draw upon prior experiences in solving similar problems and retrieve relevant knowledge and theorems from memory—LLMs often struggle to accurately identify analogous problems and to recall or apply appropriate theorems. To overcome these limitations, we introduce a novel framework for constructing a template-theorems knowledge base, leveraging the capabilities of large language models. |
Yarong Lan; Yajing Xu; Huajun Chen; |
| 234 | Personalize Your Gaussian: Consistent 3D Scene Personalization from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Lacking the mechanisms to effectively expand reference information beyond the original view, existing methods of image-conditioned 3DGS personalization often suffer from this viewpoint bias and struggle to produce consistent results. Therefore, in this paper, we present Consistent Personalization for 3D Gaussian Splatting (CP-GS), a framework that progressively propagates the single-view reference appearance to novel perspectives. |
Yuxuan Wang; Xuanyu Yi; Qingshan Xu; Yuan Zhou; Long Chen; Hanwang Zhang; |
| 235 | OwlCap: Harmonizing Motion-Detail for Video Captioning Via HMD-270K and Caption Set Equivalence Reward Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). |
Chunlin Zhong; Qiuxia Hou; Zhangjun Zhou; Yanhao Zhang; Shuang Hao; Haonan Lu; He Tang; Xiang Bai; |
| 236 | Evidence-aware Integration and Domain Identification of Spatial Transcriptomics Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In addition, domain identification is a fundamental and critical task in ST, but commonly used models that separate expression learning and clustering often struggle to learn cluster-friendly latent representations effectively. To address these issues, we propose PREST, a prototype-based evidence-aware integration framework for ST data. |
Wei Zhang; Siyu Yi; Lezhi Chen; Yifan Wang; Ziyue Qiao; Yongdao Zhou; Wei Ju; |
| 237 | Image-Text Knowledge Modeling for Unsupervised Multi-Scenario Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose unsupervised multi-scenario (UMS) person re-identification (ReID) as a new task that expands ReID across diverse scenarios (cross-resolution, clothing change, etc.) within a single coherent framework. |
Zhiqi Pang; Lingling Zhao; Yang Liu; Chunyu Wang; Gaurav Sharma; |
| 238 | Feature Attribution for Human Sensing with Radio Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a Matryoshka-like saliency method, MatryMask, an initial exploration of feature attribution for human sensing with radio signals. |
Shuokang Huang; Julie McCann; |
| 239 | LiDAR-GS++: Improving LiDAR Gaussian Reconstruction Via Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods exhibit artifacts in extrapolated novel view synthesis due to the incomplete reconstruction from single traversal scans. To address this limitation, we present LiDAR-GS++, a LiDAR Gaussian Splatting reconstruction method enhanced by diffusion priors for real-time and high-fidelity re-simulation on public urban roads. |
Qifeng Chen; Jiarun Liu; Rengan Xie; Tao Tang; Sicong Du; Yiru Zhao; Yuchi Huo; Sheng Yang; |
| 240 | Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically analyze backdoor vulnerabilities in MLLMs under the FTaaS paradigm, revealing two key phenomena: (1) markedly reduced sensitivity to textual variations when a visual trigger is present, and (2) abnormally stable model confidence even under strong semantic perturbations. |
Xuankun Rong; Wenke Huang; Wenzheng Jiang; Yiming Li; Wenxuan Wang; Mang Ye; |
| 241 | Reasoning Shapes Alignment: Investigating Cultural Alignment in Large Reasoning Models with Cultural Norms Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents the Cultural Norm-based Cultural Alignment (CNCA) framework, which enables models to leverage their powerful reasoning ability to align with cultural norms. Specifically, we propose three methods to automatically mine cultural norms from limited survey data and explore ways to effectively utilize these norms for improving cultural alignment. |
Yuhang Wang; Yanxu Zhu; Jitao Sang; |
| 242 | Multi-agent In-context Coordination Via Decentralized Memory Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. |
Tao Jiang; Zichuan Lin; Lihe Li; Yi-Chen Li; Cong Guan; Lei Yuan; Zongzhang Zhang; Yang Yu; Deheng Ye; |
| 243 | From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style pay-offs to coalitions via local Monte-Carlo games whose budgets are propagated downward. |
Canran Xiao; Jiabao Dou; Zhiming Lin; Zong Ke; Liwei Hou; |
| 244 | ProCrop: Learning Aesthetic Image Cropping from Professional Compositions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ProCrop, a retrieval-based method that leverages professional photography to guide cropping decisions. |
Ke Zhang; Tianyu Ding; Jiachen Jiang; Tianyi Chen; Ilya Zharkov; Vishal M. Patel; Luming Liang; |
| 245 | SEAP: Sparse Expert Activation Pruning Unlocks The Brainpower of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Sparse Expert Activation Pruning (SEAP), a training-free pruning method for large language models. |
Xun Liang; Hanyu Wang; Huayi Lai; Simin Niu; Shichao Song; Jiawei Yang; Jihao Zhao; Feiyu Xiong; Bo Tang; Zhiyu Li; |
| 246 | SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To mitigate temporal dependency biases, we introduce a dual-task paradigm, presenting image pairs in both chronological and reversed orders while keeping task descriptions consistent. |
Xiaoda Yang; Shenzhou Gao; Can Wang; Jiahe Zhang; Menglan Tang; Jingyang Xue; Sheng Liu; Peijian Zhang; Yao Mu; Xiangyu Yue; |
| 247 | LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. |
Tiesunlong Shen; Rui Mao; Jin Wang; Heming Sun; Jian Zhang; Xuejie Zhang; Erik Cambria; |
| 248 | Explainable Synthetic Image Detection Through Diffusion Timestep Ensembling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, taking the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. |
Yixin Wu; Feiran Zhang; Tianyuan Shi; Ruicheng Yin; Zhenghua Wang; Zhenliang Gan; Xiaohua Wang; Changze Lv; Xiaoqing Zheng; Xuanjing Huang; |
| 249 | Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing evaluations are limited by the static assessment that focuses on knowledge tests, the single perspective that centers on user experience, and the open-loop framework that lacks actionable feedback. To address these issues, we propose Ψ-Arena, an interactive framework for comprehensive assessment and optimization of LLM-based counselors, featuring three key characteristics: (1) Realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients; (2) Tripartite evaluation that integrates assessments from the client, supervisor, and counselor perspectives; (3) Closed-loop optimization that iteratively improves LLM counselors using diagnostic feedback. |
Shijing Zhu; Zhuang Chen; Guanqun Bi; Binghang Li; Yaxi Deng; Dazhen Wan; Libiao Peng; Xiyao Xiao; Rongsheng Zhang; Tangjie Lv; Zhipeng Hu; FangFang Li; Minlie Huang; |
| 250 | Unveiling The Landscape of Clinical Depression Assessment: From Behavioral Signatures to Psychiatric Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to unveil the landscape of clinical depression assessment. |
Zhuang Chen; Guanqun Bi; Wen Zhang; Jiawei Hu; Aoyun Wang; Xiyao Xiao; Kun Feng; Minlie Huang; |
| 251 | Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. |
Jiaxing Zhao; Boyuan Sun; Xiang Chen; Xihan Wei; |
| 252 | Can Editing LLMs Inject Harm? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. |
Canyu Chen; Baixiang Huang; Zekun Li; Zhaorun Chen; Shiyang Lai; Xiongxiao Xu; Jia-Chen Gu; Jindong Gu; Huaxiu Yao; Chaowei Xiao; Xifeng Yan; William Yang Wang; Philip Torr; Dawn Song; Kai Shu; |
| 253 | Beyond The Horizon: Decoupling Multi-View UAV Action Recognition Via Partial Order Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unlike ground-based scenarios, UAVs capture actions from diverse altitudes, resulting in pronounced appearance discrepancies and reduced recognition robustness. To address this, we introduce a multi-view formulation tailored for UAV altitudes and empirically uncover a distinctive partial order among views, where recognition accuracy consistently declines as altitude increases. |
Wenxuan Liu; Zhuo Zhou; Xuemei Jia; Siyuan Yang; Wenxin Huang; Xian Zhong; Chia-Wen Lin; |
| 254 | FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Realistic and visually appealing details are typically reflected in high-resolution outputs, further amplifying computational demands—especially for single-stage DiT models. To address these challenges, we propose a novel two-stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. |
Shilong Zhang; Wenbo Li; Shoufa Chen; Chongjian GE; Peize Sun; Yifu Zhang; Yi Jiang; Zehuan Yuan; Bingyue Peng; Ping Luo; |
| 255 | Generating Attribute-Aware Human Motions from Textual Prompt Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. |
Xinghan Wang; Kun Xu; Fei Li; Cao Sheng; JiaZhong Yu; Yadong Mu; |
| 256 | Pushing Rendering Boundaries: Hard Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In detail, we present positional gradient driven HGS, which leverages multi-view significant positional gradients to uncover hard Gaussians. |
Qingshan Xu; Jiequan Cui; Xuanyu Yi; Yuxuan Wang; Yuan Zhou; Yew-Soon Ong; Hanwang Zhang; |
| 257 | NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to create physical digital twins of deformable objects under interaction. |
Qingshan Xu; Jiao Liu; Shangshu Yu; Yuxuan Wang; Yuan Zhou; Junbao Zhou; Jiequan Cui; Yew-Soon Ong; Hanwang Zhang; |
| 258 | Geometry-Aware Stereo Matching Via Monocular Disparity Distribution Prior and Gradient Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite impressive progress, existing methods lack sufficient correlation priors in ill-posed regions such as occlusions, detailed and reflective regions. In this paper, we propose Geometry Aware Stereo Matching Network (GEAStereo) to enhance geometric structure perception and address this issue. |
Junze Zhang; Luoxi Jing; Yuanyuan Wang; Xueqi Li; Guoli Yang; Songchang Jin; Chunping Qiu; |
| 259 | CHASE: Contextual History for Adaptive and Simple Exploitation in Large Language Model Jailbreaking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Contextual History for Adaptive and Simple Exploitation (CHASE), a novel multi-turn method for Large Language Model (LLM) jailbreaking. |
Zhiqiang Hao; Chuanyi Li; Ye Fan; Jun Cai; Xiao Fu; Shangqi Wang; Hao Shen; Jiao Yin; Jidong Ge; Bin Luo; Vincent Ng; |
| 260 | InterMoE: Individual-Specific 3D Human Interaction Generation Via Dynamic Temporal-Selective MoE Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. |
Lipeng Wang; Hongxing Fan; Haohua Chen; Zehuan Huang; Lu Sheng; |
| 261 | Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. |
Chun-Hsiao Yeh; Chenyu Wang; Shengbang Tong; Ta-Ying Cheng; Ruoyu Wang; Tianzhe Chu; Yuexiang Zhai; Yubei Chen; Shenghua Gao; Yi Ma; |
| 262 | Scaling-up Perceptual Video Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in the domain of perceptual video quality assessment (VQA), the potential of data scaling remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose OmniVQA, a framework designed to efficiently build high-quality, machine-dominated synthetic multi-modal instruction databases (MIDBs) for VQA. |
Ziheng Jia; Zicheng Zhang; Xiaorong Zhu; Chunyi Li; Jinliang Han; Xiaohong Liu; Guangtao Zhai; Xiongkuo Min; |
| 263 | Refine-IQA: Multi-Stage Reinforcement Finetuning for Perceptual Image Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, these methods typically fine-tune directly on downstream IQA tasks without explicitly enhancing the model’s native low-level visual quality perception, which may constrain its performance upper bound. In response to these gaps, we propose the multi‐stage RFT IQA framework (Refine-IQA). |
Ziheng Jia; Jiaying Qian; Zicheng Zhang; Zijian Chen; Xiongkuo Min; |
| 264 | DenoDet V2: Phase-Amplitude Cross Denoising for SAR Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose DenoDet V2, which explores a completely novel and different perspective to deconstruct and modulate the features in the transform domain via a carefully designed attention architecture. |
Kang Ni; Minrui Zou; Yuxuan Li; Xiang Li; Kehua Guo; Ming-Ming Cheng; Yimian Dai; |
| 265 | Marginalized Generalized IoU (MGIoU): A Unified Objective Function for Optimizing Convex Parametric Shapes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We extend MGIoU to MGIoU+ that supports optimizing unstructured convex shapes. |
Duy-Tho Le; Trung Pham; Jianfei Cai; Hamid Rezatofighi; |
| 266 | AURORA: Augmented Understanding Via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, jointly training for reasoning and segmentation can compromise pixel-level precision. To address these issues, we introduce AURORA, a novel framework designed to enhance genuine reasoning and language comprehension in reference audio-visual segmentation. |
Ziyang Luo; Nian Liu; Fahad Shahbaz Khan; Junwei Han; |
| 267 | DeepOR: A Deep Reasoning Foundation Model for Optimization Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce DeepOR, the first RLLM specifically designed for optimization modeling. |
Ziyang Xiao; Yuan Jessica Wang; Xiongwei Han; Shisi Guan; Jingyan Zhu; Jingrong Xie; Lilin Xu; Han Wu; Wing Yin Yu; Zehua Liu; Xiaojin Fu; Gang Chen; Dongxiang Zhang; |
| 268 | KnowLCP: Knowledge Augmented Lane Change Prediction for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the emerging paradigm of enhancing model generalizability through domain knowledge, we propose KnowLCP to explicitly model and integrate driving knowledge into the lane change prediction task. |
Yuhuan Lu; Pengpeng Xu; Wei Wang; Zhen Zhang; Han Liu; Xiping Hu; |
| 269 | Lifelong Domain Adaptive 3D Human Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. |
Qucheng Peng; Hongfei Xue; Pu Wang; Chen Chen; |
| 270 | Many Minds, One Path: LLM-Augmented Consensus Decision for Distributed Control in Multi-Agent Collaborative Stable Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents LLMASC, a framework designed to enhance long-term stability in multi-agent collaboration by combining semantic reasoning with decentralized control. |
Zhuohao Yu; Zhe Liu; Tao Ren; Chenxue Wang; Junjie Wang; Qing Wang; |
| 271 | DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. |
Xiaodong Zhu; Suting Wang; Yuanming Zheng; Junqi Yang; Yangxu Liao; Yuhong Yang; Weiping Tu; Zhongyuan Wang; |
| 272 | Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces AC3 (Actor-Critic for Continuous Chunks), a novel RL framework that learns to generate high-dimensional, continuous action sequences. |
Jiarui Yang; Bin Zhu; Jingjing Chen; Yu-Gang Jiang; |
| 273 | SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. |
Qilang Ye; Yu Zhou; Lian He; Jie Zhang; Xuanming Guo; Jiayu Zhang; Mingkui Tan; Weicheng Xie; Yue Sun; Tao Tan; Xiaochen Yuan; Ghada Khoriba; Zitong Yu; |
| 274 | DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). |
Xinyi Wang; Yiping Song; Zhiliang Tian; Bo Liu; Tingjin Luo; Minlie Huang; |
| 275 | MDiff4STR: Mask Diffusion Model for Scene Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. |
Yongkun Du; Miaomiao Zhao; Songlin Fan; Zhineng Chen; Caiyan Jia; Yu-Gang Jiang; |
| 276 | Towards Robust Text-Attributed Federated Graph Learning: Multimodal Threats and Defense Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce GTAE, a novel attack framework that cascades influence-guided topological perturbations and embedding-level text refinements to generate transferable, modality-agnostic adversarial inputs. |
Zitong Shi; Guancheng Wan; Wenke Huang; Yuxin Wu; Quan Zhang; Mang Ye; |
| 277 | Concept-RuleNet: Grounded Multi-Agent Neurosymbolic Reasoning in Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a multi-agent system – Concept-RuleNet that reinstates visual grounding while retaining transparent reasoning. |
Sanchit Sinha; Guangzhi Xiong; Zhenghao He; Aidong Zhang; |
| 278 | AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. |
Renda Li; Hailang Huang; Fei Wei; Feng Xiong; Yong Wang; Xiangxiang Chu; |
| 279 | Temporal Calibrating and Distilling for Scene-Text Aware Text-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address them, we propose a temporal scene-text calibrating and distilling (TCD) network for textvideo retrieval. |
Zhiqian Zhao; Liang Li; Lei Shen; Xichun Sheng; Yaoqi Sun; Fang Kang; Chenggang Yan; |
| 280 | Principled Analysis of Deep Reinforcement Learning Evaluation and Design Paradigms Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on the key ingredients of this research progress and we analyze the canonical evaluation and design paradigms in reinforcement learning. |
Ezgi Korkmaz; |
| 281 | Interest-driven Deep Multi-modal Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the concept of interest in the recommendation system, we propose a novel interest-driven deep multi-modal clustering (IDMC) framework. |
Guoliang Zou; Tongji Chen; Sijia Li; Jin Qin; Yangdong Ye; Shizhe Hu; |
| 282 | Neural Video Compression with Reference Hierarchy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we delve into the inter-frame reference management mechanism in neural video codecs (NVCs). |
Chuanbo Tang; Zhuoyuan Li; Li Li; Dong Liu; Feng Wu; |
| 283 | DialogXpert: Driving Intelligent and Emotion-Aware Conversations Through Online Value-Based Reinforcement Learning with LLM Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. |
Tazeek Bin Abdur Rakib; Ambuj Mehrish; Lay-Ki Soon; Wern Han Lim; Soujanya Poria; |
| 284 | Language Models and Logic Programs for Trustworthy Tax Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. |
William Jurayj; Nils Holzenberger; Benjamin Van Durme; |
| 285 | S²Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, directly integrating protein sequences poses challenges due to the redundancy and noise in large-scale protein-ligand datasets. To address these limitations, we propose S²Drug, a two-stage framework that explicitly incorporates protein Sequence information and 3D Structure context in protein-ligand contrastive representation learning. |
Bowei He; Bowen Gao; Yankai Chen; Yanyan Lan; Chen Ma; Philip S. Yu; Ya-Qin Zhang; Wei-Ying Ma; |
| 286 | UDA: Unsupervised Debiasing Alignment for Pair-wise LLM-as-a-Judge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Pairwise evaluation of Large Language Models (LLMs) is a common paradigm, but it is prone to preference bias, where judges systematically favor certain outputs, such as their own. … |
Yang Zhang; Cunxiang Wang; Lindong Wu; Wenbo Yu; Yidong Wang; Guangsheng Bao; Jie Tang; |
| 287 | ProbLog4Fairness: A Neurosymbolic Approach to Modeling and Mitigating Bias Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a set of templates to express different types of bias and show the versatility of our approach on synthetic tabular datasets with known biases. |
Rik Adriaensen; Lucas Van Praet; Jessa Bekker; Robin Manhaeve; Pieter Delobelle; Maarten Buyl; |
| 288 | Dual-Channel Learning Framework for Zero-Shot CircRNA-miRNA Interaction Prediction Via State Space Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, zero-shot prediction requires models to identify new interactions without relying on previously observed samples, imposing stringent requirements on generalization capabilities. To address these limitations, we propose a dual-channel learning framework leveraging State space modeling for Zero-shot CMI prediction (ZeroStem). |
Mengmeng Wei; Lei Wang; Zhu-Hong You; Pengwei Hu; Bowei Zhao; Zhi-An Huang; Yu-An Huang; Haicheng Yi; |
| 289 | Hyperbolic Continuous Structural Entropy for Hierarchical Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Hyperbolic Continuous Structural Entropy neural networks, namely HypCSE, for structure-enhanced continuous hierarchical clustering. |
Guangjie Zeng; Hao Peng; Angsheng Li; Li Sun; Chunyang Liu; Shengze Li; Yicheng Pan; Philip S. Yu; |
| 290 | ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. |
Shengchao Zhou; Jiehong Lin; Jiahui Liu; Shizhen Zhao; Chirui Chang; Xiaojuan Qi; |
| 291 | TinyChemVL: Advancing Chemical Vision-Language Models Via Efficient Visual Token Reduction and Complex Reaction Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose TinyChemVL, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. |
Xuanle Zhao; Shuxin Zeng; Xinyuan Cai; Xiang Cheng; Duzhen Zhang; Xiuyi Chen; Bo Xu; |
| 292 | CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: End-to-end planning methods are the de-facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious long-tail problem (i.e., rare but safety-critical failure cases). In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. |
Enhui Ma; Lijun Zhou; Tao Tang; Jiahuan Zhang; Junpeng Jiang; Zhan Zhang; Dong Han; Kun Zhan; Xueyang Zhang; Xianpeng Lang; Haiyang Sun; Xia Zhou; Di Lin; Kaicheng Yu; |
| 293 | Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. |
Tianyi Zhang; Haonan Duan; Haoran Hao; Yu Qiao; Jifeng Dai; Zhi Hou; |
| 294 | DragNeXt: Rethinking Drag-Based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, by specifying the areas and types of geometric transformations, we can effectively address the ambiguity issue. |
Yuan Zhou; Junbao Zhou; Qingshan Xu; Kesen Zhao; Yuxuan Wang; Hao Fei; Richang Hong; Hanwang Zhang; |
| 295 | ReconVLA: Reconstructive Vision-Language-Action Model As Effective Robot Perceiver Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. |
Wenxuan Song; Ziyang Zhou; Han Zhao; Jiayi Chen; Pengxiang Ding; Haodong Yan; Yuxin Huang; Feilong Tang; Donglin Wang; Haoang Li; |
| 296 | Benchmarking LLMs for Political Science: A United Nations Perspective Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. |
Yueqing Liang; Liangwei Yang; Chen Wang; Congying Xia; Rui Meng; Xiongxiao Xu; Haoran Wang; Ali Payani; Kai Shu; |
| 297 | Equivariant Atomic and Lattice Modeling Using Geometric Deep Learning for Crystal Structure Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Machine learning (ML) has emerged to alleviate this bottleneck but suffers from two major limitations: (i) existing models operate mainly on atoms, leaving lattice vectors implicit despite their critical role in structural optimization; and (ii) they often rely on multi-stage, non-end-to-end workflows that are prone to error accumulation. Here, we present E³Relax, an end-to-end equivariant graph neural network that maps an unrelaxed crystal directly to its relaxed structure. |
Ziduo Yang; Yi-Ming Zhao; Xian Wang; Wei Zhuo; Xiaoqing Liu; Lei Shen; |
| 298 | Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: From an information-theoretic perspective, we introduce the concept of privacy knowledge gain and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. |
Lulu Xue; Shengshan Hu; Linqiang Qian; Peijin Guo; Yechao Zhang; Minghui Li; Yanjun Zhang; Dayong Ye; Leo Yu Zhang; |
| 299 | Selective Weak-to-Strong Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. |
Hao Lang; Fei Huang; Yongbin Li; |
| 300 | Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. |
Jiaqi Tang; Jianmin Chen; Wei Wei; Xiaogang Xu; Runtao Liu; Xiangyu Wu; Qipeng Xie; Jiafei Wu; Lei Zhang; Qifeng Chen; |
| 301 | Policy Search, Retrieval, and Composition Via Task Similarity in Collaborative Agentic Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. |
Saptarshi Nath; Christos Peridis; Eseoghene Benjamin; Xinran Liu; Soheil Kolouri; Peter Kinnell; Zexin Li; Cong Liu; Shirin Dora; Andrea Soltoggio; |
| 302 | Small But Mighty: Dynamic Wavelet Expert-Guided Fine-Tuning of Large-Scale Models for Optical Remote Sensing Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel dynamic wavelet expert-guided fine-tuning paradigm with fewer trainable parameters, dubbed WEFT, which efficiently adapts large-scale foundation models to ORSIs segmentation tasks by leveraging the guidance of wavelet experts. |
Yanguang Sun; Chao Wang; Jian Yang; Lei Luo; |
| 303 | Mitigating Self-Preference By Authorship Obfuscation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges’ ability to recognize their own outputs. |
Taslim Mahbub; Shi Feng; |
| 304 | Less Is More: Vision Representation Compression for Efficient Video Generation with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we investigate the impact of token redundancy in LLM-based video generation by information-theoretic analysis and propose Vision Representation Compression (VRC), a novel framework designed to achieve more in both performance and efficiency with less video token representations. |
Yucheng Zhou; Jihai Zhang; Guanjie Chen; Jianbing Shen; Yu Cheng; |
| 305 | AlignTrack: Top-Down Spatiotemporal Resolution Alignment for RGB-Event Visual Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This methodological limitation impedes effective RGB-Event feature alignment and ultimately degrades tracking performance. To overcome this limitation, we propose AlignTrack, a novel tracking framework built upon a Top-Down Alignment (TDA) strategy inspired by the human visual system. |
Chuanyu Sun; Jiqing Zhang; Yang Wang; Yuanchen Wang; Yutong Jiang; Baocai Yin; Xin Yang; |
| 306 | Medical Image Segmentation with Minimal Labeling Effort: How Far Can We Push The Limits? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We demonstrate for the first time that a medical image segmentation model can achieve near fully supervised performance using only a single annotated image and abundant unlabeled data. We present MedSMILE, a novel framework that synergistically integrates transductive and inductive learning for this extreme one-label semi-supervised setting. |
Yizhe Zhang; |
| 307 | IdeFN: Identifying Unclicked Space False Negatives Via Relaxed Partial Optimal Transport for Conversion Rate Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods implicitly mislabel some unclicked samples with genuine conversion potential as negatives, thereby exacerbating the false negative sample (FNS) problem. To address this, we propose IdeFN, a multi‑task CVR framework that identifies false negatives in the unclicked space to enable CVR prediction across the entire exposure space and leverages CTR as an auxiliary task for shared‑parameter learning. |
Weiyi Zhong; Weiming Liu; Lianyong Qi; Xiaoran Zhao; Xiaolong Xu; Haolong Xiang; Yang Cao; Shichao Pei; Qiang Ni; |
| 308 | T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. |
Jiazhou Zhou; Qing Jiang; Kanghao Chen; Lutao Jiang; Yuanhuiyi Lyu; Ying-Cong Chen; Lei Zhang; |
| 309 | DISC: Dynamic Feature Selection for Cost-Sensitive Medical Diagnosis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these examinations vary significantly in cost—measured in time, money, or patient discomfort—creating a challenging trade-off between diagnostic accuracy and resource efficiency. To address this issue, we propose a dynamic diagnostic framework that incrementally selects medical examinations based on individual characteristics of each patient. |
Yu-sheng Li; Xincen Duan; Beili Wang; Wei Guo; Han-jia Ye; |
| 310 | Breaking The Trade-Off Between Faithfulness and Expressiveness for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. |
Chenxu Yang; Qingyi Si; Lanrui Wang; Zheng Lin; |
| 311 | MoSs: Mixture of Scales for Efficient High-Resolution Autoregressive Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our systematic study uncovers two critical observations: (1) most image regions have stabilized during early drafting stages, making later refinement across the full-scale image token-inefficient; (2) different scales inherently trade off efficiency and fidelity, suggesting that adaptive token dispatch on different scales can focus resources where they yield the greatest quality gains. Motivated by these insights, we propose a training-free Mixture of Scales (MoSs) method for efficient high-resolution autoregressive image generation. |
Yaoxiu Lian; Hao Liang; Zhihong Gou; Yijia Zhang; Jiaming Xu; Guohao Dai; Ningyi Xu; |
| 312 | ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. |
Yihua Shao; Xiaofeng Lin; Xinwei Long; Siyu Chen; Minxi Yan; Yang Liu; Ziyang Yan; Ao Ma; Hao Tang; Jingcai Guo; |
| 313 | TR-DQ: Time-Rotation Diffusion Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. |
Yihua Shao; Deyang Lin; Minxi Yan; Siyu Chen; Fanhu Zeng; Minwen Liao; Ao Ma; Ziyang Yan; Haozhe Wang; Yan Wang; Zhi Chen; Xiaofeng Cao; Haotong Qin; Hao Tang; Jingcai Guo; |
| 314 | Membership Inference Attack Against Large Language Model-Based Recommendation Systems: A New Distillation-Based Paradigm Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel knowledge distillation-based MIA paradigm tailored for LLM-based recommendation systems. |
Cuihong Li; Xiaowen Huang; Chuanhuan Yin; Jitao Sang; |
| 315 | R-Tuning: Wavelet-Decomposed Replay and Semantic Alignment for Continual Adaptation of Pretrained Time-Series Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A key hurdle lies in accessing the original training data, as fine-tuning solely on new data often leads to catastrophic forgetting. To address this issue, we propose Replay Tuning (R-Tuning), a novel framework designed for the continual adaptation of pre-trained time-series models. |
Tianyi Yin; Jingwei Wang; Chenze Wang; Han Wang; Jiexuan Cai; Min Liu; Yunlong Ma; Kun Gao; Yuting Song; Weiming Shen; |
| 316 | Hierarchical Semantic Alignment for Image Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves clustering performance in a training-free manner. |
Xingyu Zhu; Beier Zhu; Yunfan Li; Junfeng Fang; Shuo Wang; Kesen Zhao; Hanwang Zhang; |
| 317 | ArchetypeTrader: Reinforcement Learning for Selecting and Refining Learnable Strategic Archetypes in Quantitative Trading Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While some approaches ignore demonstrations altogether, others rely on “optimal” yet overly granular trajectories or human-crafted strategies, both of which can overwhelm learning and introduce significant bias, resulting in high variance and significant profit losses. To address these problems, we propose ArchetypeTrader, a novel reinforcement learning framework that automatically selects and refines data-driven trading archetypes distilled from demonstrations. |
Chuqiao Zong; Molei Qin; Haochong Xia; Bo An; |
| 318 | Towards Real-Time Neutral Atom Array Assembly Via Unsupervised Hologram Generation and Path Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While parallel rearrangement methods using spatial light modulators show promise, they suffer from significant overhead in two sub-tasks: atom-site matching and hologram generation. We propose a framework to address these bottlenecks and enhance the efficiency and fidelity of the assembly process. |
Ge Yan; Yuchen Wang; Junchi Yan; |
| 319 | MathSE: Improving Multimodal Mathematical Reasoning Via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This reliance on fixed teacher-derived datasets not only restricts the model’s ability to adapt to novel or more intricate questions that extend beyond the confines of the training data, but also lacks the iterative depth needed for robust generalization. To overcome these limitations, we propose MathSE, a Mathematical Self-Evolving framework for MLLMs. |
Jinhao Chen; Zhen Yang; Jianxin Shi; Tianyu Wo; Jie Tang; |
| 320 | ViCToR: Improving Visual Comprehension Via Token Reconstruction for Pretraining LMMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Multimodal Models (LMMs) often face a modality representation gap during pretraining: while language embeddings remain stable, visual representations are highly sensitive to contextual noise (e.g., background clutter). To address this issue, we introduce a visual comprehension stage, which we call ViCToR (Visual Comprehension via Token Reconstruction), a novel pretraining framework for LMMs. |
Yin Xie; Kaicheng Yang; Peirou Liang; Xiang An; Yongle Zhao; Yumeng Wang; Ziyong Feng; Roy Miles; Ismail Elezi; Jiankang Deng; |
| 321 | Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. |
Changyue Wang; Weihang Su; Qingyao Ai; Yiqun Liu; |
| 322 | CharBench: Evaluating The Role of Tokenization in Character-Level Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present an in-depth analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance. |
Omri Uzan; Yuval Pinter; |
| 323 | Reality Vs Counterfactual: Multi-World Contrastive Reinforcement Learning for Enhancing MLLM’s Theory of Mind in Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user’s mental states (goals, beliefs, and potential next actions). |
Guiyang Hou; Yihui Fu; Chen Wu; Xiang Huang; Zhe Zheng; Wenqi Zhang; Yongliang Shen; Weiming Lu; |
| 324 | LLMC+: Benchmarking Vision-Language Model Compression with A Plug-and-play Toolkit Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (3) Isolated use of individual compression techniques, without exploring their joint potential. To overcome these gaps, we introduce LLMC+, a comprehensive VLM compression benchmark with a versatile, plug-and-play toolkit. |
Chengtao Lv; Bilang Zhang; Yang Yong; Ruihao Gong; Yushi Huang; Shiqiao Gu; Jiajun Wu; Yumeng Shi; Jinyang Guo; Wenya Wang; |
| 325 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that outputs facial motions in a streaming manner. |
Yifan Yang; Zhi Cen; Sida Peng; Xiangwei Chen; Yifu Deng; Xinyu Zhu; Fan Jia; Xiaowei Zhou; Hujun Bao; |
| 326 | PASE: Leveraging The Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations. To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. |
Xiaobin Rong; Qinwen Hu; Mansur Yesilbursa; Kamil Wojcicki; Jing Lu; |
| 327 | From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven By Coarse-Grained Annotations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. |
Zhiqing Guo; Dongdong Xi; Songlin Li; Gaobo Yang; |
| 328 | Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat Minima LoRA (FMLoRA) and its efficient version i.e., EFMLoRA, to seek flat minima for LoRA. |
Jiaxin Deng; Qingcheng Zhu; Junbiao Pang; Linlin Yang; Zhongqian Fu; Baochang Zhang; |
| 329 | COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous research adopts the split conformal prediction (SCP) framework to ensure desired coverage of admissible answers by constructing data-driven prediction sets, yet these sets typically contain incorrect candidates, undermining their practical effectiveness. To address this, we introduce COIN, an uncertainty-guarding selection framework that calibrates statistically valid uncertainty thresholds to filter a single generated answer per question under user-specified FDR constraints. |
Zhiyuan Wang; Jinhao Duan; Qingni Wang; Xiaofeng Zhu; Tianlong Chen; Xiaoshuang Shi; Kaidi Xu; |
| 330 | Voices, Faces, and Feelings: Multi-modal Emotion-Cognition Captioning for Mental Health Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While some studies incorporate emotional features via transfer learning, their connection to mental health conditions remains implicit. To address these issues, we propose ECMC, a novel task that aims at generating natural language descriptions of emotional and cognitive states from multi-modal data, and producing emotion–cognition profiles that improve both the accuracy and interpretability of mental health assessments. |
Zhiyuan Zhou; Yanrong Guo; Shijie Hao; |
| 331 | MathSmith: Towards Extremely Hard Mathematical Reasoning By Forging Synthetic Problems with A Reinforced Policy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MathSmith, a novel framework for synthesizing challenging mathematical problems to enhance LLM reasoning. |
Shaoxiong Zhan; Yanlin Lai; Ziyu Lu; Dahua Lin; Ziqing Yang; Fei Tan; |
| 332 | X-SAM: From Segment Anything to Any Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. |
Hao Wang; Limeng Qiao; Zequn Jie; Zhijian Huang; Chengjian Feng; Qingfang Zheng; Lin Ma; Xiangyuan Lan; Xiaodan Liang; |
| 333 | On The Misalignment Between Data Learnability and Forgettability in Machine Unlearning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We report a structural mismatch between a data point’s {learnability}—how quickly it improves the loss—and its {forgettability}—how much it anchors the final parameters—an aspect ignored by prior machine unlearning frameworks such as SISA, Fisher-Forget, and influence-based fine-tuning. To make this gap measurable we introduce Unlearning Gradient Sensitivity (UGS), an influence score computable with a single Hutch++ sketch, and derive the Learnability–Forgettability Divergence (LFD), the Jensen–Shannon distance between the model’s learning and forgetting distributions. |
Zijie Pan; Zuobin Ying; Yajie Wang; Wanlei Zhou; |
| 334 | Bonsai: Interpretable Tree-Adaptive Grounded Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Bonsai, a compositional and probabilistic reasoning system that generates adaptable inference trees by retrieving relevant grounding evidence and using it to compute likelihoods of sub-claims derived from broader natural language inferences. |
Kate Sanders; Benjamin Van Durme; |
| 335 | MAISI-v2: Accelerated 3D High-Resolution Medical Image Synthesis with Rectified Flow and Region-specific Contrastive Loss Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present MAISI-v2, the first accelerated 3D medical image synthesis framework that integrates rectified flow to enable fast and high-quality generation. |
Can Zhao; Pengfei Guo; Dong Yang; Yufan He; Yucheng Tang; Benjamin Simon; Mason Belue; Stephanie Harmon; Baris Turkbey; Daguang Xu; |
| 336 | O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). |
Yuqing Chen; Junjie Wang; Lin Liu; Ruihang Chu; Xiaopeng Zhang; Qi Tian; Yujiu Yang; |
| 337 | SSR-SAM: Retrieval-Style Segment Anything Model for Semi-Supervised Ultra-High-Resolution Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose SSR-SAM, a retrieval-style semi-supervised segmentation framework tailored for UHR images. |
Shijie Li; Yiming Chen; Zhineng Chen; Kai Hu; Xieping Gao; |
| 338 | Multimodal Mixture-of-Experts with Retrieval Augmentation for Protein Active Site Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current methods face two critical challenges: vulnerability in single-instance prediction due to sparse training data, and inadequate modality reliability estimation that leads to performance degradation when unreliable modalities dominate fusion processes. To address these challenges, we introduce Multimodal Mixtureof-Experts with Retrieval Augmentation (MERA), the first retrieval-augmented framework for protein active site identification. |
Jiayang Wu; Jiale Zhou; Rubo Wang; Xingyi Zhang; Xun Lin; Tianxu Lv; Leong Hou U; Yefeng Zheng; |
| 339 | Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in videos remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation in video contexts. |
Meng Cao; Pengfei Hu; Yingyao Wang; Jihao Gu; Haoran Tang; Haoze Zhao; Chen Wang; Jiahua Dong; Wangbo Yu; Ge Zhang; Xiang Li; Ian Reid; Xiaodan Liang; |
| 340 | Steering Visuomotor Policy in Open Worlds Via Cross-View Goal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their camera views rather than the agent’s observations. |
Shaofei Cai; Zhancun Mu; Anji Liu; Yitao Liang; |
| 341 | HAPO: Training Language Models to Reason Concisely Via History-Aware Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We hypothesize that this limits their ability to progressively make solutions more concise over time. To address this, we present History-Aware Policy Optimization (HAPO), which keeps track of a history state (e.g., the minimum length over previously generated correct responses) for each problem. |
Chengyu Huang; Zhengxin Zhang; Claire Cardie; |
| 342 | OneFont: A Unified Agent for End-to-End Font Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advancements in font generation, practitioners still grapple with a laborious trial-and-error workflow. To streamline this, we propose OneFont, an end-to-end framework that interprets user intents via free-form dialogue, seamlessly integrating both glyph synthesis and refinement modules. |
Yingxin Lai; Yufei Liu; Guoqing Yang; Jiaxing Chai; Zhiming Luo; Shaozi Li; |
| 343 | Rademacher Complexity for Distributionally Robust Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The goal of distributionally robust learning is to learn models capable of performing well against distributional shifts, such as latent heterogeneous subpopulations, unknown covariate shifts, or unmodeled temporal effects. |
Zhengyu Zhou; Weiwei Liu; |
| 344 | On The Robustness of Bandit Multiple Testing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a robust approach for bandit multiple testing, allowing for at most an epsilon fraction of arbitrary distribution corruption, as in Huber’s contamination model. |
Zhengyu Zhou; Weiwei Liu; |
| 345 | A Novel Fine-Tuned CLIP-OOD Detection Method with Double Loss Constraint Through Optimal Transport Semantic Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This phenomenon comes from the semantic mismalignment and inter-class feature confusion. To address these issues, we propose a novel fine-tuned OOD detection method with the Double loss constraint based on Optimal Transport (DOT-OOD). |
Hengyang Lu; Xin Guo; Shuai Feng; Wenyu Jiang; Yuntao Du; Chang Xia; Chenyou Fan; |
| 346 | DLVINet: Advancing Dual-Lens Video Inpainting Beyond Parallax Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although preliminary explorations have been conducted, existing methods still face two key challenges: limited exploitation of long-range reference information and inadequate modeling of inter-lens consistency in non-standard binocular systems. In this paper, we propose a novel dual-lens video inpainting framework named DLVINet, which addresses these challenges with two core components. |
Zhiliang Wu; Kun Li; Yunqiu Xu; Hehe Fan; Yi Yang; |
| 347 | Efficient Protein Optimization Via Structure-aware Hamiltonian Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior sequence-based optimization methods struggle with the high-dimensional complexities due to the epistasis effect and the disregard for structural constraints. To address this, we propose HADES, a Bayesian optimization method utilizing Hamiltonian dynamics to efficiently sample from a structure-aware approximated posterior. |
Jiahao Wang; Shuangjia Zheng; |
| 348 | MARPO: A Reflective Policy Optimization for Multi-Agent Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Multi-Agent Reflective Policy Optimization MARPO to alleviate the issue of sample inefficiency in multi-agent reinforcement learning. |
Cuiling Wu; Yaozhong Gan; Junliang Xing; Ying Fu; |
| 349 | Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel framework for video camera trajectory editing, enabling the re-synthesis of monocular videos along user-defined camera paths. |
Junyoung Seo; Jisang Han; Jaewoo Jung; Siyoon Jin; JoungBin Lee; Takuya Narihira; Kazumi Fukuda; Takashi Shibuya; Donghoon Ahn; Shoukang Hu; Seungryong Kim; Yuki Mitsufuji; |
| 350 | Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions. |
Zhenlong Dai; Zhuoluo Zhao; Hengning Wang; Xiu Tang; Sai Wu; Chang Yao; Zhipeng Gao; Jingyuan Chen; |
| 351 | Empowering DINO Representations for Underwater Instance Segmentation Via Aligner and Prompter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. |
Zhiyang Chen; Chen Zhang; Hao Fang; Runmin Cong; |
| 352 | GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This dataset offers comprehensive task coverage, diverse modalities, and rich image-text data. Building upon this dataset, we develop GMAI-VL, a 7B-parameter general medical vision-language model, with a three-stage training strategy that enhances the integration of visual and textual information. |
Tianbin Li; Yanzhou Su; Wei Li; Bin Fu; Zhe Chen; Ziyan Huang; Guoan Wang; Chenglong Ma; Ying Chen; Ming Hu; Yanjun Li; Pengcheng Chen; Shixiang Tang; Xiaowei Hu; Zhongying Deng; Yuanfeng Ji; Jin Ye; Yu Qiao; Junjun He; |
| 353 | MARS: Multi-Agent Adaptive Reasoning with Socratic Guidance for Automated Prompt Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose a Multi-Agent Adaptive Reasoning with Socratic guidance framework (MARS) for APO. |
Jian Zhang; Zhangqi Wang; Haiping Zhu; Kangda Cheng; Kai He; Bo Li; Qika Lin; Jun Liu; Erik Cambria; |
| 354 | How Does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and general neurons. |
Shimao Zhang; Zhejian Lai; Xiang Liu; Shuaijie She; Xiao Liu; Yeyun Gong; Shujian Huang; Jiajun Chen; |
| 355 | Split-Layer: Enhancing Implicit Neural Representation By Maximizing The Dimensionality of Feature Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. |
Zhicheng Cai; Hao Zhu; Linsen Chen; Qiu Shen; Xun Cao; |
| 356 | Semantic-Augmented Image Clustering Via Adaptive Multi-Modal Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel Semantic-Augmented image Clustering (SAC) method, which transcends the inherent limitations of purely visual representations through the integration of external knowledge. |
Xiaohan Zhang; Chao Zhang; Deng Xu; Hong YU; Chunlin Chen; Huaxiong Li; |
| 357 | Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). |
Runmin Cong; Anpeng Wang; Bin Wan; Cong Zhang; Xiaofei Zhou; Wei Zhang; |
| 358 | Exploring Efficient Open-Vocabulary Segmentation in The Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose RSKT-Seg, a novel open-vocabulary segmentation framework tailored for remote sensing. |
Bingyu Li; Haocheng Dong; Da Zhang; Zhiyuan Zhao; Hao Sun; Junyu Gao; |
| 359 | GlitchMiner: Mining Glitch Tokens in Large Language Models Via Gradient-based Discrete Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce GlitchMiner, an behavior-driven framework designed to identify glitch tokens by maximizing predictive entropy. |
Zihui Wu; Haichang Gao; Ping Wang; Shudong Zhang; Zhaoxiang Liu; Shiguo Lian; |
| 360 | VirtualEnv: A Platform for Embodied AI Research Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5 that enables fine-grained benchmarking of LLMs in embodied and interactive scenarios. |
Kabir Swain; Sijie Han; Ayush Raina; Jin Zhang; Shuang Li; Michael Stopa; Antonio Torralba; |
| 361 | Stability-Aware Reinforcement Learning for Robust Class Integration Test Order Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These challenges stem from insufficient reward shaping and the lack of reliable oracles for validation. To address these limitations, we propose LM-CITO, a stability-aware RL framework that integrates Lyapunov-guided reward shaping with semantic validation through metamorphic testing (MT). |
Yanru Ding; Yanmei Zhang; Guan Yuan; Shujuan Jiang; Wei Dai; Luciano Baresi; |
| 362 | Mapping on A Budget: Optimizing Spatial Data Collection for ML Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leaves scientists and policymakers who wish to use SatML for large-scale monitoring uncertain about whether and how to collect additional data to maximize performance. Here, we present the first problem formulation for the optimization of spatial training data in the presence of heterogeneous data collection costs and realistic budget constraints, as well as novel methods for addressing this problem. |
Livia Betti; Farooq Sanni; Gnouyaro Z. Sogoyou; Togbe Agbagla; Cullen Molitor; Tamma Carleton; Esther Rolf; |
| 363 | Heterogeneous Complementary Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD), a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits. |
Liuchi Xu; Hao Zheng; Lu Wang; Lisheng Xu; Jun Cheng; |
| 364 | Noise-Aware Graph-Based Cognitive Diagnostic Framework Through Low-Rank Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Interestingly, a noteworthy phenomenon has been overlooked: even without robustness designs, GCDFs can still learn correct information in noisy environments. In this paper, we conduct a comprehensive empirical analysis of this issue. |
Guixian Zhang; Yanmei Zhang; Guan Yuan; Shang Liu; Xiaojing Du; Debo Cheng; |
| 365 | Knowledge Boundary Discovery for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Knowledge Boundary Discovery (KBD), a reinforcement learning based framework to explore the knowledge boundaries of the Large Language Models (LLMs). |
Ziquan Wang; Zhongqi Lu; |
| 366 | Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. |
Ante Wang; Yujie Lin; Jingyao Liu; Suhang Wu; Hao Liu; Xinyan Xiao; Jinsong Su; |
| 367 | DSCodeBench: A Realistic Benchmark for Data Science Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. |
Shuyin Ouyang; Dong HUANG; Jingwen Guo; Zeyu Sun; Qihao Zhu; Jie M. Zhang; |
| 368 | VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advances, existing VQA models still suffer from two critical limitations: poor generalization to out-of-distribution (OOD) videos and limited explainability, which restrict their applicability in real-world scenarios. To address these challenges, we propose VQAThinker, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. |
Linhan Cao; Wei Sun; Weixia Zhang; Xiangyang Zhu; Jun Jia; Kaiwei Zhang; Dandan Zhu; Guangtao Zhai; Xiongkuo Min; |
| 369 | When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. |
Qilang Ye; Wei Zeng; Meng Liu; Jie Zhang; Yupeng Hu; Zitong Yu; Yu Zhou; |
| 370 | D²Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user’s prompt and spatial redundancy. To address this, we introduce D²Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. |
Evelyn Zhang; Fufu Yu; Aoqi Wu; Zichen Wen; Ke Yan; Shouhong Ding; Biqing Qi; Linfeng Zhang; |
| 371 | GUI-G²: Gaussian Reward Modeling for GUI Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G2), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. |
Fei Tang; Zhangxuan Gu; Zhengxi Lu; Xuyang Liu; Shuheng Shen; Changhua Meng; Wen Wang; Wenqi Zhang; Yongliang Shen; Weiming Lu; Jun Xiao; Yueting Zhuang; |
| 372 | VGGS: VGGT-guided Gaussian Splatting for Efficient and Faithful Sparse-View Surface Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our primary contribution is an anchor-calibrated depth estimation scheme, which yields accurate depth maps. |
Peng Xiang; Liang Han; Hui Zhang; Yu-Shen Liu; Zhizhong Han; |
| 373 | Palimpsest: Reconciling The CISS Trilemma for Incremental Nuclei Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this task imposes a unique CISS Trilemma: a simultaneous failure to preserve the intricate tissue background (stability), distinguish morphologically similar new nuclei (plasticity), and maintain a constant model size (scalability), all under a strict exemplar-free constraint. To resolve this, we introduce Palimpsest, a novel framework that systematically decouples these conflicting demands. |
Jiajia Li; Huisi Wu; |
| 374 | Multi-view Learning Via Trusted Pairwise Entity Energy Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on the pairwise trusted problem on long-tailed multi-view classification and give a general framework, which considers the trusted pairs instead of trusted annotated data points. |
Yalan Qin; Guorui Feng; Xinpeng Zhang; |
| 375 | On Robustness of Linear Classifiers to Targeted Data Poisoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this setting, we prove that finding the robustness is an NP-Complete problem, even when hypotheses are linear classifiers. To overcome this, we present a technique that finds lower and upper bounds of robustness. |
Nakshatra Gupta; Sumanth Prabhu S; Supratik Chakraborty; Venkatesh R; |
| 376 | H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. |
Hongzhe Bi; Lingxuan Wu; Tianwei Lin; Hengkai Tan; Zhizhong Su; Hang Su; Jun Zhu; |
| 377 | RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Refined Contribution Measurement with In-Context Learning (RICo), a novel gradient-free method that quantifies the fine-grained contribution of individual samples to both task-level and global-level model performance. |
Yixin Yang; Qingxiu Dong; Linli Yao; Fangwei Zhu; Weilin Luo; Bin Wang; Zhifang Sui; |
| 378 | STEP-Nav: Spatial-Temporal Efficient Visual Token Pruning for Vision-and-Language Navigation with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this tokenization approach incurs substantial computational overhead due to two key inefficiencies: 1) ego-centric camera views often include navigation-irrelevant re- gions (e.g., sky or distant backgrounds), and 2) high-frame-rate image sequences introduce temporal redundancy. To address these challenges, we propose Spatial-Temporal Efficient Visual Token Pruning (STEP-Nav), a unified frame- work that simultaneously prunes redundant visual tokens and fine-tunes VLN models to preserve navigation performance. |
Yantao Lu; Shiqi Sun; Ning Liu; Bo Jiang; Ying Zhang; Jinchao Chen; Chenglie Du; |
| 379 | Belief-Driven Value Alignment for Human-Robot Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing works often assume that human is perfectly rational, and can fully obtain robot’s belief on human’s preference. To address this limitation, we propose a Particle Filter-based Hierarchical Dynamic Programming algorithm (PFHDP). |
Saisai Li; Bing Shi; Yiming Xia; Xiao Su; |
| 380 | MTAttack: Multi-Target Backdoor Attacks Against Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. |
Zihan Wang; Guansong Pang; Wenjun Miao; Jin Zheng; Xiao Bai; |
| 381 | Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. |
Jinfeng Xu; Zheyu Chen; Shuo Yang; Jinze Li; Ziyue Peng; Zewei Liu; Hewei Wang; Jiayi Zhang; Edith C. H. Ngai; |
| 382 | FGD-Align: Pluralistic Alignment for Large Language Models Via Fuzzy Group Decision-Making Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most existing approaches, such as Direct Preference Optimization (DPO), assume consistent and conflict-free supervision, overlooking the ambiguity, inconsistency, and value trade-offs inherent in real-world preferences—often leading to reduced robustness and exclusion of minority views. To address this, we propose FGD-Align, a novel pluralistic alignment framework grounded in Fuzzy Group Decision-Making theory. |
Weihang Pan; Zhengxu Yu; Yong Wu; Xun Liang; Zhongming Jin; Qiang Fu; Penghui Shang; Binbin Lin; Xiaofei He; Jieping Ye; |
| 383 | SeViL: Semi-supervised Vision-Language Learning with Text Prompt Guiding for Moving Infrared Small Target Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, they almost have not concerned the potentials of multi-modal (e.g., vision and text) learning yet. To address these issues, inspired by prevalent vision-language models, we propose the first semi-supervised vision-language (SeViL) framework with adaptive text prompt guiding. |
Weiwei Duan; Luping Ji; Jianghong Huang; Sicheng Zhu; |
| 384 | Human2Robot: Learning Robot Actions from Paired Human-Robot Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. |
Sicheng Xie; Haidong Cao; Zejia Weng; Zhen Xing; Haoran Chen; Shiwei Shen; Jiaqi Leng; Zuxuan Wu; Yu-Gang Jiang; |
| 385 | The Avengers: A Routing Recipe for Collective Intelligence in Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Can open-source, smaller models remain competitive across a broad range of tasks? In this paper, we present the Avengers—a lightweight framework that leverages the collective intelligence of these smaller models. |
Yiqun Zhang; Hao Li; Chenxu Wang; Linyao Chen; Qiaosheng Zhang; Peng Ye; Shi Feng; Xinrun Wang; Jia Xu; Lei Bai; Shuyue Hu; |
| 386 | AgentSense: Virtual Sensor Data Generation Using LLM Agents in Simulated Home Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AgentSense, a virtual data generation pipeline in which agents live out daily routines in simulated smart homes, with behavior guided by Large Language Models (LLMs). |
Zikang Leng; Megha Thukral; Yaqi Liu; Hrudhai Rajasekhar; Shruthi K. Hiremath; Jiaman He; Thomas Plötz; |
| 387 | AHAMask: Reliable Task Specification for Large Audio Language Models Without Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. |
Yiwei Guo; Bohan Li; Hankun Wang; Zhihan Li; Shuai Wang; Xie Chen; Kai Yu; |
| 388 | Can LLMs Identify Tax Abuse? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate whether large language models can discover and analyze U.S. tax-minimization strategies. |
Andrew Blair-Stanek; Nils Holzenberger; Benjamin Van Durme; |
| 389 | Forgetting By Pruning: Data Deletion in Join Cardinality Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Cardinality Estimation Pruning (CEP), the first unlearning framework specifically designed for multi-table learned CE systems. |
Chaowei He; Yuanjun Liu; Qingzhi Ma; Shenyuan Ren; Xizhao Luo; Lei Zhao; An Liu; |
| 390 | Distributional Priors Guided Diffusion for Generating 3D Molecules in Low Data Regimes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces the Geometric OOD Diffusion Model (GODD), a novel diffusion-based framework that enables training on data-abundant molecular distributions while generalizing to data-scarce distributions under distributional structural shifts. |
Haokai Hong; Wanyu Lin; Ming Yang; Kay Chen Tan; |
| 391 | Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Accurate detection of offensive content on social media demands high-quality labeled data; however, such data is often scarce due to the low prevalence of offensive instances and the high cost of manual annotation. To address this low-resource challenge, we propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling. |
Han Wang; Deyi Ji; Junyu Lu; Lanyun Zhu; Hailong Zhang; Haiyang Wu; Liqun Liu; Peng Shu; Roy Ka-Wei Lee; |
| 392 | StegaVAR: Privacy-Preserving Video Action Recognition Via Steganographic Domain Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These methods suffer from (1) low concealment, where producing visually distorted videos that attract attackers’ attention during transmission, and (2) spatiotemporal disruption, where degrading essential spatiotemporal features for accurate VAR. To address these issues, we propose StegaVAR, a novel framework that embeds action videos into ordinary cover videos and directly performs VAR in the steganographic domain for the first time. |
Lixin Chen; Chaomeng Chen; Jiale Zhou; Zhijian Wu; Xun Lin; |
| 393 | Exploiting All Mamba Fusion for Efficient RGB-D Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current RGB-D trackers often copy RGB tracking paradigms, leading to inefficiency due to two-stream architectures that fail to exploit heterogeneous features, and reliance on simplistic or large-parameter fusion methods. To address these challenges, we propose AMTrack, a one-stream RGB-D tracker leveraging Mamba’s linear complexity for simultaneous feature extraction and two-stage cross-modal feature fusion. |
Ge Ying; Dawei Zhang; Chengzhuan Yang; Wei Liu; Sang-Woon Jeon; Hua Wang; Changqin Huang; Zhonglong Zheng; |
| 394 | Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study two public instruction tuned models—Gemma‑2-2B‑IT and LLaMA‑3.1-8B‑IT using sparse autoencoders (SAEs) trained on residual‑stream activations. |
Nirmalendu Prakash; Yeo Wei Jie; Amir Abdullah; Ranjan Satapathy; Erik Cambria; Roy Ka-Wei Lee; |
| 395 | S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy for addressing heterogeneous problem reasoning. |
Jiangwen Dong; Zehui Lin; Wanyu Lin; Mingjin Zhang; |
| 396 | Is Word Sense Disambiguation Dead in The LLM Era? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper surveys the role of WSD in the LLM era, drawing on recent studies of encoder-based sense separation and disambiguation, and decoder-based definition selection and generation, as well as multilingual evaluation. |
Roberto Navigli; |
| 397 | The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. |
Taewhoo Lee; Minju Song; Chanwoong Yoon; Jungwoo Park; Jaewoo Kang; |
| 398 | Generalized Threshold Optimization with Harmony Multi-Threshold Neurons for Accurate ANN-to-SNN Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To minimize the conversion error, this paper proposed a harmonious mathematical property-based neuron, called Harmony Multi-Threshold Neurons (H-MT Neuron), which utilizes multiple spikes to minimize residual membrane potentials. |
Wenhan Zhang; Zihan Huang; Tong Bu; Tiejun Huang; Zhaofei Yu; |
| 399 | AStar: Boosting Multimodal Reasoning with Automated Structured Thinking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, search-based methods suffer from computational inefficiency due to extensive solution space exploration, while post-training methods demand substantial data, computational resources, and often exhibit training instability. To address these challenges, we propose **AStar**, a training-free, **A**utomatic **S**tructured **t**hinking paradigm for multimod**a**l **r**easoning. |
Jinyang Wu; Mingkuan Feng; Guocheng Zhai; Shuai Zhang; Zheng Lian; Fangrui Lv; Pengpeng Shao; Ruihan Jin; Zhengqi Wen; Jianhua Tao; |
| 400 | TrackGS: Optimizing COLMAP-Free 3D Gaussian Splatting with Global Track Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TrackGS, a novel method to integrate global feature tracks with 3D Gaussian Splatting (3DGS) for COLMAP-free novel view synthesis. |
Dongbo Shi; Shen Cao; Lubin Fan; Bojian Wu; Jinhui Guo; Ligang Liu; Renjie Chen; |
| 401 | Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture, including both speech encoder and LLM. |
Biao Fu; Donglei Yu; Minpeng Liao; Chengxi Li; Xinjie Chen; Yidong Chen; Kai Fan; Xiaodong Shi; |
| 402 | MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing CAPTCHA schemes encompass a diverse range of modalities—from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions—yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. |
Zonglin Wu; Yule Xue; Yaoyao Feng; Xiaolong Wang; Yiren Song; |
| 403 | UniMo: Unified Motion Generation and Understanding with Chain of Thought Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the next-token prediction paradigm in LLMs is ill-suited for motion sequences, causing cumulative prediction errors. To address these limitations, we propose UniMo, a novel framework that integrates motion-language information and interpretable chain of thought (CoT) reasoning into the LLM via supervised fine-tuning (SFT). |
Guocun Wang; Kenkun Liu; Jing Lin; Guorui Song; Jian Li; Xiaoguang Han; |
| 404 | Human-Centric Video Generation Via Collaborative Multi-Modal Conditioning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. |
Liyang Chen; Tianxiang Ma; Jiawei Liu; Bingchuan Li; Zhuowei Chen; Lijie Liu; Xu He; Gen Li; Qian He; Zhiyong Wu; |
| 405 | From Stimuli to Minds: Enhancing Psychological Reasoning in LLMs Via Bilateral Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Language Models show promise in emotion understanding, social reasoning, and empathy, yet struggle with psychologically grounded tasks requiring inference of implicit mental states in complex, socially and contextually ambiguous settings. |
Yichao Feng; Haoran Luo; Lang Feng; Shuai Zhao; Anh Tuan Luu; |
| 406 | KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose KTV, a novel two-stage framework for efficient and effective training-free video understanding. |
Baiyang Song; Jun Peng; Yuxin Zhang; Guangyao Chen; Feidiao Yang; Jianyuan Guo; |
| 407 | R²D-LPCC: Relevance-Ranking Guided Region-Adaptive Dynamic LiDAR Point Cloud Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach neglects the semantic priorities of a scene, resulting in inefficient bit allocation and particularly compromising the reconstruction quality of safety-critical regions, such as pedestrians and vehicles, which are vital to downstream perception tasks. To address these limitations, we propose R²D-LPCC, a relevance-ranking framework for region adaptive LPCC that prioritizes fidelity in semantically important regions. |
Fangzhe Nan; Frederick W. B. Li; Gary K. L. Tam; Zhaoyi Jiang; Bailin Yang; Jingke Cui; Changshuo Wang; |
| 408 | MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. |
Yilun Liu; Chunguang Zhao; Xinhua Yang; Hongyong Zeng; Shimin Tao; Weibin Meng; Minggui He; Yan Yu; Hongxia Ma; Li Zhang; Daimeng Wei; Boxing Chen; |
| 409 | E-Logic Prompt: Unified Energy-Logic Framework for Continual Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This oversight can lead to inconsistent reasoning paths and performance degradation across tasks. To address this issue, we propose the E Logic Prompt framework, which employs energy-based models (EBMs) to model the semantic compatibility between prompts and queries. |
Jiayao Tan; Tianle Liu; Fuyuan Hu; Wei Feng; Liang Wan; |
| 410 | Time-Frequency Token Advantage Clipping for Training Efficient Large Reasoning Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current sequence-level optimizations, like length penalties, are too coarse-grained to distinguish core logic from verbose language, precluding the necessary token-level control for efficient reasoning CoT. To overcome these limitations, we introduce Time-Frequency token Advantage Clipping (TFAC), a novel training framework designed to build efficient large reasoning models via token-level interventions. |
Rong Bao; Bo Wang; Xiao Wang; Hongyu Li; Rui Zheng; Leszek Rutkowski; Qi Zhang; Liang Ding; Dacheng Tao; |
| 411 | Monocular Vehicle Pose and Shape Reconstruction Via Dynamic Context Adaptation and Progressive Geometry Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods often suffer from geometric ambiguity in depth estimation and structural hollowness in shape recovery, primarily due to inadequate multi-scale feature aggregation and unflexible prior modeling. To overcome these limitations, MonoVPR is proposed, a novel framework integrating dynamic context adaptation and progressive geometry refinement. |
Wei Li; Long Ji; Ying Wang; Xiao Wu; Zhaoquan Yuan; Penglin Dai; |
| 412 | Test-time Prompt Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. |
Chenxu Yang; Qingyi Si; Mz Dai; Dingyu Yao; Mingyu Zheng; Minghui Chen; Zheng Lin; Weiping Wang; |
| 413 | Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. |
Yuxiang Zhou; Jichang Li; Yanhao Zhang; Haonan Lu; Guanbin Li; |
| 414 | Personalize Anything for Free with Diffusion Transformer Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. |
Haoran Feng; Zehuan Huang; Lin Li; Lu Sheng; |
| 415 | Duplex Rewards Optimization for Test-Time Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Within the TT-CIR setting, we identify that naively introducing existing TTA methods (e.g., reward-based) into CIR faces two vital challenges: 1) Modification-restricted reward pool, which limits the exploration of semantically relevant candidate rewards; 2) Conservative knowledge feedback, which inhibits the adaptability of rewards to the current data distribution. To address these challenges, we propose a test-time reinforcement learning framework that integrates a Counterfactual-guided Multinomial Sampling (CMS) strategy and a Duplex Rewards Modeling (DRM) module. |
Haoliang Zhou; Feifei Zhang; Changsheng Xu; |
| 416 | Bi-VLM: Binary Post-Training Quantization for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. |
Xijun Wang; Rayyan Abdalla; Junyun Huang; Chengyuan Zhang; Ruiqi Xian; Dinesh Manocha; |
| 417 | Divide, Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. |
Xingyue Zhao; Wenke Huang; Xingguang Wang; Haoyu Zhao; Linghao Zhuang; Anwen Jiang; Guancheng Wan; Mang Ye; |
| 418 | Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning Over Foundational Capabilities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our empirical insights underline the critical need for developing more versatile LRMs capable of dynamically allocating inference-time compute according to specific task characteristics. |
Weixiang Zhao; Xingyu Sui; Jiahe Guo; Yulin Hu; Yang Deng; Yanyan Zhao; Xuda Zhi; Yongbo Huang; Hao He; Wanxiang Che; Ting Liu; Bing Qin; |
| 419 | CultureRL: Internalizing Cultural Principles in Large Language Models Via Norm-Driven Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose CultureRL, a culture-norm-driven reinforcement learning framework that directly encodes cultural principles into model behavior. |
Weixiang Zhao; Haozhen Li; Yanyan Zhao; Haixiao Liu; Biye Li; Ting Liu; Bing Qin; |
| 420 | Tracking The Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. |
Jianbo Ma; Hui Luo; Qi Chen; Yuankai Qi; Yumei Sun; Amin Beheshti; Jianlin Zhang; Ming-Hsuan Yang; |
| 421 | DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches struggle with generating high-quality videos aligned with the complex single-scene description, as visualizing such complex description involves coherent composition of multiple objects/events, complex motion synthesis and character customization with sequential motions. To address these challenges, we propose DREAMRUNNER, a novel story-to-video generation method: First, we structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning as well as fine-grained object-level layout planning. |
Zun Wang; Jialu Li; Han Lin; Jaehong Yoon; Mohit Bansal; |
| 422 | Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, in these methods, collaborative signals often dominate and suppress semantic knowledge, which limits its role in representation learning. To address these issues, we propose SCALE, a novel framework that combines subspace-aware graph construction and contrastive alignment for multimodal recommendation with large language models. |
Haodong Li; Lianyong Qi; Weiming Liu; Fan Wang; Chong Li; Shengye Pang; Wenwen Gong; Yanwei Xu; Xiaoxiao Chi; Yang Zhang; Xiaokang Zhou; |
| 423 | Multi-View Differential Mixing and Graph-Guided Structural Region Selection for Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, global matching methods suffer from the over-compression of local features, while local matching methods rarely consider the inherent spatial topology of image patches. To address these limitations, we propose MG-Net, a unified framework with two collaborative modules: Multi-View Differential Mixer (MDM) and Graph-Guided Structural Region Selector (GSRS). |
Linlin Ji; Li Liu; |
| 424 | SplatSSC: Decoupled Depth-Guided Gaussian Splatting for Semantic Scene Completion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. |
Rui Qian; Haozhi Cao; Tianchen Deng; Shenghai Yuan; Lihua Xie; |
| 425 | Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Multi-Faceted Attack (MFA), a framework that systematically uncovers general safety vulnerabilities in leading defense-equipped VLMs, including GPT-4o, Gemini-Pro, and LLaMA 4, etc. |
Yijun Yang; Lichao Wang; Jianping Zhang; Chi Harold Liu; Lanqing Hong; Qiang Xu; |
| 426 | Beyond N-grams: A Hierarchical Reward Learning Framework for Clinically-Aware Medical Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods can write formally fluent sentences but may be factually flawed, introducing serious medical errors known as clinical hallucinations, which make them untrustworthy for diagnosis. To bridge this gap, we introduce HiMed-RL, a Hierarchical Medical Reward Learning Framework designed to explicitly prioritize clinical quality. |
Yuan Wang; Shujian Gao; Jiaxiang Liu; Songtao Jiang; Xia Haoxiang; Xiaotian Zhang; Zhaolu Kang; Yemin Wang; Zuozhu Liu; |
| 427 | MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MemGuide, a two-stage intent-driven memory selection framework: (1) Intent‑Aligned Retrieval retrieves goal-consistent QA‑formatted memory units; (2) Missing‑Slot Guided Filtering reranks units by slot-completion gain via a chain‑of‑thought reasoner and fine‑tuned LLaMA‑8B filter. |
Yiming Du; Bingbing Wang; Yang He; Bin Liang; Baojun Wang; Zhongyang Li; Lin Gui; Jeff Z. Pan; Ruifeng Xu; Kam-Fai Wong; |
| 428 | TLAGC: Taylor Linear Attention-Guided Graph Convolutions for Revealing Spatial Domains in Spatial Multi-Omics Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Graph-based neural methods capture only local neighborhood information, whereas conventional Transformers, although capable of modelling long-range dependencies, incur prohibitive computational costs on such data. To overcome these limitations, we propose TLAGC—a Taylor-Linear-Attention-Guided Graph Convolutional framework that couples a Taylor-expanded linear attention (TLA) mechanism with graph convolutional networks. |
Aoyun Geng; Chunyan Cui; Yunyun Su; Zhenjie Luo; Feifei Cui; Zilong Zhang; |
| 429 | Large Language Model Unlearning for Source Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we find its application to source code often tends to spill over, damaging the basic knowledge of programming languages learned by the LLM and degrading the overall capability. To ease this challenge, we propose PROD for precise source code unlearning. |
Xue Jiang; Yihong Dong; Huangzhao Zhang; Tangxinyu Wang; Zheng Fang; Yingwei Ma; Rongyu Cao; Binhua Li; Zhi Jin; Wenpin Jiao; Yongbin Li; Ge Li; |
| 430 | DIET: Machine Unlearning on A Data-Diet Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A further complication arises when unlearning is applied to vision-language models (VLMs), where entangled multimodal representations make targeted forgetting especially challenging. We propose DIET, a principled retain-data-free unlearning method for VLMs that addresses these challenges by leveraging the geometry of hyperbolic space. |
Nilakshan Kunananthaseelan; Jing Wu; Trung Le; Gholamreza Haffari; Mehrtash Harandi; |
| 431 | Generalizable Heterogeneity-aware Federated Feature and Basic-matrix Consistency Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the classic FL still faces significant challenges due to feature/model heterogeneity and catastrophic forgetting, which seriously hinder knowledge transfer and cause the forgetting of previous knowledge. To address these important challenges, we propose FBCL, a novel generalizable heterogeneity-aware Federated features and Basic-matrix Consistency Learning to balance intra-domain discriminability and inter-domain generalization. |
Xuan Lai; Luying Zhong; Tianying Lu; Junjie Zhang; Zhiqin Huang; Zheyi Chen; |
| 432 | Optimally Auditing Adversarial Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a general model of audit policy design as a principal-agent game with multiple agents, where the principal commits to an audit policy, and agents collectively choose an equilibrium that minimizes the principal’s utility. |
Sanmay Das; Fang-Yi Yu; Yuang Zhang; |
| 433 | GenPRM: Scaling Test-Time Compute of Process Reward Models Via Generative Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. |
Jian Zhao; Runze Liu; Kaiyan Zhang; Zhimu Zhou; Junqi Gao; Dong Li; Jiafei Lyu; Zhouyi Qian; Biqing Qi; Xiu Li; Bowen Zhou; |
| 434 | NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to develop a simple yet effective method called NeuralGS that compresses the original 3DGS into a compact representation. |
Zhenyu Tang; Chaoran Feng; Xinhua Cheng; Wangbo Yu; Junwu Zhang; Yuan Liu; Xiao-Xiao Long; Wenping Wang; Li Yuan; |
| 435 | SwiftVideo: A Unified Framework for Few-Step Video Generation Through Trajectory-Distribution Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts in few-step settings. To address these limitations, we propose SwiftVideo, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. |
Yanxiao Sun; Jiafu Wu; Yun Cao; Chengming Xu; Yabiao Wang; Weijian Cao; Donghao Luo; Chengjie Wang; Yanwei Fu; |
| 436 | Identity-Aware Vision-Language Model for Explainable Face Forgery Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Second, these methods rely heavily on low-level visual cues, making them effective for known forgeries but less reliable against new or unseen manipulation techniques. To address these challenges, we present a novel personalized vision-language model (VLM) that integrates low-level visual artifact analysis and high-level semantic inconsistency detection. |
Junhao Xu; Jingjing Chen; Yang Jiao; Jiacheng Zhang; Zhiyu Tan; Hao Li; Yu-Gang Jiang; |
| 437 | Large Language Models Struggle with Unreasonability in Math Problems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instead of recognizing these issues, models frequently proceed as if the problem is well-posed, producing incorrect answers or falling into overthinking and verbose self-correction. To systematically investigate this overlooked vulnerability, we propose the Unreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs’ ability to detect and respond to unreasonable math problem statements. |
Jingyuan Ma; Damai Dai; Zihang Yuan; Rui Li; Weilin Luo; Bin Wang; Qun Liu; Lei Sha; Zhifang Sui; |
| 438 | 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite recent advancements in 3D-text cross-modal alignment, existing state-of-the-art methods still struggle to align fine-grained textual semantics with detailed geometric structures, and their alignment performance degrades significantly when scaling to large-scale 3D databases. To overcome this limitation, we introduce 3DAlign-DAER, a unified framework designed to align text and 3D geometry via the proposed dynamic attention policy and the efficient retrieval strategy, capturing subtle correspondences for diverse cross-modal retrieval and classification tasks. |
Yijia Fan; Jusheng Zhang; Kaitong Cai; Jing Yang; Jian Wang; Keze Wang; |
| 439 | Cost-Effective Communication: An Auction-based Method for Language Agent Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We argue that "free” communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. |
Yijia Fan; Jusheng Zhang; Kaitong Cai; Jing Yang; Chengpei Tang; Jian Wang; Keze Wang; |
| 440 | Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. |
Bingshen Mu; Hexin Liu; Hongfei Xue; Kun Wei; Lei Xie; |
| 441 | Bridging The Tokenizer Gap: Semantics and Distribution-aware Knowledge Transfer for Unbiased Cross-Tokenizer Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Cross-tokenizer knowledge distillation, where the teacher and student employ different tokenizers, is becoming increasingly prevalent, yet it poses underexplored challenges: existing methods fail to capture the rich knowledge encoded in teacher logits, as evidenced by the neglect of semantic information, inaccurate and biased logit alignment, and discarding distributional structure—ultimately leading to unfavorable distillation. To address these issues, we propose SeDi, a semantics and distribution-aware knowledge transfer framework tailored for cross-tokenizer distillation. |
Huazheng Wang; Yongcheng Jing; Haifeng Sun; Jingyu Wang; Jianxin Liao; Leszek Rutkowski; Dacheng Tao; |
| 442 | MMIFEvol: Towards Evolutionary Multimodal Instruction Following Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes MMIFEvol, a framework for multimodal instruction evolving and benchmarking. |
Haoyu Wang; Sihang Jiang; Xiangru Zhu; Yuyan Chen; Xiaojun Meng; Jiansheng Wei; Yitong Wang; Yanghua Xiao; |
| 443 | 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. |
Zhiguo Lu; Jianwen Lou; Mingjun Ma; Hairong Jin; Youyi Zheng; Kun Zhou; |
| 444 | BitDP: Ultra-low-bit Communication for Data Parallelism in LLM Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose BitDP, an ultra-low-bit gradient quantization system that reduces communication costs by up to 32× while preserving model accuracy with less than 1% performance degradation. |
Xiaozhe Ren; Qiong Luo; |
| 445 | ComLQ: Benchmarking Complex Logical Queries in Information Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Thus, these benchmarks can not be used to sufficiently evaluate the performance of IR models on complex queries in real-world scenarios. To address this problem, we propose a novel method leveraging large language models (LLMs) to construct a new IR dataset ComLQ for Complex Logical Queries, which comprises 2,909 queries and 11,251 candidate passages. |
Ganlin Xu; Zhitao Yin; Linghao Zhang; Jiaqing Liang; Weijia Lu; Xiaodong Zhang; Zhifei Yang; Sihang Jiang; Deqing Yang; |
| 446 | Consensus-Driven Multi-Agent Cognitive Reasoning for Enhancing The Emotional Intelligence of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Benchmarks such as EmoBench attribute this gap to deficiencies in cognitively demanding tasks that require inferring others’ latent mental states, intentions, and emotions in nuanced social contexts. To address this, we propose MACRo, a Multi-Agent Cognitive Reasoning framework that generates a structured Cognitive Chain of Thought comprising Situation, Clue, Thought, Action, and Emotion. |
Geng Tu; Dingming Li; Jun Huang; Ruifeng Xu; |
| 447 | DIFFA: Large Language Diffusion Models Can Listen and Understand Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce DIFFA, the first diffusion-based large audio-language model designed to perform spoken language understanding. |
Jiaming Zhou; Hongjie Chen; Shiwan Zhao; Jian Kang; Jie Li; Enzhi Wang; Yujie Guo; Haoqin Sun; Hui Wang; Aobo Kong; Yong Qin; Xuelong Li; |
| 448 | FT-MoE: Sustainable-learning Mixture of Experts for Fault-Tolerant Computing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is primarily because their homogenization of fault knowledge perception difficuties to fully capture diverse and complex fault patterns. To address these challenges, we propose FT-MoE, a sustainable-learning fault-tolerant computing framework based on a dual-path architecture for high-accuracy fault detection and classification. |
Wenjing Xiao; Wenhao Song; Miaojiang Chen; Min Chen; |
| 449 | Thinking Aesthetics Assessment of Image Color Temperature: Models, Datasets and Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, within the existing IAA field, little light has been shed on assessing the aesthetic quality of image color temperature. To bridge this gap, we introduce a new task: Image Color Temperature Aesthetics Assessment (ICTAA). |
Jinguang Cheng; Chunxiao Li; Shuai He; Taiyu Chen; Anlong Ming; |
| 450 | Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. |
Sida Huang; Siqi Huang; Ping Luo; Hongyuan Zhang; |
| 451 | Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents a simple, effective, and cost-efficient strategy, named ModelSwitch, to improve LLM performance by scaling test-time compute. |
Jianhao Chen; Zishuo Xun; Bocheng Zhou; Han Qi; Hangfan Zhang; Qiaosheng Zhang; Yang Chen; Wei Hu; Yuzhong Qu; Shuyue Hu; |
| 452 | Improving Large Molecular Language Model Via Relation-aware Multimodal Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing LMLMs often suffer from hallucination and limited robustness, largely due to inadequate integration of diverse molecular modalities such as 1D sequences, 2D molecular graphs, and 3D conformations. To address these limitations, we propose CoLLaMo, a large language model-based molecular assistant equipped with a multi-level molecular modality-collaborative projector. |
Jinyoung Park; Minseong Bae; Jeehye Na; Hyunwoo J. Kim; |
| 453 | New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. |
Nana Wang; Suli Wang; Gen Li; Pengfei Ren; Hao Su; |
| 454 | Towards Explainable Video Camouflaged Object Detection: SAM2 with Eventstream-Inspired Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by the human strategy of identifying abnormal movements between frames and the principle of event camera image formation, we propose an eventstream-inspired dual-branch framework for VCOD. |
Hong Zhang; Yixuan Lyu; Hanyang Liu; Jianbo Song; Ding Yuan; Yifan Yang; |
| 455 | Towards A Rigorous Understanding of The Population Dynamics of The NSGA-III: Tight Runtime Bounds Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A central open problem concerns its population dynamics, which involve controlling the maximum number of individuals sharing the same fitness value during the exploration process. In this paper, we make a significant step towards such an understanding by proving tight runtime bounds for NSGA-III on the bi-objective OneMinMax (2-OMM) problem. |
Andre Opris; |
| 456 | Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. |
Tianyi Zhou; Johanne Medina; Sanjay Chawla; |
| 457 | Look-Back: Implicit Visual Re-focusing in MLLM Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual injection. |
Shuo Yang; Yuwei Niu; Yuyang Liu; Yang Ye; Bin Lin; Li Yuan; |
| 458 | AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. |
Shuo Yang; Qihui Zhang; Yuyang Liu; Yue Huang; Xiaojun Jia; Kun-Peng Ning; Jia-Yu Yao; Jigang Wang; Dai Hailiang; Yibing Song; Li Yuan; |
| 459 | MegaCoin: Enhancing Medium-Grained Color Perception for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. |
Ming-Chang Chiu; Shicheng Wen; Pin-Yu Chen; Xuezhe Ma; |
| 460 | MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. |
Huangbiao Xu; Huanqi Wu; Xiao Ke; Junyi Wu; Rui Xu; Jinglin Xu; |
| 461 | MoBGS: Motion Deblurring Dynamic 3D Gaussian Splatting for Blurry Monocular Video Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MoBGS, a novel motion deblurring 3D Gaussian Splatting (3DGS) framework capable of reconstructing sharp and high-quality novel spatio-temporal views from blurry monocular videos in an end-to-end manner. |
Minh-Quan Viet Bui; Jongmin Park; Juan Luis Gonzalez; Jaeho Moon; Jihyong Oh; Munchurl Kim; |
| 462 | LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, existing methods often expand entire layers for each task, leading to significant parameter overhead and poor scalability. To overcome these issues, we introduce LoRA in LoRA (LiLoRA), a highly efficient architecture expansion method tailored for CVIT in MLLMs. |
Chang Che; Ziqi Wang; Pengwan Yang; Cheems Wang; Hui Ma; Zenglin Shi; |
| 463 | Dynamic-Static Collaboration for Unsupervised Domain Adaptive Video-Based Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Directly extending existing image-based unsupervised VI-ReID methods to video scenarios by simply averaging frame-level features is suboptimal, as this naive strategy neglects the rich temporal dynamics in video data and leads to unreliable pseudo-labels due to occlusion-induced noise. To overcome these limitations, we propose a Dynamic-Static Collaboration (DSC) framework that explicitly leverages the complementary strengths of motion and appearance cues. |
Jiaxu Leng; Zhengjie Wang; Shuang Li; Xinbo Gao; |
| 464 | MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. |
Liu Liu; Alexandra Schild; Marco Cipriano; Fatimeh Al Ghannam; Freya Tan; Gerard de Melo; Andres Sevtsuk; |
| 465 | Easy for Children, Hard for AI: The Limits of Multimodal LLMs in Early Childhood Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing research on child-related AI largely centers on modeling language, emotion, or behavior, with limited focus on evaluating cognitive tasks relevant to early learning. To address this gap, we propose ChildBench, a multimodal benchmark designed to assess models on tasks inspired by early childhood cognitive development. |
Jingping Liu; Xueyan Wu; Hanxuan Chen; Ziyan Liu; Zhangquan Chen; Ronghao Chen; Huacan Wang; |
| 466 | Empowering Semantic-Sensitive Underwater Image Enhancement with VLM Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. |
Guodong Fan; Shengning Zhou; Genji Yuan; Huiyu Li; Jingchun Zhou; Jinjiang Li; |
| 467 | Text-based Aerial-Ground Person Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. |
Xinyu Zhou; Yu Wu; Jiayao Ma; Wenhao Wang; Min Cao; Mang Ye; |
| 468 | Brownian Bridge Augmented Surrogate Simulation and Injection Planning for Geological CO2 Storage Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior literature, including numerical optimization methods and surrogate-optimization methods, is limited by real-world GCS requirements of smooth state transitions and goal-directed planning within limited time. To address these limitations, we propose a Brownian Bridge–augmented framework for surrogate simulation and injection planning in GCS and develop two insights (i) Brownian bridge as smooth state regularizer for better surrogate simulator; (ii) Brownian bridge as goal-time-conditioned planning guidance for better injection planning. |
Haoyue Bai; Guodong Chen; Wangyang Ying; Xinyuan Wang; Nanxu Gong; Sixun Dong; Giulia Pedrielli; Haoyu Wang; Haifeng Chen; Yanjie Fu; |
| 469 | A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have made significant progress in traffic prediction, two critical challenges persist: (i) limited contextual capacity when handling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points caused by heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. |
Weilin Ruan; Xilin Dang; Ziyu Zhou; Sisuo Lyu; Yuxuan Liang; |
| 470 | WaveEx: Accelerating Flow Matching-based Speech Generation Via Wavelet-guided Extrapolation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose WaveEx, a training-free and plug-in acceleration framework which replaces portions of ODE integration with wavelet-guided extrapolation. |
Xiaoqian Liu; Xiyan Gui; Zhengkun Ge; Yuan Ge; Chang Zou; Jiacheng Liu; Zhikang Niu; Qixi Zheng; Chen Xu; Xie Chen; Tong Xiao; Jingbo Zhu; Linfeng Zhang; |
| 471 | IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing static, termination-oriented evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. |
Xiaoya Lu; Zeren Chen; Xuhao Hu; Yijin Zhou; Weichen Zhang; Dongrui Liu; Lu Sheng; Jing Shao; |
| 472 | RefleXNet: Targeted Self-Reflection for Accurate Chest X-ray Reporting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods typically rely on global visual features and token-level supervision, limiting their sensitivity to subtle abnormalities and reducing their clinical reliability. To address these challenges, we present Reflective X-ray Network (RefleXNet), which systematically integrates multi-scale visual feature fusion and anatomical relational reasoning with a targeted self-reflective learning strategy. |
Xin Mei; Rui Mao; Xiaoyan Cai; Libin Yang; Erik Cambria; |
| 473 | JRDB-Reasoning: A Difficulty-Graded Benchmark for Visual Reasoning in Robotics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing visual reasoning benchmarks often suffer from several limitations: they lack a clear definition of reasoning complexity, offer have no control to generate questions over varying difficulty and task customization, and fail to provide structured, step-by-step reasoning annotations (workflows). To bridge these gaps, we formalize reasoning complexity, introduce an adaptive query engine that generates customizable questions of varying complexity with detailed intermediate annotations, and extend the JRDB dataset with human-object interaction and geometric relationship annotations to create JRDB-Reasoning, a benchmark tailored for visual reasoning in human-crowded environments. |
Simindokht Jahangard; Mehrzad Mohammadi; Yi Shen; Zhixi Cai; Hamid Rezatofighi; |
| 474 | TabFlash: Efficient Table Understanding with Progressive Question Conditioning and Token Focusing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing Multimodal Large Language Model (MLLM) approaches often overlook these characteristics, resulting in uninformative and redundant visual representations. To address these issues, we aim to generate visual features that are both informative and compact for improved table understanding. |
Jongha Kim; Minseong Bae; Sanghyeok Lee; Jinsung Yoon; Hyunwoo J. Kim; |
| 475 | DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they face optimization challenges in precisely selecting the best option from thousands of candidates and distinguishing subtle but safety-critical differences, especially in rare and challenging scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. |
Wenhao Yao; Zhenxin Li; Shiyi Lan; Zi Wang; Xinglong Sun; Jose M. Alvarez; Zuxuan Wu; |
| 476 | PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges for Chest X-Ray (CXR) reasoning. |
Yushi Feng; Junye Du; Yingying Hong; Qifan Wang; Lequan Yu; |
| 477 | TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In order to reduce reliance on costly temporal annotations, we introduce a multi-task prompt fine-tuning approach that seamlessly integrates temporal-sensitive tasks into existing instruction datasets without requiring additional annotations. |
Yunxiao Wang; Meng Liu; Wenqi Liu; Xuemeng Song; Bin Wen; Fan Yang; Tingting Gao; Di Zhang; Guorui Zhou; Liqiang Nie; |
| 478 | Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. |
Zhaopei Huang; Qifeng Dai; Guozheng Wu; Xiaopeng Wu; Xubin Li; Tiezheng Ge; Wenxuan Wang; Qin Jin; |
| 479 | HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. |
Wencan Cheng; Gim Hee Lee; |
| 480 | PDE-Driven Spatiotemporal Generative Modeling for Multilead ECG Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Physics-Inspired Partial Differential Equation GAN for Multilead ECG Synthesis (PhysioPDE-GAN), a generative framework designed to model the spatiotemporal structure of multilead ECG signals by incorporating physiological priors and spatial constraints directly into the generative process. |
Yakir Yehuda; Kira Radinsky; |
| 481 | I Have Covered All The Bases Here: Interpreting Reasoning Features in Large Language Models Via Sparse Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. |
Andrey V. Galichin; Alexey Dontsov; Polina Druzhinina; Anton Razzhigaev; Oleg Rogov; Elena Tutubalina; Ivan Oseledets; |
| 482 | Versatile Vision-Language Model for 3D Computed Tomography Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Besides, current MVLMs exhibit constrained voxel-level capabilities, lacking effective multi-task instruction tuning framework capable of achieving robust performance across various downstream tasks. To address these challenges, we propose CTInstruct, a novel MVLM employing a hybrid ResNet-ViT encoder with multi-granular vision-language pretraining for efficient heterogeneous data modeling, and unified instruction tuning that jointly optimizes discriminative, generative, and voxel-level reasoning for volumetric medical imaging. |
Jiayu Lei; Ziqing Fan; Yanyong Zhang; Weidi Xie; Ya Zhang; Yanfeng Wang; |
| 483 | A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, modeling this progression remains challenging due to 1) the scarcity of longitudinal data obtained through irregular and infrequent subject visits and 2) the complex interplay of pathological mechanisms across brain regions and disease stages, where traditional models assume fixed mechanisms throughout disease progression. To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert weighting. |
Tiantian He; Keyue Jiang; An Zhao; Anna Schroder; Elinor Thompson; Sonja Soskic; Frederik Barkhof; Daniel C. Alexander; |
| 484 | Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we theoretically propose an offline RL post-training objective for VLA flow models and induce an efficient and feasible offline RL fine-tuning algorithm −− Adaptive Reinforced Flow Matching (ARFM). |
Hongyin Zhang; Shiyuan Zhang; Junxi Jin; Qixin Zeng; Yifan Qiao; Hongchao Lu; Donglin Wang; |
| 485 | Robust Fusion Controller: Degradation-Aware Image Fusion with Fine-Grained Language Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current image fusion methods struggle to adapt to real-world environments encompassing diverse degradations with spatially varying characteristics. To address this challenge, we propose a robust fusion controller (RFC) capable of achieving degradation-aware image fusion through fine-grained language instructions, ensuring its reliable application in adverse environments. |
Hao Zhang; Yanping Zha; Qingwei Zhuang; Zhenfeng Shao; Jiayi Ma; |
| 486 | Priority-Based Graph-Enhanced Reinforcement Learning for Robust Analog Circuit Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we propose a priority-based graph-enhanced RL framework. |
Jintao Li; Zhenxin Chen; Sicheng He; Ao-Jin Li; Shui Yu; |
| 487 | MindCross: Fast New Subject Adaptation with Limited Data for Cross-subject Video Reconstruction from Brain Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To achieve fast and data-efficient new subject adaptation, we propose **MindCross**, a novel cross-subject brain decoding framework. |
Xuan-Hao Liu; Yan-Kai Liu; Tianyi Zhou; Bao-Liang Lu; Wei-Long Zheng; |
| 488 | Hybrid-DMKG: A Hybrid Reasoning Framework Over Dynamic Multimodal Knowledge Graphs for Multimodal Multihop QA with Knowledge Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our evaluation shows that current MKE methods often struggle to consistently update and reason over multimodal reasoning chains following knowledge edits. To overcome these challenges, we propose Hybrid-DMKG, a hybrid reasoning framework built on a dynamic multimodal knowledge graph (DMKG) to enable accurate multihop reasoning over updated multimodal knowledge. |
Li Yuan; Qingfei Huang; Bingshan Zhu; Yi Cai; Qingbao Huang; Changmeng Zheng; Zikun Deng; Tao Wang; |
| 489 | SIDE: Surrogate Conditional Data Extraction from Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through extensive experiments on CIFAR-10, CelebA, ImageNet, and LAION-5B, we show that SIDE can successfully extract training data from so-called safe unconditional models, outperforming baseline attacks even on conditional models. Complementing these findings, we present a unified theoretical framework based on informative labels, demonstrating that all forms of conditioning, explicit or surrogate, amplify memorization. |
Yunhao Chen; Shujie Wang; Difan Zou; Xingjun Ma; |
| 490 | TextShield-R1: Reinforced Reasoning for Tampered Text Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, our approach introduces Forensic Continual pre-training, an easy-to-hard curriculum that well prepares the MLLM for tampered text detection by harnessing the large-scale cheap data from natural image forensic and OCR tasks. |
Chenfan Qu; Yiwu Zhong; Jian Liu; Xuekang Zhu; Bohan Yu; Lianwen Jin; |
| 491 | VideoSeg-R1:Reasoning Video Object Segmentation Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose VideoSeg-R1, the first framework to introduce reinforcement learning into video reasoning segmentation. |
Zishan Xu; Yifu Guo; Yuquan Lu; Fengyu Yang; Junxin Li; Lihua Cai; |
| 492 | S²Flow: Towards Fast and Authentic Training-Free High-Resolution Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose S^2Flow, a training-free framework that enables efficient and authentic high-resolution video generation by jointly exploring Flow-guided Sparse attention and Second-order ODE solution. |
Chaoqun Wang; Shaobo Min; Xu Yang; |
| 493 | Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we proposed a lightweight and plug-and-play Test-time adaptation framework for correcting Unseen Normal pattErns (TUNE) in GAD. |
Junjun Pan; Yixin Liu; Chuan Zhou; Fei Xiong; Alan Wee-Chung Liew; Shirui Pan; |
| 494 | Instance-Guided Scene Adaptation for Unsupervised Person Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an Instance-Guided Scene Adaptation (IGSA) framework by eradicating scene disparities and focusing the tasks on instances, effectively eliminating the contradiction between person search and domain adaptation. |
Linfeng Qi; Huibing Wang; Jinjia Peng; Xianping Fu; Jiqing Zhang; |
| 495 | Localization-Anchored Instance Discrimination for Domain Adaptive Person Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Domain-adaptive person search (DAPS) aims to transfer pedestrian detection and re-identification capabilities from a labeled source domain to an unlabeled target domain, yet faces critical challenges from domain shift: semantic confusion among overlapping instances, over-reliance on shallow features for look-alike targets, and poor discriminability of small-scale instances. To address these issues, we propose the Localization-Anchored Instance Discrimination (LAID) framework, which leverages spatial relationships between bounding boxes as auxiliary signals to enhance instance identity learning. |
Linfeng Qi; Huibing Wang; Jinjia Peng; Jiqing Zhang; |
| 496 | Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an intent-centric image–text dataset generated via reasoning by a Multimodal Large Language Model (MLLM) to better train ZS-CIR models for human manipulation intent understanding. Building on this dataset, we propose De-MINDS, a framework that distills the MLLM’s reasoning ability to capture manipulation intent and enhance models’ comprehension of modified text. |
Yuanmin Tang; Jing Yu; Keke Gai; Gang Xiong; Gaopeng Gou; Meikang Qiu; Qi Wu; |
| 497 | OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1% of predictive information from LVMs into lightweight networks. |
Sisuo Lyu; Siru Zhong; Weilin Ruan; Qingxiang Liu; Qingsong Wen; Hui Xiong; Yuxuan Liang; |
| 498 | Video Spatial Reasoning with Object-Centric 3D Rollout Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches primarily rely on spatially grounded supervised fine-tuning or reinforcement learning, yet we observe that such models often exhibit query-locked reasoning, focusing narrowly on objects explicitly mentioned in the prompt while ignoring critical contextual cues. To address this limitation, we propose Object-Centric 3D Rollout (OCR), a novel strategy that introduces structured perturbations to the 3D geometry of selected objects during training. |
Haoran Tang; Meng Cao; Ruyang Liu; Xiaoxi Liang; Linglong Li; Ge Li; Xiaodan Liang; |
| 499 | An Efficient and Harmonized Framework for Balanced Cross-Domain Feature Integration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel framework that utilizes customized models to learn style representations. |
Shaoxu Li; Ye Pan; |
| 500 | Sparse-vDiT: Unleashing The Power of Sparse Attention to Accelerate Video Diffusion Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Crucially, these patterns exhibit strong layer-depth and head-position correlations but show limited dependence on the input content. Leveraging these findings, we propose Sparse-vDiT, a sparsity acceleration framework for vDiT comprising: 1) Pattern-optimized sparse kernels that replace dense attention with computationally efficient implementations for each identified sparsity pattern. |
Pengtao Chen; Xianfang Zeng; Maosen Zhao; Mingzhu Shen; Wei Cheng; Gang Yu; Tao Chen; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~4,300 papers), please visit Paper Digest: AAAI-2026 (Full List).