Paper Digest: ACL 2026 Papers & Highlights
Note: ACL-2026 accepts more than 2,400 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,400 ACL-2026 papers in a separate page, which takes quite some time to load.
To search for papers presented at ACL-2026 on a specific topic, please make use of the search by venue (ACL-2026) service. To summarize the latest research published at ACL-2026 on a specific topic, you can utilize the review by venue (ACL-2026) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 11,000 authors (ACL-2026). Additionally, you may want to explore our “Best Paper” Digest (ACL), which lists the most influential ACL papers since 1981.
Since 2018, Paper Digest has built a foundation of data spanning decades of conferences, journals, and research topics. The platform features a daily digest service that sifts through tens of thousands of new papers, clinical trials, news articles, and community posts, filtering the noise to highlight what matters most to specific interests. Beyond daily updates, dozens of built-in research tools streamline the academic workflow, supporting efficient reading and writing, comprehensive literature reviews, and automated research report generation.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: ACL 2026 Papers & Highlights
| Paper | Author(s) | |
|---|---|---|
| 1 | MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL Via Agentic Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. |
Taicheng Guo; Hai Wang; Chaochun Liu; Mohsen Golalikhani; Xin Chen; Xiangliang Zhang; Chandan K. Reddy; |
| 2 | DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. |
Yinger Zhang; Shutong Jiang; Renhao Li; Jianhong Tu; Yang Su; Lianghao Deng; Xudong Guo; ChenXu Lv; Junyang Lin; |
| 3 | From Completion to Editing: Unlocking Context-Aware Code Infilling Via Search-and-Replace Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Chat LLMs offer safety and Agentic workflows provide flexibility, they suffer from performance degradation and prohibitive latency, respectively. To resolve this dilemma, we propose Search-and-Replace Infilling (SRI), a framework that internalizes the agentic verification-and-editing mechanism into a unified, single-pass inference process. |
Jiajun Zhang; Zeyu Cui; Jiaxi Yang; Lei Zhang; Yuheng Jing; Zeyao Ma; Tianyi Bai; Zilei Wang; Qiang Liu; Liang Wang; Binyuan Hui; Junyang Lin; |
| 4 | Outcome Accuracy Is Not Enough: Aligning The Reasoning Process of Reward Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment. |
Binghai Wang; Yantao Liu; Yuxuan Liu; Tianyi Tang; Shenzhi Wang; Chang Gao; Chujie Zheng; Yichang Zhang; Le Yu; Shixuan Liu; Tao Gui; Qi Zhang; Xuanjing Huang; Bowen Yu; Fei Huang; Junyang Lin; |
| 5 | LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Masked diffusion language models present a promising paradigm for language modeling, yet the systematic theoretical analysis and comprehensive empirical validation of their alignment on general tasks remain relatively underexplored. In this paper, we identify the primary challenge for this problem: the high variance in Evidence Lower Bound (ELBO)-based likelihood estimates required for preference optimization. |
Fengqi Zhu; Rongzhen Wang; Shen Nie; Xiaolu Zhang; Chunwei Wu; Jun Zhou; Yankai Lin; Ji-Rong Wen; Chongxuan Li; |
| 6 | Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. |
Sikuan Yan; Xiufeng Yang; Zuchao Huang; Ercong Nie; Zifeng Ding; Zonggen Li; Xiaowen Ma; Jinhe Bi; Kristian Kersting; Jeff Z. Pan; Hinrich Schuetze; Volker Tresp; Yunpu Ma; |
| 7 | COSMOS: Connectivity-Oriented Submodular Maximization for Optimal Subgraph Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose **COSMOS** (**C**onnectivity-**O**riented **S**ubmodular **M**aximization for **O**ptimal **S**ubgraph Retrieval), a unified framework that formalizes evidence retrieval as a constrained submodular maximization problem. |
Boci Peng; Xiao Liu; Boren Hu; Yun Zhu; Xuanbo Fan; Yanwei Yue; Chunyu Yang; Yan Zhang; |
| 8 | REST: Stress Testing Large Reasoning Models By Asking Multiple Problems at Once Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This single-question setup suffers from two major limitations: (1) vulnerability to data contamination and diminishing difficulty, forcing costly creation of new questions with significant human effort, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present **REST** (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. |
Zhuoshi Pan; Qizhi Pei; Yu Li; Zinan Tang; QiYao Sun; H. Vicky Zhao; Conghui He; Lijun Wu; |
| 9 | Frankentext: Stitching Random Text Fragments Into Long-form Narratives Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce Frankentexts, a long-form narrative generation paradigm that treats an LLM as a composer of existing texts rather than as an author. |
Chau Minh Pham; Jenna Russell; Dzung Pham; Mohit Iyyer; |
| 10 | Merlin’s Whisper: Enabling Efficient Reasoning in Large Language Models Via Black-box Persuasive Prompting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. |
Heming Xia; Cunxiao Du; Rui Li; Chak Tou Leong; Yongqi Li; Wenjie Li; |
| 11 | Inferring Events from Time Series Using Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an automated method for generating tasks that test a model’s ability to reason about events associated with time series data based on sports data, and develop a new benchmarking method. |
Mingtian Tan; Mike A Merrill; Zachary Gottesman; Tim Althoff; David Evans; Thomas Hartvigsen; |
| 12 | Video-MMMU: Evaluating Knowledge Acquisition from Multidisciplinary Professional Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing video benchmarks fail to evaluate the knowledge acquisition capabilities of Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-discipline, multi-track benchmark that evaluates LMMs’ ability to acquire knowledge from college-level, educational videos. |
Kairui Hu; Penghao Wu; Fanyi Pu; Wang Xiao; Xiang Yue; Bo Li; Yuanhan Zhang; Ziwei Liu; |
| 13 | BadScientist: Can A Research Agent Write Convincing But Unsound Papers That Fool LLM Reviewers? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We develop a rigorous evaluation framework with formal error guarantees (concentration bounds and calibration analysis), calibrated on real data. |
Fengqing Jiang; Yichen Feng; Yuetai Li; Luyao Niu; Basel Alomair; Radha Poovendran; |
| 14 | When One LLM Drools, Multi-LLM Collaboration Rules Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. |
Shangbin Feng; Wenxuan Ding; Alisa Liu; Zifeng Wang; Weijia Shi; Yike Wang; Shannon Zejiang Shen; Xiaochuang Han; Hunter Lang; Chen-Yu Lee; Tomas Pfister; Yejin Choi; Yulia Tsvetkov; |
| 15 | RespiraMFM: A Multimodal Foundation Model with Contrastive Audio-Language Alignment for Respiratory Disease Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. |
Shakhrul Iman Siam; Tiantian Feng; Jiankun Zhang; Shrikanth Narayanan; Mi Zhang; |
| 16 | Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it is compute-intensive, requiring generation of long transcripts before answering questions. To address this inefficiency, we develop a framework that distills multi-agent debate into a single LLM through a two-stage fine-tuning pipeline combining debate structure learning with internalization via dynamic reward scheduling and length clipping. |
John Seon Keun Yi; Aaron Mueller; Dokyun Lee; |
| 17 | AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. |
Zhiheng Xi; Dingwen Yang; Jiaqi Liu; Jixuan Huang; Honglin Guo; Baodai Huang; Tinggang Chen; Qi Zhang; Zhonghang Lu; Chenyu Liu; Jiajun Sun; Jiazheng Zhang; Dingwei Zhu; Xin Guo; Junzhe Wang; Zhihao Zhang; Yuming Yang; Junjie Ye; Minghe Gao; Dongrui Liu; Jiaming Ji; Guohao Li; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 18 | PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. |
Jingcheng Hu; Yinmin Zhang; Shijie Shang; Xiaobo Yang; Yue Peng; Zhewei Huang; Hebin Zhou; Xin Wu; Jie Cheng; Fanqi Wan; Xiangwen Kong; Chengyuan Yao; Kaiwen Yan; Ailin Huang; Hongyu Zhou; Qi Han; Zheng Ge; Xiangyu Zhang; Heung-Yeung Shum; |
| 19 | One Tokenizer To Rule Them All: Emergent Language Plasticity Via Multilingual Tokenizers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we study what relatively cheap interventions early on in training improve *language plasticity*, or adaptation capabilities of the model post-training to new languages. |
Diana Abagyan; Alejandro R. Salamanca; Andres Felipe Cruz-Salinas; Kris Cao; Hangyu Lin; Acyr Locatelli; Marzieh Fadaee; Ahmet Üstün; Sara Hooker; |
| 20 | It’s Not What You Say, It’s How You Say It: Evaluating LLM Responses to Expressions of Belief Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a typology to systematically evaluate how different EoBs affect whether models follow context versus prior knowledge. |
Kevin Du; Clara Kümpel; Michelle Wastl; Alex Warstadt; |
| 21 | Can Factual Opinions Be Edited (Manipulated) in Large Language Models? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. |
Yuanpu Cao; Ziyi Yin; Fenglong Ma; Jinghui Chen; |
| 22 | Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs As Semantic Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior studies often rely on general-purpose embedding benchmarks (e. g. , MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. |
Yupeng Hou; Jiacheng Li; Xiangjun Fu; Zhankui He; An Yan; Xiusi Chen; Julian McAuley; |
| 23 | Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through preliminary studies on representative personalization tasks, we discover that (a) LLMs’ attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs’ ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. |
Shenglai Zeng; Tianqi Zheng; Chuan Tian; Dante Everaert; Yau-Shian Wang; Yupin Huang; Michael J. Morais; Rohit Patki; Jinjin Tian; Xinnan Dai; Kai Guo; Monica Xiao Cheng; Hui Liu; |
| 24 | LLM-VA: Resolving The Jailbreak-Overrefusal Trade-off Via Vector Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify the root cause: LLMs encode the decision to respond (answer vector va) and the judgment of input safety (benign vector vb) as nearly orthogonal directions, treating them as independent processes. We propose LLM-VA, which aligns va with vb through closed-form weight updates, making the model’s willingness to respond causally dependent on its safety assessment—without fine-tuning or architectural changes. |
Haonan Zhang; Dongxia Wang; Yi Liu; Kexin Chen; Wenhai Wang; |
| 25 | LOTUS: Evolving Multimodal Unlearning Via Hyperbolic Entailment and Lorentz Transport Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce LOTUS (Lorentz Transport for Unlearning Strategies), a framework for surgical semantic pruning within the Lorentz manifold. |
Zekun Wang; Jingjie Zeng; Yingxu Li; Hongfei Lin; Liang Yang; |
| 26 | When Efficiency Becomes A Vulnerability: Computational Cost Attacks on WebAgents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Adversaries can inject malicious prompts into web pages, causing WebAgents to generate unnecessarily long reasoning processes and incur excessive computational cost, termed Computational Cost Attacks (CCA). In this paper, to systematically study this vulnerability under realistic black-box settings, we propose CostBomb, a generation-then-selection attack framework that leverages large language models to generate diverse adversarial prompts and a reinforcement learning–enhanced selector to identify the most effective perturbations. |
Liang-Bo Ning; Yuchen Zhu; Heqing Huang; Xin Wang; Yi Chang; Li Qing; Wenqi Fan; |
| 27 | Selective Knowledge Distillation: Fusing LLM Semantic Strengths with DNN Efficiency for Binary Code Similarity Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, LLM-based BCSD methods are constrained by their large model sizes and high inference latency. To alleviate these limitations, this paper proposes BinSKD. |
Shize Zhou; Peiyu Liu; Lirong Fu; Tong Ye; Wenhai Wang; |
| 28 | Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to systematically investigate the key question: What can multimodal Chain-of-Thought reasoning do, and where and why does it fall short? |
Zhuoran Jin; Kejian Zhu; Hongbang Yuan; Yupu Hao; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao; |
| 29 | Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. |
Zhenhao Zhou; Zhuochen Huang; Yike He; Chong Wang; Jiajun Wang; Yijian Wu; Xin Peng; Yiling Lou; |
| 30 | WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, misalignment with real-world user preferences, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with LLMs to create preference datasets automatically. |
Taiwei Shi; Zhuoer Wang; Longqi Yang; Ying-Chun Lin; Zexue He; Mengting Wan; Pei Zhou; Sujay Kumar Jauhar; Sihao Chen; Shan Xia; Hongfei Zhang; Jieyu Zhao; Xiaofeng Xu; Xia Song; Jennifer Neville; |
| 31 | AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise. To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. |
Xuanle Zhao; Zilin Sang; Yuxuan Li; Qi Shi; Weilun Zhao; Shuo Wang; Duzhen Zhang; Xu Han; Zhiyuan Liu; Maosong Sun; |
| 32 | Value of Information: A Framework for Human–Agent Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches either rely on brittle confidence thresholds that require task-specific tuning, or fail to account for the varying stakes of different decisions. We introduce a decision-theoretic framework that resolves this trade-off through the Value of Information (VoI), enabling agents to dynamically weigh the expected utility gain from asking questions against the cognitive cost imposed on users. |
Yijiang River Dong; Tiancheng Hu; Zheng Hui; Caiqi Zhang; Ivan Vulić; Andreea Bobu; Nigel Collier; |
| 33 | IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. |
Bosi Wen; Yilin Niu; Cunxiang Wang; Xiaoying Ling; Ying Zhang; Pei Ke; Hongning Wang; Minlie Huang; |
| 34 | IF-CRITIC: Towards A Fine-Grained LLM Critic for Instruction-Following Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose IF-CRITIC, an LLM critic for fine-grained, efficient, and reliable instruction-following evaluation. |
Bosi Wen; Yilin Niu; Cunxiang Wang; Pei Ke; Xiaoying Ling; Ying Zhang; Aohan Zeng; Hongning Wang; Minlie Huang; |
| 35 | From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a multi-concept evaluation setting using concepts such as sentiment, domain, voice, and tense. |
Aaron Mueller; Andrew Lee; Shruti Joshi; Ekdeep Singh Lubana; Dhanya Sridhar; Patrik Reizinger; |
| 36 | VFA: Empowering Multilingual MLLMs Via Vision-Free Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Vision-Free Adaptation (VFA), a framework that decouples multilingual language enhancement from visual alignment by composing complementary task vectors over a shared LLM backbone. |
Yixia Li; Yaqing Shi; Zhiwen Ruan; Dongdong Zhang; Lingjie Jiang; Shaohan Huang; Yun Chen; Guanhua Chen; Furu Wei; |
| 37 | Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To improve reward modeling in subjective tasks, this paper proposes AAM (Act-Adaptive Margin), which enhances reward modeling by dynamically calibrating preference margins using the model’s internal parameter knowledge. |
Feiteng Fang; Dingwei Chen; Xiang Huang; Ting-En Lin; Yuchuan Wu; Xiong Liu; Jing Ye; Ziqiang Liu; Haonan Zhang; Liang Zhu; Hamid Alinejad-Rokny; Min Yang; Yongbin Li; |
| 38 | Probing Audio-Visual Reasoning in Multimodal Language Models Through The Lens of Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This raises a fundamental question—does a deficiency in low-level audio perception constrain higher-level audio-visual reasoning? To address this, we introduce AV-Odyssey Bench—a comprehensive benchmark of 4,555 meticulously designed problems that integrate text, audio, and visual modalities. |
Kaixiong Gong; Kaituo Feng; Bohao Li; Yibing Wang; Mofan Cheng; Shijia Yang; Jiaming Han; Benyou Wang; Yutong Bai; Zhuoran Yang; Xiangyu Yue; |
| 39 | MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning Via Bipartite Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. |
Changle Qu; Sunhao Dai; Hengyi Cai; Jun Xu; Shuaiqiang Wang; Dawei Yin; |
| 40 | Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This effect leads to increasingly volatile importance-ratio signals and bursty clipping behavior, which consistently precede training collapse. Motivated by this diagnosis, we propose Router-Shift Policy Optimization (RSPO). |
Di Zhang; Xun Wu; Shaohan Huang; Lingjie Jiang; Yaru Hao; Li Dong; Zewen Chi; Zhifang Sui; Furu Wei; |
| 41 | Measuring Human Contribution in AI-Assisted Content Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This shift presents notable challenges for the delineation of originality due to the varying degrees of human contribution in AI-assisted works. This study raises the research question of measuring human contribution in AI-assisted content generation and introduces a framework to address this question that is grounded in information theory. |
Yueqi Xie; Tao Qi; Jingwei Yi; Xiyuan Yang; Ryan Whalen; Junming Huang; Qian Ding; Yu Xie; Xing Xie; Fangzhao Wu; |
| 42 | AI Use in American Newspapers Is Widespread, Uneven, and Rarely Disclosed Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: AI is rapidly transforming journalism, but the extent of its use in published newspaper articles remains unclear. We address this gap by auditing a large-scale dataset of 186K articles from online editions of 1. |
Jenna Russell; Marzena Karpinska; Destiny Akinode; James Zhou; Katherine Thai; Bradley Emi; Max Spero; Mohit Iyyer; |
| 43 | What Do Prosody and Text Convey? Characterizing How Meaningful Information Is Distributed Across Multiple Channels Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose an information-theoretic approach to quantify how much is conveyed by prosody that is not recoverable from text alone, and, crucially, what prosody conveys. |
Aditya Yadavalli; Tiago Pimentel; Tamar I Regev; Ethan Gotlieb Wilcox; Alex Warstadt; |
| 44 | Interpretable Traces, Unexpected Outcomes: Investigating The Disconnect in Trace-Based Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To isolate the effect of trace semantics, we design experiments in the Question Answering (QA) domain using a rule-based problem decomposition method. This enables us to create Supervised Fine-Tuning (SFT) datasets for LLMs where – each QA problem is paired with either verifiably correct or incorrect CoT traces, while always providing the correct final solution. |
Siddhant Bhambri; Upasana Biswas; Subbarao Kambhampati; |
| 45 | MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. |
Junbo Niu; Zheng Liu; Zhuangcheng Gu; Bin Wang; Linke Ouyang; Zhiyuan Zhao; Tao Chu; Tianyao He; Fan Wu; Qintong Zhang; Zhenjiang Jin; Guang Liang; Rui Zhang; Wenzheng Zhang; Yuan Qu; Zhifei Ren; Yuefeng Sun; Zirui Tang; Boyu Niu; Yuanhong Zheng; Dongsheng Ma; Ziyang Miao; Hejun Dong; Siyi Qian; Junyuan Zhang; Fangdong Wang; Jingzhou Chen; Xiaomeng Zhao; Liqun Wei; Wei Li; Shasha Wang; RuiLiang Xu; Yuanyuan Cao; Lu Chen; Qianqian Wu; Huaiyu Gu; Lindong Lu; Dechen Lin; Shenguanlin; Xuanhe Zhou; Linfeng Zhang; Yuhang Zang; Xiaoyi Dong; Jiaqi Wang; Bo Zhang; Lei Bai; Pei Chu; Weijia Li; Jiang Wu; Lijun Wu; Zhenxiang Li; Guangyu Wang; Zhongying Tu; Chao Xu; Kai Chen; Bowen Zhou; Dahua Lin; Wentao Zhang; Conghui He; |
| 46 | How Should We Enhance The Safety of Large Reasoning Models: An Empirical Study Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). |
Zhexin Zhang; Xian Qi Loye; Victor Shea-Jay Huang; Junxiao Yang; Qi Zhu; Shiyao Cui; Fei Mi; Lifeng Shang; Yingkang Wang; Hongning Wang; Minlie Huang; |
| 47 | Success and Cost Elicit Convention Formation for Efficient Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a method to train large multimodal models to form conventions, enabling efficient communication. |
Saujas Vaduguru; Yilun Hua; Yoav Artzi; Daniel Fried; |
| 48 | AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce AdvancedIF, a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs’ ability to follow complex, multi-turn, and system-level instructions. |
Yun He; Wenzhe Li; Hejia Zhang; Songlin Li; Karishma Mandyam; Sopan Khosla; Yuanhao Xiong; Nanshu Wang; Xiaoliang Peng; Beibin Li; Shengjie Bi; Shishir G Patil; Qi Qi; Shengyu Feng; Julian Katz-Samuels; Richard Yuanzhe Pang; Sujan Kumar Gonugondla; Hunter Lang; Yue Yu; Yundi Qian; Maryam Fazel-Zarandi; Licheng Yu; Amine Benhalloum; Hany Hassan Awadalla; Manaal Faruqui; |
| 49 | OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents Via Hybrid Validation in Realistic Workflows Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. |
Qiushi Sun; Mukai Li; Zhoumianze Liu; Zhihui Xie; Fangzhi Xu; Zhangyue Yin; Kanzhi Cheng; Zehao Li; Zichen Ding; Qi Liu; Zhiyong Wu; Zhuosheng Zhang; Ben Kao; Lingpeng Kong; |
| 50 | A Goal Without A Plan Is Just A Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Task Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent’s planning abilities without human effort. |
Shuzheng Si; Haozhe Zhao; Kangyang Luo; Gang Chen; Fanchao Qi; Minjia Zhang; Baobao Chang; Maosong Sun; |
| 51 | OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. |
Qiguang Chen; Chengyu Luan; Jiajun Wu; Qiming Yu; Yi Yang; Yizhuo Li; Jingqi Tong; Xiachong Feng; Libo Qin; Wanxiang Che; |
| 52 | FinSight: Towards Real-World Financial Deep Research Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent deep research systems excel in open-domain search, they struggle with financial reporting, specifically in handling financial data, ensuring analytical depth, and integrating professional visualizations. To address this, we introduce FinSight , the first multi-agent framework for automate end-to-end professional, multimodal financial report. |
Jiajie Jin; Yuyao Zhang; Yimeng Xu; Hongjin Qian; Yutao Zhu; Zhicheng Dou; |
| 53 | LongVideoAgent: Multi-Agent Reasoning with Long Videos Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. |
Runtao Liu; Ziyi Liu; Jiaqi Tang; Yue Ma; Renjie Pi; Jipeng Zhang; Qifeng Chen; |
| 54 | Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we introduce SP3F (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without any data in the target language(s). |
Lintang Sutawika; Gokul Swamy; Steven Wu; Graham Neubig; |
| 55 | Mechanisms of Prompt-Induced Hallucination in Vision–Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. |
William Rudman; Michal Golovanevsky; Dana Arad; Yonatan Belinkov; Carsten Eickhoff; Ritambhara Singh; Kyle Mahowald; |
| 56 | Long-Chain Reasoning Distillation Via Adaptive Prefix Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. |
Zhenghao Liu; Zhuoyang Wu; Xinze Li; Yukun Yan; Shuo Wang; Zulong Chen; Yu Gu; Ge Yu; Maosong Sun; |
| 57 | TRAC: Teacher-Guided Token Reward with Adaptive Calibration for Robust Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In reality, teacher models exhibit capability limitations and uncertainty, producing noisy signals that make student policies susceptible to reward hacking. To address this, we propose Teacher Reward Adaptive Calibration (TRAC), a robust framework that filters noisy supervision by dynamically modulating teacher influence via a multi-granularity calibration mechanism. |
Sitong Wu; Haoru Tan; Xichen Zhang; Bin Xia; Wenhu Zhang; Xiaojuan Qi; Bei Yu; Jiaya Jia; |
| 58 | Temporal Sampling for Forgotten Reasoning in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by the phenomenon of Temporal Forgetting, we proposed Temporal Sampling, a simple decoding strategy that draws outputs from multiple checkpoints along the training trajectory. |
Yuetai Li; Zhangchen Xu; Fengqing Jiang; Bhaskar Ramasubramanian; Luyao Niu; Bill Yuchen Lin; Xiang Yue; Radha Poovendran; |
| 59 | Empirical Analysis of Decoding Biases in Masked Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reveal that prevalent uncertainty-based decoding strategies induce two decoding biases in MDMs: rigid boundary bias and trivial token bias. |
Pengcheng Huang; Tianming Liu; Zhenghao Liu; Yukun Yan; Shuo Wang; Tong Xiao; Zulong Chen; Maosong Sun; |
| 60 | Writing-RL: Advancing Long-form Writing Via Adaptive Curriculum Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further advance long-form writing, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. |
Xuanyu Lei; Chenliang Li; Yuning Wu; Kaiming Liu; Weizhou Shen; Peng Li; Ming Yan; Fei Huang; Ya-Qin Zhang; Yang Liu; |
| 61 | Attention Basin: Why Contextual Position Matters in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Crucially, our analysis further reveals that allocating higher attention to critical information is key to enhancing model performance. Based on these insights, we introduce Attention-Driven Reranking (AttnRank), a two-stage framework that (i) estimates a model’s intrinsic positional attention preferences using a small calibration set, and (ii) reorders retrieved documents or few-shot examples to align the most salient content with these high-attention positions. |
Zihao Yi; Zhenqing Ling; Delong Zeng; Haohao Luo; Zhe Xu; Wei Liu; Jian Luan; Wanxia Cao; Ying Shen; |
| 62 | TPS-Bench: Evaluating AI Agents’ Tool Planning & Scheduling Abilities in Compounding Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. |
Hanwen Xu; Xuyao Huang; Yuzhe Liu; Zhijie Deng; |
| 63 | Instant Personalized Large Language Model Adaptation Via Hypernetwork Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Profile-to-PEFT, a scalable framework that employs a hypernetwork, trained end-to-end, to map a user’s encoded profile directly to a full set of adapter parameters (e. g. , LoRA), eliminating per-user training at deployment. |
Zhaoxuan Tan; Zixuan Zhang; Haoyang Wen; Zheng Li; Rongzhi Zhang; Pei Chen; Fengran Mo; Zheyuan Liu; Qingkai Zeng; Qingyu Yin; Meng Jiang; |
| 64 | UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. |
Yifan Ji; Zhipeng Xu; Zhenghao Liu; Zulong Chen; Qian Zhang; Zhibo Yang; Junyang Lin; Yu Gu; Ge Yu; Maosong Sun; |
| 65 | SciPedia: Unlocking The Value of Scientific Data for Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: First, we construct a large-scale raw scientific corpus but identify a critical Learnability Gap, revealing that direct pre-training yields negligible gains. To bridge this, we develop a multi-stage pipeline featuring content cleaning and pedagogical augmentation, resulting in SciPedia, a 900B-token corpus. |
Yiwei Qin; Zhen Huang; Tiantian Mi; Weiye Si; Qipeng Guo; Siyuan Feng; Pengfei Liu; |
| 66 | Stratagem: Learning Transferable Reasoning Via Trajectory-Modulated Game Self-Play Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. |
Xiachong Feng; Deyi Yin; Xiaocheng Feng; Yi Jiang; Libo Qin; Yangfan Ye; Lei Huang; Weitao Ma; Qiming Li; Yuxuan Gu; Bing Qin; Lingpeng Kong; |
| 67 | ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose ImCoref-CeS, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. |
Kangyang Luo; Yuzhuo Bai; Shuzheng Si; Cheng Gao; Zhitong Wang; Yingli Shen; Wenhao Li; Zhu Liu; Yufeng Han; Jiayi Wu; Cunliang Kong; Maosong Sun; |
| 68 | SPARKLE: A Structured and Plug-and-play Agentic Retrieval Policy for Adaptive RAG Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing methods either rely on frozen large language models (LLMs) without explicit supervision or require costly LLM finetuning. Therefore, we propose SPARKLE, a structured and plug-and-play agentic retrieval policy where an additional proxy model is introduced to control the retrieval process. |
Jinyuan Fang; Zaiqiao Meng; Craig Macdonald; |
| 69 | Backdoor Collapse: Eliminating Unknown Threats Via Known Backdoor Aggregation In Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose Locphylax, a defense framework that requires no prior knowledge of trigger settings. |
Liang Lin; Miao Yu; Moayad Aloqaily; Zhenhong Zhou; Kun Wang; Linsey Pang; Prakhar Mehrotra; Qingsong Wen; |
| 70 | SafeAgent: Safeguarding LLM Agents Via An Automated Risk Simulator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in multi-turn, tool-augmented settings, dynamic user interactions, external tool use, and unintended harmful behaviors make robust safety assurance challenging. To address these challenges, we propose **SafeAgent**, a framework that improves agent safety through fully automated synthetic data generation. |
Xueyang Zhou; Weidong Wang; Lin Lu; Jiawen Shi; Guiyao Tie; Xu Yongtian; Lixing Chen; Pan Zhou; Neil Zhenqiang Gong; Lichao Sun; |
| 71 | Your Reasoning Model Is Secretly A Reward Model – Optimization-Free Verification from Experience Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we study a third source of information—the model’s hidden states—for binary correctness verification in tasks with a reliable success/failure signal (e. g. , deterministic checkers or reference-grounded answers). |
Zhenwen Liang; Ruosen Li; Yujun Zhou; Linfeng Song; Dian Yu; Xinya Du; Haitao Mi; Dong Yu; |
| 72 | Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e. g. , GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. |
Zhenwen Liang; Yujun Zhou; Sidi Lu; Xiangliang Zhang; Haitao Mi; Dong Yu; |
| 73 | Chunks As Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. |
Shaohua Duan; Pengcheng Huang; Xinze Li; Zhenghao Liu; Xiaoyuan Yi; Yukun Yan; Shuo Wang; Yu Gu; Ge Yu; Maosong Sun; |
| 74 | CAML: A Conflict-Aware Molecular Language Model Merging Framework for Multi-Constraint Molecular Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing paradigms often struggle with this challenge due to catastrophic forgetting or gradient conflicts. To address this, we propose a conflict-aware molecular language model merging framework (CAML). |
Xuanbai Ren; Luoda Tan; Pei Liu; Tengfei Ma; Xiangzheng Fu; Longyue Wang; Yiping Liu; Xiangxiang Zeng; |
| 75 | How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we conduct an empirical study on how memory management choices impact the LLM agents’ behavior, especially their long-term performance. |
Zidi Xiong; Yuping Lin; Wenya Xie; Pengfei He; Zirui Liu; Jiliang Tang; Himabindu Lakkaraju; Zhen Xiang; |
| 76 | SearchGym: Bootstrapping Real-World Search Agents Via Cost-Effective and High-Fidelity Environment Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. |
Xichen Zhang; Ziyi He; Yinghao Zhu; Sitong Wu; Shaozuo Yu; Meng Chu; Wenhu Zhang; Haoru Tan; Jiaya Jia; |
| 77 | Follow The Flow: On Information Flow Across Textual Tokens in Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we investigate how semantic information is distributed across token representations in text-to-image prompts, analyzing it at two levels: (1) in-item representation—whether individual tokens represent their lexical item (i. e. , a word or expression conveying a single concept), and (2) cross-item interaction—whether information flows between tokens of different lexical items. |
Guy Kaplan; Michael Toker; Yuval Reif; Yonatan Belinkov; Roy Schwartz; |
| 78 | Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. |
Congren Dai; Yue Yang; Krinos Li; Huichi Zhou; Shijie Liang; Zhang Bo; Enyang Liu; Ge Jin; Hongran An; Haosen Zhang; Peiyuan Jing; KinHei Lee; Zhenxuan Zhang; Xiaobing Li; Maosong Sun; |
| 79 | RSMeM: Knowledge-Enhanced Memory Evolution for Remote Sensing Agents with Systematic Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, these failures are seldom consolidated into a reusable experience for subsequent analyses. To address this issue, we introduce RSMeM, a knowledge-enhanced memory evolution mechanism that bootstraps RS agents with pre-distilled domain knowledge and iteratively integrates online experience for robust multi-step tool execution. |
Bingxian Wu; Yu Zhang; Zonghao Guo; Tang Liu; Chen Qian; Yuxiang Lu; Xingbo Du; Yanghao Li; Yidan Zhang; Chi Chen; Ling Yao; Chenghu Zhou; Maosong Sun; |
| 80 | CheckRLM: Effective Knowledge–Thought Coherence Checking in Retrieval-Augmented Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these chains are prone to containing factual errors, particularly in knowledge-intensive tasks. To address this issue, we propose **CheckRLM**, a framework that improves the reliability of the reasoning process through Retrieval-Augmented Generation (RAG) by timely checking and correcting factual errors. |
Dingling Xu; Ruobing Wang; Qingfei Zhao; Yukun Yan; Zhichun Wang; Daren Zha; Shi Yu; Zhenghao Liu; Shuo Wang; Xu Han; Maosong Sun; |
| 81 | Explicit Trait Inference for Multi-Agent Coordination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. |
Suhaib Abdurahman; Etsuko Ishii; Katerina Margatina; Divya Bhargavi; Monica Sunkara; Yi Zhang; |
| 82 | Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with An Automated Examiner Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). |
Guan-Ting Lin; Shih-Yun Shan Kuan; Jiatong Shi; Kai-Wei Chang; Siddhant Arora; Shinji Watanabe; Hung-yi Lee; |
| 83 | Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. |
Xiao Liang; Zhong-Zhi Li; Zhenghao Lin; Eric Hanchen Jiang; Hengyuan Zhang; Yelong Shen; Kai-Wei Chang; Ying Nian Wu; Yeyun Gong; Weizhu Chen; |
| 84 | MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators—a paradigm known as *MLLM-as-a-Judge*. |
Sua Lee; Sanghee Park; Jinbae Im; |
| 85 | Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. |
Jiyeon Kim; Hyunji Lee; Dylan Zhou; Sue Hyun Park; Seunghyun Yoon; Trung Bui; Franck Dernoncourt; Sungmin Cha; Minjoon Seo; |
| 86 | EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. |
Mukai Li; Linfeng Song; Zhenwen Liang; Jiahao Xu; Shansan Gong; Qi Liu; Haitao Mi; Dong Yu; |
| 87 | Omni-RewardBench: Toward A Comprehensive Evaluation of Generative Reward Models Across Modalities Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks are largely text-centric or limited to bimodal tasks, restricting comprehensive assessment for ORMs. To bridge this gap, we introduce Omni-RewardBench, the first benchmark for comprehensive evaluation of ORMs across modalities. |
Chi-Min Chan; Yujin Zhou; Pengcheng Wen; Boqin Yin; Jiaming Ji; Juntao Dai; Wei Xue; Sirui Han; Yike Guo; |
| 88 | Benchmarking Fine-Grained Error Detection in Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the research community currently lacks a dedicated benchmark to rigorously assess the error discernment capabilities of these models. To address this gap, we introduce PRMBench-V, a novel benchmark specifically designed to evaluate MPRMs’ proficiency in detecting erroneous reasoning steps across diverse error categories. |
Chi-Min Chan; Han Zhu; Chunyang Jiang; Jiaming Ji; Juntao Dai; Wei Xue; Sirui Han; Yike Guo; |
| 89 | EvoRoute: Experience-Driven Self-Routing LLM Agent Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We formalize this challenge as the Agent System Trilemma: the inherent tension among achieving state-of-the-art performance, minimizing monetary cost, and ensuring rapid task completion. To dismantle this trilemma, we introduce EvoRoute, a self-evolving model routing paradigm that transcends static, pre-defined model assignments. |
Guibin Zhang; Haiyang Yu; Kaiming Yang; Bingli Wu; Fei Huang; Yongbin Li; Shuicheng Yan; |
| 90 | FinCall-Surprise: A Large Scale Multi-modal Benchmark for Earning Surprise Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, progress in this domain has been constrained by a reliance on expensive, proprietary, and text-only data, limiting the development of advanced models. To address this gap, we introduce FinCall-Surprise (Financial Conference Call for Earning Surprise Prediction), the first large-scale, open-source, and multi-modal dataset for earnings surprise prediction. |
Dong Shu; Yanguang Liu; Huopu Zhang; Mengnan Du; |
| 91 | FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. |
Dong Shu; Haoyang Yuan; Yuchen Wang; Yanguang Liu; Huopu Zhang; Mengnan Du; |
| 92 | Breaking The Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. |
Minzheng Wang; Run Luo; Yanbo Wang; Zichen Liu; Yuqiao Tan; Tao Tan; Nan Xu; Lu Wang; Wenji Mao; |
| 93 | XToM: Exploring The Multilingual Theory of Mind for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind—the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. |
Chunkit Chan; Yauwai Yim; Hongchuan Zeng; Zhiying Zou; Xinyuan Cheng; Zhifan Sun; Zheye Deng; Kawai Chung; Yuzhuo Ao; Fan Yixiang; Cheng Jiayang; Ercong Nie; Ginny Wong; Helmut Schmid; Hinrich Schuetze; Simon See; Yangqiu Song; |
| 94 | SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. |
Wenxi Chen; Ruiqi Yan; Yushen Chen; Zhikang Niu; Ziyang Ma; Xiquan Li; Yuzhe Liang; Wenhanlin; Shunshun Yin; Ming Tao; Xinsheng Wang; Xie Chen; |
| 95 | Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. |
Shunian Chen; Xinyuan Xie; Zheshu Chen; Owen Lee; Liyan Zhao; Zhan Su; Qilin Sun; Benyou Wang; |
| 96 | The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically investigate verbalized calibration in tool-use agents, revealing a fundamental confidence dichotomy driven by tool type. |
Weihao Xuan; Qingcheng Zeng; Heli Qi; Yunze Xiao; Junjue Wang; Naoto Yokoya; |
| 97 | Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. |
Ziyuan Yang; Wenxuan Ding; Shangbin Feng; Yulia Tsvetkov; |
| 98 | Enhancing The Transferability of Jailbreak Attacks on Large Language Models Via Exploiting Reparameterization Invariance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these attacks suffer from poor cross-model transferability, severely limiting their utility on proprietary ones. To address this limitation, we propose Reparameterization Invariance Gradient-based Jailbreak (RIGJ), a natural gradient based framework designed to improve cross-model transferability. |
Ao Wang; Xinghao Yang; Yongshun Gong; Wei Liu; Bao-di Liu; Weifeng Liu; |
| 99 | Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, real-world spreadsheets are often massive in scale, exceeding the input length that LLMs can efficiently process. To address these challenges, we propose SpreadsheetAgent, a two-stage multi-agent framework for spreadsheet understanding that adopts a step-by-step reading and reasoning paradigm. |
Houxing Ren; Mingjie Zhan; Zimu Lu; Ke Wang; Yunqiao Yang; Haotian Hou; Hongsheng Li; |
| 100 | Human or LLM As Standardized Patients? A Comparative Study in Medical Education Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose EasyMED, a multi-agent VSP framework that separates case-grounded information disclosure from response generation to support stable, inquiry-conditioned patient behavior. |
Bingquan Zhang; Xiaoxiao Liu; Yuchi Wang; Zhou Lei; Qianqian Xie; Benyou Wang; |
| 101 | MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing approaches to Visual Chain-of-Thought (VCoT) are often limited by rigid external tools or fail to generate the high-fidelity, strategically-timed diagrams necessary for complex problem-solving. To bridge this gap, we introduce MathCanvas, a comprehensive framework designed to endow unified Large Multimodal Models (LMMs) with intrinsic VCoT capabilities for mathematics. |
Weikang Shi; Aldrich Yu; Rongyao Fang; Houxing Ren; Ke Wang; Aojun Zhou; Changyao Tian; Xinyu Fu; Yuxuan Hu; Zimu Lu; Linjiang Huang; Si Liu; Rui Liu; Hongsheng Li; |
| 102 | Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent evaluations show that large language models (LLMs) frequently fail to challenge users’ harmful beliefs in domains ranging from medical advice to social reasoning. We present a unifying analysis through the lens of pragmatics: these safety failures can be understood and addressed as LLMs exhibiting excessive accommodation and insufficient epistemic vigilance. |
Myra Cheng; Robert D. Hawkins; Dan Jurafsky; |
| 103 | When Correct Is Not Safe: Can We Trust Functionally Correct Patches Generated By Code Agents? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we reveal a novel type of threat to real-world code-agents: functionally correct yet vulnerable (FCV) patches, which pass all test cases but contain vulnerable code. |
Yibo Peng; James Song; Lei Li; Xinyu Yang; Mihai Christodorescu; Ravi Mangal; Corina S. Pasareanu; Haizhong Zheng; Beidi Chen; |
| 104 | AgentOCR: Reimagining Agent History Via Optical Self-Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AgentOCR, a framework that exploits visual tokens’ superior information density by representing the accumulated observation-action history as a compact rendered image. |
Lang Feng; Fuchao Yang; Feng Chen; Xin Cheng; Haiyang Xu; Zhenglin Wan; Ming Yan; Bo An; |
| 105 | Social Dynamics As Critical Vulnerabilities That Undermine Objective Decision-Making in LLM Collectives Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language model (LLM) agents are increasingly acting as human delegates in multi-agent environments, where a representative agent integrates diverse peer perspectives to make a final decision. |
Changgeon Ko; Jisu Shin; Hoyun Song; Huije Lee; Eui Jun Hwang; Jong C. Park; |
| 106 | Make LLMs See Like Investigators, Not Just Think More: The Role of Structured Analysis in Investigative Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Criminal investigators and intelligence analysts have developed structured analytic techniques to evaluate competing hypotheses under incomplete information. |
Jaewook Lee; Myeong-Cheol Kang; Jong-hun Shin; |
| 107 | Mango: Multi-Agent Web Navigation Via Global-View Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. |
Weixi Tong; Yifeng Di; Tianyi Zhang; |
| 108 | Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we observe that current models are susceptible to reward hacking, leading to a substantial overestimation of a model’s reasoning ability. |
Youliang Yuan; Qiuyang Mang; Jingbang Chen; Hong Wan; Xiaoyuan Liu; Junjielong Xu; Jen-tse Huang; Wenxuan Wang; Wenxiang Jiao; Pinjia He; |
| 109 | Gated Differentiable Working Memory for Long-Context Language Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, asking: given limited computational budget, which parts of the context should be consolidated into working memory? |
Lingrui Mei; Shenghua Liu; Yiwei Wang; Yuyao Ge; Baolong Bi; Jiayu Yao; Jun Wan; Ziling Yin; Jiafeng Guo; Xueqi Cheng; |
| 110 | HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Ideally, LLMs should offer informative responses while avoiding the disclosure of harmful and sensitive information. To address these challenges, we introduce HiddenGuard, a novel framework for fine-grained safe generation in LLMs. |
Lingrui Mei; Shenghua Liu; Yiwei Wang; Baolong Bi; Ruibin Yuan; Xueqi Cheng; |
| 111 | ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus–Unconditioned Stimulus (CS–US) associations shaping first decisions). |
Chonghan Qin; Xiachong Feng; Weitao Ma; Xiaocheng Feng; Lingpeng Kong; |
| 112 | Current Agents Fail to Leverage World Model As Tool for Foresight Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. |
Cheng Qian; Emre Can Acikgoz; Bingxuan Li; Xiusi Chen; Yuji Zhang; Bingxiang He; Qinyu Luo; Gokhan Tur; Dilek Hakkani-Tür; Yunzhu Li; Heng Ji; |
| 113 | CHOIR: Harmonizing Structured Persona Diversity for Robust Collaborative LLM Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CHOIR (Collaborative Harmonization fOr Inference Robustness), a test-time framework that harmonizes a set of demographically perturbed, persona-conditioned reasoning signals into a unified prediction. |
Xiangjue Dong; Cong Wang; Maria Teleki; Millennium Bismay; Ruihong Huang; James Caverlee; |
| 114 | ReFL: Reflective Feedback Learning for Hallucination Detection of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing hallucination detection methods either depend on external knowledge sources, incurring high computational costs and limiting real-time applicability, or extract the model’s internal states, leading to poor generalization. To address these issues, this paper proposes ReFL, a hallucination detection framework. |
Cunhang Fan; Jun Zhang; Xue Zhang; Shuai Zhang; Zhao Lv; Jianhua Tao; Zhengqi Wen; |
| 115 | Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. |
Kailai Yang; Xiao Liu; Lei Ji; Hao Li; Xiao Liang; Zhiwei Liu; Yeyun Gong; Peng Cheng; Mao Yang; |
| 116 | Tailored Primitive Initialization Is The Secret Key to Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we describe thinking token patterns with reasoning primitives and argue that initializing LLMs with diverse, high-quality primitives is crucial for stable and efficient RL training. |
Yihang Yao; Guangtao Zeng; Raina Wu; Yang Zhang; Ding Zhao; Zhang-Wei Hong; Chuang Gan; |
| 117 | From Word to World: Can Large Language Models Be Implicit Text-based World Models? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a three-level framework to evaluate LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. |
Yixia Li; Hongru Wang; Jiahao Qiu; Zhenfei Yin; Dongdong Zhang; Cheng Qian; Zeping Li; Xiaoteng Ma; Guanhua Chen; Heng Ji; |
| 118 | OctoTools: A Multi-Agent Framework with Extensible Tools for Complex Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce OctoTools, a training-free, user-friendly, and easily extensible multi-agent framework designed to tackle complex reasoning across diverse domains. |
Pan Lu; Bowen Chen; Sheng Liu; Rahul Thapa; Joseph Boen; James Zou; |
| 119 | Traffic-R1: Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Traffic-R1, a 3B-parameter foundation model with human-like reasoning for Traffic signal control (TSC), developed via self-exploration and iterative reinforcement of LLM with expert guidance in a simulated traffic environment. |
Xingchen Zou; Yuhao Yang; Zheng Chen; Xixuan Hao; Yiqi Chen; Chao Huang; Yuxuan Liang; |
| 120 | SynthAgent: Adapting Web Agents with Synthetic Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. |
Zhaoyang Wang; Yiming Liang; Xuchao Zhang; Qianhui Wu; Siwei Han; Anson Bastos; Rujia Wang; Chetan Bansal; Baolin Peng; Jianfeng Gao; Saravan Rajmohan; Huaxiu Yao; |
| 121 | WildReward: Learning Reward Models from In-the-Wild Human Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. |
Hao Peng; Yunjia Qi; Xiaozhi Wang; Zijun Yao; Lei Hou; Juanzi Li; |
| 122 | OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. |
Deming Ding; Shichun Liu; Enhui Yang; Jiahang Lin; Ziying Chen; Shihan Dou; Honglin Guo; Weiyu Cheng; Pengyu Zhao; Chengjun Xiao; Qunhong Zeng; Qi Zhang; Xuanjing Huang; Qidi Xu; Tao Gui; |
| 123 | HiChunk: Evaluating and Enhancing Retrieval Augmented Generation with Hierarchical Chunking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper first analyzes why existing RAG evaluation benchmarks are inadequate for assessing document chunking quality, specifically due to evidence sparsity. Based on this conclusion, we propose HiCBench, which includes manually annotated multi-level document chunking points, synthesized evidence-dense question answer(QA) pairs, and their corresponding evidence sources. |
Wensheng Lu; Keyu Chen; Zhifeng Shen; Ruizhi Qiao; Xing Sun; |
| 124 | ReportLogic: Evaluating Logical Quality in Deep Research Reports Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current evaluation frameworks largely overlook this requirement. To bridge this gap, we introduce ReportLogic, a benchmark that quantifies report-level logical quality through a reader-centric lens of auditability. |
Jujia Zhao; Zhaoxin Huan; Zihan Wang; Xiaolu Zhang; Jun Zhou; Suzan Verberne; Zhaochun Ren; |
| 125 | Chaining The Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents’ reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose Citation-aware Rubric Rewards (CaRR), a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. |
Jiajie Zhang; Xin Lv; Ling Feng; Lei Hou; Juanzi Li; |
| 126 | MedVerse: Efficient and Reliable Medical Reasoning Via DAG-Structured Parallel Execution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri Net theory. |
Jianwen Chen; Xinyu Yang; Peng Xia; Arian Azarang; Yueh Z Lee; Gang Li; Hongtu Zhu; Yun Li; Beidi Chen; Huaxiu Yao; |
| 127 | Simulated Students in Tutoring Dialogues: Substance or Illusion? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Surprisingly, little work has been done to ensure or even measure the quality of simulated students. In this work, we formally define the student simulation task, propose a set of evaluation metrics that span linguistic, behavioral, and cognitive aspects, and benchmark a wide range of student simulation methods on these metrics. |
Alexander Scarlatos; Jaewook Lee; Simon Woodhead; Andrew Lan; |
| 128 | Experience-driven Multi-turn Reinforcement Learning for GUI Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ̲Experience-driven ̲Multi-turn ̲Policy ̲Optimization (EMPO), which leverages expert trajectories as environment experiences for on-policy multi-turn training. |
Zhengxi Lu; Jiabo Ye; Fei Tang; Yongliang Shen; Haiyang Xu; Ziwei Zheng; Weiming Lu; Ming Yan; Fei Huang; Jun Xiao; Yueting Zhuang; |
| 129 | RLSeek: Evidence-Grounded Reasoning for RAG Hallucination Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This behaviour contrasts with human verification practices in benchmarks such as RAGTruth, where evidence quotation is a prerequisite for determining hallucinated spans. Motivated by this observation, we propose an evidence-grounded RL framework, namely RLSeek, to explicitly enforce active evidence seeking during CoT reasoning by requiring quotation of relevant source segments at each verification step. |
Zhaoheng Huang; Dacheng Wen; Yutao Zhu; Xiaoying Lian; Yushi Liang; Kai Hao; Nan Li; Liangjie Zhang; Qi Zhang; Ji-Rong Wen; Zhicheng Dou; Fangzhao Wu; |
| 130 | Too Long, Do Re-weighting for Efficient LLM Reasoning Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Thinking Length Data Re-weighting (TLDR), that does not rely on sophisticated data annotations or interpolation between multiple models. |
Zhong-Zhi Li; Xiao Liang; Zihao Tang; Lei Ji; Peijie Wang; Haotian Xu; Xing W; Haizhen Huang; Weiwei Deng; Yeyun Gong; Zhijiang Guo; Xiao Liu; Fei Yin; Cheng-Lin Liu; |
| 131 | Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, they incur significant memory overhead due to the need to retain all MC samples for the gradient computation of non-linear terms in the RL objective, and thus restrict feasible sample sizes, leading to imprecise likelihood approximations and distorted RL objective. To address this, we propose Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. |
Nianyi Lin; Jiajie Zhang; Lei Hou; Juanzi Li; |
| 132 | MaDS: Long-Horizon GUI Automation Via Synergizing Dual-Layer Memory and Multi-Round Debate Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current methods often struggle to distinguish targets in low Signal-to-Noise Ratio (SNR) environments and lack sufficient pre-execution verification to prevent error accumulation. To address this, we propose the Memory-augmented Debate System (MaDS). |
Pengchen Chen; Shi Chen; Qiming Ye; Xinli Chen; Xinran Li; Wei Xiang; |
| 133 | Retrieval Heads Are Dynamic Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate retrieval heads from a dynamic perspective. |
Yuping Lin; Zitao Li; Yue Xing; Pengfei He; Yingqian Cui; Yaliang Li; Bolin Ding; Jingren Zhou; Jiliang Tang; |
| 134 | MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Knowledge Poisoning Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MM-PoisonRAG, a framework to systematically study the vulnerability of multimodal RAG under knowledge poisoning. |
Hyeonjeong Ha; Qiusi Zhan; Jeonghwan Kim; Dimitrios Bralios; Saikrishna Sanniboina; Nanyun Peng; Kai-Wei Chang; Daniel Kang; Heng Ji; |
| 135 | SimPBL: A Multi-Agent Framework for Project-Based Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SimPBL, a multi-agent framework with an orchestrator agent that provides adaptive scaffolding from interaction logs and collaborator agents that support project work through boundary-aware collaboration. |
Daniel Zhang-Li; Joy Jia Yin Lim; Binglin Liu; Shangqing Tu; Zijun Yao; Hao Peng; Jifan Yu; Haoxuan Li; Zhanxin Hao; Ye He; Zekun Li; Jiangyi Wang; Lei Hou; Bin Xu; Xin Cong; Zhiyuan Liu; Huiqin Liu; Yu Zhang; Juanzi Li; |
| 136 | Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. |
Samuel Joseph Amouyal; Aya Meltzer-Asscher; Jonathan Berant; |
| 137 | BaseCal: Unsupervised Confidence Calibration Via Base Model Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes two ways to achieve this. |
Hexiang Tan; Wanli Yang; Junwei Zhang; Xin Chen; Rui Tang; Du Su; Jingang Wang; Yuanzhuo Wang; Fei Sun; Xueqi Cheng; |
| 138 | JanusMM: A Benchmark for Self-Deprecation Understanding in Real-World Multimodal Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite self-deprecation is widespread in real-world conversations, the ability of multimodal large language models (MLLMs) to understand it remains underexplored. To fill this gap, we introduce **JanusMM**, the first benchmark designed to evaluate MLLMs’ understanding of self-deprecation in real-world conversations. |
Xinyi Xu; Bingguang Hao; Yongyi Xiong; Zimo Chen; Xinchen Liu; Hongxin Guo; Xuelong Wang; Silin Zhou; Shihan Dou; |
| 139 | Thinking Beyond The Anthropomorphic Paradigm Benefits LLM Research Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Anthropomorphism, or the attribution of human traits to technology, is an automatic and unconscious response that occurs even in those with advanced technical expertise. In this position paper, we analyze hundreds of thousands of research articles to present empirical evidence of the prevalence and growth of anthropomorphic terminology in research on large language models (LLMs). |
Lujain Ibrahim; Myra Cheng; |
| 140 | NSF-SciFy: Mining The NSF Awards Database for Scientific Claims Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. |
Delip Rao; Weiqiu You; Eric Wong; Chris Callison-Burch; |
| 141 | Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. |
Yifan Yang; Bing Han; Hui Wang; Wei Wang; Ziyang Ma; Long Zhou; Zengrui Jin; Guanrou Yang; Tianrui Wang; Xu Tan; Xie Chen; |
| 142 | Privacy-R1: Privacy-Aware Multi-LLM Agent Collaboration Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called Privacy-R1 to solve it. |
Zheng Hui; Yijiang River Dong; Sanhanat Sivapiromrat; Ehsan Shareghi; Nigel Collier; |
| 143 | Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. |
Bhaktipriya Radharapu; Eshika Saxena; Kenneth Li; Chenxi Whitehouse; Adina Williams; Nicola Cancedda; |
| 144 | Global Adaptive Momentum Meets Local Personalized Perturbation: Efficient Federated LLM Fine-Tuning with Zeroth-Order Gradients Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we propose a new federated LLM fine-tuning framework, with a holistic revamped design of the entire ZO gradient processing pipeline. |
Zihan Chen; Howard Hao Yang; Tony Quek; Kai Fong Ernest Chong; |
| 145 | Graph-Based Alternatives to LLMs for Human Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Graph-basEd Models for Human Simulation (GEMS) which formulates close-ended simulation as link prediction on a heterogeneous graph of individuals and choices. |
Joseph Suh; Suhong Moon; Serina Chang; |
| 146 | When Efficiency Meets Safety: A Benchmark Security Analysis of KV Cache Compression in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: (ii) On the other hand, **Vulnerability Paradox** arises under merging-based compression for human-designed Attacks, where aggressive merging in shallow layers triggers functional head collapse, amplifying attack success rates. To address this, we propose **Safe-CAM**, a history-aware, per-head feedback merging strategy that prevents safety degradation while maintaining efficiency. |
Xiaoxiao Ma; Kuofeng Gao; Zeyi Lu; Wenxi Jiang; Hao Fang; Hao Wu; Bin Chen; Shu-Tao Xia; |
| 147 | Retrievals Can Be Detrimental: Unveiling The Backdoor Vulnerability of Retrieval-Augmented Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose BadRDM, the first poisoning framework targeting RDMs, to systematically investigate their vulnerability to backdoor attacks. |
Hao Fang; Xiaohang Sui; Hongyao Yu; Kuofeng Gao; Jiawei Kong; Sijin Yu; Bin Chen; Shu-Tao Xia; |
| 148 | MT3: A Synergistic Multi-Task RL Framework for Specializing MLLMs in Text Image Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advances in large-scale Reinforcement Learning (RL) have improved reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs), but their application to end-to-end TIMT is still underexplored. To bridge this gap, we introduce MT3, a novel Multi-Task RL framework to specialize MLLMs into end-to-end expert TIMT models. |
Zhaopeng Feng; Yupu Liang; Shaosheng Cao; Jiayuan Su; Jiahan Ren; Zhijie Zhou; Wenxuan Huang; Jian Wu; Zuozhu Liu; |
| 149 | Failure Modes in Multi-Hop QA: The Weakest Link Effect and The Recognition Bottleneck Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Whether these failures stem from an inability to locate evidence (recognition failure) or integrate it (synthesis failure) is unclear. We introduce Multi-Focus Attention Instruction (MFAI), a semantic probe to disentangle these mechanisms by explicitly steering attention towards selected positions. |
Meiru Zhang; Zaiqiao Meng; Nigel Collier; |
| 150 | LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose LoVeC (Long-form Verbalized Confidence), a novel reinforcement learning (RL)–based method that trains LLMs to append an on-the-fly numerical confidence score to each generated statement during long-form generation. |
Caiqi Zhang; Xiaochen Zhu; Chengzu Li; Nigel Collier; Andreas Vlachos; |
| 151 | Beyond The Context Window: Scaling Agentic RL Via End-to-end Optimized Context Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing multi-turn RL pipelines suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. In this work, to address these challenges, we introduce summarization-based context management to training. |
Miao Lu; Weiwei Sun; Weihua Du; Zhan Ling; Xuesong Yao; Kang Liu; Jiecao Chen; |
| 152 | ARGUS: Policy-Adaptive Ad Governance Via Evolving Reinforcement with Adversarial Umpiring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ARGUS, a policy-adaptive governance system that enables evolving reinforcement through multi-agent adversarial umpiring. |
Deyi Ji; Junyu Lu; Xuanyi Liu; Liqun Liu; Hailong Zhang; Peng Shu; Huan Yu; Jie Jiang; Tianrun Chen; Lanyun Zhu; |
| 153 | Beyond Static Benchmarks: Synthesizing Harmful Content Via Persona-based Simulation for Robust Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Static benchmarks for harmful content detection face limitations in scalability and diversity, and may also be affected by contamination from web-scale pre-training corpora. To address these issues, we propose a framework for synthesizing harmful content, leveraging persona-guided large language model (LLM) agents. |
Huije Lee; Jisu Shin; Hoyun Song; Changgeon Ko; Jong C. Park; |
| 154 | From Verbatim to Gist: Distilling Pyramidal Multimodal Memory Via Semantic Information Bottleneck for Long-Horizon Video Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose **MM-Mem**, a pyramidal multimodal memory architecture grounded in *Fuzzy-Trace Theory*. |
Niu Lian; Yuting Wang; Hanshu Yao; Jinpeng Wang; Bin Chen; Yaowei Wang; Min Zhang; Shu-Tao Xia; |
| 155 | VEG: Verbal 𝜖-greedy for Semantic Exploration in Multi-Turn RL Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This lack of diversity biases the model toward premature exploitation, hindering the exploration necessary for optimal learning. To address this, we propose VEG (verbal 𝜖-greedy), a novel framework that leverages external feedback as a dynamic control variable to explicitly balance exploration and exploitation within the semantic space. |
Yongchang Hao; Jie Hao; Yongsheng Mei; Ze Ye; Junyi Chai; Bin Guo; Benjamin Z. Yao; Chenlei Guo; Lili Mou; |
| 156 | SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents Via Reconstructing Vulnerability-Introducing Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. |
Junkai Chen; Huihui Huang; Yunbo Lyu; Junwen An; Jieke Shi; Chengran Yang; Ting Zhang; Haoye Tian; Yikun Li; Zhenhao Li; Xin Zhou; Xing Hu; David Lo; |
| 157 | Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present GenCluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. |
Mehrzad Samadi; Aleksander Ficek; Sean Narenthiran; Siddhartha Jain; Wasi Uddin Ahmad; Somshubra Majumdar; Vahid Noroozi; Boris Ginsburg; |
| 158 | Act As You Think: Reinforcing Consistent Reasoning in Medical Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This degradation is most pronounced in specialized medical modalities (e. g. , Fundus, Ultrasound) where base VLMs lack robust understanding, a failure we attribute to a flawed reward mechanism exacerbated by the scarcity of diverse training data. To tackle this, we introduce Med-Zero-17K, a large-scale dataset spanning over 30 modalities and 24 clinically relevant tasks, and the Multi-Consistency Reward (MCR) framework, which explicitly rewards both perceptual grounding and logical coherence. |
Songtao Jiang; Yuan Wang; Ruizhe Chen; Yan Zhang; Ruilin Luo; Bohan Lei; Yeying Jin; Sibo Song; ZhiBo Yang; Jimeng Sun; Jian Wu; Zuozhu Liu; |
| 159 | Crossing The Reward Bridge: Expanding Reinforcement Learning with Verifiable Rewards Across Diverse Domains Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find their applicability is surprisingly narrow even in structured domains, a limitation that is compounded at scale: rule-based systems can paradoxically degrade in performance as multi-domain, free-form training data increases. To overcome these challenges, we propose a new RLVR framework that uses a generative verifier to provide soft, probabilistic rewards. |
Yi Su; Dian Yu; Linfeng Song; Juntao Li; Haitao Mi; Zhaopeng Tu; Min Zhang; Dong Yu; |
| 160 | Analyzing and Internalizing Complex Policy Documents for LLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our analysis shows that workflow-governing policy specifications are the hardest to reason over, and that SFT on gold trajectories with chain-of-thought is data-hungry and struggles at high complexity. We propose Category-Aware Policy Continued Pretraining, an automated pipeline that analyzes policies, extracts key specifications, categorizes them into factual, behavioral, and conditional types, and isolates those driving workflow complexity. |
Jiateng Liu; Zhenhailong Wang; Xiaojiang Huang; Yingjie Li; Xiang Li; Chenlei Guo; Xing Fan; Ruhi Sarikaya; Heng Ji; |
| 161 | Social Story Frames: Contextual Reasoning About Narrative Intent and Reception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. |
Joel Mire; Maria Antoniak; Steven R Wilson; Zexin Ma; Achyutarama R Ganti; Andrew Piper; Maarten Sap; |
| 162 | ET-Agent: Incentivizing Effective Tool-Integrated Reasoning Agent Via Behavior Calibration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ET-Agent, a training framework for calibrating agent’s tool-use behavior through two synergistic perspectives: Self-evolving Data Flywheel and Behavior Calibration Training. |
Yifei Chen; Guanting Dong; Zhicheng Dou; |
| 163 | Analytical FFN-to-MoE Restructuring Via Activation Pattern Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose an analytical post-training framework that rapidly restructures FFNs into sparse MoE architectures using only a small calibration dataset. |
Zehua Pei; Hui-Ling Zhen; Lancheng Zou; Xianzhi Yu; Wulong Liu; Sinno Jialin Pan; Mingxuan Yuan; Bei Yu; |
| 164 | S^4: Operationalizing Speech Act Theory for Strategic Semi-Structured Psychiatric Interview Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce S4, a comprehensive framework grounded in Speech Act Theory, modeling the interview as a unified process of internal strategy (Illocution and Perlocution) and external realization (Locution). |
Guanqun Bi; Zhoufu Liu; Zhuang Chen; Dazhen Wan; Xiyao Xiao; Minlie Huang; |
| 165 | MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. |
Xueqing Peng; Lingfei Qian; Yan Wang; Ruoyu Xiang; Yueru He; Yang Ren; Mingyang Jiang; Vincent Jim Zhang; Yuqing Guo; Jeff Zhao; Huan He; Yi Han; Yun Feng; Yuechen Jiang; Yupeng Cao; Haohang Li; Yangyang Yu; Xiaoyu Wang; Penglei Gao; Shengyuan Lin; Keyi Wang; Shanshan Yang; Yilun Zhao; Zhiwei Liu; Peng Lu; Jerry Huang; Suyuchen Wang; Triantafillos Papadopoulos; Polydoros Giannouris; Efstathia Soufleri; Nuo Chen; Zhiyang Deng; Heming Fu; Yijia Zhao; Mingquan Lin; Meikang Qiu; Kaleb E Smith; Arman Cohan; Xiao-Yang Liu; Jimin Huang; Guojun Xiong; Alejandro Lopez-Lira; Xi Chen; Junichi Tsujii; Jian-Yun Nie; Sophia Ananiadou; Qianqian Xie; |
| 166 | FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes **Fine**-grained **L**anguage-**A**udio **P**retraining (**FineLAP**), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. |
Xiquan Li; Xuenan Xu; Ziyang Ma; Wenxi Chen; Haolin He; Qiuqiang Kong; Xie Chen; |
| 167 | MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). |
Xiquan Li; Junxi Liu; Yuzhe Liang; Zhikang Niu; Wenxi Chen; Xie Chen; |
| 168 | Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. |
Shrey Pandit; Austin Xu; Xuan-Phi Nguyen; Yifei Ming; Caiming Xiong; Shafiq Joty; |
| 169 | Reusable Experiences: Latent Routing and Modular Composition in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose ReX (Reusable eXperience), an experience-centric adaptation framework that treats latent experiences — recurring reasoning patterns and skills — as fundamental units for LLM specialization. |
Shuai Ling; Lizi Liao; Dongmei Jiang; Weili Guan; |
| 170 | CachePrune: Teaching LLMs What Not to Follow Via KV-Cache Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This vulnerability stems from LLMs’ inability to distinguish between data and instructions within a prompt. We propose CachePrune that defends against this attack by identifying and pruning neurons associated with instruction-following, during KV cache encoding of the prompt context. |
Rui Wang; Junda Wu; Yu Xia; Tong Yu; Ruiyi Zhang; Ryan A. Rossi; Subrata Mitra; Lina Yao; Julian McAuley; |
| 171 | Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to revealed context exhibit greater generation stability and play a critical role in reasoning. |
Jia Deng; Junyi Li; Xin Zhao; Jinpeng Wang; Hongyu Lu; Ji-Rong Wen; |
| 172 | In-Context Representation Hijacking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce **Doublespeak**, a simple in-context representation hijacking attack against language models. |
Itay Yona; Amir Sarid; Michael Karasik; Yossi Gandelsman; |
| 173 | ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: During the RL stage, we design a novel multi-view ranking reward tailored to the multi-turn nature of listwise ranking. |
Wenhan Liu; Xinyu Ma; Weiwei Sun; Yutao Zhu; Yuchen Li; Dawei Yin; Zhicheng Dou; |
| 174 | Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As such, they tend to overlook a critical upstream factor: the role of the original safety-alignment data. This paper therefore investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks. |
Lei Hsiung; Tianyu Pang; Yung-Chen Tang; Linyue Song; Tsung-Yi Ho; Pin-Yu Chen; Yaoqing Yang; |
| 175 | Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). |
Jiaxi Bi; Tongxu Luo; Wenyu Du; Zhengyang Tang; Benyou Wang; |
| 176 | Optimizing Length Compression in Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). |
Zhengxiang Cheng; Dongping Chen; Mingyang Fu; Tianyi Zhou; |
| 177 | RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review–rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. |
Yelin Chen; Fanjin Zhang; Suping Sun; Yunhe Pang; Yuanchun Wang; Jian Song; XiaoYan Li; Lei Hou; Shu Zhao; Jie Tang; Juanzi Li; |
| 178 | SciCoQA: Quality Assurance for Scientific Paper–Code Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We construct SciCoQA from GitHub issues and reproducibility papers, and propose a synthetic generation pipeline to scale beyond AI to Physics, Quantitative Biology, and other computational sciences. |
Tim Baumgärtner; Iryna Gurevych; |
| 179 | Saber: Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model in Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce efficient **S**ampling with **A**daptive acceleration and **B**acktracking **E**nhanced **R**emasking (i. e. , **Saber**), a novel training-free sampling algorithm for DLMs that the first to improve both inference speed and output quality in code generation. |
Yihong Dong; Zhaoyu Ma; Xue Jiang; Zhiyuan Fan; Jiaru Qian; Yongmin Li; Jianha Xiao; Zhi Jin; Ge Li; |
| 180 | RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Critically, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose R-PLUS, a novel hybrid-policy optimization approach for LLMs that synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models. |
Yihong Dong; Xue Jiang; Yongding Tao; Huanyu Liu; Kechi Zhang; Lili Mou; Rongyu Cao; Yingwei MA; Jue Chen; Binhua Li; Zhi Jin; Fei Huang; Yongbin Li; Ge Li; |
| 181 | PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our experiments show that current LLM agents perform poorly with high error rates, e. g. , Qwen-3-30B-Think has an average error rate of 35%. To address this gap, we propose PEARL, a reinforcement-learning framework that (i) augments the language agent with an external preference memory that stores and updates inferred strategies (e. g. , attendee priorities, topic importance, time/location preferences), and (ii) optimizes the agent with round-wise rewards that directly supervise decision correctness, ranking quality, and memory usage across rounds. |
Bingxuan Li; Jeonghwan Kim; Cheng Qian; Xiusi Chen; Eitan Anzenberg; Niran Kundapur; Heng Ji; |
| 182 | CRISP: Persistent Concept Unlearning Via Sparse Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. |
Tomer Ashuach; Dana Arad; Aaron Mueller; Martin Tutek; Yonatan Belinkov; |
| 183 | R^3AG: Retriever Routing for Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This overlooks a critical distinction in RAG: a retrieved document must not only be relevant but also effectively support the generator in producing correct answers. To address this limitation, we propose R³AG, a novel routing framework that explicitly models the dynamic alignment between queries and retriever capabilities. |
Tong Zhao; Yutao Zhu; Yucheng Tian; Zhicheng Dou; |
| 184 | ATIR: Towards Audio-Text Interleaved Contextual Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. |
Tong Zhao; Chenghao Zhang; Yutao Zhu; Zhicheng Dou; |
| 185 | NavA3: Understanding Any Instruction, Navigating Anywhere, Finding Anything Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose NavA3, a hierarchical framework divided into two stages: global and local policies. |
Lingfeng Zhang; Xiaoshuai Hao; Yingbo Tang; Haoxiang Fu; Xinyu Zheng; Pengwei Wang; Zhongyuan Wang; Wenbo Ding; Shanghang Zhang; |
| 186 | Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The goal of this work is to develop a universal approach for aligning subtitles (i. e. , spoken language text with corresponding timestamps) to continuous sign language videos. |
Zifan Jiang; Youngjoon Jang; Liliane Momeni; Gül Varol; Sarah Ebling; Andrew Zisserman; |
| 187 | Probing for Reading Times Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we probe language model representations for human reading times. |
Eleftheria Tsipidi; Samuel Kiegeland; Francesco Ignazio Re; Tianyang Xu; Mario Giulianelli; Karolina Stanczak; Ryan Cotterell; |
| 188 | Reasoning Gets Harder for LLMs Inside A Dialogue Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Experiments on nine LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. |
Ivan Kartáč; Mateusz Lango; Ondrej Dusek; |
| 189 | MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current approaches face three limitations: causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into bidirectional multimodal embedding models. |
Haonan Chen; Hong Liu; Yuping Luo; Liang Wang; Nan Yang; Furu Wei; Zhicheng Dou; |
| 190 | GLARE: Agentic Reasoning for Legal Judgment Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce GLARE, an agentic legal reasoning framework that enables models to actively retrieve and apply external knowledge during decision-making. |
Xinyu Yang; Chenlong Deng; Zhicheng Dou; |
| 191 | Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing evaluation frameworks typically focus on diagnostic accuracy and win-rates and often overlook alignment with patient-specific goals, values, and personalities required for meaningful conversations. To address this, we introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations and use it to create the Mental Health Sensemaking Dialogue (MHSD) dataset, comprising over 2,200 patient–LLM conversations. |
Mohit Chandra; Siddharth Sriraman; Harneet Singh Khanuja; Yiqiao Jin; Munmun De Choudhury; |
| 192 | Beyond Timestamps: Bridging Forward and Backward Reasoning in Temporal Numerical and Relational Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a bi-directional evaluation framework consisting of forward generation via Question Answering (QA) and backward verification via Fact Verification (FV). |
Xinying Qian; Ying Zhang; Xuhui Sui; Yu Zhao; Baohang Zhou; Jeff Z. Pan; |
| 193 | UniversalRAG: Retrieval-Augmented Generation Over Corpora of Diverse Modalities and Granularities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, an any-to-any RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. |
Woongyeong Yeo; Kangsan Kim; Soyeong Jeong; Jinheon Baek; Sung Ju Hwang; |
| 194 | Know The Known and The Unknown: Reasonable Answer Generation with Knowledge-Informed Citations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they often overlook key challenges such as citation granularity, the awareness of unknown information, and the adoption of effective training strategies. In this paper, we introduce Knowledge-informed Citation (KFC), which addresses these issues through a novel data construction pipeline, a new benchmark, and an innovative training strategy. |
Yichi Zhang; Zhuo Chen; Lingbing Guo; Jun Xu; Mengshu Sun; Zhizhen Liu; Lei Liang; Wen Zhang; Huajun Chen; |
| 195 | GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present ***GroupToM-Bench***, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. |
Weidong Tang; Jierui Li; Yueling Hou; Zihan Mei; Can Zhang; Xinyan Wan; Zhiyuan Liang; Pengfei Zhou; Yang You; Wangbo Zhao; |
| 196 | Reinforcement Learning for Diffusion LLMs Via Energy-Based Gibbs Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose **Diffusion-Gibbs Alignment (DGA)**, a novel variational framework that reformulates RL for dLLMs as a distribution matching problem. |
Yijia Fan; Jing Yang; Mingyu Liu; Kaitong Cai; Jian Wang; Keze Wang; Jusheng Zhang; |
| 197 | Putting HUMANS First: Efficient LAM Evaluation with Human Preference Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 85 correlation with human. To better predict preferences, we trained regression models on these selected subsets, achieving 0. |
Woody Haosheng Gan; William Barr Held; Diyi Yang; |
| 198 | Modular Monolingual Adaptation Using Pretrained Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. |
Nalin Kumar; Ondrej Dusek; |
| 199 | Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning. |
Nishant Balepur; Atrey Desai; Rachel Rudinger; |
| 200 | MMSearch-R1: Incentivizing LMMs to Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. |
Jinming Wu; Zihao Deng; Wei Li; Yiding Liu; Bo You; Bo Li; Zejun MA; Ziwei Liu; |
| 201 | Characterizing The Expressivity of Local Attention in Transformers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although this restriction is usually motivated by efficiency, it has also been found to improve model quality, a phenomenon that has so far lacked a satisfactory explanation. We provide a formal account of this phenomenon in terms of recognizer expressivity. |
Jiaoda Li; Ryan Cotterell; |
| 202 | InsideOut: Measuring and Mitigating Insider–Outsider Bias in Interview Script Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we identify and systematically investigate LLMs’ **insider-outsider bias**, a phenomenon where models position themselves as insiders of mainstream cultures during generation while externalizing less dominant cultures. |
Yixin Wan; Xingrun Chen; Kai-Wei Chang; |
| 203 | LLM-Generated Text May Harm Your Retrieval! A Robust Detection Strategy for Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the usage paradigms of LLM text detectors for RAG and highlight key limitations of off-the-shelf or directly fine-tuned detectors. |
Zhaoheng Huang; Yutao Zhu; Ji-Rong Wen; Zhicheng Dou; |
| 204 | How to Train A Real-World Silicon Concierge? Internalizing Complex Business Workflow to Only OneModel Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose OneModel, an applicable paradigm shift from external workflows to internalized knowledge representation. |
Yongqi Tong; Xiaoyun Feng; Lyuxin Xue; Jianshe Li; Xin Zhang; Jiang-Ming Yang; |
| 205 | CE-GPPO: Coordinating Entropy Via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Coordinating Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. |
Zhenpeng Su; Leiyu Pan; Minxuan Lv; Yuntao Li; Wenping Hu; Fuzheng Zhang; Kun Gai; Guorui Zhou; |
| 206 | Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify three key barriers: limited scale of audio-text corpora, limited coverage of audio attributes in existing caption corpora, and lack of systematic exploration and evaluation. To fill this gap, we present the first principled empirical study of ALP. |
Wei-Cheng Tseng; Xuanru Zhou; Mingyue Huo; Yiwen Shao; Hao Zhang; Dong Yu; |
| 207 | Challenging The Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. |
Haoxiang Sun; Yingqian Min; Zhipeng Chen; Xin Zhao; Ji-Rong Wen; |
| 208 | LENS: LLM-Enabled Narrative Synthesis for Mental Health By Aligning Multimodal Sensing with Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor–text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. |
Wenxuan Xu; Arvind Pillai; Subigya Nepal; Amanda C. Collins; Daniel M Mackin; Michael V. Heinz; Tess Z Griffin; Nicholas C. Jacobson; Andrew Campbell; |
| 209 | LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For Bangla, existing work provides valuable resources and models, however, they are mostly single-task (e. g. , binary hate/offense) with narrow coverage of key dimensions such as type, severity, and target. We address these gaps by introducing *the first multi-task* Bangla hate-speech dataset, *BanglaMultiHate*, one of the largest manually annotated dataset to date. |
Md Arid Hasan; Firoj Alam; Md Fahad Hossain; Usman Naseem; Syed Ishtiaque Ahmed; |
| 210 | VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. |
Wenyi Xiao; Xinchi XU; Leilei Gan; |
| 211 | TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present TeamFusion, a multi-agent system designed to support teamwork in open-ended domains by: 1. |
Jiale Liu; Victor Bursztyn; Lin Ai; Haoliang Wang; Sunav Choudhary; Saayan Mitra; Qingyun Wu; |
| 212 | J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. |
Austin Xu; Yilun Zhou; Xuan-Phi Nguyen; Caiming Xiong; Shafiq Joty; |
| 213 | EfficientLLM: Unified Pruning-Aware Pretraining for Auto-Designed Compact Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Distinguished from direct pretraining that bounded by parameter scaling law, this work proposes the unified pruning-aware pretraining, focusing on pretraining compact models while preserving performance of much larger source models, termed EfficientLLM. |
Xingrun Xing; Zheng Liu; Shitao Xiao; Boyan Gao; Yiming Liang; Haokun Lin; Xianlin Zeng; Guoqi Li; Jiajun Zhang; |
| 214 | KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing Reinforcement Learning (RL) approaches typically rely on outcome-oriented rewards, which can inadvertently reinforce fabricated reasoning paths when the final answer is correct. To address this, we propose **Know**ledge-enhanced **RL**, **KnowRL**, a framework that integrates factual supervision directly into the reasoning process. |
Baochang Ren; Shuofei Qiao; Ningyu Zhang; Da Zheng; Huajun Chen; |
| 215 | KoCo-Bench: Can Large Language Models Leverage Domain Knowledge in Software Development? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present KOCO-bench, a novel benchmark designed for evaluating domain specialization methods in real-world software development. |
Xue Jiang; Ge Li; Jiaru Qian; Xianjie Shi; Chenjie Li; Hao Zhu; Ziyu Wang; Jielun Zhang; Zeyu Zhao; Kechi Zhang; Jia Li; Wenpin Jiao; Zhi Jin; Yihong Dong; |
| 216 | HERMES: KV Cache As Hierarchical Memory for Efficient Streaming Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. |
Haowei Zhang; Shudong Yang; Jinlan Fu; See-Kiong Ng; Xipeng Qiu; |
| 217 | A Multilingual Social Bias Benchmark Incorporating Thinking Processes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce MBTP, a multilingual social bias benchmark that incorporates human-generated pro- and anti-stereotype reasoning as part of the thinking process, and propose a few-shot meta-evaluation method that enables scalable bias assessment without model fine-tuning. |
Masahiro Kaneko; Danushka Bollegala; Timothy Baldwin; |
| 218 | Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Mem-Gallery features high-quality multi-session conversations grounded in both visual and textual information, with long interaction horizons and rich multimodal dependencies. Building on this dataset, we propose a systematic evaluation framework that assesses key memory capabilities along three functional dimensions: memory extraction and test-time adaptation, memory reasoning, and memory knowledge management. |
Yuanchen Bei; Tianxin Wei; Xuying Ning; Yanjun Zhao; Zhining Liu; Xiao Lin; Yada Zhu; Hendrik Hamann; Jingrui He; Hanghang Tong; |
| 219 | Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. |
Fangda Ye; Kuicai Dong; Xie Zhifei; Yuxin Hu; Yihang Yin; Shurui Huang; Shikai Dong; Chen Zhang; Jianzhu Bao; Shuicheng Yan; |
| 220 | When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present the first principled framework that joins sycophancy and self-bias to mitigate and quantify identity bias in MAD. |
Hyeong Kyu Choi; Jerry Zhu; Sharon Li; |
| 221 | ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. |
Hyeong Kyu Choi; Sharon Li; |
| 222 | Language Models Don’t Know What You Want: Evaluating Personalization in Deep Research Needs Real Users Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. |
Nishant Balepur; Malachi Hamada; Varsha Kishore; Sergey Feldman; Amanpreet Singh; Pao Siangliulue; Joseph Chee Chang; Eunsol Choi; Jordan Lee Boyd-Graber; Aakanksha Naik; |
| 223 | Efficient Test-Time Scaling of Multi-Step Reasoning By Probing Internal States of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. |
Jingwei Ni; Ekaterina Fadeeva; Tianyi Wu; Mubashara Akhtar; Jiaheng Zhang; Elliott Ash; Markus Leippold; Timothy Baldwin; See-Kiong Ng; Artem Shelmanov; Mrinmaya Sachan; |
| 224 | Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Bright-Pro, an evaluation framework that assesses the effectiveness of retrievers in agentic search systems. |
Yilun Zhao; Jinbiao Wei; Tingyu Song; Siyue Zhang; Chen Zhao; Arman Cohan; |
| 225 | RouteMoA: Dynamic Routing Without Pre-Inference Boosts Efficient Mixture-of-Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: They also lack model selection criteria and struggle with large model pools, where full inference is costly and can exceed context limits. To address this, we propose **RouteMoA**, an efficient mixture-of-agents framework with dynamic routing. |
Jize Wang; Han Wu; Zhiyuan You; Yiming Song; Yijun Wang; Zifei Shan; Yining Li; Songyang Zhang; Xinyi Le; Cailian Chen; Xinping Guan; Dacheng Tao; |
| 226 | Native Hybrid Attention for Efficient Sequence Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra inter-layer hybridization into a unified layer design. |
Jusen Du; Jiaxi Hu; Zhang Tao; Weigao Sun; Yu Cheng; |
| 227 | Understanding The Behaviors of Environment-aware Information Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To facilitate learning over multi-retrieval-step trajectories, we introduce a branching-based rollout technique that improves training stability. |
Ruifeng Yuan; Chaohao Yuan; David Dai; Yu Rong; Hong Cheng; Hou Pong Chan; Chenghao Xiao; |
| 228 | Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. |
Yuanjie Lyu; Chengyu Wang; Lei Shen; Jun Huang; Tong Xu; |
| 229 | SAGE: A Search-AuGmented Evaluation of Large Language Models on Free-Form QA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Search-AuGmented Evaluation (SAGE), a framework to assess LLM outputs without fixed ground-truth answers. |
Sher Badshah; Ali Emami; Hassan Sajjad; |
| 230 | Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recently, automated methods (LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few “authority” models. To tackle these issues, we propose Decentralized Arena (), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. |
Yanbin Yin; Kun Zhou; Zhen Wang; Xiangdong Zhang; Yifei Shao; Shibo Hao; Yi Gu; Jieyuan Liu; Somanshu Singla; Tianyang Liu; Eric P. Xing; Zhengzhong Liu; Haojian Jin; Zhiting Hu; |
| 231 | CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To exploit this temporal redundancy, we introduce Trace Credit to quantify a token’s decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. |
Kangyu Wang; Zhiyun Jiang; Haibo Feng; Weijia Zhao; Lin Liu; Jianguo Li; Zhenzhong Lan; Weiyao Lin; |
| 232 | A Survey of Large Language Model-Based Search Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This survey provides the first systematic analysis of search agents. |
Yunjia Xi; Jianghao Lin; Yongzhao Xiao; Zheli Zhou; Rong Shan; Te Gao; Jiachen Zhu; Weiwen Liu; Yong Yu; Weinan Zhang; |
| 233 | Enabling Agents to Communicate Entirely in Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further compression not only substantially accelerates inference by up to 24× but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research. |
Zhuoyun Du; Runze Wang; Huiyu Bai; Zouying Cao; Xiaoyong Zhu; Yu Cheng; Bo Zheng; Wei Chen; Haochao Ying; |
| 234 | Zero-Shot Detection of LLM-Generated Text Using Temperature Sensitivity Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose that modulating the decoding temperature and monitoring how the probability distributions respond can better probe the intrinsic discrepancies between two types of text. |
Shixuan Ma; Jiahao Li; Zhendong Mao; Quan Wang; |
| 235 | LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. |
Zhivar Sourati; Zheng Wang; Marianne Menglin Liu; Yazhe Hu; Mengqing Guo; Sujeeth Bharadwaj; Kyu J. Han; Tao Sheng; Sujith Ravi; Morteza Dehghani; Dan Roth; |
| 236 | Evolving Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce EVA (Evolving Agents), a novel paradigm for autonomous learning driven by pseudo-symbolic abstraction. |
Leonardo Ranaldi; |
| 237 | Constructing Interpretable Features from Compositional Neuron Groups Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. |
Or David Shafran; Atticus Geiger; Mor Geva; |
| 238 | LLM Agents in Law: Taxonomy, Applications, and Challenges Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recently, LLM agents have attracted significant attention as a solution to these challenges, utilizing advanced capabilities such as planning, memory, and tool usage to meet the rigorous standards of legal practice. In this paper, we present a comprehensive survey of LLM agents for legal tasks, analyzing how these architectures bridge the gap between technical capabilities and domain-specific needs. |
Shuang Liu; Ruijia Zhang; Ruoyun Ma; Yujia Deng; Lanyi Zhu; Jiayu Li; Zelong Li; Zhibin Shen; Mengnan Du; |
| 239 | Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is the first work to study this question by examining if a stronger model can enhance trustworthiness when fine-tuned on a weaker model’s labels, a paradigm we term weak-to-strong trustworthiness. To address this, we introduce two fundamental fine-tuning strategies that leverage trustworthiness regularization during the fine-tuning of the weak model and the weak-to-strong transfer. |
Lillian Sun; Martin Pawelczyk; Zhenting Qi; Aounon Kumar; Himabindu Lakkaraju; |
| 240 | PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. |
Afra Feyza Akyürek; Advait Gosai; Chen Bo Calvin Zhang; Vipul Gupta; Jaehwan Jeong; Anisha Gunjal; Tahseen Rabbani; Maria Mazzone; David Randolph IV; Mohammad Mahmoudi Meymand; Gurshaan Chattha; Paula Rodriguez; Diego A. Mares Buendia; Pavit Singh; Michael Liu; Subodh Chawla; Peter Cline; Lucy Ogaz; Ernesto Gabriel Hernández Montoya; Zihao Wang; Pavi Bhatter; Marcos Ayestaran; Bing Liu; Yunzhong He; |
| 241 | What Makes AI Research Replicable? Executable Knowledge Graphs As Scientific Knowledge Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Furthermore, previous approaches tend to overlook valuable implementation-level code signals and lack structured knowledge representations that support multi-granular retrieval and reuse. To overcome these challenges, we propose Executable Knowledge Graphs (xKG), a pluggable, paper-centric knowledge base that automatically integrates code snippets and technical insights extracted from scientific literature. |
Yujie Luo; Zhuoyun Yu; Xuehai Wang; Yuqi Zhu; Ningyu Zhang; Lanning Wei; Lun Du; Da Zheng; Huajun Chen; |
| 242 | PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (**PersonalAlign**), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. |
Yibo Lyu; Gongwei Chen; Rui Shao; Weili Guan; Liqiang Nie; |
| 243 | Is Human-Like Text Liked By Humans? Multilingual Human Detection and Preference Against AI Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior studies have shown that distinguishing text generated by Large Language Models (LLMs) from human-written one is highly challenging for humans, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. |
Yuxia Wang; Rui Xing; Jonibek Mansurov; Giovanni Puccetti; Zhuohan Xie; Minh Ngoc Ta; Jiahui Geng; Jinyan Su; Mervat Abassy; Saadeldine Eletter; Kareem Elozeiri; Nurkhan Laiyk; Maiya Goloburda; Tarek Mahmoud; Raj Vardhan Tomar; Alexander Aziz; Ryuto Koike; Masahiro Kaneko; Artem Shelmanov; Ekaterina Artemova; Vladislav Mikhailov; Akim Tsvigun; Alham Fikri Aji; Nizar Habash; Iryna Gurevych; Preslav Nakov; |
| 244 | AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. |
Yuanjie Lyu; Chengyu Wang; Haonan Zheng; Yuanhao Yue; Junbing Yan; Ming Wang; Jun Huang; |
| 245 | Revisiting Model Interpolation for Efficient Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we systematically revisit the simplest merging method that interpolates two weights directly. |
Taiqiang Wu; Runming Yang; Tao Liu; Jiahao Wang; Ngai Wong; |
| 246 | Efficient KL Divergence Estimation Via Truncated Top-K Integration for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TIKE (Top-k Importance-weighted KL Estimator), which exploits the Zipfian structure of language model distributions: by deterministically integrating over only the top-k tokens, TIKE captures most of the probability mass while effectively reducing memory cost. |
Xinyuan Wang; Luozhijie Jin; Bo Wang; Yuan Li; Zhangyue Yin; Xipeng Qiu; |
| 247 | Are We Using The Right Benchmark: An Evaluation Framework for Visual Token Compression Methods Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we uncover a counterintuitive yet consistent phenomenon: simple image downsampling outperforms many advanced visual token compression methods across multiple widely used benchmarks. |
Chenfei Liao; Wensong Wang; Zichen Wen; Xu Zheng; Yiyu Wang; Haocong He; Yuanhuiyi Lyu; Lutao Jiang; Xin Zou; Yuqian Fu; Bin Ren; Linfeng Zhang; Xuming Hu; |
| 248 | Deriving Character Logic from Storyline As Codified Decision Trees Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. |
Letian Peng; Kun Zhou; Longfei Yun; Yupeng Hou; Jingbo Shang; |
| 249 | NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning Via Mixture-of-Neurons Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge these gaps, we conduct an in-depth white-box analysis, identifying key neurons (Mixture of Neurons, MoN) and their fluctuation patterns associated with distinct failures. Building upon these insights, we propose NeuReasoner, an explainable, controllable, and unified reasoning framework driven by MoN. |
Haonan Dong; Kehan Jiang; Haoran Ye; Wenhao Zhu; Zhaolu Kang; Guojie Song; |
| 250 | D-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing dLLM policy optimization methods suffer from two critical reliability bottlenecks: (1) reward sparsity, arising from coarse or unverifiable signals that impede accurate advantage calculation; and (2) their probability estimates do not account for the gap to the unbiased expectation over all decoding orders, which are intractable to compute. To mitigate these issues, we propose d-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. |
Leyi Pan; Shuchang Tao; Yunpeng Zhai; Zheyu Fu; Liancheng Fang; Minghua He; Lingzhe Zhang; Zhaoyang Liu; Bolin Ding; Aiwei Liu; Lijie Wen; |
| 251 | MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass. |
Parker Riley; Daniel Deutsch; Mara Finkelstein; Colten DiIanni; Juraj Juraska; Markus Freitag; |
| 252 | SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose SARA, a hybrid RAG framework that targets answer quality under fixed token budgets by combining natural-language snippets with semantic compression vectors. |
Yiqiao Jin; Kartik Sharma; Vineeth Rakesh; Yingtong Dou; Menghai Pan; Mahashweta Das; Srijan Kumar; |
| 253 | SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. |
Yiqiao Jin; Rachneet Kaur; Zhen Zeng; Sumitra Ganesh; Srijan Kumar; |
| 254 | Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we present a systematic analysis of logical reasoning under controlled increases in logical complexity, and reveal a previously unrecognized phenomenon, which we term **Logical Phase Transitions**: rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions such as water freezing beyond a critical temperature threshold. |
Xinglang Zhang; Yunyao Zhang; ZeLiang Chen; Junqing Yu; Wei Yang; Zikai Song; |
| 255 | MemRec: Collaborative Memory-Augmented Agentic Recommender System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, naively utilizing collaborative memory causes severe context overload and introduces noise to downstream LLMs, alongside prohibitive computational costs. To resolve this, we propose MemRec, a framework that architecturally decouples memory management from reasoning. |
Weixin Chen; Yuhan Zhao; Jingyuan Huang; Zihe Ye; Mingxuan Ju; Tong Zhao; Neil Shah; Li Chen; Yongfeng Zhang; |
| 256 | A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A growing body of recent research addresses these issues by integrating structured perception, explicit alignment, and verifiable reasoning within unified frameworks. To establish a clear roadmap for understanding and comparing different MMR approaches, we systematically review them around four fundamental questions: (1) What to extract from multimodal inputs, (2) How to represent and align textual and visual information, (3) How to perform the reasoning, and (4) How to evaluate the correctness of the overall reasoning process. |
Tianyu Yang; Sihong Wu; Yilun Zhao; Zhenwen Liang; Lisen Dai; Chen Zhao; Minhao Cheng; Arman Cohan; Xiangliang Zhang; |
| 257 | UI-Copilot: Advancing Long-Horizon GUI Automation Via Tool-Integrated Policy Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. |
Zhengxi Lu; Fei Tang; Guangyi Liu; Jin Ma; Kaitao Song; Xu Tan; Wenqi Zhang; Weiming Lu; Jun Xiao; Yueting Zhuang; Yongliang Shen; |
| 258 | DARM: Distribution-Aware Reward Modeling By Alleviating Biases from Low Preference-Context Dependency Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that standard RM training is vulnerable in data subsets where response quality depends only weakly on the context: such instances encourage the RM to ignore the context, leading to context neglect and degraded accuracy. To address this failure mode, we propose Distribution-Aware Reward Modeling (DARM), which augments the RM objective with a conditional mutual information regularizer that maximizes context and the predicted reward conditioned on the response. |
Shaofan Liu; Guoqiang Zhang; Shihan Dou; Huiyuan Zheng; Yiming Zhou; Junjie Ye; Shaowen Wang; Shichun Liu; Jiazheng Zhang; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 259 | Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a hierarchical planning framework that analyzes web agents across three layers (i. e. , high-level planning, low-level execution, and re-planning), enabling process-based evaluation of reasoning, grounding, and recovery. |
Mohamed Aghzal; Gregory J. Stein; Ziyu Yao; |
| 260 | GTA: Generating Long-horizon Tasks for Web Agents at Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a scalable framework GTA that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. |
Tenghao Huang; Kung-Hsiang Huang; Prafulla Kumar Choubey; Yilun Zhou; Muhao Chen; Jonathan May; Chien-Sheng Wu; |
| 261 | LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting, critical issues that obscure true model capabilities. To address this, we introduce LLMEval-Fair, a framework for dynamic evaluation of LLMs. |
Ming Zhang; Yujiong Shen; Jingyi Deng; Yuhui Wang; Huayu Sha; Kexin Tan; Qiyuan Peng; Yue Zhang; Junzhe Wang; Shichun Liu; Yueyuan Huang; Jingqi Tong; Changhao Jiang; Yilong Wu; Zhihao Zhang; Mingqi Wu; Mingxu Chai; Zhiheng Xi; Shihan Dou; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 262 | Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing tool attacks are limited by domain specificity or fixed and static templates. To address these challenges, we propose Evo-Attacker, which formulates the tool attack as a self-evolving, memory-augmented reinforcement learning process. |
Bingyu Yan; Xiaoming Zhang; JinYu Hou; Chaozhuo Li; Ziyi Zhou; Yiming Hei; Litian Zhang; |
| 263 | AttnPO: Attention-Guided Process Supervision for Efficient Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model’s intrinsic attention signals for step-level credit assignment. |
Shuaiyi Nie; Dingsiyu; Wenyuan Zhang; Linhao Yu; Tianmeng Yang; Yao Chen; Weichong Yin; Yu Sun; Hua Wu; Tingwen Liu; |
| 264 | Multi-Agent-as-Judge: Aligning LLM-Agent-Based Automated Evaluation with Multi-Dimensional Human Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, to date, existing LLM-as-a-judge approaches face two limitations: persona descriptions of agents are often arbitrarily designed, and the frameworks are not generalizable to other tasks. To address these challenges, we propose MAJ-EVAL, a Multi-Agent-as-Judge evaluation framework that can automatically construct multiple evaluator personas with distinct dimensions from relevant text documents (e. g. , research papers), instantiate LLM agents with the personas, and engage in-group debates with multi-agents to generate multi-dimensional feedback. |
Jiaju Chen; Yuxuan Lu; Xiaojie Wang; Huimin Zeng; Jing Huang; Jiri Gesi; Ying Xu; Bingsheng Yao; Dakuo Wang; |
| 265 | Mitigating Selection Bias in Large Language Models Via Permutation-Aware GRPO Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. |
Jinquan Zheng; Jia Yuan; Jiacheng Yao; Chenyang Gu; Pujun Zheng; Guoxiu He; |
| 266 | Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unified multimodal models aim to jointly enable visual understanding and generation, yet current benchmarks rarely examine their true integration. |
Kai Zou; Ziqi Huang; Yuhao Dong; Shulin Tian; Dian Zheng; Hongbo Liu; Jingwen He; Bin Liu; Yu Qiao; Ziwei Liu; |
| 267 | PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce **PRIME**, a benchmark for evaluating verifiers on **PR**ocess-outcome alignment verification **I**n **M**athematics and **E**ngineering. |
Xiangfeng Wang; Hangyu Guo; Yanlin Lai; Mitt Huang; Liang Zhao; Chengyuan Yao; Yinmin Zhang; Qi Han; Xiaoxiaoren; Chun Yuan; Tong Xu; Zheng Ge; Xiangyu Zhang; Daxin Jiang; |
| 268 | LOKA: Conflict-Aware LLM Knowledge Update with Adaptive Knowledge Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the problem of LLM knowledge updates, which requires simultaneously unlearning unwanted information and learning new knowledge. |
Binchi Zhang; Zhengzhang Chen; Zaiyi Zheng; Jundong Li; Haifeng Chen; |
| 269 | Closing The Modality Reasoning Gap for Speech Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. |
Chaoren Wang; Heng Lu; Xueyao Zhang; Shujie Liu; Yan Lu; Jinyu Li; Zhizheng Wu; |
| 270 | MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. |
WenHao Wang; Peizhi Niu; Zhao Xu; Zhaoyu Chen; Jian Du; Yaxin Du; Xianghe Pang; Keduan Huang; Yanfeng Wang; Qiang Yan; Siheng Chen; |
| 271 | Efficient Self-Evaluation for Diffusion Language Models Via Sequence Regeneration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. |
Linhao Zhong; Linyu Wu; Wen Wang; Yuling Xi; Chenchen Jing; Jiaheng Zhang; Hao Chen; Chunhua Shen; |
| 272 | Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. |
Linhao Zhong; Linyu Wu; Bozhen Fang; Tianjian Feng; Chenchen Jing; Wen Wang; Jiaheng Zhang; Hao Chen; Chunhua Shen; |
| 273 | Glyph: Scaling Context Windows Via Visual-Text Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce Glyph, a framework that renders long texts into compact visual pages and processes them with a vision-language model (VLM), allowing a fixed context window to cover substantially more text. |
Jiale Cheng; Yusen Liu; Xinyu Zhang; Yulin Fei; Wenyi Hong; Ruiliang Lyu; Weihan Wang; Zhe Su; Xiaotao Gu; Xiao Liu; Yushi Bai; Jie Tang; Hongning Wang; Minlie Huang; |
| 274 | Scaling Laws for Code: A More Data-Hungry Regime Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0. |
Xianzhen Luo; Wenzhen Zheng; Qingfu Zhu; Rongyi Zhang; Houyi Li; Siming Huang; YuanTao Fan; Wanxiang Che; |
| 275 | When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the mechanisms that VLMs use to resolve cross-modal conflicts by introducing WHOOPS-AHA! |
Francesco Ortu; Zhijing Jin; Diego Doimo; Alberto Cazzaniga; |
| 276 | LLM-as-Scheduler: Agentic Workflow Dynamic Scheduling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In practice, many queries do not need such heavy processing and can be handled well by a single strong agent. To address this inefficiency, we propose LLM-as-Scheduler (LAS), a system that dynamically chooses the right workflow for each query. |
Dawei Xiang; Kexin Chu; Wenyan Xu; Wenhui Zhang; Wei Zhang; |
| 277 | Guided By Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that enables LLMs to perform step-by-step reasoning at a low cost, without any reward models or verifiers. |
Amirhosein Ghasemabadi; Keith G. Mills; Baochun Li; Di Niu; |
| 278 | Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads By Excluding Inert Tokens to Mitigate Hallucination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term **Vocabulary Hijacking**. |
Yangneng Chen; Junlin Li; Weijun Yao; Xilai Ma; Guodong DU; Wenya Wang; Jing Li; |
| 279 | VRPO: Rethinking Value Modeling for Robust RL Under Noisy Supervision in LLM Post-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. |
Dingwei Zhu; Shihan Dou; Zhiheng Xi; Senjie Jin; Guoqiang Zhang; Jiazheng Zhang; Junjie Ye; Mingxu Chai; Enyu Zhou; Ming Zhang; Yuhui Wang; Caishuang Huang; Chenhao Huang; Yunke Zhang; Yuran Wang; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang; |
| 280 | CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. |
Tabinda Sarwar; Farhad Moghimifar; Cong Duy Vu Hoang; Xiaoxiao Ma; Shawn Chang Xu; Fahimeh Saleh; Poorya Zaremoodi; Avirup Sil; Katrin Kirchhoff; |
| 281 | Bidirectional LMs Are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce a novel, real-world and large-scale knowledge injection benchmark that continuously evolves without human intervention. |
Yuwei Zhang; Wenhao Yu; Shangbin Feng; Yifan Zhu; Letian Peng; Jayanth Srinivasa; Gaowen Liu; Jingbo Shang; |
| 282 | Agentic Oversight Via Dialectic Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To make oversight grounded and scale as capabilities extend, we introduce an Agentic Oversight framework. |
Leonardo Ranaldi; Federico Ranaldi; |
| 283 | Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. |
Ruixuan Deng; Xiaoyang Hu; Miles Gilberti; Shane Storks; Aman Taxali; Mike Angstadt; Chandra Sripada; Joyce Chai; |
| 284 | Reliable Use of Lemmas Via Eligibility Reasoning and Section-Aware Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We formalize lemma-judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion-utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two-section output and trains with reinforcement learning plus section-aware loss masking to assign penalty to the section responsible for errors. |
Zhikun Xu; Xiaodong Yu; Ben Zhou; Jiang Liu; Jialian Wu; Ze Wang; Ximeng Sun; Hao Chen; Zicheng Liu; |
| 285 | SLR: Automated Synthesis for Scalable Logical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. |
Lukas Helff; Ahmad Omar; Felix Friedrich; Antonia Wüst; Hikaru Shindo; Rupert Mitchell; Tim Woydt; Patrick Schramowski; Wolfgang Stammer; Kristian Kersting; |
| 286 | ReCode: Reinforcing Code Generation with Reasoning-Process Rewards Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work proposes ReCode(Reasoning-Reinforced Code Generation), a novel RL training framework comprising: (1) Contrastive Reasoning-Process Reward Learning (CRPL), which trains a reward model with synthesized optimized and degraded reasoning variants to assess the quality of reasoning process; and (2) Consistency-Gated GRPO (CG-GRPO), which integrates the reasoning-process reward model into RL by gating neural reasoning-process rewards with strict execution outcomes, using execution correctness as a hard gate to mitigate reward hacking. |
Lishui Fan; Yu Zhang; Mouxiang Chen; Zhongxin Liu; |
| 287 | Does RLVR Extend Reasoning Boundaries? Investigating Capability Expansion in Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To examine the impact of RLVR on the capability boundaries of Vision-Language Models (VLMs), we introduce Ariadne, a controlled framework based on synthetic maze navigation where the reasoning difficulty is precisely regulated by path length and the number of turns. |
Minghe Shen; Zhuo Zhi; Chonghan Liu; Shuo Xing; Zhengzhong Tu; Che Liu; |
| 288 | COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify context management as the central bottleneck—extended histories cause agents to overlook critical evidence or become distracted by irrelevant information, thus failing to replan or reflect from previous mistakes. To address this, we propose COMPASS (Context-Organized Multi-Agent Planning and Strategy System), a lightweight hierarchical framework that separates tactical execution, strategic oversight, and context organization into three specialized components: (1) a Main Agent that performs reasoning and tool use, (2) a Meta-Thinker that monitors progress and issues strategic interventions, and (3) a Context Manager that maintains concise, relevant progress briefs for different reasoning stages. |
Guangya Wan; Mingyang Ling; Xiaoqi Ren; Rujun Han; Sheng Li; Zizhao Zhang; |
| 289 | ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce **ReasonEmbed**, a novel text embedding model developed for reasoning-intensive document retrieval. |
Jianlyu Chen; Junwei Lan; Chaofan Li; Defu Lian; Zheng Liu; |
| 290 | Factual Retrieval in LLMs Is A Redundant, Distributed and Non-Contiguous Process Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language models (LLMs) store and recall factual knowledge, yet the precise mechanism of how entity representations are transformed to enable specific attribute retrieval remains underexplored. In this work, we investigate this mechanism through the lens of an “attribute-computation path”—a sequence of computational steps over the entity representation required to elicit a target attribute. |
Hail Hochman; Natalie Shapira; Yoav Goldberg; |
| 291 | Counteracting The Matthew Effect in Self-Improvement of LVLMs Through Head-Tail Re-balancing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision–language models (LVLMs), where models explore and learn from successful trajectories iteratively. |
Xin Guo; Zhiheng Xi; Yiwen Ding; Yitao Zhai; Xiaowei Shi; Xunliang Cai; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 292 | Compatibility-Aware Dynamic Fine-Tuning for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. |
Yucheng Zhou; Junwei Sheng; Qianning Wang; Jianbing Shen; |
| 293 | Multimodal Large Language Models for Multi-Subject In-Context Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome the data scarcity, we introduce an automatic and scalable data generation pipeline that eliminates the need for manual annotation. |
Yucheng Zhou; Dubing Chen; Huan Zheng; Jianbing Shen; |
| 294 | The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. |
Renmiao Chen; Yida Lu; Shiyao Cui; Xuan Ouyang; Victor Shea-Jay Huang; Shumin Zhang; Chengwei Pan; Han Qiu; Minlie Huang; |
| 295 | The Pitfalls of KV Cache Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we identify several pitfalls that practitioners should be aware of when deploying KV cache compressed LLMs. |
Alex Chen; Renato Geh; Aditya Grover; Guy Van Den Broeck; Daniel Mingyi Israel; |
| 296 | Visually-Guided Policy Optimization for Multimodal Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. |
Zengbin Wang; Feng Xiong; Liang Lin; Xuecai Hu; Yong Wang; Yanlin Wang; Man Zhang; Xiangxiang Chu; |
| 297 | Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank–Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. |
Yuming Yang; Mingyoung Lai; Wanxu Zhao; Xiaoran Fan; Zhiheng Xi; Mingqi Wu; Chiyue Huang; Jun Zhao; Haijun Lv; Jian Tong; Yunhua Zhou; Yicheng Zou; Qipeng Guo; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 298 | Protecting Bystander Privacy Via Selective Hearing in Audio LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model’s ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. |
Xiao Zhan; Guangzhi Sun; Jose Such; Phil Woodland; |
| 299 | VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. |
Di Wu; Yixin Wan; Kai-Wei Chang; |
| 300 | Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs’ ability to accurately simulate human behavior. |
Yuxuan Lu; Jing Huang; Yan Han; Bingsheng Yao; Sisong Bei; Yaochen Xie; Yisi Sang; Qi He; Dakuo Wang; |
| 301 | Nirvana: A Specialized Generalist Model With Task-Aware Memory Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Nirvana, an SGM featuring specialized memory, linear-time complexity, and test-time task information extraction. |
Yuhua Jiang; Shuang Cheng; Yihao Liu; Ermo Hua; Che Jiang; Weigao Sun; Yu Cheng; Feifei Gao; Biqing Qi; Bowen Zhou; |
| 302 | ACIArena: Toward Unified Evaluation for Agent Cascading Injection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing studies consider only limited attack strategies and simplified MAS settings, limiting their generalizability and comprehensive evaluation. To bridge this gap, we introduce ACIArena, a unified framework for evaluating the robustness of MAS. |
Hengyu An; Minxi Li; Jinghuai Zhang; Naen Xu; Chunyi Zhou; Changjiang Li; Xiaogang Xu; Tianyu Du; Shouling Ji; |
| 303 | SDAR-VL: Stable and Efficient Block-wise Diffusion for Vision-Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SDAR-VL, the first systematic application of block-wise discrete diffusion to large-scale vision-language understanding (VLU), together with an integrated framework for efficient and stable training. |
Shuang Cheng; Yuhua Jiang; Zineng Zhou; Dawei Liu; Tao Wang; Linfeng Zhang; Biqing Qi; Bowen Zhou; |
| 304 | SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. |
Chuhan Wang; Xintong Li; Jennifer Yuntong Zhang; Junda Wu; Chengkai Huang; Lina Yao; Julian McAuley; Jingbo Shang; |
| 305 | Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine whether large language models (LLMs) exhibit similar behaviors when assigned high- or low-status personas. |
Anvesh Rao Vijjini; Sagar B. Manjunath; Snigdha Chaturvedi; |
| 306 | MOA: Multi-Objective Alignment for Role-Playing Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present **MOA** (**M**ulti-**O**bjective **A**lignment), a reinforcement-learning framework that enables multi-dimensional, fine-grained rubric optimization for general RPAs. |
Chonghua Liao; Ke Wang; Yuchuan Wu; Ruoran Li; Fei Huang; Yongbin Li; |
| 307 | FoE: Forest of Errors Makes The First Solution The Best in Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. |
Kehan Jiang; Haonan Dong; Zhaolu Kang; Zhengzhou Zhu; Guojie Song; |
| 308 | Can Persona-Prompted LLMs Emulate Subgroup Values? An Empirical Analysis of Generalisability and Fairness in Cultural Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates the more challenging task of fine-grained value alignment: examining whether LLMs can emulate the distinct cultural values of demographic subgroups. |
Bryan Chen Zhengyu Tan; Zhengyuan Liu; Xiaoyuan Yi; Jing Yao; Xing Xie; Nancy F. Chen; Roy Ka-Wei Lee; |
| 309 | VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, we found a fundamental issue faced by OLLMs: Visual Interference, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models’ inference process to follow the human-like “Look-then-Listen” inference chain. |
Rui Hu; Delai Qiu; Yining Wang; Shengping Liu; Jitao Sang; |
| 310 | Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This global treatment results in unnecessary utility degradation and extends optimization to content-agnostic regions. To address these limitations, we propose PALU (Prefix-Aware Localized Unlearning), a framework driven by a local entropy maximization objective across both temporal and vocabulary dimensions. |
Naixin Zhai; Pengyang Shao; Binbin Zheng; Yonghui Yang; Fei Shen; Long Bai; Xun Yang; |
| 311 | ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ReGATE (**Re**ference-**G**uided **A**daptive **T**oken **E**lision), an adaptive token pruning method for accelerating MLLM training. |
Chaoyu Li; Yogesh Kulkarni; Pooyan Fazli; |
| 312 | LEDOM: Reverse Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Reverse Reward, which reranks forward outputs using reverse posterior estimates, and prove that bidirectional scoring penalizes hallucinated reasoning chains whose backward reconstruction degrades. |
Xunjian Yin; Sitao Cheng; Yuxi Xie; Xinyu Hu; Li Lin; Xinyi Wang; Liangming Pan; William Yang Wang; Xiaojun Wan; |
| 313 | Can We Predict Before Executing Machine Learning Agents? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. |
Jingsheng Zheng; Jintian Zhang; Yujie Luo; Yuren Mao; Yunjun Gao; Lun Du; Huajun Chen; Ningyu Zhang; |
| 314 | HistLens: Mapping Idea Change Across Concepts and Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose HistLens, a unified, SAE-based framework for multi-concept, multi-corpus conceptual-history analysis. |
Yi Jing; Weiyun Qiu; Yihang Peng; Zhifang Sui; |
| 315 | Illusions of Confidence? Diagnosing LLM Truthfulness Via Neighborhood Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. |
Haoming Xu; Ningyuan Zhao; Yunzhi Yao; Weihong Xu; Hongru Wang; Xinle Deng; Shumin Deng; Jeff Z. Pan; Huajun Chen; Ningyu Zhang; |
| 316 | Doc-V*: Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Doc-V*, an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. |
Yuanlei Zheng; Pei Fu; Hang Li; Ziyang Wang; Yuyi Zhang; Wenyu Ruan; Xiaojin Zhang; Zhongyu Wei; Zhenbo Luo; Jian Luan; Wei Chen; Xiang Bai; |
| 317 | To Lie or Not to Lie? Investigating The Biased Spread of Global Lies By LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. |
Zohaib Khan; Mustafa Dogan; Ifeoma Okoh; Pouya Sadeghi; Siddhartha Shrestha; Sergius Justus Chesami Nyah; Mahmoud O. Mokhiamar; Michael J Ryan; Tarek Naous; |
| 318 | Breaking Block Boundaries: Anchor-based History-stable Decoding for Diffusion Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address this challenge, we systematically investigate the identification of stable tokens and present three key findings: (1) naive lookahead decoding is unreliable, (2) token stability closely correlates with convergence trend, and (3) historical information is isolated. Building on these insights, we propose Anchor-based History-stable Decoding (AHD), a training-free, plug-and-play dynamic decoding strategy. |
Shun Zou; Yong Wang; Zehui Chen; Lin Chen; Chongyang Tao; Feng Zhao; Xiangxiang Chu; |
| 319 | Libra-VLA: Achieving Learning Equilibrium Via Asynchronous Coarse-to-Fine Dual-System Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This strategy overlooks the inherent hierarchy of robotic manipulation, where complex actions can be naturally modeled in a Hybrid Action Space, decomposing into discrete macro-directional reaching and continuous micro-pose alignment, severely widening the semantic-actuation gap and imposing a heavy representational burden on grounding high-level semantics to continuous actions. To address this, we introduce Libra-VLA, a novel Coarse-to-Fine Dual-System VLA architecture. |
Yifei Wei; Linqing Zhong; Yi Liu; Yuxiang Lu; Xindong He; Maoqing Yao; Guanghui Ren; |
| 320 | Beyond Scaling: Measuring and Predicting The Upper Bound of Knowledge Retention in Language Model Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Size-dependent Mutual Information (SMI), an information-theoretic predictor that integrates knowledge frequency, knowledge specificity, and model size to forecast closed-book question answering (QA) accuracy. |
Changhao Jiang; Ming Zhang; Yifei Cao; Junjie Ye; Xiaoran Fan; Shihan Dou; Zhiheng Xi; Jiajun Sun; Yi Dong; Yujiong Shen; Jingqi Tong; Baoyu Fan; Tao Gui; Qi Zhang; Xuanjing Huang; |
| 321 | Contextual Relevance and Adaptive Sampling for LLM-Based Document Reranking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To efficiently estimate contextual relevance, we propose TS-SetRank, a sampling-based, uncertainty-aware reranking algorithm. |
Jerry Huang; Siddarth Madala; Cheng Niu; Julia Hockenmaier; Tong Zhang; |
| 322 | Different Types of Syntactic Agreement Recruit The Same Units Within Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate whether different syntactic phenomena recruit shared or distinct components in LLMs. |
Daria Kryvosheieva; Andrea Gregor de Varda; Evelina Fedorenko; Greta Tuckute; |
| 323 | Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. |
Yongqi Li; Hao Lang; Tieyun Qian; Yongbin Li; |
| 324 | Evaluating Language Model Pluralism Through In-the-wild Crowd Discussions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose PLURALEVAL, an evaluation framework that assesses LLM pluralism in open-ended generation by comparing outputs against free-form crowd responses. |
Gagan Mundada; Rohan Surana; Nandhini Swaminathan; Bodhisattwa Prasad Majumder; Junda Wu; Julian McAuley; Zhouhang Xie; |
| 325 | WebSynthesis: World Model-Guided Monte Carlo Tree Search for Efficient WebAgent Trajectory Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We identify two key challenges: (1) Infrastructure Overhead, where network instability and website access restrictions limit data collection scalability; and (2) Constrained Exploration, where irreversible state transitions preclude tree-based search and thus limit trajectory diversity. To address these challenges, we introduce WebSynthesis, a framework for scalable trajectory synthesis. |
Yifei Gao; Junhong Ye; Yifan Yang; Jiaqi Wang; Yi Zhang; Zhang Ruichen; Jitao Sang; |
| 326 | DPC: Training-Free Text-to-SQL Candidate Selection Via Dual-Paradigm Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DPC (Dual-Paradigm Consistency), a multi-agent framework that reformulates SQL selection from a probabilistic guessing task on hidden data into a deterministic verification task on visible data. |
Boyan Li; Ou Ocean Kun Hei; Yue Yu; Yuyu Luo; |
| 327 | A Comprehensive Survey of Process Reward Models: Data Generation, Model Construction, and Usage Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment. |
Congmin Zheng; Jiachen Zhu; Zhuoying Ou; Yuxiang Chen; Kangning Zhang; Rong Shan; Zeyu Zheng; Mengyue Yang; Jianghao Lin; Yong Yu; Weinan Zhang; |
| 328 | Don’t Stop Early: Scalable Enterprise Deep Research with Controlled Information Flow and Evidence-Aware Termination Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Enterprise deep research often fails to produce decision-ready reports due to uneven information coverage, context explosion, and premature stopping. We propose a scalable Enterprise Deep Research (EDR) architecture to address these failures. |
Prafulla Kumar Choubey; Kung-Hsiang Huang; Pranav Narayanan Venkit; Jiaxin Zhang; Vaibhav Vats; Yu Li; Xiangyu Peng; Chien-Sheng Wu; |
| 329 | OneRec-Think: In-Text Reasoning for Generative Recommendation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing generative models (e. g. , OneRec) operate as implicit predictors, critically lacking the capacity for explicit and controllable reasoning—a key advantage of LLMs. To bridge this gap, we propose OneRec-Think, a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation. |
Zhanyu Liu; Shiyao Wang; Xingmei Wang; Rongzhou Zhang; Jiaxin Deng; Honghui Bao; Jinghao Zhang; Wuchao Li; PengFei Zheng; Xiangyu Wu; Yifei Hu; Qigen Hu; Xinchen Luo; Lejian Ren; Zhang Zixing; Qianqian Wang; Kuo Cai; Yunfan Wu; Hongtao Cheng; Zexuan Cheng; Lu Ren; Huanjie Wang; Yi Su; Ruiming Tang; Kun Gai; Guorui Zhou; |
| 330 | Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. |
Tingchen Fu; Yafu Li; Jiawei Gu; Xiaoye Qu; Yu Cheng; |
| 331 | AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image tokens, conditioned on language grounded query banks. |
Rui Qian; Chuanhang Deng; Qiang Huang; Jian Xiong; Mingxuan Li; Yingbo Zhou; Wei Zhai; Jintao Chen; Dejing Dou; |
| 332 | AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with the task-dependent preference signals, and a representational mismatch, as the backbone’s optimization for generation leaves its representations ill-suited to fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. |
Yongliang Miao; Yangyang Liang; Mengnan Du; |
| 333 | Tears or Cheers? Benchmarking LLMs Via Culturally Elicited Distinct Affective Responses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing **C**ulturally **E**licited **D**istinct **A**ffective **R**esponses. |
Chongyuan Dai; Yaling Shen; Zihan Gao; Jia Li; Yishun Jiang; Yaxiong Wang; Liu Liu; Zongyuan Ge; Jinpeng Hu; |
| 334 | LASA: Language-Agnostic Semantic Alignment at The Semantic Bottleneck for LLM Safety Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language models (LLMs) have demonstrated better safety performance in high-resource languages than in low-resource languages. |
Junxiao Yang; Haoran Liu; Jinzhe Tu; Jiale Cheng; Zhexin Zhang; Shiyao Cui; Jiaqi Weng; Jialing Tao; Hui Xue; Hongning Wang; Han Qiu; Minlie Huang; |
| 335 | When Background Matters: Breaking Medical Vision Language Models By Transferable Attack Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorrect yet clinically plausible diagnoses while keeping perturbations imperceptible. |
Akash Ghosh; Subhadip Baidya; Sriparna Saha; Xiuying Chen; |
| 336 | Improving Long-Context Translation Via Self-Supervised Dual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose LongDu, a self-supervised post-training framework that improves long-document translation reliability via round-trip consistency. |
Shanbo Cheng; Shuaijie She; Yu Bao; Jianbing Zhang; Jiajun Chen; Shujian Huang; |
| 337 | Uncovering Temporal Framing in The News Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a taxonomy of eight temporal frames grounded in prior work on temporality and framing, and we realize it through expert annotation of a multilingual news corpus. |
Tarek Mahmoud; Veronika Solopova; Premtim Sahitaj; Ariana Sahitaj; Max Upravitelev; Mervat Abassy; Hana Fatima Shaikh; Neda Foroutan; Vera Schmitt; Preslav Nakov; |
| 338 | New Terms, New Toxicity: Consensus-based Chinese Neologism Toxicity Detection Via Search-Augmented LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate how to detect implicit toxicity expressed via neologisms. |
Shiyao Cui; QingLin Zhang; Di Wang; Yida Lu; Zhexin Zhang; Jinhua Gao; Jinglin Yang; Min He; Han Qiu; Minlie Huang; |
| 339 | What Is A Protest Anyway? Codebook Conceptualization Is Still A First-order Concern in LLM-era Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we focus on the steps before and after LLM prompting: conceptualization of the categories to classify and using LLM predictions in downstream statistical inference. |
Andrew Halterman; Katherine A. Keith; |
| 340 | SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through **practical, utility-driven** tasks. |
Shuyang Hou; Yi Hu; Muhan Zhang; |
| 341 | Are They Lovers or Friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To explore their capabilities, we introduce SCRIPTS, a 1. |
Eunsu Kim; Junyeong Park; Juhyun Oh; Kiwoong Park; Seyoung Song; A. Seza Doğruöz; Alice Oh; Najoung Kim; |
| 342 | Teach A Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a preference distribution agnostic procedure that uses the reward model itself to guide controlled decoding toward mis specified responses while preserving the underlying preference class. Building on this discovery mechanism, we propose REFORM, a self improving RM framework that (i) searches for class consistent but reward inconsistent variants and (ii) fine tunes the RM on a small, targeted augmentation of these failures. |
Pankayaraj Pathmanathan; Furong Huang; |
| 343 | DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. |
Junchao Wu; Yefeng Liu; Chenyu Zhu; Hao Zhang; Zeyu Wu; Tianqi Shi; Yichao Du; Longyue Wang; Weihua Luo; Jinsong Su; Derek F. Wong; |
| 344 | DREAM: Deep Research Evaluation with Agentic Metrics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose **DREAM** (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. |
Elad Ben Avraham; ChangHao Li; Ron Dorfman; Roy Ganz; Oren Nuriel; Amir Dudai; Aviad Aberdam; Noah Flynn; Elman Mansimov; Aditya Kalyanpur; Ron Litman; |
| 345 | GiLT: Augmenting Transformer Language Models with Dependency Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Graph-Infused Layers Transformer Language Model (GiLT) which leverages dependency graphs for augmenting Transformer language models. |
Tianyu Huang; Yida Zhao; Chuyan Zhou; Kewei Tu; |
| 346 | Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting Via Automated Spectral Inspection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Due to the limited generalization and interpretability of deep learning classifiers, the final vetting of rare celestial object candidates still relies on manually intensive expert visual inspection, which has become a primary bottleneck as modern spectroscopic surveys continue to scale. To bridge this gap, we propose Spec-o3, a tool-augmented vision-language agent that performs astronomer-aligned spectral inspection via interleaved multimodal chain-of-thought reasoning. |
Minghui Jia; Qichao Zhang; Ali Luo; Linjing Li; Shuo Ye; Hailing Lu; Wen Hou; Dongbin Zhao; |
| 347 | Adam’s Law: Textual Frequency Law on Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel research direction in terms of textual data frequency, which is an understudied topic. |
Hongyuan Lu; Zixuan Li; Zefan Zhang; Bowen Cao; Wai Lam; |
| 348 | Incentivizing Parametric Knowledge Via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To incentivize the effective use of parametric knowledge, we propose EA-RLVR (Entity-Anchored Reinforcement Learning with Verifiable Rewards), a training framework that optimizes cross-cultural entity translation without relying on external knowledge bases. |
Jiang Zhou; Xiaohu Zhao; Xinwei Wu; Tianyu Dong; Hao Wang; Yangyang Liu; Heng Liu; Linlong Xu; Longyue Wang; Weihua Luo; Deyi Xiong; |
| 349 | Query-Efficient Agentic Graph Extraction Attacks on GraphRAG Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose AGEA (Agentic Graph Extraction Attack), a framework that leverages a novelty-guided exploration–exploitation strategy, external graph memory modules, and a two-stage graph extraction pipeline combining lightweight discovery with LLM-based filtering. |
Shuhua Yang; Jiahao Zhang; Yilong Wang; Dongwon Lee; Suhang Wang; |
| 350 | ReCreate: Reasoning and Creating Domain Agents Driven By Experience Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such strategies overlook critical evidence explaining why an agent succeeds or fails, and often require high computational costs. To address these limitations, we propose ReCreate, an experience-driven framework for the automatic creation of domain agents. |
Zhezheng Hao; Hong Wang; Jian Luo; Jianqing Zhang; Yuyan Zhou; Qiang Lin; Can Wang; Hande Dong; Jiawei Chen; |
| 351 | Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct comprehensive theoretical and empirical analyses of entropy dynamics in RLVR, offering two main insights: (1) We derive a tight approximation for token-level entropy change at each update step, revealing four governing factors and providing a unified theoretical framework of how existing methods influence entropy; (2) We reveal a fundamental limitation of recent approaches: they rely on heuristic adjustments to one or two of these factors, leaving other relevant factors unconsidered, thus inherently limiting their effectiveness. Motivated by these findings, we propose STEER, a principled entropy-modulation method that adaptively reweighs tokens based on theoretically-estimated entropy variations. |
Zhezheng Hao; Hong Wang; Haoyang Liu; Jian Luo; Jiarui Yu; Hande Dong; Qiang Lin; Can Wang; Jiawei Chen; |
| 352 | Demystifying Data Organization for Enhanced LLM Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Guided by them, we introduce two novel data ordering methods termed STR and SAW. |
Yalun Dai; Yangyu Huang; Tongshen Yang; Yonghan Wang; Xin Zhang; Wenshan Wu; Qihao Zhao; Hao Li; Yuanyuan Gao; Kim-Hui Yap; Scarlett Li; |
| 353 | MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Long-tail Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a benchmark designed to evaluate multi-hop QA performance on questions involving 10,479 question-answer pairs for evaluating old/new knowledge and 17,887 pairs for assessing popular/unpopular knowledge, with each question equipped with corresponding sub-questions and answers. |
Jie He; Nan Hu; Wanqiu Long; Jiaoyan Chen; Jeff Z. Pan; |
| 354 | Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. |
Sensen Gao; Shanshan Zhao; Xu Jiang; Lunhao Duan; Yong Xien Chng; Qing-Guo Chen; Weihua Luo; Kaifu Zhang; Jia-Wang Bian; Mingming Gong; |
| 355 | From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at The Repository Level Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. |
Jia Li; Yuxin Su; Michael R. Lyu; |
| 356 | Language of Thought Shapes Output Diversity in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we reveal that controlling the language used during model thinking—the *language of thought*—provides a novel and structural source of output diversity. |
Shaoyang Xu; Wenxuan Zhang; |
| 357 | Aligning Large Language Models Via Fully Self-Synthetic Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i. e. , user queries), responses, and preferences, are generated by the model itself. |
Shangjian Yin; Zhepei Wei; Xinyu Zhu; Wei-Lin Chen; Yu Meng; |
| 358 | Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Embodied-Reasoner, a reasoning model for interactive embodied tasks. |
Wenqi Zhang; Mengna Wang; Gangao Liu; Huixin Xu; Yiwei Jiang; Yongliang Shen; Guiyang Hou; Zhe Zheng; Hang Zhang; Xin Li; Jiajun Liu; Weiming Lu; Peng Li; Yueting Zhuang; |
| 359 | XY-Tokenizer: Mitigating The Semantic-Acoustic Conflict in Low-Bitrate Speech Codecs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose XY-Tokenizer, a low-bitrate speech codec (around 1 kbps) trained with a structured multi-stage, multi-task strategy that aligns discrete speech representations with text while preserving fine-grained acoustic details for reconstruction. |
Yitian Gong; Luozhijie Jin; Kuangwei Chen; Dong Zhang; Ruifan Deng; Xiaogui Yang; Xin Zhang; Zhaoye Fei; Qinyuan Cheng; Shimin Li; Xipeng Qiu; |
| 360 | TEMA: Anchor The Image, Follow The Text for Multi-Modification Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. |
Zixu Li; Yupeng Hu; Zhiheng Fu; Zhiwei Chen; Yongqi Li; Liqiang Nie; |
| 361 | FedMental: Evaluating Federated Learning for Mental Health Detection from Social Media Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Social media text data are often used to train Machine Learning (ML) models to identify users exhibiting high-risk mental health behaviors. |
Nuredin Ali Abdelkadir; Anjali Ratnam; Zeerak Talat; Stevie Chancellor; |
| 362 | Self-SoftCoT: A Self-Consistent Framework Via Position-Aware Latent Space Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing continuous reasoning approaches, such as SoftCoT, mitigate this but typically rely on external auxiliary models, resulting in complex deployment and fractured inference pipelines. To address these challenges, we propose Self-SoftCoT, a self-contained framework that enables a frozen LLM to internally generate and consume latent thoughts without external assistants. |
Liangliang Dong; Lianlei Shan; Shuaimin Li; |
| 363 | PRInTS: Reward Modeling for Long-Horizon Information Seeking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs, designed for short reasoning with binary judgment, cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM’s reasoning across multiple step quality dimensions (e. g. , interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. |
Jaewoo Lee; Archiki Prasad; Justin Chen; Zaid Khan; Elias Stengel-Eskin; Mohit Bansal; |
| 364 | Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: On the other hand, human evaluations, despite their richness, remain costly, inconsistent, and difficult to scale. We tackle this critical barrier by proposing a Dual-Axis Generative Reward Model, which is trained to understand complex interaction dynamics using a detailed taxonomy and an annotated dataset, produces a single score and, crucially, provides separate evaluations for semantic quality and interaction timing. |
Yifu Chen; Shengpeng Ji; Zhengqing Liu; Qian Chen; Wen Wang; Ziqing Wang; Yangzhuo Li; Tianle Liang; Zhou Zhao; |
| 365 | Standard-to-Dialect Transfer Trends Differ Across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. |
Verena Blaschke; Miriam Winkler; Barbara Plank; |
| 366 | Unified Thinker: A General Reasoning Core for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. |
Sashuai Zhou; Qiang Zhou; Jijin Hu; Hanqing Yang; Yue Cao; Junpeng Ma; Yinchao Ma; Jun Song; Tiezheng Ge; Cheng Yu; Bo Zheng; Zhou Zhao; |
| 367 | MED-COREASONER: Reducing Language Disparities in Medical Reasoning Via Language-Informed Co-Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. |
Fan Gao; Sherry T. Tong; Jiwoong Sohn; Jiahao Huang; Junfeng Jiang; Ding Xia; Piyalitt Ittichaiwong; Kanyakorn Veerakanjana; Hyunjae Kim; Qingyu Chen; Edison Marrese-Taylor; Kazuma Kobayashi; Akiko Aizawa; Irene Li; |
| 368 | AV-Dialog: Spoken Dialogue Models with Audio-Visual Input Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. |
Tuochao Chen; Bandhav Veluri; Hongyu Gong; Shyamnath Gollakota; |
| 369 | Why Steering Works: Toward A Unified View of Language Model Parameter Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. |
Ziwen Xu; Chenyan WU; Hengyu Sun; Haiwen Hong; Mengru Wang; Yunzhi Yao; Longtao Huang; Hui Xue; Shumin Deng; Zhixuan Chu; Huajun Chen; Ningyu Zhang; |
| 370 | Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences. Motivated by this, we propose Experience Retrieval-Augmentation ExpRAG framework based on Electronic Health Record(EHR), aiming to offer the relevant context from other patients’ discharge reports. |
Justice Ou; Tinglin Huang; Yilun Zhao; Ziyang Yu; Peiqing Lu; Yifei Shen; Rex Ying; |
| 371 | Controllable Contamination Detection for Reliable LLM Evaluation with Statistical Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, contaminated data may be mistakenly retained, leading to unreliable evaluation. To address this challenge, we propose FTD (FDR-controlled Training Data detection), a principled framework that detects and filters contaminated evaluation data while providing a statistical guarantee: the proportion of contaminated samples mistakenly retained as clean, the false discovery rate (FDR), is provably controlled below a user-specified threshold. |
Zheng Zhang; Qi Liu; Siyuan Liang; Ning Li; Zirui Hu; Weibo Gao; Rui Li; Zhenya Huang; Leszek Rutkowski; Baosheng Yu; Dacheng Tao; |
| 372 | FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by recent findings that LLM forgetting mirrors the Ebbinghaus human forgetting curve, we propose FOREVER (FORgEtting curVe-inspired mEmory Replay), a novel CL framework that aligns replay schedules with a model-centric notion of time. |
Yujie Feng; Hao Wang; Jian Li; Xu Chu; Zhaolu Kang; Yiran Liu; Yasha Wang; Philip S. Yu; Xiao-Ming Wu; |
| 373 | SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. |
Mahi Luthra; Jiayi Shen; Maxime Poli; Angelo Ortiz Tandazo; Yosuke Higuchi; Youssef Benchekroun; Martin Gleize; Charles-Éric Saint-James; Dongyan Lin; Phillip Rust; Angel Villar-Corrales; Surya; Vanessa Stark; Rashel Moritz; Juan Pino; Yann LeCun; Emmanuel Dupoux; |
| 374 | The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: ** In this work, we present a comprehensive evaluation of dLLMs (e. g. , LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). |
Qingyu Lu; Liang Ding; Kanjian Zhang; Jinxia Zhang; Dacheng Tao; |
| 375 | Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose SkillNav, a modular framework that introduces structured, skill-based reasoning into Transformer-based VLN agents. |
Tianyi Ma; Yue Zhang; Zehao Wang; Parisa Kordjamshidi; |
| 376 | InferenceDynamics: Adaptive LLM Routing Through Structured Capability and Knowledge Profiling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To overcome those challenges, we propose **InferenceDynamics**, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. |
Haochen Shi; Tianshi Zheng; Weiqi Wang; Baixuan Xu; Chunyang Li; Chunkit Chan; Tao Fan; Yangqiu Song; |
| 377 | AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, AutoJudger has three core components: **ability decomposition** to organize evaluation along meaningful capability dimensions, **ability estimation** to maintain an up-to-date quantitative profile of the model competence, and **adaptive question selection** to choose the most informative questions. To operationalize this paradigm, we introduce **A2-Judger**, a novel MLLM-based **A**gentic instantiation of **A**uto**Judger** equipped with semantic-aware retrieval and dynamic memory. |
Xuanwen Ding; Chengjun Pan; Zejun Li; Jiwen Zhang; Siyuan Wang; Zhongyu Wei; |
| 378 | When Agents Look The Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose two complementary metrics to isolate non-mandatory behavioral patterns: Response Pattern Similarity (RPS) for verbal alignment and Action Graph Similarity (AGS) for tool-use habits modeled as directed graphs. |
Chenghao Yang; Yuning Zhang; Zhoufutu Wen; Tao Gong; Jiaheng Liu; Qi Chu; Nenghai Yu; |
| 379 | IntrAgent: An LLM Agent for Content-Grounded Information Retrieval Through Literature Review Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Scientific research relies on accurate information retrieval from literature to support analytical decisions. In this work, we introduce a new task, *INformation reTRieval through literAture reVIEW* (IntraView), which aims to automate fine-grained information retrieval *faithfully* grounded in the provided content in response to research-driven queries, and propose IntrAgent, an LLM-based agent that addresses this challenging task. |
Fengbo Ma; Zixin Rao; Xiaoting Li; Zhetao Chen; Hongyue Sun; Yiping Zhao; Xianyan Chen; Zhen Xiang; |
| 380 | SciMDR: Advancing Scientific Multimodal Document Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. |
Ziyu Chen; Yilun Zhao; Chengye Wang; Rilyn R. Han; Manasi Patwardhan; Arman Cohan; |
| 381 | FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. |
Zhuohan Xie; Daniil Orel; Rushil Thareja; Dhruv Sahnan; Hachem Madmoun; Fan Zhang; Debopriyo Banerjee; Georgi Nenkov Georgiev; Xueqing Peng; Lingfei Qian; Jimin Huang; Jinyan Su; Aaryamonvikram Singh; Rui Xing; Rania Elbadry; Chen Xu; Haonan Li; Fajri Koto; Ivan Koychev; Tanmoy Chakraborty; Yuxia Wang; Salem Lahlou; Veselin Stoyanov; Sophia Ananiadou; Preslav Nakov; |
| 382 | Ted-Tok: Maintaining An Evolving Vocabulary for Lifelong Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The tokenizer, a foundational part of the system, is usually assumed to remain fixed in lifelong learning scenarios. In this work, we challenge the validity of this assumption: as language evolves, a static tokenizer fragments newly emerging lexical items, reducing compression efficiency and consequently degrading the model performance. |
Jiameng Huang; Zhi Zhang; Zhenyu He; Jiacheng Sun; Di He; |
| 383 | ACE-Router: Generalizing History-Aware Routing from MCP Tools to The Agent Web Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current architectures face severe scalability and generality bottlenecks. To address this, we propose ACE-Router, a pipeline for training history-aware routers to empower precise navigation in large-scale ecosystems. |
Zhiyuan Yao; Zishan Xu; Yifu Guo; Zhiguang Han; Cheng Yang; Shuo Zhang; Weinan Zhang; Xingshan Zeng; Weiwen Liu; |
| 384 | Can LLMs Learn to Map The World from Local Descriptions? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates whether LLMs, grounded in locally relative human observations, can construct coherent global spatial cognition by integrating fragmented relational descriptions. |
Sirui Xia; Aili Chen; Xintao Wang; Tinghui Zhu; Yikai Zhang; Jiangjie Chen; Yanghua Xiao; |
| 385 | SCAN: Structured Capability Assessment and Navigation for LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing research has focused on approximating model rankings, such benchmarks fail to provide users and developers with a comprehensive and fine-grained understanding of a specific model’s capabilities. To fill this gap, we propose SCAN (Structured Capability Assessment and Navigation), a practical framework that enables detailed characterization of LLM capabilities through comprehensive and fine-grained evaluation. |
Zongqi Wang; Tianle Gu; Chen Gong; Xin Tian; Siqi Bao; Yujiu Yang; |
| 386 | On The Proper Treatment of Units in Surprisal Theory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. |
Samuel Kiegeland; Vésteinn Snæbjarnarson; Tim Vieira; Ryan Cotterell; |
| 387 | CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current benchmarks primarily emphasize functional relevance while neglecting code quality. To address this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four critical dimensions: correctness, efficiency, security, and maintainability. |
Jiahui Geng; Fengyu Cai; Shaobo Cui; Qing Li; Liangwei Chen; Chenyang Lyu; Haonan Li; Derui Zhu; Alexander Pretschner; Heinz Koeppl; Fakhri Karray; |
| 388 | Towards A Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. |
Yi Hu; Jiaqi Gu; Ruxin Wang; Zijun Yao; Hao Peng; Xiaobao Wu; Jianhui Chen; Muhan Zhang; Liangming Pan; |
| 389 | Shanks: Simultaneous Hearing and Thinking for Spoken Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to user input. |
Cheng-Han Chiang; Xiaofei Wang; Linjie Li; Chung-Ching Lin; Kevin Lin; Shujie Liu; Zhendong Wang; Zhengyuan Yang; Hung-yi Lee; Lijuan Wang; |
| 390 | Improving Retrieval-Augmented Generation Without Taxonomy-based Error Categorization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose RePAIR, a response–action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies or explicit critic supervision. |
Gongbo Zhang; Yifan Peng; Chunhua Weng; |
| 391 | From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. |
Ziwei Huang; Ying Shu; Fanghao; Quanyu Long; Wenya Wang; Qiushi Guo; Tiezheng Ge; Leilei Gan; |
| 392 | HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing Using Tree of Writing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. |
Andrew Zhuoer Feng; Cunxiang Wang; Yu Luo; Lin Fan; Irene Zhou; Zikang Wang; Xiaotao Gu; Jie Tang; Hongning Wang; Minlie Huang; |
| 393 | Mitigating Hallucinations in Large Vision-Language Models Without Performance Degradation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. |
Xingyu Zhu; Junfeng Fang; Shuo Wang; Beier Zhu; Zhicai Wang; Yonghui Yang; Xiangnan He; |
| 394 | SAD: A Large-Scale Strategic Argumentative Dialogue Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To support deeper modeling of argumentation dialogue, we present the first large-scale Strategic Argumentative Dialogue dataset, SAD, consisting of 392,822 examples. |
YongKang Liu; Jiayang Yu; Mingyang Wang; Yiqun Zhang; Ercong Nie; Shi Feng; Daling Wang; Kaisong Song; Hinrich Schuetze; |
| 395 | Look Within or Beyond? A Theoretical Comparison Between Parameter-Efficient and Full Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we provide a theoretical and empirical comparison of PEFT and FFT in terms of representational capacity and robustness. |
YongKang Liu; Xingle Xu; Ercong Nie; Zijing Wang; Shi Feng; Daling Wang; Qian Li; Hinrich Schuetze; |
| 396 | MMSciCode: Real-world Evaluation of Multilingual Multi-Discipline Scientific Research Coding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MMSciCode, a comprehensive expert-level, multilingual multi-discipline benchmark for evaluating foundation models in scientific code generation. |
Xue Xia; Zheyuan Yang; Arman Cohan; Yilun Zhao; |
| 397 | D2Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose **D²Plan**, a **D**ual-agent **D**ynamic global **Plan**ning paradigm for complex retrieval-augmented reasoning. |
Kangcheng Luo; Tinglang Wu; Yansong Feng; |
| 398 | HCSpec: Two-Tier Horizontal Cascade Speculative Decoding for High-Efficiency Large Language Model Inference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current state-of-the-art self-distilled draft models adopt a homogeneous architecture across all drafting positions, failing to account for a critical empirical observation: the expected utility of drafting decays rapidly after the initial positions. To exploit this imbalance, we propose Two-tier Horizontal Cascade Speculative Decoding (HCSpec), a novel framework that organizes heterogeneous, position-specialized draft modules into a horizontal cascade. |
Yizhou Zhang; Siming Chen; Hao Ye; Erhu Feng; |
| 399 | Beyond Examples: Towards Automated Thought-level In-Context Reasoning for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, traditional ICL struggles with complex reasoning mainly due to superficial, example-level implicit imitation. To address these limitations, we introduce **ThoughtICR**, an automated **Thought**-level **I**n-**C**ontext **R**easoning paradigm that shifts from surface-level examples to more guidance-oriented thought patterns. |
Jinyang Wu; Mingkuan Feng; Shuai Zhang; Feihu Che; Zhengqi Wen; Chonghua Liao; Ling Yang; Haoran Luo; Zheng Lian; Jianhua Tao; |
| 400 | SPARK: Strategic Policy-Aware Exploration Via Dynamic Branching for Long-Horizon Agentic Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose **SPARK** (**S**trategic **P**olicy-**A**ware explo**R**ation via **K**ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. |
Jinyang Wu; Shuo Yang; Yuhao Shen; Shuai Zhang; Zhengqi Wen; Jianhua Tao; |
| 401 | Decoupling Task-Solving and Output Formatting in LLM Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This entanglement creates competing goals for the model, hindering its reasoning capabilities. To address this, we introduce Deco-G, a decoding framework that explicitly decouples format adherence from problem solving. |
Haikang Deng; Po-Nien Kung; Nanyun Peng; |
| 402 | Semantic-Aware Logical Reasoning Via A Semiotic Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing studies largely overlook the interplay between logical complexity and semantic complexity, limiting their robustness under abstract propositions, ambiguous contexts, and conflicting stances, which are central to human reasoning. We propose **LogicAgent**, a semiotic-square–guided framework that jointly addresses these two axes of difficulty. |
Yunyao Zhang; Xinglang Zhang; Junxi Sheng; Wenbing Li; Junqing Yu; Yi-Ping Phoebe Chen; Wei Yang; Zikai Song; |
| 403 | Compressing Then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. |
Da Li; Yuxiao Luo; Keping Bi; Jiafeng Guo; Wei Yuan; Biao Yang; Yan Wang; Fan Yang; Tingting Gao; Guorui Zhou; |
| 404 | BoYaEval: Evaluating Multimodal Large Language Models on Understanding Ancient Chinese Musical Scores Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce BoYaEval, the first comprehensive benchmark dedicated to deciphering diverse Ancient Chinese musical notations, including five types of ancient Chinese music notation systems. |
Jiajia Li; Weizhi Xue; Yao Yao; Qiwei Li; Chenchong; Zuchao Li; Ping Wang; Hai Zhao; |
| 405 | AAPO: Enhancing The Reasoning Capabilities of LLMs with Advantage Margin Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a margin-based estimation scheme. |
Jian Xiong; Jingbo Zhou; Jingyong Ye; Qiang Huang; Dejing Dou; |
| 406 | MARS2: Scaling Multi-Agent Tree Search Via Reinforcement Learning for Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MARS2 (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. |
Pengfei Li; Shijie Wang; Fangyuan Li; Yikun Fu; Kaifeng Liu; Kaiyan Zhang; Dazhi Zhang; Yuqiang Li; Biqing Qi; Bowen Zhou; |
| 407 | MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. |
Angelo Ortiz Tandazo; Manel Khentout; Youssef Benchekroun; Thomas Hueber; Emmanuel Dupoux; |
| 408 | Nature-Inspired Population-Based Evolution of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by this principle, this paper formally defines a newly emerging problem: the population-based evolution of large language models (LLMs). We introduce a novel framework that starts with a population of parent LLMs and allows this population to evolve through four key operations: (i) crossover, merging the weights of different parents to create offspring LLMs, (ii) mutation, introducing small, random changes to model weights to foster diversity, (iii) selection, prioritizing high-performing models, and (iv) succession, transferring the learned experience from parent to offspring LLMs. |
Yiqun Zhang; Peng Ye; Xiaocui Yang; Shi Feng; Shufei Zhang; Lei Bai; Wanli Ouyang; Shuyue Hu; |
| 409 | MTRouter: Cost-Aware Multi-Turn LLM Routing with History–Model Joint Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose MTRouter, which encodes the interaction history and candidate models into joint history–model embeddings, and learns an outcome estimator from logged trajectories to predict turn-level model utility. |
Yiqun Zhang; Hao Li; Zihan Wang; Shi Feng; Xiaocui Yang; Daling Wang; Bo Zhang; Lei Bai; Shuyue Hu; |
| 410 | Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper describes the system architecture, agent roles, retrieval and scoring methods, knowledge graph schema, and evaluation interfaces that together form the Paper Circle research workflow. |
Komal Kumar; Aman Chadha; Salman Khan; Fahad Shahbaz Khan; Hisham Cholakkal; |
| 411 | MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite these surface changes, we observe that functional semantics and task intents remain fundamentally stable. Building on this insight, we introduce MAGNET, a memory-driven adaptive agent framework with dual-level memory: stationary memory that links diverse visual features to stable functional semantics for robust action grounding and procedural memory that captures stable task intents across varying workflows. |
Libo Sun; Jiwen Zhang; Siyuan Wang; Zhongyu Wei; |
| 412 | BrowseComp-Plus: A Fair and Disentangled Evaluation Benchmark for Deep Search Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce BrowseComp-Plus, a benchmark derived from BrowseComp that employs a fixed, human-verified corpus, enabling controlled retrieval for deep search agents. |
Zijian Chen; Xueguang Ma; Shengyao Zhuang; Ping Nie; Kai Zou; Sahel Sharifymoghaddam; Andrew Liu; Joshua Green; Kshama Patel; Ruoxi Meng; Mingyi Su; Yanxi Li; Haoran Hong; Xinyu Shi; Xuye Liu; Hosna Oyarhoseini; Nandan Thakur; Crystina Zhang; Luyu Gao; Wenhu Chen; Jimmy Lin; |
| 413 | Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems Via Bi-Level Graph Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, the lack of interpretability in these methods limits their reliability and real-world applicability. To address these limitations, we propose , an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS. |
Junjun Pan; Yixin Liu; Rui Miao; Kaize Ding; Yu Zheng; Quoc Viet Hung Nguyen; Alan Wee-Chung Liew; Shirui Pan; |
| 414 | Reinforcement Learning on Pre-Training Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The latter incentivizes reasoning capabilities with strong generalization, but is constrained by limited data availability due to its reliance on human annotation. To alleviate these issues, we propose Reinforcement Learning on Pre-Training data (RLPT), which combines the advantages of learning from general data and RL. |
Siheng Li; Kejiao Li; Zenan Xu; Guanhua Huang; Kun Li; Haoyuan Wu; Wujiajia; Zihao Zheng; Chenchen Zhang; Kun Shi; Xue Gong; Qi Yi; Ruibin Xiong; Tingqiang Xu; Yuhao Jiang; Jianfeng Yan; Yuyuan Zeng; Guanghui Xu; Jinbao Xue; Zhijiang xu; Zheng Fang; Shuai LI; Qibin Liu; Xiaoxue Li; Zhuoyu Li; Yangyu Tao; Fei Gao; Cheng Jiang; Bochao Wang; Kai Liu; Jianchen Zhu; Wai Lam; Bo Zhou; Di Wang; |
| 415 | CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We apply our framework to a half-dozen published attacks on the Pythia and OLMo family of models, from 70M to 7B parameters. |
Jeffrey George Wang; Jason Wang; Marvin Li; Seth Neel; |
| 416 | CoEvolve: Training LLM Agents Via Agent-Data Mutual Evolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent’s evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. |
Shidong Yang; Ziyu Ma; Tongwen Huang; Yiming Hu; Yong Wang; Xiangxiang Chu; |
| 417 | KV-Embedding: Training-free Text Embedding Via Internal KV Re-routing in Decoder-only LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While LLMs are powerful embedding backbones, their application in training-free settings faces two structural challenges: causal attention restricts early tokens from accessing subsequent context, and the next-token prediction objective biases representations toward generation rather than semantic compression. To address these limitations, we propose KV-Embedding, a framework that activates the latent representation power of frozen LLMs. |
Yixuan Tang; Yi Yang; |
| 418 | MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a heterogeneous Mixture-of-Adapters (MoA) approach. |
Jie Cao; Tianwei Lin; Bo Yuan; Rolan Yan; Hongyang He; Wenqiao Zhang; Juncheng Li; Dongping Zhang; Siliang Tang; Yueting Zhuang; |
| 419 | WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current agentic systems are often retrieval-heavy but reasoning-light, where success is predominantly determined by simple entity-seeking rather than the multi-step aggregation of scattered evidence. To address this, we propose a data synthesis pipeline WebAggregator, designed to shift the agentic paradigm from retrieval-centric to compositional aggregation. |
Rui Wang; Ce Zhang; Jun-Yu Ma; Jianshu Zhang; Hongru Wang; Yi Chen; Boyang Xue; Tianqing Fang; Zhisong Zhang; Hongming Zhang; Haitao Mi; Dong Yu; Kam-Fai Wong; |
| 420 | PAR: Training-Free Positional Perturbation and Attention Recycling for Faithful OCR Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We discover that the model’s reliance on visual grounding diminishes significantly as the generation length increases. To mitigate this, we propose PAR (Positional Perturbation and Attention Recycling), a training-free, inference-time intervention framework. |
Yao Yao; Manwen Liao; Weitian Zhang; Zuchao Li; Hai Zhao; |
| 421 | GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we present GenesisFunc, an automated pipeline for generating FC training data. |
Hao-Xiang Xu; Chong Deng; Jiaqing Liu; Wen Wang; Qian Chen; Lujia Bao; Xiangang Li; Zhen-Hua Ling; |
| 422 | Multi-Granularity Semantic Revision for Large Language Model Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing distillation loss functions struggle to align the most informative part due to the complex output distributions of LLMs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. |
Xiaoyu Liu; Yun Zhang; Wei Li; Simiao Li; Xudong Huang; Hanting Chen; Yehui Tang; Jie Hu; Zhiwei Xiong; Yunhe Wang; |
| 423 | LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we propose LLM-ForcedAligner, reformulating FA as a slot-filling paradigm: timestamps are treated as discrete indices, and special timestamp tokens are inserted as slots into the transcript. |
Bingshen Mu; Xian Shi; Xiong Wang; Hexin Liu; Jin Xu; Lei Xie; |
| 424 | FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages—conditional steering and fine-grained vector synthesis—allowing fine-grained control over when and how to steer internal representations. |
Zixuan Weng; Jinghuai Zhang; Kunlin Cai; Ying Li; Peiran Wang; Yuan Tian; |
| 425 | Forest Before Trees: Latent Superposition for Efficient Visual Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning. |
Yubo Wang; Juntian Zhang; Yichen Wu; Yankai Lin; Nils Lukas; Yuhan Liu; |
| 426 | Language Models Learn Universal Representations of Numbers and Here’s Why You Should Care Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior work has shown that large language models (LLMs) often converge to accurate input embedding for numbers, based on sinusoidal representations. In this work, we demonstrate that these representations are in fact strikingly systematic, to the point of being almost perfectly universal: different LLM families develop equivalent sinusoidal structures, and number representations are broadly interchangeable in a large swathe of experimental setups. |
Michal Štefánik; Timothee Mickus; Marek Kadlčík; Bertram Højer; Michal Spiegel; Raúl Vázquez; Aman Sinha; Josef Kuchař; Philipp Mondorf; Pontus Stenetorp; |
| 427 | From Individual to Common: An Early Exploration of Consensus in Non-verifiable Data for Balanced Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Donkey, a high-quality, non-verifiable dataset where response pairs differ only by subtle nuances. |
Shangjian Yin; Zhouxing Shi; |
| 428 | The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address these, we introduce GaoYao, a comprehensive benchmark with 182. |
Yilun Liu; Chunguang Zhao; Mengyao Piao; Lingqi Miao; Shimin Tao; Minggui HE; Chenxin Liu; Zhang Li; Mahongxia; Jiaxin Guo; Chen Liu; Liqun Deng; Jiansheng Wei; Xiaojun Meng; Fanyi Du; Daimeng Wei; Yanghua Xiao; |
| 429 | MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. |
Xiangyu Zhao; Wanghan Xu; Bo Liu; Yuhao Zhou; Fenghua Ling; Ben Fei; Xiaoyu Yue; Lei Bai; Wenlong Zhang; Xiao-Ming Wu; |
| 430 | Mathematical Proof As A Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the high reported accuracy of these advanced models on popular datasets and reliance on purely numerical evaluation often mask their true reasoning shortcomings. To address this, we propose leveraging the inherent rigor and methodological complexity of mathematical proofs as a diagnostic tool to expose these hidden failures. |
Dadi Guo; Jiayu Liu; Zhiyuan Fan; Zhitao He; Haoran Li; Yuxin Li; Yumeng Wang; Yi R. Fung; |
| 431 | Arguments That Alter Minds: LLM Rationales Sway Human (and LLM) Notions of Plausibility Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate the degree to which human (and LLM) plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. |
Shramay Palta; Peter A. Rankel; Sarah Wiegreffe; Rachel Rudinger; |
| 432 | Learning Uncertainty from Sequential Internal Dispersion in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. |
Ponhvoan Srey; Xiaobao Wu; Cong-Duy T Nguyen; Anh Tuan Luu; |
| 433 | CUB: Benchmarking Context Utilisation Techniques for Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we develop CUB (Context Utilisation Benchmark) – the first comprehensive benchmark designed to help diagnose CMTs under diverse noisy context conditions within retrieval-augmented generation (RAG). |
Lovisa Hagström; Youna Kim; Haeun Yu; Sang-goo Lee; Richard Johansson; Hyunsoo Cho; Isabelle Augenstein; |
| 434 | CT-FineBench: A Diagnostic Fidelity Benchmark for Fine-Grained Evaluation of CT Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Conventional evaluation metrics offer only coarse measures of lexical overlap or entity matching and fail to reflect the granular diagnostic accuracy required for clinical use. To address this gap, we propose CT-FineBench, a benchmark built from CT-RATE and Merlin to evaluate the fine-grained factual consistency of CT reports, constructed from CT-RATE and Merlin. |
Ruifeng Yuan; Wanxing Chang; Weiwei Cao; Bowen Shi; Zhongyu Wei; Ling Zhang; Jianpeng Zhang; |
| 435 | The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. |
Xi Fang; Weijie Xu; Yuchong Zhang; Scott Nickleach; Stephanie Eckman; Chandan K. Reddy; |
| 436 | Revisiting A Pain in The Neck: A Semantic Reasoning Benchmark for Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. |
Yang Liu; Hongming Li; Melissa Xiaohui Qin; Chao Huang; Qiankun Liu; |
| 437 | PAR2-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Planned Active Retrieval and Reasoning RAG (PAR2-RAG), a training-free two-stage framework that separates coverage from commitment. |
Xingyu Li; Rongguang Wang; Yuying Wang; Mengqing Guo; Chenyang Li; Tao Sheng; Sujith Ravi; Dan Roth; |
| 438 | ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Yet existing medical Large Language Models (LLMs) predominantly follow a reactive paradigm, risking diagnostic errors by answering before seeking sufficient details. To bridge this gap, we propose ProMed, a reinforcement learning framework that transitions LLMs toward a proactive paradigm, enabling them to ask clinically valuable questions before decision-making. |
Hongxin Ding; Baixiang Huang; Yue Fang; Weibin Liao; Xinke Jiang; Jinyang Zhang; Yinghao Zhu; Zheng Li; Liantao Ma; Junfeng Zhao; Yasha Wang; |
| 439 | Measuring User’s Mental Models of Speech Translation in Human-AI Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper studies users’ mental models of speech translation systems through a new framework based on cross-lingual question answering, where users either accept MT output or request professional re-translation to answer questions based on the information presented in a foreign language. |
HyoJung Han; Nishant Balepur; Jordan Lee Boyd-Graber; Marine Carpuat; |
| 440 | Persona-E²: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we introduce Persona-E² (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. |
Yuqin Yang; Haowu Zhou; Haoran Tu; Zhiwen Hui; Shiqi Yan; HaoYang Li; Dong She; Xianrong Yao; Yang Gao; Zhanpeng Jin; |
| 441 | What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. |
Dong Yan; Jian Liang; Yanbo Wang; Shuo Lu; Ran He; Tieniu Tan; |
| 442 | Mind The Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CultureManager, a novel pipeline for task-specific cultural alignment. |
Binchi Zhang; Xujiang Zhao; Jundong Li; Haifeng Chen; Zhengzhang Chen; |
| 443 | AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose AgentRouter, a framework that formulates multi-agent QA as a knowledge-graph–guided routing problem supervised by empirical performance signals. |
Zheyuan Zhang; Kaiwen Shi; Zhengqing Yuan; Zehong Wang; Tianyi Ma; Keerthiram Murugesan; Vincent Galassi; Chuxu Zhang; Yanfang Ye; |
| 444 | T⋆: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present T⋆, a simple TraceRL-based curriculum for progressive block-size scaling in masked diffusion language models (MDMs). |
Hanchen Xia; Baoyou Chen; Yutang Ge; Guojiang Zhao; Siyu Zhu; |
| 445 | A Survey of Reasoning-Intensive Retrieval: Progress and Challenges Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: Reasoning-Intensive Retrieval (RIR) targets retrieval settings where relevance is mediated by latent inferential links between a query and supporting evidence, rather than … |
Yiyang Wei; Tingyu Song; Siyue Zhang; Yilun Zhao; |
| 446 | OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While LLMs have shown promising capabilities in generating believable human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPeRA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. |
Ziyi Wang; Yuxuan Lu; Wenbo Li; Amirali Amini; Bo Sun; Yakov Bart; Weimin Lyu; Jiri Gesi; Tian Wang; Jing Huang; Yu Su; Upol Ehsan; Malihe Alikhani; Toby Jia-Jun Li; Lydia Chilton; Dakuo Wang; |
| 447 | Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap, we present Trajectory2Task a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios: ambiguous intent, changing intent, and infeasible intents. |
Ziyi Wang; Yuxuan Lu; Yimeng Zhang; Pei Chen; Ziwei Dong; Jing Huang; Jiri Gesi; Xianfeng Tang; Chen Luo; Qun Liu; Yisi Sang; Hanqing Lu; Manling Li; Jin Lai; Dakuo Wang; |
| 448 | Can AI Be A Good Peer Reviewer? A Survey of Peer Review Process, Evaluation, and The Future Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this survey, we synthesize techniques for (i) peer review generation, including fine-tuning strategies, agent-based systems, RL-based methods, and emerging paradigms to enhance generation; (ii) after-review tasks including rebuttals, meta-review and revision aligned to reviews; and (iii) evaluation methods spanning human-centered, reference-based, LLM-based and aspect-oriented. |
Sihong Wu; Owen Jiang; Yilun Zhao; Tiansheng Hu; Yiling Ma; Kaiyan Zhang; Manasi Patwardhan; Arman Cohan; |
| 449 | Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by this gap, we demonstrate that steering vectors derived solely from text-only LLM backbones can effectively guide and enhance their multimodal counterparts, revealing a novel cross-modal transfer that enables reuse of existing interpretability tools. Using community-standard methods—Sparse Autoencoders (SAE), Mean Shift, and Linear Probing—we validate this transfer effect across diverse MLLM architectures and visual reasoning tasks. |
Woody Haosheng Gan; Deqing Fu; Julian Asilis; Ollie Liu; Vatsal Sharan; Robin Jia; Willie Neiswanger; |
| 450 | Evaluating Answer Leakage Robustness of LLM Tutors Against Adversarial Student Attacks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we study scenarios where students behave adversarially and aim to obtain the correct answer from the tutor. |
Jin Zhao; Marta Knežević; Tanja Käser; |
| 451 | ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. |
Haoran Li; Yulin Chen; Huihao Jing; Wenbin Hu; Tsz Ho Li; Chanhou Lou; Hong Ting Tsang; Sirui Han; Yangqiu Song; |
| 452 | Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation—ensuring that models forget not only specific training samples but also related implicit knowledge. |
Huazheng Wang; Yongcheng Jing; Haifeng Sun; Yingjie Wang; Jingyu Wang; Jianxin Liao; Dacheng Tao; |
| 453 | SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. |
Jingyu Lu; Yuhan Wang; Fan Zhuo; Xize Cheng; Changhao Pan; Xueyi Pu; Yifu Chen; Chenyuhao Wen; Tianle Liang; Zhou Zhao; |
| 454 | Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. |
Cheng Wang; Qin Liu; Wenxuan Zhou; Muhao Chen; |
| 455 | Cognitive Alpha Mining Via LLM-Driven Code-Based Evolution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although different in form, these paradigms share a key limitation: none can conduct broad, structured, and human-like exploration that balances logical consistency with creative leaps. To address this gap, we introduce the Cognitive Alpha Mining Framework (CogAlpha), which combines code-level alpha representation with LLM-driven reasoning and evolutionary search. |
Fengyuan Liu; Yi Huang; Sichun Luo; Yuqi Wang; Yazheng Yang; Xinye Li; Zefa Hu; Junlan Feng; Qi Liu; |
| 456 | LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. |
Yuchun Fan; Bei Li; Peiguang Li; Yilin Wang; Yongyu Mu; Jian Yang; Xin Chen; Rongxiang Weng; Jingang Wang; Xunliang Cai; JingBo Zhu; Tong Xiao; |
| 457 | An Exploration of Mamba for Speech Self-Supervised Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. |
Tzu-Quan Lin; Heng-Cheng Kuo; Tzu-Chieh Wei; Hsi-Chun Cheng; Chun Wei Chen; Hsien-Fu Hsiao; Yu Tsao; Hung-yi Lee; |
| 458 | HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This introduces significant challenges in resolving logical dependencies and disambiguating vague boundaries. To bridge this gap, we introduce HSCodeComp, a novel benchmark derived from e-commerce, requiring agents to assign a unique 10-digit Harmonized System (HS) Code to products by aligning their fuzzy attributes with strict tariff classification rules. |
Tian Lan; Yiqian Yang; Qianghuai Jia; Li Zhu; Hui Jiang; Hang Zhu; Weihua Luo; Longyue Wang; |
| 459 | Powering Verifiable Learning Via Automated Evolutionary Data Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework that, from minimal seed supervision, jointly synthesizes problems, diverse candidate solutions, and verification artifacts, and iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks. |
He Du; Bowen Li; Aijun Yang; Siyang He; Qipeng Guo; Kai Chen; Dacheng Tao; |
| 460 | Robust Tool Use Via Fission-GRPO: Learning to Recover from Execution Errors Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In particular, standard reinforcement learning (RL) collapses rich failure experience into sparse negative rewards, while pre-collected error-correction datasets become mismatched to the policy’s evolving failure modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into on-policy corrective supervision within the RL training loop. |
Zhiwei Zhang; Fei Zhao; Rui Wang; Zezhong Wang; Bin Liang; Jiakang Wang; Yao Hu; Shaosheng Cao; Kam-Fai Wong; |
| 461 | Psyche-R1: Towards Reliable Psychological LLMs Through Unified Empathy, Expertise, and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating accurate responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. |
Chongyuan Dai; Jinpeng Hu; Hongchang Shi; Zhuo Li; Dan Guo; Xun Yang; Meng Wang; |
| 462 | Quantifying The Impact of Translation Errors on Multilingual LLM Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that span agreement is non-trivial on naturally occurring benchmark translations, and that target-side translation errors are consistently associated with measurable, percentage-point drops in translated accuracy even after controlling for English correctness and source-side anomalies. |
Klaudia Thellmann; Bernhard Stadler; Michael Färber; Jens Lehmann; |
| 463 | LearnerCoMPASS: Intelligent Tutoring System with Dynamic Cognitive Diagnosis and Multi-Model Path Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces LearnerCoMPASS (Cognitive Multi-model Planning Adaptive System), an integrated, end-to-end framework for adaptive learning. |
Ziji Sheng; Guiyao Tie; Weidong Wang; Pan Zhou; Daizong Liu; |
| 464 | Less Languages, Less Tokens: An Efficient Unified Logic Cross-lingual Chain-of-Thought Reasoning Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, multilingual LLM representations vary strongly by language, hindering direct feature comparisons and effective pruning. To address this, we introduce UL-XCoT, the first efficient unified logic cross-lingual reasoning framework that minimizes redundancy in token usage and latency, yielding the greatest efficiency under limited sampling budgets during inference. |
Chenyuan Zhang; Qiguang Chen; Xie Chen; Zhuotao Tian; Bowen Xing; Meishan Zhang; Libo Qin; Baotian Hu; Min Zhang; |
| 465 | COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present COMPASS (Company/Organization Policy Alignment Assessment), the first systematic framework for evaluating whether LLMs comply with organizational allowlist and denylist policies. |
Dasol Choi; DongGeon Lee; Brigitta Jesica Kartono; Helena Berndt; Taeyoun Kwon; Joonwon Jang; Haon Park; Hwanjo Yu; Minsuk Kahng; |
| 466 | USB: A COMPREHENSIVE AND UNIFIED SAFETY EVALUATION BENCHMARK FOR MULTIMODAL LARGE LANGUAGE MODELS Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Current benchmarks fail to provide reliable assessments due to limited risk coverage, insufficient scale, and the oversight of complex modality combinations (e. g. , cross-modal risks). To address this, we introduce the Unified Safety Benchmark (USB), a comprehensive framework covering 61 risk categories across four distinct modality interactions. |
Baolin Zheng; Guanlin Chen; Qingyang Teng; Hongqiong Zhong; Yingshui Tan; Zhendong Liu; Weixun Wang; Jiaheng Liu; Jian Yang; Huiyun Jing; Jincheng Wei; Wenbo Su; Xiaoyong Zhu; Bo Zheng; Kaifu Zhang; |
| 467 | Beyond Surface Features: Advancing Medical Vision-Language Alignment Via Dynamic Evidence-Guided Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Dynamic Evidence-Guided Preference Optimization (DEPO), a new framework that enables evidence-aware and adaptive preference learning for Med-LVLMs. |
Zixuan Huang; Zhihong Zhu; Xiaolong Liu; Yanchao Hao; Manman Zhang; Zheng Wei; Bowen Xing; Xian Wu; Ye Li; Fen Miao; Yefeng Zheng; |
| 468 | TOWER+: Bridging Generality and Translation Specialization in Multilingual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Tower+, a suite of models designed to deliver strong performance on both translation and multilingual general-purpose text capabilities. |
Ricardo Rei; Nuno M Guerreiro; José Pombal; João Alves; Amin Farajian; Pedro Teixeirinha; Andre Martins; |
| 469 | NL ⇒ Schedule: Evaluate Multitask Scheduling Capability of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our evaluation of nine state-of-the-art LLMs reveals the limitations of different LLMs in procedure grounding and the strengths of advanced LLMs in global planning via local analysis. To address these shortcomings, we propose Mans, a novel multi-agent framework. |
Wenrui Liao; Weihong Du; Yi Li; Hongru Liang; Wenqiang Lei; |
| 470 | OLA: Output Language Alignment in Code-Switched LLM Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce OLA, a benchmark to evaluate LLMs’ Output Language Alignment in code-switched interactions. |
Juhyun Oh; Haneul Yoo; Faiz Ghifari Haznitrama; Alice Oh; |
| 471 | When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. |
Lin Sun; Wangdexian; Jingang Huang; Linglin Zhang; Change Jia; Zhengwei Cheng; Xiangzheng Zhang; |
| 472 | CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This neglects a crucial capability: agents’ ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce **CostBench**, a scalable, cost-centric benchmark designed to evaluate agents’ economic reasoning and replanning abilities. |
Jiayu Liu; Cheng Qian; Zhaochen Su; Qing Zong; Shijue Huang; Bingxiang He; Yi R. Fung; |
| 473 | DPDV: Dual-Pathway and Dual-View Representation Learning for Bridging Information Asymmetry in Text-Video Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the asymmetry of cross-modal information poses a challenge to accurately establishing retrieval relationships. To overcome this challenge, we propose a novel video retrieval framework, termed the Dual-Pathway and Dual-View model (DPDV), which consists of the Dual-Pathway Partitioning Module (DPPM) for constructing features at an appropriate granularity and the Dual-View Interaction Module (DVIM) for performing effective feature interactions. |
Zequn Xie; Xin Liu; Fangming Feng; Boyun Zhang; Tao Jin; |
| 474 | Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Prior work has focused on the ability of LLMs to **identify** or **classify** fallacies, but their robustness against these fallacies in persuasive contexts remains largely unexplored. To address this gap, we introduce **LoFa** (Logical Fallacy), a comprehensive benchmark to evaluate LLM robustness against fallacies. |
Xudong Shen; li Yuan; Ye Chen; Xin Wu; Yi Cai; Zhiyong Wu; |
| 475 | GRAD: Generalizing RAG Adaptation with Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose GRAD, an adaptive decoding-time framework that keeps the base generator fixed and composes small, objective-specific guidance at inference. |
Youngwon Lee; Seung-won Hwang; Zhewei Yao; Yuxiong He; |
| 476 | GAMBIT: A Gamified Jailbreak Framework for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: If a model can think like a human, can we influence its cognitive-stage decisions so that it proactively completes a jailbreak? To validate this idea, we propose GAMBIT (Gamified Adversarial Multimodal Breakout via Instructional Traps), a novel multimodal jailbreak framework that decomposes and reassembles harmful visual semantics, then constructs a gamified scene that drives the model to explore, reconstruct intent, and answer as part of winning the game. |
Xiangdong Hu; Yangyang Jiang; Qin Hu; Xiaojun Jia; |
| 477 | SCVQ: Sparse-Compensated Vector Quantization for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, these methods typically suffer from non-negligible performance degradation under ultra-low bitwidth regimes. To bridge this gap, we propose Sparse-Compensated Vector Quantization (SCVQ), a novel framework designed for high-efficiency LLM vector quantization. |
Zixuan Zhou; Yujun Diao; Zicheng Kong; Dehua Ma; Zhenbo Xu; Pei Pei Li; Zhaofeng He; |
| 478 | PIArena: A Platform for Prompt Injection Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For instance, many defenses initially reported as effective were later found to exhibit limited robustness on diverse datasets and attacks. To bridge this gap, we introduce PIArena, a unified and extensible platform for prompt injection evaluation that enables users to easily integrate state-of-the-art attacks and defenses and evaluate them across a variety of existing and new benchmarks. |
Runpeng Geng; Chenlong Yin; Yanting Wang; Ying Chen; Jinyuan Jia; |
| 479 | AT²PO: Agentic Turn-based Policy Optimization Via Tree Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present AT²PO (**A**gentic **T**urn-based **P**olicy **O**ptimization via **T**ree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. |
Zefang Zong; Dingwei Chen; Yang Li; Qi Yi; Bo Zhou; Chengming Li; BO Qian; Peng Chen; Jie Jiang; |
| 480 | KARL: Reinforcement Learning for LLM Agents on Multi-Turn Knowledge-Intensive Agentic Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce KARL (Knowledge-Augmented Reinforcement Learning), a framework that enables LLM agents to dynamically explore structured knowledge sources through multi-turn interactions. |
Xueqiao Sun; Xiao Liu; Bowen Lv; Hanchen Zhang; Bohao Jing; Zehan Qi; Yifan Xu; Yuxiao Dong; Jie Tang; |
| 481 | Domain Generalizable AI Guardrails with Augmented Policy Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Augmented Policy Training (APT), a training recipe that enhances guardrail adaptability to unseen policies by using a suite of policy perturbation strategies during training to reduce overfitting and increase generalization. |
Minqian Liu; Ioana Baldini; David Rabinowitz; David S Rosenberg; Sebastian Gehrmann; Mark Dredze; |
| 482 | From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. |
Pavel Chizhov; Egor Bogomolov; Ivan P. Yamshchikov; |
| 483 | Illusions of The Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023–2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1. |
Katelyn X. Mei; Yi-Li Hsu; Minjoon Choi; Zongwan Cao; Chenjun Xu; Bingbing Wen; Su Lin Blodgett; Lucy Lu Wang; |
| 484 | AutoSchemaKG: Autonomous Knowledge Graph Construction Through Dynamic Schema Induction from Web-Scale Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. |
Jiaxin Bai; Wei Fan; Qi Hu; Qing Zong; Chunyang Li; Hong Ting Tsang; Hongyu Luo; Yauwai Yim; Haoyu Huang; Xiao Zhou; Feng Qin; Tianshi Zheng; Xi Peng; Xin Yao; Huiwen Yang; Leijie Wu; JI Yi; Gong Zhang; Renhai Chen; Yangqiu Song; |
| 485 | Can LLM Safety Be Ensured By Constraining Parameter Regions? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large language models (LLMs) are often assumed to contain safety regions” – parameter subsets whose modification directly influences safety behaviors. |
Zongmin Li; Jian Su; Farah Benamara; Aixin Sun; |
| 486 | CODERL+: Improving Code Generation Via Reinforcement with Execution Semantics Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose CODERL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. |
Xue Jiang; Yihong Dong; Mengyang Liu; Deng Hongyi; Tian Wang; Yongding Tao; Zhi Jin; Wenpin Jiao; Ge Li; |
| 487 | Incomplete In-context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address IICL, we propose Iterative Judgments and Integrated Prediction (IJIP), a framework with train-free and train-based variants. |
Wenqiang Wang; Wen Yujia; Yan Xiao; Zhifeng Chen; Yangshijie Zhang; Peng Chen; Mingbo Yang; Xiaochun Cao; |
| 488 | Seeing But Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. |
Haolei Xu; Haiwen Hong; Hongxing Li; Rui Zhou; Yang Zhang; Longtao Huang; Hui Xue; Yongliang Shen; Weiming Lu; Yueting Zhuang; |
| 489 | Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. |
Yutong Gao; Qinglin Meng; Yuan Zhou; Liangming Pan; |
| 490 | Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of Self-Consuming Performative Loop (SCPL) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. |
Yaxuan Wang; Zhongteng Cai; Yujia Bao; Xueru Zhang; Yang Liu; |
| 491 | Select Before Use: On The Importance of Reference Model Selection in Preference Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose RewardRank, a simple, effective, training-free metrics for estimating initial implicit alignment between reference model and preference objective. |
Muyang Li; Runze Wu; Xiangyu Zhao; Bo Han; Daoyi Dong; Tongliang Liu; |
| 492 | LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing evaluations predominantly focus on isolated, short-term interactions, overlooking the inherently long-term nature of learning. To bridge this gap, we introduce LongTutor, a benchmark for long-term personalized tutoring grounded in formative assessment theory. |
Ning Li; Zheng Zhang; Zhenya Huang; Rui Li; Yi Zhan; Yinbo Luo; Qi Liu; Enhong Chen; |
| 493 | CodeEvo: Interaction-Driven Synthesis of Code-centric Data Through Hybrid and Iterative Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose CodeEvo, a dual-agent architecture comprising a Coder for iterative solution synthesis and a Reviewer to orchestrate the generation trajectory. |
Qiushi Sun; Jingyang Gong; Lei Li; Qipeng Guo; Fei Yuan; |
| 494 | TeRA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these approaches come with a trade-off, as achieving the expressivity of high-rank weight updates typically comes at the cost of sacrificing the extreme parameter efficiency offered by vector-based techniques. To address this issue, we propose a vector-based random Tensor network for high-Rank Adaptation (TeRA), a novel PEFT method that achieves high-rank weight updates while retaining the parameter efficiency of vector-based PEFT adapters. |
Yuxuan Gu; Wuyang Zhou; Giorgos Iacovides; Danilo Mandic; |
| 495 | FinKario: Event-Enhanced Automated Construction of Financial Knowledge Graph Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, two key challenges limit their effectiveness: (1) the rapid evolution of market events often outpaces the slow update cycles of existing knowledge bases, (2) the long-form and unstructured nature of financial reports further hinders timely and context-aware integration by LLMs. To address these challenges, we tackle both data and methodological aspects. |
Xiang Li; Penglei Sun; Wanyun Zhou; Zikai Wei; Yongqi Zhang; Xiaowen Chu; |
| 496 | I²B-LPO: Latent Policy Optimization Via Iterative Information Bottleneck Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we propose Latent Policy Optimization via Iterative Information Bottleneck ( I²B-LPO), which shifts from statistical perturbation of token distributions to topological branching of reasoning trajectories. |
Huilin Deng; Hongchen Luo; Yue Zhu; Long Li; Zhuoyue Chen; Xinghao Zhao; Ming LI; Chuyang Zhao; Jihai Zhang; MengChang Wang; Yang Cao; Yu Kang; |
| 497 | Empowering GUI Agents Via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While small open-source MLLMs are cost-efficient and privacy-preserving compared with commercial large models, they suffer from weak planning and limited cross-website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high-level training data. |
Tianyi Men; Zhuoran Jin; Pengfei Cao; Yubo Chen; Kang Liu; Jun Zhao; |
| 498 | CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CodeJudgeBench, a benchmark explicitly designed to evaluate LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. |
Hongchao Jiang; Yiming Chen; Yushi Cao; Hung-yi Lee; Robby T. Tan; |
| 499 | LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training–inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. |
Penghui Yang; Cunxiao Du; Fengzhuo Zhang; Haonan Wang; Tianyu Pang; Chao Du; Bo An; |
| 500 | RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre-2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR-bench, a high-fidelity, up-to-date benchmark curated from 2024–2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings collected via a large-scale email survey, yielding 1,055 expert-annotated paper–reviewer–score annotations. |
Weicong Liu; Zixuan Yang; Yibo Zhao; Xiang Li; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,400 papers), please visit Paper Digest: ACL-2026 (Full List).