Paper Digest: ACL 2025 Papers & Highlights

July 21, 2025March 17, 2026 admin

Annual Meeting of the Association for Computational Linguistics (ACL) is one of the top natural language processing conferences in the world. To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights to quickly get the main idea of each paper.

Note: ACL-2025 (long, short and industry tracks) accepts more than 1,800 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All ~1,800 ACL-2025 papers in a separate page, which takes quite some time to load.

To search for papers presented at ACL-2025 on a specific topic, please make use of the search by venue (ACL-2025) service. To summarize the latest research published at ACL-2025 on a specific topic, you can utilize the review by venue (ACL-2025) service. If you are interested in browsing papers by author, we have a comprehensive list of ~ 8,500 authors (ACL-2025). Using this year’s data, our system also generates a report on recent natural language processing topics. Additionally, you may want to explore our “Best Paper” Digest (ACL), which lists the most influential ACL papers since 1981.

We’ve developed a service – ACL-2025 Research that synthesizes the latest findings from ACL 2025 into comprehensive reports. We encourage interested users to utilize our service to create tailored reports on other emerging topics.

As a pioneer in the field since 2018, Paper Digest has curated thousands of such lists, drawing on years of accumulated data across decades of conferences and research topics.To ensure you never miss a breakthrough, our daily service sifts through tens of thousands of new papers, clinical trials, news articles, community posts every day – delivering only what matters most to your specific interests. Beyond discovery, Paper Digest offers built-in research tools to help users read articles, write articles, get answers, conduct literature reviews, and generate research reports more efficiently.

Paper Digest Team
New York City, New York, 10017

TABLE 1: Paper Digest: ACL 2025 Papers & Highlights

	Paper	Author(s)
1	Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders.	Benjamin Warner; Antoine Chaffin; Benjamin Clavié; Orion Weller; Oskar Hallström; Said Taghadouini; Alexis Gallagher; Raja Biswas; Faisal Ladhak; Tom Aarsen; Griffin Thomas Adams; Jeremy Howard; Iacopo Poli;
2	Demons in The Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Under this strict constraint, even tokens from a domain-specific sequence (e. g. , code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a global-batch to loose this constraint.	Zihan Qiu; Zeyu Huang; Bo Zheng; Kaiyue Wen; Zekun Wang; Rui Men; Ivan Titov; Dayiheng Liu; Jingren Zhou; Junyang Lin;
3	ProcessBench: Identifying Process Errors in Mathematical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.	Chujie Zheng; Zhenru Zhang; Beichen Zhang; Runji Lin; Keming Lu; Bowen Yu; Dayiheng Liu; Jingren Zhou; Junyang Lin;
4	InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training.	Dingdong Wang; Jin Xu; Ruihang Chu; Zhifang Guo; Xiong Wang; Jincenzi Wu; Dongchao Yang; Shengpeng Ji; Junyang Lin;
5	Qwen2.5-xCoder: Multi-Agent Collaboration for Multilingual Code Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively.	Jian Yang; Wei Zhang; Yibo Miao; Shanghaoran Quan; Zhenhe Wu; Qiyao Peng; Liqun Yang; Tianyu Liu; Zeyu Cui; Binyuan Hui; Junyang Lin;
6	Analyzing and Mitigating Inconsistency in Discrete Speech Tokens for Neural Codec Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec.	Wenrui Liu; Zhifang Guo; Jin Xu; Yuanjun Lv; Yunfei Chu; Zemin Liu; Junyang Lin;
7	Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present NSA, a Natively trained Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling.	Jingyang Yuan; Huazuo Gao; Damai Dai; Junyu Luo; Liang Zhao; Zhengyan Zhang; Zhenda Xie; Yuxing Wei; Lean Wang; Zhiping Xiao; Yuqing Wang; Chong Ruan; Ming Zhang; Wenfeng Liang; Wangding Zeng;
8	Towards Effective Extraction and Evaluation of Factual Claims Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the lack of a standardized evaluation framework impedes assessment and comparison of claim extraction methods. To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization.	Dasha Metropolitansky; Jonathan Larson;
9	MPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features.	Anwen Hu; Haiyang Xu; Liang Zhang; Jiabo Ye; Ming Yan; Ji Zhang; Qin Jin; Fei Huang; Jingren Zhou;
10	Improve Vision Language Model Chain-of-thought Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations.	Ruohong Zhang; Bowen Zhang; Yanghao Li; Haotian Zhang; Zhiqing Sun; Zhe Gan; Yinfei Yang; Ruoming Pang; Yiming Yang;
11	BIG-Bench Extra Hard Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation.	Mehran Kazemi; Bahare Fatemi; Hritik Bansal; John Palowitch; Chrysovalantis Anastasiou; Sanket Vaibhav Mehta; Lalit K Jain; Virginia Aglietti; Disha Jindal; Peter Chen; Nishanth Dikkala; Gladys Tyen; Xin Liu; Uri Shalit; Silvia Chiappa; Kate Olszewska; Yi Tay; Vinh Q. Tran; Quoc V Le; Orhan Firat;
12	ACECODER: Acing Coder RL Via Automated Test-Case Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training.	Huaye Zeng; Dongfu Jiang; Haozhe Wang; Ping Nie; Xiaotong Chen; Wenhu Chen;
13	Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.	Shivalika Singh; Angelika Romanou; Clémentine Fourrier; David Ifeoluwa Adelani; Jian Gang Ngui; Daniel Vila-Suero; Peerat Limkonchotiwat; Kelly Marchisio; Wei Qi Leong; Yosephine Susanto; Raymond Ng; Shayne Longpre; Sebastian Ruder; Wei-Yin Ko; Antoine Bosselut; Alice Oh; Andre Martins; Leshem Choshen; Daphne Ippolito; Enzo Ferrante; Marzieh Fadaee; Beyza Ermis; Sara Hooker;
14	KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent.	Shi Luohe; Zuchao Li; Lefei Zhang; Baoyuan Qi; Liu Guoming; Hai Zhao;
15	Evaluating Language Models As Synthetic Data Generators Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs’ data generation abilities.	Seungone Kim; Juyoung Suk; Xiang Yue; Vijay Viswanathan; Seongyun Lee; Yizhong Wang; Kiril Gashteovski; Carolin Lawrence; Sean Welleck; Graham Neubig;
16	HalluLens: LLM Hallucination Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a comprehensive hallucination benchmark HalluLens, incorporating both extrinsic and intrinsic evaluation tasks, built upon a clear taxonomy of hallucination.	Yejin Bang; Ziwei Ji; Alan Schelten; Anthony Hartshorn; Tara Fowler; Cheng Zhang; Nicola Cancedda; Pascale Fung;
17	CoT-Valve: Length-Compressible Chain-of-Thought Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths.	Xinyin Ma; Guangnian Wan; Runpeng Yu; Gongfan Fang; Xinchao Wang;
18	Model Extrapolation Expedites Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the observation that alignment training typically involves only small parameter changes without injecting new knowledge into models, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences.	Chujie Zheng; Ziqi Wang; Heng Ji; Minlie Huang; Nanyun Peng;
19	RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance.	Shi-Qi Yan; Quan Liu; Zhen-Hua Ling;
20	MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark.	Xiang Yue; Tianyu Zheng; Yuansheng Ni; Yubo Wang; Kai Zhang; Shengbang Tong; Yuxuan Sun; Botao Yu; Ge Zhang; Huan Sun; Yu Su; Wenhu Chen; Graham Neubig;
21	LocAgent: Graph-Guided LLM Agents for Code Localization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce LocAgent, a framework that addresses code localization through a graph-guided agent.	Zhaoling Chen; Robert Tang; Gangda Deng; Fang Wu; Jialong Wu; Zhiwei Jiang; Viktor Prasanna; Arman Cohan; Xingyao Wang;
22	M³GQA: A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In order to construct diverse data with semantically correct ground-truth reasoning paths, we introduce a novel reasoning-driven four-step data construction method, including tree sampling, reasoning path backtracking, query creation, and multi-stage refinement and filtering.	Boci Peng; Yongchao Liu; Xiaohe Bo; Jiaxin Guo; Yun Zhu; Xuanbo Fan; Chuntao Hong; Yan Zhang;
23	How to Train Long-Context Language Models (Effectively) Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.	Tianyu Gao; Alexander Wettig; Howard Yen; Danqi Chen;
24	Extending LLM Context Window with Adaptive Grouped Positional Encoding: A Training-Free Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Adaptive Grouped Positional Encoding (AdaGroPE), a training-free, plug-and-play method to enhance long-context understanding in existing LLMs.	Xinhao Xu; Jiaxin Li; Hui Chen; Zijia Lin; Jungong Han; Guiguang Ding;
25	TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations.	Max Ku; Cheuk Hei Chong; Jonathan Leung; Krish Shah; Alvin Yu; Wenhu Chen;
26	F5-TTS: A Fairytaler That Fakes Fluent and Faithful Speech with Flow Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT).	Yushen Chen; Zhikang Niu; Ziyang Ma; Keqi Deng; Chunhui Wang; JianZhao JianZhao; Kai Yu; Xie Chen;
27	LongBench V2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks.	Yushi Bai; Shangqing Tu; Jiajie Zhang; Hao Peng; Xiaozhi Wang; Xin Lv; Shulin Cao; Jiazheng Xu; Lei Hou; Yuxiao Dong; Jie Tang; Juanzi Li;
28	MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning.	Jiawei Guo; Tianyu Zheng; Yizhi Li; Yuelin Bai; Bo Li; Yubo Wang; King Zhu; Graham Neubig; Wenhu Chen; Xiang Yue;
29	RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline.	Kunlun Zhu; Yifan Luo; Dingling Xu; Yukun Yan; Zhenghao Liu; Shi Yu; Ruobing Wang; Shuo Wang; Yishan Li; Nan Zhang; Xu Han; Zhiyuan Liu; Maosong Sun;
30	HateDay: Insights from A Global Hate Speech Dataset Representative of A Day on Twitter Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce HateDay, the first global hate speech dataset representative of social media settings, constructed from a random sample of all tweets posted on September 21, 2022 and covering eight languages and four English-speaking countries.	Manuel Tonneau; Diyi Liu; Niyati Malhotra; Scott A. Hale; Samuel Fraiberger; Victor Orozco-Olvera; Paul Röttger;
31	Cramming 1568 Tokens Into A Single Vector and Back Again: Exploring The Limits of Embedding Space Capacity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure.	Yuri Kuratov; Mikhail Arkhipov; Aydar Bulatov; Mikhail Burtsev;
32	Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks – numerical analytical tasks and open-ended question answering with reasoning.	Yaxuan Kong; Yiyuan Yang; Yoontae Hwang; Wenjie Du; Stefan Zohren; Zhangyang Wang; Ming Jin; Qingsong Wen;
33	Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach.	Shenglai Zeng; Pengfei He; Kai Guo; Tianqi Zheng; Hanqing Lu; Yue Xing; Hui Liu;
34	LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.	Qingkai Fang; Yan Zhou; Shoutao Guo; Shaolei Zhang; Yang Feng;
35	Multilingual Arbitration: Optimizing Data Pools to Accelerate Multilingual Progress Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we propose multilingual arbitration, which exploits performance variations among multiple models for each language.	Ayomide Odumakinde; Daniel D’souza; Pat Verga; Beyza Ermis; Sara Hooker;
36	AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility & flexibility.	Weidi Luo; Shenghong Dai; Xiaogeng Liu; Suman Banerjee; Huan Sun; Muhao Chen; Chaowei Xiao;
37	EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression.	Zekun Wang; MingHua Ma; Zexin Wang; Rongchuan Mu; Liping Shan; Ming Liu; Bing Qin;
38	Agentic Knowledgeable Self-awareness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans.	Shuofei Qiao; Zhisong Qiu; Baochang Ren; Xiaobin Wang; Xiangyuan Ru; Ningyu Zhang; Xiang Chen; Yong Jiang; Pengjun Xie; Fei Huang; Huajun Chen;
39	AgentGym: Evaluating and Training Large Language Model-based Agents Across Diverse Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction.	Zhiheng Xi; Yiwen Ding; Wenxiang Chen; Boyang Hong; Honglin Guo; Junzhe Wang; Xin Guo; Dingwen Yang; Chenyang Liao; Wei He; Songyang Gao; Lu Chen; Rui Zheng; Yicheng Zou; Tao Gui; Qi Zhang; Xipeng Qiu; Xuanjing Huang; Zuxuan Wu; Yu-Gang Jiang;
40	Autoregressive Speech Synthesis Without Vector Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS).	Lingwei Meng; Long Zhou; Shujie Liu; Sanyuan Chen; Bing Han; Shujie Hu; Yanqing Liu; Jinyu Li; Sheng Zhao; Xixin Wu; Helen M. Meng; Furu Wei;
41	Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, our empirical findings reveal that, unlike smaller models, directly adding semantic parsing results into LLMs reduces their performance. To overcome this, we propose SENSE, a novel prompting approach that embeds semantic hints within the prompt.	Kaikai An; Shuzheng Si; Helan Hu; Haozhe Zhao; Yuchi Wang; Qingyan Guo; Baobao Chang;
42	Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages.	Guijin Son; Jiwoo Hong; Hyunwoo Ko; James Thorne;
43	Byte Latent Transformer: Patches Scale Better Than Tokens Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness.	Artidoro Pagnoni; Ramakanth Pasunuru; Pedro Rodriguez; John Nguyen; Benjamin Muller; Margaret Li; Chunting Zhou; Lili Yu; Jason E Weston; Luke Zettlemoyer; Gargi Ghosh; Mike Lewis; Ari Holtzman; Srini Iyer;
44	Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models Via A Multi-Paradigm Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Chain-of-Reasoning (CoR), a novel unified framework that integrates multiple reasoning paradigms — Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR) — to enable synergistic collaboration.	Yiyao Yu; Yuxiang Zhang; Dongdong Zhang; Xiao Liang; Hengyuan Zhang; Xingxing Zhang; Mahmoud Khademi; Hany Hassan Awadalla; Junjie Wang; Yujiu Yang; Furu Wei;
45	Sticking to The Mean: Detecting Sticky Tokens in Text Embedding Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering.	Kexin Chen; Dongxia Wang; Yi Liu; Haonan Zhang; Wenhai Wang;
46	AntiLeakBench: Preventing Data Contamination By Automatically Constructing Benchmarks with Updated Real-World Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework.	Xiaobao Wu; Liangming Pan; Yuxi Xie; Ruiwen Zhou; Shuai Zhao; Yubo Ma; Mingzhe Du; Rui Mao; Anh Tuan Luu; William Yang Wang;
47	HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we collect HelpSteer3 data to train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks.	Zhilin Wang; Jiaqi Zeng; Olivier Delalleau; Daniel Egert; Ellie Evans; Hoo-Chang Shin; Felipe Soares; Yi Dong; Oleksii Kuchaiev;
48	HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically.	Mengkang Hu; Tianxing Chen; Qiguang Chen; Yao Mu; Wenqi Shao; Ping Luo;
49	LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters Through Modality Linear Representation-Steering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we introduce Modality Linear Representation-Steering (MoReS), which re-balances intrinsic modalities by steering visual representations through linear transformations in the visual subspace across each model layer.	Jinhe Bi; Yujun Wang; Haokun Chen; Xun Xiao; Artur Hecker; Volker Tresp; Yunpu Ma;
50	MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation. To alleviate this issue, we propose the contamination-free MCQ benchmark called MMLU-CF, which reassesses LLMs’ understanding of world knowledge by averting both unintentional and malicious data contamination.	Qihao Zhao; Yangyu Huang; Tengchao Lv; Lei Cui; Qinzheng Sun; Shaoguang Mao; Xin Zhang; Ying Xin; Qiufeng Yin; Scarlett Li; Furu Wei;
51	CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs’ Cultural Knowledge Through Human-AI Red-Teaming Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CulturalBench: a set of 1,696 human-written and human-verified questions to assess LMs’ cultural knowledge, covering 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru.	Yu Ying Chiu; Liwei Jiang; Bill Yuchen Lin; Chan Young Park; Shuyue Stella Li; Sahithya Ravi; Mehar Bhatia; Maria Antoniak; Yulia Tsvetkov; Vered Shwartz; Yejin Choi;
52	AndroidGen: Building An Android Language Agent Under Data Scarcity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity.	Hanyu Lai; Junjie Gao; Xiao Liu; Yifan Xu; Shudan Zhang; Yuxiao Dong; Jie Tang;
53	A Survey of Post-Training Scaling in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation.	Hanyu Lai; Xiao Liu; Junjie Gao; Jiale Cheng; Zehan Qi; Yifan Xu; Shuntian Yao; Dan Zhang; Jinhua Du; Zhenyu Hou; Xin Lv; Minlie Huang; Yuxiao Dong; Jie Tang;
54	Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues.	Rupak Sarkar; Neha Srikanth; Taylor Pellegrin; Rachel Rudinger; Claire Bonial; Philip Resnik;
55	The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations.	Nitay Calderon; Roi Reichart; Rotem Dror;
56	Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we provide a theoretical framework that explains why some prompts succeed while others fail.	Xiang Zhang; Juntai Cao; Chenyu You; Dujian Ding;
57	Disentangling Memory and Reasoning Ability in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a novel language model inference paradigm that decomposes the complex inference process into two distinct and clear actions: (1) memory recall: which retrieves relevant knowledge in LLM, and (2) reasoning: which performs reasoning steps based on the recalled knowledge.	Mingyu Jin; Weidi Luo; Sitao Cheng; Xinyi Wang; Wenyue Hua; Ruixiang Tang; William Yang Wang; Yongfeng Zhang;
58	UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation.	Boxi Yu; Yuxuan Zhu; Pinjia He; Daniel Kang;
59	InductionBench: LLMs Fail in The Simplest Complexity Class Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs.	Wenyue Hua; Tyler Wong; Fei Sun; Liangming Pan; Adam Jardine; William Yang Wang;
60	ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose ChartCoder, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code.	Xuanle Zhao; Xianzhen Luo; Qi Shi; Chi Chen; Shuo Wang; Zhiyuan Liu; Maosong Sun;
61	Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We present Dynamic Block-Sparse Attention, an optimized method for retrieval-based many-shot in-context learning.	Emily Xiao; Chin-Jou Li; Yilin Zhang; Graham Neubig; Amanda Bertsch;
62	People Who Frequently Use ChatGPT for Writing Tasks Are Accurate and Robust Detectors of AI-generated Text Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1).	Jenna Russell; Marzena Karpinska; Mohit Iyyer;
63	Enhancing Multimodal Continual Instruction Tuning with BranchLoRA Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context.	Duzhen Zhang; Yong Ren; Zhong-Zhi Li; Yahan Yu; Jiahua Dong; Chenxing Li; Zhilong Ji; Jinfeng Bai;
64	ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners� real-world usage of models.	Alexander Miserlis Hoyle; Lorena Calvo-Bartolomé; Jordan Lee Boyd-Graber; Philip Resnik;
65	Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce ReDial* (Reasoning with Dialect Queries), a benchmark containing 1.*	Fangru Lin; Shaoguang Mao; Emanuele La Malfa; Valentin Hofmann; Adrian de Wynter; Xun Wang; Si-Qing Chen; Michael J. Wooldridge; Janet B. Pierrehumbert; Furu Wei;
66	Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.	Matthias Orlikowski; Jiaxin Pei; Paul Röttger; Philipp Cimiano; David Jurgens; Dirk Hovy;
67	We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, we decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric to hierarchically assess inherent issues in LMMs’ reasoning process.	Runqi Qiao; Qiuna Tan; Guanting Dong; MinhuiWu MinhuiWu; Chong Sun; Xiaoshuai Song; Jiapeng Wang; Zhuoma GongQue; Shanglin Lei; YiFan Zhang; Zhe Wei; Miaoxuan Zhang; Runfeng Qiao; Xiao Zong; Yida Xu; Peiqing Yang; Zhimin Bao; Muxi Diao; Chen Li; Honggang Zhang;
68	V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose V-Oracle, an innovative framework that utilizes Large Multi-modal Models (LMMs) for interpreting OBS.	Runqi Qiao; Qiuna Tan; Guanting Dong; MinhuiWu MinhuiWu; Jiapeng Wang; YiFan Zhang; Zhuoma GongQue; Chong Sun; Yida Xu; Yadong Xue; Ye Tian; Zhimin Bao; Lan Yang; Chen Li; Honggang Zhang;
69	GUICourse: From General Vision Language Model to Versatile GUI Agent Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These limitations hinder their effectiveness as practical GUI agents. To address these challenges, we introduce GUICourse, a series of datasets for training visual-based GUI agents using general VLMs.	Wentong Chen; Junbo Cui; Jinyi Hu; Yujia Qin; Junjie Fang; Yue Zhao; Chongyi Wang; Jun Liu; Guirong Chen; Yupeng Huo; Yuan Yao; Yankai Lin; Zhiyuan Liu; Maosong Sun;
70	Binary Classifier Optimization for Large Language Model Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Binary Classifier Optimization (BCO), a technique that effectively aligns LLMs using only binary feedback.	Seungjae Jung; Gunsoo Han; Daniel Wontae Nam; Kyoung-Woon On;
71	OpenWebVoyager: Building Multimodal Web Agents Via Iterative Real-World Exploration, Feedback and Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce an innovative multimodal web agent that can autonomously conduct real-world exploration and improve itself.	Hongliang He; Wenlin Yao; Kaixin Ma; Wenhao Yu; Hongming Zhang; Tianqing Fang; Zhenzhong Lan; Dong Yu;
72	CodeDPO: Aligning Code Models with Self Generated and Verified Source Code Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.	Kechi Zhang; Ge Li; Yihong Dong; Jingjing Xu; Jun Zhang; Jing Su; Yongfei Liu; Zhi Jin;
73	ShifCon: Enhancing Non-Dominant Language Capabilities with A Shift-based Multilingual Contrastive Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based multilingual Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one.	Hengyuan Zhang; Chenming Shang; Sizhe Wang; Dongdong Zhang; Yiyao Yu; Feng Yao; Renliang Sun; Yujiu Yang; Furu Wei;
74	Bitnet.cpp: Efficient Edge Inference for Ternary LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.	Jinheng Wang; Hansong Zhou; Ting Song; Shijie Cao; Yan Xia; Ting Cao; Jianyu Wei; Shuming Ma; Hongyu Wang; Furu Wei;
75	Nemotron-CC: Transforming Common Crawl Into A Refined Long-Horizon Pretraining Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters.	Dan Su; Kezhi Kong; Ying Lin; Joseph Jennings; Brandon Norick; Markus Kliegl; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro;
76	SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions.	Shuangrui Ding; Zihan Liu; Xiaoyi Dong; Pan Zhang; Rui Qian; Junhao Huang; Conghui He; Dahua Lin; Jiaqi Wang;
77	TreeRL: LLM Reinforcement Learning with On-Policy Tree Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training.	Zhenyu Hou; Ziniu Hu; Yujiang Li; Rui Lu; Jie Tang; Yuxiao Dong;
78	Hierarchical Document Refinement for Long-context Retrieval-augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents.	Jiajie Jin; Xiaoxi Li; Guanting Dong; Yuyao Zhang; Yutao Zhu; Yongkang Wu; Zhonghua Li; Ye Qi; Zhicheng Dou;
79	DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers.	Xueguang Ma; Xi Victoria Lin; Barlas Oguz; Jimmy Lin; Wen-tau Yih; Xilun Chen;
80	BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another.	Lindia Tjuatja; Graham Neubig;
81	Sightation Counts: Leveraging Sighted User Feedback in Building A BLV-aligned Dataset of Diagram Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we ask sighted individuals to assess—rather than produce—diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference.	Wan Ju Kang; Eunki Kim; Na Min An; Sangryul Kim; Haemin Choi; Ki Hoon Kwak; James Thorne;
82	Speculative Reward Model Boosts Decision Making Ability of LLMs Cost-Effectively Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To improve LLM decision-making while maintaining efficiency, we propose the Speculative Reward Model (SRM), a plug-and-play framework that seamlessly integrates with existing search strategies.	Jiawei Gu; Shangsong Liang;
83	MathAgent: Leveraging A Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Though effective in mathematical problem-solving, MLLMs often struggle with the nuanced task of identifying and categorizing student errors in multimodal mathematical contexts. Therefore, we introduce MathAgent, a novel Mixture-of-Math-Agent framework specifically designed to address these challenges.	Yibo Yan; Shen Wang; Jiahao Huo; Philip S. Yu; Xuming Hu; Qingsong Wen;
84	Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems.	Haoran Ye; TianZe Zhang; Yuhang Xie; Liyuan Zhang; Yuanyi Ren; Xin Zhang; Guojie Song;
85	ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker’s voice and enabling arbitrary control and adjustment of speaking style.	Shengpeng Ji; Qian Chen; Wen Wang; Jialong Zuo; Minghui Fang; Ziyue Jiang; Hai Huang; Zehan Wang; Xize Cheng; Siqi Zheng; Zhou Zhao;
86	Language-Codec: Bridging Discrete Codec Representations and Speech Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Consequently, leveraging the characteristics of speech language models, we propose Language-Codec.	Shengpeng Ji; Minghui Fang; Jialong Zuo; Ziyue Jiang; Dingdong Wang; Hanting Wang; Hai Huang; Zhou Zhao;
87	Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions.	Yanhao Jia; Xinyi Wu; Li Hao; QinglinZhang QinglinZhang; Yuxiao Hu; Shuai Zhao; Wenqi Fan;
88	Aligning Large Language Models with Implicit Preferences from User-Generated Content Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data.	Zhaoxuan Tan; Zheng Li; Tianyi Liu; Haodong Wang; Hyokun Yun; Ming Zeng; Pei Chen; Zhihan Zhang; Yifan Gao; Ruijie Wang; Priyanka Nigam; Bing Yin; Meng Jiang;
89	CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, current Large Language Models (LLMs) face limitations in these specialized domains, highlighting the need for the development of comprehensive datasets that can assess, continuously update, and progressively improve these culturally-grounded linguistic competencies through targeted training optimizations. To address this gap, we introduce CKnowEdit, the first-ever Chinese knowledge editing dataset designed to correct linguistic, factual, and logical errors in LLMs.	Jizhan Fang; Tianhe Lu; Yunzhi Yao; Ziyan Jiang; Xin Xu; Huajun Chen; Ningyu Zhang;
90	Can We Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To address the alignment* problem, we introduce an LLM-based retrieval method — ARM, designed to better align questions with the organization of the data collection.*	Peter Baile Chen; Yi Zhang; Mike Cafarella; Dan Roth;
91	In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence.	Zhen Tan; Jun Yan; I-Hung Hsu; Rujun Han; Zifeng Wang; Long Le; Yiwen Song; Yanfei Chen; Hamid Palangi; George Lee; Anand Rajan Iyer; Tianlong Chen; Huan Liu; Chen-Yu Lee; Tomas Pfister;
92	Fusing Highly Specialized Language Models for Comprehensive Expertise Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to “play the dealt cards well” and propose to fuse models that are already highly-specialized directly.	Ning Ding; Yulin Chen; Ganqu Cui; Xingtai Lv; Weilin Zhao; Kaiyan Zhang; Ruobing Xie; Bowen Zhou; Zhiyuan Liu; Maosong Sun;
93	Caution for The Environment: Multimodal LLM Agents Are Susceptible to Environmental Distractions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context.	Xinbei Ma; Yiting Wang; Yao Yao; Tongxin Yuan; Aston Zhang; Zhuosheng Zhang; Hai Zhao;
94	Gödel Agent: A Self-Referential Agent Framework for Recursively Self-Improvement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Gödel Agent, a self-evolving framework inspired by the Gödel Machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms.	Xunjian Yin; Xinyi Wang; Liangming Pan; Li Lin; Xiaojun Wan; William Yang Wang;
95	Aligning Large Language Models to Follow Instructions and Hallucinate Less Via Effective Data Filtering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations.	Shuzheng Si; Haozhe Zhao; Gang Chen; Cheng Gao; Yuzhuo Bai; Zhitong Wang; Kaikai An; Kangyang Luo; Chen Qian; Fanchao Qi; Baobao Chang; Maosong Sun;
96	Refuse Whenever You Feel Unsafe: Improving Safety in LLMs Via Decoupled Refusal Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities.	Youliang Yuan; Wenxiang Jiao; Wenxuan Wang; Jen-tse Huang; Jiahao Xu; Tian Liang; Pinjia He; Zhaopeng Tu;
97	LLMs Know Their Vulnerabilities: Uncover Safety Gaps Through Natural Distribution Shifts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to natural distribution shifts between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms.	Qibing Ren; Hao Li; Dongrui Liu; Zhanxu Xie; Xiaoya Lu; Yu Qiao; Lei Sha; Junchi Yan; Lizhuang Ma; Jing Shao;
98	Lost in The Context: Insufficient and Distracted Attention to Contexts in Preference Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These issues undermine the RM’s effectiveness in modeling human preferences. To further address these challenges, we propose AttnRM, a novel optimization framework that enables the RM to concentrate on crucial segments of the context.	Shihan Dou; Jiayi Chen; Chenhao Huang; Feng Chen; Wei Chengzhi; Huiyuan Zheng; Shichun Liu; Yan Liu; Chenxiao Liu; Chao Xin; Lin Yan; Zongzhang Zhang; Tao Gui; Qi Zhang; Xuanjing Huang;
99	APB: Accelerating Distributed Long-Context Inference By Passing Compressed Context Blocks Across GPUs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously.	Yuxiang Huang; Mingye Li; Xu Han; Chaojun Xiao; Weilin Zhao; Sun Ao; Hao Zhou; Jie Zhou; Zhiyuan Liu; Maosong Sun;
100	Genetic Instruct: Scaling Up Synthetic Generation of Coding Instructions for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions using evolutionary principles.	Somshubra Majumdar; Vahid Noroozi; Mehrzad Samadi; Sean Narenthiran; Aleksander Ficek; Wasi Uddin Ahmad; Jocelyn Huang; Jagadeesh Balam; Boris Ginsburg;
101	LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models Via Restoration Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training.	Zican Dong; Junyi Li; Jinhao Jiang; Mingyu Xu; Xin Zhao; Bingning Wang; Weipeng Chen;
102	Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs.	Zenghui Yuan; Yangming Xu; Jiawen Shi; Pan Zhou; Lichao Sun;
103	AgentRM: Enhancing Agent Generalization with Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model.	Yu Xia; Jingru Fan; Weize Chen; Siyu Yan; Xin Cong; Zhong Zhang; Yaxi Lu; Yankai Lin; Zhiyuan Liu; Maosong Sun;
104	LLM×MapReduce: Simplified Long-Sequence Processing Using Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a training-free framework that enables large language models (LLMs) to effectively process long texts, using a divide-and-conquer strategy for comprehensive document understanding.	Zihan Zhou; Chong Li; Xinyi Chen; Shuo Wang; Yu Chao; Zhili Li; Haoyu Wang; Qi Shi; Zhixing Tan; Xu Han; Xiaodong Shi; Zhiyuan Liu; Maosong Sun;
105	MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While fingerprinting techniques have been proposed for verifying model ownership, their resistance to model merging remains unexplored. To address this gap, we propose a novel fingerprinting method, MergePrint, which embeds robust fingerprints capable of surviving model merging.	Shojiro Yamabe; Futa Kai Waseda; Tsubasa Takahashi; Koki Wataoka;
106	RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose RAG-Critic, a novel framework that leverages a critic-guided agentic workflow to improve RAG capabilities autonomously.	Guanting Dong; Jiajie Jin; Xiaoxi Li; Yutao Zhu; Zhicheng Dou; Ji-Rong Wen;
107	Progressive Multimodal Reasoning Via Active Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS).	Guanting Dong; Chenghao Zhang; Mengjie Deng; Yutao Zhu; Zhicheng Dou; Ji-Rong Wen;
108	RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence Within Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose RetroLLM, a unified framework that integrates retrieval and generation into a single, auto-regressive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding.	Xiaoxi Li; Jiajie Jin; Yujia Zhou; Yongkang Wu; Zhonghua Li; Ye Qi; Zhicheng Dou;
109	Magnet: Multi-turn Tool-use Data Synthesis and Distillation Via Graph Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans.	Fan Yin; Zifeng Wang; I-Hung Hsu; Jun Yan; Ke Jiang; Yanfei Chen; Jindong Gu; Long Le; Kai-Wei Chang; Chen-Yu Lee; Hamid Palangi; Tomas Pfister;
110	AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose AndroidLab as a systematic Android agent framework.	Yifan Xu; Xiao Liu; Xueqiao Sun; Siyi Cheng; Hao Yu; Hanyu Lai; Shudan Zhang; Dan Zhang; Jie Tang; Yuxiao Dong;
111	Enhancing Open-Domain Task-Solving Capability of LLMs Via Autonomous Tool Integration from GitHub Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we introduce OpenAct benchmark to evaluate the open-domain task-solving capability, which is built on human expert consultation and repositories in GitHub.	Bohan Lyu; Xin Cong; Heyang Yu; Pan Yang; Cheng Qian; Zihe Wang; Yujia Qin; Yining Ye; Yaxi Lu; Chen Qian; Zhong Zhang; Yukun Yan; Yankai Lin; Zhiyuan Liu; Maosong Sun;
112	AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, which provides more nuanced evaluations of alignment capabilities and is the first benchmark specifically designed for Chinese visual contexts.	Yuhang Wu; Wenmeng Yu; Yean Cheng; Yan Wang; Xiaohan Zhang; Jiazheng Xu; Ming Ding; Yuxiao Dong;
113	Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores.	Joakim Edin; Andreas Geert Motzfeldt; Casper L. Christensen; Tuukka Ruotsalo; Lars Maaløe; Maria Maistro;
114	Navigating Rifts in Human-LLM Grounding: Study and Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding.	Omar Shaikh; Hussein Mozannar; Gagan Bansal; Adam Fourney; Eric Horvitz;
115	LLM Agents Making Agent Tools Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools.	Georg Wölflein; Dyke Ferber; Daniel Truhn; Ognjen Arandjelovic; Jakob Nikolas Kather;
116	MultiAgentBench : Evaluating The Collaboration and Competition of LLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.	Kunlun Zhu; Hongyi Du; Zhaochen Hong; Xiaocheng Yang; Shuyi Guo; Zhe Wang; Zhenhailong Wang; Cheng Qian; Robert Tang; Heng Ji; Jiaxuan You;
117	Fairness Through Difference Awareness: Measuring Desired Group Discrimination in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently – when it is contextually appropriate to.	Angelina Wang; Michelle Phan; Daniel E. Ho; Sanmi Koyejo;
118	Demystifying Small Language Models for Edge Deployment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work presents the first comprehensive study of over 60 SLMs such as Microsoft Phi and Google Gemma that are publicly accessible.	Zhenyan Lu; Xiang Li; Dongqi Cai; Rongjie Yi; Fangming Liu; Wei Liu; Jian Luan; Xiwen Zhang; Nicholas D. Lane; Mengwei Xu;
119	Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While chain-of-thought and retrieval-augmented generation help break down problems and retrieve knowledge, they still falter on challenging tasks like competitive programming due to frequent reasoning errors and irrelevant retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning.	Xingxuan Li; Weiwen Xu; Ruochen Zhao; Fangkai Jiao; Shafiq Joty; Lidong Bing;
120	Language Models Resist Alignment: Evidence From Data Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives.	Jiaming Ji; Kaile Wang; Tianyi Alex Qiu; Boyuan Chen; Jiayi Zhou; Changye Li; Hantao Lou; Josef Dai; Yunhuai Liu; Yaodong Yang;
121	PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs).	Jiaming Ji; Donghai Hong; Borong Zhang; Boyuan Chen; Josef Dai; Boren Zheng; Tianyi Alex Qiu; Jiayi Zhou; Kaile Wang; Boxun Li; Sirui Han; Yike Guo; Yaodong Yang;
122	INews: A Multimodal Dataset for Modeling Personalized Affective Responses to News Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce iNews, a novel large-scale dataset specifically designed to facilitate the modeling of personalized affective responses to news content.	Tiancheng Hu; Nigel Collier;
123	Efficiently Identifying Watermarked Segments in Mixed-Source Texts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection.	Xuandong Zhao; Chenwen Liao; Yu-Xiang Wang; Lei Li;
124	KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning Over Knowledge Graph Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aim to improve the reasoning ability of large language models(LLMs) over knowledge graphs(KGs) to answer complex questions.	Jinhao Jiang; Kun Zhou; Xin Zhao; Yang Song; Chen Zhu; Hengshu Zhu; Ji-Rong Wen;
125	Diversity-oriented Data Augmentation with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we explore data augmentation’s impact on dataset diversity and propose a Diversity-oriented data Augmentation framework (DoAug).	Zaitian Wang; Jinghan Zhang; Xinhao Zhang; Kunpeng Liu; Pengfei Wang; Yuanchun Zhou;
126	SceneGenAgent: Precise Industrial Scene Generation with Coding Agent Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code.	Xiao Xia; Dan Zhang; Zibo Liao; Zhenyu Hou; Tianrui Sun; Jing Li; Ling Fu; Yuxiao Dong;
127	InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts.	Zifu Wan; Yaqi Xie; Ce Zhang; Zhiqiu Lin; Zihan Wang; Simon Stepputtis; Deva Ramanan; Katia P. Sycara;
128	Enhancing Automated Interpretability with Output-Centric Feature Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions.	Yoav Gur-Arieh; Roy Mayan; Chen Agassy; Atticus Geiger; Mor Geva;
129	M-RewardBench: Evaluating Reward Models in Multilingual Settings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we conduct a systematic evaluation of several reward models in multilingual settings.	Srishti Gureja; Lester James Validad Miranda; Shayekh Bin Islam; Rishabh Maheshwary; Drishti Sharma; Gusti Triandi Winata; Nathan Lambert; Sebastian Ruder; Sara Hooker; Marzieh Fadaee;
130	PIGuard: Prompt Injection Guardrail Via Mitigating Overdefense for Free Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words.	Hao Li; Xiaogeng Liu; Ning Zhang; Chaowei Xiao;
131	AutoMixer: Checkpoint Artifacts As Automatic Data Mixers Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory.	Ernie Chang; Yang Li; Patrick Huber; Vish Vogeti; David Kant; Yangyang Shi; Vikas Chandra;
132	The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives.	Feiran Jia; Tong Wu; Xin Qin; Anna Squicciarini;
133	EfficientQAT: Efficient Quantization-Aware Training for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm.	Mengzhao Chen; Wenqi Shao; Peng Xu; Jiahao Wang; Peng Gao; Kaipeng Zhang; Ping Luo;
134	Deliberate Reasoning in Language Models As Structure-Aware Planning with An Accurate World Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel reasoning framework, referred to as Structure-aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning.	Siheng Xiong; Ali Payani; Yuan Yang; Faramarz Fekri;
135	ChatBench: From Static Benchmarks to Human-AI Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i. e. , “AI-alone”). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question.	Serina Chang; Ashton Anderson; Jake M. Hofman;
136	LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement.	Boyi Kang; Xinfa Zhu; Zihan Zhang; Zhen Ye; Mingshuai Liu; Ziqian Wang; Yike Zhu; Guobin Ma; Jun Chen; Longshuai Xiao; Chao Weng; Wei Xue; Lei Xie;
137	CaLMQA: Exploring Culturally Specific Long-form Question Answering Across 23 Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi.	Shane Arora; Marzena Karpinska; Hung-Ting Chen; Ipsita Bhattacharjee; Mohit Iyyer; Eunsol Choi;
138	Cooperative or Competitive? Understanding The Interaction Between Attention Heads From A Game Theory Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further optimize the interactions among attention heads, we propose a training-free Game-theoretic Attention Calibration (GAC) method.	Xiaoye Qu; Zengqi Yu; Dongrui Liu; Wei Wei; Daizong Liu; Jianfeng Dong; Yu Cheng;
139	Bias in Language Models: Beyond Trick Tests and Towards RUTEd Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that standard bias metrics have no significant correlation with long-form output metrics.	Kristian Lum; Jacy Reese Anthis; Kevin Robinson; Chirag Nagpal; Alexander Nicholas D’Amour;
140	Towards Generating Controllable and Solvable Geometry Problem By Leveraging Symbolic Deduction Engine Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a novel task for geometry problem generation and propose a new pipeline method: the Symbolic Deduction Engine-based Geometry Problem Generation framework (SDE-GPG).	Zhuoxuan Jiang; Tianyang Zhang; Peiyan Peng; Jing Chen; Yinong Xun; Haotian Zhang; Lichi Li; Yong Li; Shaohua Zhang;
141	Learning to Generate Structured Output with Schema Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models’ abilities in generating valid JSON.	Yaxi Lu; Haolun Li; Xin Cong; Zhong Zhang; Yesai Wu; Yankai Lin; Zhiyuan Liu; Fangming Liu; Maosong Sun;
142	VISA: Retrieval Augmented Generation with Visual Source Attribution Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution.	Xueguang Ma; Shengyao Zhuang; Bevan Koopman; Guido Zuccon; Wenhu Chen; Jimmy Lin;
143	Aligned But Blind: Alignment Increases Implicit Bias By Reducing Awareness of Race Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers.	Lihao Sun; Chengzhi Mao; Valentin Hofmann; Xuechunzi Bai;
144	Literary Evidence Retrieval Via Long-Context Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of Thai et al. (2022) to construct a benchmark where the entire text of a primary source (e. g. , The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work.	Katherine Thai; Mohit Iyyer;
145	Revisiting The Test-Time Scaling of O1-like Models: Do They Truly Possess Test-Time Scaling Capabilities? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose “Shortest Majority Vote”, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models’ test-time scalability compared to conventional majority voting approaches.	Zhiyuan Zeng; Qinyuan Cheng; Zhangyue Yin; Yunhua Zhou; Xipeng Qiu;
146	Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes, including decomposition, search, and resolution. To address this, this paper proposes a logic-complete reasoning framework, Aristotle.	Jundong Xu; Hao Fei; Meng Luo; Qian Liu; Liangming Pan; William Yang Wang; Preslav Nakov; Mong-Li Lee; Wynne Hsu;
147	ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedback to improve one-off code generation performance.	Houxing Ren; Mingjie Zhan; Zhongyuan Wu; Aojun Zhou; Junting Pan; Hongsheng Li;
148	S2R: Teaching LLMs to Self-verify and Self-correct Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce S2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.	Ruotian Ma; Peisong Wang; Cheng Liu; Xingyan Liu; Jiaqi Chen; Bang Zhang; Xin Zhou; Nan Du; Jia Li;
149	500xCompressor: Generalized Prompt Compression for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, current methods face challenges such as low compression ratios and potential training-test overlap during evaluation. To address these issues, we propose 500xCompressor, a method that compresses natural language contexts into a minimum of one special token and demonstrates strong generalization ability.	Zongqian Li; Yixuan Su; Nigel Collier;
150	Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets.	Kuofeng Gao; Shu-Tao Xia; Ke Xu; Philip Torr; Jindong Gu;
151	Inferring Functionality of Attention Heads from Their Parameters Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model.	Amit Elhelo; Mor Geva;
152	Improving Factuality with Explicit Working Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce Ewe (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources.	Mingda Chen; Yang Li; Karthik Padthe; Rulin Shao; Alicia Yi Sun; Luke Zettlemoyer; Gargi Ghosh; Wen-tau Yih;
153	DeAL: Decoding-time Alignment for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: First, the inability to incorporate multiple, custom rewards and reliance on a model developer�s view of universal and static principles are key limitations. Second, the reliability of such approaches is also questionable (e. g. susceptibility to jailbreaking even after safety training). To address these issues, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL).	James Y. Huang; Sailik Sengupta; Daniele Bonadiman; Yi-An Lai; Arshit Gupta; Nikolaos Pappas; Saab Mansour; Katrin Kirchhoff; Dan Roth;
154	SURVEYFORGE : On The Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SURVEYFORGE, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles.	Xiangchao Yan; Shiyang Feng; Jiakang Yuan; Renqiu Xia; Bin Wang; Lei Bai; Bo Zhang;
155	Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents.	Junde Wu; Jiayuan Zhu; Yuyuan Liu; Min Xu; Yueming Jin;
156	TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks.	Jeffrey Li; Mohammadreza Armandpour; Seyed Iman Mirzadeh; Sachin Mehta; Vaishaal Shankar; Raviteja Vemulapalli; Samy Bengio; Oncel Tuzel; Mehrdad Farajtabar; Hadi Pouransari; Fartash Faghri;
157	Mixtures of In-Context Learners Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Mixtures of In-Context Learners (MoICL), a novel approach that uses subsets of demonstrations to train a set of experts via ICL and learns a weighting function to merge their output distributions via gradient-based optimisation.	Giwon Hong; Emile Van Krieken; Edoardo Ponti; Nikolay Malkin; Pasquale Minervini;
158	A Silver Bullet or A Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we provide an empirical investigation of gist-based context compression methods to improve context processing in large language models.	Chenlong Deng; Zhisong Zhang; Kelong Mao; Shuaiyi Li; Xinting Huang; Dong Yu; Zhicheng Dou;
159	Can Language Models Reason About Individualistic Human Values and Preferences? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To achieve an authentic representation of diversity that respects individuality, we propose individualistic alignment.	Liwei Jiang; Taylor Sorensen; Sydney Levine; Yejin Choi;
160	What Are The Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, our preliminary experiments show that fewer than 35% of samples generated by Qwen-2-72B are multi-hop, and over 40% exhibit poor quality, limiting comprehensive understanding and further research. To address this, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, which integrates a quality verification agent, a single-hop question generation agent, a multiple question sampling strategy, and a multi-hop question merger agent.	Zhi Chen; Qiguang Chen; Libo Qin; Qipeng Guo; Haijun Lv; Yicheng Zou; Hang Yan; Kai Chen; Dahua Lin;
161	Synergistic Weak-Strong Collaboration By Aligning Preferences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Fine-tuning large models for every niche application is often infeasible due to black-box constraints and high computational overhead. To address this, we propose a collaborative framework that pairs a specialized weak model with a general strong model.	Yizhu Jiao; Xuchao Zhang; Zhaoyang Wang; Yubo Ma; Zhun Deng; Rujia Wang; Chetan Bansal; Saravan Rajmohan; Jiawei Han; Huaxiu Yao;
162	Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, human evaluations require significant manual effort. Therefore, we propose Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents.	Ruochen Zhao; Wenxuan Zhang; Yew Ken Chia; Weiwen Xu; Deli Zhao; Lidong Bing;
163	HALoGEN: Fantastic LLM Hallucinations and Where to Find Them Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source.	Abhilasha Ravichander; Shrusti Ghela; David Wadden; Yejin Choi;
164	LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive. To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies.	Zhiyuan Hu; Yuliang Liu; Jinman Zhao; Suyuchen Wang; WangYan WangYan; Wei Shen; Qing Gu; Anh Tuan Luu; See-Kiong Ng; Zhiwei Jiang; Bryan Hooi;
165	EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks).	Cheng Qian; Peixuan Han; Qinyu Luo; Bingxiang He; Xiusi Chen; Yuji Zhang; Hongyi Du; Jiarui Yao; Xiaocheng Yang; Denghui Zhang; Yunzhu Li; Heng Ji;
166	JuStRank: Benchmarking LLM Judges for System Ranking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems.	Ariel Gera; Odellia Boni; Yotam Perlitz; Roy Bar-Haim; Lilach Eden; Asaf Yehudai;
167	FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs’ fact-checking capabilities.	Hongzhan Lin; Yang Deng; Yuxuan Gu; Wenxuan Zhang; Jing Ma; See-Kiong Ng; Tat-Seng Chua;
168	Don’t Get Lost in The Trees: Streamlining LLM Reasoning By Overcoming Tree Search Exploration Pitfalls Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: over-exploration due to redundant states with semantically equivalent content, and under-exploration caused by high variance in verifier scoring leading to frequent trajectory switching.	Ante Wang; Linfeng Song; Ye Tian; Dian Yu; Haitao Mi; Xiangyu Duan; Zhaopeng Tu; Jinsong Su; Dong Yu;
169	SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment.	Runnan Fang; Xiaobin Wang; Yuan Liang; Shuofei Qiao; Jialong Wu; Zekun Xi; Ningyu Zhang; Yong Jiang; Pengjun Xie; Fei Huang; Huajun Chen;
170	HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB.	Meng-Chieh Lee; Qi Zhu; Costas Mavromatis; Zhen Han; Soji Adeshina; Vassilis N. Ioannidis; Huzefa Rangwala; Christos Faloutsos;
171	Identifying Reliable Evaluation Metrics for Scientific Text Revision Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments.	Leane Jourdan; Nicolas Hernandez; Florian Boudin; Richard Dufour;
172	Mind The Gap: Static and Interactive Evaluations of Large Audio Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants.	Minzhi Li; William Barr Held; Michael J Ryan; Kunat Pipatanakul; Potsawee Manakul; Hao Zhu; Diyi Yang;
173	Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To mitigate risks in critical domains, we introduce Consistency-based Confidence Calibration (C3), which assesses confidence consistency through question reformulation.	Shiyu Ni; Keping Bi; Jiafeng Guo; Lulu Yu; Baolong Bi; Xueqi Cheng;
174	Recent Advances in Speech Language Models: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this context, Speech Language Models (SpeechLMs)—foundation models designed to understand and generate speech—emerge as a promising solution for end-to-end speech interaction. This survey offers a comprehensive overview of recent approaches to building SpeechLMs, outlining their core architectural components, training methodologies, evaluation strategies, and the challenges and potential directions for future research in this rapidly advancing field.	Wenqian Cui; Dianzhi Yu; Xiaoqi Jiao; Ziqiao Meng; Guangyan Zhang; Qichao Wang; Steven Y. Guo; Irwin King;
175	VoxEval: Benchmarking The Knowledge Understanding Capabilities of End-to-End Spoken Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs’ knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs’ knowledge understanding through pure speech interactions.	Wenqian Cui; Xiaoqi Jiao; Ziqiao Meng; Irwin King;
176	RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.	Ruiwen Zhou; Wenyue Hua; Liangming Pan; Sitao Cheng; Xiaobao Wu; En Yu; William Yang Wang;
177	Automated Structured Radiology Report Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting.	Jean-Benoit Delbrouck; Justin Xu; Johannes Moll; Alois Thomas; Zhihong Chen; Sophie Ostmeier; Asfandyar Azhar; Kelvin Zhenghao Li; Andrew Johnston; Christian Bluethgen; Eduardo Pontes Reis; Mohamed S Muneer; Maya Varma; Curtis Langlotz;
178	SYNTHIA: Novel Concept Design with Affordance Composition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances.	Hyeonjeong Ha; Xiaomeng Jin; Jeonghwan Kim; Jiateng Liu; Zhenhailong Wang; Khanh Duy Nguyen; Ansel Blume; Nanyun Peng; Kai-Wei Chang; Heng Ji;
179	GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus.	Yifan Yang; Zheshu Song; Jianheng Zhuo; Mingyu Cui; Jinpeng Li; Bo Yang; Yexing Du; Ziyang Ma; Xunying Liu; Ziyuan Wang; Ke Li; Shuai Fan; Kai Yu; Wei-Qiang Zhang; Guoguo Chen; Xie Chen;
180	MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal.	Haochen Xue; Feilong Tang; Ming Hu; Yexin Liu; Qidong Huang; Yulong Li; Chengzhi Liu; Zhongxing Xu; Chong Zhang; Chun-Mei Feng; Yutong Xie; Imran Razzak; Zongyuan Ge; Jionglong Su; Junjun He; Yu Qiao;
181	Can Indirect Prompt Injection Attacks Be Detected and Removed? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation.	Yulin Chen; Haoran Li; Yuan Sui; Yufei He; Yue Liu; Yangqiu Song; Bryan Hooi;
182	Defense Against Prompt Injection Attack By Leveraging Attack Techniques Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: * In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction.	Yulin Chen; Haoran Li; Zihao Zheng; Dekai Wu; Yangqiu Song; Bryan Hooi;
183	Fixing Distribution Shifts of LLM Self-Critique Via On-Policy Self-Play Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose an on-policy reinforcement learning framework to synchronize the reasoning and critique capabilities of language models.	Rong Bao; Donglei Yu; Kai Fan; Minpeng Liao;
184	Intuitive Fine-Tuning: Towards Simplifying Alignment Into A Single Process Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we interpret SFT and PO with two sub-processes — Preference Estimation* and Transition Optimization — defined at token level within the Markov Decision Process (MDP).*	Ermo Hua; Biqing Qi; Kaiyan Zhang; Kai Tian; Xingtai Lv; Ning Ding; Bowen Zhou;
185	LLMs + Persona-Plug = Personalized LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user’s overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug.	Jiongnan Liu; Yutao Zhu; Shuting Wang; Xiaochi Wei; Erxue Min; Yu Lu; Shuaiqiang Wang; Dawei Yin; Zhicheng Dou;
186	Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on method design.	Yuan Sui; Yufei He; Zifeng Ding; Bryan Hooi;
187	Diffusion Models Through A Global Lens: Are They Culturally Inclusive? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In our work, we introduce CULTDIFF benchmark, evaluating whether state-of-the-art diffusion models can generate culturally specific images spanning ten countries.	Zahra Bayramli; Ayhan Suleymanzade; Na Min An; Huzama Ahmad; Eunsu Kim; Junyeong Park; James Thorne; Alice Oh;
188	Can’t See The Forest for The Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1,500 carefully curated image-prompt pairs.	Wenxuan Wang; Xiaoyuan Liu; Kuiyi Gao; Jen-tse Huang; Youliang Yuan; Pinjia He; Shuai Wang; Zhaopeng Tu;
189	Efficient and Accurate Prompt Optimization: The Benefit of Memory in Exemplar-Guided Reflection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization.	Cilin Yan; Jingyun Wang; Lin Zhang; Ruihui Zhao; Xiaopu Wu; Kai Xiong; Qingsong Liu; Guoliang Kang; Yangyang Kang;
190	Pre-training Distillation for Large Language Models: A Design Space Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD).	Hao Peng; Xin Lv; Yushi Bai; Zijun Yao; Jiajie Zhang; Lei Hou; Juanzi Li;
191	Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards.	Hao Peng; Yunjia Qi; Xiaozhi Wang; Zijun Yao; Bin Xu; Lei Hou; Juanzi Li;
192	Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While existing approaches have explored various decomposition strategies, they often lack effective mechanisms to identify and correct errors in intermediate reasoning steps, leading to cascading error propagation. To address these issues, we propose Table-Critic, a novel multi-agent framework that facilitates collaborative criticism and iterative refinement of the reasoning process until convergence to correct solutions.	Peiying Yu; Guoxin Chen; Jingjing Wang;
193	Beyond Prompt Engineering: Robust Behavior Control in LLMs Via Steering Target Atoms Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.	Mengru Wang; Ziwen Xu; Shengyu Mao; Shumin Deng; Zhaopeng Tu; Huajun Chen; Ningyu Zhang;
194	Medical Graph RAG: Evidence-based Medical Large Language Model Via Graph Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce MedGraphRAG, a novel graph-based Retrieval-Augmented Generation (RAG) framework designed to enhance LLMs in generating evidence-based medical responses, improving safety and reliability with private medical data.	Junde Wu; Jiayuan Zhu; Yunli Qi; Jingkun Chen; Min Xu; Filippo Menolascina; Yueming Jin; Vicente Grau;
195	SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states.	Zijun Yao; Weijian Qi; Liangming Pan; Shulin Cao; Linmei Hu; Liu Weichuan; Lei Hou; Juanzi Li;
196	Synthesizing and Adapting Error Correction Data for Mobile Large Language Model Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we use LLMs to synthesize a high-quality dataset of error correction pairs to evaluate and improve LLMs for mobile applications.	Yanxiang Zhang; Zheng Xu; Shanshan Wu; Yuanbo Zhang; Daniel Ramage;
197	Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks.	Arduin Findeis; Floris Weers; Guoli Yin; Ke Ye; Ruoming Pang; Tom Gunter;
198	RAVEN: Robust Advertisement Video Violation Temporal Grounding Via Reinforcement Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose RAVEN, a novel framework that integrates curriculum reinforcement learning with multimodal large language models (MLLMs) to enhance reasoning and cognitive capabilities for violation detection.	Deyi Ji; Yuekui Yang; Haiyang Wu; Shaoping Ma; Tianrun Chen; Lanyun Zhu;
199	Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model.	Philipp Mondorf; Sondre Wold; Barbara Plank;
200	Attacking Vision-Language Computer Agents Via Pop-ups Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore.	Yanzhe Zhang; Tao Yu; Diyi Yang;
201	INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce InvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts.	Haohang Li; Yupeng Cao; Yangyang Yu; Shashidhar Reddy Javaji; Zhiyang Deng; Yueru He; Yuechen Jiang; Zining Zhu; K.p. Subbalakshmi; Jimin Huang; Lingfei Qian; Xueqing Peng; Jordan W. Suchow; Qianqian Xie;
202	Sliding Windows Are Not The End: Exploring Full Ranking with Long-Context Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we conduct a comprehensive study of long-context LLMs for ranking tasks in terms of efficiency and effectiveness.	Wenhan Liu; Xinyu Ma; Yutao Zhu; Ziliang Zhao; Shuaiqiang Wang; Dawei Yin; Zhicheng Dou;
203	Insight Over Sight: Exploring The Vision-Knowledge Conflicts in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model’s internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs.	Xiaoyuan Liu; Wenxuan Wang; Youliang Yuan; Jen-tse Huang; Qiuzhi Liu; Pinjia He; Zhaopeng Tu;
204	Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks. To analyze this phenomenon, we attempted to employ compositional generalization (CG), which refers to the models� ability to understand novel combinations by recombining learned elements, as a guiding framework.	Zhenyang Cai; Junying Chen; Rongsheng Wang; Weihong Wang; Yonglin Deng; Dingjie Song; Yize Chen; Zixu Zhang; Benyou Wang;
205	NvAgent: Automated Data Visualization from Natural Language Via Collaborative Agent Workflow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, they often struggle with complex queries that require reasoning across multiple tables. To address this limitation, we propose a collaborative agent workflow, termed nvAgent, for NL2Vis.	Geliang Ouyang; Jingyao Chen; Zhihe Nie; Yi Gui; Yao Wan; Hongyu Zhang; Dongping Chen;
206	Mitigating Visual Forgetting Via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We observe only a ~2 accuracy drop on MathVista’s test-hard subset, revealing the model’s textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning.	Hai-Long Sun; Zhun Sun; Houwen Peng; Han-Jia Ye;
207	M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation.	Zhaopeng Feng; Jiayuan Su; Jiamei Zheng; Jiahan Ren; Yan Zhang; Jian Wu; Hongwei Wang; Zuozhu Liu;
208	DRPruning: Efficient Large Language Model Pruning Through Distributionally Robust Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data.	Hexuan Deng; Wenxiang Jiao; Xuebo Liu; Jing Li; Min Zhang; Zhaopeng Tu;
209	Culture Is Not Trivia: Sociocultural Theory for Cultural NLP Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, this leads to a number of recurring limitations: coarse national boundaries fail to capture nuanced differences that lay within them, limited coverage restricts datasets to only a subset of usually highly-represented cultures, and a lack of dynamicity results in static cultural benchmarks that do not change as culture evolves. In this position paper, we argue that these methodological limitations are symptomatic of a theoretical gap.	Naitian Zhou; David Bamman; Isaac L. Bleaman;
210	DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation.	Minjun Zhu; Yixuan Weng; Linyi Yang; Yue Zhang;
211	LongReward: Improving Long-context Large Language Models with AI Feedback Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline.	Jiajie Zhang; Zhongni Hou; Xin Lv; Shulin Cao; Zhenyu Hou; Yilin Niu; Lei Hou; Yuxiao Dong; Ling Feng; Juanzi Li;
212	AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework that prioritize actions through application programming interfaces (APIs) over UI actions.	Junting Lu; Zhiyang Zhang; Fangkai Yang; Jue Zhang; Lu Wang; Chao Du; Qingwei Lin; Saravan Rajmohan; Dongmei Zhang; Qi Zhang;
213	Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word-translation task in transformer-based LLMs.	Clément Dumas; Chris Wendler; Veniamin Veselovsky; Giovanni Monea; Robert West;
214	Mitigating Selection Bias with Node Pruning and Auxiliary Options Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce two methods: Bias Node Pruning (BNP), which prunes parameters that contribute to selection bias, and Auxiliary Option Injection (AOI), which introduces an additional answer choice to reduce bias in both white-box and black-box settings.	Hyeong Kyu Choi; Weijie Xu; Chi Xue; Stephanie Eckman; Chandan K. Reddy;
215	When People Are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e. g. water or vermin). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts.	Julia Mendelsohn; Ceren Budak;
216	Scalable Vision Language Model Training Via High Quality Data Curation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce SAIL-VL ( ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters.	Hongyuan Dong; Zijian Kang; Weijie Yin; LiangXiao LiangXiao; ChaoFeng ChaoFeng; Ran Jiao;
217	Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction Via LLMs-as-the-Judge Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average).	Md Tahmid Rahman Laskar; Israt Jahan; Elham Dolatabadi; Chun Peng; Enamul Hoque; Jimmy Huang;
218	Judging The Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks.	Md Tahmid Rahman Laskar; Mohammed Saidul Islam; Ridwan Mahbub; Ahmed Masry; Mizanur Rahman; Amran Bhuiyan; Mir Tafseer Nayeem; Shafiq Joty; Enamul Hoque; Jimmy Huang;
219	Dynamic and Generalizable Process Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria.	Zhangyue Yin; Qiushi Sun; Zhiyuan Zeng; Qinyuan Cheng; Xipeng Qiu; Xuanjing Huang;
220	Tracing and Dissecting How LLMs Recall Factual Knowledge for Real World Questions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we introduce a two-dimensional analysis framework—comprising token back-tracing and individual token decoding—to uncover how LLMs conduct factual knowledge recall.	Yiqun Wang; Chaoqun Wan; Sile Hu; Yonggang Zhang; Xiang Tian; Yaowu Chen; Xu Shen; Jieping Ye;
221	PIC: Unlocking Long-Form Text Generation Capabilities of Large Language Models Via Position ID Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, beyond the focus on “input-long”, the ability to “output-long” is equally significant, yet it remains underexplored. To address this limitation, we propose a simple, efficient, and plug-in approach, Position ID Compression (PIC), to unlock the long-form text generation potential of LLMs.	Haoran Que; Wenge Rong;
222	Do Large Language Models Have An English Accent? Evaluating and Improving The Naturalness of Multilingual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context.	Yanzhu Guo; Simone Conia; Zelin Zhou; Min Li; Saloni Potdar; Henry Xiao;
223	The Impossibility of Fair LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We analyze a variety of technical fairness frameworks and find inherent challenges in each that make the development of a fair LLM intractable.	Jacy Reese Anthis; Kristian Lum; Michael Ekstrand; Avi Feller; Chenhao Tan;
224	Sample-Efficient Human Evaluation of Large Language Models Via Maximum Discrepancy Competition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper presents a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) competition.	Kehua Feng; Keyan Ding; Tan Hongzhi; Kede Ma; Zhihua Wang; Shuangquan Guo; Cheng Yuzhou; Ge Sun; Guozhou Zheng; Qiang Zhang; Huajun Chen;
225	From Selection to Generation: A Survey of LLM-based Active Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning.	Yu Xia; Subhojyoti Mukherjee; Zhouhang Xie; Junda Wu; Xintong Li; Ryan Aponte; Hanjia Lyu; Joe Barrow; Hongjie Chen; Franck Dernoncourt; Branislav Kveton; Tong Yu; Ruiyi Zhang; Jiuxiang Gu; Nesreen K. Ahmed; Yu Wang; Xiang Chen; Hanieh Deilamsalehy; Sungchul Kim; Zhengmian Hu; Yue Zhao; Nedim Lipka; Seunghyun Yoon; Ting-Hao Kenneth Huang; Zichao Wang; Puneet Mathur; Soumyabrata Pal; Koyel Mukherjee; Zhehao Zhang; Namyong Park; Thien Huu Nguyen; Jiebo Luo; Ryan A. Rossi; Julian McAuley;
226	Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples.	Keqin Peng; Liang Ding; Yuanxin Ouyang; Meng Fang; Yancheng Yuan; Dacheng Tao;
227	BOOKWORLD: From Novels to Interactive Agent Societies for Story Creation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies.	Yiting Ran; Xintao Wang; Tian Qiu; Jiaqing Liang; Yanghua Xiao; Deqing Yang;
228	Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values.	Tao Ji; Bin Guo; Yuanbin Wu; Qipeng Guo; Shenlixing Shenlixing; Chenzhan Chenzhan; Xipeng Qiu; Qi Zhang; Tao Gui;
229	Efficient Pretraining Data Selection for Language Models Via Multi-Actor Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism.	Tianyi Bai; Ling Yang; Zhen Hao Wong; Fupeng Sun; Xinlin Zhuang; Jiahui Peng; Chi Zhang; Lijun Wu; Qiu Jiantao; Wentao Zhang; Binhang Yuan; Conghui He;
230	FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing methods rely on traditional federated averaging of LoRA adapters, resulting in inexact updates. To address this, we propose Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the pre-trained frozen weight matrix.	Raghav Singhal; Kaustubh Ponkshe; Praneeth Vepakomma;
231	PaSa: An LLM Agent for Comprehensive Academic Paper Search Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce PaSa, an advanced Paper Search agent powered by large language models.	Yichen He; Guanhua Huang; Peiyuan Feng; Yuan Lin; Yuchen Zhang; Hang Li; Weinan E;
232	Computation Mechanism Behind LLM Position Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we show how LLMs enforce certain computational mechanisms to allow for the aforementioned tolerance in position perturbations.	Chi Han; Heng Ji;
233	MathFusion: Enhancing Mathematical Problem-solving of LLM Through Instruction Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis.	Qizhi Pei; Lijun Wu; Zhuoshi Pan; Yu Li; Honglin Lin; Chenlin Ming; Xin Gao; Conghui He; Rui Yan;
234	On The Mutual Influence of Gender and Occupation in LLM Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We examine LLM representations of gender for first names in various occupational contexts to study how occupations and the gender perception of first names in LLMs influence each other mutually.	Haozhe An; Connor Baumler; Abhilasha Sancheti; Rachel Rudinger;
235	Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.	Yichi Zhang; Zhuo Chen; Lingbing Guo; Yajing Xu; Shaokai Chen; Mengshu Sun; Binbin Hu; Zhiqiang Zhang; Lei Liang; Wen Zhang; Huajun Chen;
236	When The LM Misunderstood The Human Chuckled: Analyzing Garden Path Effects in Humans and Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we try to answer two questions: 1. What makes garden-path sentences hard to understand for humans? 2. Do the same reasons make garden-path sentences hard for LLMs as well?	Samuel Joseph Amouyal; Aya Meltzer-Asscher; Jonathan Berant;
237	TESS 2: A Large-Scale Generalist Diffusion Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models.	Jaesung Tae; Hamish Ivison; Sachin Kumar; Arman Cohan;
238	Rethinking Repetition Problems of LLMs in Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs.	Yihong Dong; Yuchen Liu; Xue Jiang; Bin Gu; Zhi Jin; Ge Li;
239	KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, their retrieval processes face two key challenges: (1) they can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG.	Jinyuan Fang; Zaiqiao Meng; Craig MacDonald;
240	Substance Over Style: Evaluating Proactive Conversational Coaching Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations.	Vidya Srinivas; Xuhai Xu; Xin Liu; Kumar Ayush; Isaac Galatzer-Levy; Shwetak Patel; Daniel McDuff; Tim Althoff;
241	Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data.	Joseph Suh; Erfan Jahanparast; Suhong Moon; Minwoo Kang; Serina Chang;
242	White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on our observations, we propose Mitigation via Selective Rewrite (MSR), a novel bias mitigation strategy that leverages an agency classifier to identify and selectively revise parts of generated texts that demonstrate communal traits.	Yixin Wan; Kai-Wei Chang;
243	The Male CEO and The Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, significant biases remain when generating images with more than one person. To systematically evaluate this, we propose the Paired Stereotype Test (PST) framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities, respectively (e. g. “a CEO” and “an Assistant”).	Yixin Wan; Kai-Wei Chang;
244	One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs.	Hyungjoo Chae; Dongjin Kang; Jihyuk Kim; Beong-woo Kwak; Sunghyun Park; Haeju Park; Jinyoung Yeo; Moontae Lee; Kyungjae Lee;
245	World Modeling Makes A Better Planner: Dual Preference Optimization for Embodied Task Planning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Dual Preference Optimization (D2PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning.	Siyin Wang; Zhaoye Fei; Qinyuan Cheng; Shiduo Zhang; Panpan Cai; Jinlan Fu; Xipeng Qiu;
246	ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use.	Junjie Ye; Zhengyin Du; Xuesong Yao; Weijian Lin; Yufei Xu; Zehui Chen; Zaiyuan Wang; Sining Zhu; Zhiheng Xi; Siyu Yuan; Tao Gui; Qi Zhang; Xuanjing Huang; Jiecao Chen;
247	Personalized Generation In Large Model Era: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We conceptualize PGen from a unified perspective, systematically formalizing its key components, core objectives, and abstract workflows. Based on this unified perspective, we propose a multi-level taxonomy, offering an in-depth review of technical advancements, commonly used datasets, and evaluation metrics across multiple modalities, personalized contexts, and tasks.	Yiyan Xu; Jinghao Zhang; Alireza Salemi; Xinting Hu; Wenjie Wang; Fuli Feng; Hamed Zamani; Xiangnan He; Tat-Seng Chua;
248	DEEPER Insight Into Your User: Directed Persona Refinement for Dynamic Persona Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing methods—whether regenerating personas or incrementally extending them with new behaviors—often fail to achieve sustained improvements in persona quality or future behavior prediction accuracy. To address this, we propose DEEPER, a novel approach for dynamic persona modeling that enables continual persona optimization.	Aili Chen; Chengyu Du; Jiangjie Chen; Jinghan Xu; Yikai Zhang; Siyu Yuan; Zulong Chen; Liangyue Li; Yanghua Xiao;
249	Dynamic Scaling of Unit Tests for Code Reward Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling.	Zeyao Ma; Xiaokang Zhang; Jing Zhang; Jifan Yu; Sijia Luo; Jie Tang;
250	Modeling Uncertainty in Composed Image Retrieval Via Probabilistic Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While metric learning methods have shown promise, they rely on deterministic point embeddings that fail to capture the inherent uncertainty in the input data, in which user intentions may be imprecisely specified or open to multiple interpretations. We address this challenge by reformulating CIR through our proposed Composed Probabilistic Embedding (CoPE) framework, which represents both queries and targets as Gaussian distributions in latent space rather than fixed points.	Haomiao Tang; Jinpeng Wang; Yuang Peng; GuangHao Meng; Ruisheng Luo; Bin Chen; Long Chen; Yaowei Wang; Shu-Tao Xia;
251	SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to excessive forgetting, uncontrolled knowledge erasure and substantial utility drops when existing unlearning methods are applied. To address this, we propose a novel Selected-Expert Unlearning Framework (SEUF).	Haomin Zhuang; Yihua Zhang; Kehan Guo; Jinghan Jia; Gaowen Liu; Sijia Liu; Xiangliang Zhang;
252	BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we tackle the challenge of embedding realistic human personality traits into LLMs.	Wenkai Li; Jiarui Liu; Andy Liu; Xuhui Zhou; Mona T. Diab; Maarten Sap;
253	UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs’ Memorization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering (TSQA) benchmark.	Md Nayem Uddin; Amir Saeidi; Divij Handa; Agastya Seth; Tran Cao Son; Eduardo Blanco; Steven Corman; Chitta Baral;
254	Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data.	Erxin Yu; Jing Li; Ming Liao; Qi Zhu; Boyang Xue; Minghui Xu; Baojun Wang; Lanqing Hong; Fei Mi; Lifeng Shang;
255	GPT-4 As A Homework Tutor Can Improve Student Engagement and Learning Outcomes Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work contributes to the scarce empirical literature on LLM-based interactive homework in real-world educational settings and offers a practical, scalable solution to improve homework in schools.	Alessandro Vanzo; Sankalan Pal Chowdhury; Mrinmaya Sachan;
256	Can Community Notes Replace Professional Fact-Checkers? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the extent and nature of dependencies between fact-checking and helpful community notes remain unclear. To address these questions, we use language models to annotate a large corpus of Twitter/X community notes with attributes such as topic, cited sources, and whether they refute claims tied to broader misinformation narratives.	Nadav Borenstein; Greta Warren; Desmond Elliott; Isabelle Augenstein;
257	Masks Can Be Learned As An Alternative to Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate how to sparsify a pre-trained dense large language model into a mixture-of-experts (MoE) architecture for faster inference.	Peiyu Liu; Tianwen Wei; Bo Zhu; Xin Zhao; Shuicheng Yan;
258	Decoding By Contrasting Knowledge: Enhancing Large Language Model Confidence on Edited Facts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a novel approach termed Decoding by Contrasting Knowledge (DeCK).	Baolong Bi; Shenghua Liu; Lingrui Mei; Yiwei Wang; Junfeng Fang; Pengliang Ji; Xueqi Cheng;
259	KnowShiftQA: How Robust Are RAG Systems When Textbook Knowledge Shifts in K-12 Education? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, discrepancies between these textbooks and the parametric knowledge inherent in Large Language Models (LLMs) can undermine the effectiveness of RAG systems. To systematically investigate RAG system robustness against such knowledge discrepancies, we introduce KnowShiftQA.	Tianshi Zheng; Weihan Li; Jiaxin Bai; Weiqi Wang; Yangqiu Song;
260	“Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.	Eldar Kurtic; Alexandre Noll Marques; Shubhra Pandit; Mark Kurtz; Dan Alistarh;
261	HumT DumT: Measuring and Controlling Human-like Language in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance.	Myra Cheng; Sunny Yu; Dan Jurafsky;
262	Enhancing Human Evaluation in Machine Translation with Comparative Judgement Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups—point-wise Multidimensional Quality Metrics (MQM), side-by-side (S×S) MQM, and its simplified version S×S relative ranking (RR).	Yixiao Song; Parker Riley; Daniel Deutsch; Markus Freitag;
263	Enhancing Safe and Controllable Protein Generation Via Knowledge Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph.	Yuhao Wang; Keyan Ding; Kehua Feng; Zeyuan Wang; Ming Qin; Xiaotong Li; Qiang Zhang; Huajun Chen;
264	The AI Gap: How Socioeconomic Status Affects Language Technology Interactions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find systematic differences across SES groups in language technology usage (i. e. , frequency, performed tasks), interaction styles, and topics.	Elisa Bassignana; Amanda Cercas Curry; Dirk Hovy;
265	SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While recent efforts explore continuous-space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM.	Yige Xu; Xu Guo; Zhiwei Zeng; Chunyan Miao;
266	Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While these metrics are considered indicative (even if imperfect), of human evaluation for English, their suitability for other languages remains unclear. To address this, in this paper we systematically assess evaluation metrics for generation — both n-gram-based and neural-based— to assess their effectiveness across languages and tasks.	Itai Mondshine; Tzuf Paz-Argaman; Reut Tsarfaty;
267	Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In response, we introduce LoRA-A^2 (Low Rank Adaptation with Alternating freeze and Adaptive rank selection), which demonstrates robustness in challenging settings with low ranks and high data heterogeneity.	Jabin Koo; Minwoo Jang; Jungseul Ok;
268	Attention Entropy Is A Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor.	Zhisong Zhang; Yan Wang; Xinting Huang; Tianqing Fang; Hongming Zhang; Chenlong Deng; Shuaiyi Li; Dong Yu;
269	When to Speak, When to Abstain: Contrastive Decoding with Abstention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To investigate this challenge, we first present a controlled testbed featuring four distinct knowledge access scenarios, including the aforementioned edge case, revealing that conventional LLM usage exhibits insufficient robustness in handling all instances. Addressing this limitation, we propose Contrastive Decoding with Abstention (CDA), a novel training-free decoding method that allows LLMs to generate responses when relevant knowledge is available and to abstain otherwise.	Hyuhng Joon Kim; Youna Kim; Sang-goo Lee; Taeuk Kim;
270	Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities.	Zheyuan Liu; Guangyao Dou; Xiangchi Yuan; Chunhui Zhang; Zhaoxuan Tan; Meng Jiang;
271	Disentangling Biased Knowledge from Reasoning in Large Language Models Via Machine Unlearning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While these methods have experimentally proven effective, they can still be sub-optimum in fully disentangling biases from reasoning. To address this gap, we propose Selective Disentanglement Unlearning (SDU), a novel unlearning framework that selectively removes biased knowledge while preserving reasoning capabilities.	Zheyuan Liu; Suraj Maharjan; Fanyou Wu; Rahil Parikh; Belhassen Bayar; Srinivasan H. Sengamedu; Meng Jiang;
272	Efficient Long Context Language Model Retrieval with Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages.	Minju Seo; Jinheon Baek; Seongyun Lee; Sung Ju Hwang;
273	Memorizing Is Not Enough: Deep Knowledge Injection Through Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association.	Ruoxi Xu; Yunjie Ji; Boxi Cao; Yaojie Lu; Hongyu Lin; Xianpei Han; Ben He; Yingfei Sun; Xiangang Li; Le Sun;
274	Rethinking Reward Model Evaluation Through The Lens of Reward Overoptimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization, i. e. , a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy.	Sunghwan Kim; Dongjin Kang; Taeyoon Kwon; Hyungjoo Chae; Dongha Lee; Jinyoung Yeo;
275	ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although temporal reasoning has raised increasing research attention, comprehensive testing of Allen’s interval relations (e. g. , before, after, during) —a fundamental framework for temporal relationships— remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs’ temporal understanding.	Duygu Sezen Islakoglu; Jan-Christoph Kalo;
276	Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens.	Shengguang Wu; Fan-Yun Sun; Kaiyue Wen; Nick Haber;
277	Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities.	Priyanka Kargupta; Ishika Agarwal; Tal August; Jiawei Han;
278	Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper introduces a novel task, episode detection, which identifies episodes within a news corpus of key event articles.	Priyanka Kargupta; Yunyi Zhang; Yizhu Jiao; Siru Ouyang; Jiawei Han;
279	Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e. g. , safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives.	Priyanka Kargupta; Runchu Tian; Jiawei Han;
280	TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e. g. , methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions.	Priyanka Kargupta; Nan Zhang; Yunyi Zhang; Rui Zhang; Prasenjit Mitra; Jiawei Han;
281	MDCure: A Scalable Pipeline for Multi-Document Instruction-Following Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data.	Gabrielle Kaili-May Liu; Bowen Shi; Avi Caciularu; Idan Szpektor; Arman Cohan;
282	HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment.	Songtao Jiang; Yan Zhang; Yeying Jin; Zhihang Tang; Yangyang Wu; Yang Feng; Jian Wu; Zuozhu Liu;
283	Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM.	Songtao Jiang; Chenyi Zhou; Yan Zhang; Yeying Jin; Zuozhu Liu;
284	VLSBench: Unveiling Visual Leakage in Multimodal Safety Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL.	Xuhao Hu; Dongrui Liu; Hao Li; Xuanjing Huang; Jing Shao;
285	Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While great success has been achieved in building vision models with Contrastive Language-Image Pre-training (CLIP) over Internet-scale image-text pairs, building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of the scarcity of labeled data and text supervision, different levels of downstream tasks, and the conceptual gaps between domains. In this work, to address these issues, we propose a multi-modal prompt learning paradigm to effectively adapt pre-trained GNN to downstream tasks and data, given only a few semantically labeled samples, each with extremely weak text supervision.	Zihao Li; Lecheng Zheng; Bowen Jin; Dongqi Fu; Baoyu Jing; Yikun Ban; Jingrui He; Jiawei Han;
286	A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Inspired by collaborative human processes (e. g. , peer review), we propose a multiple small LMs involved framework, GRA, that aggregates specialized roles across small LMs to iterative refinement and quality control typically achieved by a single large LM.	Xin Gao; Qizhi Pei; Zinan Tang; Yu Li; Honglin Lin; Jiang Wu; Lijun Wu; Conghui He;
287	CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations.	Tianyu Yang; Lisen Dai; Xiangqi Wang; Minhao Cheng; Yapeng Tian; Xiangliang Zhang;
288	HiddenDetect: Detecting Jailbreak Attacks Against Multimodal Large Language Models Via Monitoring Hidden States Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference.	Yilei Jiang; Xinyan Gao; Tianshuo Peng; Yingshui Tan; Xiaoyong Zhu; Bo Zheng; Xiangyu Yue;
289	REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling.	Navve Wasserman; Roi Pony; Oshri Naparstek; Adi Raz Goldfarb; Eli Schwartz; Udi Barzelay; Leonid Karlinsky;
290	TestNUC: Enhancing Test-Time Computing Approaches and Scaling Through Neighboring Unlabeled Data Consistency Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model’s prediction on that instance but also on neighboring unlabeled instances.	Henry Peng Zou; Zhengyao Gu; Yue Zhou; Yankai Chen; Weizhi Zhang; Liancheng Fang; Yibo Wang; Yangning Li; Kay Liu; Philip S. Yu;
291	Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench including the generated long CoTs from different o1-like models (e. g. , QwQ, DeepSeek-R1) for different reasoning tasks (e. g. , Math, Code, General Reasoning), to measure the ability to detect errors in long COT reasoning.	Yancheng He; Shilong Li; Jiaheng Liu; Weixun Wang; Xingyuan Bu; Ge Zhang; Z.y. Peng; Zhaoxiang Zhang; Zhicheng Zheng; Wenbo Su; Bo Zheng;
292	Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of LLMs to answer short questions, and Chinese SimpleQA mainly has five properties (i. e. , Chinese, Diverse, High-quality, Static, Easy-to-evaluate).	Yancheng He; Shilong Li; Jiaheng Liu; Yingshui Tan; Weixun Wang; Hui Huang; Xingyuan Bu; Hangyu Guo; Chengwei Hu; Boren Zheng; Zhuoran Lin; Dekai Sun; Zhicheng Zheng; Wenbo Su; Bo Zheng;
293	Sharper and Faster Mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite existing multimodal language models showing impressive performance on the video understanding task, extremely long videos still pose significant challenges to language model’s context length, memory consumption, and computational complexity. To address these issues, we propose a vision-language model named Sophia for long video understanding, which can efficiently handle hour-scale long videos.	Daoze Zhang; Yuze Zhao; Jintao Huang; Yingda Chen;
294	CritiQ: Mining Data Quality Criteria from Human Preferences Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection.	Honglin Guo; Kai Lv; Qipeng Guo; Tianyi Liang; Zhiheng Xi; Demin Song; Qiuyinzhe Zhang; Yu Sun; Kai Chen; Xipeng Qiu; Tao Gui;
295	Representation Bending for Large Language Model Safety Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety.	Ashkan Yousefpour; Taeheon Kim; Ryan Sungmo Kwon; Seungbeen Lee; Wonje Jeung; Seungju Han; Alvin Wan; Harrison Ngan; Youngjae Yu; Jonghyun Choi;
296	Crowdsource, Crawl, or Generate? Creating SEA-VL, A Multicultural Vision-Language Dataset for Southeast Asia Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages.	Samuel Cahyawijaya; Holy Lovenia; Joel Ruben Antony Moniz; Tack Hwa Wong; Mohammad Rifqi Farhansyah; Thant Thiri Maung; Frederikus Hudi; David Anugraha; Muhammad Ravi Shulthan Habibi; Muhammad Reza Qorib; Amit Agarwal; Joseph Marvin Imperial; Hitesh Laxmichand Patel; Vicky Feliren; Bahrul Ilmi Nasution; Manuel Antonio Rufino; Genta Indra Winata; Rian Adam Rajagede; Carlos Rafael Catalan; Mohamed Fazli Mohamed Imam; Priyaranjan Pattnayak; Salsabila Zahirah Pranida; Kevin Pratama; Yeshil Bangera; Adisai Na-Thalang; Patricia Nicole Monderin; Yueqi Song; Christian Simon; Lynnette Hui Xian Ng; Richardy Lobo Sapan; Taki Hasan Rafi; Bin Wang; Supryadi; Kanyakorn Veerakanjana; Piyalitt Ittichaiwong; Matthew Theodore Roque; Karissa Vincentio; Takdanai Kreangphet; Phakphum Artkaew; Kadek Hendrawan Palgunadi; Yanzhi Yu; Rochana Prih Hastuti; William Nixon; Mithil Bangera; Adrian Xuan Wei Lim; Aye Hninn Khine; Hanif Muhammad Zhafran; Teddy Ferdinan; Audra Aurora Izzani; Ayushman Singh; Evan Evan; Jauza Akbar Krito; Michael Anugraha; Fenal Ashokbhai Ilasariya; Haochen Li; John Amadeo Daniswara; Filbert Aurelian Tjiaranata; Eryawan Presma Yulianrifat; Can Udomcharoenchaikit; Fadil Risdian Ansori; Mahardika Krisna Ihsani; Giang Nguyen; Anab Maulana Barik; Dan John Velasco; Rifo Ahmad Genadi; Saptarshi Saha; Chengwei Wei; Isaiah Edri W. Flores; Kenneth Chen Ko Han; Anjela Gail D. Santos; Wan Shen Lim; Kaung Si Phyo; Tim Santos; Meisyarah Dwiastuti; Jiayun Luo; Jan Christian Blaise Cruz; Ming Shan Hee; Ikhlasul Akmal Hanif; M.Alif Al Hakim; Muhammad Rizky Sya’ban; Kun Kerdthaisong; Lester James Validad Miranda; Fajri Koto; Tirana Noor Fatyanosa; Alham Fikri Aji; Jostin Jerico Rosal; Jun Kevin; Robert Wijaya; Onno P. Kampman; Ruochen Zhang; Börje F. Karlsson; Peerat Limkonchotiwat;
297	DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (DiffPO), which provides an efficient and policy-agnostic solution for aligning LLMs with humans.	Ruizhe Chen; Wenhao Chai; Zhifei Yang; Xiaotian Zhang; Ziyang Wang; Tony Quek; Joey Tianyi Zhou; Soujanya Poria; Zuozhu Liu;
298	Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Large Language Models (LLMs) such as ChatGPT demonstrate significant potential in the medical domain and are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE.	Maxime Griot; Jean Vanderdonckt; Demet Yuksel; Coralie Hemptinne;
299	Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language.	Haneul Yoo; Yongjin Yang; Hwaran Lee;
300	LoGU: Long-form Generation with Uncertainty Expressions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce the task of Long-form Generation with Uncertainty (LoGU).	Ruihan Yang; Caiqi Zhang; Zhisong Zhang; Xinting Huang; Sen Yang; Nigel Collier; Dong Yu; Deqing Yang;
301	Distilling An End-to-End Voice Assistant Without Instruction Training Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision.	William Barr Held; Yanzhe Zhang; Weiyan Shi; Minzhi Li; Michael J Ryan; Diyi Yang;
302	SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling.	Michael J Ryan; Omar Shaikh; Aditri Bhagirath; Daniel Frees; William Barr Held; Diyi Yang;
303	Movie101v2: Improved Movie Narration Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration.	Zihao Yue; Yepeng Zhang; Ziheng Wang; Qin Jin;
304	Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance.	Leyi Pan; Aiwei Liu; Shiyu Huang; Yijian Lu; Xuming Hu; Lijie Wen; Irwin King; Philip S. Yu;
305	Hybrid Preferences: Learning to Route Instances for Human Vs. AI Feedback Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or LMs, achieving better annotation quality while reducing the cost of human-only annotation.	Lester James Validad Miranda; Yizhong Wang; Yanai Elazar; Sachin Kumar; Valentina Pyatkin; Faeze Brahman; Noah A. Smith; Hannaneh Hajishirzi; Pradeep Dasigi;
306	OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our work identifies the key ingredients for building a top-tier code LLM: optimized heuristic rules for data cleaning and deduplication, effective recall of code-related text corpus, and high-quality synthetic data for both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research and enable reproducible advancements in code intelligence.	Siming Huang; Tianhao Cheng; Jason Klein Liu; Weidi Xu; Jiaran Hao; Liuyihan Song; Yang Xu; Jian Yang; Jiaheng Liu; Chenchen Zhang; Linzheng Chai; Ruifeng Yuan; Xianzhen Luo; Qiufeng Wang; YuanTao Fan; Qingfu Zhu; Zhaoxiang Zhang; Yang Gao; Jie Fu; Qian Liu; Houyi Li; Ge Zhang; Yuan Qi; Xu Yinghui; Wei Chu; Zili Wang;
307	OS-Genesis: Automating GUI Agent Trajectory Construction Via Reverse Task Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Further, these approaches exhibit significant gaps between the generated data and online environments, alongside limited data diversity. To address this issue, we introduce OS-Genesis, a novel GUI data synthesis pipeline that overcomes the challenges above.	Qiushi Sun; Kanzhi Cheng; Zichen Ding; Chuanyang Jin; Yian Wang; Fangzhi Xu; Zhenyu Wu; Chengyou Jia; Liheng Chen; Zhoumianze Liu; Ben Kao; Guohao Li; Junxian He; Yu Qiao; Zhiyong Wu;
308	WebWalker: Benchmarking LLMs in Web Traversal Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address this, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.	Jialong Wu; Wenbiao Yin; Yong Jiang; Zhenglin Wang; Zekun Xi; Runnan Fang; Linhai Zhang; Yulan He; Deyu Zhou; Pengjun Xie; Fei Huang;
309	TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Although promising, these additional components often add complexity to the training and inference process, contravening the efficiency that PEFT is designed to deliver. Considering this, we introduce an innovative PEFT method, TeamLoRA, consisting of a collaboration and competition module for LoRA experts, thus achieving the right balance of effectiveness and efficiency:(i) For collaboration, we introduce a novel knowledge sharing and organization mechanism designed to optimize hierarchical learning while enhancing the efficiency of model training and inference.	Tianwei Lin; Jiang Liu; Wenqiao Zhang; Yang Dai; Haoyuan Li; Zhelun Yu; Wanggui He; Juncheng Li; Jiannan Guo; Hao Jiang; Siliang Tang; Yueting Zhuang;
310	Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness.	Xinlin Zhuang; Jiahui Peng; Ren Ma; Yinfan Wang; Tianyi Bai; Xingjian Wei; Qiu Jiantao; Chi Zhang; Ying Qian; Conghui He;
311	Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs.	Zixiang Xu; Yanbo Wang; Yue Huang; Xiuying Chen; Jieyu Zhao; Meng Jiang; Xiangliang Zhang;
312	BRIGHTER: BRIdging The Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we present BRIGHTER–a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains.	Shamsuddeen Hassan Muhammad; Nedjma Ousidhoum; Idris Abdulmumin; Jan Philip Wahle; Terry Ruas; Meriem Beloucif; Christine de Kock; Nirmal Surange; Daniela Teodorescu; Ibrahim Said Ahmad; David Ifeoluwa Adelani; Alham Fikri Aji; Felermino D. M. A. Ali; Ilseyar Alimova; Vladimir Araujo; Nikolay Babakov; Naomi Baes; Ana-Maria Bucur; Andiswa Bukula; Guanqun Cao; Rodrigo Tufiño; Rendi Chevi; Chiamaka Ijeoma Chukwuneke; Alexandra Ciobotaru; Daryna Dementieva; Murja Sani Gadanya; Robert Geislinger; Bela Gipp; Oumaima Hourrane; Oana Ignat; Falalu Ibrahim Lawan; Rooweither Mabuya; Rahmad Mahendra; Vukosi Marivate; Alexander Panchenko; Andrew Piper; Charles Henrique Porto Ferreira; Vitaly Protasov; Samuel Rutunda; Manish Shrivastava; Aura Cristina Udrea; Lilian Diana Awuor Wanzare; Sophie Wu; Florian Valentin Wunderlich; Hanif Muhammad Zhafran; Tianhui Zhang; Yi Zhou; Saif M. Mohammad;
313	SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose SHARE, a SLM-based Hierarchical Action corREction assistant that enables LLMs to perform more precise error localization and efficient correction.	Ge Qu; Jinyang Li; Bowen Qin; Xiaolong Li; Nan Huo; Chenhao Ma; Reynold Cheng;
314	Dolphin: Moving Towards Closed-loop Auto-research Through Thinking, Practice, and Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To further move towards the ultimate goal (i. e. , automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research.	Jiakang Yuan; Xiangchao Yan; Bo Zhang; Tao Chen; Botian Shi; Wanli Ouyang; Yu Qiao; Lei Bai; Bowen Zhou;
315	Cuckoo: An IE Free Rider Hatched By Massive Nutrition in LLM’s Nest Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context.	Letian Peng; Zilong Wang; Feng Yao; Jingbo Shang;
316	CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap.	Yongheng Zhang; Xu Liu; Ruoxi Zhou; Qiguang Chen; Hao Fei; Wenpeng Lu; Libo Qin;
317	X-TURING: Towards An Enhanced and Efficient Turing Test for Long-Term Dialogue Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper proposes X-Turing, which enhances the original test with a burst dialogue pattern, allowing more dynamic exchanges using consecutive messages.	Weiqi Wu; Hongqiu Wu; Hai Zhao;
318	Maximizing The Effectiveness of Larger BERT Models for Compression Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Through Canonical Correlation Analysis, we identify that these methods fail to fully exploit the potential advantages of larger teachers. To address this, we propose an improved distillation approach that effectively enhances knowledge transfer.	Wen-Shu Fan; Su Lu; Shangyu Xing; Xin-Chun Li; De-Chuan Zhan;
319	Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.	Austin Xu; Srijan Bansal; Yifei Ming; Semih Yavuz; Shafiq Joty;
320	Position-aware Automatic Circuit Discovery Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.	Tal Haklay; Hadas Orgad; David Bau; Aaron Mueller; Yonatan Belinkov;
321	Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we collect a Medical Abnormalities Unveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM training.	Yucheng Zhou; Lingran Song; Jianbing Shen;
322	Learning Sparsity for Effective and Efficient Music Performance Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing Music AVQA methods often rely on dense and unoptimized representations, leading to inefficiencies in the isolation of key information, the reduction of redundancy, and the prioritization of critical samples. To address these challenges, we introduce Sparsify, a sparse learning framework specifically designed for Music AVQA.	Xingjian Diao; Tianzhen Yang; Chunhui Zhang; Weiyi Wu; Ming Cheng; Jiang Gui;
323	Keys to Robust Edits: From Theoretical Insights to Practical Advances Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our solution introduces Robust Edit Pathway (REP), a plug-and-play module that: (1) disentangles editing keys from native model representations; (2) dynamically adjusts keys via contrastive learning to achieve robustness-specificity balance.	Jianhao Yan; Futing Wang; Yun Luo; Yafu Li; Yue Zhang;
324	Towards Effective and Efficient Continual Pre-training of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. In this paper, we comprehensively study its key designs to balance the new abilities while retaining the original abilities, and present an effective CPT method that can greatly improve the Chinese language ability and scientific reasoning ability of LLMs.	Jie Chen; Zhipeng Chen; Jiapeng Wang; Kun Zhou; Yutao Zhu; Jinhao Jiang; Yingqian Min; Xin Zhao; Zhicheng Dou; Jiaxin Mao; Yankai Lin; Ruihua Song; Jun Xu; Xu Chen; Rui Yan; Zhewei Wei; Di Hu; Wenbing Huang; Ji-Rong Wen;
325	LegalAgentBench: Evaluating LLM Agents in Legal Domain Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing general-domain benchmarks are unable to fully capture the complexity and subtle nuances inherent in real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain.	Haitao Li; Junjie Chen; Jingli Yang; Qingyao Ai; Wei Jia; Youfeng Liu; Kai Lin; Yueyue Wu; Guozhi Yuan; Yiran Hu; Wuyue Wang; Yiqun Liu; Minlie Huang;
326	Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation.	Sumanth Doddapaneni; Mohammed Safi Ur Rahman Khan; Dilip Venkatesh; Raj Dabre; Anoop Kunchukuttan; Mitesh M Khapra;
327	NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process.	Ruisheng Cao; Hanchong Zhang; Tiancheng Huang; Zhangyi Kang; Yuxin Zhang; Liangtai Sun; Hanqi Li; Yuxun Miao; Shuai Fan; Lu Chen; Kai Yu;
328	Enhancing Text Editing for Grammatical Error Correction: Arabic As A Case Study Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce a text editing approach that derives edit tags directly from data, eliminating the need for language-specific edits.	Bashar Alhafni; Nizar Habash;
329	From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend.	Jialun Cao; Yaojie Lu; Meiziniu Li; Haoyang Ma; Haokun Li; Mengda He; Cheng Wen; Le Sun; Hongyu Zhang; Shengchao Qin; Shing-Chi Cheung; Cong Tian;
330	DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we release DREsS, a large-scale, standard dataset for rubric-based automated essay scoring with 48.	Haneul Yoo; Jieun Han; So-Yeon Ahn; Alice Oh;
331	OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With the evolution of multi-modal large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computers, mobile phones and web browsers by operating within the environments and interfaces (e. g. , Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey on these advanced agents, designated as OS Agents.	Xueyu Hu; Tao Xiong; Biao Yi; Zishu Wei; Ruixuan Xiao; Yurun Chen; Jiasheng Ye; Meiling Tao; Xiangxin Zhou; Ziyu Zhao; Yuhuai Li; Shengze Xu; Shenzhi Wang; Xinchen Xu; Shuofei Qiao; Zhaokai Wang; Kun Kuang; Tieyong Zeng; Liang Wang; Jiwei Li; Yuchen Eleanor Jiang; Wangchunshu Zhou; Guoyin Wang; Keting Yin; Zhou Zhao; Hongxia Yang; Fan Wu; Shengyu Zhang; Fei Wu;
332	SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Abstract: Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally …	Jialong Wu; Zhenglin Wang; Linhai Zhang; Yilong Lai; Yulan He; Deyu Zhou;
333	Language Model Probabilities Are Not Calibrated in Numeric Contexts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Some statements have one well-defined continuation (e. g. , “the Eiffel Tower is in [Paris]), whereas others have a natural distribution over multiple options (e. g. , “the weighted coin flip was [Heads/Tails]. ) We argue that language model (LM) outputs should capture these natural distributions.	Charles Lovering; Michael Krumdick; Viet Dac Lai; Varshini Reddy; Seth Ebner; Nilesh Kumar; Rik Koncel-Kedziorski; Chris Tanner;
334	SkillAggregation: Reference-free LLM-Dependent Aggregation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A new method called SkillAggregation is proposed, which learns to combine estimates from LLM judges without needing additional data or ground truth.	Guangzhi Sun; Anmol Kagrecha; Potsawee Manakul; Phil Woodland; Mark Gales;
335	Knowledge Boundary of Large Language Models: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types.	Moxin Li; Yong Zhao; Wenxuan Zhang; Shuaiyi Li; Wenya Xie; See-Kiong Ng; Tat-Seng Chua; Yang Deng;
336	Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we repurpose a relation extraction dataset (e. g. , Re-DocRED) to design controlled experiments that quantify the impact of heuristic biases, such as a preference for shorter documents, on retrievers like Dragon+ and Contriever.	Mohsen Fayyaz; Ali Modarressi; Hinrich Schuetze; Nanyun Peng;
337	A Large-Scale Real-World Evaluation of An LLM-Based Virtual Teaching Assistant Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we develop an LLM-based VTA and deploy it in an introductory AI programming course with 477 graduate students.	Sunjun Kweon; Sooyohn Nam; Hyunseung Lim; Hwajung Hong; Edward Choi;
338	Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture.	Michael Y. Hu; Jackson Petty; Chuan Shi; William Merrill; Tal Linzen;
339	Unanswerability Evaluation for Retrieval Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce UAEval4RAG, a comprehensive evaluation framework designed to evaluate whether RAG systems effectively handle unanswerable queries specific to a given knowledge base.	Xiangyu Peng; Prafulla Kumar Choubey; Caiming Xiong; Chien-Sheng Wu;
340	Neuron-Level Sequential Editing for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing model editing methods, especially those that alter model parameters, typically focus on single-round editing and often face significant challenges in sequential model editing-most notably issues of model forgetting and failure. To address these challenges, we introduce a new model editing method, namely Neuron-level Sequential Editing (NSE), tailored for supporting sequential model editing.	Houcheng Jiang; Junfeng Fang; Tianyu Zhang; Baolong Bi; An Zhang; Ruipeng Wang; Tao Liang; Xiang Wang;
341	Information Locality As An Inductive Bias for Neural Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases.	Taiga Someya; Anej Svete; Brian DuSell; Timothy J. O’Donnell; Mario Giulianelli; Ryan Cotterell;
342	Low-Bit Quantization Favors Undertrained LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model’s training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.	Xu Ouyang; Tao Ge; Thomas Hartvigsen; Zhisong Zhang; Haitao Mi; Dong Yu;
343	Segment-Based Attention Masking for GPTs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that.	Shahar Katz; Liran Ringel; Yaniv Romano; Lior Wolf;
344	Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In contrast, humans can quickly form impressions of a model’s capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses.	Fan Zhang; Shulin Tian; Ziqi Huang; Yu Qiao; Ziwei Liu;
345	CULEMO: Cultural Lenses on Emotion – Benchmarking LLMs for Cross-Cultural Emotion Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing emotion benchmarks suffer fromtwo major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required fordeeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designedto evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish.	Tadesse Destaw Belay; Ahmed Haj Ahmed; Alvin C Grissom Ii; Iqra Ameer; Grigori Sidorov; Olga Kolesnikova; Seid Muhie Yimam;
346	Sparse Latents Steer Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we leverage Sparse Autoencoders (SAEs) within the LLaMA Scope to uncover sparse, interpretable latents that govern RAG behaviors.	Chunlei Xin; Shuheng Zhou; Huijia Zhu; Weiqiang Wang; Xuanang Chen; Xinyan Guan; Yaojie Lu; Hongyu Lin; Xianpei Han; Le Sun;
347	SEA: Low-Resource Safety Alignment for Multimodal Large Language Models Via Synthetic Embeddings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets.	Weikai Lu; Hao Peng; Huiping Zhuang; Cen Chen; Ziqian Zeng;
348	Interpret and Improve In-Context Learning Via The Lens of Input-Label Mappings Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, the internal mechanisms behind ICL remain under-explored, particularly the mappings between inputs and labels. In this work, we reverse-engineer ICL by examining input-label mappings: what they are within LLMs, where they function, and how LLMs utilize them.	Chenghao Sun; Zhen Huang; Yonggang Zhang; Le Lu; Houqiang Li; Xinmei Tian; Xu Shen; Jieping Ye;
349	STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our key contribution is exploring the interpolation between structured and unstructured pruning, to propose a novel structured-then-unstructured (STUN) approach outperforming both of structured or unstructured pruning, especially for MoEs.	Jaeseong Lee; Seung-won Hwang; Aurick Qiao; Daniel F Campos; Zhewei Yao; Yuxiong He;
350	Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with Human Feedback (RLHF) to enhance the semantic understanding of SLMs.	Guan-Ting Lin; Prashanth Gurunath Shivakumar; Aditya Gourav; Yile Gu; Ankur Gandhe; Hung-yi Lee; Ivan Bulyko;
351	SDPO: Segment-Level Direct Preference Optimization for Social Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior.	Aobo Kong; Wentao Ma; Shiwan Zhao; Yongbin Li; Yuchuan Wu; Ke Wang; Xiaoqian Liu; Qicheng Li; Yong Qin; Fei Huang;
352	PsyDial: A Large-scale Long-term Conversational Dataset for Mental Health Support Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Although removing personally identifiable information is feasible, this process is labor-intensive. To address these challenges, we propose a novel privacy-preserving data reconstruction method that reconstructs real-world client-counselor dialogues while mitigating privacy concerns.	Huachuan Qiu; Zhenzhong Lan;
353	Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, EMMA-X.	Qi Sun; Pengfei Hong; Tej Deep Pala; Vernon Toh; U-Xuan Tan; Deepanway Ghosal; Soujanya Poria;
354	VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains.	Yikun Wang; Siyin Wang; Qinyuan Cheng; Zhaoye Fei; Liang Ding; Qipeng Guo; Dacheng Tao; Xipeng Qiu;
355	Structure-aware Domain Knowledge Injection for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists.	Kai Liu; Ze Chen; Zhihang Fu; Wei Zhang; Rongxin Jiang; Fan Zhou; Yaowu Chen; Yue Wu; Jieping Ye;
356	AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We investigate various LLM-based evaluation methods on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.	Yilun Zhao; Weiyuan Chen; Zhijian Xu; Manasi Patwardhan; Chengye Wang; Yixin Liu; Lovekesh Vig; Arman Cohan;
357	Unique Hard Attention: A Tale of Two Sides Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality.	Selim Jerad; Anej Svete; Jiaoda Li; Ryan Cotterell;
358	The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines’ ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios.	Chulun Zhou; Qiujing Wang; Mo Yu; Xiaoqian Yue; Rui Lu; Jiangnan Li; Yifan Zhou; Shunchi Zhang; Jie Zhou; Wai Lam;
359	The Harmonic Structure of Information Contours Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: These fluctuations are often explained by factors such as syntactic constraints, stylistic choices, or audience design. In this work, we explore an alternative perspective: that these fluctuations may be influenced by an implicit linguistic pressure towards periodicity, where the information rate oscillates at regular intervals, potentially across multiple frequencies simultaneously.	Eleftheria Tsipidi; Samuel Kiegeland; Franz Nowak; Tianyang Xu; Ethan Wilcox; Alex Warstadt; Ryan Cotterell; Mario Giulianelli;
360	Vulnerability of LLMs to Vertically Aligned Text Manipulations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes.	Zhecheng Li; Yiwei Wang; Bryan Hooi; Yujun Cai; Zhen Xiong; Nanyun Peng; Kai-Wei Chang;
361	Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model’s capabilities.	Qiming Ge; Shuhao Xing; Songyang Gao; Yunhua Zhou; Yicheng Zou; Songyang Zhang; Zhi Chen; Hang Yan; Qi Zhang; Qipeng Guo; Kai Chen;
362	Rethinking The Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we focus on a standard and realistic scaling setting: majority voting.	Yexiang Liu; Zekun Li; Zhi Fang; Nan Xu; Ran He; Tieniu Tan;
363	Advancing Zero-shot Text-to-Speech Intelligibility Across Diverse Domains Via Preference Alignment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a new dataset, named the Intelligibility Preference Speech Dataset (INTP), and extend the Direct Preference Optimization (DPO) framework to accommodate diverse TTS architectures.	Xueyao Zhang; Yuancheng Wang; Chaoren Wang; Ziniu Li; Zhuo Chen; Zhizheng Wu;
364	Transferring Textual Preferences to Vision-Language Understanding Through Model Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs.	Chen-An Li; Tzu-Han Lin; Yun-Nung Chen; Hung-yi Lee;
365	Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step, which is fundamentally different from prior work that primarily operate on parallel CoT generations.	Haritz Puerto; Tilek Chubakov; Xiaodan Zhu; Harish Tayyar Madabushi; Iryna Gurevych;
366	UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation.	Baining Zhao; Jianjie Fang; Zichao Dai; Ziyou Wang; Jirong Zha; Weichen Zhang; Chen Gao; Yue Wang; Jinqiang Cui; Xinlei Chen; Yong Li;
367	Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Our investigation suggests that such hallucinations often stem from the deficiencies in fine-grained comprehension on the visual aspect, particularly when visual scenes exhibit appearance or semantic similarities (e. g. , bicycle vs. motorcycles, baseball bat vs. baseball). In this work, we show such hallucination is naturally mitigated via a novel method called visual evidence prompting, utilizing small visual models to complement the LVLMs.	Wei Li; Zhen Huang; Houqiang Li; Le Lu; Yang Lu; Xinmei Tian; Xu Shen; Jieping Ye;
368	Croppable Knowledge Graph Embedding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we propose a novel KGE training framework MED.	Yushan Zhu; Wen Zhang; Zhiqiang Liu; Mingyang Chen; Lei Liang; Huajun Chen;
369	Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin.	Xiang Zhuang; Bin Wu; Jiyu Cui; Kehua Feng; Xiaotong Li; Huabin Xing; Keyan Ding; Qiang Zhang; Huajun Chen;
370	Mind The Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS), a dataset of 288 gesture-country pairs annotated for offensiveness, cultural significance, and contextual factors across 25 gestures and 85 countries.	Akhila Yerukola; Saadia Gabriel; Nanyun Peng; Maarten Sap;
371	Colloquial Singaporean English Style Transfer with Fine-Grained Explainable Control Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Style transfer between Singlish and Standard (formal) English is vital for various applications, yet existing methods often lack explainability and fine-grained control. To fill this gap, we contribute in two key ways.	Jinggui Liang; Dung Vo; Yap Hong Xian; Hai Leong Chieu; Kian Ming A. Chai; Jing Jiang; Lizi Liao;
372	Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We observe a novel phenomenon, contextual entrainment, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by “irrelevant” contextual information in the input prompt.	Jingcheng Niu; Xingdi Yuan; Tong Wang; Hamidreza Saghir; Amir H. Abdi;
373	Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we provide experimental results on the BEIR dataset using the open-source Lucene search library that explicate the tradeoffs between HNSW and flat indexes (including quantized variants) from the perspectives of indexing time, query evaluation performance, and retrieval quality.	Jimmy Lin;
374	Advancing Collaborative Debates with Role Differentiation Through Multi-Agent Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, in some given tasks, obtaining domain knowledge related to task characteristics and getting the strengths of different LLMs is hard. To solve these problems, we propose a Multi-LLM Cooperation (MLC) framework with automatic role assignment capabilities.	Haoran Li; Ziyi Su; Yun Xue; Zhiliang Tian; Yiping Song; Minlie Huang;
375	Pixel-Level Reasoning Segmentation Via Multi-turn Conversations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation.	Dexian Cai; Xiaocui Yang; YongKang Liu; Daling Wang; Shi Feng; Yifei Zhang; Soujanya Poria;
376	Large Language Models Struggle to Describe The Haystack Without Human Help: A Social Science-Inspired Evaluation of Topic Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This paper provides best practices—there is no one right model, the choice of models is situation-specific—and suggests potential improvements for scalable LLM- based topic models.	Zongxia Li; Lorena Calvo-Bartolomé; Alexander Miserlis Hoyle; Paiheng Xu; Daniel Kofi Stephens; Juan Francisco Fung; Alden Dima; Jordan Lee Boyd-Graber;
377	Gumbel Reranking: Differentiable End-to-End Reranker Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Existing distillation-based approaches suffer from training-inference misalignment and fail to capture interdependencies among candidate documents. To overcome these limitations, we reframe the reranking process as an attention-mask problem and propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap.	Siyuan Huang; Zhiyuan Ma; Jintao Du; Changhua Meng; Weiqiang Wang; Jingwen Leng; Minyi Guo; Zhouhan Lin;
378	Personalized Text Generation with Contrastive Activation Steering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG’s inference latency by retrieval operations and PEFT’s parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM’s activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage.	Jinghao Zhang; Yuting Liu; Wenjie Wang; Qiang Liu; Shu Wu; Liang Wang; Tat-Seng Chua;
379	Synthesizing Post-Training Data for LLMs Through Multi-Agent Simulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To fill this gap, inspired by the recent success of using LLMs to simulate human society, we propose MATRIX, a multi-agent simulator that automatically generates diverse text-based scenarios, capturing a wide range of real-world human needs in a realistic and scalable manner. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis.	Shuo Tang; Xianghe Pang; Zexi Liu; Bohan Tang; Rui Ye; Tian Jin; Xiaowen Dong; Yanfeng Wang; Siheng Chen;
380	Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix Modality alignment framework leveraging Multimodal Large Language Models (MLLMs).	Yupu Liang; Yaping Zhang; Zhiyang Zhang; Yang Zhao; Lu Xiang; Chengqing Zong; Yu Zhou;
381	LongSafety: Evaluating Long-Context Safety of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks.	Yida Lu; Jiale Cheng; Zhexin Zhang; Shiyao Cui; Cunxiang Wang; Xiaotao Gu; Yuxiao Dong; Jie Tang; Hongning Wang; Minlie Huang;
382	Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models Through Bidirectional Hidden State Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we explore a novel perspective on hallucination mitigation by examining the intermediate activations of LVLMs during generation.	Jingran Su; Jingfan Chen; Hongxin Li; Yuntao Chen; Li Qing; Zhaoxiang Zhang;
383	IOPO: Empowering LLMs with Complex Instruction Following Via Input-Output Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, this paper introduces Trace, a benchmark for improving and evaluating the complex instruction-following ability, which consists of 120K training data and 1K evaluation data.	Xinghua Zhang; Haiyang Yu; Cheng Fu; Fei Huang; Yongbin Li;
384	UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation.	Fengran Mo; Yifan Gao; Chuan Meng; Xin Liu; Zhuofeng Wu; Kelong Mao; Zhengyang Wang; Pei Chen; Zheng Li; Xian Li; Bing Yin; Meng Jiang;
385	Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: This paper introduces Light-R1, an opensource suite for training long reasoning modelsusing reproducible and cost-effective methodology.	Liang Wen; Yunke Cai; Fenrui Xiao; Xin He; Qi An; Zhenyu Duan; Yimin Du; Junchen Liu; Tanglifu Tanglifu; Xiaowei Lv; Haosheng Zou; Yongchao Deng; Shousheng Jia; Xiangzheng Zhang;
386	Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level “novelty. ”	Yuming Yang; Yang Nan; Junjie Ye; Shihan Dou; Xiao Wang; Shuo Li; Huijie Lv; Tao Gui; Qi Zhang; Xuanjing Huang;
387	CFBench: A Comprehensive Constraints-Following Benchmark for LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user’s perspective. To bridge this gap, we propose CFBench, a large-scale Chinese Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks.	Tao Zhang; ChengLIn Zhu; Yanjun Shen; Wenjing Luo; Yan Zhang; Hao Liang; Tao Zhang; Fan Yang; Mingan Lin; Yujing Qiao; Weipeng Chen; Bin Cui; Wentao Zhang; Zenan Zhou;
388	SocialEval: Evaluating Social Intelligence of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.	Jinfeng Zhou; Yuxuan Chen; Yihan Shi; Xuanming Zhang; Leqi Lei; Yi Feng; Zexuan Xiong; Miao Yan; Xunzhi Wang; Yaru Cao; Jianing Yin; Shuai Wang; Quanyu Dai; Zhenhua Dong; Hongning Wang; Minlie Huang;
389	TC–RAG: Turing–Complete RAG’s Case Study on Medical LLM Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing approaches to RAG fall short by neglecting system state variables, which are crucial for ensuring adaptive control, retrieval halting, and system convergence. In this paper, we introduce the Turing-Complete-RAG (TC-RAG) through rigorous proof, a novel framework that addresses these challenges by incorporating a Turing Complete System to manage state variables, thereby enabling more efficient and accurate knowledge retrieval.	Xinke Jiang; Yue Fang; Rihong Qiu; Haoyu Zhang; Yongxin Xu; Hao Chen; Wentao Zhang; Ruizhe Zhang; Yuchen Fang; Xinyu Ma; Xu Chu; Junfeng Zhao; Yasha Wang;
390	HyKGE: A Hypothesis Knowledge Graph Enhanced RAG Framework for Accurate and Reliable Medical LLMs Responses Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate the retrieval-augmented generation (RAG) based on Knowledge Graphs (KGs) to improve the accuracy and reliability of Large Language Models (LLMs).	Xinke Jiang; Ruizhe Zhang; Yongxin Xu; Rihong Qiu; Yue Fang; Zhiyuan Wang; Jinyi Tang; Hongxin Ding; Xu Chu; Junfeng Zhao; Yasha Wang;
391	T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation Via Fine-grained AI Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning.	Zehan Wang; Ke Lei; Chen Zhu; Jiawei Huang; Sashuai Zhou; Luping Liu; Xize Cheng; Shengpeng Ji; Zhenhui Ye; Tao Jin; Zhou Zhao;
392	MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark MEMERAG.	María Andrea Cruz Blandón; Jayasimha Talur; Bruno Charron; Dong Liu; Saab Mansour; Marcello Federico;
393	Words of Warmth: Trust and Sociability Norms for Over 26k English Words Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce Words of Warmth, the first large-scale repository of manually derived word–warmth (as well as word–trust and word–sociability) associations for over 26k English words.	Saif M. Mohammad;
394	How to Mitigate Overfitting in Weak-to-strong Generalization? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions.	Junhao Shi; Qinyuan Cheng; Zhaoye Fei; Yining Zheng; Qipeng Guo; Xipeng Qiu;
395	Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we investigate applying ReFT to complex reasoning tasks.	Chenxi Huang; Shaotian Yan; Liang Xie; Binbin Lin; Sinan Fan; Yue Xin; Deng Cai; Chen Shen; Jieping Ye;
396	R2D2: Remembering, Replaying and Dynamic Decision Making with A Reflective Agentic Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.	Tenghao Huang; Kinjal Basu; Ibrahim Abdelaziz; Pavan Kapanipathi; Jonathan May; Muhao Chen;
397	MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification.	Linzhuang Sun; Hao Liang; Jingxuan Wei; Bihui Yu; Tianpeng Li; Fan Yang; Zenan Zhou; Wentao Zhang;
398	L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA.	Hyesung Jeon; Yulhwa Kim; Jae-Joon Kim;
399	SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose Subtask-oriented Reinforced Fine-Tuning (SoRFT), a novel training approach to enhance the issue resolving capability of LLMs.	Zexiong Ma; Chao Peng; Pengfei Gao; Xiangxin Meng; Yanzhen Zou; Bing Xie;
400	The Impact of Token Granularity on The Predictive Power of Language Model Surprisal Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions.	Byung-Doh Oh; William Schuler;
401	Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Abstract: The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their …	Aneta Zugecova; Dominik Macko; Ivan Srba; Robert Moro; Jakub Kopál; Katarína Marcinčinová; Matúš Mesarčík;
402	TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), which combines CoT reasoning with regression-aware training.	Cheng-Han Chiang; Hung-yi Lee; Michal Lukasik;
403	Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants.	Sky CH-Wang; Darshan Girish Deshpande; Smaranda Muresan; Anand Kannappan; Rebecca Qian;
404	Energy Considerations of Large Language Model Inference and Efficiency Optimizations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation.	Jared Fernandez; Clara Na; Vashisth Tiwari; Yonatan Bisk; Sasha Luccioni; Emma Strubell;
405	TARGA: Targeted Synthetic Data Generation for Practical Reasoning Over Structured Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation.	Xiang Huang; Jiayu Shen; Shanshan Huang; Sitao Cheng; Xiaxia Wang; Yuzhong Qu;
406	Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs.	Seungcheol Park; Jeongin Bae; Beomseok Kwon; Minjun Kim; Byeongwook Kim; Se Jung Kwon; U Kang; Dongsoo Lee;
407	Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JITVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits.	Alperen Yildiz; Sin G Teo; Yiling Lou; Yebo Feng; Chong Wang; Dinil Mon Divakaran;
408	MasRouter: Learning to Route LLMs for Multi-Agent Systems Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution.	Yanwei Yue; Guibin Zhang; Boyang Liu; Guancheng Wan; Kun Wang; Dawei Cheng; Yiyan Qi;
409	Guiding Not Forcing: Enhancing The Transferability of Jailbreaking Attacks on LLMs Via Removing Superfluous Constraints Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints—specifically, the response pattern constraint and the token tail constraint—as significant barriers to improved transferability.	Junxiao Yang; Zhexin Zhang; Shiyao Cui; Hongning Wang; Minlie Huang;
410	Faster Speculative Decoding Via Effective Draft Decoder with Pruned Candidate Tree Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we found that the confidence scores predicted by the draft model are well-calibrated with the acceptance probability of draft tokens.	Huanran Zheng; Xiaoling Wang;
411	Real-time Factuality Assessment from Adversarial Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs.	Sanxing Chen; Yukun Huang; Bhuwan Dhingra;
412	CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns.	Zekai Ye; Qiming Li; Xiaocheng Feng; Libo Qin; Yichong Huang; Baohang Li; Kui Jiang; Yang Xiang; Zhirui Zhang; Yunfei Lu; Duyu Tang; Dandan Tu; Bing Qin;
413	Literature Meets Data: A Synergistic Approach to Hypothesis Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation.	Haokun Liu; Yangqiaoyu Zhou; Mingxuan Li; Chenfei Yuan; Chenhao Tan;
414	Mitigating Posterior Salience Attenuation in Long-Context LLMs with Positional Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Motivated by it, we propose the training-free Positional Contrastive Decoding (PCD) that contrasts the logits derived from long-aware attention with those from designed local-aware attention, enabling the model to focus on the gains introduced by large-scale short-to-long training.	Zikai Xiao; Ziyang Wang; Wen Ma; Yan Zhang; Wei Shen; WangYan WangYan; Luqi Gong; Zuozhu Liu;
415	Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses.	William Jurayj; Jeffrey Cheng; Benjamin Van Durme;
416	Can A Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: 4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities.	Emre Can Acikgoz; Jeremiah Greer; Akul Datta; Ze Yang; William Zeng; Oussama Elachqar; Emmanouil Koukoumidis; Dilek Hakkani-Tür; Gokhan Tur;
417	Boosting Long-Context Information Seeking Via Query-Guided Activation Refilling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In the paper, we propose a method for processing long-context information-seeking tasks via query-guided ACtivation REfilling (ACRE).	Hongjin Qian; Zheng Liu; Peitian Zhang; Zhicheng Dou; Defu Lian;
418	Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs.	Chak Tou Leong; Qingyu Yin; Jian Wang; Wenjie Li;
419	Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This factuality ability is crucial in determining whether these models can be deployed and applied safely and compliantly within specific regions. To address these challenges and better evaluate the factuality ability of LLMs to answer short question, we introduce the Chinese SafetyQA benchmark.	Yingshui Tan; Boren Zheng; Baihui Zheng; Kerui Cao; Huiyun Jing; Jincheng Wei; Jiaheng Liu; Yancheng He; Wenbo Su; Xiaoyong Zhu; Bo Zheng; Kaifu Zhang;
420	HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection Under Human-AI Coauthoring Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels.	Zhixiong Su; Yichen Wang; Herun Wan; Zhaohan Zhang; Minnan Luo;
421	Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Despite this, existing LLM evaluators suffer from limited use scenarios and poor flexibility. To mitigate these issues, we propose Praetor, a fine-grained generative LLM evaluator with instance-level customazable evaluation criteria.	Yongqi Leng; Renren Jin; Yue Chen; Zhuowen Han; Ling Shi; Jianxiang Peng; Lei Yang; Juesi Xiao; Deyi Xiong;
422	Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: A Case Study in Kazakh Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan.	Nurkhan Laiyk; Daniil Orel; Rituraj Joshi; Maiya Goloburda; Yuxia Wang; Preslav Nakov; Fajri Koto;
423	BOOKCOREF: Coreference Resolution at Book Scale Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i. e. , when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens.	Giuliano Martinelli; Tommaso Bonomo; Pere-Lluís Huguet Cabot; Roberto Navigli;
424	Causal Estimation of Tokenisation Bias Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e. g. , ⟨ hello ⟩) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i. e. , “hello”).	Pietro Lesci; Clara Meister; Thomas Hofmann; Andreas Vlachos; Tiago Pimentel;
425	OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e. g. , voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency.	Haonan Zhang; Run Luo; Xiong Liu; Yuchuan Wu; Ting-En Lin; Pengpeng Zeng; Qiang Qu; Feiteng Fang; Min Yang; Lianli Gao; Jingkuan Song; Fei Huang; Yongbin Li;
426	Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS.	Fangzhi Xu; Qiushi Sun; Kanzhi Cheng; Jun Liu; Yu Qiao; Zhiyong Wu;
427	Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recognizing the intrinsic noise and uncertainty of self-supervision, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies.	Fangzhi Xu; Hang Yan; Chang Ma; Haiteng Zhao; Qiushi Sun; Kanzhi Cheng; Junxian He; Jun Liu; Zhiyong Wu;
428	𝜙-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Built on it, we propose a novel decoding strategy, named 𝜙-Decoding.	Fangzhi Xu; Hang Yan; Chang Ma; Haiteng Zhao; Jun Liu; Qika Lin; Zhiyong Wu;
429	ReLearn: Unlearning Via Learning for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework.	Haoming Xu; Ningyuan Zhao; Liming Yang; Sendong Zhao; Shumin Deng; Mengru Wang; Bryan Hooi; Nay Oo; Huajun Chen; Ningyu Zhang;
430	Direct Prompt Optimization with Continuous Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we model the prompt optimization problem by the probability distribution of the prompt and present a novel approach that integrates greedy strategies into optimization with continuous representations.	Yangkun Wang; Zihan Wang; Jingbo Shang;
431	T2I-FactualBench: Benchmarking The Factuality of Text-to-Image Models with Knowledge-Intensive Concepts Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we present T2I-FactualBench—the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation.	Ziwei Huang; Wanggui He; Quanyu Long; Yandi Wang; Haoyuan Li; Zhelun Yu; Fangxun Shu; Weilong Dai; Hao Jiang; Fei Wu; Leilei Gan;
432	Length Controlled Generation for Black-box LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy.	Yuxuan Gu; Wenjie Wang; Xiaocheng Feng; Weihong Zhong; Kun Zhu; Lei Huang; Ting Liu; Bing Qin; Tat-Seng Chua;
433	YuLan-Mini: Pushing The Limits of Open Data-efficient Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we explore the key bottlenecks and designs during pre-training, and make the following contributions: (1) a comprehensive investigation into the factors contributing to training instability; (2) a robust optimization approach designed to mitigate training instability effectively; (3) an elaborate data pipeline that integrates data synthesis, data curriculum, and data selection.	Hu Yiwen; Huatong Song; Jie Chen; Jia Deng; Jiapeng Wang; Kun Zhou; Yutao Zhu; Jinhao Jiang; Zican Dong; Yang Lu; Xu Miao; Xin Zhao; Ji-Rong Wen;
434	KRISTEVA: Close Reading As A Novel Task for Benchmarking Interpretive Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts.	Peiqi Sui; Juan Diego Rodriguez; Philippe Laban; J. Dean Murphy; Joseph P. Dexter; Richard Jean So; Samuel Baker; Pramit Chaudhuri;
435	SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Current vision-language models may grasp basic spatial cues and simple directions (e. g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset.	Wenyu Zhang; Wei En Ng; Lixin Ma; Yuwen Wang; Junqi Zhao; Allison Koenecke; Boyang Li; Wanglu Wanglu;
436	Contextual Experience Replay for Self-Improvement of Language Agents Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window.	Yitao Liu; Chenglei Si; Karthik R Narasimhan; Shunyu Yao;
437	Can Graph Descriptive Order Affect Solving Graph Problems with LLMs? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this study, we present the first comprehensive analysis of how the order of graph descriptions impacts LLM performance.	Yuyao Ge; Shenghua Liu; Baolong Bi; Yiwei Wang; Lingrui Mei; Wenjie Feng; Lizhe Chen; Xueqi Cheng;
438	CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference.	Haitao Li; Junjie Chen; Qingyao Ai; Zhumin Chu; Yujia Zhou; Qian Dong; Yiqun Liu;
439	Just Go Parallel: Improving The Multilingual Capabilities of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct a systematic study on the impact of adding parallel data on LLMs’ multilingual capabilities, focusing specifically on translation and multilingual common-sense reasoning.	Muhammad Reza Qorib; Junyi Li; Hwee Tou Ng;
440	A Reality Check on Context Utilisation for Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance.	Lovisa Hagström; Sara Vera Marjanovic; Haeun Yu; Arnav Arora; Christina Lioma; Maria Maistro; Pepa Atanasova; Isabelle Augenstein;
441	METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation.	Bingxuan Li; Yiwei Wang; Jiuxiang Gu; Kai-Wei Chang; Nanyun Peng;
442	Learning to Rewrite: Generalized LLM-Generated Text Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce Learning2Rewrite, a novel framework to detect LLM-generated text with exceptional generalization to unseen domains.	Wei Hao; Ran Li; Weiliang Zhao; Junfeng Yang; Chengzhi Mao;
443	STaR-SQL: Self-Taught Reasoner for Text-to-SQL Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process.	Mingqian He; Yongliang Shen; Wenqi Zhang; Qiuying Peng; Jun Wang; Weiming Lu;
444	TigerLLM – A Family of Bangla Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: A few initiatives attempted to create open-source Bangla LLMs with performance still behind high-resource languages and limited reproducibility. To address this gap, we introduce TigerLLM – a family of Bangla LLMs.	Nishat Raihan; Marcos Zampieri;
445	Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce PALM, a year-long community-driven project covering all 22 Arab countries.	Fakhraddin Alwajih; Abdellah El Mekki; Samar Mohamed Magdy; AbdelRahim A. Elmadany; Omer Nacar; El Moatez Billah Nagoudi; Reem Abdel-Salam; Hanin Atwany; Youssef Nafea; Abdulfattah Mohammed Yahya; Rahaf Alhamouri; Hamzah A. Alsayadi; Hiba Zayed; Sara Shatnawi; Serry Sibaee; Yasir Ech-chammakhy; Walid Al-Dhabyani; Marwa Mohamed Ali; Imen Jarraya; Ahmed Oumar El-Shangiti; Aisha Alraeesi; Mohammed Anwar AL-Ghrawi; Abdulrahman S. Al-Batati; Elgizouli Mohamed; Noha Taha Elgindi; Muhammed Saeed; Houdaifa Atou; Issam Ait Yahia; Abdelhak Bouayad; Mohammed Machrouh; Amal Makouar; Dania Alkawi; Mukhtar Mohamed; Safaa Taher Abdelfadil; Amine Ziad Ounnoughene; Anfel Rouabhia; Rwaa Assi; Ahmed Sorkatti; Mohamedou Cheikh Tourad; Anis Koubaa; Ismail Berrada; Mustafa Jarrar; Shady Shehata; Muhammad Abdul-Mageed;
446	A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Standard modeling approaches, however, overlook much of the spatio-temporal dynamics involved in reading by relying on aggregated reading measurements—typically only focusing on fixation durations—and employing models with strong simplifying assumptions. In this paper, we propose a generative model that captures not only how long fixations last, but also where they land and when they occur.	Francesco Ignazio Re; Andreas Opedal; Glib Manaiev; Mario Giulianelli; Ryan Cotterell;
447	Around The World in 24 Hours: Probing LLM Knowledge of Time and Place Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space.	Carolin Holtermann; Paul Röttger; Anne Lauscher;
448	SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce SCALE, a novel multi-agent framework that effectively ̲Simulates ̲Content ̲Analysis via ̲Large language model (LLM) ag ̲Ents.	Chengshuai Zhao; Zhen Tan; Chau-Wai Wong; Xinyan Zhao; Tianlong Chen; Huan Liu;
449	Towards Harmonized Uncertainty Estimation for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores.	Rui Li; Jing Long; Muge Qi; Heming Xia; Lei Sha; Peiyi Wang; Zhifang Sui;
450	ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task.	Hanxing Ding; Shuchang Tao; Liang Pang; Zihao Wei; Jinyang Gao; Bolin Ding; Huawei Shen; Xueqi Cheng;
451	Lost in Literalism: How Supervised Training Shapes Translationese in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.	Yafu Li; Ronghao Zhang; Zhilin Wang; Huajian Zhang; Leyang Cui; Yongjing Yin; Tong Xiao; Yue Zhang;
452	LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN’s value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBraces, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates.	Ying Shen; Lifu Huang;
453	Revisit Self-Debugging with Self-Generated Tests for Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose and analyze two distinct paradigms for the self-debugging process: post-execution and in-execution self-debugging.	Xiancai Chen; Zhengwei Tao; Kechi Zhang; Changzhi Zhou; Xinyu Zhang; Wanli Gu; Yuanpeng He; Mengdi Zhang; Xunliang Cai; Haiyan Zhao; Zhi Jin;
454	Improving Model Factuality with Fine-grained Critique-based Evaluator Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback.	Yiqing Xie; Wenxuan Zhou; Pradyot Prakash; Di Jin; Yuning Mao; Quintin Fettes; Arya Talebzadeh; Sinong Wang; Han Fang; Carolyn Rose; Daniel Fried; Hejia Zhang;
455	PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Recent studies have revealed that the performance of LLMs can be significantly improved by integrating external tools. Based on this, we propose a benchmark framework called PlanningArena, which aims to simulate real application scenarios and provide a series of apps and API tools that may be involved in the actual planning process.	Zihan Zheng; Tianle Cui; Chuwen Xie; Jiahui Pan; Qianglong Chen; Lewei He;
456	IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in The Absence of Tabular Data Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations.	Tao Feng; Lizhen Qu; Niket Tandon; Gholamreza Haffari;
457	On The Reliability of Large Language Models for Causal Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: This study investigates the efficacy of Large Language Models (LLMs) in causal discovery.	Tao Feng; Lizhen Qu; Niket Tandon; Zhuang Li; Xiaoxi Kang; Gholamreza Haffari;
458	GIFT-SW: Gaussian Noise Injected Fine-Tuning of Salient Weights for LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW).	Maxim Zhelnin; Viktor Moskvoretskii; Egor Shvetsov; Maria Krylova; Venediktov Egor; Zuev Aleksandr; Evgeny Burnaev;
459	Enhancing Transformers for Generalizable First-Order Logical Entailment Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose TEGA, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.	Tianshi Zheng; Jiazheng Wang; Zihao Wang; Jiaxin Bai; Hang Yin; Zheye Deng; Yangqiu Song; Jianxin Li;
460	Pre-Training Curriculum for Multi-Token Prediction in Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite.	Ansar Aynetdinov; Alan Akbik;
461	Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM).	Linhao Ye; Lang Yu; Zhikai Lei; Qin Chen; Jie Zhou; Liang He;
462	EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning Via Step-wise Intention-Driven Product Association Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks.	Weiqi Wang; Limeng Cui; Xin Liu; Sreyashi Nag; Wenju Xu; Chen Luo; Sheikh Muhammad Sarwar; Yang Li; Hansu Gu; Hui Liu; Changlong Yu; Jiaxin Bai; Yifan Gao; Haiyang Zhang; Qi He; Shuiwang Ji; Yangqiu Song;
463	MARS: Benchmarking The Metaphysical Reasoning Abilities of Language Models with A Multi-task Evaluation Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of *reasoning with distributional changes as a three-step discriminative process, termed as MetAphysical ReaSoning*.	Weiqi Wang; Yangqiu Song;
464	RATIONALYST: Pre-training Process-Supervision for Improving Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data.	Dongwei Jiang; Guoxuan Wang; Yining Lu; Andrew Wang; Jingyu Zhang; Chuyu Liu; Benjamin Van Durme; Daniel Khashabi;
465	FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model’s parametric knowledge and retrieved context.	Qinggang Zhang; Zhishang Xiang; Yilin Xiao; Le Wang; Junhui Li; Xinrun Wang; Jinsong Su;
466	SConU: Selective Conformal Uncertainty in Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level.	Zhiyuan Wang; Qingni Wang; Yue Zhang; Tianlong Chen; Xiaofeng Zhu; Xiaoshuang Shi; Kaidi Xu;
467	CoT-based Synthesizer: Enhancing LLM Performance Through Answer Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidates are flawed.	Bohan Zhang; Xiaokang Zhang; Jing Zhang; Jifan Yu; Sijia Luo; Jie Tang;
468	What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens).	Dingyi Yang; Qin Jin;
469	Design Choices for Extending The Context Length of Visual Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Moreover, existing open-source VLMs lack systematic exploration into extending their context length, and commercial models often provide limited details. To tackle this, we aim to establish an effective solution that enhances long context performance of VLMs while preserving their capacities in short context scenarios.	Mukai Li; Lei Li; Shansan Gong; Qi Liu;
470	MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Leveraging multi-agentic frameworks to enhance large language models (LLMs) has demonstrated significant potential recently, with most existing studies focusing on prompting and developing workflows with frozen LLMs. In this paper, we aim to further unleash the power of such multi-agentic frameworks for post-training LLMs for better collaboration.	Chanwoo Park; Seungju Han; Xingzhi Guo; Asuman E. Ozdaglar; Kaiqing Zhang; Joo-Kyung Kim;
471	Exploring Forgetting in Large Language Model Pre-Training Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on our revised assessment of forgetting metrics, we explored low-cost, straightforward methods to mitigate forgetting during the pre-training phase.	Chonghua Liao; Ruobing Xie; Xingwu Sun; Haowen Sun; Zhanhui Kang;
472	WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support.	Yifu Chen; Shengpeng Ji; Haoxiao Wang; Ziqing Wang; Siyu Chen; Jinzheng He; Jin Xu; Zhou Zhao;
473	Dynamic Parallel Tree Search for Efficient LLM Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The challenges of accelerating the ToT lie in the frequent switching of reasoning focus, and the redundant exploration of suboptimal solutions. To alleviate this dilemma, we propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference.	Yifu Ding; Wentao Jiang; Shunyu Liu; Yongcheng Jing; Jinyang Guo; Yingjie Wang; Jing Zhang; Zengmao Wang; Ziwei Liu; Bo Du; Xianglong Liu; Dacheng Tao;
474	MMDEND: Dendrite-Inspired Multi-Branch Multi-Compartment Parallel Spiking Neuron for Sequence Modeling Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Though parallel spiking neurons are an efficient solution, their number of parameters is often tied to the hidden dimension or sequence length, which makes current parallel neurons unsuitable for large architectures. To address these issues, we propose MMDEND: a Multi-Branch Multi-Compartment Parallel Spiking Dendritic Neuron.	Kexin Wang; Yuhong Chou; Di Shang; Shijie Mei; Jiahong Zhang; Yanbin Huang; Man Yao; Bo Xu; Guoqi Li;
475	BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks.	Ran Xin; Chenguang Xi; Jie Yang; Feng Chen; Hang Wu; Xia Xiao; Yifan Sun; Shen Zheng; Ming Ding;
476	Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Based on the finding that the performance of ICL is highly sensitive to the selection of demonstration examples, we propose Bidirectional Alignment (BiAlign) to fully leverage the models’ preferences for ICL examples to improve the ICL abilities of student models.	Chengwei Qin; Wenhan Xia; Fangkai Jiao; Chen Chen; Yuchen Hu; Bosheng Ding; Ruirui Chen; Shafiq Joty;
477	FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we introduce FlashAudio with rectified flows to learn straight flow for fast simulation.	Huadai Liu; Jialei Wang; Rongjie Huang; Yang Liu; Heng Lu; Zhou Zhao; Wei Xue;
478	MIR: Methodology Inspiration Retrieval for Scientific Research Problems Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We address the challenge of retrieving prior work whose concepts can inspire solutions for a given research problem, a task we define as Methodology Inspiration Retrieval (MIR).	Aniketh Garikaparthi; Manasi Patwardhan; Aditya Sanjiv Kanade; Aman Hassan; Lovekesh Vig; Arman Cohan;
479	ExploraCoder: Advancing Code Generation for Multiple Unseen APIs Via Planning and Chained Exploration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by exploratory programming paradigm in human behavior, we propose ExploraCoder, a training-free framework that empowers LLMs to invoke multiple unseen APIs in code solution by (1) planning a complex problem into several API invocation subtasks, and (2) experimenting with correct API usage at intermediate steps through a novel chain-of-API-exploration.	Yunkun Wang; Yue Zhang; Zhen Qin; Chen Zhi; Binhua Li; Fei Huang; Yongbin Li; Shuiguang Deng;
480	What Makes A Good Natural Language Prompt? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.	Do Xuan Long; Duy Dinh; Ngoc-Hai Nguyen; Kenji Kawaguchi; Nancy F. Chen; Shafiq Joty; Min-Yen Kan;
481	UMedSum: A Unified Framework for Clinical Abstractive Summarization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Clinical abstractive summarization struggles to balance faithfulness and informativeness, sacrificing key information or introducing confabulations. Techniques like in-context learning and fine-tuning have improved overall summary quality orthogonally, without considering the above issue.	Aishik Nagar; Yutong Liu; Andy T. Liu; Viktor Schlegel; Vijay Prakash Dwivedi; Arun-Kumar Kaliya-Perumal; Guna Pratheep Kalanchiam; Yili Tang; Robby T. Tan;
482	Finding The Sweet Spot: Preference Data Construction for Scaling Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this work, we aim to scale up the number of on-policy samples via repeated random sampling to improve alignment performance.	Yao Xiao; Hai Ye; Linyao Chen; Hwee Tou Ng; Lidong Bing; Xiaoli Li; Roy Ka-Wei Lee;
483	LLMs Can Simulate Standardized Patients Via Agent Coevolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training.	Zhuoyun Du; LujieZheng LujieZheng; Renjun Hu; Yuyang Xu; Xiawei Li; Ying Sun; Wei Chen; Jian Wu; Haolei Cai; Haochao Ying;
484	Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc. ) and different diagnostic capacities (perception, disease analysis, etc. ).	Jie Liu; Wenxuan Wang; Su Yihang; Jingyuan Huang; Yudi Zhang; Cheng-Yi Li; Wenting Chen; Xiaohan Xing; Kao-Jung Chang; Linlin Shen; Michael R. Lyu;
485	ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish ELBA-Bench, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning (e. g. , LoRA) or without fine-tuning techniques (e. g. , In-context-learning).	Xuxu Liu; Siyuan Liang; Mengya Han; Yong Luo; Aishan Liu; Xiantao Cai; Zheng He; Dacheng Tao;
486	CodeIF: Benchmarking The Instruction-Following Capabilities of Large Language Models for Code Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios.	Kaiwen Yan; Hongcheng Guo; Xuanqing Shi; Shaosheng Cao; Donglin Di; Zhoujun Li;
487	Capture The Key in Reasoning to Enhance CoT Distillation Generalization Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key, instead imitating the teacher’s reasoning forms and making errors or omissions in reasoning. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistakE-Driven key reasonIng step distillaTion (EDIT), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning.	Chengwei Dai; Kun Li; Wei Zhou; Songlin Hu;
488	Game Development As Human-LLM Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We propose a data synthesis pipeline based on LLM to generate game script-code pairs and interactions from a few manually crafted seed data.	Jiale Hong; Hongqiu Wu; Hai Zhao;
489	Towards Multi-System Log Anomaly Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: Additionally, these models often encounter the “identical shortcut” predicament, erroneously predicting normal classes when confronted with rare anomaly logs due to reconstruction errors. To address these issues, we propose MLAD, a novel Multi-system Log Anomaly Detection model incorporating semantic relational reasoning.	Boyang Wang; Runqiang Zang; Hongcheng Guo; Shun Zhang; Shaosheng Cao; Donglin Di; Zhoujun Li;
490	Confidence V.s. Critique: A Decomposition of Self-Correction Capability for LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation.	Zhe Yang; Yichang Zhang; Yudong Wang; Ziyao Xu; Junyang Lin; Zhifang Sui;
491	Document-Level Text Generation with Minimum Bayes Risk Decoding Using Optimal Transport Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this paper, we investigate the adaptation of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks.	Yuu Jinnai;
492	Extending Complex Logical Queries on Uncertain Knowledge Graphs Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: The study of machine learning-based logical query-answering enables reasoning with large-scale and incomplete knowledge graphs.	Weizhi Fei; Zihao Wang; Hang Yin; Yang Duan; Yangqiu Song;
493	MMBoundary: Advancing MLLM Knowledge Boundary Awareness Through Reasoning Step Confidence Calibration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration.	Zhitao He; Sandeep Polisetty; Zhiyuan Fan; Yuchen Huang; Shujin Wu; Yi R. Fung;
494	Can MLLMs Understand The Deep Implication Behind Chinese Images? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To address this, we introduce the CII-Bench, which aims to assess MLLMs’ such capabilities for Chinese images.	Chenhao Zhang; Xi Feng; Yuelin Bai; Xeron Du; Jinchang Hou; Kaixin Deng; Guangzeng Han; Qinrui Li; Bingli Wang; Jiaheng Liu; Xingwei Qu; Yifei Zhang; Qixuan Zhao; Yiming Liang; Ziqiang Liu; Feiteng Fang; Min Yang; Wenhao Huang; Chenghua Lin; Ge Zhang; Shiwen Ni;
495	DNASpeech: A Contextualized and Situated Text-to-Speech Dataset with Dialogues, Narratives and Actions Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we propose contextualized and situated text-to-speech (CS-TTS), a novel TTS task to promote more accurate and customized speech generation using prompts with Dialogues, Narratives, and Actions (DNA).	Chuanqi Cheng; Hongda Sun; Bo Du; Shuo Shang; Xinrong Hu; Rui Yan;
496	Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models’ Uncertainty? Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define *marker confidence* as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs.	Jiayu Liu; Qing Zong; Weiqi Wang; Yangqiu Song;
497	Unveiling Language-Specific Features in Large Language Models Via Sparse Autoencoders Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages.	Boyi Deng; Yu Wan; Baosong Yang; Yidan Zhang; Fuli Feng;
498	Redundancy Principles for MLLMs Benchmarks Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains.	Zicheng Zhang; Xiangyu Zhao; Xinyu Fang; Chunyi Li; Xiaohong Liu; Xiongkuo Min; Haodong Duan; Kai Chen; Guangtao Zhai;
499	CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Save Highlight: Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics.	Xiaqiang Tang; Jian Li; Keyu Hu; Nan Du; Xiaolong Li; Xi Zhang; Weigao Sun; Sihong Xie;
500	FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation Related Papers Related Patents Related Grants Related Venues Related Experts View Save Highlight: However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories.	Wei Li; Xin Zhang; Zhongxin Guo; Shaoguang Mao; Wen Luo; Guangyue Peng; Yangyu Huang; Houfeng Wang; Scarlett Li;

This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~1,800 papers), please visit Paper Digest: ACL-2025 (Full List).