PAPER DIGEST
Most Influential ACM MULTIMEDIA 2025 Paper · 2026-03 edition

Manipulating Multimodal Agents Via Cross-Modal Prompt Injection

Le Wang, Zonghao Ying, Tianyuan Zhang, Siyuan Liang, Shengshan Hu, Mingchuan Zhang, Aishan Liu, Xianglong Liu

Venue
ACM International Conference on Multimedia (ACM MULTIMEDIA) 2025
Recognition
Most Influential ACM MULTIMEDIA 2025 Paper (Rank No. 14)
Edition
2026-03
Impact factor
3
Certificate ID
0022ba615f463644

Abstract

The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attacker embeds adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agents' decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta-prompting and generate a malicious textual command based on it that steers the agents' output toward better compliance with attacker's requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1\% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications. Code can be found in https://github.com/Larry0454/CrossInject.

Download PDF certificate