MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

Ziliang Gan, Dong Zhang, Haohan Li, Yang Wu, Xueyuan Lin, Ji Liu, Haipang Wu, Chaoyou Fu, Zenglin Xu, Rongjunchen Zhang, Yong Dai

Venue: ACM International Conference on Multimedia (ACM MULTIMEDIA) 2025
Recognition: Most Influential ACM MULTIMEDIA 2025 Paper (Rank No. 9)
Edition: 2026-03
Impact factor: 3
Certificate ID: bba8d9ed8405f1fd

Abstract

To date, there is a notable lack of rigorous benchmarks that assess Multimodal Large Language Models (MLLMs) within the financial domain, a field characterized by specialized financial charts and complex domain-specific expertise. To address this gap, we introduce MME-Finance, the first comprehensive bilingual multimodal benchmark tailored for financial analysis. MME-Finance comprises 4,751 meticulously curated samples, encompassing 2,274 open-ended questions, 2,000 binary-choice questions, and 477 multi-turn questions. To mitigate bias when LLMs act as judges, we also created an evaluation framework that strengthens alignment with human judgments by embedding visual context into the multimodal assessment pipeline. A comprehensive evaluation of 31 popular MLLMs has been conducted to assess their perception, reasoning, and cognitive capabilities. Gemini2.5Pro achieves highest accuracy of 79.28\% and 85.71\% on the open-ended questions and multi-turn questions, respectively. Among open-source models, InternVL3-78B attains 71.24 \% accuracy on the open-ended question, whereas Qwen2.5-VL-72B achieves an F1 score of 88.73 \% on the binary-choice question. The results indicate that state-of-the-art MLLMs demonstrate considerable overall competence, yet exhibit significant deficiencies in fine-grained visual perception and the understanding of domain-specific financial images. Source code is available at https://github.com/HiThink-Research/MME-Finance.

Download PDF certificate