MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models


Measuring divergent and convergent multimodal association intelligence.

11,497 Instances | Remote-Item & In-Context Association Tasks | Hierarchical Ability Annotations | LLM-as-a-Judge Evaluation

NeurIPS 2025 Datasets and Benchmarks Track (Poster)

Zimeng Huang1,2, Jinxin Ke1, Xiaoxuan Fan3, Yufeng Yang1, Yang Liu1, Liu Zhonghan1, Zedi Wang1, Junteng Dai1, Haoyi Jiang1, Yuyu Zhou3, Keze Wang1, Ziliang Chen2*

1Sun Yat-sen University 2Peng Cheng Laboratory 3Jinan University *Corresponding Author


Abstract

Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI.

Contribution Highlights

  • MM-OPERA: We introduce a benchmark of 10,000+ instances for evaluating association reasoning, centered on Remote-Item Association (RIA) and In-Context Association (ICA) tasks inspired by classic psychometric studies. It spans 13 analytical dimensions to enable comprehensive assessment.
  • LLM-as-a-Judge Strategies: We design tailored LLM-as-a-Judge methods that assess both response quality and reasoning processes, enabling fine-grained and reliable scoring.
  • Profound Findings: Our analysis reveals key limitations of current LVLMs and highlights the critical role of association reasoning in advancing real-world, general-purpose AI.

Open-Ended Association Tasks

MM-OPERA prioritizes free-form responses, employing reference answers as heuristic quality benchmarks rather than rigid correctness criteria.

Human-Grounded Taxonomy

It spans 13 associative dimensions and diverse cultural, linguistic, and thematic contexts.

Process-Aware Evaluation

We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision.

Benchmark at a Glance

Overview of MM-OPERA tasks and reasoning paths
Figure 1: An overview of MM-OPERA. The RIA task challenges models to discover meaningful connections between unrelated elements, while the ICA task requires transferring relationship patterns from a context pair to a query item to generate an appropriate target. The reference answer represents just one possible valid response. The association reasoning paths are used to evaluate the coherence and depth of the step-by-step reasoning process.

MM-OPERA comprises 11,497 instances across two core tasks (Figure 1): Remote-Item Association (RIA), testing the ability to link distant concepts with structured reasoning, and In-Context Association (ICA), probing pattern recognition within in-context learning. It prioritizes free-form responses, employing reference answers as heuristic quality benchmarks rather than rigid correctness criteria.

Association Tasks

Remote-Item Association (RIA)

  • Instruction: "Describe each image briefly. Analyze and explore the relation between the two images, identifying any possible connections, themes, or shared elements."
  • The reference answer follows Armadillo -> NaturalArmor -> Protection and Kevlar -> BulletproofVest -> Protection to reveal the shared concept "Protection".
  • The RIA dataset includes Multiple-Image variants where identical concepts appear in different images, enabling controlled sensitivity testing of LVLMs' visual perception.

In-Context Association (ICA)

  • Instruction: 1) Briefly describe Image 1.1, Image 1.2, and Image 2.1. 2) Analyze the relationship between Image 1.1 and Image 1.2. 3) Design Image 2.2 so that its relationship with Image 2.1 mirrors that between Image 1.1 and Image 1.2.
  • The task requires transferring relationship patterns from a context pair to a query item to generate an appropriate target.
  • The ICA dataset employs a circular evaluation strategy, where each set of four images generates four distinct questions, each requiring the model to reason about one image based on the relationships established by the other three.

Statistics of MM-OPERA

Statistics of MM-OPERA
Figure 2: Statistics of MM-OPERA.

(a) Hierarchical ability taxonomy consists of 3 levels, refining perceptual and conceptual associations. We report each ability’s frequency as a percentage of total label occurrences to better represent the dataset’s distribution.
(b) Three relationship types capturing diverse associative connections.
(c) The number of hops in the association reasoning path, quantifying different associative reasoning complexity.
(d) Different cultures, (e) 15 languages, and (f) 22 topic domains ensuring broad cultural, linguistic, and thematic diversity.

Hierarchical Ability Taxonomy

We develop a hierarchical annotation framework to systematically evaluate multimodal associative reasoning abilities.

Level-1 (L-1) divides associative abilities into two fundamental categories: Perception and Conception. Level-2 (L-2) further refines these categories into Recognition, Context, Interaction, Logic, Semantic, and Reasoning. Level-3 (L-3) provides the most granular classification with thirteen specific dimensions:

  • Visual Similarity: Associations based on visual features like shape, color, texture, and appearance.
  • Semantic Object: High-level semantic recognition of objects, including fine-grained identification in specific contexts.
  • Contextual Sensory Cues: Perceptual associations based on visual details like tone, lighting, and spatial layout.
  • Scene Contextualization: Understanding of overall scene context, including atmosphere and purpose.
  • Abstract Interpretation: Recognition of abstract concepts and symbolic patterns.
  • Social Insight: Understanding emotions and interactions between people in visual scenes.
  • Relational Perception: Comprehension of spatial and logical relationships between objects.
  • Functional Links: Associations based on functional relationships between concepts.
  • Causal Connections: Associations based on cause-and-effect relationships.
  • Thematic Links: Associations within the same theme or context.
  • Cultural Reference: Associations based on cultural knowledge and specific contexts.
  • Hierarchical Association: Vertical associations between abstract and concrete concepts.
  • Analogical Reasoning: Associations based on structural, feature, or pattern similarities.

This hierarchical framework enables systematic evaluation of MLLMs' associative abilities across different cognitive levels, from basic sensory processing to sophisticated abstract reasoning. The progression from L-1 to L-3 mirrors human cognitive development and provides a comprehensive structure for analyzing multimodal understanding capabilities.

Dataset and Curation

  • MM-OPERA comprises 11,497 instances across two core tasks: Remote-Item Association (RIA) and In-Context Association (ICA).
  • The RIA dataset includes Multiple-Image variants where identical concepts appear in different images, enabling controlled sensitivity testing of LVLMs' visual perception, and over 25% of the instances exhibit unique concept pairs.
  • We identify three relationship categories: Relation, Mutual Element, and Metaphor, capturing diverse associative connections.
  • Our dataset deliberately incorporates various cultures, 15 languages with their unique linguistic devices, and 22 topic domains.
  • 33.35% of the questions and reference answers were sourced from the Remote Associates Test, 4.01% from LI-RAT, and the remaining images were collected from the Internet; all fine-grained annotations were manually constructed and revised to ensure consistency and accuracy.

Evaluation Framework

Responses are graded automatically by LLM judges using two complementary strategies:

  • Regular LLM-as-a-Judge scoring: We adopt unified scoring rubrics that evaluate the association quality of responses, prioritizing depth, coherence, and insight over mere correctness. The holistic score ranges from 0 to 4, and we report Score Rate (SR) (reflecting general performance), High Score Rate (HR-3, HR-4) (measuring the ability to converge on an optimal solution), and ΔHR = HR-3 - HR-4 (a proxy for creative divergent thinking—generating valid but non-optimal associations.)
  • Process-Reward LLM-as-a-Judge: The judge reformats model responses into association paths (s1, s2, ..., sn), where each step is assessed for Reasonableness Rt, Distinctiveness Dt, and Knowledgeability Kt with st = α * Rt * Dt + (1 - α) * Kt and reasoning score Sr = ∑t st * δt (α = 0.9, δ = 0.9).

We employ GPT-4o (gpt-4o-2024-08-06) and DeepSeek-V3 as the mixed basic LLM-as-a-Judge engine for scoring, and the study included 24 undergraduate and graduate students who answered subsets of 485 RIA and 436 ICA questions to establish a human baseline.

Access Code on Github

Results Snapshot

Model RIA SR (%) RIA HR-4 (%) ICA SR (%) ICA HR-4 (%) ΔHR (RIA / ICA)
Gemini-2.5-Pro-Preview 60.05 23.89 63.09 12.85 17.86 / 28.30
o4-mini 60.33 19.86 61.55 10.24 18.03 / 26.36
Gemini-2.0-Flash-Thinking-Exp 59.11 17.73 61.42 9.74 18.87 / 28.14
GPT-4o 59.72 10.89 58.26 6.27 17.94 / 23.35
Yi-VL-34B 45.25 4.97 54.39 1.30 14.66 / 18.23
Human baseline* 61.88 22.84 68.69 31.65 26.13 / 29.82

*Human results are averaged over 24 participants and sampled data subsets.


LVLMs Far Below Humans in Association Reasoning. MM-OPERA reveals the formidable challenges of associative reasoning for current LVLMs. While latest models like o4-mini and latest Gemini models show improved performance, with SR approaching the human baseline, they still fall short in achieving high-quality associations. For instance, on the RIA task, o4-mini achieves an HR-4 of 19.86% compared to humans' 22.84%, and on the ICA task, Gemini-2.5-Pro-Preview reaches an HR-4 of 12.85% against humans' 31.65%, which demonstrates that sophisticated associative reasoning remains at the cutting edge of LVLM capabilities. Case studies in Appendix C illuminate key limitations, such as cross-domain knowledge retrieval deficiencies and perceptual misalignments.

Creativity Gap in Divergent Thinking. The ΔHR metric highlights divergent thinking, with most models scoring 12%-20%, showing their ability to generate reasonable yet non-optimal associations. Latest Gemini models lead among LVLMs (18.87% and 28.30% in two tasks), but humans outperform with both higher ΔHR (26.13% and 29.82%) and HR-3 scores, demonstrating a superior balance of creativity and accuracy, an area where LVLMs remain limited.

ICA: Dual Challenge of Pattern Abstraction and Transfer. Most models perform better on RIA than ICA, highlighting the challenges of pattern-based associative reasoning. ICA requires not only connecting concepts but also abstracting and transferring these patterns to new contexts, a complex process demanding advanced meta-reasoning. Notably, latest Gemini models, Yi-VL-34B and GLM-4V outperform on ICA compared to RIA, suggesting that certain architectures excel in specific associative reasoning tasks. These distinctions may stem from more effective pattern extraction or transfer mechanisms, warranting further investigation.

Conservative Reasoning vs. Associative Flexibility. Analysis shows an inverse correlation between model constraints and associative abilities. Gemini-1.5-Flash (55.86% SR on RIA), optimized for speed, outperforms Gemini-1.5-Pro (45.34% SR), despite Pro's larger size and focus on detailed reasoning. Examination of 500 random RIA samples (Figure 3) shows Pro's conservative behavior to reason the high-rate association, prioritizing factuality and ethics, led to 1 point scores on nearly 20% of RIA questions due to conservative responses like "unrelated" versus Flash's <10%. Flash tended to offer superficial connections where Pro declined. Thus, factuality checks and ethical considerations, while improving reliability for complex tasks, can limit performance on creative association.

Paper & Supplementary Material

BibTeX

@inproceedings{huang2025mmopera,
  title={{MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models}},
  author={Zimeng Huang and Jinxin Ke and Xiaoxuan Fan and Yufeng Yang and Yang Liu and Liu Zhonghan and Zedi Wang and Junteng Dai and Haoyi Jiang and Yuyu Zhou and Keze Wang and Ziliang Chen},
  booktitle={Advances in Neural Information Processing Systems 39},
  year={2025}
}