Within each subject, the best scores in processing image representations are highlighted in blue, and the best in text representations are in red. Overall, GPT-4 Turbo is the best for images and Claude-3 Opus is the best for text. Across four subjects within IsoBench, multimodal foundation models have a strong preference for text modalities. The gap in accuracy between the best text representation and its isomorphic image representation can be as large as 28.7%.
Illustration of IsoCombination (IsoCB) and IsoScratchPad (IsoSP). IsoCB combines all representations provided by a user and constructs one unified prompt for a foundation model. IsoSP is a two-step prompting method, where a foundation model first describes an image and then uses the textual description as the sole representation for a given task.
Best prompting methods are highlighted in red and improvements over image-only prompts are in (green). Both methods improve performance in comparison with image representations, and for certain domains, IsoCombination additionally improves performance relative to text representations.
Full IsoBench results including more API-access models (GPT-3.5 Turbo and PaLM-2) and open-source models (LLaMa-2 70B and LLaVA-1.5 13B).
@inproceedings{fu2024isobench,
title={{I}so{B}ench: Benchmarking Multimodal Foundation Models on Isomorphic Representations},
author={Deqing Fu and Ruohao Guo and Ghazal Khalighinejad and Ollie Liu and Bhuwan Dhingra and Dani Yogatama and Robin Jia and Willie Neiswanger},
booktitle={First Conference on Language Modeling (COLM)},
year={2024},
note={First four authors contributed equally.}
}