We evaluate various LVLMs, including both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark. For all models, we use the default prompt provided by each model for multi-choice or open QA, if available.
Reset | Size | Date | Safe | Effective | Safe & Effective | Multi-Choice |
GPT-4V(ision) | - | 2024-04-29 | 53.29 | 69.46 | 23.35 | 38.92 |
GPT-4o | - | 2024-05-06 | 50.90 | 95.81 | 46.71 | 41.32 |
Gemini 1.5 Pro | - | 2024-04-30 | 52.10 | 91.62 | 45.51 | 47.31 |
LLaVA-1.6-34B | 34B | 2024-04-29 | 40.72 | 95.81 | 37.13 | 52.69 |
Gemini 1.0 Pro | - | 2024-04-29 | 27.54 | 92.22 | 25.15 | 34.13 |
LLaVA-1.5-7B | 7.2B | 2024-04-29 | 21.56 | 87.43 | 16.17 | 33.53 |
LLaVA-1.5-13B | 13.4B | 2024-04-29 | 22.16 | 91.62 | 19.76 | 32.93 |
Qwen-VL-7B-Chat | 9.6B | 2024-04-29 | 41.32 | 82.63 | 29.94 | 20.96 |
mPLUG-OWL2 | 8.2B | 2024-04-29 | 22.16 | 90.42 | 17.37 | 28.14 |
MiniGPT4-v2 | 8B | 2024-04-29 | 41.92 | 81.44 | 32.93 | 27.54 |
CogVLM | 17B | 2024-04-29 | 22.75 | 91.02 | 20.96 | 27.54 |
InstructBLIP2-T5-XL | 4B | 2024-04-29 | 8.38 | 51.50 | 1.80 | - |
InstructBLIP2-T5-XXL | 12B | 2024-04-29 | 11.98 | 51.50 | 4.79 | - |
InstructBLIP2-7B | 8B | 2024-04-29 | 24.55 | 51.50 | 4.19 | - |
InstructBLIP2-13B | 14B | 2024-04-29 | 19.76 | 49.10 | 4.19 | - |
Random Choice | - | 2024-04-29 | - | - | - | 24.95 |
Overall results of different models on the SIUO. The best-performing model in each category is in-bold, and the second best is underlined. Except for multiple-choice questions, all other results are based on manual evaluations.