Open FlagEval-VLM Leaderboard
欢迎使用Open FlagEval-VLM Leaderboard! Open FlagEval-VLM Leaderboard 旨在跟踪、排名和评估开放式视觉大语言模型(VLM)。本排行榜由FlagEval平台提供相应算力和运行环境。VLM构建了一种基于数据集的能力体系,依据所接入的开源数据集,我们总结出了数学,视觉、图表、通用、文字以及中文等六个能力维度,由此组成一个评测集合。
Welcome to the FlagEval-VLM Leaderboard! The FlagEval-VLM Leaderboard is designed to track, rank and evaluate open Visual Large Language Models (VLMs). This leaderboard is powered by the FlagEval platform, which provides the appropriate arithmetic and runtime environment. VLM builds a dataset-based competency system. Based on the accessed open source datasets, we summarize six competency dimensions, including Mathematical, Visual, Graphical, Generic, Textual, and Chinese, to form a collection of assessments.
T | Model | Average ⬆️ | CMMMU | MMMU | OCRBench | MMMU_Pro_standard | MMMU_Pro_vision | MathVision | CII-Bench | Blink | model_name_for_query |
---|---|---|---|---|---|---|---|---|---|---|---|
🟢 | 60.35 | 61.11 | 62.89 | 86.19 | 38.61 | 40.54 | 53.79 | 67.72 | 64.11 | XGen-MM-Instruct-Interleave-v1.5 |
T | Model | Average ⬆️ | CMMMU | MMMU | OCRBench | MMMU_Pro_standard | MMMU_Pro_vision | MathVision | CII-Bench | Blink | Type | Precision | #Params (B) | Model sha | model_name_for_query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
🟢 | 92.75 | 61.11 | 62.89 | 86.19 | 38.61 | 40.54 | 53.79 | 67.72 | 64.11 | pretrained | float16 | 2.21 | N/A | XGen-MM-Instruct-Interleave-v1.5 |
T | Model | Average ⬆️ | CMMMU | MMMU | OCRBench | MMMU_Pro_standard | MMMU_Pro_vision | MathVision | CII-Bench | Blink | Type | Precision | #Params (B) | Model sha | model_name_for_query |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 92.75 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | N/A | baseline | ||||
🟢 | 60.35 | 61.11 | 62.89 | 86.19 | 45.9 | 40.54 | 53.79 | 67.72 | 64.7 | pretrained | float16 | 0 | gemini-2.0-pro-exp-02-05 | ||
🟢 | 57 | 50 | 61.78 | 80.4 | 47.8 | 47.7 | 44.64 | 59.61 | 64.11 | pretrained | float16 | 0 | Gemini-1.5-Pro | ||
🟢 | 56.98 | 49.11 | 52.33 | 79.82 | 38.61 | 68.03 | 45.38 | 59.26 | 63.34 | pretrained | float16 | 0 | claude-3-7-sonnet-20250219 | ||
🟢 | 56.98 | 61.89 | 62.33 | 82.6 | 44.59 | 40.06 | 35.56 | 67.97 | 60.81 | pretrained | float16 | 0 | Doubao-Pro-Vision-32k-241028 | ||
🟢 | 55.84 | 50.78 | 56.11 | 79.3 | 42.14 | 56.65 | 38.3 | 63.07 | 60.34 | pretrained | float16 | 0 | Claude-3.5-Sonnet-20241022 | ||
🟢 | 54.3 | 48.67 | 60.4 | 80.4 | 43.22 | 45.4 | 29.61 | 61.05 | 65.65 | pretrained | float16 | 0 | GPT-4o-20241120 | ||
🟢 | 53.04 | 50.78 | 57.22 | 80.4 | 37.17 | 46.53 | 28.79 | 59.22 | 64.19 | pretrained | float16 | 0 | GPT-4o-20240806 | ||
🟢 | 53.01 | 52.78 | 60.89 | 83.2 | 41.33 | 34.41 | 26.35 | 67.84 | 57.29 | pretrained | float16 | 0 | Qwen2-VL-72B-Instruct | ||
🟢 | 52.76 | 47.33 | 52.44 | 84.8 | 35.66 | 59.08 | 25.86 | 58.82 | 58.13 | pretrained | float16 | 0 | Step-1V-32k | ||
🟢 | 50.56 | 49.94 | 56.89 | 84.57 | 39.25 | 31.79 | 26.88 | 58.76 | 56.4 | pretrained | float16 | 0 | Qwen-VL-Max | ||
🟢 | 49.6 | 48.33 | 51.89 | 73.3 | 36.65 | 60.69 | 24.38 | 52.55 | 49.03 | pretrained | float16 | 0 | Molmo-72B-0924 | ||
🟢 | 49.3 | 51.44 | 57.64 | 81.8 | 36.67 | 27.47 | 22.19 | 54.51 | 62.66 | pretrained | float16 | 0 | Yi-Vision | ||
🟢 | 48.4 | 47.78 | 56 | 74.4 | 37.21 | 31.94 | 25.1 | 57.78 | 56.95 | pretrained | float16 | 0 | LLaVA-Onevision-72B | ||
🟢 | 48.34 | 50.22 | 58.22 | 79.5 | 36.84 | 38.73 | 20.26 | 55.42 | 47.5 | pretrained | float16 | 0 | NVLM-D-72B | ||
🟢 | 46.79 | 43.56 | 54.44 | 81.66 | 37.19 | 23.47 | 17.66 | 60.86 | 55.44 | pretrained | float16 | 0 | GLM-4V-Plus | ||
🟢 | 45.28 | 43.56 | 48.67 | 71.9 | 32.08 | 52.02 | 15.56 | 46.27 | 52.18 | pretrained | float16 | 0 | Aria | ||
🟢 | 44.58 | 40.44 | 47 | 70 | 30.46 | 52 | 24.61 | 47.71 | 44.42 | pretrained | float16 | 0 | Claude3-Opus-20240229 | ||
🟢 | 44.43 | 41.78 | 54.67 | 71.1 | 38.09 | 23.58 | 21.51 | 55.82 | 48.87 | pretrained | float16 | 0 | Llama-3.2-90B-Vision-Instruct | ||
🟢 | 44.26 | 42 | 55.89 | 77.7 | 36.3 | 13.82 | 17.07 | 53.99 | 57.34 | pretrained | float16 | 0 | InternVL2-Llama3-76B | ||
🟢 | 44.13 | 42.89 | 48.78 | 74.3 | 31.97 | 33.02 | 21.15 | 45.23 | 55.71 | pretrained | float16 | 0 | Gemini-1.5-Flash | ||
🟢 | 43.86 | 42.56 | 47.56 | 74.2 | 31.27 | 35.55 | 19.77 | 51.9 | 48.03 | pretrained | float16 | 0 | InternVL2-8B | ||
🟢 | 43.55 | 44.78 | 50.44 | 82.9 | 33.93 | 17.69 | 17.34 | 51.24 | 50.08 | pretrained | float16 | 0 | Qwen2-VL-7B-Instruct | ||
🟢 | 43.37 | 35.89 | 48.67 | 68.1 | 31.5 | 57.51 | 21.55 | 31.63 | 52.13 | pretrained | float16 | 0 | Pixtral-12B-2409 | ||
🟢 | 43.32 | 39.44 | 49.78 | 75 | 31.39 | 23.35 | 26.95 | 45.75 | 54.87 | pretrained | float16 | 0 | GPT-4o-mini-20240718 | ||
🟢 | 42.91 | 39.56 | 44.33 | 68.1 | 30.77 | 52.69 | 15.2 | 45.36 | 47.26 | pretrained | float16 | 0 | yi.daiteng01 | ||
🟢 | 42.07 | 39.78 | 43 | 71.6 | 26.07 | 54.97 | 17.43 | 40.13 | 43.56 | pretrained | float16 | 0 | Molmo-7B-D | ||
🟢 | 41.02 | 38.89 | 45.11 | 80.6 | 28.38 | 23.01 | 15.1 | 47.58 | 49.5 | pretrained | float16 | 0 | MiniCPM-V-2.6 | ||
🟢 | 36.75 | 37.11 | 45.33 | 60.5 | 28.67 | 13.82 | 16.68 | 42.88 | 49.03 | pretrained | float16 | 0 | LLaVA-OneVision-7B | ||
🟢 | 35.69 | 33.89 | 41.44 | 75.5 | 26.82 | 13.58 | 14.34 | 39.48 | 40.45 | pretrained | float16 | 0 | Qwen2-VL-2B-Instruct | ||
🟢 | 35.31 | 35.89 | 39.67 | 77.2 | 27.11 | 9.83 | 14.77 | 39.22 | 38.82 | pretrained | float16 | 2.21 | Qwen/Qwen2-VL-2B-Instruct | ||
🟢 | 34.63 | 28.44 | 44 | 60.9 | 24.28 | 11.5 | 14.47 | 36.08 | 57.39 | pretrained | float16 | 0 | Phi-3.5-Vision-Instruct | ||
🟢 | 34.55 | 33.89 | 42.22 | 55.3 | 27.86 | 13.53 | 16.25 | 39.22 | 48.13 | pretrained | float16 | 0 | Idefics3-8B-Llama3 | ||
🟢 | 33.29 | 27.56 | 40 | 55.5 | 25.49 | 12.66 | 20.56 | 36.99 | 47.55 | pretrained | float16 | 0 | XGen-MM-Instruct-Interleave-v1.5 | ||
🟢 | 32.25 | 29.22 | 32.89 | 71.4 | 20.4 | 10.81 | 14.01 | 38.95 | 40.35 | pretrained | float16 | 0 | InternVL2-2B | ||
🟢 | 31.19 | 32.33 | 36.78 | 60.9 | 23.76 | 12.37 | 13.98 | 27.19 | 42.24 | pretrained | float16 | 0 | deepseek-ai/Janus-Pro-7B | ||
🟢 | 29.53 | 28.89 | 38.33 | 62.2 | 26.53 | 33.93 | 16.71 | 1.44 | 28.2 | pretrained | float16 | 0 | Llama-3.2-11B-Vision-Instruct | ||
🟢 | 28.58 | 27 | 33.11 | 58.7 | 17.28 | 11.97 | 13.29 | 28.5 | 38.77 | pretrained | float16 | 0 | LLaVA-OneVision-0.5B | ||
🟢 | 28.19 | 28.56 | 28.22 | 69.9 | 16.53 | 10.69 | 12.53 | 23.4 | 35.72 | pretrained | float16 | 0 | Mono-InternVL-2B | ||
🟢 | 25.65 | 25.44 | 30 | 49.2 | 15.09 | 10.75 | 14.21 | 21.96 | 38.56 | pretrained | float16 | 0 | Janus-1.3B |
The Goal of FlagEval-VLM Leaderboard
感谢您积极的参与评测,在未来,我们会持续推动 FlagEval-VLM Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
Thanks for your active participation in the evaluation. In the future, we will continue to promote FlagEval-VLM Leaderboard to be more perfect and maintain the openness of the ecosystem, and we welcome developers to participate in the discussion of evaluation methodology, tools and datasets, so that we can build a more scientific and fair list together.
Context
FlagEval-VLM Leaderboard是视觉大语言排行榜,我们希望能够推动更加开放的生态,让视觉大语言模型开发者参与进来,为推动视觉大语言模型进步做出相应的贡献。 为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
FlagEval-VLM Leaderboard is a Visual Large Language Leaderboard, and we hope to promote a more open ecosystem for visual large language model developers to participate and contribute accordingly to the advancement of visual large language models. To achieve the goal of fairness, all models are evaluated on the FlagEval platform using standardized GPUs and a unified environment to ensure fairness.
How it works
We evaluate models on 9 key benchmarks using the https://github.com/flageval-baai/FlagEvalMM , FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
- ChartQA - a large-scale benchmark covering 9.6K manually written questions and 23.1K questions generated from manually written chart summaries.
- Blink - a benchmark containing 14 visual perception tasks that can be solved by humans “within a blink”.
- CMMU - a benchmark for Chinese multi-modal multi-type question understanding and reasoning
- CMMMU - a new Chinese Massive Multi-discipline Multimodal Understanding benchmark designed to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context.
- MMMU - a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.
- MMMU_Pro(standard & vision) - a more robust multi-discipline multimodal understanding benchmark.
- OCRBench - a comprehensive evaluation benchmark designed to assess the OCR capabilities of Large Multimodal Models.
- MathVision - a meticulously curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions.
- CII-Bench -a new benchmark measuring the higher-order perceptual, reasoning and comprehension abilities of MLLMs when presented with complex Chinese implication images.
For all these evaluations, a higher score is a better score. Accuracy will be used as the evaluation metric, and it will primarily be calculated according to the methodology outlined in the original paper.
Details and logs
You can find:
- detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/vlm_results
- community queries and running status in the requests Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/vlm_requests
Reproducibility
An example of llava with vllm as backend:
flagevalmm --tasks tasks/mmmu/mmmu_val.py --exec model_zoo/vlm/api_model/model_adapter.py --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf --num-workers 8 --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf --backend vllm --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
Useful links
Evaluation Queue for the FlagEval VLM Leaderboard
Models added here will be automatically evaluated on the FlagEval cluster.
Currently, we offer two methods for model evaluation, including API calls and private deployments:
- If you choose to evaluate via API call, you need to provide the Model interface, name and corresponding API key.
- If you choose to do open source model evaluation directly through huggingface, you don't need to fill in the Model online api url and Model online api key.
Open API model Integration Documentation
For models accessed via API calls (such as OpenAI API, Anthropic API, etc.), the integration process is straightforward and only requires providing necessary configuration information.
- model_name: Name of the model to use
- api_key: API access key
- api_base: Base URL for the API service
Adding a Custom Model to the Platform
This guide explains how to integrate your custom model into the platform by implementing a model adapter and run.sh script. We'll use the Qwen-VL implementation as a reference example.
Overview
To add your custom model, you need to:
- Create a custom dataset class
- Implement a model adapter class
- Set up the initialization and inference pipeline
Step-by-Step Implementation
Here is an example:model_adapter.py
1. Create Preprocess Custom Dataset Class
Inherit from ServerDataset
to handle data loading:
# model_adapter.py
class CustomDataset(ServerDataset):
def __getitem__(self, index):
data = self.get_data(index)
question_id = data["question_id"]
img_path = data["img_path"]
qs = data["question"]
qs, idx = process_images_symbol(qs)
idx = set(idx)
img_path_idx = []
for i in idx:
if i < len(img_path):
img_path_idx.append(img_path[i])
else:
print("[warning] image index out of range")
return question_id, img_path_idx, qs
The function get_data
returns a structure like this:
{
"img_path": "A list where each element is an absolute path to an image that can be read directly using PIL, cv2, etc.",
"question": "A string containing the question, where image positions are marked with <image1> <image2>",
"question_id": "question_id",
"type": "A string indicating the type of question"
}
2. Implement Model Adapter
Inherit from BaseModelAdapter
and implement the required methods:
- model_init: is responsible for model initialization and serves as the entry point for model loading and setup.
- run_one_task: implements the inference pipeline, handling data processing and result generation for a single evaluation task.
# model_adapter.py
class ModelAdapter(BaseModelAdapter):
def model_init(self, task_info: Dict):
ckpt_path = task_info["model_path"]
'''
Initialize the model and processor here.
Load your pre-trained model and any required processing tools using the provided checkpoint path.
'''
def run_one_task(self, task_name: str, meta_info: Dict[str, Any]):
results = []
cnt = 0
data_loader = self.create_data_loader(
CustomDataset, task_name, batch_size=1, num_workers=0
)
for question_id, img_path, qs in data_loader:
'''
Perform model inference here.
Use the model to generate the 'answer' variable for the given inputs (e.g., question_id, image path, question).
'''
results.append(
{"question_id": question_id, "answer": answer}
)
self.save_result(results, meta_info, rank=rank)
'''
Save the inference results.
Use the provided meta_info and rank parameters to manage result storage as needed.
'''
Note:
results
is a list of dictionaries
Each dictionary must contain two keys:
question_id: identifies the specific question
answer: contains the model's prediction/output
After collecting all results, they are saved using save_result()
3. Launch Script (run.sh)
run.sh is the entry script for launching model evaluation, used to set environment variables and start the evaluation program.
#!/bin/bash
current_file="$0"
current_dir="$(dirname "$current_file")"
SERVER_IP=$1
SERVER_PORT=$2
PYTHONPATH=$current_dir:$PYTHONPATH python $current_dir/model_adapter.py --server_ip $SERVER_IP --server_port $SERVER_PORT "${@:3}"
model | revision | private | precision | weight_type | status |
---|---|---|---|---|---|
main | false | float16 | Original | FINISHED |
model | revision | private | precision | weight_type | status |
---|---|---|---|---|---|
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED | |
main | false | float16 | Original | FINISHED |
model | revision | private | precision | weight_type | status |
---|---|---|---|---|---|
model | revision | private | precision | weight_type | status |
---|
model | revision | private | precision | weight_type | status |
---|---|---|---|---|---|
model | revision | private | precision | weight_type | status |
---|