MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI X Yue, Y Ni, K Zhang, T Zheng, R Liu, G Zhang, S Stevens, D Jiang, ... 🏆 CVPR 2024 (Best Paper Finalist), 2023 | 511 | 2023 |
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Y Wang, X Ma, G Zhang, Y Ni, A Chandra, S Guo, W Ren, A Arulraj, X He, ... 🏆 NeurIPS D&B 2024 (Spotlight), 2024 | 111* | 2024 |
A Comprehensive Study of Knowledge Editing for Large Language Models N Zhang, Y Yao, B Tian, P Wang, S Deng, M Wang, Z Xi, S Mao, J Zhang, ... arXiv preprint arXiv:2401.01286, 2024 | 97* | 2024 |
EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models P Wang, N Zhang, B Tian, Z Xi, Y Yao, Z Xu, M Wang, S Mao, X Wang, ... ACL SDT 2024, 2023 | 79* | 2023 |
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong, Y Sun, M Yin, B Yu, ... arXiv preprint arXiv:2409.02813, 2024 | 26 | 2024 |
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation X He, D Jiang, G Zhang, M Ku, A Soni, S Siu, H Chen, A Chandra, Z Jiang, ... EMNLP Main 2024, 2024 | 19 | 2024 |
GenAI Arena: An Open Evaluation Platform for Generative Models D Jiang, M Ku, T Li, Y Ni, S Sun, R Fan, W Chen NeurIPS D&B 2024, 2024 | 9 | 2024 |
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks J Chen, T Liang, S Siu, Z Wang, K Wang, Y Wang, Y Ni, W Zhu, Z Jiang, ... arXiv preprint arXiv:2410.10563, 2024 | 3 | 2024 |
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models Z Liu, F Fang, X Feng, X Du, C Zhang, Z Wang, Y Bai, Q Zhao, L Fan, ... NeurIPS D&B 2024, 2024 | 3 | 2024 |