CMMLU consists of multiple choice questions from Chinese textbooks

CMMLU consists of multiple choice questions from Chinese textbooks. It has been used to evaluate Chinese LLMs, including Qwen-72B, Yi-Chat, etc. For simplicity, we conduct the evaluation in a zero-sample setting.

CMMLU is a comprehensive Chinese evaluation benchmark specifically used to evaluate the knowledge and reasoning capabilities of language models in the Chinese context.
CMMLU covers 67 topics ranging from basic subjects to advanced professional levels.
It includes: natural sciences that require calculation and reasoning, humanities and social sciences that require knowledge, and China driving rules that require common sense of life.
In addition, many tasks in CMMLU have China specific answers and may not be universally applicable in other regions or languages. Therefore, it is a completely China benchmark for Chinese testing.

As the capabilities of large language models (LLMs) continue to improve, evaluating their performance has also become more important and challenging. This paper aims to solve this problem in Mandarin in the form of CMMLU, a comprehensive Chinese benchmark covering natural sciences, social sciences, engineering and humanities. We conducted a comprehensive evaluation of more than 20 contemporary multilingual and Chinese LLMs to assess their performance in different subjects and environments. The results show that most existing LLMs have difficulty even achieving an accuracy rate of 60%, which is the passing score for the Chinese test. This highlights that there is still much room for improvement in LLMs ‘functions. In addition, we conducted a large number of experiments to determine factors that affect model performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models in a Chinese environment.

To this end, researchers have created various benchmarks designed to assess the capabilities of different models (Wang et al., 2019b; a; Lin et al., 2022; Zellers et al., 2019; Hendrycks et al., 2021b; Chen et al., 2022）。，2021）。Specifically, Hendrycks et al. (2021a) proposed MMLU, a benchmark that covers tasks ranging from basic mathematics and computer science to management and law, and can be used to comprehensively measure the knowledge embedded in LLM capabilities. Because its multiple-choice format is easy to evaluate and covers a wide range of subject areas, it has been widely used as a basic evaluation tool for the knowledge encoded by LLMs. However, this benchmark is in English, which limits its ability to evaluate LLMs in other languages. Although some researchers (OpenAI, 2023) have tried to automatically translate it to evaluate LLMs in other languages, the dataset’s inherent bias against Western (especially American) culture makes it inappropriate or even inappropriate to evaluate b1005 across different cultures and languages.

In this article, we propose CMMLU (Figure 1), a comprehensive Chinese assessment suite specifically used to assess LLMs ‘advanced knowledge and reasoning abilities in a Chinese language and cultural context. CMMLU covers a wide range of disciplines, including 67 topics from junior to advanced professional levels. It includes disciplines that require computing expertise, such as physics and mathematics, as well as disciplines within the humanities and social sciences. Due to their specific contextual nuances and wording, many of these tasks are not easily translated from other languages. In addition, many tasks in CMMLU have answers specific to China that may not be universally applicable or considered correct in other regions or languages.

We evaluated GPT4, ChatGPT, and more than 20 advanced open source multilingual and Chinese LLMs on CMMLU. The results showed that most models had difficulty achieving 60% accuracy, while random accuracy was 25%. It is worth noting that the average accuracy rate of GPT4 reaches 71%. These findings highlight that LLMs still have much room for improvement in their Chinese knowledge and language understanding.

In addition, through a large number of experiments, we found that: (1) most existing models do not benefit from the thought chain hints in CMMLU;(2) few sample examples help the basic model understand tasks and enhance its reasoning ability, but are not helpful for models that undergo supervised fine-tuning (SFT) or human feedback reinforcement learning (RLHF);(3) LLMs perform worse on questions with negatives than questions without negatives, but recently released models mitigate this difference with better pre-training data or fine-tuning; (4) Questions with sub-options (Section 4.2) are difficult for all existing LLMs, and even GPT4’s accuracy on such questions has dropped by 20%.

If you want to learn more, you can click on the link below the video.
Thank you for watching this video. If you like it, please subscribe and like it. thank

Original paper:https://arxiv.org/html/2306.09212v2
Github： https://github.com/haonan-li/CMMLU

Video: