June 24, 2024
BEIJING – Artificial intelligence has performed strongly on the subjects of Chinese literature and English language, but it scored poorly in mathematics, according to a study that used different chatbot tools to generate answers to this year’s national college entrance exams, or gaokao.
Researchers from the Shanghai Artificial Intelligence Laboratory had six open-source AI models, as well as GPT-4o — the latest version rolled out by leading company Open AI — take the test that most Chinese high-school students must undergo to gain admission to domestic universities.
Results released by the laboratory on Wednesday show that AI test-takers achieved an average accuracy rate of 67 percent in Chinese language and literature and 81 percent in the English language. In mathematics, however, they only answered 36 percent of questions correctly.
The top scorer was domestic company Alibaba’s latest multilingual language model, known as Qwen2-72B, which got about 72 percent of the questions right, followed by GPT-4o and a model launched by the Shanghai Artificial Intelligence Laboratory itself on June 4.
Researchers said the exams include not only multiple choice sections, fill-in-the-blank questions and questions with only one correct answer, but also open-response questions such as those that call for writing a short essay based on a theme. Each answer sheet was reviewed by at least three tutors who were not informed of the special identity of the test-takers until they finished grading.
Graders commented that the AI tools appeared to be more capable of comprehending Chinese text written in a contemporary style, but they had a hard time understanding pre-modern, classical Chinese passages. Few of them were capable of using techniques such as quoting adages when writing articles.
“On the math test, their subjective responses tend to be disorganized and confusing, and the answer could be correct despite errors in the process. They also exhibited a strong memorization capability for formulas but were not able to swiftly apply them to problem-solving,” the graders said.
AI participants also had mediocre results during the preliminary round of the 2024 Alibaba Global Mathematics Competition. Organizers said this month that the average score of the 500-plus AI teams was 18 out of 120, and the highest score among them was only 34, compared with the highest human score of 113.
Cao Sanxing, deputy dean of Communication University of China’s Institute for Internet Information Research, said the AI models’ poor performance on math does not necessarily signal weaknesses in reasoning and calculation capabilities.
“At present, AI training related to math questions is not the primary focus of the sector, and the majority of resources have been devoted to feeding human language materials into AI models, hence the higher score in Chinese and English languages,” Cao said.
In spite of AI’s high marks in language-related subjects, Cao said AI-generated content still contains obvious flaws, such as contradictory statements, and shows a lack of profound thinking.
Xu Yi, a graduate student at Renmin University of China’s Gaoling School of Artificial Intelligence, said that the biggest current strength of AI is making summarizations through analyzing vast amounts of data, which explains its outstanding performance in generating text.
“However, AI is less capable of logical thinking or creating completely novel content,” he added.
Xiong Bingqi, director of the 21st Century Education Research Institute, also attributed the lower math score to a shortage of math-related programming.
“In the meantime, the emergence of AI shows that it is now important for students to not only memorize knowledge, but also learn to innovate and foster critical thinking abilities,” he said.