U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

stp2yDecember 9, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 4 Dec 2024 (v1), last revised 6 Dec 2024 (this version, v2)]

View a PDF of the paper titled U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs, by Konstantin Chernyshev and 5 other authors

View PDF
HTML (experimental)

Abstract:The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored.

To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $mu$-MATH, a dataset to evaluate the LLMs’ capabilities in judging solutions.

The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $mu$-MATH.

Submission history

From: Konstantin Chernyshev [view email]
[v1]
Wed, 4 Dec 2024 10:44:50 UTC (2,620 KB)
[v2]
Fri, 6 Dec 2024 08:29:43 UTC (2,620 KB)

Source link
lol

By stp2y