Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

stp2yJanuary 7, 20250 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 26 Aug 2024 (v1), last revised 3 Jan 2025 (this version, v4)]

View a PDF of the paper titled Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study, by Liuchang Xu and 8 other authors

View PDF
HTML (experimental)

Abstract:The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI’s gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI’s glm-4, Anthropic’s claude-3-sonnet-20240229, and MoonShot’s moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o’s accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.

Submission history

From: Shuo Zhao [view email]
[v1]
Mon, 26 Aug 2024 17:25:16 UTC (6,000 KB)
[v2]
Wed, 28 Aug 2024 13:19:36 UTC (6,000 KB)
[v3]
Mon, 2 Sep 2024 11:59:05 UTC (6,002 KB)
[v4]
Fri, 3 Jan 2025 03:03:32 UTC (7,550 KB)

Source link
lol

By stp2y