PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

stp2yJanuary 23, 20250 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 10 Sep 2024 (v1), last revised 17 Jan 2025 (this version, v2)]

View a PDF of the paper titled PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation, by Ilya Gusev

View PDF
HTML (experimental)

Abstract:We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model that assumes a specific character role, an interrogator model that simulates user behavior, and several judge models that evaluate conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of the model capabilities in interactive scenarios.

Submission history

From: Ilya Gusev [view email]
[v1]
Tue, 10 Sep 2024 19:00:44 UTC (314 KB)
[v2]
Fri, 17 Jan 2025 21:11:03 UTC (1,437 KB)

Source link
lol

By stp2y