LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled LLM Detectors Still Fall Short of Real World: Case of LLM-Generated Short News-Like Posts, by Henrique Da Silva Gameiro and 2 other authors

View PDF

Abstract:With the emergence of widely available powerful LLMs, disinformation generated by large Language Models (LLMs) has become a major concern. Historically, LLM detectors have been touted as a solution, but their effectiveness in the real world is still to be proven. In this paper, we focus on an important setting in information operations — short news-like posts generated by moderately sophisticated attackers.

We demonstrate that existing LLM detectors, whether zero-shot or purpose-trained, are not ready for real-world use in that setting. All tested zero-shot detectors perform inconsistently with prior benchmarks and are highly vulnerable to sampling temperature increase, a trivial attack absent from recent benchmarks. A purpose-trained detector generalizing across LLMs and unseen attacks can be developed, but it fails to generalize to new human-written texts.

We argue that the former indicates domain-specific benchmarking is needed, while the latter suggests a trade-off between the adversarial evasion resilience and overfitting to the reference human text, with both needing evaluation in benchmarks and currently absent. We believe this suggests a re-consideration of current LLM detector benchmarking approaches and provides a dynamically extensible benchmark to allow it (this https URL).

Submission history

From: Andrei Kucharavy [view email]
[v1]
Thu, 5 Sep 2024 06:55:13 UTC (4,449 KB)
[v2]
Fri, 27 Sep 2024 16:04:40 UTC (4,449 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.