Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language, by Anthony Costarelli and 2 other authors

View PDF
HTML (experimental)

Abstract:As Large Language Models (LLMs) become increasingly integrated into our daily lives, the potential harms from deceptive behavior underlie the need for faithfully interpreting their decision-making. While traditional probing methods have shown some effectiveness, they remain best for narrowly scoped tasks while more comprehensive explanations are still necessary. To this end, we investigate meta-models-an architecture using a “meta-model” that takes activations from an “input-model” and answers natural language questions about the input-model’s behaviors. We evaluate the meta-model’s ability to generalize by training them on selected task types and assessing their out-of-distribution performance in deceptive scenarios. Our findings show that meta-models generalize well to out-of-distribution tasks and point towards opportunities for future research in this area.

Submission history

From: Mat Allen [view email]
[v1]
Thu, 3 Oct 2024 13:25:15 UTC (85 KB)
[v2]
Sat, 5 Oct 2024 19:06:07 UTC (85 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.