Stochastic Principal-Agent Problems: Efficient Computation and Learning

stp2ySeptember 13, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 6 Jun 2023 (v1), last revised 12 Sep 2024 (this version, v3)]

View a PDF of the paper titled Stochastic Principal-Agent Problems: Efficient Computation and Learning, by Jiarui Gan and 3 other authors

View PDF
HTML (experimental)

Abstract:We introduce a stochastic principal-agent model. A principal and an agent interact in a stochastic environment, each privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The players communicate with each other and then select actions independently. Each of them receives a payoff based on the state and their joint action, and the environment transitions to a new state. The interaction continues over a finite time horizon. Both players are far-sighted, aiming to maximize their total payoffs over the time horizon. The model encompasses as special cases extensive-form games (EFGs) and stochastic games of incomplete information, partially observable Markov decision processes (POMDPs), as well as other forms of sequential principal-agent interactions, including Bayesian persuasion and automated mechanism design problems.

We consider both the computation and learning of the principal’s optimal policy. Since the general problem, which subsumes POMDPs, is intractable, we explore algorithmic solutions under hindsight observability, where the state and the interaction history are revealed at the end of each step. Though the problem becomes more amenable under this condition, the number of possible histories remains exponential in the length of the time horizon, making approaches for EFG-based models infeasible. We present an efficient algorithm based on the inducible value sets. The algorithm computes an $epsilon$-approximate optimal policy in time polynomial in $1/epsilon$. Additionally, we show an efficient learning algorithm for an episodic reinforcement learning setting where the transition probabilities are unknown. The algorithm guarantees sublinear regret $tilde{O}(T^{2/3})$ for both players over $T$ episodes.

Submission history

From: Jiarui Gan [view email]
[v1]
Tue, 6 Jun 2023 16:20:44 UTC (21 KB)
[v2]
Sun, 17 Dec 2023 13:34:46 UTC (23 KB)
[v3]
Thu, 12 Sep 2024 10:22:10 UTC (30 KB)

Source link
lol

By stp2y