Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation, by Fengdi Che and 8 other authors

View PDF
HTML (experimental)

Abstract:We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird’s counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.

Submission history

From: Fengdi Che [view email]
[v1]
Fri, 31 May 2024 17:36:16 UTC (4,652 KB)
[v2]
Fri, 4 Oct 2024 18:04:33 UTC (4,654 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.