View a PDF of the paper titled Statistical Efficiency of Distributional Temporal Difference Learning, by Yang Peng and 2 other authors
Abstract:Distributional reinforcement learning (DRL) has achieved empirical success in various domains. One core task in the field of DRL is distributional policy evaluation, which involves estimating the return distribution $eta^pi$ for a given policy $pi$. The distributional temporal difference learning has been accordingly proposed, which is an extension of the temporal difference learning (TD) in the classic RL area. In the tabular case, citet{rowland2018analysis} and citet{rowland2023analysis} proved the asymptotic convergence of two instances of distributional TD, namely categorical temporal difference learning (CTD) and quantile temporal difference learning (QTD), respectively. In this paper, we go a step further and analyze the finite-sample performance of distributional TD. To facilitate theoretical analysis, we propose non-parametric distributional TD learning (NTD). For a $gamma$-discounted infinite-horizon tabular Markov decision process, we show that for NTD we need $tilde{O}left(frac{1}{varepsilon^{2p}(1-gamma)^{2p+1}}right)$ iterations to achieve an $varepsilon$-optimal estimator with high probability, when the estimation error is measured by the $p$-Wasserstein distance. This sample complexity bound is minimax optimal up to logarithmic factors in the case of the $1$-Wasserstein distance. To achieve this, we establish a novel Freedman’s inequality in Hilbert spaces, which would be of independent interest. In addition, we revisit CTD, showing that the same non-asymptotic convergence bounds hold for CTD in the case of the $p$-Wasserstein distance for $pgeq 1$.
Submission history
From: Yang Peng [view email]
[v1]
Sat, 9 Mar 2024 06:19:53 UTC (24 KB)
[v2]
Thu, 14 Mar 2024 09:24:51 UTC (36 KB)
[v3]
Wed, 23 Oct 2024 07:26:07 UTC (36 KB)
Source link
lol