Posterior Sampling for Reinforcement Learning

The dataset used in the paper is a random finite horizon Markov decision process (MDP) with states S, actions A, and horizon τ.

BibTex: