Stochastic MDP

The dataset used in this paper is a stochastic MDP with |S| = 4 and |A| = 4. One of the states is set to the terminal state, and one of the rest is set to the starting state. The transition probability and reward functions are randomly generated.

BibTex: