The dataset used in the paper is a concave bandit problem, where the learner aims at selecting satisficing arms (arms with mean reward exceeding a certain threshold value) as frequently as possible.
BibTex:
Before browse our site, please accept our cookies policy