Frequentist multi-armed bandits for complex reward models

Haoran Chen; Weibing Deng; Keqin Liu; Ting Wu

doi:10.1117/12.2627482

22 April 2022 Frequentist multi-armed bandits for complex reward models

Haoran Chen, Weibing Deng, Keqin Liu, Ting Wu

Proceedings Volume 12163, International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021); 121631P (2022) https://doi.org/10.1117/12.2627482
Event: International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021), 2021, Nanjing, China

Abstract

In the classical Multi-Armed Bandit (MAB) problem, a player selects one out of a set of arms to play at each time, without knowing the reward models. At each time, the player selects one arm to play, aiming to maximize the total expected reward over 𝑇 times. The regret of an algorithm is the expected total loss after 𝑇 steps compared to the ideal scenario of knowing reward models. When the distributions of the arm reward are heavy-tailed, it is difficult to learn which arm has the best reward. In this paper, we introduce an algorithm based on the idea of Upper Confidence Bound (UCB) and prove that the algorithm achieves a sublinear growth of regret for heavy-tailed reward distributions. Furthermore, we consider MAB with gap periods as a dynamic model requiring that the arm will get into a gap period without offering reward immediately after being played. This model finds a broad application area in Internet advertising. Clearly the player should avoid to choose an arm until it gets out of the gap period. We extend the algorithm framework of Deterministic Sequencing of Exploration and Exploitation (DSEE) to the MAB model with gap periods, with regret reaching a growth of optimal order 𝑂ሺlog 𝑇ሻ for light-tailed distributions and a sublinear growth the for the heavy-tailed.

Citation Download Citation

Haoran Chen, Weibing Deng, Keqin Liu, and Ting Wu "Frequentist multi-armed bandits for complex reward models", Proc. SPIE 12163, International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2021), 121631P (22 April 2022); https://doi.org/10.1117/12.2627482

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available