This document discusses multi-armed bandit algorithms like Upper Confidence Bound (UCB) for balancing exploration and exploitation. UCB works by selecting the option with the highest measured reward rate plus a confidence interval based on the number of impressions to encourage exploration of seemingly suboptimal options. The document provides formulas for UCB and discusses how it can be used for both exploration and exploitation in multi-armed bandit problems like A/B testing multiple options to learn their unknown payoff rates.