Chapter 2 discusses the multi-armed bandit problem, which involves selecting from multiple actions with independent outcomes, highlighting strategies like the ε-greedy method and upper confidence bounds. It covers various approaches to balancing exploration and exploitation, including optimistic initial values and gradient bandit algorithms. The chapter emphasizes the need for adapting methods in nonstationary contexts and introduces the concept of contextual bandit problems.
Related topics: