SlideShare a Scribd company logo
Deep Reinforcement Learning
and its Applications
Yuxi Li 

yuxili@gmail.com

2021.04.28
Deep Reinforcement Learning and Its Applications
Reinforcement Learning (RL)
at each time step, an RL agent

• receives a state 

• selects an action

• receives a reward 

• transitions into a new state 

objective: maximize long term reward
[Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]
“Reinforcement”
is from classical conditioning
• The term “reinforcement” in the context of animal learning came into use in
the 1927 English translation of Pavlov’s monograph on conditioned reflexes. 

• Pavlov described reinforcement as the strengthening of a pattern of
behavior due to an animal receiving a stimulus—a reinforcer—in an
appropriate temporal relationship with another stimulus or with a response.
[Pic from Internet]
[Sutton BA in psychology from Stanford University in 1978]
[Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]
[David Silver RL Course]
[Li, Y. (2017). Deep Reinforcement Learning: An Overview. ArXiv.]
reinforcement
learning
supervised
learning
unsupervised
learning
deep
learning
artificial
intelligence
machine
learning
deep
reinforcement
learning
artificial neural networks
association rule learning
Bayesian networks
clustering
decision tree learning
genetic algorithms
inductive logic programming
reinforcement learning
representation learning
rule-based machine learning
similarity and metric learning
sparse dictionary learning
support vector machines
problem solving
search
constraint satisfaction
knowledge, reasoning, and
planning
logical agents
first-order logic
planning and acting
knowledge representation
probabilistic reasoning
decision making
learning
learning from examples
knowledge in learning
learning probabilistic models
reinforcement learning
communication, perceiving,
and acting
natural language processing
perception
robotics
[Li, Y. (2019). Reinforcement Learning Applications. ArXiv.]
in a usual sense (not perfectly correct)

• supervised learning makes predictions 

• myopic

• reinforcement learning makes decisions

• long-term thinking
RL+ DL = AI
[David Silver, Deep Reinforcement Learning from AlphaGo to AlphaStar]
From Deep Q-Networks (DQN)
to Agent57
[Badia, A. P.,et al. (2020). Agent57: Outperforming the atari human benchmark. ArXiv.]
[Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. ]
First return, then explore: Go-Explore
[Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth Stanley and Jeff Clune, First return, then explore, Nature, February 2021]
条件反射
• “强化”这⼀一术语出现于
1927年年巴浦洛洛夫(Pavlov)
条件反射论⽂文的英译本

• 巴浦洛洛夫把强化描述成,
当动物接收到刺刺激,也就
是强化物,对⼀一种⾏行行为模
式的加强,⽽而这个刺刺激与
另⼀一个刺刺激或反应的发⽣生
有合适的时间关系。
[based on Silver et al. (2016, 2017, 2018), Li (2017)]
Dota [OpenAI (2019)]
StarCraft [Vinyals et al. (2019)]
Poker [Moravcik et al. (2017)]
Catch The Flag [Jaderberg et al. (2019)]
Curling [Won et al. (2020)]
Hide-and-Seek [Baker et al. (2020)]
see next slide
Potential applications of techniques in games
games correspond to fundamental problems in CS/AI, relate to combinatorial optimization,
NP-hard problems, control, and operations research
Libratus paper mentioned:

• business strategy

• negotiation

• strategic pricing

• finance

• cybersecurity

• military applications

• auctions

• pricing
AlphaGo papers mentioned:

• general game-playing

• classical planning

• partially observed planning

• scheduling

• constraint satisfaction

DeepStack paper mentioned:

• defending strategic resources

• robust decision making for medical treatment recommendations
[Brown, N. and Sandholm, T. (2017). Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science.]
[Jaderberg, M.,et al. (2019). Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364:859–865.]
[Silver, D.,et al. (2017). Mastering the game of Go without human knowledge. Nature, 550:354–359. ]
[Baker, B., et al. (2020). Emergent tool use from multi-agent autocurricula. In ICLR. (hide-and-seek)]
[OpenAI (2019). Dota 2 with large scale deep reinforcement learning. ArXiv.]
[Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144. ]
[Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. ]
[Moravcik, M., et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513.]
[Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575:350–354. ]
Games to AI
is like
fruit flies to genetics.
[David Silver RL Course]
Deep Reinforcement Learning and Its Applications
•brief introduction

•successful applications

•challenges and opportunities
Multi-Arm Bandits
• 25-30% improvements for click-
through rate

• 18% revenue lift in the landing
page
[Agarwal, A. et al. (2016). Making contextual decisions with low technical debt. ArXiv.]
[Csaba Szepesvári, Bandits - DLRLSS 2019]
Decision Service
[Agarwal et al. (2016)]
abstractions
• explore to collect the
data

• log the data correctly

• learn a good model

• deploy it in the
application
• the Client Library implements the Explore abstraction 

• implements various exploration policies, addressing F1 

• the Join Service implements the Log abstraction 

• joins rewards to decisions

• produces correct exploration data to address F2 

• enforces a uniform delay before releasing data to the Learn
component to avoid delay-related biases, addressing F1 

• the exploration data is copied to the Store for offline
experimentation 

• the Online Learner implements the Learn abstraction 

• incorporates data continuously and checkpoints models to
the Deploy component at a configurable rate, addressing F3 

• evaluates arbitrary policies in real-time

• enables advanced monitoring and safeguards, addressing F4

• the Store implements the Deploy abstraction

• provides model and data storage 

• the Offline Learner uses data for offline experimentation

• such as tuning hyper-parameters, evaluating other learning
algorithms or policy classes, changing the reward metric,
etc., counterfactually accurate 

• the Feature Generator eases usability by auto-generating
features
failures
• (F1) partial feedback and
bias

• (F2) incorrect data
collection 

• (F3) changes in the
environment 

• (F4) weak monitoring and
debugging
https://github.com/Microsoft/mwt-ds http://ds.microsoft.com
2019 Inaugural ACM SIGAI Industry Award for Excellence in Artificial Intelligence
multi-world testing
vs A/B testing
Lessons from contextual bandit learning
in a customer support bot
• consider starting with imitation learning 

• consider simplified action spaces 

• don’t be afraid of principled exploration 

• try to support changes in environment 

• cautiously close the loop 

• consider reward engineering and
shaping 

• use a separate logging channel
• use and extend existing systems 

• pay attention to effective sample size 

• avoid -greedy

• regularize towards the logging policy,
increases the effective sample size,
resulting in shorter confidence
intervals and reduced overfitting 

• design an architecture suited to RL 

• balance randomness with
predictability
ϵ
[Karampatziakis, N., et al. (2019). Lessons from real-world reinforcement learning in a customer support bot. In RL4RealLife.]
• Microsoft Virtual Agent

• scenarios: intent disambiguation, contextual recommendations
RL Applications @ Microsoft
• Personalizer, part of Azure Cognitive Services, within Azure AI platform

• making its way in more MicroSoft products and services, Windows, Edge browser, Xbox 

• developers can plug Azure Cognitive Services into apps and websites 

• engineers can use Autonomous systems to refine manufacturing processes 

• Azure Machine Learning previews cloud-based RL offerings for data scientists and ML
professionals 

• Metrics Advisor incorporates feedback and makes models more adaptive to a customer’s
dataset, which helps detect more subtle anomalies in sensors, production processes or
business metrics 

• recommendation, adaptive to COVID-19 pandemic, find the optimal jitter buffer for a video
meeting, help determine when to reboot or remediate virtual machines, … 

• wide applications: 

• deliver tailored recommendations to small grocery stores across Mexico

• manipulate unstable coin bags for a bank in Russia

• collaborate with human players in games for a UK company
With reinforcement learning, Microsoft brings a new class of AI solutions to customers, https://blogs.microsoft.com/ai/reinforcement-learning/
RecSim: A Configurable Simulation Platform for
Recommender Systems
[Ie, E. et al. (2019). Reinforcement learning for slate-based recommender systems: A tractable decomposition and practical methodology. ArXiv. ]
[Ie, E. et al. (2019). Recsim - a configurable recommender systems environment. in RL4RealLife]
[Chen, M. et al. (2019). Top-k off-policy correction for a reinforce recommender system. in WSDM]
[Zhao, X., Xia, L., Tang, J., and Yin, D. (2019). Reinforcement learning for online information seeking. ACM SIGWEB Newsletter (SIGWEB).]
Facebook ReAgent
features 

• data preprocessing 

• feature normalization

• deep RL model implementation 

• multi-GPU training 

• counterfactual policy evaluation 

• optimized serving

• tested algorithms 

applications
• delivering more relevant notifications
optimizing 

• streaming video bit rates 

• improving M suggestions in
Messenger
Horizon: The first open source reinforcement learning platform for large-scale products and services, https://code.fb.com/ml-applications/horizon/
pipeline
• timeline generation, runs
across 

• thousands of CPUs training,
runs across many GPUs

• serving, spans thousands of
machines
[https://github.com/facebookresearch/ReAgent]
[Gauci, J. et al. Horizon: Facebook’s open source applied reinforcement learning platform. In RL4RealLife, 2019]
Ride-Hailing Order Dispatching
at DiDi via Reinforcement Learning
INFORMS 2019 Wagner Prize Winner [Qin, Z. T., et al. (2020). Ride-hailing order dispatching at DiDi via reinforcement learning. INFORMS Journal on Applied Analytics, 50(5):272–286.]
challenges
• dynamic and stochastic supply and demand

• system response time

• reliability

• multiple business objectives

• driver-centric objective: maximize the total income of the
drivers on the platform 

• passenger-centric objective: minimize the average pickup
distance of all the assigned orders

• marketplace efficiency metrics

• response rate

• fulfillment rate

• production requirements and constraints 

• computational efficiency 

• system reliability

• changing business requirements
solution approaches
• combinatorial optimization, myopic

• semi-Markov decision process

• tabular temporal difference learning

• deep (hierarchical) RL 

• transfer learning
[Tony Qin talk, https://www.arlseminar.com/speakers/]
[Tony Qin tutorials, https://tonyzqin.wordpress.com]
Autonomous navigation of stratospheric
balloons using reinforcement learning
• a cost-effective platform for communication, Earth observation,
gathering meteorological data and other applications 

• to navigate, a flight controller ascend and descend to find and
follow favourable wind currents

• vertical motion by pumping air ballast in and out of a fixed-
volume envelope

• horizontal motion by the winds 

• station-keeping, maintaining the balloon within a range of its
station, for communication

• input signal: wind speed and solar elevation

• challenges: imperfect data, sparse wind measurements resulting
in partial observability, power management 

• neither conventional methods nor human intervention suffice
• use reinforcement learning5
to train a flight controller from
simulations 

• use data augmentation6,7
and a self-correcting design to
overcome imperfect data 

• robust to the natural diversity in stratospheric winds 

• 39-day controlled experiment over the Pacific Ocean
[Bellemare, M. G., et al. (2020). Autonomous navigation of stratospheric
balloons using reinforcement learning. Nature, 588:77–82.]
Saying goodbye to Loon, https://medium.com/loon-for-all/loon-draft-c3fcebc11f3f, Jan 22, 2021
[Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hutter, M. (2020). Learning quadrupedal locomotion over challenging terrain. Science Robotics, 5. ]
Learning quadrupedal locomotion over challenging terrain
[Mao, H., et al. (2019). Park: An open platform for learning augmented computer systems. In NeurIPS.]
[Mirhoseini, A., et al. (2020). Chip placement with deep reinforcement learning. ArXiv. ]
[Zoph, B. and Le, Q. V. (2017). Neural architecture search with
reinforcement learning. In ICLR. ] [Cubuk, E. et al. (2019). Autoaugment: Learning
augmentation policies from data. In CVPR.
[Mirhoseini, A. et al. (2017). Device placement optimization with reinforcement learning. In ICML.]
AutoML-
Zero:
Evolving
Machine
Learning
Algorithms
From
Scratch
[John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, and Aleksandra Faust. Evolving Reinforcement Learning Algorithms, ICLR 2021]
[Esteban Real, Chen Liang, David R. So and Quoc V. Le. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch, ICML 2020]
Evolving
Reinforcement
Learning
Algorithms
(AutoRL)
Combinatorial Optimization
[Chen, X. and Tian, Y. (2019). Learning to perform local rewriting for combinatorial optimization. In NeurIPS. ]
[Kool, W., van Hoof, H., and Welling, M. (2019). 

Attention, learn to solve routing problems! In ICLR. ]
[Lu, H., Zhang, X., and Yang, S. (2020). A learning-based
iterative method for solving vehicle routing problems. In ICLR. ]
ML alongside
optimization
algorithms
(branch and
bound)
End to end learning
Learning to configure algorithms
(hyper-parameters)
[Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. (2021) Machine Learning for Combinatorial
Optimization: a Methodological Tour d’Horizon. European Journal of Operational Research 290(2):405-421.]
SMARTS: Scalable
Multi-Agent RL
Training School for
Autonomous Driving
[Zhou, M., et al. (2020). SMARTS: Scalable multi-agent reinforcement learning training school for autonomous driving. In Conference on Robot Learning (CoRL). ]
Wuji- Automatic Online Combat Game Testing
Using Evolutionary Deep Reinforcement Learning
• crash bugs

• stuck bugs

• logic bugs

• gaming balance bugs

• user experience bugs
[Zheng, Y. et al. (2019). Wuji: Automatic online combat game testing 

using evolutionary deep reinforcement learning. In ASE 2019. ]
ACM SIGSOFT Distinguished Paper Award
• automatic game testing framework

• evolutionary algorithm

• deep RL

• multi-objective optimization
RL for Instructional Sequencing
• RL has been most successful in cases where it has been
constrained with ideas and theories from cognitive
psychology and the learning sciences
[Doroudi, S., Aleven, V., and Brunskill, E. (2019). Where’s the reward? a review of reinforcement learning for instructional sequencing.
International Journal of Artificial Intelligence in Education, 29:568–620.]
• elementary school children learn to manipulate money 

• four principal regions: wallet location, repository location, object
location, text location

• ITS dynamically proposes to students the exercises currently
making maximal learning progress

• targeting to maximize intrinsic motivation and learning efficiency
[Oudeyer, P.-Y.,et al. (2016). Intrinsic motivation, curiosity and learning: theory and applications in educational technologies. Progress in brain research, Elsevier, 229:257– 284. ]
Intrinsic motivation is
defined as doing an activity
for its inherent satisfaction,
fun or challenge, rather than
external products,
pressures or reward.
Flow
the psychology of optimal experience
A good life is one that is
characterized by complete
absorption in what one does.
• challenge-skill balance

• action-awareness merging

• clear goals

• unambiguous feedback

• concentration on the task at hand

• sense of control

• loss of self-consciousness 

• transformation of time 

• an autotelic experience
[Nakamura, J. and Csikszentmihalyi, M. (2014). The concept of flow. In Csikszentmihalyi, M., editor, Flow and the Foundations of Positive Psychology, pages 239–263. Springer. ]
A Generic Approach to Challenge Modeling for
the Procedural Creation of Video Game Levels
• vertical arrows represent the challenging events
(holes) as unit impulses

• the curve represents the amount of accumulated
challenge in the time window
• anxiety depends on the accumulated challenge 

• fun as a response to increasing anxiety
• define reward function with quantitative challenge and fun 

• use RL to procedurally generate content for Super Mario
[Sorenson, N., Pasquier, P., and DiPaola, S. (2011). A generic approach to challenge modeling for the procedural creation of video game levels.
IEEE Transaction on Computational Intelligence and AI in Games, 3(3):229–244.]
Mobile Healthcare
• micro-randomized trial, collecting data for offline analysis

• Just-in-time adaptive interventions (JITAIs), the next generation of
mobile healthcare delivery that is automated, scalable, evidence-driven
and inexpensive. 

• iterating between offline analysis and online personalization

• more data and better algorithms over time, gradually improve JITAIs

• use positive psychology to improve adoption, engagement and effect?
• chronic diseases: migraine, diabetes, obesity, etc.

• mental health: schizophrenia, depression, anxiety, etc.

• wellness: fitness management, sedentary behavior, etc.
[Menictas, M., Rabbi, M., Klasnja, P., and Murphy, S. (2019). Artificial intelligence
decision-making in mobile health. The Biochemist, 41(5):20–24.]
FinRL: A Deep Reinforcement Learning Library for
Automated Stock Trading in Quantitative Finance
[FinRL: https://github.com/AI4Finance-LLC/FinRL-Library]
[Wang, J. et al. (2019). Alphastock: A buying-winners-and- selling-losers investment strategy using interpretable deep reinforcement attention networks. In KDD.]
The AI Economist:
Improving Equality and Productivity
with AI-Driven Tax Policies
[Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C. Parkes, and Richard
Socher, (2020) The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies. ArXiv.]
Data center cooling using model-predictive control
[Warren B. Powell (2019). From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions. arXiv]
[Shin, J., Badgwell, T.A., Liu, K., Lee, J.H., 2019. Reinforcement learning - overview of recent progress and implications for process control. CCE 127, 282–294 (2020).]
[R. Nian, J. Liu and B. Huang, A review On reinforcement learning: Introduction and applications in industrial process control, Computers and Chemical Engineering 139 (2020) ]
notes for RL:

• the goal can be
complex, can have
safety constraints

• there can be
stability constraints

• there can be state
constraints

• failure can be
avoided with safety
constraints during
execution
[Lazic, N., et al. (2018). Data center cooling using model-predictive control. In NeurIPS.]
Gym-ANM- Reinforcement learning
environments for active network management
tasks in electricity distribution systems
[Robin Henry and Damien Ernst, (2021). Gym-ANM- Reinforcement learning environments for active network management tasks in electricity distribution systems, ArXiv.]
[Zhavoronkov, A., et al. (2019). Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature Biotechnology, 37:1038–1040. ]
Drug design
AlphaFold
• what a protein does largely depends on its unique 3D structure 

• protein folding problem: figure out what shapes proteins fold into

• a grand challenge in biology for the past 50 years
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology]
COVID-19
• RL is a promising framework to combat epidemics

• epidemics are a fruitful application area for RL to make
substantial real life impact 

• communities of RL and AI, epidemiology and public health,
and economics should collaborate to combat epidemics
[Yinyu Ye, Optimization and operations research in mitigation of a pandemic, 2020]
[Yuxi Li, Combat epidemics with reinforcement learning, https://attain.ai, 2020]
•brief introduction

•successful applications

•challenges and opportunities
Software engineering for machine learning
• the nine stages of the machine learning workflow 

• data-oriented: collection, cleaning, and labeling

• model-oriented: model requirements, feature engineering, training, evaluation, deployment, and monitoring

• many feedback loops in the workflow 

• larger feedback arrows: model evaluation and monitoring may loop back to any of the previous stages

• smaller feedback arrow: model training may loop back to feature engineering, e.g., in representation learning

• three aspects of the AI domain that make it fundamentally different from prior software application domains: 

• discovering, managing, and versioning the data needed for machine learning applications is much more complex
and difficult than other types of software engineering

• model customization and model reuse require very different skills than are typically found in software teams

• AI components are more difficult to handle as distinct modules than traditional software components — models
may be “entangled” in complex ways and experience non-monotonic error behavior
[Amershi, S.,et al. (2019). Software engineering for machine learning: A case study. In ICSE
Hidden Technical Debt in
Machine Learning Systems
• boundary erosion

• entanglement

• hidden feedback loops

• undeclared consumers

• data dependencies 

• configuration issues

• changes in the external world

• system-level anti-patterns
[Sculley, D., et al. (2014). Machine learning: The high interest credit card of technical debt.
In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)]
• potential solutions

• refactoring code

• improving unit tests

• deleting dead code

• reducing dependencies

• tightening APIs

• improving documentation
• technical debt, the long-term costs that accumulate when expedient but
suboptimal decisions are made in the short term

• ML systems incur massive hidden costs at the system level, beyond the
basic code complexity issues of traditional software systems
Challenges 

in 

Deploying Machine Learning 

[Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, (2020) Challenges
in Deploying Machine Learning: a Survey of Case Studies, ArXiv.]
Data set and software engineering
• data set for reinforcement learning?

• in particular, for contextual bandits

• something for RL like ImageNet for deep learning 

• software engineering for reinforcement learning? 

• Personalizer / Decision Service by Microsoft
RL competitions
• AWS DeepRacer League

• Flatland: Multi-Agent Reinforcement Learning on Trains

• KDD 2020 Cup Learning to dispatch and reposition on a
mobility-on-demand platform

• Learning to Run a Power Network Challenge

• SMARTS Competition of Autonomous Driving
https://github.com/seungjaeryanlee/awesome-rl-competitions
Offline RL / Batch RL
• (a) online RL: the policy is updated with streaming data collected by itself 

• (b) off-policy RL: the agent’s experience is appended to a data buffer (also called a replay
buffer) , and each new policy collects additional data, such that is composed of
samples from , , . . . , , and all of this data is used to train an updated new policy . 

• (c) offline RL: employs a dataset collected by some (potentially unknown) behavior policy
. The dataset is collected once, and is not altered during training, which makes it feasible
to use large previous collected datasets. The training process does not interact with the MDP
at all, and the policy is only deployed after being fully trained.
πk πk
𝒟 πk 𝒟
π0 π1 πk πk+1
𝒟
πβ
[Sergey Levine, Aviral Kumar, George Tucker and Justin Fu. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ArXiv. ]
The challenges of real-world RL
• being able to learn on live systems from limited samples

• dealing with unknown and potentially large delays in the system actuators, sensors, or
rewards

• learning and acting in high-dimensional state and action spaces

• reasoning about system constraints that should never or rarely be violated 

• interacting with systems that are partially observable, which can alternatively be seen viewed
as systems that are non-stationary or stochastic

• learning from multi-objective or poorly specified reward functions

• being able to provide actions in real-time, especially for systems with high control frequencies 

• training off-line from the fixed logs of an external behavior policy 

• providing system operators with explainable policies
[Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.]
A Model for Motivation, Engagement,
and Thriving in the User Experience
• The basic psychological needs of autonomy, competence and relatedness mediate
positive user experience outcomes such as engagement, motivation and thriving.

• As such, they constitute specific measurable parameters for which designers can
design in order to foster these outcomes within different spheres of experience.

• self-motivation theory, positive psychology, helpful for HCI? reward function?
[Peters, D., et al. (2018). Designing for motivation, engagement and wellbeing in digital experience. Frontier in Psychology, 9(797).]
The foundation of
efficient robot learning
• sample efficient, requiring relatively
few training examples

• generalizable, applicable to many
situations other than the one(s) it
learned 

• compositional, represented in a
form that allows it to be combined
with previous knowledge

• incremental, capable of adding new
knowledge and abilities over time
[Kaelbling, L. P. (2020). The foundation of efficient robot learning. Science, 369(6506):915–916. ]
Prior knowledge/structure in machine learning
bitter or better lesson?
• Richard Sutton: The biggest lesson that can be read
from 70 years of AI research is that general methods
that leverage computation are ultimately the most
effective, and by a large margin.

• chess, computer Go, speech recognition, computer
vision

• leverage the great power of general purpose methods:
search and learning

• the actual contents of minds are very complex

• build in only the meta-methods that can find and
capture the arbitrary complexity

• let our methods search for good approximations,
not by us
• Rodney Brooks: we have to take into account the
total cost of any solution, and that so far they have
all required substantial amounts of human ingenuity

• Convolutional Neural Networks, designed by
humans to manage translational invariance

• issues like color constancy, to avoid recognizing a
traffic stop sign with some pieces of tape on it as a
45 mph speed limit sign by CNN

• network architecture design

• massive data sets, amount of computation, power
consumption

• Moore’s Law slows down; breakdown of Dennard
scaling 

• special purpose computer architecture needs human
analysis
Brooks, R. (2019). A better lesson. https://rodneybrooks.com/a-better-lesson/ 

Sutton, R. (2019). The bitter lesson. http://incompleteideas.net/IncIdeas/BitterLesson.html.
• Thomas Dietterich: both are right

• Richard, we have achieved significant advances in
performance by replacing (some kinds of) human
engineering with machine learning from big data.

• Rodney, we need to find better ways of encoding
knowledge into network structure (or other prior
constraints).
Yoshua Bengio,  From System 1 Deep Learning to System 2 Deep Learning, Posner lecture at NeurIPS’2019
The AI community borrows ideas in psychology
which underlies a Nobel prize in economics.
Reinforcement Learning,
Fast and Slow
• sample inefficiency
• the requirement for
incremental parameter
adjustment 

• maximize generalization and
avoid overwriting the effects of
earlier learning 

• inductive bias
• bias–variance trade-off: the
stronger assumption, the less
data
[Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5):408–422.]
• episodic deep RL
• fast learning through episodic memory

• form useful internal representations or embeddings
of each new observation 

• meta-RL
• speed up deep RL by learning to learn 

• narrow hypothesis space

• episodic meta-RL 

• fast learning arises from, and is enabled by, slow
learning 

• relevance to neuroscience and psychology
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh
Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE.
[Elias Bareinboim, Causal Reinforcement Learning, ICML 2020 Tutorial. https://crl.causalai.net/]
Causality
Representation
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh
Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE.
[Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.]
[Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. (2021) A Comprehensive
Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4-24, Jan. 2021]
[Yao Ma and Jiliang Tang. (2020) Deep Learning on Graphs. Cambridge University Press]
Interpretability
[Belle, V. and Papantonis, I. (2020). Principles and practice of explainable machine learning. ArXiv.]
[W. James Murdoch, Chandan Singh, Karl Kumbier, Reza
Abbasi-Asl and Bin Yu. (2019) Definitions, methods, and
applications in interpretable machine learning. PNAS
116 (44) 22071–22080]
[Finale Doshi-Velez and Been Kim. (2017) Towards A
Rigorous Science of Interpretable Machine Learning]
[Alharin, A., Doan, T.-N., and Sartipi, M. (2020). Reinforcement learning
interpretation methods: A survey. IEEE Access, 8:171058 – 171077.]
Guidelines for RL in healthcare
• access to all variables influencing decision
making
• the effective sample size

• larger if the learned policies are close to the
clinician policies 

• most reliable for refining existing practices
rather than discovering new treatment approaches 

• the possibilities for mismatch between the actual
decision and the proposed decision grow with the
number of decisions in the patient’s history 

• interrogate RL-learned policies 

• to assess whether they will behave prospectively
as intended 

• consider problem formulation, reward function
definition, data recording or preprocessing,
transferability to new scenarios, interpretability
[Gottesman, O. et al. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:14–18.]
problem scale - combinations
effective sample size in off-policy evaluation
Do no harm: a roadmap for
responsible machine learning for health care
[Wiens, J. et al. (2019). Do no harm: a roadmap for responsible
machine learning for health care. Nature Medicine, 25:1337–1340.]
Hippocratic oath for AI?
Constraints
• satisfying constraints

• during exploration and operation

• tradeoff multiple objectives

• autonomous car: safety, efficiency
and comfort

• combating COVID-19 vs maintaining
economic productivity
[Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.]
[Csaba Szepesvári. (2020 )Constrained MDPs and the reward hypothesis, https://
readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html]
AI creates a new type of business
• lower gross margins due to heavy
cloud infrastructure usage and ongoing
human support

• scaling challenges due to the thorny
problem of edge cases

• weaker defensive moats due to the
commoditization of AI models and
challenges with data network effects

• AI companies appear, increasingly, to
combine elements of both software and
services with gross margins, scaling,
and defensibility that may represent a
new class of business entirely.
[Andreessen Horowitz blog. The new business of AI (and how its different from traditional software).
https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-its-different-from-traditional-software/
practical advices for founders: 

• eliminate model complexity
as much as possible

• choose problem domains
carefully to reduce data
complexity

• plan for high variable costs

• embrace services plan for
change in the tech stack

• build defensibility the old-
fashioned way
MLOps From Model-
centric to Data-centric AI
• MLOp’s most important task is to make high quality data
available through all stages of the ML project lifecycle.

• AI system = Code + Data

• Model-centric AI: How can you change the model (code) to
improve performance?

• Data-centric AI: How can you systematically change your
data (input x or labels y) to improve performance?

• Important frontier: MLOps tools to make data-centric AI an
efficient and systematic process.
[Andrew Ng, A Chat with Andrew on MLOps: From Model-
centric to Data-centric AI, https://tinyurl.com/8dzjmexd]
Bridging AI’s POC
to production gap
• small data algorithms include synthetic data generation,
e.g. Generative Adversarial Networks (GANs),one/few-
shot learning, e.g., GPT-3,self-supervised learning,
transfer learning, anomaly detection, etc.

• will your model generalize to a different dataset than
what it was trained on?

• a model works in a published paper often not work
in production

• production AI projects require more than ML code
• manage the change the technology brings: budget
enough time, identify all stakeholders, provide
reassurance, explain what’s happening and why, right-
size the first project

• key technical tools: explainable AI, auditing
[Ng, A. (2020). Bridging AI’s proof-of-concept to production gap.
https://tinyurl.com/u45zer7j]
SHOULD I GET INTO RL?

• This is RL, that is not RL 

• Dangers of RL 

• Safety guarantees 

• RL is done 

• RL does not work 

RL IS PROBLEMATIC!? 

• Testing on training data? 

• Generalization & RL 

• Focus on simulators 

• Bad problem 

• Speed of RL
[Csaba Szepesvári, DL Day talk @ KDD 2020, https://sites.ualberta.ca/~szepesva/talks.html]
META MYTHS 

• Breaking curses 

• Data/generality wins 

• SOTA
NEIGHBORS OF RL 

• Alternatives ≫ RL 

• Self-supervised learning
and RL 

• Causality
Dimitri Bertsekas: cautiously positive
• There are enough methods to try with a reasonable chance of success for most types of optimization problems. 

• There are no methods that are guaranteed to work for all or even most problems. 

• We can begin to address practical problems of unimaginable difficulty! 

• There is an exciting journey ahead! 

• see more from slides and the new book on RL and Optimal Control, http://web.mit.edu/dimitrib/www/RLbook.html
Warren Powell: Sequential Decision Analytics
four classes of policies 

• Policy function
approximations (PFAs): 

• parameterized
policies

• Cost function
approximations (CFAs)

• upper confidence
bounding (UCB)

• value function
approximations (VFAs)

• Q-learning

• Direct lookaheads
(DLAs)

• Monte Carlo tree
search

• which one is best? 

• meta-learning?
[https://castlelab.princeton.edu/sda/]
How would Rich Sutton like this?
Foundation
Rich Sutton, AI Debate 2, https://www.youtube.com/watch?v=VOI3Bb3p4GM, around 35’

• David Marr's three levels at which any information processing machine must be understood:
computational theory, representation and algorithm, and hardware implementation. 

• AI has surprisingly little computational theory. 

• The big ideas are mostly at the middle level of representation and algorithm.

• Reinforcement learning is the first computational theory of intelligence.
• RL is explicitly about the goal, the whats and whys of intelligence.
Alekh Agarwal, Akshay Krishnamurthy,
and John Langford

• FOCS 2020 tutorial on the Theoretical
Foundations of Reinforcement Learning 

• https://hunch.net/~tforl/
https://www.mckinsey.com/business-
functions/mckinsey-analytics/our-insights/
its-time-for-businesses-to-chart-a-course-
for-reinforcement-learning
McKinsey
It’s time for
businesses to
chart a
course for
reinforcement
learning
Harvard Business Review
Why AI That Teaches
Itself to Achieve a Goal Is
the Next Big Thing
How to Spot an Opportunity
for Reinforcement Learning

• make a list

• consider other options

• be careful what you wish for

• ask whether it’s worth it

• prepare to Be Patient
https://hbr.org/2021/04/why-ai-that-teaches-
itself-to-achieve-a-goal-is-the-next-big-thing
When RL is helpful?
• when big data are available, from the model, a good simulator, or interaction

• natural science and engineering

• usually with clear objective function, with a standard answer, straightforward to evaluate

• AlphaGo

• combinatorial optimization, operations research, optimal control, drug design, etc.

• social science and humanities

• usually “human in the loop”, usually influenced by psychology, behavioural science, etc.,
subjective, may not have a standard answer, may not be easy to evaluate

• game design and evaluation, education

• concepts like psychology, e.g. intrinsic motivation and self-determination theory, may serve
as a bridge connecting RL/AI with social science and art, e.g., by defining reward function
Applications everywhere
•Reinforcement learning solves sequential
decision making problems.

•Reinforcement learning intelligently
automates previously manually designed
strategies, e.g., those based on heuristics.
[Yuxi Li, Deep Reinforcement Learning: An Overview. ArXiv. https://bit.ly/2AidXm1]
games robotics healthcare
business
management
science
engineering
humanities
finance
education
energy
transportation
computer
systems
reinforcement learning
Atari, Go, poker
Starcraft
game theory
PCG, testing
gamification
perception
planning
navigation
locomotion
sim-to-real
smart grid
power mgmt
data center
VRP
inventory
AGV
resource mgmt
neural arch.
computer vision
NLP
software
hardware
networks
maths, physics
chemistry, bio
psychology
neural sci.
OR, optimal ctrl
music, drawing
economic sectors
recommender
sequencing
motivation
DTRs
mobile
testing
scheduling
pricing, trading
portfolio opt.
risk mgmt
recommender
e-commerce
customer mgmt
scheduling
process ctrl
maintenance
traffic signal
order matching
V2X
logistics manufacture
Resources
• Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction
(2nd Edition). MIT Press
• David Silver, Reinforcement Learning, 2015
• Reinforcement Learning course from Univ. of Alberta on Coursera
• OpenAI Spinning Up in Deep RL
• Deepmind & UCL Advanced Deep Learning and Reinforcement Learning
• Sergey Levine, UC Berkeley, Deep Reinforcement Learning
• https://github.com/ShangtongZhang/reinforcement-learning-an-introduction
• Yuxi Li, Deep Reinforcement Learning: An Overview, arXiv, 2017
• Yuxi Li, Reinforcement Learning Applications, arXiv, 2019 (plan to update soon)
• Yuxi Li, Resources for Deep Reinforcement Learning, medium.com
Reinforcement
Learning
for Real Life
• Machine Learning
Journal Special Issue
• ICML 2021 workshop
• 2020 virtual workshop
• ICML 2019 workshop
RL4RealLife Workshop @ ICML 2021
https://sites.google.com/view/RL4RealLife
Co-chairs
RL Foundation Panel
RL + RecSys Panel
RL + Robotics Panel
RL + OR Panel
RL Explainability &
Interpretability Panel
CFP: Deadline: June 12
TBA: RL Research-to-
RealLife Gap Panel
Machine Learning Journal Special Issue
Reinforcement Learning for Real Life
accepted so far, more to come 

• Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application
to real-life problems

• Partially observable environment estimation with uplift inference for reinforcement learning based
recommendation

• Automatic discovery of interpretable planning strategies

• Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks & Analysis

• IntelligentPooling: Practical Thompson Sampling for mHealth

• Bandit Algorithms to Personalize Educational Chatbots

• Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation

• Grounded Action Transformation for Sim-to-Real Reinforcement Learning

• Inverse Reinforcement Learning in Contextual MDPs

• Lessons on off-policy methods from a notification component of a chatbot
Not widely commercialized yet. Why?
• no “AlphaGo moment” in practice, no killer application, yet

• still challenges with implementation, algorithms and theory

• software engineering, system deployment, technical debt

• business model, software + service, gross margins, scaling, defensive moats

• resources, still insufficient, talent, compute, funding

• investment, chicken-egg, long-term thinking, trail-and-error sprit

• slow adoption of new technology, esp. by traditional industrial sectors

• technical route, AI+ or +AI? collaborate with domain experts

• learning curve, steeper than deep learning

• more education and training, necessary for all, from engineer to CEO
(Deep)RL doesn’t work?
Lots of successful stories already.
It requires accumulation of knowledge and experience,
resources including talents and compute, and patience.
[modified a pic from Internet]
The time for reinforcement learning is coming.
may not be another AlphaGo moment
or a killer application
may be permeating slowly and gradually
promising yet challenging

More Related Content

PPTX
Reinforcement Learning, Application and Q-Learning
PDF
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
PDF
Reinforcement learning-ebook-part1
PDF
Deep Q-Learning
PDF
An introduction to reinforcement learning
PPTX
Intro to Deep Reinforcement Learning
PDF
Reinforcement Learning Tutorial | Edureka
PDF
Reinforcement Learning using OpenAI Gym
Reinforcement Learning, Application and Q-Learning
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Reinforcement learning-ebook-part1
Deep Q-Learning
An introduction to reinforcement learning
Intro to Deep Reinforcement Learning
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning using OpenAI Gym

What's hot (20)

PPTX
Deep Reinforcement Learning
PDF
An introduction to deep reinforcement learning
PDF
Deep Reinforcement Learning
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Deep reinforcement learning
PPTX
Reinforcement Learning
PPTX
Deep Reinforcement Learning
PDF
Reinforcement learning, Q-Learning
PPTX
Reinforcement learning
PDF
Reinforcement Learning 4. Dynamic Programming
PPTX
An introduction to reinforcement learning
PDF
RLCode와 A3C 쉽고 깊게 이해하기
PPTX
Reinforcement Learning
PPT
Reinforcement Learning Q-Learning
PPT
Reinforcement learning 7313
PDF
Reinforcement Learning 1. Introduction
PDF
Introduction of Deep Reinforcement Learning
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Deep Reinforcement Learning
An introduction to deep reinforcement learning
Deep Reinforcement Learning
Reinforcement Learning : A Beginners Tutorial
Deep reinforcement learning
Reinforcement Learning
Deep Reinforcement Learning
Reinforcement learning, Q-Learning
Reinforcement learning
Reinforcement Learning 4. Dynamic Programming
An introduction to reinforcement learning
RLCode와 A3C 쉽고 깊게 이해하기
Reinforcement Learning
Reinforcement Learning Q-Learning
Reinforcement learning 7313
Reinforcement Learning 1. Introduction
Introduction of Deep Reinforcement Learning
Continuous control with deep reinforcement learning (DDPG)
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Ad

Similar to Deep Reinforcement Learning and Its Applications (20)

PDF
Artificial Collective Intelligence
PPTX
pptvuvubhbhaszvgsgsvxhbughbghbgbhhhhhhh.pptx
PDF
PDF
Introduction to reinforcement learning
PDF
Reinforcement learning in a nutshell
PPTX
mlcgfxfgtyufuyhjfxcgvhbgfasghjgfghj.pptx
PPTX
What Can RL do.pptx
PDF
A Journey to Reinforcement Learning
PPTX
Deep Reinforcement Leaning In Machine Learning
PDF
A brief overview of Reinforcement Learning applied to games
PPTX
Reinforcement learning
PDF
Horizon: Deep Reinforcement Learning at Scale
PDF
從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧
PPTX
AI for energy: the uncertain promising opportunity
PDF
Shanghai deep learning meetup 4
DOCX
Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)
PDF
anintroductiontoreinforcementlearning-180912151720.pdf
PPTX
Introduction to Reinforcement Learning.pptx
PDF
My Robot Can Learn -Using Reinforcement Learning to Teach my Robot
PDF
RL presentation
Artificial Collective Intelligence
pptvuvubhbhaszvgsgsvxhbughbghbgbhhhhhhh.pptx
Introduction to reinforcement learning
Reinforcement learning in a nutshell
mlcgfxfgtyufuyhjfxcgvhbgfasghjgfghj.pptx
What Can RL do.pptx
A Journey to Reinforcement Learning
Deep Reinforcement Leaning In Machine Learning
A brief overview of Reinforcement Learning applied to games
Reinforcement learning
Horizon: Deep Reinforcement Learning at Scale
從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧
AI for energy: the uncertain promising opportunity
Shanghai deep learning meetup 4
Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)
anintroductiontoreinforcementlearning-180912151720.pdf
Introduction to Reinforcement Learning.pptx
My Robot Can Learn -Using Reinforcement Learning to Teach my Robot
RL presentation
Ad

More from Bill Liu (20)

PDF
Walk Through a Real World ML Production Project
PDF
Redefining MLOps with Model Deployment, Management and Observability in Produ...
PDF
Productizing Machine Learning at the Edge
PPTX
Transformers in Vision: From Zero to Hero
PDF
Deep AutoViML For Tensorflow Models and MLOps Workflows
PDF
Metaflow: The ML Infrastructure at Netflix
PDF
Practical Crowdsourcing for ML at Scale
PDF
Building large scale transactional data lake using apache hudi
PDF
Big Data and AI in Fighting Against COVID-19
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Build computer vision models to perform object detection and classification w...
PDF
Causal Inference in Data Science and Machine Learning
PDF
Weekly #106: Deep Learning on Mobile
PDF
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
PDF
AISF19 - On Blending Machine Learning with Microeconomics
PDF
AISF19 - Travel in the AI-First World
PDF
AISF19 - Unleash Computer Vision at the Edge
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Toronto meetup 20190917
PPTX
Feature Engineering for NLP
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Transformers in Vision: From Zero to Hero
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917
Feature Engineering for NLP

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology

Deep Reinforcement Learning and Its Applications

  • 1. Deep Reinforcement Learning and its Applications Yuxi Li yuxili@gmail.com 2021.04.28
  • 3. Reinforcement Learning (RL) at each time step, an RL agent • receives a state • selects an action • receives a reward • transitions into a new state objective: maximize long term reward [Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]
  • 4. “Reinforcement” is from classical conditioning • The term “reinforcement” in the context of animal learning came into use in the 1927 English translation of Pavlov’s monograph on conditioned reflexes. • Pavlov described reinforcement as the strengthening of a pattern of behavior due to an animal receiving a stimulus—a reinforcer—in an appropriate temporal relationship with another stimulus or with a response. [Pic from Internet] [Sutton BA in psychology from Stanford University in 1978] [Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]
  • 6. [Li, Y. (2017). Deep Reinforcement Learning: An Overview. ArXiv.] reinforcement learning supervised learning unsupervised learning deep learning artificial intelligence machine learning deep reinforcement learning artificial neural networks association rule learning Bayesian networks clustering decision tree learning genetic algorithms inductive logic programming reinforcement learning representation learning rule-based machine learning similarity and metric learning sparse dictionary learning support vector machines problem solving search constraint satisfaction knowledge, reasoning, and planning logical agents first-order logic planning and acting knowledge representation probabilistic reasoning decision making learning learning from examples knowledge in learning learning probabilistic models reinforcement learning communication, perceiving, and acting natural language processing perception robotics [Li, Y. (2019). Reinforcement Learning Applications. ArXiv.] in a usual sense (not perfectly correct) • supervised learning makes predictions • myopic • reinforcement learning makes decisions • long-term thinking
  • 7. RL+ DL = AI [David Silver, Deep Reinforcement Learning from AlphaGo to AlphaStar]
  • 8. From Deep Q-Networks (DQN) to Agent57 [Badia, A. P.,et al. (2020). Agent57: Outperforming the atari human benchmark. ArXiv.] [Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. ]
  • 9. First return, then explore: Go-Explore [Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth Stanley and Jeff Clune, First return, then explore, Nature, February 2021]
  • 11. Dota [OpenAI (2019)] StarCraft [Vinyals et al. (2019)] Poker [Moravcik et al. (2017)] Catch The Flag [Jaderberg et al. (2019)] Curling [Won et al. (2020)] Hide-and-Seek [Baker et al. (2020)] see next slide
  • 12. Potential applications of techniques in games games correspond to fundamental problems in CS/AI, relate to combinatorial optimization, NP-hard problems, control, and operations research Libratus paper mentioned: • business strategy • negotiation • strategic pricing • finance • cybersecurity • military applications • auctions • pricing AlphaGo papers mentioned: • general game-playing • classical planning • partially observed planning • scheduling • constraint satisfaction DeepStack paper mentioned: • defending strategic resources • robust decision making for medical treatment recommendations [Brown, N. and Sandholm, T. (2017). Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science.] [Jaderberg, M.,et al. (2019). Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364:859–865.] [Silver, D.,et al. (2017). Mastering the game of Go without human knowledge. Nature, 550:354–359. ] [Baker, B., et al. (2020). Emergent tool use from multi-agent autocurricula. In ICLR. (hide-and-seek)] [OpenAI (2019). Dota 2 with large scale deep reinforcement learning. ArXiv.] [Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144. ] [Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. ] [Moravcik, M., et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513.] [Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575:350–354. ]
  • 13. Games to AI is like fruit flies to genetics. [David Silver RL Course]
  • 16. Multi-Arm Bandits • 25-30% improvements for click- through rate • 18% revenue lift in the landing page [Agarwal, A. et al. (2016). Making contextual decisions with low technical debt. ArXiv.]
  • 17. [Csaba Szepesvári, Bandits - DLRLSS 2019]
  • 18. Decision Service [Agarwal et al. (2016)] abstractions • explore to collect the data • log the data correctly • learn a good model • deploy it in the application • the Client Library implements the Explore abstraction • implements various exploration policies, addressing F1 • the Join Service implements the Log abstraction • joins rewards to decisions • produces correct exploration data to address F2 • enforces a uniform delay before releasing data to the Learn component to avoid delay-related biases, addressing F1 • the exploration data is copied to the Store for offline experimentation • the Online Learner implements the Learn abstraction • incorporates data continuously and checkpoints models to the Deploy component at a configurable rate, addressing F3 • evaluates arbitrary policies in real-time • enables advanced monitoring and safeguards, addressing F4 • the Store implements the Deploy abstraction • provides model and data storage • the Offline Learner uses data for offline experimentation • such as tuning hyper-parameters, evaluating other learning algorithms or policy classes, changing the reward metric, etc., counterfactually accurate • the Feature Generator eases usability by auto-generating features failures • (F1) partial feedback and bias • (F2) incorrect data collection • (F3) changes in the environment • (F4) weak monitoring and debugging https://github.com/Microsoft/mwt-ds http://ds.microsoft.com 2019 Inaugural ACM SIGAI Industry Award for Excellence in Artificial Intelligence multi-world testing vs A/B testing
  • 19. Lessons from contextual bandit learning in a customer support bot • consider starting with imitation learning • consider simplified action spaces • don’t be afraid of principled exploration • try to support changes in environment • cautiously close the loop • consider reward engineering and shaping • use a separate logging channel • use and extend existing systems • pay attention to effective sample size • avoid -greedy • regularize towards the logging policy, increases the effective sample size, resulting in shorter confidence intervals and reduced overfitting • design an architecture suited to RL • balance randomness with predictability ϵ [Karampatziakis, N., et al. (2019). Lessons from real-world reinforcement learning in a customer support bot. In RL4RealLife.] • Microsoft Virtual Agent • scenarios: intent disambiguation, contextual recommendations
  • 20. RL Applications @ Microsoft • Personalizer, part of Azure Cognitive Services, within Azure AI platform • making its way in more MicroSoft products and services, Windows, Edge browser, Xbox • developers can plug Azure Cognitive Services into apps and websites • engineers can use Autonomous systems to refine manufacturing processes • Azure Machine Learning previews cloud-based RL offerings for data scientists and ML professionals • Metrics Advisor incorporates feedback and makes models more adaptive to a customer’s dataset, which helps detect more subtle anomalies in sensors, production processes or business metrics • recommendation, adaptive to COVID-19 pandemic, find the optimal jitter buffer for a video meeting, help determine when to reboot or remediate virtual machines, … • wide applications: • deliver tailored recommendations to small grocery stores across Mexico • manipulate unstable coin bags for a bank in Russia • collaborate with human players in games for a UK company With reinforcement learning, Microsoft brings a new class of AI solutions to customers, https://blogs.microsoft.com/ai/reinforcement-learning/
  • 21. RecSim: A Configurable Simulation Platform for Recommender Systems [Ie, E. et al. (2019). Reinforcement learning for slate-based recommender systems: A tractable decomposition and practical methodology. ArXiv. ] [Ie, E. et al. (2019). Recsim - a configurable recommender systems environment. in RL4RealLife] [Chen, M. et al. (2019). Top-k off-policy correction for a reinforce recommender system. in WSDM] [Zhao, X., Xia, L., Tang, J., and Yin, D. (2019). Reinforcement learning for online information seeking. ACM SIGWEB Newsletter (SIGWEB).]
  • 22. Facebook ReAgent features • data preprocessing • feature normalization • deep RL model implementation • multi-GPU training • counterfactual policy evaluation • optimized serving • tested algorithms applications • delivering more relevant notifications optimizing • streaming video bit rates • improving M suggestions in Messenger Horizon: The first open source reinforcement learning platform for large-scale products and services, https://code.fb.com/ml-applications/horizon/ pipeline • timeline generation, runs across • thousands of CPUs training, runs across many GPUs • serving, spans thousands of machines [https://github.com/facebookresearch/ReAgent] [Gauci, J. et al. Horizon: Facebook’s open source applied reinforcement learning platform. In RL4RealLife, 2019]
  • 23. Ride-Hailing Order Dispatching at DiDi via Reinforcement Learning INFORMS 2019 Wagner Prize Winner [Qin, Z. T., et al. (2020). Ride-hailing order dispatching at DiDi via reinforcement learning. INFORMS Journal on Applied Analytics, 50(5):272–286.] challenges • dynamic and stochastic supply and demand • system response time • reliability • multiple business objectives • driver-centric objective: maximize the total income of the drivers on the platform • passenger-centric objective: minimize the average pickup distance of all the assigned orders • marketplace efficiency metrics • response rate • fulfillment rate • production requirements and constraints • computational efficiency • system reliability • changing business requirements solution approaches • combinatorial optimization, myopic • semi-Markov decision process • tabular temporal difference learning • deep (hierarchical) RL • transfer learning [Tony Qin talk, https://www.arlseminar.com/speakers/] [Tony Qin tutorials, https://tonyzqin.wordpress.com]
  • 24. Autonomous navigation of stratospheric balloons using reinforcement learning • a cost-effective platform for communication, Earth observation, gathering meteorological data and other applications • to navigate, a flight controller ascend and descend to find and follow favourable wind currents • vertical motion by pumping air ballast in and out of a fixed- volume envelope • horizontal motion by the winds • station-keeping, maintaining the balloon within a range of its station, for communication • input signal: wind speed and solar elevation • challenges: imperfect data, sparse wind measurements resulting in partial observability, power management • neither conventional methods nor human intervention suffice • use reinforcement learning5 to train a flight controller from simulations • use data augmentation6,7 and a self-correcting design to overcome imperfect data • robust to the natural diversity in stratospheric winds • 39-day controlled experiment over the Pacific Ocean [Bellemare, M. G., et al. (2020). Autonomous navigation of stratospheric balloons using reinforcement learning. Nature, 588:77–82.] Saying goodbye to Loon, https://medium.com/loon-for-all/loon-draft-c3fcebc11f3f, Jan 22, 2021
  • 25. [Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hutter, M. (2020). Learning quadrupedal locomotion over challenging terrain. Science Robotics, 5. ] Learning quadrupedal locomotion over challenging terrain
  • 26. [Mao, H., et al. (2019). Park: An open platform for learning augmented computer systems. In NeurIPS.]
  • 27. [Mirhoseini, A., et al. (2020). Chip placement with deep reinforcement learning. ArXiv. ] [Zoph, B. and Le, Q. V. (2017). Neural architecture search with reinforcement learning. In ICLR. ] [Cubuk, E. et al. (2019). Autoaugment: Learning augmentation policies from data. In CVPR. [Mirhoseini, A. et al. (2017). Device placement optimization with reinforcement learning. In ICML.]
  • 28. AutoML- Zero: Evolving Machine Learning Algorithms From Scratch [John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, and Aleksandra Faust. Evolving Reinforcement Learning Algorithms, ICLR 2021] [Esteban Real, Chen Liang, David R. So and Quoc V. Le. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch, ICML 2020] Evolving Reinforcement Learning Algorithms (AutoRL)
  • 29. Combinatorial Optimization [Chen, X. and Tian, Y. (2019). Learning to perform local rewriting for combinatorial optimization. In NeurIPS. ] [Kool, W., van Hoof, H., and Welling, M. (2019). Attention, learn to solve routing problems! In ICLR. ] [Lu, H., Zhang, X., and Yang, S. (2020). A learning-based iterative method for solving vehicle routing problems. In ICLR. ] ML alongside optimization algorithms (branch and bound) End to end learning Learning to configure algorithms (hyper-parameters) [Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. (2021) Machine Learning for Combinatorial Optimization: a Methodological Tour d’Horizon. European Journal of Operational Research 290(2):405-421.]
  • 30. SMARTS: Scalable Multi-Agent RL Training School for Autonomous Driving [Zhou, M., et al. (2020). SMARTS: Scalable multi-agent reinforcement learning training school for autonomous driving. In Conference on Robot Learning (CoRL). ]
  • 31. Wuji- Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning • crash bugs • stuck bugs • logic bugs • gaming balance bugs • user experience bugs [Zheng, Y. et al. (2019). Wuji: Automatic online combat game testing using evolutionary deep reinforcement learning. In ASE 2019. ] ACM SIGSOFT Distinguished Paper Award • automatic game testing framework • evolutionary algorithm • deep RL • multi-objective optimization
  • 32. RL for Instructional Sequencing • RL has been most successful in cases where it has been constrained with ideas and theories from cognitive psychology and the learning sciences [Doroudi, S., Aleven, V., and Brunskill, E. (2019). Where’s the reward? a review of reinforcement learning for instructional sequencing. International Journal of Artificial Intelligence in Education, 29:568–620.]
  • 33. • elementary school children learn to manipulate money • four principal regions: wallet location, repository location, object location, text location • ITS dynamically proposes to students the exercises currently making maximal learning progress • targeting to maximize intrinsic motivation and learning efficiency [Oudeyer, P.-Y.,et al. (2016). Intrinsic motivation, curiosity and learning: theory and applications in educational technologies. Progress in brain research, Elsevier, 229:257– 284. ] Intrinsic motivation is defined as doing an activity for its inherent satisfaction, fun or challenge, rather than external products, pressures or reward.
  • 34. Flow the psychology of optimal experience A good life is one that is characterized by complete absorption in what one does. • challenge-skill balance • action-awareness merging • clear goals • unambiguous feedback • concentration on the task at hand • sense of control • loss of self-consciousness • transformation of time • an autotelic experience [Nakamura, J. and Csikszentmihalyi, M. (2014). The concept of flow. In Csikszentmihalyi, M., editor, Flow and the Foundations of Positive Psychology, pages 239–263. Springer. ]
  • 35. A Generic Approach to Challenge Modeling for the Procedural Creation of Video Game Levels • vertical arrows represent the challenging events (holes) as unit impulses • the curve represents the amount of accumulated challenge in the time window • anxiety depends on the accumulated challenge • fun as a response to increasing anxiety • define reward function with quantitative challenge and fun • use RL to procedurally generate content for Super Mario [Sorenson, N., Pasquier, P., and DiPaola, S. (2011). A generic approach to challenge modeling for the procedural creation of video game levels. IEEE Transaction on Computational Intelligence and AI in Games, 3(3):229–244.]
  • 36. Mobile Healthcare • micro-randomized trial, collecting data for offline analysis • Just-in-time adaptive interventions (JITAIs), the next generation of mobile healthcare delivery that is automated, scalable, evidence-driven and inexpensive. • iterating between offline analysis and online personalization • more data and better algorithms over time, gradually improve JITAIs • use positive psychology to improve adoption, engagement and effect? • chronic diseases: migraine, diabetes, obesity, etc. • mental health: schizophrenia, depression, anxiety, etc. • wellness: fitness management, sedentary behavior, etc. [Menictas, M., Rabbi, M., Klasnja, P., and Murphy, S. (2019). Artificial intelligence decision-making in mobile health. The Biochemist, 41(5):20–24.]
  • 37. FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance [FinRL: https://github.com/AI4Finance-LLC/FinRL-Library] [Wang, J. et al. (2019). Alphastock: A buying-winners-and- selling-losers investment strategy using interpretable deep reinforcement attention networks. In KDD.]
  • 38. The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies [Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C. Parkes, and Richard Socher, (2020) The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies. ArXiv.]
  • 39. Data center cooling using model-predictive control [Warren B. Powell (2019). From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions. arXiv] [Shin, J., Badgwell, T.A., Liu, K., Lee, J.H., 2019. Reinforcement learning - overview of recent progress and implications for process control. CCE 127, 282–294 (2020).] [R. Nian, J. Liu and B. Huang, A review On reinforcement learning: Introduction and applications in industrial process control, Computers and Chemical Engineering 139 (2020) ] notes for RL: • the goal can be complex, can have safety constraints • there can be stability constraints • there can be state constraints • failure can be avoided with safety constraints during execution [Lazic, N., et al. (2018). Data center cooling using model-predictive control. In NeurIPS.]
  • 40. Gym-ANM- Reinforcement learning environments for active network management tasks in electricity distribution systems [Robin Henry and Damien Ernst, (2021). Gym-ANM- Reinforcement learning environments for active network management tasks in electricity distribution systems, ArXiv.]
  • 41. [Zhavoronkov, A., et al. (2019). Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature Biotechnology, 37:1038–1040. ] Drug design
  • 42. AlphaFold • what a protein does largely depends on its unique 3D structure • protein folding problem: figure out what shapes proteins fold into • a grand challenge in biology for the past 50 years [https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology]
  • 43. COVID-19 • RL is a promising framework to combat epidemics • epidemics are a fruitful application area for RL to make substantial real life impact • communities of RL and AI, epidemiology and public health, and economics should collaborate to combat epidemics [Yinyu Ye, Optimization and operations research in mitigation of a pandemic, 2020] [Yuxi Li, Combat epidemics with reinforcement learning, https://attain.ai, 2020]
  • 45. Software engineering for machine learning • the nine stages of the machine learning workflow • data-oriented: collection, cleaning, and labeling • model-oriented: model requirements, feature engineering, training, evaluation, deployment, and monitoring • many feedback loops in the workflow • larger feedback arrows: model evaluation and monitoring may loop back to any of the previous stages • smaller feedback arrow: model training may loop back to feature engineering, e.g., in representation learning • three aspects of the AI domain that make it fundamentally different from prior software application domains: • discovering, managing, and versioning the data needed for machine learning applications is much more complex and difficult than other types of software engineering • model customization and model reuse require very different skills than are typically found in software teams • AI components are more difficult to handle as distinct modules than traditional software components — models may be “entangled” in complex ways and experience non-monotonic error behavior [Amershi, S.,et al. (2019). Software engineering for machine learning: A case study. In ICSE
  • 46. Hidden Technical Debt in Machine Learning Systems • boundary erosion • entanglement • hidden feedback loops • undeclared consumers • data dependencies • configuration issues • changes in the external world • system-level anti-patterns [Sculley, D., et al. (2014). Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)] • potential solutions • refactoring code • improving unit tests • deleting dead code • reducing dependencies • tightening APIs • improving documentation • technical debt, the long-term costs that accumulate when expedient but suboptimal decisions are made in the short term • ML systems incur massive hidden costs at the system level, beyond the basic code complexity issues of traditional software systems
  • 47. Challenges in Deploying Machine Learning [Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, (2020) Challenges in Deploying Machine Learning: a Survey of Case Studies, ArXiv.]
  • 48. Data set and software engineering • data set for reinforcement learning? • in particular, for contextual bandits • something for RL like ImageNet for deep learning • software engineering for reinforcement learning? • Personalizer / Decision Service by Microsoft
  • 49. RL competitions • AWS DeepRacer League • Flatland: Multi-Agent Reinforcement Learning on Trains • KDD 2020 Cup Learning to dispatch and reposition on a mobility-on-demand platform • Learning to Run a Power Network Challenge • SMARTS Competition of Autonomous Driving https://github.com/seungjaeryanlee/awesome-rl-competitions
  • 50. Offline RL / Batch RL • (a) online RL: the policy is updated with streaming data collected by itself • (b) off-policy RL: the agent’s experience is appended to a data buffer (also called a replay buffer) , and each new policy collects additional data, such that is composed of samples from , , . . . , , and all of this data is used to train an updated new policy . • (c) offline RL: employs a dataset collected by some (potentially unknown) behavior policy . The dataset is collected once, and is not altered during training, which makes it feasible to use large previous collected datasets. The training process does not interact with the MDP at all, and the policy is only deployed after being fully trained. πk πk 𝒟 πk 𝒟 π0 π1 πk πk+1 𝒟 πβ [Sergey Levine, Aviral Kumar, George Tucker and Justin Fu. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ArXiv. ]
  • 51. The challenges of real-world RL • being able to learn on live systems from limited samples • dealing with unknown and potentially large delays in the system actuators, sensors, or rewards • learning and acting in high-dimensional state and action spaces • reasoning about system constraints that should never or rarely be violated • interacting with systems that are partially observable, which can alternatively be seen viewed as systems that are non-stationary or stochastic • learning from multi-objective or poorly specified reward functions • being able to provide actions in real-time, especially for systems with high control frequencies • training off-line from the fixed logs of an external behavior policy • providing system operators with explainable policies [Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.]
  • 52. A Model for Motivation, Engagement, and Thriving in the User Experience • The basic psychological needs of autonomy, competence and relatedness mediate positive user experience outcomes such as engagement, motivation and thriving. • As such, they constitute specific measurable parameters for which designers can design in order to foster these outcomes within different spheres of experience. • self-motivation theory, positive psychology, helpful for HCI? reward function? [Peters, D., et al. (2018). Designing for motivation, engagement and wellbeing in digital experience. Frontier in Psychology, 9(797).]
  • 53. The foundation of efficient robot learning • sample efficient, requiring relatively few training examples • generalizable, applicable to many situations other than the one(s) it learned • compositional, represented in a form that allows it to be combined with previous knowledge • incremental, capable of adding new knowledge and abilities over time [Kaelbling, L. P. (2020). The foundation of efficient robot learning. Science, 369(6506):915–916. ]
  • 54. Prior knowledge/structure in machine learning bitter or better lesson? • Richard Sutton: The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. • chess, computer Go, speech recognition, computer vision • leverage the great power of general purpose methods: search and learning • the actual contents of minds are very complex • build in only the meta-methods that can find and capture the arbitrary complexity • let our methods search for good approximations, not by us • Rodney Brooks: we have to take into account the total cost of any solution, and that so far they have all required substantial amounts of human ingenuity • Convolutional Neural Networks, designed by humans to manage translational invariance • issues like color constancy, to avoid recognizing a traffic stop sign with some pieces of tape on it as a 45 mph speed limit sign by CNN • network architecture design • massive data sets, amount of computation, power consumption • Moore’s Law slows down; breakdown of Dennard scaling • special purpose computer architecture needs human analysis Brooks, R. (2019). A better lesson. https://rodneybrooks.com/a-better-lesson/ Sutton, R. (2019). The bitter lesson. http://incompleteideas.net/IncIdeas/BitterLesson.html. • Thomas Dietterich: both are right • Richard, we have achieved significant advances in performance by replacing (some kinds of) human engineering with machine learning from big data. • Rodney, we need to find better ways of encoding knowledge into network structure (or other prior constraints).
  • 55. Yoshua Bengio,  From System 1 Deep Learning to System 2 Deep Learning, Posner lecture at NeurIPS’2019 The AI community borrows ideas in psychology which underlies a Nobel prize in economics.
  • 56. Reinforcement Learning, Fast and Slow • sample inefficiency • the requirement for incremental parameter adjustment • maximize generalization and avoid overwriting the effects of earlier learning • inductive bias • bias–variance trade-off: the stronger assumption, the less data [Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5):408–422.] • episodic deep RL • fast learning through episodic memory • form useful internal representations or embeddings of each new observation • meta-RL • speed up deep RL by learning to learn • narrow hypothesis space • episodic meta-RL • fast learning arises from, and is enabled by, slow learning • relevance to neuroscience and psychology
  • 57. Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE. [Elias Bareinboim, Causal Reinforcement Learning, ICML 2020 Tutorial. https://crl.causalai.net/] Causality
  • 58. Representation Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE. [Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.] [Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4-24, Jan. 2021] [Yao Ma and Jiliang Tang. (2020) Deep Learning on Graphs. Cambridge University Press]
  • 59. Interpretability [Belle, V. and Papantonis, I. (2020). Principles and practice of explainable machine learning. ArXiv.] [W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl and Bin Yu. (2019) Definitions, methods, and applications in interpretable machine learning. PNAS 116 (44) 22071–22080] [Finale Doshi-Velez and Been Kim. (2017) Towards A Rigorous Science of Interpretable Machine Learning] [Alharin, A., Doan, T.-N., and Sartipi, M. (2020). Reinforcement learning interpretation methods: A survey. IEEE Access, 8:171058 – 171077.]
  • 60. Guidelines for RL in healthcare • access to all variables influencing decision making • the effective sample size • larger if the learned policies are close to the clinician policies • most reliable for refining existing practices rather than discovering new treatment approaches • the possibilities for mismatch between the actual decision and the proposed decision grow with the number of decisions in the patient’s history • interrogate RL-learned policies • to assess whether they will behave prospectively as intended • consider problem formulation, reward function definition, data recording or preprocessing, transferability to new scenarios, interpretability [Gottesman, O. et al. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:14–18.] problem scale - combinations effective sample size in off-policy evaluation
  • 61. Do no harm: a roadmap for responsible machine learning for health care [Wiens, J. et al. (2019). Do no harm: a roadmap for responsible machine learning for health care. Nature Medicine, 25:1337–1340.] Hippocratic oath for AI?
  • 62. Constraints • satisfying constraints • during exploration and operation • tradeoff multiple objectives • autonomous car: safety, efficiency and comfort • combating COVID-19 vs maintaining economic productivity [Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.] [Csaba Szepesvári. (2020 )Constrained MDPs and the reward hypothesis, https:// readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html]
  • 63. AI creates a new type of business • lower gross margins due to heavy cloud infrastructure usage and ongoing human support • scaling challenges due to the thorny problem of edge cases • weaker defensive moats due to the commoditization of AI models and challenges with data network effects • AI companies appear, increasingly, to combine elements of both software and services with gross margins, scaling, and defensibility that may represent a new class of business entirely. [Andreessen Horowitz blog. The new business of AI (and how its different from traditional software). https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-its-different-from-traditional-software/ practical advices for founders: • eliminate model complexity as much as possible • choose problem domains carefully to reduce data complexity • plan for high variable costs • embrace services plan for change in the tech stack • build defensibility the old- fashioned way
  • 64. MLOps From Model- centric to Data-centric AI • MLOp’s most important task is to make high quality data available through all stages of the ML project lifecycle. • AI system = Code + Data • Model-centric AI: How can you change the model (code) to improve performance? • Data-centric AI: How can you systematically change your data (input x or labels y) to improve performance? • Important frontier: MLOps tools to make data-centric AI an efficient and systematic process. [Andrew Ng, A Chat with Andrew on MLOps: From Model- centric to Data-centric AI, https://tinyurl.com/8dzjmexd] Bridging AI’s POC to production gap • small data algorithms include synthetic data generation, e.g. Generative Adversarial Networks (GANs),one/few- shot learning, e.g., GPT-3,self-supervised learning, transfer learning, anomaly detection, etc. • will your model generalize to a different dataset than what it was trained on? • a model works in a published paper often not work in production • production AI projects require more than ML code • manage the change the technology brings: budget enough time, identify all stakeholders, provide reassurance, explain what’s happening and why, right- size the first project • key technical tools: explainable AI, auditing [Ng, A. (2020). Bridging AI’s proof-of-concept to production gap. https://tinyurl.com/u45zer7j]
  • 65. SHOULD I GET INTO RL? • This is RL, that is not RL • Dangers of RL • Safety guarantees • RL is done • RL does not work RL IS PROBLEMATIC!? • Testing on training data? • Generalization & RL • Focus on simulators • Bad problem • Speed of RL [Csaba Szepesvári, DL Day talk @ KDD 2020, https://sites.ualberta.ca/~szepesva/talks.html] META MYTHS • Breaking curses • Data/generality wins • SOTA NEIGHBORS OF RL • Alternatives ≫ RL • Self-supervised learning and RL • Causality
  • 66. Dimitri Bertsekas: cautiously positive • There are enough methods to try with a reasonable chance of success for most types of optimization problems. • There are no methods that are guaranteed to work for all or even most problems. • We can begin to address practical problems of unimaginable difficulty! • There is an exciting journey ahead! • see more from slides and the new book on RL and Optimal Control, http://web.mit.edu/dimitrib/www/RLbook.html
  • 67. Warren Powell: Sequential Decision Analytics four classes of policies • Policy function approximations (PFAs): • parameterized policies • Cost function approximations (CFAs) • upper confidence bounding (UCB) • value function approximations (VFAs) • Q-learning • Direct lookaheads (DLAs) • Monte Carlo tree search • which one is best? • meta-learning? [https://castlelab.princeton.edu/sda/] How would Rich Sutton like this?
  • 68. Foundation Rich Sutton, AI Debate 2, https://www.youtube.com/watch?v=VOI3Bb3p4GM, around 35’ • David Marr's three levels at which any information processing machine must be understood: computational theory, representation and algorithm, and hardware implementation. • AI has surprisingly little computational theory. • The big ideas are mostly at the middle level of representation and algorithm. • Reinforcement learning is the first computational theory of intelligence. • RL is explicitly about the goal, the whats and whys of intelligence. Alekh Agarwal, Akshay Krishnamurthy, and John Langford • FOCS 2020 tutorial on the Theoretical Foundations of Reinforcement Learning • https://hunch.net/~tforl/
  • 70. Harvard Business Review Why AI That Teaches Itself to Achieve a Goal Is the Next Big Thing How to Spot an Opportunity for Reinforcement Learning • make a list • consider other options • be careful what you wish for • ask whether it’s worth it • prepare to Be Patient https://hbr.org/2021/04/why-ai-that-teaches- itself-to-achieve-a-goal-is-the-next-big-thing
  • 71. When RL is helpful? • when big data are available, from the model, a good simulator, or interaction • natural science and engineering • usually with clear objective function, with a standard answer, straightforward to evaluate • AlphaGo • combinatorial optimization, operations research, optimal control, drug design, etc. • social science and humanities • usually “human in the loop”, usually influenced by psychology, behavioural science, etc., subjective, may not have a standard answer, may not be easy to evaluate • game design and evaluation, education • concepts like psychology, e.g. intrinsic motivation and self-determination theory, may serve as a bridge connecting RL/AI with social science and art, e.g., by defining reward function
  • 72. Applications everywhere •Reinforcement learning solves sequential decision making problems. •Reinforcement learning intelligently automates previously manually designed strategies, e.g., those based on heuristics.
  • 73. [Yuxi Li, Deep Reinforcement Learning: An Overview. ArXiv. https://bit.ly/2AidXm1] games robotics healthcare business management science engineering humanities finance education energy transportation computer systems reinforcement learning Atari, Go, poker Starcraft game theory PCG, testing gamification perception planning navigation locomotion sim-to-real smart grid power mgmt data center VRP inventory AGV resource mgmt neural arch. computer vision NLP software hardware networks maths, physics chemistry, bio psychology neural sci. OR, optimal ctrl music, drawing economic sectors recommender sequencing motivation DTRs mobile testing scheduling pricing, trading portfolio opt. risk mgmt recommender e-commerce customer mgmt scheduling process ctrl maintenance traffic signal order matching V2X logistics manufacture
  • 74. Resources • Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press • David Silver, Reinforcement Learning, 2015 • Reinforcement Learning course from Univ. of Alberta on Coursera • OpenAI Spinning Up in Deep RL • Deepmind & UCL Advanced Deep Learning and Reinforcement Learning • Sergey Levine, UC Berkeley, Deep Reinforcement Learning • https://github.com/ShangtongZhang/reinforcement-learning-an-introduction • Yuxi Li, Deep Reinforcement Learning: An Overview, arXiv, 2017 • Yuxi Li, Reinforcement Learning Applications, arXiv, 2019 (plan to update soon) • Yuxi Li, Resources for Deep Reinforcement Learning, medium.com
  • 75. Reinforcement Learning for Real Life • Machine Learning Journal Special Issue • ICML 2021 workshop • 2020 virtual workshop • ICML 2019 workshop
  • 76. RL4RealLife Workshop @ ICML 2021 https://sites.google.com/view/RL4RealLife Co-chairs RL Foundation Panel RL + RecSys Panel RL + Robotics Panel RL + OR Panel RL Explainability & Interpretability Panel CFP: Deadline: June 12 TBA: RL Research-to- RealLife Gap Panel
  • 77. Machine Learning Journal Special Issue Reinforcement Learning for Real Life accepted so far, more to come • Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application to real-life problems • Partially observable environment estimation with uplift inference for reinforcement learning based recommendation • Automatic discovery of interpretable planning strategies • Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks & Analysis • IntelligentPooling: Practical Thompson Sampling for mHealth • Bandit Algorithms to Personalize Educational Chatbots • Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation • Grounded Action Transformation for Sim-to-Real Reinforcement Learning • Inverse Reinforcement Learning in Contextual MDPs • Lessons on off-policy methods from a notification component of a chatbot
  • 78. Not widely commercialized yet. Why? • no “AlphaGo moment” in practice, no killer application, yet • still challenges with implementation, algorithms and theory • software engineering, system deployment, technical debt • business model, software + service, gross margins, scaling, defensive moats • resources, still insufficient, talent, compute, funding • investment, chicken-egg, long-term thinking, trail-and-error sprit • slow adoption of new technology, esp. by traditional industrial sectors • technical route, AI+ or +AI? collaborate with domain experts • learning curve, steeper than deep learning • more education and training, necessary for all, from engineer to CEO
  • 79. (Deep)RL doesn’t work? Lots of successful stories already. It requires accumulation of knowledge and experience, resources including talents and compute, and patience. [modified a pic from Internet]
  • 80. The time for reinforcement learning is coming. may not be another AlphaGo moment or a killer application may be permeating slowly and gradually promising yet challenging