Deep Reinforcement Learning and Its Applications

Deep Reinforcement Learning
and its Applications
Yuxi Li

yuxili@gmail.com

2021.04.28

Reinforcement Learning (RL)
at each time step, an RL agent

• receives a state

• selects an action

• receives a reward

• transitions into a new state

objective: maximize long term reward
[Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]

“Reinforcement”
is from classical conditioning
• The term “reinforcement” in the context of animal learning came into use in
the 1927 English translation of Pavlov’s monograph on conditioned reflexes.

• Pavlov described reinforcement as the strengthening of a pattern of
behavior due to an animal receiving a stimulus—a reinforcer—in an
appropriate temporal relationship with another stimulus or with a response.
[Pic from Internet]
[Sutton BA in psychology from Stanford University in 1978]
[Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT Press.]

[Li, Y. (2017). Deep Reinforcement Learning: An Overview. ArXiv.]
reinforcement
learning
supervised
learning
unsupervised
learning
deep
learning
artificial
intelligence
machine
learning
deep
reinforcement
learning
artificial neural networks
association rule learning
Bayesian networks
clustering
decision tree learning
genetic algorithms
inductive logic programming
reinforcement learning
representation learning
rule-based machine learning
similarity and metric learning
sparse dictionary learning
support vector machines
problem solving
search
constraint satisfaction
knowledge, reasoning, and
planning
logical agents
first-order logic
planning and acting
knowledge representation
probabilistic reasoning
decision making
learning
learning from examples
knowledge in learning
learning probabilistic models
communication, perceiving,
and acting
natural language processing
perception
robotics
[Li, Y. (2019). Reinforcement Learning Applications. ArXiv.]
in a usual sense (not perfectly correct)

• supervised learning makes predictions

• myopic

• reinforcement learning makes decisions

• long-term thinking

RL+ DL = AI
[David Silver, Deep Reinforcement Learning from AlphaGo to AlphaStar]

From Deep Q-Networks (DQN)
to Agent57
[Badia, A. P.,et al. (2020). Agent57: Outperforming the atari human benchmark. ArXiv.]
[Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533. ]

First return, then explore: Go-Explore
[Adrien Ecoﬀet, Joost Huizinga, Joel Lehman, Kenneth Stanley and Jeﬀ Clune, First return, then explore, Nature, February 2021]

条件反射
• “强化”这⼀一术语出现于
1927年年巴浦洛洛夫(Pavlov)
条件反射论⽂文的英译本

• 巴浦洛洛夫把强化描述成，
当动物接收到刺刺激，也就
是强化物，对⼀一种⾏行行为模
式的加强，⽽而这个刺刺激与
另⼀一个刺刺激或反应的发⽣生
有合适的时间关系。
[based on Silver et al. (2016, 2017, 2018), Li (2017)]

Dota [OpenAI (2019)]
StarCraft [Vinyals et al. (2019)]
Poker [Moravcik et al. (2017)]
Catch The Flag [Jaderberg et al. (2019)]
Curling [Won et al. (2020)]
Hide-and-Seek [Baker et al. (2020)]
see next slide

Potential applications of techniques in games
games correspond to fundamental problems in CS/AI, relate to combinatorial optimization,
NP-hard problems, control, and operations research
Libratus paper mentioned:

• business strategy

• negotiation

• strategic pricing

• finance

• cybersecurity

• military applications

• auctions

• pricing
AlphaGo papers mentioned:

• general game-playing

• classical planning

• partially observed planning

• scheduling

• constraint satisfaction

DeepStack paper mentioned:

• defending strategic resources

• robust decision making for medical treatment recommendations
[Brown, N. and Sandholm, T. (2017). Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science.]
[Jaderberg, M.,et al. (2019). Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364:859–865.]
[Silver, D.,et al. (2017). Mastering the game of Go without human knowledge. Nature, 550:354–359. ]
[Baker, B., et al. (2020). Emergent tool use from multi-agent autocurricula. In ICLR. (hide-and-seek)]
[OpenAI (2019). Dota 2 with large scale deep reinforcement learning. ArXiv.]
[Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144. ]
[Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. ]
[Moravcik, M., et al. (2017). DeepStack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513.]
[Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575:350–354. ]

Games to AI
is like
fruit flies to genetics.
[David Silver RL Course]

•brief introduction

•successful applications

•challenges and opportunities

Multi-Arm Bandits
• 25-30% improvements for click-
through rate

• 18% revenue lift in the landing
page
[Agarwal, A. et al. (2016). Making contextual decisions with low technical debt. ArXiv.]

[Csaba Szepesvári, Bandits - DLRLSS 2019]

Decision Service
[Agarwal et al. (2016)]
abstractions
• explore to collect the
data

• log the data correctly

• learn a good model

• deploy it in the
application
• the Client Library implements the Explore abstraction

• implements various exploration policies, addressing F1

• the Join Service implements the Log abstraction

• joins rewards to decisions

• produces correct exploration data to address F2

• enforces a uniform delay before releasing data to the Learn
component to avoid delay-related biases, addressing F1

• the exploration data is copied to the Store for offline
experimentation

• the Online Learner implements the Learn abstraction

• incorporates data continuously and checkpoints models to
the Deploy component at a configurable rate, addressing F3

• evaluates arbitrary policies in real-time

• enables advanced monitoring and safeguards, addressing F4

• the Store implements the Deploy abstraction

• provides model and data storage

• the Offline Learner uses data for offline experimentation

• such as tuning hyper-parameters, evaluating other learning
algorithms or policy classes, changing the reward metric,
etc., counterfactually accurate

• the Feature Generator eases usability by auto-generating
features
failures
• (F1) partial feedback and
bias

• (F2) incorrect data
collection

• (F3) changes in the
environment

• (F4) weak monitoring and
debugging
https://github.com/Microsoft/mwt-ds http://ds.microsoft.com
2019 Inaugural ACM SIGAI Industry Award for Excellence in Artificial Intelligence
multi-world testing
vs A/B testing

Lessons from contextual bandit learning
in a customer support bot
• consider starting with imitation learning

• consider simplified action spaces

• don’t be afraid of principled exploration

• try to support changes in environment

• cautiously close the loop

• consider reward engineering and
shaping

• use a separate logging channel
• use and extend existing systems

• pay attention to eﬀective sample size

• avoid -greedy

• regularize towards the logging policy,
increases the eﬀective sample size,
resulting in shorter confidence
intervals and reduced overfitting

• design an architecture suited to RL

• balance randomness with
predictability
ϵ
[Karampatziakis, N., et al. (2019). Lessons from real-world reinforcement learning in a customer support bot. In RL4RealLife.]
• Microsoft Virtual Agent

• scenarios: intent disambiguation, contextual recommendations

RL Applications @ Microsoft
• Personalizer, part of Azure Cognitive Services, within Azure AI platform

• making its way in more MicroSoft products and services, Windows, Edge browser, Xbox

• developers can plug Azure Cognitive Services into apps and websites

• engineers can use Autonomous systems to refine manufacturing processes

• Azure Machine Learning previews cloud-based RL offerings for data scientists and ML
professionals

• Metrics Advisor incorporates feedback and makes models more adaptive to a customer’s
dataset, which helps detect more subtle anomalies in sensors, production processes or
business metrics

• recommendation, adaptive to COVID-19 pandemic, find the optimal jitter buffer for a video
meeting, help determine when to reboot or remediate virtual machines, …

• wide applications:

• deliver tailored recommendations to small grocery stores across Mexico

• manipulate unstable coin bags for a bank in Russia

• collaborate with human players in games for a UK company
With reinforcement learning, Microsoft brings a new class of AI solutions to customers, https://blogs.microsoft.com/ai/reinforcement-learning/

RecSim: A Configurable Simulation Platform for
Recommender Systems
[Ie, E. et al. (2019). Reinforcement learning for slate-based recommender systems: A tractable decomposition and practical methodology. ArXiv. ]
[Ie, E. et al. (2019). Recsim - a configurable recommender systems environment. in RL4RealLife]
[Chen, M. et al. (2019). Top-k oﬀ-policy correction for a reinforce recommender system. in WSDM]
[Zhao, X., Xia, L., Tang, J., and Yin, D. (2019). Reinforcement learning for online information seeking. ACM SIGWEB Newsletter (SIGWEB).]

Facebook ReAgent
features

• data preprocessing

• feature normalization

• deep RL model implementation

• multi-GPU training

• counterfactual policy evaluation

• optimized serving

• tested algorithms

applications
• delivering more relevant notifications
optimizing

• streaming video bit rates

• improving M suggestions in
Messenger
Horizon: The first open source reinforcement learning platform for large-scale products and services, https://code.fb.com/ml-applications/horizon/
pipeline
• timeline generation, runs
across

• thousands of CPUs training,
runs across many GPUs

• serving, spans thousands of
machines
[https://github.com/facebookresearch/ReAgent]
[Gauci, J. et al. Horizon: Facebook’s open source applied reinforcement learning platform. In RL4RealLife, 2019]

Ride-Hailing Order Dispatching
at DiDi via Reinforcement Learning
INFORMS 2019 Wagner Prize Winner [Qin, Z. T., et al. (2020). Ride-hailing order dispatching at DiDi via reinforcement learning. INFORMS Journal on Applied Analytics, 50(5):272–286.]
challenges
• dynamic and stochastic supply and demand

• system response time

• reliability

• multiple business objectives

• driver-centric objective: maximize the total income of the
drivers on the platform

• passenger-centric objective: minimize the average pickup
distance of all the assigned orders

• marketplace efficiency metrics

• response rate

• fulfillment rate

• production requirements and constraints

• computational efficiency

• system reliability

• changing business requirements
solution approaches
• combinatorial optimization, myopic

• semi-Markov decision process

• tabular temporal difference learning

• deep (hierarchical) RL

• transfer learning
[Tony Qin talk, https://www.arlseminar.com/speakers/]
[Tony Qin tutorials, https://tonyzqin.wordpress.com]

Autonomous navigation of stratospheric
balloons using reinforcement learning
• a cost-effective platform for communication, Earth observation,
gathering meteorological data and other applications

• to navigate, a flight controller ascend and descend to find and
follow favourable wind currents

• vertical motion by pumping air ballast in and out of a fixed-
volume envelope

• horizontal motion by the winds

• station-keeping, maintaining the balloon within a range of its
station, for communication

• input signal: wind speed and solar elevation

• challenges: imperfect data, sparse wind measurements resulting
in partial observability, power management

• neither conventional methods nor human intervention suffice
• use reinforcement learning5
to train a flight controller from
simulations

• use data augmentation6,7
and a self-correcting design to
overcome imperfect data

• robust to the natural diversity in stratospheric winds

• 39-day controlled experiment over the Pacific Ocean
[Bellemare, M. G., et al. (2020). Autonomous navigation of stratospheric
balloons using reinforcement learning. Nature, 588:77–82.]
Saying goodbye to Loon, https://medium.com/loon-for-all/loon-draft-c3fcebc11f3f, Jan 22, 2021

[Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., and Hutter, M. (2020). Learning quadrupedal locomotion over challenging terrain. Science Robotics, 5. ]
Learning quadrupedal locomotion over challenging terrain

[Mao, H., et al. (2019). Park: An open platform for learning augmented computer systems. In NeurIPS.]

[Mirhoseini, A., et al. (2020). Chip placement with deep reinforcement learning. ArXiv. ]
[Zoph, B. and Le, Q. V. (2017). Neural architecture search with
reinforcement learning. In ICLR. ] [Cubuk, E. et al. (2019). Autoaugment: Learning
augmentation policies from data. In CVPR.
[Mirhoseini, A. et al. (2017). Device placement optimization with reinforcement learning. In ICML.]

AutoML-
Zero:
Evolving
Machine
Learning
Algorithms
From
Scratch
[John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, and Aleksandra Faust. Evolving Reinforcement Learning Algorithms, ICLR 2021]
[Esteban Real, Chen Liang, David R. So and Quoc V. Le. AutoML-Zero: Evolving Machine Learning Algorithms From Scratch, ICML 2020]
Evolving
Reinforcement
Learning
Algorithms
(AutoRL)

Combinatorial Optimization
[Chen, X. and Tian, Y. (2019). Learning to perform local rewriting for combinatorial optimization. In NeurIPS. ]
[Kool, W., van Hoof, H., and Welling, M. (2019).

Attention, learn to solve routing problems! In ICLR. ]
[Lu, H., Zhang, X., and Yang, S. (2020). A learning-based
iterative method for solving vehicle routing problems. In ICLR. ]
ML alongside
optimization
algorithms
(branch and
bound)
End to end learning
Learning to configure algorithms
(hyper-parameters)
[Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. (2021) Machine Learning for Combinatorial
Optimization: a Methodological Tour d’Horizon. European Journal of Operational Research 290(2):405-421.]

SMARTS: Scalable
Multi-Agent RL
Training School for
Autonomous Driving
[Zhou, M., et al. (2020). SMARTS: Scalable multi-agent reinforcement learning training school for autonomous driving. In Conference on Robot Learning (CoRL). ]

Wuji- Automatic Online Combat Game Testing
Using Evolutionary Deep Reinforcement Learning
• crash bugs

• stuck bugs

• logic bugs

• gaming balance bugs

• user experience bugs
[Zheng, Y. et al. (2019). Wuji: Automatic online combat game testing

using evolutionary deep reinforcement learning. In ASE 2019. ]
ACM SIGSOFT Distinguished Paper Award
• automatic game testing framework

• evolutionary algorithm

• deep RL

• multi-objective optimization

RL for Instructional Sequencing
• RL has been most successful in cases where it has been
constrained with ideas and theories from cognitive
psychology and the learning sciences
[Doroudi, S., Aleven, V., and Brunskill, E. (2019). Where’s the reward? a review of reinforcement learning for instructional sequencing.
International Journal of Artificial Intelligence in Education, 29:568–620.]

• elementary school children learn to manipulate money

• four principal regions: wallet location, repository location, object
location, text location

• ITS dynamically proposes to students the exercises currently
making maximal learning progress

• targeting to maximize intrinsic motivation and learning efficiency
[Oudeyer, P.-Y.,et al. (2016). Intrinsic motivation, curiosity and learning: theory and applications in educational technologies. Progress in brain research, Elsevier, 229:257– 284. ]
Intrinsic motivation is
defined as doing an activity
for its inherent satisfaction,
fun or challenge, rather than
external products,
pressures or reward.

Flow
the psychology of optimal experience
A good life is one that is
characterized by complete
absorption in what one does.
• challenge-skill balance

• action-awareness merging

• clear goals

• unambiguous feedback

• concentration on the task at hand

• sense of control

• loss of self-consciousness

• transformation of time

• an autotelic experience
[Nakamura, J. and Csikszentmihalyi, M. (2014). The concept of flow. In Csikszentmihalyi, M., editor, Flow and the Foundations of Positive Psychology, pages 239–263. Springer. ]

A Generic Approach to Challenge Modeling for
the Procedural Creation of Video Game Levels
• vertical arrows represent the challenging events
(holes) as unit impulses

• the curve represents the amount of accumulated
challenge in the time window
• anxiety depends on the accumulated challenge

• fun as a response to increasing anxiety
• define reward function with quantitative challenge and fun

• use RL to procedurally generate content for Super Mario
[Sorenson, N., Pasquier, P., and DiPaola, S. (2011). A generic approach to challenge modeling for the procedural creation of video game levels.
IEEE Transaction on Computational Intelligence and AI in Games, 3(3):229–244.]

Mobile Healthcare
• micro-randomized trial, collecting data for offline analysis

• Just-in-time adaptive interventions (JITAIs), the next generation of
mobile healthcare delivery that is automated, scalable, evidence-driven
and inexpensive.

• iterating between offline analysis and online personalization

• more data and better algorithms over time, gradually improve JITAIs

• use positive psychology to improve adoption, engagement and effect?
• chronic diseases: migraine, diabetes, obesity, etc.

• mental health: schizophrenia, depression, anxiety, etc.

• wellness: fitness management, sedentary behavior, etc.
[Menictas, M., Rabbi, M., Klasnja, P., and Murphy, S. (2019). Artificial intelligence
decision-making in mobile health. The Biochemist, 41(5):20–24.]

FinRL: A Deep Reinforcement Learning Library for
Automated Stock Trading in Quantitative Finance
[FinRL: https://github.com/AI4Finance-LLC/FinRL-Library]
[Wang, J. et al. (2019). Alphastock: A buying-winners-and- selling-losers investment strategy using interpretable deep reinforcement attention networks. In KDD.]

The AI Economist:
Improving Equality and Productivity
with AI-Driven Tax Policies
[Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C. Parkes, and Richard
Socher, (2020) The AI Economist: Improving Equality and Productivity with AI-Driven Tax Policies. ArXiv.]

Data center cooling using model-predictive control
[Warren B. Powell (2019). From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions. arXiv]
[Shin, J., Badgwell, T.A., Liu, K., Lee, J.H., 2019. Reinforcement learning - overview of recent progress and implications for process control. CCE 127, 282–294 (2020).]
[R. Nian, J. Liu and B. Huang, A review On reinforcement learning: Introduction and applications in industrial process control, Computers and Chemical Engineering 139 (2020) ]
notes for RL:

• the goal can be
complex, can have
safety constraints

• there can be
stability constraints

• there can be state
constraints

• failure can be
avoided with safety
constraints during
execution
[Lazic, N., et al. (2018). Data center cooling using model-predictive control. In NeurIPS.]

Gym-ANM- Reinforcement learning
environments for active network management
tasks in electricity distribution systems
[Robin Henry and Damien Ernst, (2021). Gym-ANM- Reinforcement learning environments for active network management tasks in electricity distribution systems, ArXiv.]

[Zhavoronkov, A., et al. (2019). Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature Biotechnology, 37:1038–1040. ]
Drug design

AlphaFold
• what a protein does largely depends on its unique 3D structure

• protein folding problem: figure out what shapes proteins fold into

• a grand challenge in biology for the past 50 years
[https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology]

COVID-19
• RL is a promising framework to combat epidemics

• epidemics are a fruitful application area for RL to make
substantial real life impact

• communities of RL and AI, epidemiology and public health,
and economics should collaborate to combat epidemics
[Yinyu Ye, Optimization and operations research in mitigation of a pandemic, 2020]
[Yuxi Li, Combat epidemics with reinforcement learning, https://attain.ai, 2020]

Software engineering for machine learning
• the nine stages of the machine learning workflow

• data-oriented: collection, cleaning, and labeling

• model-oriented: model requirements, feature engineering, training, evaluation, deployment, and monitoring

• many feedback loops in the workflow

• larger feedback arrows: model evaluation and monitoring may loop back to any of the previous stages

• smaller feedback arrow: model training may loop back to feature engineering, e.g., in representation learning

• three aspects of the AI domain that make it fundamentally different from prior software application domains:

• discovering, managing, and versioning the data needed for machine learning applications is much more complex
and difficult than other types of software engineering

• model customization and model reuse require very different skills than are typically found in software teams

• AI components are more difficult to handle as distinct modules than traditional software components — models
may be “entangled” in complex ways and experience non-monotonic error behavior
[Amershi, S.,et al. (2019). Software engineering for machine learning: A case study. In ICSE

Hidden Technical Debt in
Machine Learning Systems
• boundary erosion

• entanglement

• hidden feedback loops

• undeclared consumers

• data dependencies

• configuration issues

• changes in the external world

• system-level anti-patterns
[Sculley, D., et al. (2014). Machine learning: The high interest credit card of technical debt.
In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)]
• potential solutions

• refactoring code

• improving unit tests

• deleting dead code

• reducing dependencies

• tightening APIs

• improving documentation
• technical debt, the long-term costs that accumulate when expedient but
suboptimal decisions are made in the short term

• ML systems incur massive hidden costs at the system level, beyond the
basic code complexity issues of traditional software systems

Challenges

in

Deploying Machine Learning

[Andrei Paleyes, Raoul-Gabriel Urma, Neil D. Lawrence, (2020) Challenges
in Deploying Machine Learning: a Survey of Case Studies, ArXiv.]

Data set and software engineering
• data set for reinforcement learning?

• in particular, for contextual bandits

• something for RL like ImageNet for deep learning

• software engineering for reinforcement learning?

• Personalizer / Decision Service by Microsoft

RL competitions
• AWS DeepRacer League

• Flatland: Multi-Agent Reinforcement Learning on Trains

• KDD 2020 Cup Learning to dispatch and reposition on a
mobility-on-demand platform

• Learning to Run a Power Network Challenge

• SMARTS Competition of Autonomous Driving
https://github.com/seungjaeryanlee/awesome-rl-competitions

Offline RL / Batch RL
• (a) online RL: the policy is updated with streaming data collected by itself

• (b) off-policy RL: the agent’s experience is appended to a data buffer (also called a replay
buffer) , and each new policy collects additional data, such that is composed of
samples from , , . . . , , and all of this data is used to train an updated new policy .

• (c) offline RL: employs a dataset collected by some (potentially unknown) behavior policy
. The dataset is collected once, and is not altered during training, which makes it feasible
to use large previous collected datasets. The training process does not interact with the MDP
at all, and the policy is only deployed after being fully trained.
πk πk
𝒟 πk 𝒟
π0 π1 πk πk+1
𝒟
πβ
[Sergey Levine, Aviral Kumar, George Tucker and Justin Fu. (2020). Oﬄine Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. ArXiv. ]

The challenges of real-world RL
• being able to learn on live systems from limited samples

• dealing with unknown and potentially large delays in the system actuators, sensors, or
rewards

• learning and acting in high-dimensional state and action spaces

• reasoning about system constraints that should never or rarely be violated

• interacting with systems that are partially observable, which can alternatively be seen viewed
as systems that are non-stationary or stochastic

• learning from multi-objective or poorly specified reward functions

• being able to provide actions in real-time, especially for systems with high control frequencies

• training off-line from the fixed logs of an external behavior policy

• providing system operators with explainable policies
[Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.]

A Model for Motivation, Engagement,
and Thriving in the User Experience
• The basic psychological needs of autonomy, competence and relatedness mediate
positive user experience outcomes such as engagement, motivation and thriving.

• As such, they constitute specific measurable parameters for which designers can
design in order to foster these outcomes within different spheres of experience.

• self-motivation theory, positive psychology, helpful for HCI? reward function?
[Peters, D., et al. (2018). Designing for motivation, engagement and wellbeing in digital experience. Frontier in Psychology, 9(797).]

The foundation of
efficient robot learning
• sample efficient, requiring relatively
few training examples

• generalizable, applicable to many
situations other than the one(s) it
learned

• compositional, represented in a
form that allows it to be combined
with previous knowledge

• incremental, capable of adding new
knowledge and abilities over time
[Kaelbling, L. P. (2020). The foundation of eﬃcient robot learning. Science, 369(6506):915–916. ]

Prior knowledge/structure in machine learning
bitter or better lesson?
• Richard Sutton: The biggest lesson that can be read
from 70 years of AI research is that general methods
that leverage computation are ultimately the most
effective, and by a large margin.

• chess, computer Go, speech recognition, computer
vision

• leverage the great power of general purpose methods:
search and learning

• the actual contents of minds are very complex

• build in only the meta-methods that can find and
capture the arbitrary complexity

• let our methods search for good approximations,
not by us
• Rodney Brooks: we have to take into account the
total cost of any solution, and that so far they have
all required substantial amounts of human ingenuity

• Convolutional Neural Networks, designed by
humans to manage translational invariance

• issues like color constancy, to avoid recognizing a
traffic stop sign with some pieces of tape on it as a
45 mph speed limit sign by CNN

• network architecture design

• massive data sets, amount of computation, power
consumption

• Moore’s Law slows down; breakdown of Dennard
scaling

• special purpose computer architecture needs human
analysis
Brooks, R. (2019). A better lesson. https://rodneybrooks.com/a-better-lesson/

Sutton, R. (2019). The bitter lesson. http://incompleteideas.net/IncIdeas/BitterLesson.html.
• Thomas Dietterich: both are right

• Richard, we have achieved significant advances in
performance by replacing (some kinds of) human
engineering with machine learning from big data.

• Rodney, we need to find better ways of encoding
knowledge into network structure (or other prior
constraints).

Yoshua Bengio, From System 1 Deep Learning to System 2 Deep Learning, Posner lecture at NeurIPS’2019
The AI community borrows ideas in psychology
which underlies a Nobel prize in economics.

Reinforcement Learning,
Fast and Slow
• sample inefficiency
• the requirement for
incremental parameter
adjustment

• maximize generalization and
avoid overwriting the effects of
earlier learning

• inductive bias
• bias–variance trade-off: the
stronger assumption, the less
data
[Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5):408–422.]
• episodic deep RL
• fast learning through episodic memory

• form useful internal representations or embeddings
of each new observation

• meta-RL
• speed up deep RL by learning to learn

• narrow hypothesis space

• episodic meta-RL

• fast learning arises from, and is enabled by, slow
learning

• relevance to neuroscience and psychology

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh
Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE.
[Elias Bareinboim, Causal Reinforcement Learning, ICML 2020 Tutorial. https://crl.causalai.net/]
Causality

Representation
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh
Goyal, Yoshua Bengio, (2021). Toward Causal Representation Learning, Proceedings of the IEEE.
[Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.]
[Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. (2021) A Comprehensive
Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4-24, Jan. 2021]
[Yao Ma and Jiliang Tang. (2020) Deep Learning on Graphs. Cambridge University Press]

Interpretability
[Belle, V. and Papantonis, I. (2020). Principles and practice of explainable machine learning. ArXiv.]
[W. James Murdoch, Chandan Singh, Karl Kumbier, Reza
Abbasi-Asl and Bin Yu. (2019) Definitions, methods, and
applications in interpretable machine learning. PNAS
116 (44) 22071–22080]
[Finale Doshi-Velez and Been Kim. (2017) Towards A
Rigorous Science of Interpretable Machine Learning]
[Alharin, A., Doan, T.-N., and Sartipi, M. (2020). Reinforcement learning
interpretation methods: A survey. IEEE Access, 8:171058 – 171077.]

Guidelines for RL in healthcare
• access to all variables influencing decision
making
• the effective sample size

• larger if the learned policies are close to the
clinician policies

• most reliable for refining existing practices
rather than discovering new treatment approaches

• the possibilities for mismatch between the actual
decision and the proposed decision grow with the
number of decisions in the patient’s history

• interrogate RL-learned policies

• to assess whether they will behave prospectively
as intended

• consider problem formulation, reward function
definition, data recording or preprocessing,
transferability to new scenarios, interpretability
[Gottesman, O. et al. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine, 25:14–18.]
problem scale - combinations
effective sample size in off-policy evaluation

Do no harm: a roadmap for
responsible machine learning for health care
[Wiens, J. et al. (2019). Do no harm: a roadmap for responsible
machine learning for health care. Nature Medicine, 25:1337–1340.]
Hippocratic oath for AI?

Constraints
• satisfying constraints

• during exploration and operation

• tradeoff multiple objectives

• autonomous car: safety, efficiency
and comfort

• combating COVID-19 vs maintaining
economic productivity
[Dulac-Arnold, G., Mankowitz, D., and Hester, T. (2019). Challenges of real-world reinforcement learning. In RL4RealLife.]
[Csaba Szepesvári. (2020 )Constrained MDPs and the reward hypothesis, https://
readingsml.blogspot.com/2020/03/constrained-mdps-and-reward-hypothesis.html]

AI creates a new type of business
• lower gross margins due to heavy
cloud infrastructure usage and ongoing
human support

• scaling challenges due to the thorny
problem of edge cases

• weaker defensive moats due to the
commoditization of AI models and
challenges with data network effects

• AI companies appear, increasingly, to
combine elements of both software and
services with gross margins, scaling,
and defensibility that may represent a
new class of business entirely.
[Andreessen Horowitz blog. The new business of AI (and how its diﬀerent from traditional software).
https://a16z.com/2020/02/16/the-new-business-of-ai-and-how-its-diﬀerent-from-traditional-software/
practical advices for founders:

• eliminate model complexity
as much as possible

• choose problem domains
carefully to reduce data
complexity

• plan for high variable costs

• embrace services plan for
change in the tech stack

• build defensibility the old-
fashioned way

MLOps From Model-
centric to Data-centric AI
• MLOp’s most important task is to make high quality data
available through all stages of the ML project lifecycle.

• AI system = Code + Data

• Model-centric AI: How can you change the model (code) to
improve performance?

• Data-centric AI: How can you systematically change your
data (input x or labels y) to improve performance?

• Important frontier: MLOps tools to make data-centric AI an
efficient and systematic process.
[Andrew Ng, A Chat with Andrew on MLOps: From Model-
centric to Data-centric AI, https://tinyurl.com/8dzjmexd]
Bridging AI’s POC
to production gap
• small data algorithms include synthetic data generation,
e.g. Generative Adversarial Networks (GANs)，one/few-
shot learning, e.g., GPT-3，self-supervised learning,
transfer learning, anomaly detection, etc.

• will your model generalize to a different dataset than
what it was trained on?

• a model works in a published paper often not work
in production

• production AI projects require more than ML code
• manage the change the technology brings: budget
enough time, identify all stakeholders, provide
reassurance, explain what’s happening and why, right-
size the first project

• key technical tools: explainable AI, auditing
[Ng, A. (2020). Bridging AI’s proof-of-concept to production gap.
https://tinyurl.com/u45zer7j]

SHOULD I GET INTO RL?

• This is RL, that is not RL

• Dangers of RL

• Safety guarantees

• RL is done

• RL does not work

RL IS PROBLEMATIC!?

• Testing on training data?

• Generalization & RL

• Focus on simulators

• Bad problem

• Speed of RL
[Csaba Szepesvári, DL Day talk @ KDD 2020, https://sites.ualberta.ca/~szepesva/talks.html]
META MYTHS

• Breaking curses

• Data/generality wins

• SOTA
NEIGHBORS OF RL

• Alternatives ≫ RL

• Self-supervised learning
and RL

• Causality

Dimitri Bertsekas: cautiously positive
• There are enough methods to try with a reasonable chance of success for most types of optimization problems.

• There are no methods that are guaranteed to work for all or even most problems.

• We can begin to address practical problems of unimaginable diﬃculty!

• There is an exciting journey ahead!

• see more from slides and the new book on RL and Optimal Control, http://web.mit.edu/dimitrib/www/RLbook.html

Warren Powell: Sequential Decision Analytics
four classes of policies

• Policy function
approximations (PFAs):

• parameterized
policies

• Cost function
approximations (CFAs)

• upper confidence
bounding (UCB)

• value function
approximations (VFAs)

• Q-learning

• Direct lookaheads
(DLAs)

• Monte Carlo tree
search

• which one is best?

• meta-learning?
[https://castlelab.princeton.edu/sda/]
How would Rich Sutton like this?

Foundation
Rich Sutton, AI Debate 2, https://www.youtube.com/watch?v=VOI3Bb3p4GM, around 35’

• David Marr's three levels at which any information processing machine must be understood:
computational theory, representation and algorithm, and hardware implementation.

• AI has surprisingly little computational theory.

• The big ideas are mostly at the middle level of representation and algorithm.

• Reinforcement learning is the first computational theory of intelligence.
• RL is explicitly about the goal, the whats and whys of intelligence.
Alekh Agarwal, Akshay Krishnamurthy,
and John Langford

• FOCS 2020 tutorial on the Theoretical
Foundations of Reinforcement Learning

• https://hunch.net/~tforl/

https://www.mckinsey.com/business-
functions/mckinsey-analytics/our-insights/
its-time-for-businesses-to-chart-a-course-
for-reinforcement-learning
McKinsey
It’s time for
businesses to
chart a
course for
reinforcement
learning

Harvard Business Review
Why AI That Teaches
Itself to Achieve a Goal Is
the Next Big Thing
How to Spot an Opportunity
for Reinforcement Learning

• make a list

• consider other options

• be careful what you wish for

• ask whether it’s worth it

• prepare to Be Patient
https://hbr.org/2021/04/why-ai-that-teaches-
itself-to-achieve-a-goal-is-the-next-big-thing

When RL is helpful?
• when big data are available, from the model, a good simulator, or interaction

• natural science and engineering

• usually with clear objective function, with a standard answer, straightforward to evaluate

• AlphaGo

• combinatorial optimization, operations research, optimal control, drug design, etc.

• social science and humanities

• usually “human in the loop”, usually influenced by psychology, behavioural science, etc.,
subjective, may not have a standard answer, may not be easy to evaluate

• game design and evaluation, education

• concepts like psychology, e.g. intrinsic motivation and self-determination theory, may serve
as a bridge connecting RL/AI with social science and art, e.g., by defining reward function

Applications everywhere
•Reinforcement learning solves sequential
decision making problems.

•Reinforcement learning intelligently
automates previously manually designed
strategies, e.g., those based on heuristics.

[Yuxi Li, Deep Reinforcement Learning: An Overview. ArXiv. https://bit.ly/2AidXm1]
games robotics healthcare
business
management
science
engineering
humanities
finance
education
energy
transportation
computer
systems
Atari, Go, poker
Starcraft
game theory
PCG, testing
gamification
perception
planning
navigation
locomotion
sim-to-real
smart grid
power mgmt
data center
VRP
inventory
AGV
resource mgmt
neural arch.
computer vision
NLP
software
hardware
networks
maths, physics
chemistry, bio
psychology
neural sci.
OR, optimal ctrl
music, drawing
economic sectors
recommender
sequencing
motivation
DTRs
mobile
testing
scheduling
pricing, trading
portfolio opt.
risk mgmt
recommender
e-commerce
customer mgmt
scheduling
process ctrl
maintenance
traffic signal
order matching
V2X
logistics manufacture

Resources
• Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction
(2nd Edition). MIT Press
• David Silver, Reinforcement Learning, 2015
• Reinforcement Learning course from Univ. of Alberta on Coursera
• OpenAI Spinning Up in Deep RL
• Deepmind & UCL Advanced Deep Learning and Reinforcement Learning
• Sergey Levine, UC Berkeley, Deep Reinforcement Learning
• https://github.com/ShangtongZhang/reinforcement-learning-an-introduction
• Yuxi Li, Deep Reinforcement Learning: An Overview, arXiv, 2017
• Yuxi Li, Reinforcement Learning Applications, arXiv, 2019 (plan to update soon)
• Yuxi Li, Resources for Deep Reinforcement Learning, medium.com

Reinforcement
Learning
for Real Life
• Machine Learning
Journal Special Issue
• ICML 2021 workshop
• 2020 virtual workshop
• ICML 2019 workshop

RL4RealLife Workshop @ ICML 2021
https://sites.google.com/view/RL4RealLife
Co-chairs
RL Foundation Panel
RL + RecSys Panel
RL + Robotics Panel
RL + OR Panel
RL Explainability &
Interpretability Panel
CFP: Deadline: June 12
TBA: RL Research-to-
RealLife Gap Panel

Machine Learning Journal Special Issue
Reinforcement Learning for Real Life
accepted so far, more to come

• Dealing with multiple experts and non-stationarity in inverse reinforcement learning: an application
to real-life problems

• Partially observable environment estimation with uplift inference for reinforcement learning based
recommendation

• Automatic discovery of interpretable planning strategies

• Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks & Analysis

• IntelligentPooling: Practical Thompson Sampling for mHealth

• Bandit Algorithms to Personalize Educational Chatbots

• Air Learning: A Deep Reinforcement Learning Gym for Autonomous Aerial Robot Visual Navigation

• Grounded Action Transformation for Sim-to-Real Reinforcement Learning

• Inverse Reinforcement Learning in Contextual MDPs

• Lessons on off-policy methods from a notification component of a chatbot

Not widely commercialized yet. Why?
• no “AlphaGo moment” in practice, no killer application, yet

• still challenges with implementation, algorithms and theory

• software engineering, system deployment, technical debt

• business model, software + service, gross margins, scaling, defensive moats

• resources, still insufficient, talent, compute, funding

• investment, chicken-egg, long-term thinking, trail-and-error sprit

• slow adoption of new technology, esp. by traditional industrial sectors

• technical route, AI+ or +AI? collaborate with domain experts

• learning curve, steeper than deep learning

• more education and training, necessary for all, from engineer to CEO

（Deep）RL doesn’t work？
Lots of successful stories already.
It requires accumulation of knowledge and experience,
resources including talents and compute, and patience.
[modified a pic from Internet]

The time for reinforcement learning is coming.
may not be another AlphaGo moment
or a killer application
may be permeating slowly and gradually
promising yet challenging

Deep Reinforcement Learning and Its Applications

More Related Content

What's hot (20)

Similar to Deep Reinforcement Learning and Its Applications (20)

More from Bill Liu (20)

Recently uploaded (20)

Deep Reinforcement Learning and Its Applications