SlideShare a Scribd company logo
Machine Learning for
                    Computer Games
                    John E. Laird & Michael van Lent
                      Game Developers Conference
                              March 10, 2005
                   http://ai.eecs.umich.edu/soar/gdc2005


Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial   Page 1
Advertisement
  Artificial Intelligence and Interactive Digital Entertainment
    Conference (AIIDE)
         • June 1-3, Marina Del Rey, CA
         • Invited Speakers:
                   •   Doug Church
                   •   Chris Crawford
                   •   Damian Isla (Halo)
                   •   W. Bingham Gordon
                   •   Craig Reynolds
                   •   Jonathan Schaeffer
                   •   Will Wright

  • www.aiide.org


Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial   Page 2
Who are We?
   • John Laird (laird@umich.edu)
          •    Professor, University of Michigan, since 1986
          •    Ph.D., Carnegie Mellon University, 1983
          •    Teaching: Game Design and Development for seven years
          •    Research: Human-level AI, Cognitive Architecture, Machine Learning
          •    Applications: Military Simulations and Computer Games

   • Michael van Lent (vanlent@ict.usc.edu)
          • Project Leader, Institute for Creative Technology, University of Southern
            California
          • Ph.D., University of Michigan, 2000
          • Research: Combining AI for commercial game techniques for immersive
            training simulations.
          • Research Scientist on Full Spectrum Command & Full Spectrum Warrior


Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial              Page 3
Goals for Tutorial
     1.      What is machine learning?
            •      What are the main concepts underlying machine learning?
            •      What are the main approaches to machine learning?
            •      What are the main issues in using machine learning?
     2.      When should it be used in games?
            •      How can it improve a game?
            •      Examples of possible applications of ML to games
            •      When shouldn’t ML be used?
     3.      How do you use it in games?
            •      Important ML techniques that might be useful in computer games.
            •      Examples of machine learning used in actual games.




Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial         Page 4
What this is not…
     • Not about using learning for board & card games
            • Chess, backgammon, checkers, Othello, poker, blackjack,
              bridge, hearts, …
                   • Usually assumes small set of moves, perfect information, …
            • But a good place to look to learn ML techniques
     • Not a cookbook of how to apply ML to your game
            • No C++ code




Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial    Page 5
Tutorial Overview
     I.         Introduction to learning and games [.75 hour] {JEL}
     II.        Overview of machine learning field [.75 hour] {MvL}
     III. Analysis of specific learning mechanisms [3 hours total]
            •      Decision Trees [.5 hour] {MvL}
            •      Neural Networks [.5 hour] {JEL}
            •      Genetic Algorithms [.5 hour] {MvL}
            •      Bayesian Networks [.5 hour] {MvL}
            •      Reinforcement Learning [1 hour] {JEL}
     IV. Advanced Techniques [1 hour]
            •      Episodic Memory [.3 hour] {JEL}
            •      Behavior capture [.3 hour] {MvL}
            •      Player modeling [.3 hour] {JEL}
     V.         Questions and Discussion [.5 hour] {MvL & JEL}

Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial   Page 6
Part I
                      Introduction
                           John Laird




Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 7
What is learning?
   • Learning:
          • “The act, process, or experience of gaining knowledge or skill.”
   • Our general definition:
          • The capture and transformation of information into a usable
            form to improve performance.
   • Possible definitions for games
          • The appearance of improvement in game AI performance
            through experience.
          • Games that get better the longer you play them
          • Games that adjust their tactics and strategy to the player
          • Games that let you train your own characters
          • Really cool games


Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial    Page 8
Why Learning for Games?
  • Improved Game Play
  • Cheaper AI development
         • Avoid programming behaviors by hand
  • Reduce Runtime Computation
         • Replace repeated planning with cached knowledge
  • Marketing Hype




Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial   Page 9
Improved Game Play I
     • Better AI behavior:
            •      More variable
            •      More believable
            •      More challenging
            •      More robust
     • More personalized experience & more replayability
            • AI develops as human develops
            • AI learns model of player and counters player strategies
     • Dynamic Difficulty Adjustment
            • Learns properties of player and dynamically changes game to
              maximize enjoyment

Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 10
Improved Game Play II
           • New types of game play
                   • Training characters
                      • Black & White, Lionhead Studios
                   • Create a model of you to compete against others
                      • Forza Motorsport, Microsoft Game Studios for XBOX




Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial   Page 11
Marketing Hype
     • Not only does it learns from its own mistakes, it also learns from
       yours! You might be able to out think it the first time, but will
       you out think it the second, third and forth?

     • “Check out the revolutionary A.I. Drivatar™ technology:
       Train your own A.I. "Drivatars" to use the same racing
       techniques you do, so they can race for you in competitions or
       train new drivers on your team. Drivatar technology is the
       foundation of the human-like A.I. in Forza Motosport.”

     • “Your creature learns from you the entire time. From the way
       you treat your people to the way you act toward your creature, it
       remembers everything you do and its future personality will be
       based on your actions.” Preview of Black and White
Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial    Page 12
Why Not Learning for Games?
  • Worse Game Play
  • More expensive AI Development
  • Increased Runtime Computation
  • Marketing Hype Backfire




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 13
Worse Game Play: Less Control
     •    Behavior isn’t directly controlled by game designer
     •    Difficult to validate & predict all future behaviors
     •    AI can get stuck in a rut from learning
     •    Learning can take a long time to visibly change behavior
     •    If AI learns from a stupid player, get stupid behavior
            • “Imagine a track filled with drivers as bad as you are, barreling
              into corners way too hot, and trading paint at every opportunity
              possible; sounds fun to us.” - Forza Motosport




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial    Page 14
Why Not Learning for Games?
  • Worse Game Play
  • More expensive AI Development
         •   Lack of programmers with machine learning experience
         •   More time to develop, test & debug learning algorithms
         •   More time to test range of behaviors
         •   Difficult to “tweak” learned behavior
  • Increased Runtime Computation
         • Computational and memory overhead of learning algorithm
  • Marketing Hype Backfire
         • Prior failed attempts


Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 15
Marketing Hype
     • “I seriously doubt that BC3K is the first title to employ
       this technology at any level. The game has been hyped
       so much that 'neural net' to a casual gamer just became
       another buzzword and something to look forward to. At
       least that's my opinion.”
     • Derek Smart




Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial   Page 16
Alternatives to Learning
                             (How to Fake it)
     • Pre-program in multiple levels of performance
            • Dynamically switch between levels as player advances
            • Provides pre-defined set of behaviors that can be tested
     • Swap in new components [more incremental]
            • Add in more transitions and/or states to a FSM
            • Add in new rules in a rule-based system
     • Change parameters during game play
            •      The number of mistakes system makes
            •      Accuracy in shooting
            •      Reaction time to seeing enemy
            •      Aggressiveness, …

Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 17
Indirect Adaptation [Manslow]
     • Gather pre-defined data for use by AI decision making
            •      What is my kill rate with each type of weapon?
            •      What is my kill rate in each room?
            •      Where is the enemy most likely to be?
            •      Does opponent usually pass on the left or the right?
            •      How early does the enemy usually attack


     • AI “Behavior” code doesn’t change
            • Makes testing much easier


     • AI adapts in selected, but significant ways

Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial   Page 18
Where can we use learning?
     • AIs
            • Change behavior of AI entities
     • Game environment
            • Optimize the game rules, terrain, interface,
              infrastructure, …




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 19
When can we use learning?
     • Off-line: during development
            • Train AIs against experts, terrain, each other
            • Automated game balancing, testing, …
     • On-line: during game play
            • AIs adapt to player, environment
            • Dynamic difficulty adjustment




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 20
Basic ML Techniques
     • Learning by observation of human behavior
            • Replicate special individual performance
            • Capture variety in expertise, personalities and cultures
            • AI learns from human’s performance
     • Learning by instruction
            • Non programmers instructing AI behavior
            • Player teaches AI to do his bidding
     • Learning from experience
            • Play against other AI and human testers during development
                   • Improve behavior and find bogus behavior
            • Play against the environment
                   • Find places to avoid, hide, ambush, etc.
            • Adapt tactics and strategy to human opponent
                                                                                  Next
Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial          Page 21
Learning by Observation
         [Passive Learning, Imitation, Behavior Capture]
                                                          Parameters &
                                                             Sensors
             Expert or         Environmental                             Game
              Player             Interface
                                                              Motor
                                                            Commands


                               Observation
                              Trace Database



                                  Learning                   AI Code
                                                                         Game AI
                                  Algorithm                Knowledge


                                                                                   Return
Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial                  Page 22
Learning by Training

                          Game AI                                     Game



                   New or corrected
                     knowledge                                       Developer
                                                                     or Player

                                                                      Instruction or
                                                                     training signal
                                       Learning
                                       Algorithm


                                                                                       Return
Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial               Page 23
Learning by Experience
                                 [Active Learning]

                                   Game AI                                   Game



                          New or corrected
                            knowledge
                                                                                   Critic


                                                                               Training signal
                   Features used
                                                                                 or reward
                    in learning
                                              Learning
                                              Algorithm



Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial                 Page 24
Game AI Levels
     • Low-level actions & Movement
     • Situational Assessment (Pattern Recognition)
     • Tactical Behavior
     • Strategic Behavior
     • Opponent Model




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial   Page 25
Actions & Movement: Off Line
     • Capture styles of drivers/fighters/skiers
            • More complex than motion capture
            • Includes when to transition from one animation to another
     • Train AI against environment:
            • ReVolt: genetic algorithm to discover optimal racing paths




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial    Page 26
Actions & Movement: On Line
     • Capture style of player for later competition
            • Forza Motorsport
     • Learn new paths for AI from humans:
            • Command & Conquer Renegade: internal version noticed
              paths taken by humans after terrain changes.




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 27
Demos/Example
     • Michiel van de Panne & Ken Alton [UBC]
            • Driving Examples: http://www.cs.ubc.ca/~kalton/icra.html
     • Andrew Ng [Stanford]
            • Helicopter: http://www.robotics.stanford.edu/~ang/




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 28
Learning Situational Assessment
     • Learn whether a situation is good or bad
            • Creating an internal model of the environment and relating it to goals
     • Concepts that will be useful in decision making and planning
            • Can learn during development or from experience
     • Examples
            • Exposure areas (used in path planning)
            • Hiding places, sniping places, dangerous places
            • Properties of objects (edible, destructible, valuable, …)




Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial            Page 29
Learning Tactical Behavior
     • Selecting and executing appropriate tactics
            • Engage, Camp, Sneak, Run, Ambush, Flee, Flee and
              Ambush, Get Weapon, Flank Enemy, Find Enemy, Explore
     • What weapons work best and when
            • Against other weapons, in what environment, …
     • Train up teammates to fight your style, understand your
       commands, …
            • (see talk by John Funge, Xiaoyuan Tu – iKuni, Inc.)
            • Thursday 3:30pm – AK Peters Booth (962)




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 30
Learning Strategic Behavior
     • Selecting and executing appropriate strategy
            • Allocation of resources for gathering, tech., defensive,
              offensive
            • Where to place defenses
            • When to attack, who to attack, where to attack, how to attack
            • Leads to a hierarchy of goals




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 31
Settlers of Catan:
                          Michael Pfeiffer
     • Used hierarchical reinforcement learning
            • Co-evolutionary approach
            • Offline: 3000-8000 training games
     • Learned primitive actions:
            • Trading, placing roads, …




     • Learning & prior knowledge gave best results

Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 32
Review
     • When can we use learning?
            • Off-line
            • On-line
     • Where can we use learning?
            •      Low-level actions
            •      Movement
            •      Situational Assessment
            •      Tactical Behavior
            •      Strategic Behavior
            •      Opponent Model
     • Types of Learning?
            • Learning from experience
            • Learning from training/instruction
            • Learning by observation


Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 33
References
     • General Machine Learning Overviews:
            • Mitchell: Machine Learning, McGraw Hill, 1997
            • Russell and Norvig: Artificial Intelligence: A Modern
              Approach, 2003
            • AAAI’s page on machine learning:
                   • http://www.aaai.org/Pathfinder/html/machine.html

     • Machine Learning for Games
            • http://www.gameai.com/ - Steve Woodcock’s labor of love
            • AI Game Programming Wisdom
            • AI Game Programming Wisdom 2
            • M. Pfeiffer: Machine Learning Applications in Computer
              Games, MSc Thesis, Graz University of Technology, 2003
            • Nick Palmer: Machine Learning in Games Development:
                   • http://ai-depot.com/GameAI/Learning.html

Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 34
Part II
                   Overview of Machine Learning
                            Michael van Lent




Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 35
Talk Overview
     • Machine Learning Background
     • Machine Learning: “The Big Picture”
     • Challenges in applying machine learning
     • Outline for ML Technique presentations




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 36
AI for Games
     • Game AI
            • Entertainment is the central goal
                   • The player should win, but only after a close fight
            • Constraints of commercial development
                   • Development schedule, budget, CPU time, memory footprint
                   • Quality assurance
            • The public face of AI?
     • Academic AI
            • Exploring new ideas is the central goal
                   • Efficiency and optimality are desirable
            • Constraints of academic research
                   • Funding, publishing, teaching, tenure
                   • Academics also work on a budget and schedule
            • The next generation of AI techniques?


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 37
Talk Overview
     • Machine Learning Background
     • Machine Learning “The Big Picture”
     • Challenges in applying machine learning
     • Outline for ML Technique presentations




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 38
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
     • Interface to the environment




Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 39
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
            •      Search algorithms
            •      Logical & probabilistic inference
            •      Planners
            •      Expert system shells
            •      Cognitive architectures
            •      Machine learning techniques
     • Knowledge
     • Interface to the environment


Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial   Page 40
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
            • Knowledge representation
            • Knowledge acquisition
     • Interface to the environment




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 41
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
            • Knowledge representation
                   •   Finite state machines
                   •   Rule-based systems
                   •   Propositional & first-order logic
                   •   Operators
                   •   Decision trees
                   •   Classifiers
                   •   Neural networks
                   •   Bayesian networks
            • Knowledge acquisition
     • Interface to the environment
Laird & van Lent                        GDC 2005: AI Learning Techniques Tutorial   Page 42
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
            • Knowledge representation
            • Knowledge acquisition
                   • Programming
                   • Knowledge engineering
                   • Machine Learning

     • Interface to the environment




Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial   Page 43
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
     • Interface to the environment
            • Sensing
            • Acting




Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 44
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
     • Interface to the environment
            • Sensing
                   •   Robotic sensors (sonar, vision, IR, laser, radar)
                   •   Machine vision
                   •   Speech recognition
                   •   Examples
                   •   Environment features
                   •   World models
            • Acting

Laird & van Lent                        GDC 2005: AI Learning Techniques Tutorial   Page 45
AI: a learning-centric view
     Artificial Intelligence requires:
     • Architecture and algorithms
     • Knowledge
     • Interface to the environment
            • Sensing
            • Acting
                   •   Navigation
                   •   Locomotion
                   •   Speech generation
                   •   Robotic actuators




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 46
Talk Overview
     • Machine Learning Background
     • Machine Learning “The Big Picture”
     • Challenges in applying machine learning
     • Outline for ML Technique presentations




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 47
The Big Picture
     Many different ways to group machine learning fields:
          (in a somewhat general to specific order)
            • by Problem
                   • What is the fundamental problem being addressed?
                   • Broad categorization that groups techniques into a few large classes
            • by Feedback
                   • How much information is given about the right answer?
                   • The more information the easier the learning problem
            • by Knowledge Representation
                   • How is the learned knowledge represented/stored/used?
                   • Tends to be the basis for a technique’s common name
            • by Knowledge Source
                   • Where is the input coming from and in what format?
                   • Somewhat orthogonal to the other groupings


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial             Page 48
Machine Learning by Problem
     • Classification
            • Classify “instances” as one of a discrete set of “categories”
            • Input is often a list of examples
     • Clustering
            • Given a data set, identify meaningful “clusters”
                   • Unsupervised learning

     • Optimization
            • Given a function f(x) = y, find an input x with a high y value
                   • Supervised learning
            • Classification can be cast as an optimization problem
                   • Function is number of correct classifications on some test set of examples



Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial                   Page 49
Classification Problems
     • Task:
            • Classify “instances” as one of a discrete set of “categories”
     • Input: set of features about the instance to be classified
            • Inanimate = <true, false>
            • Allegiance = <friendly, neutral, enemy>
            • FoodValue = <none, low, medium, high>
     • Output: the category this object fits into
            • Is this object edible? Yes, No
            • How should I react to this object? Attack, Ignore, Heal
     • Examples are often split into two data sets
            • Training data
            • Test data

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial       Page 50
Example Problem
     Classify how I should react to an object in the world
            •      Facts about any given object include:
                   •   Inanimate = <true, false>
                   •   Allegiance = < friendly, neutral, enemy>
                   •   FoodValue = < none, low, medium, high>
                   •   Health = <low, medium, full>
                   •   RelativeHealth = <weaker, same, stronger>
            •      Output categories include:
                   •   Reaction = Attack
                   •   Reaction = Ignore
                   •   Reaction = Heal
                   •   Reaction = Eat

     •       Inanimate=false, Allegiance=enemy, RelativeHealth=weaker => Reaction=Attack
     •       Inanimate=true, FoodValue=medium => Reaction=Eat
     •       Inanimate=false, Allegiance=friendly, Health=low => Reaction=Heal
     •       Inanimate=false, Allegiance=neutral, RelativeHealth=weaker => Reaction=?
Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial             Page 51
Clustering
     Given a list of data points, group them into clusters
            • Like classification without the categories identified
            • Facts about any given object include:
                    •   Inanimate = <true, false>
                    •   Allegiance = < friendly, neutral, enemy>
                    •   FoodValue = < none, low, medium, high>
                    •   Health = <low, medium, full>
                    •   RelativeHealth = <weaker, same, stronger>
            • No categories pre-defined
     • Find a way to group the following into two groups:
            •      Inanimate=false, Allegiance=enemy, RelativeHealth=weaker
            •      Inanimate=true, FoodValue=medium
            •      Inanimate=false, Allegiance=friendly, Health=low
            •      Inanimate=false, Allegiance=neutral, RelativeHealth=weaker



Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial   Page 52
Optimization
     • Task:
            • Given a function f(x) = y, find an input with a high y value
            • Input (x) can take many forms
                   • Feature string
                   • Set of classification rules
                   • Parse trees of code

     • Example:
            • Let x be a RTS build order x = [n1, n2, n3, n4, n5, n6, n7, n8]
                   • ni means build unit or building n as the next action
                   • If a unit or building isn’t available go on to the next action
            • f([n1, n2, n3, n4, n5, n6, n7, n8]) = firepower of resulting units
            • Optimize the build order for maximum firepower


Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial      Page 53
Machine Learning by Feedback
     • Supervised Learning
            • Correct output is available
            • In Black & White: Examples of things to attack
     • Reinforcement Learning
            • Feedback is available but not correct output
            • In Black & White: Getting slapped for attacking something
     • Unsupervised Learning
            • No hint about correct outputs
            • In Black & White: Just looking for groupings of objects




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial    Page 54
Supervised Learning
     • Learning algorithm gets the right answers
            • List of examples as input
            • “Teacher” who can be asked for answers
     • Induction
            • Generalize from available examples
                   • If X is true in every example X must always be true
            • Often used to learn decision trees and rules
     • Explanation-based Learning
     • Case-based Learning



Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 55
Reinforcement Learning
     • Learning algorithm gets positive/negative feedback
            • Evaluation function
            • Rewards from the environment
     • Back propagation
            • Pass a reward back across the previous steps
            • Often paired with Neural Networks
     • Genetic algorithm
            • Parallel search for a very positive solution
            • Optimization technique
     • Q learning
            • Learn the value of taking an action at a state

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 56
Unsupervised Learning
     • Learning algorithm gets little or no feedback
            • Don’t learn right or wrong answers
     • Just recognize interesting patterns of data
            • Similar to data mining
     • Clustering is a prime example
     • Most difficult class of learning problems




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 57
Machine Learning by Knowledge
                           Representation
     • Decision Trees
            • Classification procedure
            • Generally learned by induction
     • Rules
            • Flexible representation with multiple uses
            • Learned by induction, genetic algorithms
     • Neural Networks
            • Simulates layers of neurons
            • Often paired with back propigation
     • Stochastic Models
            • Learning probabilistic networks
            • Takes advantage of prior knowledge

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 58
Machine Learning by Knowledge
                              Source
     • Examples
            • Supervised Learning
     • Environment
            • Supervised or Reinforcement Learning
     • Observation
            • Supervised Learning
     • Instruction
            • Supervised or Reinforcement Learning
     • Data points
            • Unsupervised Learning


Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 59
A Formatting Problem
     • Machine learning doesn’t generate knowledge
            • Transfers knowledge present in the input into a more useable
              source
     • Examples => Decision Trees
     • Observations => Rules
     • Data => Clusters




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 60
Talk Overview
     • Machine Learning Background
     • Machine Learning “The Big Picture”
     • Challenges in applying machine learning
     • Outline for ML Technique presentations




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 61
Challenges
     • What is being learned?
     • Where to get good inputs?
     • What’s the right learning technique?
     • When to stop learning?
     • How to QA learning?




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial   Page 62
What is being learned?
     • What are you trying to learn?
            • Often useful to have a sense of good answers in advance
            • Machine learning often finds more/better variations
            • Novel, unexpected solutions don’t appear often
     • What are the right features?
            • This can be the difference between success and failure
            • Balance what’s available, what’s useful
            • If features are too good there’s nothing to learn
     • What’s the right knowledge representation?
            • Again, difference between success and failure
            • Must be capable of representing the solution

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 63
Where to get good inputs?
     • Getting good examples is essential
            • Need enough for useful generalization
            • Need to avoid examples that represent only a subset of the
              space
            • Creating a long list of examples can take a lot of time
     • Human experts
            • Observations, Logs, Traces
            • Examples
     • Other AI systems
            • AI prototypes
            • Similar games

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial    Page 64
What’s the right learning technique?
     • This often falls out of the other decisions
     • Knowledge representations tend to be associated with
       techniques
            • Decision trees go with induction
            • Neural networks go with back propagation
            • Stochastic models go with Bayesian learning
     • Often valuable to try out more than one approach




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 65
When to stop learning?
     • Sometimes more learning is not better
            • More learning might not improve the solution
            • More learning might result in a worse solution
     • Overfitting
            •      Learned knowledge is too specific to the provided examples
            •      Looks very good on training data
            •      Can look good on test data
            •      Doesn’t generalize to new inputs




Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial     Page 66
How to QA learning?
     • Central challenge in applying machine learning to games
            • Adds a big element of variability into the player’s experience
            • Adds an additional risk factor to the development process
     • Offline learning
            • The result can undergo standard play testing
            • Might be hard or impossible to debug learned knowledge
                   • Neural networks are difficult to understand

     • Online learning
            • Constrain the space learning can explore
                   • Carefully design and bound the knowledge representation
                   • Consider “instincts” or rules than learned knowledge can’t violate
            • Allow players to activate/deactivate learning

Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial           Page 67
Talk Overview
     • Machine Learning Background
     • Machine Learning “The Big Picture”
     • Challenges in applying machine learning
     • Non-learning learning
     • Outline for ML mechanism presentations
            •      Decision Trees
            •      Neural Networks
            •      Genetic Algorithms
            •      Bayesian Networks
            •      Reinforcement Learning


Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 68
Outline
     • Background
     • Technical Overview
     • Example
     • Games that have used this mechanism
     • Pros, Cons & Challenges
     • References




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 69
General Machine Learning References
     • Artificial Intelligence: A Modern Approach
            • Russell & Norvig
     • Machine Learning
            • Mitchell
     • Gameai.com
     • AI Game Programming Wisdom books
     • Game Programming Gems




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 70
Decision Trees &
                    Rule Induction
                       Michael van Lent




Laird & van Lent    GDC 2005: AI Learning Techniques Tutorial   Page 71
The Big Picture
     • Problem
            • Classification
     • Feedback
            • Supervised learning
            • Reinforcement learning
     • Knowledge Representation
            • Decision tree
            • Rules
     • Knowledge Source
            • Examples


Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial   Page 72
Decision Trees
     • Nodes represent attribute tests
            • One child for each possible value of the attribute
     • Leaves represent classifications
     • Classify by descending from root to a leaf
            •      At root test attribute associated with root attribute test
            •      Descend the branch corresponding to the instance’s value
            •      Repeat for subtree rooted at the new node
            •      When a leaf is reached return the classification of that leaf
     • Decision tree is a disjunction of conjunctions of
       constraints on the attribute values of an instance

Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial       Page 73
Example Problem
     Classify how I should react to an object in the world
            •      Facts about any given object include:
                   •   Allegiance = < friendly, neutral, enemy>
                   •   Health = <low, medium, full>
                   •   Animate = <true, false>
                   •   RelativeHealth = <weaker, same, stronger>
            •      Output categories include:
                   •   Reaction = Attack
                   •   Reaction = Ignore
                   •   Reaction = Heal
                   •   Reaction = Eat
                   •   Reaction = Run

     •       <friendly, low, true, weaker> => Heal
     •       <neutral, low, true, same> => Heal
     •       <enemy, low, true, stronger> => Attack
     •       <enemy, medium, true, weaker> => Attack
Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 74
Classifying with a Decision Tree
                                                  Allegiance?

                                       Friendly       Neutral       Enemy


                           Health?                   Health?                      Attack

                     Low             Full           Low          Full
                           Medium
                                                            Medium
                                                                                  Ignore
              Heal          Heal        Ignore         Heal         Ignore




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial            Page 75
Classifying with a Decision Tree
                                                           Health?

                                               Low         Medium           Full
                                                                                                Attack
                         Allegiance?                              Allegiance?

                     Friendly             Enemy
                                Neutral                     Friendly                    Enemy
                                                                         Neutral

              Heal              Heal         Ignore         Heal          Ignore           Ignore




Laird & van Lent                            GDC 2005: AI Learning Techniques Tutorial                    Page 76
Decision Trees are good when:
     • Inputs are attribute-value pairs
            • With fairly small number of values
            • Numeric or continuous values cause problems
                   • Can extend algorithms to learn thresholds

     • Outputs are discrete output values
            • Again fairly small number of values
            • Difficult to represent numeric or continuous outputs
     • Disjunction is required
            • Decision trees easily handle disjunction
     • Training examples contain errors
            • Learning decision trees
            • More later

Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 77
Learning Decision Trees
     • Decision trees are usually learned by induction
            • Generalize from examples
            • Induction doesn’t guarantee correct decision trees
     • Bias towards smaller decision trees
            • Occam’s Razor: Prefer simplest theory that fits the data
            • Too expensive to find the very smallest decision tree
     • Learning is non-incremental
            • Need to store all the examples
     • ID3 is the basic learning algorithm
            • C4.5 is an updated and extended version


Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 78
Induction
     • If X is true in every example X must always be true
            • More examples are better
            • Errors in examples cause difficulty
            • Note that induction can result in errors
     • Inductive learning of Decision Trees
            • Create a decision tree that classifies the available examples
            • Use this decision tree to classify new instances
            • Avoid over fitting the available examples
                   • One root to node path for each example
                   • Perfect on the examples, not so good on new instances




Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 79
Induction requires Examples
     • Where do examples come from?
            • Programmer/designer provides examples
            • Observe a human’s decisions
     • # of examples need depends on difficulty of concept
            • More is always better
     • Training set vs. Testing set
            • Train on most (75%) of the examples
            • Use the rest to validate the learned decision trees




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 80
ID3 Learning Algorithm
     • ID3 has two parameters
            • List of examples
            • List of attributes to be tested
     • Generates tree recursively
            • Chooses attribute that best divides the examples at each step

     ID3(examples,attributes)
           if all examples in same category then
                   return a leaf node with that category
           if attributes is empty then
                   return a leaf node with the most common category in examples
           best = Choose-Attribute(examples,attributes)
           tree = new tree with Best as root attribute test
           foreach value vi of best
                   examples i = subset of examples with best == vi
                   subtree = ID3(examplesi,attributes – best)
                   add a branch to tree with best == vi and subtree beneath
           return tree

Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial   Page 81
Examples
     •    <friendly, low, true, weaker> => Heal               • 13 examples
     •    <neutral, full, false, same> => Eat
                                                                     •   3 Heal
     •    <enemy, low, true, weaker> => Eat
                                                                     •   2 Eat
     •    <enemy, low, true, same> => Attack
                                                                     •   2 Attack
     •    <neutral, low, true, weaker> => Heal
                                                                     •   4 Ignore
     •    <enemy, medium, true, stronger> => Run
                                                                     •   2 Run
     •    <friendly, full, true, same> => Ignore
     •    <neutral, full, true, stronger> => Ignore
     •    <enemy, full, true, same> => Run
     •    <enemy, medium, true, weaker> => Attack
     •    <friendly, full, true, weaker> => Ignore
     •    <neutral, full, false, stronger> => Ignore
     •    <friendly, medium, true, stronger> => Heal




Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial    Page 82
Entropy
     • Entropy: how “mixed” is a set of examples
            • All one category: Entropy = 0
            • Evenly divided: Entropy = log2(# of examples)
     • Given S examples Entropy(S) = S –pi log2 pi
       where pi is the proportion of S belonging to class i
            • 13 examples with 3 heal, 2 attack, 2 eat, 4 ignore, 2 run
                   • Entropy([3,2,2,4,2]) = 2.258
            • 13 examples with all 13 heal
                   • Entropy ([13,0,0,0,0]) = 0
            • Maximum entropy is log2 5 = 2.322
                   • 5 is the number of categories




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 83
Information Gain
     • Information Gain measures the reduction in Entropy
            • Gain(S,A) = Entropy(S) – S Sv/S Entropy(Sv)
     • Example: 13 examples: Entropy([3,2,2,4,2]) = 2.258
            • Information gain of Allegiance = <friendly, neutral, enemy>
                   •   Allegiance = friendly for 4 examples [2,0,0,2,0]
                   •   Allegiance = neutral for 4 examples [1,1,0,2,0]
                   •   Allegiance = enemy for 5 examples [0,1,2,0,2]
                   •   Gain(S,Allegiance) = 0.903
            • Information gain of Animate = <true, false>
                   • Animate = true for 11 examples [3,1,2,3,2]
                   • Animate = false for 2 examples [0,1,0,1,0]
                   • Gain(S,Animate) = 0.216
            • Allegiance has a higher information gain than Animate
                   • So choose allegiance as the next attribute to be tested

Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial   Page 84
Learning Example
     • Information gain of Allegiance
            • 0.903

     • Information gain of Health
            • 0.853

     • Information gain of Animate
            • 0.216

     • Information gain of RelativeHealth
            • 0.442


     • So Allegiance should be the root test


Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial   Page 85
Decision tree so far
                                    Allegiance?

                        Friendly        Neutral       Enemy


                    ?                      ?                        ?




Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial       Page 86
Allegiance = friendly
     • Four examples have allegiance = friendly
            •      Two categorized as Heal
            •      Two categorized as Ignore
            •      We’ll denote this now as [# of Heal, # of Ignore]
            •      Entropy = 1.0
     • Which of the remaining features has the highest info
       gain?
            • Health: low [1,0], medium [1,0], full [0,2] => Gain is 1.0
            • Animate: true [2,2], false [0,0] => Gain is 0
            • RelativeHealth: weaker [1,1], same [0,1], stronger [1,0] =>
              Gain is 0.5
     • Health is the best (and final) choice

Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial   Page 87
Decision tree so far
                                                     Allegiance?

                                         Friendly        Neutral       Enemy


                                Health                      ?                        ?

                          Low             Full
                                Medium

                   Heal         Heal             Ignore




Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial       Page 88
Allegiance = enemy
     • Five examples have allegiance = enemy
            •      One categorized as Eat
            •      Two categorized as Attack
            •      Two categorized as Run
            •      We’ll denote this now as [# of Eat, # of Attack, # of Run]
            •      Entropy = 1.5
     • Which of the remaining features has the highest info gain?
            • Health: low [1,1,0], medium [0,1,1], full [0,0,1] => Gain is 0.7
            • Animate: true [1,2,2], false [0,0,0] => Gain is 0
            • RelHealth: weaker [1,1,0], same [0,1,1], stronger [0,0,1] => Gain is 0.7
     • Health and RelativeHealth are equally good choices




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial          Page 89
Decision tree so far
                                                     Allegiance?

                                         Friendly        Neutral         Enemy


                                Health                      ?                        Health

                          Low             Full                              Low               Full
                                Medium                                               Medium


                   Heal         Heal             Ignore              ?                 ?             Run




Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial                         Page 90
Final Decision Tree
                                                     Allegiance?

                                         Friendly        Neutral       Enemy

                                                     RelHealth
                                Health                                               Health
                                                 Heal               Ignore
                          Low             Full             Eat               Low            Full
                                Medium                                                Medium


                   Heal         Heal             Ignore           RelHealth              RelHealth      Run

                                                            Eat                 Attack
                                                                                                        Run
                                                                    Attack
                                                                                     Attack    Attack




Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial                            Page 91
Generalization
     • Previously unseen examples can be classified
            • Each path through the decision tree doesn’t test every feature
            • <neutral, low, false, stronger> => Eat
     • Some leaves don’t have corresponding examples
            •      (Allegiance=enemy) & (Health=low) & (RelHealth=stronger)
            •      Don’t have any examples of this case
            •      Generalize from the closest example
            •      <enemy, low, false, same> => Attack
            •      Guess that: <enemy, low, false, stronger> => Attack




Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 92
Decision trees in Black & White
     • Creature learns to predict the player’s reactions
            • Instead of categories, range [-1 to 1] of predicted feedback
            • Extending decision trees for continuous values
                   • Divide into discrete categories
                   • …

     • Creature generates examples by experimenting
            • Try something and record the feedback (tummy rub, slap…)
            • Starts to look like reinforcement learning
     • Challenges encountered
            • Ensuring everything that can be learned is reasonable
            • Matching actions with player feedback


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 93
Decision Trees and Rules
     • Decision trees can easily be translated into rules
            • and vice versa
                                                  Allegiance?

                                       Friendly       Neutral       Enemy


                           Health?                   Health?                      Attack

                     Low             Full           Low          Full
                           Medium
                                                            Medium
                                                                                  Ignore
              Heal          Heal        Ignore         Heal         Ignore
     If (Allegiance=friendly) & ((Health=low) | (Health=medium)) then Heal
     If (Allegiance=friendly) & (Health=high) then Ignore
     If (Allegiance=neutral) & (Health=low) then Heal
     …
     If (Allegiance=enemy) then Attack
Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial            Page 94
Rule Induction
     • Specific to General Induction
            • First example creates a very specific rule
            • Additional examples are used to generalize the rule
            • If rule becomes too general create a new, disjunctive rule
     • Version Spaces
            • Start with a very specific rule and a very general rule
            • Each new example either
                   • Makes the specific rule more general
                   • Makes the general rule more specific
            • The specific and general rules meet at the solution




Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 95
Learning Example
     • First example: <friendly, low, true, weaker> => Heal
            • If (Allegiance=friendly) & (Health=low) & (Animate=true) &
              (RelHealth=weaker) then Heal


     • Second example: <neutral, low, true, weaker> => Heal
            • If (Health=low) & (Animate=true) & (RelHealth=weaker) then Heal
                • Overgeneralization?
            • If ((Allegiance=friendly) | (Allegiance=neutral)) & (Health=low) &
              (Animate=true) & (RelHealth=weaker) then Heal


     • Third example: <friendly, medium, true, stronger> =>
       Heal
            • If ((Allegiance=friendly) | (Allegiance=neutral)) & ((Health=low) |
              (Health=medium)) & (Animate=true) & ((RelHealth=weaker) |
              (RelHealth=stronger)) then Heal


Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial        Page 96
Advanced Topics
     • Boosting
            • Manipulate the set of training examples
            • Increase the representation of incorrectly classified examples
     • Ensembles of classifiers
            • Learn multiple classifiers (i.e. multiple decision trees)
                   • All the classifiers vote on the correct answer (only one approach)
            • “Bagging”: break the training set into overlapping subsets
                   • Learn a classifier for each subset
            • Learn classifiers using different subsets of features
                   • Or different subsets of categories
            • Ensembles can be more accurate than a single classifier



Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial           Page 97
Games that use inductive learning
     • Decision Trees
            • Black & White
     • Rules




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 98
Inductive Learning Evaluation
     • Pros
            •      Decision trees and rules are human understandable
            •      Handle noisy data fairly well
            •      Incremental learning
            •      Online learning is feasible
     • Cons
            • Need many, good examples
            • Overfitting can be an issue
            • Learned decision trees may contain errors
     • Challenges
            • Picking the right features
            • Getting good examples


Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial   Page 99
References
     •    Mitchell: Machine Learning, McGraw Hill, 1997.
     •    Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 1995.
     •    Quinlan: Induction of decision trees, Machine Learning 1:81-106, 1986.
     •    Quinlan: Combining instance-based and model-based learning,10th International
          Conference on Machine Learning, 1993.
     •    AI Game Programming Wisdom.
     •    AI Game Programming Wisdom 2.




Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial                Page 100
Neural Networks

                           John Laird




Laird & van Lent    GDC 2005: AI Learning Techniques Tutorial   Page 101
Inspiration
    • Mimic natural intelligence
           • Networks of simple neurons
           • Highly interconnected
           • Adjustable weights on
             connections
           • Learn rather than program
    • Architecture is different
           • Brain is massively parallel
                   • 1012 neurons
           • Neurons are slow
                   • Fire 10-100 times a second




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 102
Simulated Neuron
  • Neurons are simple computational devices whose power
    comes from how they are connected together
         • Abstractions of real neurons
  • Each neuron has:
         • Inputs/activation from other neurons (aj) [-1, +1]
         • Weights of input (Wi,j) [-1, +1]
         • Output to other neurons (ai)

        aj
                   Wi,j
                                                 ai
                          Neuroni




Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial   Page 103
Simulated Neuron
     • Neuron calculates weighted sum of inputs (ini)
       • ini = S Wi,j aj
     • Threshold function g(ini) calculates output (ai)
             • Step function:                                                 ai
                      • if ini > t then ai = 1 else ai = 0
                                                                                           t

             • Sigmoid:                                                      ai
                      • ai = 1/(1+e-ini)                                                   t
     • Output becomes input for next layer of neurons
        aj
                   Wi,j
                             S Wi,j aj = ini                    ai

                                           ai = g(in i)

Laird & van Lent                               GDC 2005: AI Learning Techniques Tutorial       Page 104
Network Structure
     • Single neuron can represent AND, OR not XOR
            • Combinations of neuron are more powerful
     • Neuron are usually organized as layers
            • Input layer: takes external input
            • Hidden layer(s)
            • Output player: external output




                                             Input             Hidden     Output
Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial            Page 105
Feed-forward vs. recurrent
     • Feed-forward: outputs only connect to later layers
            • Learning is easier




     • Recurrent: outputs connect to earlier layers
            • Internal state




Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial   Page 106
Neural Network for a FPS-bot
                                                                         Enemy           Dead
    • Four input neuron                                                          Sound          Low Health
           • One input for each condition
    • Two neuron hidden layer
           • Fully connected
           • Forces generalization
    • Five output neuron
           • One output for each action
           • Choose action with highest output
           • Probabilistic action selection
                                                                  Attack             Wander           Spawn
                                                                           Retreat            Chase



Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial                                  Page 107
Learning Weights: Back Propagation
     • Learning from examples
            • Examples consist of input and correct output (t)
     • Learn if network’s output doesn’t match correct output
            • Adjust weights to reduce difference
            • Only change weights a small amount (?)
     • Basic neuron learning
            •      Wi,j = Wi,j + ? Wi,j
            •      Wi,j = Wi,j + ?(t-o)aj
            •      If output is too high, (t-o) is negative so Wi,j will be reduced
            •      If output is too low, (t-o) is positive so Wi,j will be increased
            •      If aj is negative the opposite happens

Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial      Page 108
Back propagation algorithm
          Repeat
            Foreach e in examples do
                 O = Run-Network(network,e)
                 // Calculate error term for output layer
                 Foreach neuron in the output layer do
                       Errk = ok(1-ok)(tk-ok)
                 // Calculate error term for hidden layer
                 Foreach neuron in the hidden layer do
                       Errh = oh(1-oh)SwkhErrk
                 // Update weights of all neurons
                 Foreach neuron do
                       Wi,j = Wi,j + ? (xij) Errj
          Until network has converged

Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 109
Neural Net Example
     • Single neuron to represent OR
            • Two inputs
            • One output (1 if either inputs is 1)
            • Step function (if weighted sum > 0.5 output a = 1)

                   1
                       0.1
                             S Wj aj = 0.1
                                                                           0
                                         g(0.1) = 0
                       0.6
                   0


    • Error so training occurs


Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial       Page 110
Neural Net Example
     • Wj = Wj + ? Wj
     • Wj = Wj + ?(t-o)aj

     • W1 = 0.1 + 0.1(1-0)1 = 0.2
     • W2 = 0.6 + 0.1(1-0)0 = 0.6
                   0
                       0.2
                             S Wj aj = 0.6
                                                                         1
                                          g(0.6) = 0
                       0.6
                   1

    • No error so no training occurs

Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial       Page 111
Neural Net Example
                   1
                       0.2
                             S Wj aj = 0.2
                                                                           0
                                           g(0.2) = 0
                       0.6
                   0


     • Error so training occurs
     • W1 = 0.2 + 0.1(1-0)1 = 0.3
     • W2 = 0.6 + 0.1(1-0)0 = 0.6
                   1
                       0.3
                             S Wj aj = 0.9
                                                                           1
                                         g(0.9) = 1
                       0.6
                   1

Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial       Page 112
Using Neural Networks in Games
     • Classification/function approximation
            • In game or during development
     • Learning to predict the reward associated with a state
            • Can be the core of reinforcement learning
     • Situational Assessment/Classification
            • Feelings toward objects in world or other players
            • Black & White BC3K
     • Predict enemy action




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 113
Neural Network Example Systems
     • BattleCruiser: 3000AD
            • Guide NPC: Negotiation, trading, combat
     • Black & White
            • Teach creatures desires and preferences
     • Creatures
            • Creature behavior control
     • Dirt Track Racing
            • Race track driving control
     • Heavy Gear
            • Multiple NNs for control


Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 114
NN Example: B & W
               Low Energy

                Source = 0.2
                Weight = 0.8
       Value = Source * Weight = 0.16



                   Tasty Food

                Source = 0.4
                                                     ?                     Threshold   Hunger
                Weight = 0.2                   0.16 + 0.08 + 0.14
       Value = Source * Weight = 0.08



              Unhappiness

                Source = 0.7
                Weight = 0.2
       Value = Source * Weight = 0.14




Laird & van Lent                        GDC 2005: AI Learning Techniques Tutorial               Page 115
Neural Networks Evaluation
     • Advantages
            • Handle errors well
            • Graceful degradation
            • Can learn novel solutions
     • Disadvantages
            •      Feed forward doesn’t have memory of prior events
            •      Can’t understand how or why the learned network works
            •      Usually requires experimentation with parameters
            •      Learning takes lots of processing
                    • Incremental so learning during play might be possible
            • Run time cost is related to number of connections
     • Challenges
            • Picking the right features
            • Picking the right learning parameters
            • Getting lots of data


Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial   Page 116
References
 • General AI Neural Network References:
        • Mitchell: Machine Learning, McGraw Hill, 1997
        • Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 2003
        • Hertz, Krogh & Palmer: Introduction to the theory of neural computation, Addison-
          Wesley, 1991
        • Cowan & Sharp: Neural nets and artificial intelligence, Daedalus 117:85-121, 1988
 • Neural Networks in Games:
        • Penny Sweetser, How to Build Neural Networks for Games
               • AI Programming Wisdom 2
        • Mat Buckland, Neural Networks in Plain English, AI-Junkie.com
        • John Manslow, Imitating Random Variations in Behavior using Neural Networks
               • AI Programming Wisdom, p. 624
        • Alex Champandard, The Dark Art of Neural Networks
               • AI Programming Wisdom, p. 640
        • John Manslow, Using a Neural Network in a Game: A Concrete Example
               • Game Programming Gems 2



Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial          Page 117
Genetic Algorithms

                        Michael van Lent




Laird & van Lent      GDC 2005: AI Learning Techniques Tutorial   Page 118
Background
     • Evolution creates individuals with higher fitness
            • Population of individuals
                   • Each individual has a genetic code
            • Successful individuals (higher fitness) more likely to breed
                   • Certain codes result in higher fitness
                   • Very hard to know ahead which combination of genes = high fitness
            • Children combine traits of parents
                   • Crossover
                   • Mutation

     • Optimize through artificial evolution
            • Define fitness according to the function to be optimized
            • Encode possible solutions as individual genetic codes
            • Evolve better solutions through simulated evolution

Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial           Page 119
The Big Picture
     • Problem
            • Optimization
            • Classification

     • Feedback
            • Reinforcement learning
     • Knowledge Representation
            • Feature String
            • Classifiers
            • Code (Genetic Programming)
     • Knowledge Source
            • Evaluation function


Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial   Page 120
Genes
     • Gene is typically a string of symbols
            • Frequently a bit string
            • Gene can be a simple function or program
                    • Evolutionary programming

     • Challenges in gene representation
            • Every possible gene should encode a valid solution
     • Common representation
            •      Coefficients
            •      Weights for state transitions in a FSM
            •      Classifiers
            •      Code (Genetic Programming)
            •      Neural network weights


Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial   Page 121
Classifiers
     • Classification rules encoded as bit strings
            •      Bits 1-3: Allegiance (1=friendly, 2=neutral, 3=enemy)
            •      Bits 4-6: Health (4=low, 5=medium, 6=full)
            •      Bits 7-8: Animate (7=true, 8=false)
            •      Bits 9-11: RelHealth (9=weaker, 10=same, 11=stronger)
            •      Bits 12-16: Action(Attack, Ignore, Heal, Eat, Run)
     • Example
            • If ((Allegiance=friendly) | (Allegiance=neutral)) & ((Health=low) |
              (Health=medium)) & (Animate=true) & ((RelHealth=weaker) |
              (RelHealth=stronger)) then Heal
            • 110 110 10 101 00100
            • Need to ensure that bits 12-16 are mutually exclusive


Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial          Page 122
Genetic Algorithm
          initialize population p with random genes
          repeat
                 foreach p i in p
                   fi = fitness(pi)
                 repeat
                   parent1 = select(p,f)
                   parent2 = select(p,f)
                   child1, child2 = crossover(parent1,parent2)
                   if (random < mutate_probability)
                           child1 = mutate(child1)
                   if (random < mutate_probability)
                           child2 = mutate(child2)
                   add child1, child2 to p’
                 until p’ is full
                 p = p’


     • Fitness(gene): the fitness function
     • Select(population,fitness): weighted selection of parents
     • Crossover(gene,gene): crosses over two genes
     • Mutate(gene): randomly mutates a gene

Laird & van Lent                            GDC 2005: AI Learning Techniques Tutorial   Page 123
Genetic Operators
     • Crossover
            • Select two points at random
            • Swap genes between two points




     • Mutate
            • Small probably of randomly changing each part of a gene




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 124
Example: Evaluation
     • Initial Population:
            • 110 110 10 110 01000
                    (friendy | neutral) & (low | medium) & (true) & (weaker | same) => Ignore
            • 001 010 00 101 00100
                    (enemy) & (medium) & (weaker | stronger) => Ignore
            • 010 001 11 111 10000
                    (friendy | neutral) & (low | medium) & (true) & (weaker | same) => Heal
            • 000 101 01 010 00010
                    (low | full) & (false) & (same) => Eat

     • Evaluation:
            •      110 110 10 110 01000: Fitness score = 47
            •      010 001 11 111 10000: Fitness score = 23
            •      000 101 01 010 00010: Fitness score = 39
            •      001 010 00 101 00100: Fitness score = 12


Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial              Page 125
Example: Genetic Operators
     • Crossover:
            • 110 110 10 110 01000
            • 000 101 01 010 00010
            Crossover after bit 7:
            • 110 110 1
            •           1 010 00010
            • 000 101 0
            •           0 110 01000
     • Mutations
            • 110 110 11 011 00010
            • 000 101 00 110 01000
     • Evaluate the new population
     • Repeat


Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial   Page 126
Advanced Topics
     • Competitive evaluation
            • Evaluate each gene against the rest of the population
     • Genetic programming
            • Each gene is a chunk of code
            • Generally represented as a parse tree
     • Punctuated Equlibria
            • Evolve multiple parallel populations
            • Occasionally swap members
            • Identifies a wider range of high fitness solutions




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 127
Games that use GAs
     • Creatures
            • Creatures 2
            • Creatures 3
            • Creatures Adventures
     • Seaman
     • Nooks & Crannies
     • Return Fire II




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 128
Genetic Algorithm Evaluation
     • Pros
            • Powerful optimization technique
                   • Parallel search of the space
            • Can learn novel solutions
            • No examples required to learn
     • Cons
            • Evolution takes lots of processing
                   • Not very feasible for online learning
            • Can’t guarantee an optimal solution
            • May find uninteresting but high fitness solutions
     • Challenges
            • Finding correct representation can be tricky
                   • The richer the representation, the bigger the search space
            • Fitness function must be carefully chosen
Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 129
References
     •    Mitchell: Machine Learning, McGraw Hill, 1997.
     •    Holland: Adaptation in natural and artificial systems, MIT Press 1975.
     •    Back: Evolutionary algorithms in theory and practice, Oxford University Press 1996.
     •    Booker, Goldberg, & Holland: Classifier systems and genetic algorithms, Artificial
          Intelligence 40: 235-282, 1989.
     •    AI Game Programming Wisdom.
     •    AI Game Programming Wisdom 2.




Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial                   Page 130
Bayesian Learning
                        Michael van Lent




Laird & van Lent     GDC 2005: AI Learning Techniques Tutorial   Page 131
The Big Picture
     • Problem
            • Classification
            • Stochastic Modeling
     • Feedback
            • Supervised learning
     • Knowledge Representation
            • Bayesian classifiers
            • Bayesian Networks
     • Knowledge Source
            • Examples


Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 132
Background
     • Most learning approaches learn a single best guess
            • Learning algorithm selects a single hypothesis
            • Hypothesis = Decision tree, rule set, neural network…
     • Probabilistic learning
            •      Learn the probability that a hypothesis is correct
            •      Identify the most probable hypothesis
            •      Competitive with other learning techniques
            •      A single example doesn’t eliminate any hypothesis
     • Notation
            •      P(h): probability that hypothesis h is correct
            •      P(D): probability of seeing data set D
            •      P(D|h): probability of seeing data set D given that h is correct
            •      P(h|D): probability that h is correct given that D is seen


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial       Page 133
Bayes Rule
     • Bayes rule is the foundation of Bayesian learning

                                 P( D | h) P(h )
                     P( h | D) =
                                     P( D)

     • As P(D|h) increases, so does P(h|D)
     • As P(h) increases, so does P(h|D)
     • As P(D) increases, probability of P(h|D) decreases




Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 134
Example
     • A monster has two attacks, A and B:
            • Attack A does 11-20 damage and is used 10% of the time
            • Attack B does 16–115 damage and is used 90% of the time
            • You have counters A’ (for attack A) and B’ (for attack B)
     • If an attack does 16-20 damage, which counter to use?
            • P(A|damage=16-20) greater or less than 50%?


     • We don’t know P(A|16-20)
            • We do know P(A), P(B), P(16-20|A), P(16-20|B)
            • We only need P(16-20)
            • P(16-20) = P(A) P(16-20|A) + P(B) P(16-20|B)

Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial    Page 135
Example (cont’d)
     • Some probabilities
            •      P(A) = 10%
            •      P(B) = 90%
            •      P(16-20|A) = 50%
            •      P(16-20|B) = 5%

                                         P (16 − 20 | A) P ( A)
                      P ( A | 16 − 20) =
                                              P (16 − 20)
                                                  0.5( 0.1)
                      P ( A | 16 − 20) =
                                         ( 0.1)(0.5) + ( 0.9)(0.05)
                                              0.05          0.05
                      P ( A | 16 − 20) =                 =        = 0.5263 = 52.63%
                                         0.05 + 0.045 0.095

     • So counter A’ is the slightly better choice
Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial      Page 136
Bayes Optimal Classifier
     • Given data D, what’s the probability that a new example
       falls into category c
     • P(example=c|D) or P(c|D)
     • Best classification is highest P(c|D)
                              max P(c i|D) = max ∑ P(c i|hj)P(c j|D)
                              ci∈C                   ci∈C hj∈H



     • This approach tends to be computationally expensive
            • Space of hypothesis is generally very large




Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 137
Example Problem
     Classify how I should react to an object in the world
            •      Facts about any given object include:
                   •   Allegiance = < friendly, neutral, enemy>
                   •   Health = <low, medium, full>
                   •   Animate = <true, false>
                   •   RelativeHealth = <weaker, same, stronger>
            •      Output categories include:
                   •   Reaction = Attack
                   •   Reaction = Ignore
                   •   Reaction = Heal
                   •   Reaction = Eat
                   •   Reaction = Run

     •       <friendly, low, true, weaker> => Heal
     •       <neutral, low, true, same> => Heal
     •       <enemy, low, true, stronger> => Attack
     •       <enemy, medium, true, weaker> => Attack
Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 138
Naïve Bayes Classifier
     • Each example is a set of feature values
            • friendly, low, true, weaker
     • Given a set of feature values, find the most probable category
     • Which is highest:
            •      P(Attack | friendly, low, true, weaker)
            •      P(Ignore | friendly, low, true, weaker)
            •      P(Heal | friendly, low, true, weaker)
            •      P(Eat | friendly, low, true, weaker)
            •      P(Run | friendly, low, true, weaker)


                                    cnb = max P (ci | f1, f2, f3, f4)
                                             ci∈C




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 139
Calculating Naïve Bayes Classifier
                                   cnb = max P( ci | f1, f2, f3, f4)
                                            ci∈C


                                             P(f1, f2, f3, f4 | ci )P(ci )
                                   cnb = max
                                       ci∈C      P( f1, f2, f3, f4)

                                   cnb = max P(f1, f2, f3, f4 | ci )P( ci )
                                           ci∈C


     • Simplifying assumption: each feature in the example is independent
            • Value of Allegiance doesn’t affect value of Health, Animate, or
              RelativeHealth
                                  P (f1, f2, f3, f4 | ci) =   ∏ P(f | c )
                                                                  j
                                                                        j      i




                                  cnb = max P(ci )∏ P(fj | ci )
                                           ci∈C               j


Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial       Page 140
Example
     • Slightly modified 13 examples:
            •      <friendly, low, true, weaker> => Heal
            •      <neutral, full, false, stronger> => Eat
            •      <enemy, low, true, weaker> => Eat
            •      <enemy, low, true, same> => Attack
            •      <neutral, low, true, weaker> => Heal
            •      <enemy, medium, true, stronger> => Run
            •      <friendly, full, true, same> => Ignore
            •      <neutral, full, true, stronger> => Ignore
            •      <enemy, full, true, same> => Run
            •      <enemy, medium, true, weaker> => Attack
            •      <enemy, low, true, weaker> => Ignore
            •      <neutral, full, false, stronger> => Ignore
            •      <friendly, medium, true, stronger> => Heal
     • Estimate the most likely classification of:
            • <enemy, full, true, stronger>



Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 141
Example
     • Need to calculate:
            • P(Attack| <enemy, full, true, stronger>)
                   = P(Attack) P(enemy|Attack) P(full|Attack) P(true|Attack) P(stronger|Attack)

            • P(Ignore| <enemy, full, true, stronger>)
                   = P(Ignore) P(enemy|Ignore) P(full|Ignore) P(true|Ignore) P(stronger|Ignore)

            • P(Heal| <enemy, full, true, stronger>)
                   = P(Heal) P(enemy|Heal) P(full|Heal) P(true|Heal) P(stronger|Heal)

            • P(Eat| <enemy, full, true, stronger>)
                   = P(Eat) P(enemy|Eat) P(full|Eat) P(true|Eat) P(stronger|Eat)

            • P(Run| <enemy, full, true, stronger>)
                   = P(Run) P(enemy|Run) P(full|Run) P(true|Run) P(stronger|Run)



Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial                   Page 142
Example (cont’d)
            • P(Ignore| <enemy, full, true, stronger>)
                   = P(Ignore) P(enemy|Ignore) P(full|Ignore) P(true|Ignore) P(stronger|Ignore)

                   P(Ignore) = 4 of 13 examples = 4/13 = 31%
                   P(enemy|Ignore) = 1 of 4 examples = ¼ = 25%
                   P(full|Ignore) = 3 of 4 examples = ¾ = 75%
                   P(true|Ignore) = 3 of 4 examples = ¾ = 75%
                   P(stronger|Ignore) = 2 of 4 examples = 2/4 = 50%

                   P(Ignore| <enemy, full, true, stronger>) = 2.2%




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial                   Page 143
Example (cont’d)
            • P(Run| <enemy, full, true, stronger>)
                   = P(Run) P(enemy|Run) P(full|Run) P(true|Run) P(stronger|Run)

                   P(Run) = 2 of 13 examples = 2/13 = 15%
                   P(enemy|Run) = 2 of 2 examples = 100%
                   P(full|Run) = 1 of 2 examples = 50%
                   P(true|Run) = 2 of 2 examples = 100%
                   P(stronger|Run) = 1 of 2 examples = 50%

                   P(Run| <enemy, full, true, stronger>) = 3.8%




Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial     Page 144
Result
     •    P(Ignore| <enemy, full, true, stronger>) = 2.2%
     •    P(Run| <enemy, full, true, stronger>) = 3.8%
     •    P(Eat| <enemy, full, true, stronger>) = 0.1%
     •    P(Heal| <enemy, full, true, stronger>) = 0%
     •    P(Attack| <enemy, full, true, stronger>) = 0%

     • So Naïve Bayes Classification says Run is most probably
        • 63% of Run being correct
        • 36% of Ignore being correct
        • 1% of Eat being correct


Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 145
Estimating Probabilities
     • Need lots of examples for accurate estimates
     • With only 13 examples:
            • No example of:
                   •   Health=full for Attack category
                   •   RelativeHealth=Stronger for Attack
                   •   Allegiance=enemy for Heal
                   •   Health=full for Heal
            • Only two examples of Run
                   • P(f1|Run) can only be 0%, 50%, or 100%
                   • What if the true probability is 16.2%?

     • Need to add a factor to probability estimates that:
            • Prevents missing examples from dominating
            • Estimates what might happen with more examples

Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial   Page 146
m-estimate
     • Solution: m-estimate
            • Establish a prior estimate p
                   • Expert input
                   • Assume uniform distribution
            • Estimate the probability as:
                                                   nc + mp
                                                    n+m
            • m is the equivalent sample size
            • Augment n observed samples with m virtual samples
     • If there are no examples (nc = 0) estimate is still > 0%
     • If p(run) = 20% and m = 10 then P(full|Run):
            • Goes from 50% (1 of 2 examples)
            • to 25%                                               1 + 10(.2) 3
                                                                             =    = 0.25
                                                                     2 + 10    12
Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial            Page 147
Bayesian Networks
     • Graph structure encoding causality between variables
            • Directed, acyclic graph
            • A? B indicates that A directly influences B
                   • Positive or negative influence



                                         Attack                     Attack
                                           A                          B



                                  Damage             Damage               Damage
                                   11-15              16-20               21-115




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial    Page 148
Another Bayesian Network
                                    P(I) =10%                 P(R) = 40%

                                      Intruder                      Rat



                                                                  P(N|I,R) = 95%
                                                                  P(N|I,not R) = 30%
                                                     Noise
                                                                  P(N|not I,R) = 60%
                                                                  P(N|not I, not R) = 2%

                                      Guard1                     Guard2
                                      Report                     Report

                                 P(G1|N) = 90%             P(G2|N) = 70%
                                 P(G1|not N) = 5%          P(G2|not N) = 1%

     • Inference on Bayesian Networks can determine probability of unknown nodes
       (Intruder) given some known values
            • If Guard2 reports but Guard 1 doesn’t, what’s the probability of Intruder?

Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial              Page 149
Learning Bayesian Networks
     • Learning the topology of Bayesian networks
            • Search the space of network topologies
                    •   Adding arcs, deleting arcs, reversing arcs
                    •   Are independent nodes in the network independent in the data?
                    •   Does the network explain the data?
                    •   Need to weight towards fewer arcs

     • Learning the probabilities of Bayesian networks
            •      Experts are good at constructing networks
            •      Experts aren’t as good at filling in probabilities
            •      Expectation Management (EM) algorithm
            •      Gibbs Sampling



Laird & van Lent                        GDC 2005: AI Learning Techniques Tutorial       Page 150
Bayesian Learning Evaluation
     • Pros
            •      Takes advantage of prior knowledge
            •      Probabilistic predictions (prediction confidence)
            •      Handles noise well
            •      Incremental learning
     • Cons
            • Less effective with low number of examples
            • Can be computationally expensive
     • Challenges
            • Identifying the right features
            • Getting a large number of good examples

Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial   Page 151
References
     •    Mitchell: Machine Learning, McGraw Hill, 1997.
     •    Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 1995.
     •    AI Game Programming Wisdom.




Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial                Page 152
Reinforcement Learning

                                            John Laird

  Thanks for online reference material to: Satinder Singh, Yijue Hou & Patrick Doyle




Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial            Page 153
Outline of Reinforcement Learning
     • What is it?
     • When is it useful?
     • Examples from games
     • Analysis




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial   Page 154
Reinforcement Learning
     • A set of problems, not a single technique:
            • Adaptive Dynamic Programming
            • Temporal Difference Learning
            • Q learning
     • Cover story for Neural Networks, Decision Trees, etc.
     • Best for tuning behaviors
            • Often requires many training trials to converge
     • Very general technique applicable to many problems
            • Backgammon, poker, helicopter flying, truck & car driving




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 155
Reinforcement Learning
     • Agent receives some reward/punishment for behavior
            • Is not told directly what to do or what not to do
                   • Only whether it has done well or poorly
            • Reward can be intermittent and is often delayed
            • Must solve the temporal credit assignment problem
            • How can it learn to select actions that occur before reward?
                                   Game                        Game
                                    AI

                               New or                            Critic
                              corrected
                             knowledge
                                                                 Reward
                                          Learning
                                          Algorithm

Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 156
Deathmatch Example
     • Learn to kill enemy better                                           state


     • Possible rewards for Halo                                            action
            • +10 kill enemy
            • -3 killed                                                     state


     • State features                                                       action
            •      Health, enemy health
            •      Weapon, enemy weapon
                                                                            state
            •      Relative position and facing of enemy
            •      Absolute and relative speeds
                                                                            action
            •      Relative positions of nearby obstacles

                                                                            state




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial      Page 157
Two Approaches to Reinforcement Learning
     • Passive learning = behavior cloning
            • Examples of behavior are presented to learner
                   • Learn a model of a human player
            • Tries to learn a single optimal policy
     • Active learning = learning from experience
            • Agent is trying to perform task and learn at same time
            • Must trade off exploration vs. exploitation
            • Can train using against self or humans




Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial   Page 158
What can be Learned?
     • Utility Function:
            • How good is a state?
            • The utility of state si: U(si)
            • Choose action to maximize expected utility of result
     • Action-Value:
            • How good is a given action for a given state?
            • The expected utility of performing action aj in state si: V(si,aj)
            • Choose action with best expected utility for current state




Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial     Page 159
Utility Function for States: U(si)
     • Agent chooses action to maximize expected utility
            • One step look-ahead


                          -             +


                          +              -

                          -              -


     • Agent must have a “model” of environment
            • Possible transitions from state to state
            • Can be learned or preprogrammed


Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial   Page 160
Trivial Example: Maze Learning




Laird & van Lent     GDC 2005: AI Learning Techniques Tutorial   Page 161
Learning State Utility Function: U(si)
                                      .84       .83      .82     .81   .80

                             .84      .85       .84      .83     .82   .81         .99
                             .85      .86       .85              .83   .82         .98
                             .86      .87                                          .97
                             .87                .89      .90     .91   .92         .96
                                      .88

                             .88      .89       .90      .91     .92   .93   .94   .95

                             .87      .88       .89      .90     .91   .92   .93   .94

                                      .87       .88      .89     .90   .91   .92   .93
                             .86
                             .85      .86       .87      .88     .89   .90   .91   .92

                                      .85                                    .90   .91

                                                                             .89   .90



Laird & van Lent     GDC 2005: AI Learning Techniques Tutorial                           Page 162
Action Value Function: V(si,aj)
     • Agent chooses action that is best for current state
            • Just compare operators – not state


                              +
                     +




     • Agent doesn’t need a “model” of environment
            • But must learn separate value for each state-action pair



Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 163
Learning Action-Value Function:
                               V(si,aj)
                                                               .83
                                                       .85             .83




Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial         Page 164
Review of Dimensions
     • Source of learning data                • What is learned
        • Passive                               • State utility function
        • Active                                • Action-value




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial           Page 165
Passive Utility Function Approaches
     • Least Mean Squares (LMS)
     • Adaptive Dynamic Programming (ADP)
            • Requires a model (M) for learning
     • Temporal Difference Learning (TDL)
            • Model free learning (uses model for decision making, but not
              for learning).




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 166
Learning State Utility Function (U)
     • Assume k states in the world
     • Agent keeps:
            • An estimate U of the utility of each state (k)
            • A table N of how many times each state was seen (k)
            • A table M (the model) of the transition probabilities (k x k)
                   • likelihood of moving from each state to another state
                                                                                       S1    S2   S3   S4
                   .6         S2       .7
                         .3                                                    S1            .6   .4   0
                                   1
              S1                            S4
                    .4                 1                                       S2        0        .3   .7
                              S3
                                                                               S3        0   0         1

                                                                               S4        1   0    0

Laird & van Lent                                 GDC 2005: AI Learning Techniques Tutorial                  Page 167
Adaptive Dynamic Programming (ADP)
          • Utility = reward and probability of future reward
          • U(i) = R(i) + ? Mij * U(j)                                 .2              Initial Utilities:
                   S1   S2   S3   S4                                                        S1=.5
                                                       .6         S2        .5
            S1          .6   .4   0                          .3        1                    S2=.6
                                                  S1                              S4
            S2     0    .2   .3   .5                    .4                  1               S3=.2
                                                                  S3

            S3     0    0         1                                                         S4=.1

            S4     1    0    0

                                                                                Exact, but inefficient in
           State S2 and get reward .3                                           large search spaces
           U(3) = .3 + 0*.5 + .2*.6 + .3*.2 + .5*.1
                = .3 + 0 + .12 + .06 + .05                                      Requires sweeping through
                = .53                                                           complete space
Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial                                 Page 168
Temporal Difference Learning
     • Approximate ADP
            • Adjust the estimated utility value of the current state based
              on its immediate reward and the estimated value of the next
              state.
     • U(i) = U(i) + a(R(i) + U(j) - U(i))
            • a is learning rate
            • if a continually decreases, U will converge




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 169
Temporal Difference Example
          • Utility = reward and probability of future reward
          • U(i) = U(i) + a(R(i) + U(j) - U(i))
                                                                      .2             Initial Utilities:
                                                                                          S1=.5
                                                      .6         S2        .5
                                                            .3        1                   S2=.6
                                                 S1                             S4
                                                       .4                  1              S3=.2
                                                                 S3

                                                                                          S4=.1
         State S2, get reward .3, go to state S3
         U(3) = .6 + .5 * (.3 + .2 - .6)
              = .6 + .5 * (-.1)
              = .6 + -.05
              = .55


Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial                                Page 170
TD vs. ADP
    • ADP learns faster
    • ADP is less variable
    • TD is simpler
    • TD has less computation/observation
    • TD does not require a model during learning
    • TD biases update to observed successor instead of all




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial   Page 171
Active Learning State Utilities: ADP
     • Active learning must decide which action to take and
       update based on what it does.
     • Extend model M to give the probability of a transition
       from a state i to a state j, given an action a.
     • Utility is maximum of
     • U(i) = R(i) + maxa [SUMj MaijU(j) ]




Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial   Page 172
Active Learning State-Action Functions
                  (Q-Learning)
     • Combines situation and action:


                   +                    +
                        +




     • Q(a,i) = expected utility of using action a on state i.
     • U(i) = maxa Q(a, i)



Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial   Page 173
Q Learning
     • ADP Version: Q(a, i) = R(i) + ? Maij maxa' Q(a', j)

                                                           .9
                                          .7                 .6
                                S1                 S2
                                                            .7


     • TD: Q(a, i) <- Q(a, i) + a(R(i) + ?(maxa' Q(a', j) - Q(a, i)))
            • If a is .1, ? is .9, and R(1) = 0,
            • = .7 + .1* (0 + .9 *(max(.6, .7, .9) - .7))
              = .7 + .1 *.9 * (.9 - .7)
              = .7 + .18
              = .718


     • Selection is biased by expected utilities
            • Balances exploration vs. exploitation
            • With experience, bias more toward higher values

Laird & van Lent                     GDC 2005: AI Learning Techniques Tutorial   Page 174
Q-Learning
     • Q-Learning is the first provably convergent direct
       adaptive optimal control algorithm
     • Great impact on the field of modern RL
            • smaller representation than models
            • automatically focuses attention to where it is needed, i.e., no
              sweeps through state space




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial     Page 175
Q Learning Algorithm
     For each pair (a, s), initialize Q(a, s)
     Observe the current state s
     Loop forever
     {
            Select an action a and execute it
             a = arg max Q ( s , a )
                       a


            Receive immediate reward r and observe the new state s’
            Update Q(a, s)
             Q( s, a ) = Q (s, a) + α (r + γ max Q( s' , a' ) − Q (s, a))
                                               a'

            s=s’
     }

Laird & van Lent                         GDC 2005: AI Learning Techniques Tutorial   Page 176
Summary Comparison
            State Utility Function                                     State-Action
     • Requires model                                  • Model free
     • More general/faster learning                    • Less general/slower learning
            • Learns about states                             • Must learn state-action
                                                                combinations
     • Slower execution
            • Must compute follow on states            • Faster execution
     • If have model of reward,
       doesn’t need environment                        • Preferred for complex worlds
                                                         where model isn’t available
     • Useful for worlds with model
            • Maze worlds, board games, …



Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial                 Page 177
Anark Software
                 Galapagos
                                                   • Player trains creature by
                                                     manipulating environment
                                                   • Creature learns from pain,
                                                     death, and reward for
                                                     movement
                                                   • Learns to move and classify
                                                     objects in world based on their
                                                     pain/death.




Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial                  Page 178
Challenges
     • Exploring the possibilities
     • Picking the right representation
     • Large state spaces
     • Infrequent reward
     • Inter-dependence of actions
     • Complex data structures
     • Dynamic worlds
     • Setting parameters


Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial   Page 179
Exploration vs. Exploitation
     • Problem: If large space of possible actions, might never
       experience many of them if learn too quick.
     • Exploration: try out actions
     • Exploitation: use knowledge to improve behavior
     • Compromise:
            • Random selection, but bias choice to best actions
            • Overtime, bias more and more to best actions




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 180
Picking the Right Representation
     • Too few features and impossible to learn
            • If learning to drive and can sense acceleration or speed.
     • Too many features and can use exact representations
            • See next section




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 181
Large state spaces:
                       Curse of Dimensionality
     • Look-up Table for Q value
        • AIW 2, pp. 597
        • OK for 2-3 variables
        • Fast learning, but lots of memory
     • Issues:
            • Hard to get data that covers the states enough time to learn accurate utility
              functions
            • Probably many different states have similar utility
            • Data structures for storing utility functions can be very large
            • State-action approaches (Q-learning) exacerbate the problem
     • Deathmatch example:
            • Health [10], Enemy Health [10], Relative Distance [10], Relative Heading
              [10], Relative Opponent Heading [10], Weapon [5], Ammo [10], Power
              ups [4], Enemy Power ups [4], My Speed [4], His Speed [4], Distances to
              Walls [5,5,5,5]
            • 8 x 1014


Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial             Page 182
Solution:
     • Approximate state space with some function
     • Neural Networks, Decision Tree, Nearest Neighbor,
       Bayesian Network, …
            • Can be slower than lookup table but much more compact




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 183
Function Approximation:
                      Neural Networks
     • Use all features as input with utility as output



     State Features &                                                   Utility Estimate
     Action
     (Q Learning)




            • Output could be actions and their utilities?

Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial                  Page 184
Geisler – FPS Offline Learning
     Input Features:
            •      Closest Enemy Health                                              Sector 1
            •      Number Enemies in Sector 1
            •      Number Enemies in Sector 2                                  700 Feet
            •      Number Enemies in Sector 3                                                  Sector 2
            •      Number Enemies in Sector 4                                Sector 4
            •      Player Health
            •      Closest Goal Distance                                                  Sector 3
            •      Closest Goal Sector
            •      Closest Enemy Sector
            •      Distance to Closest Enemy
            •      Current Move Direction
            •      Current Face Direction
     Output
            •      Accelerate
            •      Move Direction
            •      Facing Direction
            •      Jumping

     •    Tested with Neural Networks, Decision Trees, and Naïve Bayes



Laird & van Lent                        GDC 2005: AI Learning Techniques Tutorial                         Page 185
Health
                                                                                                                  Partial Decision Tree
                                          1-3                 4-6          7-9      10
                                                                                                                     for Accelerate
                         EnemySector                #EnemySector1                 ClosestGoal                EnemyDistance
                                                                                 Y: 442 N:690
                                           ...                                                                               ...
                   1              2                     0-7         8-10                                    0-1




                                                                                           .
                                                                                           .
                                                                                           .
                   NO             EnemyHealth           #EnemySector3            ClosestGoal          EnemySector
                   336            Y: 150 N:647                                    Y: 191 N:            Y: 365 N:
                                                                                     589                  653
                                                 ...        0-2    3-6        ...               ...                   ...



                                                 CurrentMove             ClosestGoal

                                                        ...                         ...
                                          0                                1


                                      CurrentFace                              CurrentMove

                                                      ...                                      ...
                              0           1
                                                                                       1

                         NO               EnemyDistance                                EnemyDistance
                         23                 Y: 29 N:42
                                                             ...                                      ...
                                                                                           3-9


                                                                                                     YES
Laird & van Lent                                        GDC 2005: AI Learning Techniques Tutorial
                                                                                     54                                                   Page 186
Results – Error Rates
                                   Accelerate?                                                                             Move Direction



                      45
                                                                                                              45


                                       Baseline
                      40                                                                                      40
                                       ID3
                                       NB
                                       ANN
                      35                                                                                      35



                      30                                                                                      30




                                                                                        Test Set Error Rate
      e t e ro ae
     T s S tEr rR t




                      25                                                                                      25



                      20                                                                                      20



                      15                                                                                      15



                      10                                                                                      10



                      5                                                                                       5



                      0                                                                                       0
                           100   500   1000    1500 2000       3000   4000   5000                                  100   500   1000 1500 2000 3000 4000 5000
                                              Train Set Size                                                                       Train Set Size




Laird & van Lent                                               GDC 2005: AI Learning Techniques Tutorial                                                 Page 187
Infrequent reward
     • Problem:
            • If feedback comes only at end of lots of actions, hard to learn utilities of
              early situations
     • Solution
            • Provide intermediate rewards
     • Example: FPS
            •      +1 for hitting enemy in FPS deathmatch
            •      -1 for getting hit by enemy
            •      +.5 for getting behind enemy
            •      +.4 for being in place with good visibility but little exposure
     • Risks
            • Achieving intermediate rewards instead of final reward



Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial          Page 188
Maze Learning
                                                               +100




                                             +90




Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial      Page 189
Many Related Actions
     • If try to learn all at once, very slow
     • Train up one at a time:
            • 10.4 in AIW2, p.596




Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 190
Dynamic world
     •       Problem:
            •      If world or reward changes suddenly, system can’t respond
     •       Solution:
            1. Continual exploration to detect changes
            2. If major changes, restart learning




Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 191
Major Change in World
                                        .84       .83      .82     .81   .80

                               .84      .85       .84      .83     .82   .81         .99
                               .85      .86       .85              .83   .82         .98
                               .86      .87                                          .97
                               .87                .89      .90     .91   .92         .96
                                        .88

                               .88      .89       .90      .91     .92   .93   .94   .95

                               .87      .88       .89      .90     .91   .92   .93   .94

                                        .87       .88      .89     .90   .91   .92   .93
                               .86
                               .85      .86       .87      .88     .89   .90   .91   .92

                                        .85                                    .90   .91

                                                                               .89   .90



Laird & van Lent       GDC 2005: AI Learning Techniques Tutorial                           Page 192
Setting Parameters
     • Learning Rate: a
            •      If too high, might not converge (skip over solution)
            •      If too low, can converge slowly
            •      Lower with time: kn such as .95n = .95, .9, .86, .81, .7
            •      For deterministic worlds and state transitions, .1-.2 works well
     • Discount Factor: ?
            • Affects how “greedy” agent is for short term vs. long-term
              reward
            • .9-.95 is good for larger problems
     • Best Action Selection Probability: e
            • Increases as game progresses so takes advantage of learning
            • 1- kn


Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial    Page 193
Analysis
     • Advantages:
            • Excellent for tuning parameters & control problems
            • Can handle noise
            • Can balance exploration vs. exploitation
     • Disadvantages
            • Can be slow if large space of possible representations
            • Has troubles with changing concepts
     • Challenges:
            •      Choosing the right approach: utility vs. action-value
            •      Choosing the right features
            •      Choosing the right function approximation (NN, DT, …)
            •      Choosing the right learning parameters
            •      Choosing the right reward function


Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial   Page 194
References
     • John Manslow: Using Reinforcement Learning to Solve AI
       Control Problems: AI Programming Wisdom 2, p. 591
     • Benjamin Geisler, An Empirical Study of Machine Learning
       Algorithms Applied to Modeling Player Behavior in a “First
       Person Shooter” Video Game, Masters’ Thesis, U. Wisconsin,
       2002.




Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial   Page 195
Episodic Learning
                                       [Andrew Nuxoll]
     • What is it?
            • Not facts or procedures but memories of specific events
            • Recording and recalling of experiences with the world
     • Why study it?
            • No comprehensive computational models of episodic learning
            • No cognitive architectural models of episodic learning
                   • If not architectural, interferes with other reasoning
            • Episodic learning will expand cognitive abilities
                   •   Personal history and identity
                   •   Memories that can be used for future decision making & learning
                   •   Necessary for reflection, debriefing, etc.
                   •   Without it we are trying to build crippled AI systems
            • Mother of all case-based reasoning problems.
Laird & van Lent                       GDC 2005: AI Learning Techniques Tutorial         Page 196
Characteristics of Episodic Memory
  1.      Architectural:
         •         The mechanism is used for all tasks and does not compete with reasoning.
  2.      Automatic:
         •         Memories are created without effort or deliberate rehearsal.
  3.      Autonoetic:
         •         A retrieved memory is distinguished from current sensing.
  4.      Autobiographical:
         •         The episode is remembered from own perspective.
  5.      Variable Duration:
         •         The time period spanned by a memory is not fixed.
  6.      Temporally Indexed:
         •         The rememberer has a sense of the time when the episode occurred.


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial         Page 197
Advantages of Episodic Memory
     • Improves AI behavior
            • Creates a personal history that impacts behavior
                   • Knows what it has done – avoid repetition
            • Helps identify significant changes to the world
                   • Compare current situation to memory
            • Creates virtual sensors of previously seen aspects of the
              world
            • Helps explaining behavior
                   • History of goals and subgoals it attempted
            • Provide the basis of a simple model of the environment
            • Supports other learning mechanisms



Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 198
Why and why not Episodic Memory?
     • Advantages:
            • General capability that can be reused on many projects.
            • Might be difficult to identify what to store.
     • Disadvantages:
            • Can be replaced with code customized for specific needs.
            • Might be costly in memory and retrieval.




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 199
Implementing Episodic Memory
     • Encoding
            • When is an episode stored?
            • What is stored and what is available for cuing retrieval?
     • Storage
            • How is it stored for efficient insertion and query?
     • Retrieval
            • What is used to cue the retrieval?
            • How is the retrieval efficiently performed?
            • What is retrieved?




Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 200
Possible Approach
     • When encode:
            • Every encounter between a NPC and the player
            • If NPC goal/subgoal is achieved
     • What to store:
            • Where, when, what other entities around, difficulty of
              achievement, objects that were used, …
            • Pointer to next episode
     • Retrieve based on:
            • Time, goal, objects, place

     • Can create efficient hash or tree-based retrieval.


Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 201
Soar Structure
               Decision     Long-term Procedural Memory
               Procedure          Production Rules

               Rule
               Matcher

               GUI
               …
                                                                          Episodic Learning
                               Short-term Declarative
                                      Memory                                 Episodic
               Perception                                                    Memory
                   Action


Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial                 Page 202
Implementation Big Picture
                           Long-term Procedural Memory
Encoding                         Production Rules
 Initiation?

Storage

Retrieval
                        Output                                         Cue
                                 Working Memory



                        Input                                      Retrieved



                        When the agent takes an action.
Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial           Page 203
Implementation Big Picture
                               Long-term Procedural Memory
Encoding                             Production Rules
  Initiation
  Content?
Storage

Retrieval
                            Output                                          Cue
                                      Working Memory



                             Input                                      Retrieved



                   The entire working memory is stored in the episode
Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial           Page 204
Implementation Big Picture
                            Long-term Procedural Memory                          Episodic
Encoding                          Production Rules                               Memory
  Initiation
  Content
Storage
 Episode Structure?
Retrieval
                         Output                                          Cue
                                   Working Memory

                                                                                 Episodic
                                                                                 Learning
                          Input                                      Retrieved



                     Episodes are stored in a separate memory
Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial                  Page 205
Implementation Big Picture
                               Long-term Procedural Memory                         Episodic
Encoding                             Production Rules                              Memory
  Initiation
  Content
Storage
 Episode Structure
Retrieval
 Initiation/Cue?            Output                                         Cue
                                     Working Memory

                                                                                   Episodic
                                                                                   Learning
                            Input                                      Retrieved



                    Cue is placed in an architecture specific buffer.
Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial                  Page 206
Implementation Big Picture
                            Long-term Procedural Memory                         Episodic
Encoding                          Production Rules                              Memory
  Initiation
  Content
Storage
 Episode Structure
Retrieval
 Initiation/Cue          Output                                         Cue
                                  Working Memory
 Retrieval
                                                                                Episodic
                                                                                Learning
                         Input                                      Retrieved



                      The closest partial match is retrieved.
Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial                  Page 207
Storage of Episodes
                   “Uber-Tree”




Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial   Page 208
Alternative Approach
     • Observation:
            • Many items don’t change from one episode to next
            • Can reconstruct episode from individual facts
            • Eliminate costly episode structure
     • New representation
            • For each item, store ranges of when it exists
     • New match
            • Trace through Über tree with cue to find all matching ranges
            • Compute score for merged ranges – pick best
            • Reconstruct episode by searching Über with episode number



Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 209
Storage


                                                     5-7       80-85




Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial           Page 210
Retrieval


                                                     5-7       80-85




Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial           Page 211
Cue
                                     Merge
   Activation

            34     55 65       90 95


            12      5 7        55 65               80 85                90 92


            3       3 7        90 95

        3          15        46                   12                   49       37




Laird & van Lent           GDC 2005: AI Learning Techniques Tutorial                 Page 212
Memory Usage
                                                      Memory Usage Comparison

                                 8,000,000

                                 7,000,000
      Memory allocated (bytes)




                                 6,000,000

                                 5,000,000
                                                                                                    Old Impl
                                 4,000,000
                                                                                                    New Impl
                                 3,000,000
                                 2,000,000

                                 1,000,000
                                        0
                                             0   10,000 20,000 30,000 40,000 50,000 60,000 70,000
                                                               decision cycles

Laird & van Lent                                       GDC 2005: AI Learning Techniques Tutorial        Page 213
Conclusion
 • Explore use of episodic memory as general capability
        • Inspired by psychology
        • Constrained by computation and memory




Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial   Page 214
Learning by Observation


                          Michael van Lent




Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial   Page 215
Background
     • Goal: Learn rules to perform a task from watching an
       expert
            • Real time interaction with the game (agent-based approach)
            • Learning what goals to select & how to achieve them
     • AI agents require lots of knowledge
            • TacAir-Soar: 8000+ rules
            • Quake II agent: 800+ rules
     • Knowledge acquisition for these agents is expensive
            • 15 person/years for TacAir-Soar
     • Learning is a cheaper alternative?


Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 216
Continuum of Approaches




                         Expert and
                        Programmer
                           Effort                                       Research
                                                                          Effort
           Standard
          Knowledge             Learning by                                        Unsupervised
          Acquisition           Observation                                          Learning



Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial                        Page 217
The Big Picture
     • Problem
            • Task performance cast as classification
     • Feedback
            • Supervised learning
     • Knowledge Representation
            • Rules
            • Decision trees

     • Knowledge Source
            • Observations of an expert
            • Annotations


Laird & van Lent                GDC 2005: AI Learning Techniques Tutorial   Page 218
Knowledge Representation
     • Rules encoding operators
     • Operator Hierarchy
     • Operator consists of:
            • Pre-conditions (potentially disjunctive)
                   • Includes negated test for goal-achieved feature
            • Conditional Actions
                   • Action attribute and value (pass-through action values)
            • Goal conditions (potentially disjunctive)
                   • Create goal-achieved feature
                   • Persistent and non-persistent goal-achieved features

     • Task and Domain parameters are widely used to
       generalize the learned knowledge
Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 219
Operator Conditions
     • Pre-conditions
            • Positive instance from each observed operator selection
     • Action conditions
            • Positive instance from each observed action performance
            • Recent-changes heuristic can be applied

     • Goal conditions
            • Positive instance from each observed operator termination
            • Recent-changes heuristic can be applied

     • Action attributes and values
            • Attribute taken directly from expert actions
            • Value can be constant or “pass-through”

Laird & van Lent                 GDC 2005: AI Learning Techniques Tutorial   Page 220
KnoMic
                                                                    Parameters &
                                                                       Sensors
                                          Environmental
               Expert                                                               ModSAF
                                            Interface
                              Output
                             Commands


                   Annotations             Observation                                Soar
                                           Generation                              Architecture

                                                       Observation Traces                    Soar
                                                                                          Productions


           Specific to                      Operator                               Production
        General Induction                 Classification                           Generation
                              Operator                                Learned
                             Conditions                              Knowledge


Laird & van Lent                   GDC 2005: AI Learning Techniques Tutorial                        Page 221
Observation Trace
     • At each time step record
            • Sensor input changes
                   • List of attributes and values
            • Output commands
                   • List of attributes and values
            • Operator annotations
                   • List of active operators
                                                    # Add Sensor Input for Decision Cycle 2
                                                    set Add_Sensor_Input(2,0) [list observe io input-link vehicle radar-mode tws-man ]
                                                    set Add_Sensor_Input(2,1) [list observe io input-link vehicle elapsed-time value 5938 ]
                                                    set Add_Sensor_Input(2,3) [list observe io input-link vehicle altitude value 1 ]

                                                    # Remove Sensor Input for Decision Cycle 2
                                                    set Remove_Sensor_Input(2,0) [list observe io input-link vehicle radar-mode *unknown* ]
                                                    set Remove_Sensor_Input(2,1) [list observe io input-link vehicle elapsed-time value 0 ]
                                                    set Remove_Sensor_Input(2,3) [list observe io input-link vehicle altitude value 0 ]

                                                    # Expert Actions for Decision Cycle 3
                                                    set Expert_Action_List(3) [list [list mvl-load-weapon-bay station-1 ] ]

                                                    # Expert Goal Stack for Decision Cycle 3
                                                    set Expert_Goal_Stack(3) [list init-agent station-1 ]




Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial                                                          Page 222
Racetrack & Intercept Behavior

                                                Employ-weapons
                            Intercept                               Employ-weapons
                                                 Select-missile
                                                                     Wait-for-missile-to-clear
                                                 Get-missile-lar
                     fly inbound leg              Achieve-proximity
   fly to waypoint                                                             Employ-weapons
                     fly outbound leg                                           Support-missile




                            Employ-weapons                   fly to waypoint
                             Launch-missile
                              Lock-radar
                              Get-steering-circle
                              Fire-missile

Laird & van Lent          GDC 2005: AI Learning Techniques Tutorial                        Page 223
Learning Example
     First selection of Fly-inbound-leg
            •      Radar Mode = TWS
            •      Altitude = 20,102
            •      Compass = 52
            •      Wind Speed = 3
            •      Waypoint Direction = 52
            •      Waypoint Distance = 1,996
            •      Near Parameter = 2,000

     Initial pre-conditions
            •      Radar Mode = TWS
            •      Altitude = 20,102
            •      Compass = 52
            •      Wind Speed = 3
            •      Waypoint Direction = 52
            •      Waypoint Distance = 1,996
            •      Compass == Waypoint Direction
            •      Waypoint Distance < Near Parameter


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial   Page 224
Learning Example
     First instance of Fly-inbound-leg Second instance of Fly-inbound-leg
            •      Radar Mode = TWS                               • Radar Mode = TWS
            •      Altitude = 20,102                              • Altitude = 19,975
            •      Compass = 52                                   • Compass = 268
            •      Wind Speed = 3
            •      Waypoint Direction = 52                        • Waypoint Direction = 270
            •      Waypoint Distance = 1,996                      • Waypoint Distance = 1,987
            •      Near Parameter = 2,000                         • Near Parameter = 2,000

     Initial pre-conditions                                 Revised pre-conditions
            •      Radar Mode = TWS                               • Radar Mode = TWS
            •      Altitude = 20,102                              • Altitude = 19,975 - 20,102
            •      Compass = 52
            •      Wind Speed = 3
            •      Waypoint Direction = 52
            •      Waypoint Distance = 1,996                      • Waypoint Distance = 1,987 – 1,996
            •      Compass == Waypoint Direction                  • Compass == Waypoint Direction
            •      Waypoint Distance < Near Parameter             • Waypoint Distance < Near Parameter


Laird & van Lent                      GDC 2005: AI Learning Techniques Tutorial                   Page 225
Results 2: Efficiency
                  400




                  350




                  300




                  250
        Minutes




                                                                                                    Encode Knowledge
                  200
                                                                                                    Learn Task




                  150




                  100




                   50




                   0
                        KnoMic(10x)   KnoMic   KE 1         KE 2      KE Projected    KE 3   KE 4




Laird & van Lent                               GDC 2005: AI Learning Techniques Tutorial                          Page 226
Evaluation
     • Pros
            • Observations are fairly easy to get
            • Suitable for online learning (learn after each session)
            • AI can learn to imitate players
     • Cons
            • Only more efficient for large rule sets?
            • Experts need to annotate the observation logs
     • Challenges
            • Identifying the right features
            • Making sure you have enough observations



Laird & van Lent              GDC 2005: AI Learning Techniques Tutorial   Page 227
References
     •    Learning Task Performance Knowledge by Observation
            •      University of Michigan dissertation
     •    Knowledge Capture Conference (K-CAP).
     •    IJCAI Workshop on Modeling Others from Observation.
     •    AI Game Programming Wisdom.




Laird & van Lent                          GDC 2005: AI Learning Techniques Tutorial   Page 228
Learning Player Models
                               John Laird




Laird & van Lent        GDC 2005: AI Learning Techniques Tutorial   Page 229
Learning Player Model
     • Create an internal model of what player might do
     • Allows AI to adapt to player’s tactics & strategy
     • Tactics
            • Player is usually found in room b, c, & f
            • Player prefers using the rocket launcher
            • Patterns of players’ moves
                   • When they block, attack, retreat, combinations of moves, etc.
     • Strategy
        • Likelihood of player attacking from a given direction
        • Enemy tends to concentrate on technology and defense vs.
          exploration and attack


Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial        Page 230
Two Parts to Player Model
     • Representation of player’s behavior
            • Built up during playing


     • Tactics that test player model and generate AI behavior


     Multiple approaches for each of these




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 231
Simple Representation of Behavior
    • Predefine set of traits
           • Always runs
           • Prefers dark rooms
           • Never blocks
    • Simply count during game play
           • Doesn’t track changes in style
    • Limited horizon of past values
           • Frequency of using attack – range, melee, …
           • Traitvalue = a * ObservedValue + (1-a) * oldTraitValue
           • a = learning rate which determines influence of each
             observation

Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 232
Using Traits
     • Pick traits your AI tactics code can use (or create
       tactics that can use the traits you gather).
     • Tradeoff: level of detail vs. computation/complexity
            • Prefers dark rooms that have one entrance
            • More specialized better prediction, but more complex and
              less data




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial   Page 233
Markov Decision Process (MDP)
                           or N-Grams
     • Build up a probabilistic state transition network that
       describes player’s behavior

                        Punch: .6


                        Punch: .4

                   Kick: .7                                      Block: .6

                                      Punch: .7
       Kick: .4                                                        Rest: .3
                                    Block: .3


Laird & van Lent                    GDC 2005: AI Learning Techniques Tutorial     Page 234
Other Models
     • Any decision making system:
            • Neural networks
            • Decision tree
            • Rule-based system


     • Train with situation/action pairs


     • Use AI’s behavior as model of opponent
            • Chess, checkers, …



Laird & van Lent            GDC 2005: AI Learning Techniques Tutorial   Page 235
Using Player Model
     • Tests for values and provide direct response
            • If player is likely to kick then block.
            • If player attacks very late, don’t build defenses early on.
     • Predict players behavior and search for best response
            • Can use general look-ahead/mini-max/alpha-beta search
            • Doesn’t work with highly random games (Backgammon, Sorry)

                             Him
                   Me




Laird & van Lent               GDC 2005: AI Learning Techniques Tutorial    Page 236
Anticipation
Dennis (Thresh) Fong:
“Say my opponent walks into a room. I'm visualizing him walking in,
  picking up the weapon. On his way out, I'm waiting at the doorway
  and I fire a rocket two seconds before he even rounds the corner. A
  lot of people rely strictly on aim, but everybody has their bad aim
  days. So even if I'm having a bad day, I can still pull out a win.
  That's why I've never lost a tournament.”
Newsweek, 11/21/99


Wayne Gretzky:
“Some people skate to the puck. I skate to where the puck is going to
  be.”

Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial   Page 237
?



Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 238
Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 239
His Distance: 1
                                   My Distance: 1


Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 240
His Distance: 2
                                   My Distance: 2


Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 241
His Distance: 2
                                   My Distance: 2


Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 242
His Distance: 3
                                   My Distance: 1 (but hall)


Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 243
His Distance: 4

                                   My Distance: 0              Ambush!

Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial             Page 244
Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 245
Laird & van Lent   GDC 2005: AI Learning Techniques Tutorial   Page 246
Adaptive Anticipation
     • Opponent might have different weapon preferences
            • Influences which weapons he pursues, which rooms he goes to
     • Gather data on opponent’s weapon preferences
            • Quakebot notices when opponent changes weapons
            • Use derived preferences for predicting opponent’s behavior
            • Dynamically modifies anticipation with experience




Laird & van Lent             GDC 2005: AI Learning Techniques Tutorial     Page 247
References
     • Ryan Houlette: Player Modeling for Adaptive Games: AI
       Programming Wisdom 2, p. 557
     • John Manslow: Learning and Adaptation: AI Programming
       Wisdom, p. 559
     • Francois Laramee: Using N-Gram Statistical Models to Predict
       Play Behavior: AI Programming Wisdom, p. 596
     • John Laird, It Knows What You're Going to Do: Adding
       Anticipation to a Quakebot. Agents 2001 Conference.




Laird & van Lent         GDC 2005: AI Learning Techniques Tutorial   Page 248
Tutorial Overview
     I.         Introduction to learning and games [.75 hour] {JEL}
     II.        Overview of machine learning field [.75 hour] {MvL}
     III. Analysis of specific learning mechanisms [3 hours total]
            •      Decision Trees [.5 hour] {MvL}
            •      Neural Networks [.5 hour] {JEL}
            •      Genetic Algorithms [.5 hour] {MvL}
            •      Bayesian Networks [.5 hour] {MvL}
            •      Reinforcement Learning [1 hour] {JEL}
     IV. Advanced Techniques [1 hour]
            •      Episodic Memory [.3 hour] {JEL}
            •      Behavior capture [.3 hour] {MvL}
            •      Player modeling [.3 hour] {JEL}
     V.         Questions and Discussion [.5 hour] {MvL & JEL}

Laird & van Lent                  GDC 2005: AI Learning Techniques Tutorial   Page 249

More Related Content

PDF
Two parameter entropy of uncertain variable
PPT
Max Entropy
PPTX
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
PDF
Principle of Maximum Entropy
PDF
How Intercom built ‘Fin’, a GPT-4 powered chatbot_Fergal Reid_UXDX_EMEA_2023
PDF
PPTX
The AI is the Game: Crafting the Behavior that Creates an Experience that Dri...
Two parameter entropy of uncertain variable
Max Entropy
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Principle of Maximum Entropy
How Intercom built ‘Fin’, a GPT-4 powered chatbot_Fergal Reid_UXDX_EMEA_2023
The AI is the Game: Crafting the Behavior that Creates an Experience that Dri...

Similar to Machine Learning for Computer Games (20)

PDF
"How To Race Squirrels" at Develop Conference in Brighton, 21st July 2011
PDF
GAMIFIN 2019 Conference Keynote: How to fail at #gamification research
PDF
Sven Juergens - Gamification World Congress 2015 - A Framework for implementi...
PPTX
PDF
Working in teams vs working individually
PPT
LearningKit.ppt
PPTX
Game design as a career
PDF
Three Secret Ingredients To Recruiting Software Developers
PDF
Choosing your Game Engine (2009)
PPTX
CTE Video Game Programming Map 7th 12th
PPTX
Game Modding Lecture 2
PDF
Making A Game Engine Is Easier Than You Think
PDF
Mallory game developmentpipeline
PDF
Game Production Masterclass August 2020
PPTX
FETC 2015 Advanced Game Design Presentation - Workshop
PDF
A primer on game-based learning
KEY
Building a Mobile, Social, Location-Based Game in 5 Weeks
PPTX
Introduction to game development
PDF
ANIn Pune March 2023 | XP 2023 – XP Where Are You? by Christian Hujer
PDF
How to make a video game in 4 weeks
"How To Race Squirrels" at Develop Conference in Brighton, 21st July 2011
GAMIFIN 2019 Conference Keynote: How to fail at #gamification research
Sven Juergens - Gamification World Congress 2015 - A Framework for implementi...
Working in teams vs working individually
LearningKit.ppt
Game design as a career
Three Secret Ingredients To Recruiting Software Developers
Choosing your Game Engine (2009)
CTE Video Game Programming Map 7th 12th
Game Modding Lecture 2
Making A Game Engine Is Easier Than You Think
Mallory game developmentpipeline
Game Production Masterclass August 2020
FETC 2015 Advanced Game Design Presentation - Workshop
A primer on game-based learning
Building a Mobile, Social, Location-Based Game in 5 Weeks
Introduction to game development
ANIn Pune March 2023 | XP 2023 – XP Where Are You? by Christian Hujer
How to make a video game in 4 weeks
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!
Ad

Machine Learning for Computer Games

  • 1. Machine Learning for Computer Games John E. Laird & Michael van Lent Game Developers Conference March 10, 2005 http://ai.eecs.umich.edu/soar/gdc2005 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 1
  • 2. Advertisement Artificial Intelligence and Interactive Digital Entertainment Conference (AIIDE) • June 1-3, Marina Del Rey, CA • Invited Speakers: • Doug Church • Chris Crawford • Damian Isla (Halo) • W. Bingham Gordon • Craig Reynolds • Jonathan Schaeffer • Will Wright • www.aiide.org Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 2
  • 3. Who are We? • John Laird (laird@umich.edu) • Professor, University of Michigan, since 1986 • Ph.D., Carnegie Mellon University, 1983 • Teaching: Game Design and Development for seven years • Research: Human-level AI, Cognitive Architecture, Machine Learning • Applications: Military Simulations and Computer Games • Michael van Lent (vanlent@ict.usc.edu) • Project Leader, Institute for Creative Technology, University of Southern California • Ph.D., University of Michigan, 2000 • Research: Combining AI for commercial game techniques for immersive training simulations. • Research Scientist on Full Spectrum Command & Full Spectrum Warrior Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 3
  • 4. Goals for Tutorial 1. What is machine learning? • What are the main concepts underlying machine learning? • What are the main approaches to machine learning? • What are the main issues in using machine learning? 2. When should it be used in games? • How can it improve a game? • Examples of possible applications of ML to games • When shouldn’t ML be used? 3. How do you use it in games? • Important ML techniques that might be useful in computer games. • Examples of machine learning used in actual games. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 4
  • 5. What this is not… • Not about using learning for board & card games • Chess, backgammon, checkers, Othello, poker, blackjack, bridge, hearts, … • Usually assumes small set of moves, perfect information, … • But a good place to look to learn ML techniques • Not a cookbook of how to apply ML to your game • No C++ code Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 5
  • 6. Tutorial Overview I. Introduction to learning and games [.75 hour] {JEL} II. Overview of machine learning field [.75 hour] {MvL} III. Analysis of specific learning mechanisms [3 hours total] • Decision Trees [.5 hour] {MvL} • Neural Networks [.5 hour] {JEL} • Genetic Algorithms [.5 hour] {MvL} • Bayesian Networks [.5 hour] {MvL} • Reinforcement Learning [1 hour] {JEL} IV. Advanced Techniques [1 hour] • Episodic Memory [.3 hour] {JEL} • Behavior capture [.3 hour] {MvL} • Player modeling [.3 hour] {JEL} V. Questions and Discussion [.5 hour] {MvL & JEL} Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 6
  • 7. Part I Introduction John Laird Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 7
  • 8. What is learning? • Learning: • “The act, process, or experience of gaining knowledge or skill.” • Our general definition: • The capture and transformation of information into a usable form to improve performance. • Possible definitions for games • The appearance of improvement in game AI performance through experience. • Games that get better the longer you play them • Games that adjust their tactics and strategy to the player • Games that let you train your own characters • Really cool games Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 8
  • 9. Why Learning for Games? • Improved Game Play • Cheaper AI development • Avoid programming behaviors by hand • Reduce Runtime Computation • Replace repeated planning with cached knowledge • Marketing Hype Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 9
  • 10. Improved Game Play I • Better AI behavior: • More variable • More believable • More challenging • More robust • More personalized experience & more replayability • AI develops as human develops • AI learns model of player and counters player strategies • Dynamic Difficulty Adjustment • Learns properties of player and dynamically changes game to maximize enjoyment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 10
  • 11. Improved Game Play II • New types of game play • Training characters • Black & White, Lionhead Studios • Create a model of you to compete against others • Forza Motorsport, Microsoft Game Studios for XBOX Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 11
  • 12. Marketing Hype • Not only does it learns from its own mistakes, it also learns from yours! You might be able to out think it the first time, but will you out think it the second, third and forth? • “Check out the revolutionary A.I. Drivatar™ technology: Train your own A.I. "Drivatars" to use the same racing techniques you do, so they can race for you in competitions or train new drivers on your team. Drivatar technology is the foundation of the human-like A.I. in Forza Motosport.” • “Your creature learns from you the entire time. From the way you treat your people to the way you act toward your creature, it remembers everything you do and its future personality will be based on your actions.” Preview of Black and White Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 12
  • 13. Why Not Learning for Games? • Worse Game Play • More expensive AI Development • Increased Runtime Computation • Marketing Hype Backfire Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 13
  • 14. Worse Game Play: Less Control • Behavior isn’t directly controlled by game designer • Difficult to validate & predict all future behaviors • AI can get stuck in a rut from learning • Learning can take a long time to visibly change behavior • If AI learns from a stupid player, get stupid behavior • “Imagine a track filled with drivers as bad as you are, barreling into corners way too hot, and trading paint at every opportunity possible; sounds fun to us.” - Forza Motosport Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 14
  • 15. Why Not Learning for Games? • Worse Game Play • More expensive AI Development • Lack of programmers with machine learning experience • More time to develop, test & debug learning algorithms • More time to test range of behaviors • Difficult to “tweak” learned behavior • Increased Runtime Computation • Computational and memory overhead of learning algorithm • Marketing Hype Backfire • Prior failed attempts Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 15
  • 16. Marketing Hype • “I seriously doubt that BC3K is the first title to employ this technology at any level. The game has been hyped so much that 'neural net' to a casual gamer just became another buzzword and something to look forward to. At least that's my opinion.” • Derek Smart Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 16
  • 17. Alternatives to Learning (How to Fake it) • Pre-program in multiple levels of performance • Dynamically switch between levels as player advances • Provides pre-defined set of behaviors that can be tested • Swap in new components [more incremental] • Add in more transitions and/or states to a FSM • Add in new rules in a rule-based system • Change parameters during game play • The number of mistakes system makes • Accuracy in shooting • Reaction time to seeing enemy • Aggressiveness, … Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 17
  • 18. Indirect Adaptation [Manslow] • Gather pre-defined data for use by AI decision making • What is my kill rate with each type of weapon? • What is my kill rate in each room? • Where is the enemy most likely to be? • Does opponent usually pass on the left or the right? • How early does the enemy usually attack • AI “Behavior” code doesn’t change • Makes testing much easier • AI adapts in selected, but significant ways Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 18
  • 19. Where can we use learning? • AIs • Change behavior of AI entities • Game environment • Optimize the game rules, terrain, interface, infrastructure, … Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 19
  • 20. When can we use learning? • Off-line: during development • Train AIs against experts, terrain, each other • Automated game balancing, testing, … • On-line: during game play • AIs adapt to player, environment • Dynamic difficulty adjustment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 20
  • 21. Basic ML Techniques • Learning by observation of human behavior • Replicate special individual performance • Capture variety in expertise, personalities and cultures • AI learns from human’s performance • Learning by instruction • Non programmers instructing AI behavior • Player teaches AI to do his bidding • Learning from experience • Play against other AI and human testers during development • Improve behavior and find bogus behavior • Play against the environment • Find places to avoid, hide, ambush, etc. • Adapt tactics and strategy to human opponent Next Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 21
  • 22. Learning by Observation [Passive Learning, Imitation, Behavior Capture] Parameters & Sensors Expert or Environmental Game Player Interface Motor Commands Observation Trace Database Learning AI Code Game AI Algorithm Knowledge Return Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 22
  • 23. Learning by Training Game AI Game New or corrected knowledge Developer or Player Instruction or training signal Learning Algorithm Return Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 23
  • 24. Learning by Experience [Active Learning] Game AI Game New or corrected knowledge Critic Training signal Features used or reward in learning Learning Algorithm Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 24
  • 25. Game AI Levels • Low-level actions & Movement • Situational Assessment (Pattern Recognition) • Tactical Behavior • Strategic Behavior • Opponent Model Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 25
  • 26. Actions & Movement: Off Line • Capture styles of drivers/fighters/skiers • More complex than motion capture • Includes when to transition from one animation to another • Train AI against environment: • ReVolt: genetic algorithm to discover optimal racing paths Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 26
  • 27. Actions & Movement: On Line • Capture style of player for later competition • Forza Motorsport • Learn new paths for AI from humans: • Command & Conquer Renegade: internal version noticed paths taken by humans after terrain changes. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 27
  • 28. Demos/Example • Michiel van de Panne & Ken Alton [UBC] • Driving Examples: http://www.cs.ubc.ca/~kalton/icra.html • Andrew Ng [Stanford] • Helicopter: http://www.robotics.stanford.edu/~ang/ Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 28
  • 29. Learning Situational Assessment • Learn whether a situation is good or bad • Creating an internal model of the environment and relating it to goals • Concepts that will be useful in decision making and planning • Can learn during development or from experience • Examples • Exposure areas (used in path planning) • Hiding places, sniping places, dangerous places • Properties of objects (edible, destructible, valuable, …) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 29
  • 30. Learning Tactical Behavior • Selecting and executing appropriate tactics • Engage, Camp, Sneak, Run, Ambush, Flee, Flee and Ambush, Get Weapon, Flank Enemy, Find Enemy, Explore • What weapons work best and when • Against other weapons, in what environment, … • Train up teammates to fight your style, understand your commands, … • (see talk by John Funge, Xiaoyuan Tu – iKuni, Inc.) • Thursday 3:30pm – AK Peters Booth (962) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 30
  • 31. Learning Strategic Behavior • Selecting and executing appropriate strategy • Allocation of resources for gathering, tech., defensive, offensive • Where to place defenses • When to attack, who to attack, where to attack, how to attack • Leads to a hierarchy of goals Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 31
  • 32. Settlers of Catan: Michael Pfeiffer • Used hierarchical reinforcement learning • Co-evolutionary approach • Offline: 3000-8000 training games • Learned primitive actions: • Trading, placing roads, … • Learning & prior knowledge gave best results Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 32
  • 33. Review • When can we use learning? • Off-line • On-line • Where can we use learning? • Low-level actions • Movement • Situational Assessment • Tactical Behavior • Strategic Behavior • Opponent Model • Types of Learning? • Learning from experience • Learning from training/instruction • Learning by observation Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 33
  • 34. References • General Machine Learning Overviews: • Mitchell: Machine Learning, McGraw Hill, 1997 • Russell and Norvig: Artificial Intelligence: A Modern Approach, 2003 • AAAI’s page on machine learning: • http://www.aaai.org/Pathfinder/html/machine.html • Machine Learning for Games • http://www.gameai.com/ - Steve Woodcock’s labor of love • AI Game Programming Wisdom • AI Game Programming Wisdom 2 • M. Pfeiffer: Machine Learning Applications in Computer Games, MSc Thesis, Graz University of Technology, 2003 • Nick Palmer: Machine Learning in Games Development: • http://ai-depot.com/GameAI/Learning.html Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 34
  • 35. Part II Overview of Machine Learning Michael van Lent Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 35
  • 36. Talk Overview • Machine Learning Background • Machine Learning: “The Big Picture” • Challenges in applying machine learning • Outline for ML Technique presentations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 36
  • 37. AI for Games • Game AI • Entertainment is the central goal • The player should win, but only after a close fight • Constraints of commercial development • Development schedule, budget, CPU time, memory footprint • Quality assurance • The public face of AI? • Academic AI • Exploring new ideas is the central goal • Efficiency and optimality are desirable • Constraints of academic research • Funding, publishing, teaching, tenure • Academics also work on a budget and schedule • The next generation of AI techniques? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 37
  • 38. Talk Overview • Machine Learning Background • Machine Learning “The Big Picture” • Challenges in applying machine learning • Outline for ML Technique presentations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 38
  • 39. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Interface to the environment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 39
  • 40. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Search algorithms • Logical & probabilistic inference • Planners • Expert system shells • Cognitive architectures • Machine learning techniques • Knowledge • Interface to the environment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 40
  • 41. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Knowledge representation • Knowledge acquisition • Interface to the environment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 41
  • 42. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Knowledge representation • Finite state machines • Rule-based systems • Propositional & first-order logic • Operators • Decision trees • Classifiers • Neural networks • Bayesian networks • Knowledge acquisition • Interface to the environment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 42
  • 43. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Knowledge representation • Knowledge acquisition • Programming • Knowledge engineering • Machine Learning • Interface to the environment Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 43
  • 44. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Interface to the environment • Sensing • Acting Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 44
  • 45. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Interface to the environment • Sensing • Robotic sensors (sonar, vision, IR, laser, radar) • Machine vision • Speech recognition • Examples • Environment features • World models • Acting Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 45
  • 46. AI: a learning-centric view Artificial Intelligence requires: • Architecture and algorithms • Knowledge • Interface to the environment • Sensing • Acting • Navigation • Locomotion • Speech generation • Robotic actuators Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 46
  • 47. Talk Overview • Machine Learning Background • Machine Learning “The Big Picture” • Challenges in applying machine learning • Outline for ML Technique presentations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 47
  • 48. The Big Picture Many different ways to group machine learning fields: (in a somewhat general to specific order) • by Problem • What is the fundamental problem being addressed? • Broad categorization that groups techniques into a few large classes • by Feedback • How much information is given about the right answer? • The more information the easier the learning problem • by Knowledge Representation • How is the learned knowledge represented/stored/used? • Tends to be the basis for a technique’s common name • by Knowledge Source • Where is the input coming from and in what format? • Somewhat orthogonal to the other groupings Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 48
  • 49. Machine Learning by Problem • Classification • Classify “instances” as one of a discrete set of “categories” • Input is often a list of examples • Clustering • Given a data set, identify meaningful “clusters” • Unsupervised learning • Optimization • Given a function f(x) = y, find an input x with a high y value • Supervised learning • Classification can be cast as an optimization problem • Function is number of correct classifications on some test set of examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 49
  • 50. Classification Problems • Task: • Classify “instances” as one of a discrete set of “categories” • Input: set of features about the instance to be classified • Inanimate = <true, false> • Allegiance = <friendly, neutral, enemy> • FoodValue = <none, low, medium, high> • Output: the category this object fits into • Is this object edible? Yes, No • How should I react to this object? Attack, Ignore, Heal • Examples are often split into two data sets • Training data • Test data Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 50
  • 51. Example Problem Classify how I should react to an object in the world • Facts about any given object include: • Inanimate = <true, false> • Allegiance = < friendly, neutral, enemy> • FoodValue = < none, low, medium, high> • Health = <low, medium, full> • RelativeHealth = <weaker, same, stronger> • Output categories include: • Reaction = Attack • Reaction = Ignore • Reaction = Heal • Reaction = Eat • Inanimate=false, Allegiance=enemy, RelativeHealth=weaker => Reaction=Attack • Inanimate=true, FoodValue=medium => Reaction=Eat • Inanimate=false, Allegiance=friendly, Health=low => Reaction=Heal • Inanimate=false, Allegiance=neutral, RelativeHealth=weaker => Reaction=? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 51
  • 52. Clustering Given a list of data points, group them into clusters • Like classification without the categories identified • Facts about any given object include: • Inanimate = <true, false> • Allegiance = < friendly, neutral, enemy> • FoodValue = < none, low, medium, high> • Health = <low, medium, full> • RelativeHealth = <weaker, same, stronger> • No categories pre-defined • Find a way to group the following into two groups: • Inanimate=false, Allegiance=enemy, RelativeHealth=weaker • Inanimate=true, FoodValue=medium • Inanimate=false, Allegiance=friendly, Health=low • Inanimate=false, Allegiance=neutral, RelativeHealth=weaker Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 52
  • 53. Optimization • Task: • Given a function f(x) = y, find an input with a high y value • Input (x) can take many forms • Feature string • Set of classification rules • Parse trees of code • Example: • Let x be a RTS build order x = [n1, n2, n3, n4, n5, n6, n7, n8] • ni means build unit or building n as the next action • If a unit or building isn’t available go on to the next action • f([n1, n2, n3, n4, n5, n6, n7, n8]) = firepower of resulting units • Optimize the build order for maximum firepower Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 53
  • 54. Machine Learning by Feedback • Supervised Learning • Correct output is available • In Black & White: Examples of things to attack • Reinforcement Learning • Feedback is available but not correct output • In Black & White: Getting slapped for attacking something • Unsupervised Learning • No hint about correct outputs • In Black & White: Just looking for groupings of objects Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 54
  • 55. Supervised Learning • Learning algorithm gets the right answers • List of examples as input • “Teacher” who can be asked for answers • Induction • Generalize from available examples • If X is true in every example X must always be true • Often used to learn decision trees and rules • Explanation-based Learning • Case-based Learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 55
  • 56. Reinforcement Learning • Learning algorithm gets positive/negative feedback • Evaluation function • Rewards from the environment • Back propagation • Pass a reward back across the previous steps • Often paired with Neural Networks • Genetic algorithm • Parallel search for a very positive solution • Optimization technique • Q learning • Learn the value of taking an action at a state Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 56
  • 57. Unsupervised Learning • Learning algorithm gets little or no feedback • Don’t learn right or wrong answers • Just recognize interesting patterns of data • Similar to data mining • Clustering is a prime example • Most difficult class of learning problems Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 57
  • 58. Machine Learning by Knowledge Representation • Decision Trees • Classification procedure • Generally learned by induction • Rules • Flexible representation with multiple uses • Learned by induction, genetic algorithms • Neural Networks • Simulates layers of neurons • Often paired with back propigation • Stochastic Models • Learning probabilistic networks • Takes advantage of prior knowledge Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 58
  • 59. Machine Learning by Knowledge Source • Examples • Supervised Learning • Environment • Supervised or Reinforcement Learning • Observation • Supervised Learning • Instruction • Supervised or Reinforcement Learning • Data points • Unsupervised Learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 59
  • 60. A Formatting Problem • Machine learning doesn’t generate knowledge • Transfers knowledge present in the input into a more useable source • Examples => Decision Trees • Observations => Rules • Data => Clusters Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 60
  • 61. Talk Overview • Machine Learning Background • Machine Learning “The Big Picture” • Challenges in applying machine learning • Outline for ML Technique presentations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 61
  • 62. Challenges • What is being learned? • Where to get good inputs? • What’s the right learning technique? • When to stop learning? • How to QA learning? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 62
  • 63. What is being learned? • What are you trying to learn? • Often useful to have a sense of good answers in advance • Machine learning often finds more/better variations • Novel, unexpected solutions don’t appear often • What are the right features? • This can be the difference between success and failure • Balance what’s available, what’s useful • If features are too good there’s nothing to learn • What’s the right knowledge representation? • Again, difference between success and failure • Must be capable of representing the solution Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 63
  • 64. Where to get good inputs? • Getting good examples is essential • Need enough for useful generalization • Need to avoid examples that represent only a subset of the space • Creating a long list of examples can take a lot of time • Human experts • Observations, Logs, Traces • Examples • Other AI systems • AI prototypes • Similar games Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 64
  • 65. What’s the right learning technique? • This often falls out of the other decisions • Knowledge representations tend to be associated with techniques • Decision trees go with induction • Neural networks go with back propagation • Stochastic models go with Bayesian learning • Often valuable to try out more than one approach Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 65
  • 66. When to stop learning? • Sometimes more learning is not better • More learning might not improve the solution • More learning might result in a worse solution • Overfitting • Learned knowledge is too specific to the provided examples • Looks very good on training data • Can look good on test data • Doesn’t generalize to new inputs Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 66
  • 67. How to QA learning? • Central challenge in applying machine learning to games • Adds a big element of variability into the player’s experience • Adds an additional risk factor to the development process • Offline learning • The result can undergo standard play testing • Might be hard or impossible to debug learned knowledge • Neural networks are difficult to understand • Online learning • Constrain the space learning can explore • Carefully design and bound the knowledge representation • Consider “instincts” or rules than learned knowledge can’t violate • Allow players to activate/deactivate learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 67
  • 68. Talk Overview • Machine Learning Background • Machine Learning “The Big Picture” • Challenges in applying machine learning • Non-learning learning • Outline for ML mechanism presentations • Decision Trees • Neural Networks • Genetic Algorithms • Bayesian Networks • Reinforcement Learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 68
  • 69. Outline • Background • Technical Overview • Example • Games that have used this mechanism • Pros, Cons & Challenges • References Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 69
  • 70. General Machine Learning References • Artificial Intelligence: A Modern Approach • Russell & Norvig • Machine Learning • Mitchell • Gameai.com • AI Game Programming Wisdom books • Game Programming Gems Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 70
  • 71. Decision Trees & Rule Induction Michael van Lent Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 71
  • 72. The Big Picture • Problem • Classification • Feedback • Supervised learning • Reinforcement learning • Knowledge Representation • Decision tree • Rules • Knowledge Source • Examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 72
  • 73. Decision Trees • Nodes represent attribute tests • One child for each possible value of the attribute • Leaves represent classifications • Classify by descending from root to a leaf • At root test attribute associated with root attribute test • Descend the branch corresponding to the instance’s value • Repeat for subtree rooted at the new node • When a leaf is reached return the classification of that leaf • Decision tree is a disjunction of conjunctions of constraints on the attribute values of an instance Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 73
  • 74. Example Problem Classify how I should react to an object in the world • Facts about any given object include: • Allegiance = < friendly, neutral, enemy> • Health = <low, medium, full> • Animate = <true, false> • RelativeHealth = <weaker, same, stronger> • Output categories include: • Reaction = Attack • Reaction = Ignore • Reaction = Heal • Reaction = Eat • Reaction = Run • <friendly, low, true, weaker> => Heal • <neutral, low, true, same> => Heal • <enemy, low, true, stronger> => Attack • <enemy, medium, true, weaker> => Attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 74
  • 75. Classifying with a Decision Tree Allegiance? Friendly Neutral Enemy Health? Health? Attack Low Full Low Full Medium Medium Ignore Heal Heal Ignore Heal Ignore Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 75
  • 76. Classifying with a Decision Tree Health? Low Medium Full Attack Allegiance? Allegiance? Friendly Enemy Neutral Friendly Enemy Neutral Heal Heal Ignore Heal Ignore Ignore Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 76
  • 77. Decision Trees are good when: • Inputs are attribute-value pairs • With fairly small number of values • Numeric or continuous values cause problems • Can extend algorithms to learn thresholds • Outputs are discrete output values • Again fairly small number of values • Difficult to represent numeric or continuous outputs • Disjunction is required • Decision trees easily handle disjunction • Training examples contain errors • Learning decision trees • More later Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 77
  • 78. Learning Decision Trees • Decision trees are usually learned by induction • Generalize from examples • Induction doesn’t guarantee correct decision trees • Bias towards smaller decision trees • Occam’s Razor: Prefer simplest theory that fits the data • Too expensive to find the very smallest decision tree • Learning is non-incremental • Need to store all the examples • ID3 is the basic learning algorithm • C4.5 is an updated and extended version Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 78
  • 79. Induction • If X is true in every example X must always be true • More examples are better • Errors in examples cause difficulty • Note that induction can result in errors • Inductive learning of Decision Trees • Create a decision tree that classifies the available examples • Use this decision tree to classify new instances • Avoid over fitting the available examples • One root to node path for each example • Perfect on the examples, not so good on new instances Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 79
  • 80. Induction requires Examples • Where do examples come from? • Programmer/designer provides examples • Observe a human’s decisions • # of examples need depends on difficulty of concept • More is always better • Training set vs. Testing set • Train on most (75%) of the examples • Use the rest to validate the learned decision trees Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 80
  • 81. ID3 Learning Algorithm • ID3 has two parameters • List of examples • List of attributes to be tested • Generates tree recursively • Chooses attribute that best divides the examples at each step ID3(examples,attributes) if all examples in same category then return a leaf node with that category if attributes is empty then return a leaf node with the most common category in examples best = Choose-Attribute(examples,attributes) tree = new tree with Best as root attribute test foreach value vi of best examples i = subset of examples with best == vi subtree = ID3(examplesi,attributes – best) add a branch to tree with best == vi and subtree beneath return tree Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 81
  • 82. Examples • <friendly, low, true, weaker> => Heal • 13 examples • <neutral, full, false, same> => Eat • 3 Heal • <enemy, low, true, weaker> => Eat • 2 Eat • <enemy, low, true, same> => Attack • 2 Attack • <neutral, low, true, weaker> => Heal • 4 Ignore • <enemy, medium, true, stronger> => Run • 2 Run • <friendly, full, true, same> => Ignore • <neutral, full, true, stronger> => Ignore • <enemy, full, true, same> => Run • <enemy, medium, true, weaker> => Attack • <friendly, full, true, weaker> => Ignore • <neutral, full, false, stronger> => Ignore • <friendly, medium, true, stronger> => Heal Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 82
  • 83. Entropy • Entropy: how “mixed” is a set of examples • All one category: Entropy = 0 • Evenly divided: Entropy = log2(# of examples) • Given S examples Entropy(S) = S –pi log2 pi where pi is the proportion of S belonging to class i • 13 examples with 3 heal, 2 attack, 2 eat, 4 ignore, 2 run • Entropy([3,2,2,4,2]) = 2.258 • 13 examples with all 13 heal • Entropy ([13,0,0,0,0]) = 0 • Maximum entropy is log2 5 = 2.322 • 5 is the number of categories Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 83
  • 84. Information Gain • Information Gain measures the reduction in Entropy • Gain(S,A) = Entropy(S) – S Sv/S Entropy(Sv) • Example: 13 examples: Entropy([3,2,2,4,2]) = 2.258 • Information gain of Allegiance = <friendly, neutral, enemy> • Allegiance = friendly for 4 examples [2,0,0,2,0] • Allegiance = neutral for 4 examples [1,1,0,2,0] • Allegiance = enemy for 5 examples [0,1,2,0,2] • Gain(S,Allegiance) = 0.903 • Information gain of Animate = <true, false> • Animate = true for 11 examples [3,1,2,3,2] • Animate = false for 2 examples [0,1,0,1,0] • Gain(S,Animate) = 0.216 • Allegiance has a higher information gain than Animate • So choose allegiance as the next attribute to be tested Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 84
  • 85. Learning Example • Information gain of Allegiance • 0.903 • Information gain of Health • 0.853 • Information gain of Animate • 0.216 • Information gain of RelativeHealth • 0.442 • So Allegiance should be the root test Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 85
  • 86. Decision tree so far Allegiance? Friendly Neutral Enemy ? ? ? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 86
  • 87. Allegiance = friendly • Four examples have allegiance = friendly • Two categorized as Heal • Two categorized as Ignore • We’ll denote this now as [# of Heal, # of Ignore] • Entropy = 1.0 • Which of the remaining features has the highest info gain? • Health: low [1,0], medium [1,0], full [0,2] => Gain is 1.0 • Animate: true [2,2], false [0,0] => Gain is 0 • RelativeHealth: weaker [1,1], same [0,1], stronger [1,0] => Gain is 0.5 • Health is the best (and final) choice Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 87
  • 88. Decision tree so far Allegiance? Friendly Neutral Enemy Health ? ? Low Full Medium Heal Heal Ignore Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 88
  • 89. Allegiance = enemy • Five examples have allegiance = enemy • One categorized as Eat • Two categorized as Attack • Two categorized as Run • We’ll denote this now as [# of Eat, # of Attack, # of Run] • Entropy = 1.5 • Which of the remaining features has the highest info gain? • Health: low [1,1,0], medium [0,1,1], full [0,0,1] => Gain is 0.7 • Animate: true [1,2,2], false [0,0,0] => Gain is 0 • RelHealth: weaker [1,1,0], same [0,1,1], stronger [0,0,1] => Gain is 0.7 • Health and RelativeHealth are equally good choices Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 89
  • 90. Decision tree so far Allegiance? Friendly Neutral Enemy Health ? Health Low Full Low Full Medium Medium Heal Heal Ignore ? ? Run Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 90
  • 91. Final Decision Tree Allegiance? Friendly Neutral Enemy RelHealth Health Health Heal Ignore Low Full Eat Low Full Medium Medium Heal Heal Ignore RelHealth RelHealth Run Eat Attack Run Attack Attack Attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 91
  • 92. Generalization • Previously unseen examples can be classified • Each path through the decision tree doesn’t test every feature • <neutral, low, false, stronger> => Eat • Some leaves don’t have corresponding examples • (Allegiance=enemy) & (Health=low) & (RelHealth=stronger) • Don’t have any examples of this case • Generalize from the closest example • <enemy, low, false, same> => Attack • Guess that: <enemy, low, false, stronger> => Attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 92
  • 93. Decision trees in Black & White • Creature learns to predict the player’s reactions • Instead of categories, range [-1 to 1] of predicted feedback • Extending decision trees for continuous values • Divide into discrete categories • … • Creature generates examples by experimenting • Try something and record the feedback (tummy rub, slap…) • Starts to look like reinforcement learning • Challenges encountered • Ensuring everything that can be learned is reasonable • Matching actions with player feedback Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 93
  • 94. Decision Trees and Rules • Decision trees can easily be translated into rules • and vice versa Allegiance? Friendly Neutral Enemy Health? Health? Attack Low Full Low Full Medium Medium Ignore Heal Heal Ignore Heal Ignore If (Allegiance=friendly) & ((Health=low) | (Health=medium)) then Heal If (Allegiance=friendly) & (Health=high) then Ignore If (Allegiance=neutral) & (Health=low) then Heal … If (Allegiance=enemy) then Attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 94
  • 95. Rule Induction • Specific to General Induction • First example creates a very specific rule • Additional examples are used to generalize the rule • If rule becomes too general create a new, disjunctive rule • Version Spaces • Start with a very specific rule and a very general rule • Each new example either • Makes the specific rule more general • Makes the general rule more specific • The specific and general rules meet at the solution Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 95
  • 96. Learning Example • First example: <friendly, low, true, weaker> => Heal • If (Allegiance=friendly) & (Health=low) & (Animate=true) & (RelHealth=weaker) then Heal • Second example: <neutral, low, true, weaker> => Heal • If (Health=low) & (Animate=true) & (RelHealth=weaker) then Heal • Overgeneralization? • If ((Allegiance=friendly) | (Allegiance=neutral)) & (Health=low) & (Animate=true) & (RelHealth=weaker) then Heal • Third example: <friendly, medium, true, stronger> => Heal • If ((Allegiance=friendly) | (Allegiance=neutral)) & ((Health=low) | (Health=medium)) & (Animate=true) & ((RelHealth=weaker) | (RelHealth=stronger)) then Heal Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 96
  • 97. Advanced Topics • Boosting • Manipulate the set of training examples • Increase the representation of incorrectly classified examples • Ensembles of classifiers • Learn multiple classifiers (i.e. multiple decision trees) • All the classifiers vote on the correct answer (only one approach) • “Bagging”: break the training set into overlapping subsets • Learn a classifier for each subset • Learn classifiers using different subsets of features • Or different subsets of categories • Ensembles can be more accurate than a single classifier Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 97
  • 98. Games that use inductive learning • Decision Trees • Black & White • Rules Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 98
  • 99. Inductive Learning Evaluation • Pros • Decision trees and rules are human understandable • Handle noisy data fairly well • Incremental learning • Online learning is feasible • Cons • Need many, good examples • Overfitting can be an issue • Learned decision trees may contain errors • Challenges • Picking the right features • Getting good examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 99
  • 100. References • Mitchell: Machine Learning, McGraw Hill, 1997. • Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. • Quinlan: Induction of decision trees, Machine Learning 1:81-106, 1986. • Quinlan: Combining instance-based and model-based learning,10th International Conference on Machine Learning, 1993. • AI Game Programming Wisdom. • AI Game Programming Wisdom 2. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 100
  • 101. Neural Networks John Laird Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 101
  • 102. Inspiration • Mimic natural intelligence • Networks of simple neurons • Highly interconnected • Adjustable weights on connections • Learn rather than program • Architecture is different • Brain is massively parallel • 1012 neurons • Neurons are slow • Fire 10-100 times a second Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 102
  • 103. Simulated Neuron • Neurons are simple computational devices whose power comes from how they are connected together • Abstractions of real neurons • Each neuron has: • Inputs/activation from other neurons (aj) [-1, +1] • Weights of input (Wi,j) [-1, +1] • Output to other neurons (ai) aj Wi,j ai Neuroni Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 103
  • 104. Simulated Neuron • Neuron calculates weighted sum of inputs (ini) • ini = S Wi,j aj • Threshold function g(ini) calculates output (ai) • Step function: ai • if ini > t then ai = 1 else ai = 0 t • Sigmoid: ai • ai = 1/(1+e-ini) t • Output becomes input for next layer of neurons aj Wi,j S Wi,j aj = ini ai ai = g(in i) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 104
  • 105. Network Structure • Single neuron can represent AND, OR not XOR • Combinations of neuron are more powerful • Neuron are usually organized as layers • Input layer: takes external input • Hidden layer(s) • Output player: external output Input Hidden Output Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 105
  • 106. Feed-forward vs. recurrent • Feed-forward: outputs only connect to later layers • Learning is easier • Recurrent: outputs connect to earlier layers • Internal state Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 106
  • 107. Neural Network for a FPS-bot Enemy Dead • Four input neuron Sound Low Health • One input for each condition • Two neuron hidden layer • Fully connected • Forces generalization • Five output neuron • One output for each action • Choose action with highest output • Probabilistic action selection Attack Wander Spawn Retreat Chase Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 107
  • 108. Learning Weights: Back Propagation • Learning from examples • Examples consist of input and correct output (t) • Learn if network’s output doesn’t match correct output • Adjust weights to reduce difference • Only change weights a small amount (?) • Basic neuron learning • Wi,j = Wi,j + ? Wi,j • Wi,j = Wi,j + ?(t-o)aj • If output is too high, (t-o) is negative so Wi,j will be reduced • If output is too low, (t-o) is positive so Wi,j will be increased • If aj is negative the opposite happens Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 108
  • 109. Back propagation algorithm Repeat Foreach e in examples do O = Run-Network(network,e) // Calculate error term for output layer Foreach neuron in the output layer do Errk = ok(1-ok)(tk-ok) // Calculate error term for hidden layer Foreach neuron in the hidden layer do Errh = oh(1-oh)SwkhErrk // Update weights of all neurons Foreach neuron do Wi,j = Wi,j + ? (xij) Errj Until network has converged Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 109
  • 110. Neural Net Example • Single neuron to represent OR • Two inputs • One output (1 if either inputs is 1) • Step function (if weighted sum > 0.5 output a = 1) 1 0.1 S Wj aj = 0.1 0 g(0.1) = 0 0.6 0 • Error so training occurs Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 110
  • 111. Neural Net Example • Wj = Wj + ? Wj • Wj = Wj + ?(t-o)aj • W1 = 0.1 + 0.1(1-0)1 = 0.2 • W2 = 0.6 + 0.1(1-0)0 = 0.6 0 0.2 S Wj aj = 0.6 1 g(0.6) = 0 0.6 1 • No error so no training occurs Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 111
  • 112. Neural Net Example 1 0.2 S Wj aj = 0.2 0 g(0.2) = 0 0.6 0 • Error so training occurs • W1 = 0.2 + 0.1(1-0)1 = 0.3 • W2 = 0.6 + 0.1(1-0)0 = 0.6 1 0.3 S Wj aj = 0.9 1 g(0.9) = 1 0.6 1 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 112
  • 113. Using Neural Networks in Games • Classification/function approximation • In game or during development • Learning to predict the reward associated with a state • Can be the core of reinforcement learning • Situational Assessment/Classification • Feelings toward objects in world or other players • Black & White BC3K • Predict enemy action Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 113
  • 114. Neural Network Example Systems • BattleCruiser: 3000AD • Guide NPC: Negotiation, trading, combat • Black & White • Teach creatures desires and preferences • Creatures • Creature behavior control • Dirt Track Racing • Race track driving control • Heavy Gear • Multiple NNs for control Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 114
  • 115. NN Example: B & W Low Energy Source = 0.2 Weight = 0.8 Value = Source * Weight = 0.16 Tasty Food Source = 0.4 ? Threshold Hunger Weight = 0.2 0.16 + 0.08 + 0.14 Value = Source * Weight = 0.08 Unhappiness Source = 0.7 Weight = 0.2 Value = Source * Weight = 0.14 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 115
  • 116. Neural Networks Evaluation • Advantages • Handle errors well • Graceful degradation • Can learn novel solutions • Disadvantages • Feed forward doesn’t have memory of prior events • Can’t understand how or why the learned network works • Usually requires experimentation with parameters • Learning takes lots of processing • Incremental so learning during play might be possible • Run time cost is related to number of connections • Challenges • Picking the right features • Picking the right learning parameters • Getting lots of data Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 116
  • 117. References • General AI Neural Network References: • Mitchell: Machine Learning, McGraw Hill, 1997 • Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 2003 • Hertz, Krogh & Palmer: Introduction to the theory of neural computation, Addison- Wesley, 1991 • Cowan & Sharp: Neural nets and artificial intelligence, Daedalus 117:85-121, 1988 • Neural Networks in Games: • Penny Sweetser, How to Build Neural Networks for Games • AI Programming Wisdom 2 • Mat Buckland, Neural Networks in Plain English, AI-Junkie.com • John Manslow, Imitating Random Variations in Behavior using Neural Networks • AI Programming Wisdom, p. 624 • Alex Champandard, The Dark Art of Neural Networks • AI Programming Wisdom, p. 640 • John Manslow, Using a Neural Network in a Game: A Concrete Example • Game Programming Gems 2 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 117
  • 118. Genetic Algorithms Michael van Lent Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 118
  • 119. Background • Evolution creates individuals with higher fitness • Population of individuals • Each individual has a genetic code • Successful individuals (higher fitness) more likely to breed • Certain codes result in higher fitness • Very hard to know ahead which combination of genes = high fitness • Children combine traits of parents • Crossover • Mutation • Optimize through artificial evolution • Define fitness according to the function to be optimized • Encode possible solutions as individual genetic codes • Evolve better solutions through simulated evolution Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 119
  • 120. The Big Picture • Problem • Optimization • Classification • Feedback • Reinforcement learning • Knowledge Representation • Feature String • Classifiers • Code (Genetic Programming) • Knowledge Source • Evaluation function Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 120
  • 121. Genes • Gene is typically a string of symbols • Frequently a bit string • Gene can be a simple function or program • Evolutionary programming • Challenges in gene representation • Every possible gene should encode a valid solution • Common representation • Coefficients • Weights for state transitions in a FSM • Classifiers • Code (Genetic Programming) • Neural network weights Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 121
  • 122. Classifiers • Classification rules encoded as bit strings • Bits 1-3: Allegiance (1=friendly, 2=neutral, 3=enemy) • Bits 4-6: Health (4=low, 5=medium, 6=full) • Bits 7-8: Animate (7=true, 8=false) • Bits 9-11: RelHealth (9=weaker, 10=same, 11=stronger) • Bits 12-16: Action(Attack, Ignore, Heal, Eat, Run) • Example • If ((Allegiance=friendly) | (Allegiance=neutral)) & ((Health=low) | (Health=medium)) & (Animate=true) & ((RelHealth=weaker) | (RelHealth=stronger)) then Heal • 110 110 10 101 00100 • Need to ensure that bits 12-16 are mutually exclusive Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 122
  • 123. Genetic Algorithm initialize population p with random genes repeat foreach p i in p fi = fitness(pi) repeat parent1 = select(p,f) parent2 = select(p,f) child1, child2 = crossover(parent1,parent2) if (random < mutate_probability) child1 = mutate(child1) if (random < mutate_probability) child2 = mutate(child2) add child1, child2 to p’ until p’ is full p = p’ • Fitness(gene): the fitness function • Select(population,fitness): weighted selection of parents • Crossover(gene,gene): crosses over two genes • Mutate(gene): randomly mutates a gene Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 123
  • 124. Genetic Operators • Crossover • Select two points at random • Swap genes between two points • Mutate • Small probably of randomly changing each part of a gene Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 124
  • 125. Example: Evaluation • Initial Population: • 110 110 10 110 01000 (friendy | neutral) & (low | medium) & (true) & (weaker | same) => Ignore • 001 010 00 101 00100 (enemy) & (medium) & (weaker | stronger) => Ignore • 010 001 11 111 10000 (friendy | neutral) & (low | medium) & (true) & (weaker | same) => Heal • 000 101 01 010 00010 (low | full) & (false) & (same) => Eat • Evaluation: • 110 110 10 110 01000: Fitness score = 47 • 010 001 11 111 10000: Fitness score = 23 • 000 101 01 010 00010: Fitness score = 39 • 001 010 00 101 00100: Fitness score = 12 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 125
  • 126. Example: Genetic Operators • Crossover: • 110 110 10 110 01000 • 000 101 01 010 00010 Crossover after bit 7: • 110 110 1 • 1 010 00010 • 000 101 0 • 0 110 01000 • Mutations • 110 110 11 011 00010 • 000 101 00 110 01000 • Evaluate the new population • Repeat Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 126
  • 127. Advanced Topics • Competitive evaluation • Evaluate each gene against the rest of the population • Genetic programming • Each gene is a chunk of code • Generally represented as a parse tree • Punctuated Equlibria • Evolve multiple parallel populations • Occasionally swap members • Identifies a wider range of high fitness solutions Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 127
  • 128. Games that use GAs • Creatures • Creatures 2 • Creatures 3 • Creatures Adventures • Seaman • Nooks & Crannies • Return Fire II Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 128
  • 129. Genetic Algorithm Evaluation • Pros • Powerful optimization technique • Parallel search of the space • Can learn novel solutions • No examples required to learn • Cons • Evolution takes lots of processing • Not very feasible for online learning • Can’t guarantee an optimal solution • May find uninteresting but high fitness solutions • Challenges • Finding correct representation can be tricky • The richer the representation, the bigger the search space • Fitness function must be carefully chosen Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 129
  • 130. References • Mitchell: Machine Learning, McGraw Hill, 1997. • Holland: Adaptation in natural and artificial systems, MIT Press 1975. • Back: Evolutionary algorithms in theory and practice, Oxford University Press 1996. • Booker, Goldberg, & Holland: Classifier systems and genetic algorithms, Artificial Intelligence 40: 235-282, 1989. • AI Game Programming Wisdom. • AI Game Programming Wisdom 2. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 130
  • 131. Bayesian Learning Michael van Lent Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 131
  • 132. The Big Picture • Problem • Classification • Stochastic Modeling • Feedback • Supervised learning • Knowledge Representation • Bayesian classifiers • Bayesian Networks • Knowledge Source • Examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 132
  • 133. Background • Most learning approaches learn a single best guess • Learning algorithm selects a single hypothesis • Hypothesis = Decision tree, rule set, neural network… • Probabilistic learning • Learn the probability that a hypothesis is correct • Identify the most probable hypothesis • Competitive with other learning techniques • A single example doesn’t eliminate any hypothesis • Notation • P(h): probability that hypothesis h is correct • P(D): probability of seeing data set D • P(D|h): probability of seeing data set D given that h is correct • P(h|D): probability that h is correct given that D is seen Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 133
  • 134. Bayes Rule • Bayes rule is the foundation of Bayesian learning P( D | h) P(h ) P( h | D) = P( D) • As P(D|h) increases, so does P(h|D) • As P(h) increases, so does P(h|D) • As P(D) increases, probability of P(h|D) decreases Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 134
  • 135. Example • A monster has two attacks, A and B: • Attack A does 11-20 damage and is used 10% of the time • Attack B does 16–115 damage and is used 90% of the time • You have counters A’ (for attack A) and B’ (for attack B) • If an attack does 16-20 damage, which counter to use? • P(A|damage=16-20) greater or less than 50%? • We don’t know P(A|16-20) • We do know P(A), P(B), P(16-20|A), P(16-20|B) • We only need P(16-20) • P(16-20) = P(A) P(16-20|A) + P(B) P(16-20|B) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 135
  • 136. Example (cont’d) • Some probabilities • P(A) = 10% • P(B) = 90% • P(16-20|A) = 50% • P(16-20|B) = 5% P (16 − 20 | A) P ( A) P ( A | 16 − 20) = P (16 − 20) 0.5( 0.1) P ( A | 16 − 20) = ( 0.1)(0.5) + ( 0.9)(0.05) 0.05 0.05 P ( A | 16 − 20) = = = 0.5263 = 52.63% 0.05 + 0.045 0.095 • So counter A’ is the slightly better choice Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 136
  • 137. Bayes Optimal Classifier • Given data D, what’s the probability that a new example falls into category c • P(example=c|D) or P(c|D) • Best classification is highest P(c|D) max P(c i|D) = max ∑ P(c i|hj)P(c j|D) ci∈C ci∈C hj∈H • This approach tends to be computationally expensive • Space of hypothesis is generally very large Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 137
  • 138. Example Problem Classify how I should react to an object in the world • Facts about any given object include: • Allegiance = < friendly, neutral, enemy> • Health = <low, medium, full> • Animate = <true, false> • RelativeHealth = <weaker, same, stronger> • Output categories include: • Reaction = Attack • Reaction = Ignore • Reaction = Heal • Reaction = Eat • Reaction = Run • <friendly, low, true, weaker> => Heal • <neutral, low, true, same> => Heal • <enemy, low, true, stronger> => Attack • <enemy, medium, true, weaker> => Attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 138
  • 139. Naïve Bayes Classifier • Each example is a set of feature values • friendly, low, true, weaker • Given a set of feature values, find the most probable category • Which is highest: • P(Attack | friendly, low, true, weaker) • P(Ignore | friendly, low, true, weaker) • P(Heal | friendly, low, true, weaker) • P(Eat | friendly, low, true, weaker) • P(Run | friendly, low, true, weaker) cnb = max P (ci | f1, f2, f3, f4) ci∈C Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 139
  • 140. Calculating Naïve Bayes Classifier cnb = max P( ci | f1, f2, f3, f4) ci∈C P(f1, f2, f3, f4 | ci )P(ci ) cnb = max ci∈C P( f1, f2, f3, f4) cnb = max P(f1, f2, f3, f4 | ci )P( ci ) ci∈C • Simplifying assumption: each feature in the example is independent • Value of Allegiance doesn’t affect value of Health, Animate, or RelativeHealth P (f1, f2, f3, f4 | ci) = ∏ P(f | c ) j j i cnb = max P(ci )∏ P(fj | ci ) ci∈C j Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 140
  • 141. Example • Slightly modified 13 examples: • <friendly, low, true, weaker> => Heal • <neutral, full, false, stronger> => Eat • <enemy, low, true, weaker> => Eat • <enemy, low, true, same> => Attack • <neutral, low, true, weaker> => Heal • <enemy, medium, true, stronger> => Run • <friendly, full, true, same> => Ignore • <neutral, full, true, stronger> => Ignore • <enemy, full, true, same> => Run • <enemy, medium, true, weaker> => Attack • <enemy, low, true, weaker> => Ignore • <neutral, full, false, stronger> => Ignore • <friendly, medium, true, stronger> => Heal • Estimate the most likely classification of: • <enemy, full, true, stronger> Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 141
  • 142. Example • Need to calculate: • P(Attack| <enemy, full, true, stronger>) = P(Attack) P(enemy|Attack) P(full|Attack) P(true|Attack) P(stronger|Attack) • P(Ignore| <enemy, full, true, stronger>) = P(Ignore) P(enemy|Ignore) P(full|Ignore) P(true|Ignore) P(stronger|Ignore) • P(Heal| <enemy, full, true, stronger>) = P(Heal) P(enemy|Heal) P(full|Heal) P(true|Heal) P(stronger|Heal) • P(Eat| <enemy, full, true, stronger>) = P(Eat) P(enemy|Eat) P(full|Eat) P(true|Eat) P(stronger|Eat) • P(Run| <enemy, full, true, stronger>) = P(Run) P(enemy|Run) P(full|Run) P(true|Run) P(stronger|Run) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 142
  • 143. Example (cont’d) • P(Ignore| <enemy, full, true, stronger>) = P(Ignore) P(enemy|Ignore) P(full|Ignore) P(true|Ignore) P(stronger|Ignore) P(Ignore) = 4 of 13 examples = 4/13 = 31% P(enemy|Ignore) = 1 of 4 examples = ¼ = 25% P(full|Ignore) = 3 of 4 examples = ¾ = 75% P(true|Ignore) = 3 of 4 examples = ¾ = 75% P(stronger|Ignore) = 2 of 4 examples = 2/4 = 50% P(Ignore| <enemy, full, true, stronger>) = 2.2% Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 143
  • 144. Example (cont’d) • P(Run| <enemy, full, true, stronger>) = P(Run) P(enemy|Run) P(full|Run) P(true|Run) P(stronger|Run) P(Run) = 2 of 13 examples = 2/13 = 15% P(enemy|Run) = 2 of 2 examples = 100% P(full|Run) = 1 of 2 examples = 50% P(true|Run) = 2 of 2 examples = 100% P(stronger|Run) = 1 of 2 examples = 50% P(Run| <enemy, full, true, stronger>) = 3.8% Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 144
  • 145. Result • P(Ignore| <enemy, full, true, stronger>) = 2.2% • P(Run| <enemy, full, true, stronger>) = 3.8% • P(Eat| <enemy, full, true, stronger>) = 0.1% • P(Heal| <enemy, full, true, stronger>) = 0% • P(Attack| <enemy, full, true, stronger>) = 0% • So Naïve Bayes Classification says Run is most probably • 63% of Run being correct • 36% of Ignore being correct • 1% of Eat being correct Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 145
  • 146. Estimating Probabilities • Need lots of examples for accurate estimates • With only 13 examples: • No example of: • Health=full for Attack category • RelativeHealth=Stronger for Attack • Allegiance=enemy for Heal • Health=full for Heal • Only two examples of Run • P(f1|Run) can only be 0%, 50%, or 100% • What if the true probability is 16.2%? • Need to add a factor to probability estimates that: • Prevents missing examples from dominating • Estimates what might happen with more examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 146
  • 147. m-estimate • Solution: m-estimate • Establish a prior estimate p • Expert input • Assume uniform distribution • Estimate the probability as: nc + mp n+m • m is the equivalent sample size • Augment n observed samples with m virtual samples • If there are no examples (nc = 0) estimate is still > 0% • If p(run) = 20% and m = 10 then P(full|Run): • Goes from 50% (1 of 2 examples) • to 25% 1 + 10(.2) 3 = = 0.25 2 + 10 12 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 147
  • 148. Bayesian Networks • Graph structure encoding causality between variables • Directed, acyclic graph • A? B indicates that A directly influences B • Positive or negative influence Attack Attack A B Damage Damage Damage 11-15 16-20 21-115 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 148
  • 149. Another Bayesian Network P(I) =10% P(R) = 40% Intruder Rat P(N|I,R) = 95% P(N|I,not R) = 30% Noise P(N|not I,R) = 60% P(N|not I, not R) = 2% Guard1 Guard2 Report Report P(G1|N) = 90% P(G2|N) = 70% P(G1|not N) = 5% P(G2|not N) = 1% • Inference on Bayesian Networks can determine probability of unknown nodes (Intruder) given some known values • If Guard2 reports but Guard 1 doesn’t, what’s the probability of Intruder? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 149
  • 150. Learning Bayesian Networks • Learning the topology of Bayesian networks • Search the space of network topologies • Adding arcs, deleting arcs, reversing arcs • Are independent nodes in the network independent in the data? • Does the network explain the data? • Need to weight towards fewer arcs • Learning the probabilities of Bayesian networks • Experts are good at constructing networks • Experts aren’t as good at filling in probabilities • Expectation Management (EM) algorithm • Gibbs Sampling Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 150
  • 151. Bayesian Learning Evaluation • Pros • Takes advantage of prior knowledge • Probabilistic predictions (prediction confidence) • Handles noise well • Incremental learning • Cons • Less effective with low number of examples • Can be computationally expensive • Challenges • Identifying the right features • Getting a large number of good examples Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 151
  • 152. References • Mitchell: Machine Learning, McGraw Hill, 1997. • Russell and Norvig: Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. • AI Game Programming Wisdom. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 152
  • 153. Reinforcement Learning John Laird Thanks for online reference material to: Satinder Singh, Yijue Hou & Patrick Doyle Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 153
  • 154. Outline of Reinforcement Learning • What is it? • When is it useful? • Examples from games • Analysis Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 154
  • 155. Reinforcement Learning • A set of problems, not a single technique: • Adaptive Dynamic Programming • Temporal Difference Learning • Q learning • Cover story for Neural Networks, Decision Trees, etc. • Best for tuning behaviors • Often requires many training trials to converge • Very general technique applicable to many problems • Backgammon, poker, helicopter flying, truck & car driving Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 155
  • 156. Reinforcement Learning • Agent receives some reward/punishment for behavior • Is not told directly what to do or what not to do • Only whether it has done well or poorly • Reward can be intermittent and is often delayed • Must solve the temporal credit assignment problem • How can it learn to select actions that occur before reward? Game Game AI New or Critic corrected knowledge Reward Learning Algorithm Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 156
  • 157. Deathmatch Example • Learn to kill enemy better state • Possible rewards for Halo action • +10 kill enemy • -3 killed state • State features action • Health, enemy health • Weapon, enemy weapon state • Relative position and facing of enemy • Absolute and relative speeds action • Relative positions of nearby obstacles state Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 157
  • 158. Two Approaches to Reinforcement Learning • Passive learning = behavior cloning • Examples of behavior are presented to learner • Learn a model of a human player • Tries to learn a single optimal policy • Active learning = learning from experience • Agent is trying to perform task and learn at same time • Must trade off exploration vs. exploitation • Can train using against self or humans Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 158
  • 159. What can be Learned? • Utility Function: • How good is a state? • The utility of state si: U(si) • Choose action to maximize expected utility of result • Action-Value: • How good is a given action for a given state? • The expected utility of performing action aj in state si: V(si,aj) • Choose action with best expected utility for current state Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 159
  • 160. Utility Function for States: U(si) • Agent chooses action to maximize expected utility • One step look-ahead - + + - - - • Agent must have a “model” of environment • Possible transitions from state to state • Can be learned or preprogrammed Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 160
  • 161. Trivial Example: Maze Learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 161
  • 162. Learning State Utility Function: U(si) .84 .83 .82 .81 .80 .84 .85 .84 .83 .82 .81 .99 .85 .86 .85 .83 .82 .98 .86 .87 .97 .87 .89 .90 .91 .92 .96 .88 .88 .89 .90 .91 .92 .93 .94 .95 .87 .88 .89 .90 .91 .92 .93 .94 .87 .88 .89 .90 .91 .92 .93 .86 .85 .86 .87 .88 .89 .90 .91 .92 .85 .90 .91 .89 .90 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 162
  • 163. Action Value Function: V(si,aj) • Agent chooses action that is best for current state • Just compare operators – not state + + • Agent doesn’t need a “model” of environment • But must learn separate value for each state-action pair Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 163
  • 164. Learning Action-Value Function: V(si,aj) .83 .85 .83 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 164
  • 165. Review of Dimensions • Source of learning data • What is learned • Passive • State utility function • Active • Action-value Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 165
  • 166. Passive Utility Function Approaches • Least Mean Squares (LMS) • Adaptive Dynamic Programming (ADP) • Requires a model (M) for learning • Temporal Difference Learning (TDL) • Model free learning (uses model for decision making, but not for learning). Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 166
  • 167. Learning State Utility Function (U) • Assume k states in the world • Agent keeps: • An estimate U of the utility of each state (k) • A table N of how many times each state was seen (k) • A table M (the model) of the transition probabilities (k x k) • likelihood of moving from each state to another state S1 S2 S3 S4 .6 S2 .7 .3 S1 .6 .4 0 1 S1 S4 .4 1 S2 0 .3 .7 S3 S3 0 0 1 S4 1 0 0 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 167
  • 168. Adaptive Dynamic Programming (ADP) • Utility = reward and probability of future reward • U(i) = R(i) + ? Mij * U(j) .2 Initial Utilities: S1 S2 S3 S4 S1=.5 .6 S2 .5 S1 .6 .4 0 .3 1 S2=.6 S1 S4 S2 0 .2 .3 .5 .4 1 S3=.2 S3 S3 0 0 1 S4=.1 S4 1 0 0 Exact, but inefficient in State S2 and get reward .3 large search spaces U(3) = .3 + 0*.5 + .2*.6 + .3*.2 + .5*.1 = .3 + 0 + .12 + .06 + .05 Requires sweeping through = .53 complete space Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 168
  • 169. Temporal Difference Learning • Approximate ADP • Adjust the estimated utility value of the current state based on its immediate reward and the estimated value of the next state. • U(i) = U(i) + a(R(i) + U(j) - U(i)) • a is learning rate • if a continually decreases, U will converge Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 169
  • 170. Temporal Difference Example • Utility = reward and probability of future reward • U(i) = U(i) + a(R(i) + U(j) - U(i)) .2 Initial Utilities: S1=.5 .6 S2 .5 .3 1 S2=.6 S1 S4 .4 1 S3=.2 S3 S4=.1 State S2, get reward .3, go to state S3 U(3) = .6 + .5 * (.3 + .2 - .6) = .6 + .5 * (-.1) = .6 + -.05 = .55 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 170
  • 171. TD vs. ADP • ADP learns faster • ADP is less variable • TD is simpler • TD has less computation/observation • TD does not require a model during learning • TD biases update to observed successor instead of all Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 171
  • 172. Active Learning State Utilities: ADP • Active learning must decide which action to take and update based on what it does. • Extend model M to give the probability of a transition from a state i to a state j, given an action a. • Utility is maximum of • U(i) = R(i) + maxa [SUMj MaijU(j) ] Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 172
  • 173. Active Learning State-Action Functions (Q-Learning) • Combines situation and action: + + + • Q(a,i) = expected utility of using action a on state i. • U(i) = maxa Q(a, i) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 173
  • 174. Q Learning • ADP Version: Q(a, i) = R(i) + ? Maij maxa' Q(a', j) .9 .7 .6 S1 S2 .7 • TD: Q(a, i) <- Q(a, i) + a(R(i) + ?(maxa' Q(a', j) - Q(a, i))) • If a is .1, ? is .9, and R(1) = 0, • = .7 + .1* (0 + .9 *(max(.6, .7, .9) - .7)) = .7 + .1 *.9 * (.9 - .7) = .7 + .18 = .718 • Selection is biased by expected utilities • Balances exploration vs. exploitation • With experience, bias more toward higher values Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 174
  • 175. Q-Learning • Q-Learning is the first provably convergent direct adaptive optimal control algorithm • Great impact on the field of modern RL • smaller representation than models • automatically focuses attention to where it is needed, i.e., no sweeps through state space Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 175
  • 176. Q Learning Algorithm For each pair (a, s), initialize Q(a, s) Observe the current state s Loop forever { Select an action a and execute it a = arg max Q ( s , a ) a Receive immediate reward r and observe the new state s’ Update Q(a, s) Q( s, a ) = Q (s, a) + α (r + γ max Q( s' , a' ) − Q (s, a)) a' s=s’ } Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 176
  • 177. Summary Comparison State Utility Function State-Action • Requires model • Model free • More general/faster learning • Less general/slower learning • Learns about states • Must learn state-action combinations • Slower execution • Must compute follow on states • Faster execution • If have model of reward, doesn’t need environment • Preferred for complex worlds where model isn’t available • Useful for worlds with model • Maze worlds, board games, … Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 177
  • 178. Anark Software Galapagos • Player trains creature by manipulating environment • Creature learns from pain, death, and reward for movement • Learns to move and classify objects in world based on their pain/death. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 178
  • 179. Challenges • Exploring the possibilities • Picking the right representation • Large state spaces • Infrequent reward • Inter-dependence of actions • Complex data structures • Dynamic worlds • Setting parameters Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 179
  • 180. Exploration vs. Exploitation • Problem: If large space of possible actions, might never experience many of them if learn too quick. • Exploration: try out actions • Exploitation: use knowledge to improve behavior • Compromise: • Random selection, but bias choice to best actions • Overtime, bias more and more to best actions Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 180
  • 181. Picking the Right Representation • Too few features and impossible to learn • If learning to drive and can sense acceleration or speed. • Too many features and can use exact representations • See next section Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 181
  • 182. Large state spaces: Curse of Dimensionality • Look-up Table for Q value • AIW 2, pp. 597 • OK for 2-3 variables • Fast learning, but lots of memory • Issues: • Hard to get data that covers the states enough time to learn accurate utility functions • Probably many different states have similar utility • Data structures for storing utility functions can be very large • State-action approaches (Q-learning) exacerbate the problem • Deathmatch example: • Health [10], Enemy Health [10], Relative Distance [10], Relative Heading [10], Relative Opponent Heading [10], Weapon [5], Ammo [10], Power ups [4], Enemy Power ups [4], My Speed [4], His Speed [4], Distances to Walls [5,5,5,5] • 8 x 1014 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 182
  • 183. Solution: • Approximate state space with some function • Neural Networks, Decision Tree, Nearest Neighbor, Bayesian Network, … • Can be slower than lookup table but much more compact Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 183
  • 184. Function Approximation: Neural Networks • Use all features as input with utility as output State Features & Utility Estimate Action (Q Learning) • Output could be actions and their utilities? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 184
  • 185. Geisler – FPS Offline Learning Input Features: • Closest Enemy Health Sector 1 • Number Enemies in Sector 1 • Number Enemies in Sector 2 700 Feet • Number Enemies in Sector 3 Sector 2 • Number Enemies in Sector 4 Sector 4 • Player Health • Closest Goal Distance Sector 3 • Closest Goal Sector • Closest Enemy Sector • Distance to Closest Enemy • Current Move Direction • Current Face Direction Output • Accelerate • Move Direction • Facing Direction • Jumping • Tested with Neural Networks, Decision Trees, and Naïve Bayes Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 185
  • 186. Health Partial Decision Tree 1-3 4-6 7-9 10 for Accelerate EnemySector #EnemySector1 ClosestGoal EnemyDistance Y: 442 N:690 ... ... 1 2 0-7 8-10 0-1 . . . NO EnemyHealth #EnemySector3 ClosestGoal EnemySector 336 Y: 150 N:647 Y: 191 N: Y: 365 N: 589 653 ... 0-2 3-6 ... ... ... CurrentMove ClosestGoal ... ... 0 1 CurrentFace CurrentMove ... ... 0 1 1 NO EnemyDistance EnemyDistance 23 Y: 29 N:42 ... ... 3-9 YES Laird & van Lent GDC 2005: AI Learning Techniques Tutorial 54 Page 186
  • 187. Results – Error Rates Accelerate? Move Direction 45 45 Baseline 40 40 ID3 NB ANN 35 35 30 30 Test Set Error Rate e t e ro ae T s S tEr rR t 25 25 20 20 15 15 10 10 5 5 0 0 100 500 1000 1500 2000 3000 4000 5000 100 500 1000 1500 2000 3000 4000 5000 Train Set Size Train Set Size Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 187
  • 188. Infrequent reward • Problem: • If feedback comes only at end of lots of actions, hard to learn utilities of early situations • Solution • Provide intermediate rewards • Example: FPS • +1 for hitting enemy in FPS deathmatch • -1 for getting hit by enemy • +.5 for getting behind enemy • +.4 for being in place with good visibility but little exposure • Risks • Achieving intermediate rewards instead of final reward Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 188
  • 189. Maze Learning +100 +90 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 189
  • 190. Many Related Actions • If try to learn all at once, very slow • Train up one at a time: • 10.4 in AIW2, p.596 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 190
  • 191. Dynamic world • Problem: • If world or reward changes suddenly, system can’t respond • Solution: 1. Continual exploration to detect changes 2. If major changes, restart learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 191
  • 192. Major Change in World .84 .83 .82 .81 .80 .84 .85 .84 .83 .82 .81 .99 .85 .86 .85 .83 .82 .98 .86 .87 .97 .87 .89 .90 .91 .92 .96 .88 .88 .89 .90 .91 .92 .93 .94 .95 .87 .88 .89 .90 .91 .92 .93 .94 .87 .88 .89 .90 .91 .92 .93 .86 .85 .86 .87 .88 .89 .90 .91 .92 .85 .90 .91 .89 .90 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 192
  • 193. Setting Parameters • Learning Rate: a • If too high, might not converge (skip over solution) • If too low, can converge slowly • Lower with time: kn such as .95n = .95, .9, .86, .81, .7 • For deterministic worlds and state transitions, .1-.2 works well • Discount Factor: ? • Affects how “greedy” agent is for short term vs. long-term reward • .9-.95 is good for larger problems • Best Action Selection Probability: e • Increases as game progresses so takes advantage of learning • 1- kn Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 193
  • 194. Analysis • Advantages: • Excellent for tuning parameters & control problems • Can handle noise • Can balance exploration vs. exploitation • Disadvantages • Can be slow if large space of possible representations • Has troubles with changing concepts • Challenges: • Choosing the right approach: utility vs. action-value • Choosing the right features • Choosing the right function approximation (NN, DT, …) • Choosing the right learning parameters • Choosing the right reward function Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 194
  • 195. References • John Manslow: Using Reinforcement Learning to Solve AI Control Problems: AI Programming Wisdom 2, p. 591 • Benjamin Geisler, An Empirical Study of Machine Learning Algorithms Applied to Modeling Player Behavior in a “First Person Shooter” Video Game, Masters’ Thesis, U. Wisconsin, 2002. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 195
  • 196. Episodic Learning [Andrew Nuxoll] • What is it? • Not facts or procedures but memories of specific events • Recording and recalling of experiences with the world • Why study it? • No comprehensive computational models of episodic learning • No cognitive architectural models of episodic learning • If not architectural, interferes with other reasoning • Episodic learning will expand cognitive abilities • Personal history and identity • Memories that can be used for future decision making & learning • Necessary for reflection, debriefing, etc. • Without it we are trying to build crippled AI systems • Mother of all case-based reasoning problems. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 196
  • 197. Characteristics of Episodic Memory 1. Architectural: • The mechanism is used for all tasks and does not compete with reasoning. 2. Automatic: • Memories are created without effort or deliberate rehearsal. 3. Autonoetic: • A retrieved memory is distinguished from current sensing. 4. Autobiographical: • The episode is remembered from own perspective. 5. Variable Duration: • The time period spanned by a memory is not fixed. 6. Temporally Indexed: • The rememberer has a sense of the time when the episode occurred. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 197
  • 198. Advantages of Episodic Memory • Improves AI behavior • Creates a personal history that impacts behavior • Knows what it has done – avoid repetition • Helps identify significant changes to the world • Compare current situation to memory • Creates virtual sensors of previously seen aspects of the world • Helps explaining behavior • History of goals and subgoals it attempted • Provide the basis of a simple model of the environment • Supports other learning mechanisms Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 198
  • 199. Why and why not Episodic Memory? • Advantages: • General capability that can be reused on many projects. • Might be difficult to identify what to store. • Disadvantages: • Can be replaced with code customized for specific needs. • Might be costly in memory and retrieval. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 199
  • 200. Implementing Episodic Memory • Encoding • When is an episode stored? • What is stored and what is available for cuing retrieval? • Storage • How is it stored for efficient insertion and query? • Retrieval • What is used to cue the retrieval? • How is the retrieval efficiently performed? • What is retrieved? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 200
  • 201. Possible Approach • When encode: • Every encounter between a NPC and the player • If NPC goal/subgoal is achieved • What to store: • Where, when, what other entities around, difficulty of achievement, objects that were used, … • Pointer to next episode • Retrieve based on: • Time, goal, objects, place • Can create efficient hash or tree-based retrieval. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 201
  • 202. Soar Structure Decision Long-term Procedural Memory Procedure Production Rules Rule Matcher GUI … Episodic Learning Short-term Declarative Memory Episodic Perception Memory Action Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 202
  • 203. Implementation Big Picture Long-term Procedural Memory Encoding Production Rules Initiation? Storage Retrieval Output Cue Working Memory Input Retrieved When the agent takes an action. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 203
  • 204. Implementation Big Picture Long-term Procedural Memory Encoding Production Rules Initiation Content? Storage Retrieval Output Cue Working Memory Input Retrieved The entire working memory is stored in the episode Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 204
  • 205. Implementation Big Picture Long-term Procedural Memory Episodic Encoding Production Rules Memory Initiation Content Storage Episode Structure? Retrieval Output Cue Working Memory Episodic Learning Input Retrieved Episodes are stored in a separate memory Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 205
  • 206. Implementation Big Picture Long-term Procedural Memory Episodic Encoding Production Rules Memory Initiation Content Storage Episode Structure Retrieval Initiation/Cue? Output Cue Working Memory Episodic Learning Input Retrieved Cue is placed in an architecture specific buffer. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 206
  • 207. Implementation Big Picture Long-term Procedural Memory Episodic Encoding Production Rules Memory Initiation Content Storage Episode Structure Retrieval Initiation/Cue Output Cue Working Memory Retrieval Episodic Learning Input Retrieved The closest partial match is retrieved. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 207
  • 208. Storage of Episodes “Uber-Tree” Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 208
  • 209. Alternative Approach • Observation: • Many items don’t change from one episode to next • Can reconstruct episode from individual facts • Eliminate costly episode structure • New representation • For each item, store ranges of when it exists • New match • Trace through Über tree with cue to find all matching ranges • Compute score for merged ranges – pick best • Reconstruct episode by searching Über with episode number Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 209
  • 210. Storage 5-7 80-85 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 210
  • 211. Retrieval 5-7 80-85 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 211
  • 212. Cue Merge Activation 34 55 65 90 95 12 5 7 55 65 80 85 90 92 3 3 7 90 95 3 15 46 12 49 37 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 212
  • 213. Memory Usage Memory Usage Comparison 8,000,000 7,000,000 Memory allocated (bytes) 6,000,000 5,000,000 Old Impl 4,000,000 New Impl 3,000,000 2,000,000 1,000,000 0 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 decision cycles Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 213
  • 214. Conclusion • Explore use of episodic memory as general capability • Inspired by psychology • Constrained by computation and memory Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 214
  • 215. Learning by Observation Michael van Lent Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 215
  • 216. Background • Goal: Learn rules to perform a task from watching an expert • Real time interaction with the game (agent-based approach) • Learning what goals to select & how to achieve them • AI agents require lots of knowledge • TacAir-Soar: 8000+ rules • Quake II agent: 800+ rules • Knowledge acquisition for these agents is expensive • 15 person/years for TacAir-Soar • Learning is a cheaper alternative? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 216
  • 217. Continuum of Approaches Expert and Programmer Effort Research Effort Standard Knowledge Learning by Unsupervised Acquisition Observation Learning Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 217
  • 218. The Big Picture • Problem • Task performance cast as classification • Feedback • Supervised learning • Knowledge Representation • Rules • Decision trees • Knowledge Source • Observations of an expert • Annotations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 218
  • 219. Knowledge Representation • Rules encoding operators • Operator Hierarchy • Operator consists of: • Pre-conditions (potentially disjunctive) • Includes negated test for goal-achieved feature • Conditional Actions • Action attribute and value (pass-through action values) • Goal conditions (potentially disjunctive) • Create goal-achieved feature • Persistent and non-persistent goal-achieved features • Task and Domain parameters are widely used to generalize the learned knowledge Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 219
  • 220. Operator Conditions • Pre-conditions • Positive instance from each observed operator selection • Action conditions • Positive instance from each observed action performance • Recent-changes heuristic can be applied • Goal conditions • Positive instance from each observed operator termination • Recent-changes heuristic can be applied • Action attributes and values • Attribute taken directly from expert actions • Value can be constant or “pass-through” Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 220
  • 221. KnoMic Parameters & Sensors Environmental Expert ModSAF Interface Output Commands Annotations Observation Soar Generation Architecture Observation Traces Soar Productions Specific to Operator Production General Induction Classification Generation Operator Learned Conditions Knowledge Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 221
  • 222. Observation Trace • At each time step record • Sensor input changes • List of attributes and values • Output commands • List of attributes and values • Operator annotations • List of active operators # Add Sensor Input for Decision Cycle 2 set Add_Sensor_Input(2,0) [list observe io input-link vehicle radar-mode tws-man ] set Add_Sensor_Input(2,1) [list observe io input-link vehicle elapsed-time value 5938 ] set Add_Sensor_Input(2,3) [list observe io input-link vehicle altitude value 1 ] # Remove Sensor Input for Decision Cycle 2 set Remove_Sensor_Input(2,0) [list observe io input-link vehicle radar-mode *unknown* ] set Remove_Sensor_Input(2,1) [list observe io input-link vehicle elapsed-time value 0 ] set Remove_Sensor_Input(2,3) [list observe io input-link vehicle altitude value 0 ] # Expert Actions for Decision Cycle 3 set Expert_Action_List(3) [list [list mvl-load-weapon-bay station-1 ] ] # Expert Goal Stack for Decision Cycle 3 set Expert_Goal_Stack(3) [list init-agent station-1 ] Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 222
  • 223. Racetrack & Intercept Behavior Employ-weapons Intercept Employ-weapons Select-missile Wait-for-missile-to-clear Get-missile-lar fly inbound leg Achieve-proximity fly to waypoint Employ-weapons fly outbound leg Support-missile Employ-weapons fly to waypoint Launch-missile Lock-radar Get-steering-circle Fire-missile Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 223
  • 224. Learning Example First selection of Fly-inbound-leg • Radar Mode = TWS • Altitude = 20,102 • Compass = 52 • Wind Speed = 3 • Waypoint Direction = 52 • Waypoint Distance = 1,996 • Near Parameter = 2,000 Initial pre-conditions • Radar Mode = TWS • Altitude = 20,102 • Compass = 52 • Wind Speed = 3 • Waypoint Direction = 52 • Waypoint Distance = 1,996 • Compass == Waypoint Direction • Waypoint Distance < Near Parameter Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 224
  • 225. Learning Example First instance of Fly-inbound-leg Second instance of Fly-inbound-leg • Radar Mode = TWS • Radar Mode = TWS • Altitude = 20,102 • Altitude = 19,975 • Compass = 52 • Compass = 268 • Wind Speed = 3 • Waypoint Direction = 52 • Waypoint Direction = 270 • Waypoint Distance = 1,996 • Waypoint Distance = 1,987 • Near Parameter = 2,000 • Near Parameter = 2,000 Initial pre-conditions Revised pre-conditions • Radar Mode = TWS • Radar Mode = TWS • Altitude = 20,102 • Altitude = 19,975 - 20,102 • Compass = 52 • Wind Speed = 3 • Waypoint Direction = 52 • Waypoint Distance = 1,996 • Waypoint Distance = 1,987 – 1,996 • Compass == Waypoint Direction • Compass == Waypoint Direction • Waypoint Distance < Near Parameter • Waypoint Distance < Near Parameter Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 225
  • 226. Results 2: Efficiency 400 350 300 250 Minutes Encode Knowledge 200 Learn Task 150 100 50 0 KnoMic(10x) KnoMic KE 1 KE 2 KE Projected KE 3 KE 4 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 226
  • 227. Evaluation • Pros • Observations are fairly easy to get • Suitable for online learning (learn after each session) • AI can learn to imitate players • Cons • Only more efficient for large rule sets? • Experts need to annotate the observation logs • Challenges • Identifying the right features • Making sure you have enough observations Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 227
  • 228. References • Learning Task Performance Knowledge by Observation • University of Michigan dissertation • Knowledge Capture Conference (K-CAP). • IJCAI Workshop on Modeling Others from Observation. • AI Game Programming Wisdom. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 228
  • 229. Learning Player Models John Laird Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 229
  • 230. Learning Player Model • Create an internal model of what player might do • Allows AI to adapt to player’s tactics & strategy • Tactics • Player is usually found in room b, c, & f • Player prefers using the rocket launcher • Patterns of players’ moves • When they block, attack, retreat, combinations of moves, etc. • Strategy • Likelihood of player attacking from a given direction • Enemy tends to concentrate on technology and defense vs. exploration and attack Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 230
  • 231. Two Parts to Player Model • Representation of player’s behavior • Built up during playing • Tactics that test player model and generate AI behavior Multiple approaches for each of these Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 231
  • 232. Simple Representation of Behavior • Predefine set of traits • Always runs • Prefers dark rooms • Never blocks • Simply count during game play • Doesn’t track changes in style • Limited horizon of past values • Frequency of using attack – range, melee, … • Traitvalue = a * ObservedValue + (1-a) * oldTraitValue • a = learning rate which determines influence of each observation Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 232
  • 233. Using Traits • Pick traits your AI tactics code can use (or create tactics that can use the traits you gather). • Tradeoff: level of detail vs. computation/complexity • Prefers dark rooms that have one entrance • More specialized better prediction, but more complex and less data Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 233
  • 234. Markov Decision Process (MDP) or N-Grams • Build up a probabilistic state transition network that describes player’s behavior Punch: .6 Punch: .4 Kick: .7 Block: .6 Punch: .7 Kick: .4 Rest: .3 Block: .3 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 234
  • 235. Other Models • Any decision making system: • Neural networks • Decision tree • Rule-based system • Train with situation/action pairs • Use AI’s behavior as model of opponent • Chess, checkers, … Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 235
  • 236. Using Player Model • Tests for values and provide direct response • If player is likely to kick then block. • If player attacks very late, don’t build defenses early on. • Predict players behavior and search for best response • Can use general look-ahead/mini-max/alpha-beta search • Doesn’t work with highly random games (Backgammon, Sorry) Him Me Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 236
  • 237. Anticipation Dennis (Thresh) Fong: “Say my opponent walks into a room. I'm visualizing him walking in, picking up the weapon. On his way out, I'm waiting at the doorway and I fire a rocket two seconds before he even rounds the corner. A lot of people rely strictly on aim, but everybody has their bad aim days. So even if I'm having a bad day, I can still pull out a win. That's why I've never lost a tournament.” Newsweek, 11/21/99 Wayne Gretzky: “Some people skate to the puck. I skate to where the puck is going to be.” Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 237
  • 238. ? Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 238
  • 239. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 239
  • 240. His Distance: 1 My Distance: 1 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 240
  • 241. His Distance: 2 My Distance: 2 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 241
  • 242. His Distance: 2 My Distance: 2 Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 242
  • 243. His Distance: 3 My Distance: 1 (but hall) Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 243
  • 244. His Distance: 4 My Distance: 0 Ambush! Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 244
  • 245. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 245
  • 246. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 246
  • 247. Adaptive Anticipation • Opponent might have different weapon preferences • Influences which weapons he pursues, which rooms he goes to • Gather data on opponent’s weapon preferences • Quakebot notices when opponent changes weapons • Use derived preferences for predicting opponent’s behavior • Dynamically modifies anticipation with experience Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 247
  • 248. References • Ryan Houlette: Player Modeling for Adaptive Games: AI Programming Wisdom 2, p. 557 • John Manslow: Learning and Adaptation: AI Programming Wisdom, p. 559 • Francois Laramee: Using N-Gram Statistical Models to Predict Play Behavior: AI Programming Wisdom, p. 596 • John Laird, It Knows What You're Going to Do: Adding Anticipation to a Quakebot. Agents 2001 Conference. Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 248
  • 249. Tutorial Overview I. Introduction to learning and games [.75 hour] {JEL} II. Overview of machine learning field [.75 hour] {MvL} III. Analysis of specific learning mechanisms [3 hours total] • Decision Trees [.5 hour] {MvL} • Neural Networks [.5 hour] {JEL} • Genetic Algorithms [.5 hour] {MvL} • Bayesian Networks [.5 hour] {MvL} • Reinforcement Learning [1 hour] {JEL} IV. Advanced Techniques [1 hour] • Episodic Memory [.3 hour] {JEL} • Behavior capture [.3 hour] {MvL} • Player modeling [.3 hour] {JEL} V. Questions and Discussion [.5 hour] {MvL & JEL} Laird & van Lent GDC 2005: AI Learning Techniques Tutorial Page 249