SlideShare a Scribd company logo
Introduction to Machine Learning

                                                            Jinhyuk Choi
Human-Computer Interaction Lab @ Information and Communications University
Contents
   Concepts of Machine Learning

   Multilayer Perceptrons

   Decision Trees

   Bayesian Networks
What is Machine Learning?
   Large storage / large amount of data

   Looks random but certain patterns
       Web log data
       Medical record
       Network optimization
       Bioinformatics
       Machine vision
       Speech recognition…

   No complete identification of the process
       A good or useful approximation
What is Machine Learning?
Definition
   Programming computers to optimize a
    performance criterion using example data or past
    experience

   Role of Statistics
       Inference from a sample
   Role of Computer science
       Efficient algorithms to solve the optimization problem
       Representing and evaluating the model for inference
   Descriptive (training) / predictive (generalization)
                              Learning from Human-generated data??
What is Machine Learning?
Concept Learning

• Inducing general functions from specific training examples (positive or
  negative)
• Looking for the hypothesis that best fits the training examples

   Objects                              Concept
   눈, 코, 다리    Bird
   생식능력,       날개, 부리,
                                        boolean function :
   …           깃털…                         Bird(animal)  “true or not”
   무생물…




• Concepts:
- describing some subset of objects or events defined over a larger set
    - a boolean-valued function
What is Machine Learning?
Concept Learning

   Inferring a boolean-valued function from training examples of its input and
    output

                                   Hypothesis 1


                                   Hypothesis 2




                                     Concept
                                                        Web log data
                                                        Medical record
                                                        Network optimization
                               Positive examples        Bioinformatics
                               Negative examples        Machine vision
                                                        Speech recognition…
What is Machine Learning?
Learning Problem Design

   Do you enjoy sports ?
     Learn to predict the value of “EnjoySports” for an arbitrary day, based on
      the value of its other attributes




   What problem?
     Why learning?
   Attributes selection
     Effective?
     Enough?
   What learning algorithm?
Applications
   Learning associations
   Classification
   Regression
   Unsupervised learning
   Reinforcement learning
Examples (1)

   TV program preference inference based on web usage data


      Web page #1                                   TV Program #1
      Web page #2                                   TV Program #2
      Web page #3                 Classifier        TV Program #3
      Web page #4       1                       2   TV Program #4
          ….                                             ….


                                      3


     What are we supposed to do at each step?
Examples (2)
  from a HW of Neural Networks Class (KAIST-2002)

     Function approximation (Mexican hat)


                               
f3 ( x1 , x2 )  sin 2 x12  x2 ,
                               2
                                     x1 , x2 [1,1]
Examples (3)
from a HW of Machine Learning Class (ICU-2006)

   Face image classification
Examples (4)
from a HW of Machine Learning Class (ICU-2006)
Examples (5)
from a HW of Machine Learning Class (ICU-2006)

   Sensay
Examples (6)




A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable
Computing”, ISWC 2005
#1. Multilayer Perceptrons
Neural Network?




                  VS.   Adaline
                        MLP
                        SOM
                        Hopfield network
                        RDFN
                        Bifurcating neuron networks
                        …
Multilayer Networks of Sigmoid Units




                             • Supervised learning
                             • 2-layer
                             • Fully connected




                      Really looks like the brain??
Sigmoid Unit
The back-propagation algorithm
  Network model

      Input layer                           hidden layer                      output layer

          xi                                  yj                                    ok


                           v ji
                                                                 wkj




                                                                       
                    y j  s  v ji x i 
                                                                 w y 
                                                           ok  s  kj j 
                                                                          
                             i
                                       
                                                                 
                                                                         
                                                                  j      
                              1
                    E v , w    tk  ok 
                                             2

  Error function:             2 k

      Stochastic gradient descent
Gradient-Descent Function Minimization
Gradient-descent function minimization
                                                                     
 In order to find a vector parameter x that minimizes a function f x  …
                                                 
     Start with a random initial value of   x  x 0 .
     Determine the direction of the steepest descent in the parameter space by

                 f f         f  
                
        f       ,    ,...,      
                 x x
                 1     2       x n 
                                     
                                     

     Move to the direction a step.
                        
           x i  1  x i   hf                        
                                                             x
     Repeat the above two steps until no more change in        .


 For gradient-descent to work…
     The function to be minimized should be continuous.
     The function should not have too many local minima.
Back-propagation
Derivation of back-propagation algorithm

Adjustment of    wkj :
                                                                         2
     E                  1            2   1       t  s  w y 
                                                              
                                                                        
                                                                        
                           tk  ok               k
                                                              k j j 
    wk j    wk j     2 k          2 wk j        
                                                               j
                                                                       
                                                                        
                                                       
              1
               y j ok  1  ok  1 2 tk   ok  
              2
             y j ok  1  ok  tk   ok  

                   E
    wkj  h            h ok 1  ok tk  ok y j
                   wkj     
                                     o        
                                      d      k
Derivation of back-propagation algorithm
   Adjustment of vji :
                                                                                  2
       E                     1            2  1           t  s  w y 
                                                                      
                                                                               
                                                                              
                                tk  ok                       kj j 
                                                                              
       v j i    v j i     2 k          2 k v j i   k
                                                               
                                                               
                                                                      
                                                                       j
                                                                             
                                                                              
                                                                    2
                 1            t  s  w s  v x 
                                                               
                              k
                                          kj  ji i 
                                                              
                 2 k v j i            j
                                                       i
                                                                
                                                                
                 1
                  x i y j  1  y j   wkj ok 1  ok 1 2 tk  ok 
                 2 k

                 x i y j  1  y j    wkj ok 1  ok tk  ok 
                                         k

                         E
       v ji  h               hy j 1  y j   wkjok 1  ok tk  ok x i
                         v ji                   k


                h y j 1  y j   wkj dko x i
                   
                             y k        
                                   dj
Backpropagation
Batch learning vs. Incremental learning




Batch standard backprop proceeds as
                                                              Incremental standard backprop can be done as follows:
follows:
                                                               Initialize the weights W.
 Initialize the weights W.
                                                               Repeat the following steps for j = 1 to NL:
 Repeat the following steps:
                                                                 Process one training case (y_j,X_j) to compute the gradient
   Process all the training data DL to compute the gradient
                                                                    of the error (loss) function Q(y_j,X_j,W).
      of the average error function AQ(DL,W).
                                                                 Update the weights by subtracting the gradient times the
   Update the weights by subtracting the gradient times the
                                                                    learning rate.
      learning rate.
Training
Overfitting
#2. Decision Trees
Introduction
                  Divide & conquer

                  Hierarchical model

                  Sequence of
                   recursive splits

                  Decision node vs.
                   leaf node

                  Advantage
                      Interpretability
                          IF-THEN rules
Divide and Conquer
   Internal decision nodes
       Univariate: Uses a single attribute, xi
           Numeric xi : Binary split : xi > wm
           Discrete xi : n-way split for n possible values
       Multivariate: Uses all attributes, x

   Leaves
       Classification: Class labels, or proportions
       Regression: Numeric; r average, or local fit

   Learning
       Construction of the tree using training examples
       Looking for the simplest tree among the trees that code the training
        data without error
           Based on heuristics
           NP-complete
           “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
Classification Trees

   Split is main procedure for tree
    construction
       By impurity measure

   For node m, Nm instances reach m, Nim
    belong to Ci
                                    i
              ˆ Ci | x ,m  pm 
                               i   Nm
              P
                                   Nm                To be pure!!!

   Node m is pure if pim is 0 or 1
                                              K
   Measure of impurity is entropy      Im   pm log2pm
                                                 i      i

                                              i 1
Representation




   Each node specifies a test of some attribute of the instance

   Each branch correspond to one of the possible values for this
    attribute
Best Split
   If node m is pure, generate a leaf and stop, otherwise split
    and continue recursively

   Impurity after split: Nmj of Nm take branch j. Nimj belong to
    Ci                                               i
                                                   N mj
                         ˆ Ci | x ,m, j   pmj 
                         P                    i

                                                           N mj
                                 n     N mj   K
                         I'm               p     i
                                                     mj
                                                               i
                                                          log2pmj
                                j 1   Nm     i 1




   Find the variable and split that min impurity (among all
    variables -- and split positions for numeric variables)
Q) “Which attribute should be tested at the root of the tree?”
Top-Down Induction of Decision Trees
Entropy
   “Measure of uncertainty”
   “Expected number of bits to resolve uncertainty”

   Suppose Pr{X = 0} = 1/8
     If other events are equally likely, the number of events is 8. To indicate
      one out of so many events, one needs lg 8 bits.
   Consider a binary random variable X s.t. Pr{X = 0} = 0.1.

                                                         1  0.1 lg
                                                     1                      1
       The expected number of bits:      0.1 lg
                                                    0.1                 1  0.1
   In general, if a random variable X has c values with prob. p_c:
                                                c          c
                                                     1
       The expected number of bits:      H   pi lg   pi lg pi
                                              i 1   pi  i 1
Entropy
Example

   14 examples
                  Entropy([9,5])
                   (9 /14) log 2 (9 /14)  (5 /14) log 2 (5 /14)  0.940

         Entropy 0 : all members positive or negative
         Entropy 1 : equal number of positive & negative
         0 < Entropy < 1 : unequal number of positive & negative
Information Gain

   Measures the expected reduction in entropy caused by partitioning
    the examples
Information Gain
                              • # of samples = 100
 ICU-Student tree             • # of positive samples = 50
                Candidate     • Entropy = 1
                              Left side:
                              • # of samples = 50
           Gender             • # of positive samples = 40
                              • Entropy = 0.72
                              Right side:
    Male             Female   • # of samples = 50
                              • # of positive samples = 10
                              • Entropy = 0.72
     IQ              Height   On average
                              • Entropy = 0.5 * 0.72 + 0.5*0.72
                                          = 0.72
                              • Reduction in entropy = 0.28
                                 Information gain
Training Examples
Selecting the Next Attribute
Partially learned tree
Hypothesis Space Search
   Hypothesis space: the set of
    all possible decision trees

   DT is guided by information
    gain measure.




     Occam’s razor ??
Overfitting




•   Why “over”-fitting?
    – A model can become more complex than the true target
      function(concept) when it tries to satisfy noisy data as well
Avoiding over-fitting the data
   Two classes of approaches to avoid overfitting
       Stop growing the tree earlier.
       Post-prune the tree after overfitting

   Ok, but how to determine the optimal size of a tree?
       Use validation examples to evaluate the effect of pruning (stopping)
       Use a statistical test to estimate the effect of pruning (stopping)
       Use a measure of complexity for encoding decision tree.


   Approaches based on the first strategy
       Reduced error pruning
       Rule post-pruning
Rule Extraction from Trees

C4.5Rules
(Quinlan, 1993)
#3. Bayesian Networks
Bayes’ Rule
Introduction


                             prior     likelihood
       posterior
                                P C  p x | C 
                   P C | x  
                                     p x 

                                      evidence

 P C  0  P C  1  1
 p x   p x | C  1P C  1  p x | C  0P C  0
 p C  0 | x   P C  1 | x   1
Bayes’ Rule: K>2 Classes
Introduction


                         p x | Ci P Ci 
           P Ci | x  
                               p x 
                           p x | Ci P Ci 
                        K
                          p x | Ck P Ck 
                         k 1


                   K
  P Ci   0 and  P Ci   1
                  i 1

 choose Ci if P Ci | x   max k P Ck | x 
Bayesian Networks
Introduction

   Graphical models, probabilistic networks
       causality and influence

   Nodes are hypotheses (random vars) and the prob corresponds to our
    belief in the truth of the hypothesis

   Arcs are direct influences between hypotheses

   The structure is represented as a directed acyclic graph (DAG)
       Representation of the dependencies among random variables

   The parameters are the conditional probs in the arcs


        Small set of                                all possible
        probability, relating           B.N.        combinations of
        only neighbor node                          cicumstances
Bayesian Networks
Introduction




   Learning
       Inducing a graph
           From prior knowledge
           From structure learning
       Estimating parameters
           EM
   Inference
       Beliefs from evidences
           Especially among the nodes not directly connected
Structure
Introduction

   Initial configuration of BN
       Root nodes
         Prior probabilities
       Non-root nodes
         Conditional probabilities given all possible combinations of direct
          predecessors


                    P(a)                                P(b)
                           A                    B
           P(c|a)
                       C                D       P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb)
           P(c|ㄱa)
                               P(e|d)       E
                               P(e|ㄱd)
Causes and Bayes’ Rule
  Introduction




                          Diagnostic inference:
             diagnostic   Knowing that the grass is wet,
                          what is the probability that rain is
causal                    the cause?


                                      P W | R P R 
                          P R | W  
                                          P W 
                                                 P W | R P R 
                                    
                                      P W | R P R   P W |~ R P ~ R 
                                            0.9  0.4
                                                              0.75
                                      0.9  0.4  0.2  0.6
Causal vs Diagnostic Inference
Introduction


                                   Causal inference: If the
                                   sprinkler is on, what is the
                                   probability that the grass is wet?

                                   P(W|S) = P(W|R,S) P(R|S) +
                                           P(W|~R,S) P(~R|S)
                                    = P(W|R,S) P(R) +
                                           P(W|~R,S) P(~R)
                                    = 0.95*0.4 + 0.9*0.6 = 0.92


 Diagnostic inference: If the grass is wet, what is the probability
 that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S)
 P(S|R,W) = 0.21
 Explaining away: Knowing that it has rained
         decreases the probability that the sprinkler is on.
Bayesian Networks: Causes
Introduction


                    Causal inference:
                    P(W|C) = P(W|R,S) P(R,S|C) +
                           P(W|~R,S) P(~R,S|C) +
                           P(W|R,~S) P(R,~S|C) +
                           P(W|~R,~S) P(~R,~S|C)

                    and use the fact that
                     P(R,S|C) = P(R|C) P(S|C)

                           Diagnostic: P(C|W ) = ?
Bayesian Nets: Local structure
Introduction




                                              P (F | C) = ?




                        d
      P X 1 , X d    P X i | parentsX i 
                       i 1
Bayesian Networks: Inference
Introduction


   P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )

   P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )

   P (F |C) = P (C,F ) / P(C )   Not efficient!


   Belief propagation (Pearl, 1988)
   Junction trees (Lauritzen and Spiegelhalter, 1988)
       Independence assumption
Inference
Evidence & Belief Propagation
   Evidence – values of observed nodes
       V3 = T,V6 = 3                           V1
   Our belief in what the value of Vi
    „should‟ be changes.
   This belief is propagated              V3        V2

    As if the CPTs became
                                                V4
        V3=T   1.0      P      V2=T V2=F
        V3=F   0.0      V6=1   0.0   0.0
                                           V5        V6
                        V6=2   0.0   0.0
                        V6=3   1.0   1.0
Belief Propagation
                                                                    Bayes Law:
                                                                               P( B | A) P( A)
                                                                   P( A | B) 
                                                                                   P( B)
            “Causal” message                   “Diagnostic” message
Going down arrow, sum out parent        Going up arrow, Bayes Law
Message




                             Messages




Specifically:
                                        1/a




                         9



                                              * some figures from: Peter Lucas BN lecture course
The  Messages

• What are the messages?
• For simplicity, let the nodes be binary
             V1=T     0.8           The message passes on information.
             V1=F     0.2           What information? Observe:
       V1
                                    P(V2| V1) = P(V2| V1=T)P(V1=T)
                                             + P(V2| V1=F)P(V1=F)

               P        V1=T V1=F        The information needed is the CPT
                                         of V1 = V(V1)
       V2      V2=T     0.4   0.9
               V2=F     0.6   0.1         Messages capture information
                                         passed from parent to child
The       Messages

• We know what the  messages are
• What about ?
                Assume E = { V2 } and compute by Bayes‟rule:
      V1                       P(V1 ) P(V2 | V1 )
               P(V1 | V2 )                        aP(V1 ) P(V2 | V1 )
                                   P(V2 )

                The information not available at V1 is the P(V2|V1). To be
      V2        passed upwards by a -message. Again, this is not in general
                exactly the CPT, but the belief based on evidence down the tree.
Belief Propagation

      U1                                          U2

                                λ(U2)
                    π(U1)
                                          π(U2)
            λ(U1)
                            V


           λ(V1)
                                        π(V2)
                    π(V1)
                                λ(V2)


      V1                                          V2
Evidence & Belief

                    V1           Evidence



    Belief    V3         V2



                    V4



              V5         V6

   Evidence

                    Works for classification ??
Naive Bayes’ Classifier




    Given C, xj are independent:

          p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
Application Procedures
For classification
   MLP
       Data collection & Pre-processing (Training data / Test data)
       Decision node selection (output node)
       Network training
       Generalization
       Parameter tuning & Pruning
       Final network
   Decision Trees
       Data collection & Pre-processing (Training data / Test data)
       Decision attribute selection
       Tree construction
       Pruning
       Final tree
   Bayesian Networks
       Data collection & Pre-processing (Training data / Test data)
       Structure configuration
             Prior knowledge
       Parameter learning
       Decision node selection
       Inference (classification)
             Evidence & belief
       Final network
Simulation
   Simulation Packages
       WEKA (JAVA)
           http://www.cs.waikato.ac.nz/ml/weka/
       FullBNT (MATLAB)
           http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html
       MSBNx
           http://research.microsoft.com/msbn/
       MATLAB Neural Networks Toolbox
           http://www.mathworks.com/products/neuralnet/
       C4.5
           http://www.rulequest.com/Personal/
WEKA
FullBNT
   clear all


   N = 4;                      % 노드의 개수
   dag = zeros(N,N);                % 네크워크 구조 shell
   C = 1; S = 2; R = 3; W = 4;        % 각 노드 Naming
   dag(C,[R S]) = 1;              % 네트워크 구조 명시
   dag(R,W) = 1;
   dag(S,W)=1;


   %discrete_nodes = 1:N;
   node_sizes = 2*ones(1,N);           % 각 노드가 가질 수 있는 값의 개수
   %node_sizes = [4 2 3 5];
   %onodes = [];
   %bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);


   bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);
   %C = bnet.names('cloudy'); % bnet.names is an associative array
   %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);


   %%%%%% Specified Parameters
   %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
   %bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
   %bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
   %bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
MSBNx
References
   Textbooks
       Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004
       Tom Mitchell, Machine Learning, McGraw Hill, 1997
       Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003

   Materials
       Serafí Moral, Learning Bayesian Networks, University of Granada, Spain
             n
       Zheng Rong Yang, Connectionism, Exeter University
       KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning,
        Especially for Bayesian Networks
       Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford
        University

   Recommended Textbooks
       Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
       J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992
       Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999
       Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007

More Related Content

PDF
Lecture11 - neural networks
PDF
PDF
PDF
PDF
PDF
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
PDF
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
PDF
Lecture8 - From CBR to IBk
Lecture11 - neural networks
New Challenges in Learning Classifier Systems: Mining Rarities and Evolving F...
CCIA'2008: Can Evolution Strategies Improve Learning Guidance in XCS? Design ...
Lecture8 - From CBR to IBk

What's hot (20)

PDF
Lecture4 - Machine Learning
PDF
CCIA'2008: On the dimensions of data complexity through synthetic data sets
PDF
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
PDF
Lecture11
PDF
Lecture6 - C4.5
PDF
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
PDF
PDF
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
PDF
Lecture1 - Machine Learning
PDF
Lecture3 - Machine Learning
PDF
KEY
2 tri partite model algebra
PDF
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
PDF
Lecture2 - Machine Learning
PDF
Lecture7 - IBk
PDF
Lecture15 - Advances topics on association rules PART II
PDF
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
PDF
The Transformer - Xavier Giró - UPC Barcelona 2021
PDF
SSBSE11b.ppt
PDF
Multivariate analyses &amp; decoding
Lecture4 - Machine Learning
CCIA'2008: On the dimensions of data complexity through synthetic data sets
GECCO'2007: Modeling XCS in Class Imbalances: Population Size and Parameter S...
Lecture11
Lecture6 - C4.5
IWLCS'2008: First Approach toward Online Evolution of Association Rules wit...
A New Model for Credit Approval Problems a Neuro Genetic System with Quantum ...
Lecture1 - Machine Learning
Lecture3 - Machine Learning
2 tri partite model algebra
HIS'2008: Genetic-based Synthetic Data Sets for the Analysis of Classifiers B...
Lecture2 - Machine Learning
Lecture7 - IBk
Lecture15 - Advances topics on association rules PART II
HIS'2008: New Crossover Operator for Evolutionary Rule Discovery in XCS
The Transformer - Xavier Giró - UPC Barcelona 2021
SSBSE11b.ppt
Multivariate analyses &amp; decoding
Ad

Similar to Introduction to Machine Learning (20)

PPT
tutorial.ppt
PDF
20190927 generative models_aia
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
PDF
Introduction to Variational Auto Encoder
PDF
[系列活動] 一日搞懂生成式對抗網路
PPTX
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
PPTX
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
PPTX
Deep learning from a novice perspective
PPTX
19 - Neural Networks I.pptx
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PDF
alexVAE_New.pdf
PDF
An introduc on to Machine Learning
PDF
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
PPTX
Artificial intelligence
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PDF
Auto-encoding variational bayes
PDF
Spatio-temporal reasoning for traffic scene understanding
DOC
Presentation on Machine Learning and Data Mining
PPTX
linearly separable and therefore a set of weights exist that are consistent ...
tutorial.ppt
20190927 generative models_aia
Deep learning for molecules, introduction to chainer chemistry
Multiple Kernel Learning based Approach to Representation and Feature Selecti...
Introduction to Variational Auto Encoder
[系列活動] 一日搞懂生成式對抗網路
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
Deep learning from a novice perspective
19 - Neural Networks I.pptx
Artificial Intelligence, Machine Learning and Deep Learning
alexVAE_New.pdf
An introduc on to Machine Learning
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Artificial intelligence
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Auto-encoding variational bayes
Spatio-temporal reasoning for traffic scene understanding
Presentation on Machine Learning and Data Mining
linearly separable and therefore a set of weights exist that are consistent ...
Ad

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Introduction to Machine Learning

  • 1. Introduction to Machine Learning Jinhyuk Choi Human-Computer Interaction Lab @ Information and Communications University
  • 2. Contents  Concepts of Machine Learning  Multilayer Perceptrons  Decision Trees  Bayesian Networks
  • 3. What is Machine Learning?  Large storage / large amount of data  Looks random but certain patterns  Web log data  Medical record  Network optimization  Bioinformatics  Machine vision  Speech recognition…  No complete identification of the process  A good or useful approximation
  • 4. What is Machine Learning? Definition  Programming computers to optimize a performance criterion using example data or past experience  Role of Statistics  Inference from a sample  Role of Computer science  Efficient algorithms to solve the optimization problem  Representing and evaluating the model for inference  Descriptive (training) / predictive (generalization) Learning from Human-generated data??
  • 5. What is Machine Learning? Concept Learning • Inducing general functions from specific training examples (positive or negative) • Looking for the hypothesis that best fits the training examples Objects Concept 눈, 코, 다리 Bird 생식능력, 날개, 부리, boolean function : … 깃털… Bird(animal)  “true or not” 무생물… • Concepts: - describing some subset of objects or events defined over a larger set - a boolean-valued function
  • 6. What is Machine Learning? Concept Learning  Inferring a boolean-valued function from training examples of its input and output Hypothesis 1 Hypothesis 2 Concept Web log data Medical record Network optimization Positive examples Bioinformatics Negative examples Machine vision Speech recognition…
  • 7. What is Machine Learning? Learning Problem Design  Do you enjoy sports ?  Learn to predict the value of “EnjoySports” for an arbitrary day, based on the value of its other attributes  What problem?  Why learning?  Attributes selection  Effective?  Enough?  What learning algorithm?
  • 8. Applications  Learning associations  Classification  Regression  Unsupervised learning  Reinforcement learning
  • 9. Examples (1)  TV program preference inference based on web usage data Web page #1 TV Program #1 Web page #2 TV Program #2 Web page #3 Classifier TV Program #3 Web page #4 1 2 TV Program #4 …. …. 3 What are we supposed to do at each step?
  • 10. Examples (2) from a HW of Neural Networks Class (KAIST-2002)  Function approximation (Mexican hat)   f3 ( x1 , x2 )  sin 2 x12  x2 , 2 x1 , x2 [1,1]
  • 11. Examples (3) from a HW of Machine Learning Class (ICU-2006)  Face image classification
  • 12. Examples (4) from a HW of Machine Learning Class (ICU-2006)
  • 13. Examples (5) from a HW of Machine Learning Class (ICU-2006)  Sensay
  • 14. Examples (6) A. Krause et. al, “Unsupervised, Dynamic Identification of Physiological and Activity Context in Wearable Computing”, ISWC 2005
  • 16. Neural Network? VS. Adaline MLP SOM Hopfield network RDFN Bifurcating neuron networks …
  • 17. Multilayer Networks of Sigmoid Units • Supervised learning • 2-layer • Fully connected Really looks like the brain??
  • 19. The back-propagation algorithm  Network model Input layer hidden layer output layer xi yj ok v ji wkj     y j  s  v ji x i     w y  ok  s  kj j    i       j    1 E v , w    tk  ok  2  Error function: 2 k  Stochastic gradient descent
  • 21. Gradient-descent function minimization    In order to find a vector parameter x that minimizes a function f x  …    Start with a random initial value of x  x 0 .  Determine the direction of the steepest descent in the parameter space by  f f f    f    , ,...,   x x  1 2 x n     Move to the direction a step.    x i  1  x i   hf  x  Repeat the above two steps until no more change in .  For gradient-descent to work…  The function to be minimized should be continuous.  The function should not have too many local minima.
  • 23. Derivation of back-propagation algorithm Adjustment of wkj : 2 E  1 2 1   t  s  w y        tk  ok     k   k j j  wk j  wk j   2 k  2 wk j     j     1  y j ok  1  ok  1 2 tk   ok   2  y j ok  1  ok  tk   ok   E wkj  h  h ok 1  ok tk  ok y j wkj  o  d k
  • 24. Derivation of back-propagation algorithm Adjustment of vji : 2 E  1 2 1   t  s  w y        tk  ok       kj j   v j i  v j i   2 k  2 k v j i   k     j    2 1   t  s  w s  v x      k   kj  ji i     2 k v j i    j   i    1   x i y j  1  y j   wkj ok 1  ok 1 2 tk  ok  2 k  x i y j  1  y j    wkj ok 1  ok tk  ok  k E v ji  h  hy j 1  y j   wkjok 1  ok tk  ok x i v ji k  h y j 1  y j   wkj dko x i  y k  dj
  • 26. Batch learning vs. Incremental learning Batch standard backprop proceeds as Incremental standard backprop can be done as follows: follows: Initialize the weights W. Initialize the weights W. Repeat the following steps for j = 1 to NL: Repeat the following steps: Process one training case (y_j,X_j) to compute the gradient Process all the training data DL to compute the gradient of the error (loss) function Q(y_j,X_j,W). of the average error function AQ(DL,W). Update the weights by subtracting the gradient times the Update the weights by subtracting the gradient times the learning rate. learning rate.
  • 30. Introduction  Divide & conquer  Hierarchical model  Sequence of recursive splits  Decision node vs. leaf node  Advantage  Interpretability  IF-THEN rules
  • 31. Divide and Conquer  Internal decision nodes  Univariate: Uses a single attribute, xi  Numeric xi : Binary split : xi > wm  Discrete xi : n-way split for n possible values  Multivariate: Uses all attributes, x  Leaves  Classification: Class labels, or proportions  Regression: Numeric; r average, or local fit  Learning  Construction of the tree using training examples  Looking for the simplest tree among the trees that code the training data without error  Based on heuristics  NP-complete  “Greedy”; find the best split recursively (Breiman et al, 1984; Quinlan, 1986, 1993)
  • 32. Classification Trees  Split is main procedure for tree construction  By impurity measure  For node m, Nm instances reach m, Nim belong to Ci i ˆ Ci | x ,m  pm  i Nm P Nm To be pure!!!  Node m is pure if pim is 0 or 1 K  Measure of impurity is entropy Im   pm log2pm i i i 1
  • 33. Representation  Each node specifies a test of some attribute of the instance  Each branch correspond to one of the possible values for this attribute
  • 34. Best Split  If node m is pure, generate a leaf and stop, otherwise split and continue recursively  Impurity after split: Nmj of Nm take branch j. Nimj belong to Ci i N mj ˆ Ci | x ,m, j   pmj  P i N mj n N mj K I'm   p i mj i log2pmj j 1 Nm i 1  Find the variable and split that min impurity (among all variables -- and split positions for numeric variables) Q) “Which attribute should be tested at the root of the tree?”
  • 35. Top-Down Induction of Decision Trees
  • 36. Entropy  “Measure of uncertainty”  “Expected number of bits to resolve uncertainty”  Suppose Pr{X = 0} = 1/8  If other events are equally likely, the number of events is 8. To indicate one out of so many events, one needs lg 8 bits.  Consider a binary random variable X s.t. Pr{X = 0} = 0.1.  1  0.1 lg 1 1  The expected number of bits: 0.1 lg 0.1 1  0.1  In general, if a random variable X has c values with prob. p_c: c c 1  The expected number of bits: H   pi lg   pi lg pi i 1 pi i 1
  • 37. Entropy Example  14 examples Entropy([9,5])  (9 /14) log 2 (9 /14)  (5 /14) log 2 (5 /14)  0.940 Entropy 0 : all members positive or negative Entropy 1 : equal number of positive & negative 0 < Entropy < 1 : unequal number of positive & negative
  • 38. Information Gain  Measures the expected reduction in entropy caused by partitioning the examples
  • 39. Information Gain • # of samples = 100 ICU-Student tree • # of positive samples = 50 Candidate • Entropy = 1 Left side: • # of samples = 50 Gender • # of positive samples = 40 • Entropy = 0.72 Right side: Male Female • # of samples = 50 • # of positive samples = 10 • Entropy = 0.72 IQ Height On average • Entropy = 0.5 * 0.72 + 0.5*0.72 = 0.72 • Reduction in entropy = 0.28  Information gain
  • 41. Selecting the Next Attribute
  • 43. Hypothesis Space Search  Hypothesis space: the set of all possible decision trees  DT is guided by information gain measure. Occam’s razor ??
  • 44. Overfitting • Why “over”-fitting? – A model can become more complex than the true target function(concept) when it tries to satisfy noisy data as well
  • 45. Avoiding over-fitting the data  Two classes of approaches to avoid overfitting  Stop growing the tree earlier.  Post-prune the tree after overfitting  Ok, but how to determine the optimal size of a tree?  Use validation examples to evaluate the effect of pruning (stopping)  Use a statistical test to estimate the effect of pruning (stopping)  Use a measure of complexity for encoding decision tree.  Approaches based on the first strategy  Reduced error pruning  Rule post-pruning
  • 46. Rule Extraction from Trees C4.5Rules (Quinlan, 1993)
  • 48. Bayes’ Rule Introduction prior likelihood posterior P C  p x | C  P C | x   p x  evidence P C  0  P C  1  1 p x   p x | C  1P C  1  p x | C  0P C  0 p C  0 | x   P C  1 | x   1
  • 49. Bayes’ Rule: K>2 Classes Introduction p x | Ci P Ci  P Ci | x   p x  p x | Ci P Ci   K  p x | Ck P Ck  k 1 K P Ci   0 and  P Ci   1 i 1 choose Ci if P Ci | x   max k P Ck | x 
  • 50. Bayesian Networks Introduction  Graphical models, probabilistic networks  causality and influence  Nodes are hypotheses (random vars) and the prob corresponds to our belief in the truth of the hypothesis  Arcs are direct influences between hypotheses  The structure is represented as a directed acyclic graph (DAG)  Representation of the dependencies among random variables  The parameters are the conditional probs in the arcs Small set of all possible probability, relating B.N. combinations of only neighbor node cicumstances
  • 51. Bayesian Networks Introduction  Learning  Inducing a graph  From prior knowledge  From structure learning  Estimating parameters  EM  Inference  Beliefs from evidences  Especially among the nodes not directly connected
  • 52. Structure Introduction  Initial configuration of BN  Root nodes  Prior probabilities  Non-root nodes  Conditional probabilities given all possible combinations of direct predecessors P(a) P(b) A B P(c|a) C D P(d|ab), P(d|aㄱb), P(d|ㄱab), P(d|ㄱaㄱb) P(c|ㄱa) P(e|d) E P(e|ㄱd)
  • 53. Causes and Bayes’ Rule Introduction Diagnostic inference: diagnostic Knowing that the grass is wet, what is the probability that rain is causal the cause? P W | R P R  P R | W   P W  P W | R P R   P W | R P R   P W |~ R P ~ R  0.9  0.4   0.75 0.9  0.4  0.2  0.6
  • 54. Causal vs Diagnostic Inference Introduction Causal inference: If the sprinkler is on, what is the probability that the grass is wet? P(W|S) = P(W|R,S) P(R|S) + P(W|~R,S) P(~R|S) = P(W|R,S) P(R) + P(W|~R,S) P(~R) = 0.95*0.4 + 0.9*0.6 = 0.92 Diagnostic inference: If the grass is wet, what is the probability that the sprinkler is on? P(S|W) = 0.35 > 0.2 P(S) P(S|R,W) = 0.21 Explaining away: Knowing that it has rained decreases the probability that the sprinkler is on.
  • 55. Bayesian Networks: Causes Introduction Causal inference: P(W|C) = P(W|R,S) P(R,S|C) + P(W|~R,S) P(~R,S|C) + P(W|R,~S) P(R,~S|C) + P(W|~R,~S) P(~R,~S|C) and use the fact that P(R,S|C) = P(R|C) P(S|C) Diagnostic: P(C|W ) = ?
  • 56. Bayesian Nets: Local structure Introduction P (F | C) = ? d P X 1 , X d    P X i | parentsX i  i 1
  • 57. Bayesian Networks: Inference Introduction  P (C,S,R,W,F ) = P (C ) P (S |C ) P (R |C ) P (W |R,S ) P (F |R )  P (C,F ) = ∑S ∑R ∑W P (C,S,R,W,F )  P (F |C) = P (C,F ) / P(C ) Not efficient!  Belief propagation (Pearl, 1988)  Junction trees (Lauritzen and Spiegelhalter, 1988)  Independence assumption
  • 58. Inference Evidence & Belief Propagation  Evidence – values of observed nodes  V3 = T,V6 = 3 V1  Our belief in what the value of Vi „should‟ be changes.  This belief is propagated V3 V2  As if the CPTs became V4 V3=T 1.0 P V2=T V2=F V3=F 0.0 V6=1 0.0 0.0 V5 V6 V6=2 0.0 0.0 V6=3 1.0 1.0
  • 59. Belief Propagation Bayes Law: P( B | A) P( A) P( A | B)  P( B) “Causal” message “Diagnostic” message Going down arrow, sum out parent Going up arrow, Bayes Law Message Messages Specifically: 1/a 9 * some figures from: Peter Lucas BN lecture course
  • 60. The  Messages • What are the messages? • For simplicity, let the nodes be binary V1=T 0.8 The message passes on information. V1=F 0.2 What information? Observe: V1 P(V2| V1) = P(V2| V1=T)P(V1=T) + P(V2| V1=F)P(V1=F) P V1=T V1=F The information needed is the CPT of V1 = V(V1) V2 V2=T 0.4 0.9 V2=F 0.6 0.1  Messages capture information passed from parent to child
  • 61. The  Messages • We know what the  messages are • What about ? Assume E = { V2 } and compute by Bayes‟rule: V1 P(V1 ) P(V2 | V1 ) P(V1 | V2 )   aP(V1 ) P(V2 | V1 ) P(V2 ) The information not available at V1 is the P(V2|V1). To be V2 passed upwards by a -message. Again, this is not in general exactly the CPT, but the belief based on evidence down the tree.
  • 62. Belief Propagation U1 U2 λ(U2) π(U1) π(U2) λ(U1) V λ(V1) π(V2) π(V1) λ(V2) V1 V2
  • 63. Evidence & Belief V1 Evidence Belief V3 V2 V4 V5 V6 Evidence Works for classification ??
  • 64. Naive Bayes’ Classifier Given C, xj are independent: p(x|C) = p(x1|C) p(x2|C) ... p(xd|C)
  • 65. Application Procedures For classification  MLP  Data collection & Pre-processing (Training data / Test data)  Decision node selection (output node)  Network training  Generalization  Parameter tuning & Pruning  Final network  Decision Trees  Data collection & Pre-processing (Training data / Test data)  Decision attribute selection  Tree construction  Pruning  Final tree  Bayesian Networks  Data collection & Pre-processing (Training data / Test data)  Structure configuration  Prior knowledge  Parameter learning  Decision node selection  Inference (classification)  Evidence & belief  Final network
  • 66. Simulation  Simulation Packages  WEKA (JAVA)  http://www.cs.waikato.ac.nz/ml/weka/  FullBNT (MATLAB)  http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html  MSBNx  http://research.microsoft.com/msbn/  MATLAB Neural Networks Toolbox  http://www.mathworks.com/products/neuralnet/  C4.5  http://www.rulequest.com/Personal/
  • 67. WEKA
  • 68. FullBNT  clear all  N = 4; % 노드의 개수  dag = zeros(N,N); % 네크워크 구조 shell  C = 1; S = 2; R = 3; W = 4; % 각 노드 Naming  dag(C,[R S]) = 1; % 네트워크 구조 명시  dag(R,W) = 1;  dag(S,W)=1;  %discrete_nodes = 1:N;  node_sizes = 2*ones(1,N); % 각 노드가 가질 수 있는 값의 개수  %node_sizes = [4 2 3 5];  %onodes = [];  %bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);  bnet = mk_bnet(dag, node_sizes, 'names', {'C','S','R','W'}, 'discrete', 1:4);  %C = bnet.names('cloudy'); % bnet.names is an associative array  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %%%%%% Specified Parameters  %bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);  %bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);  %bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);  %bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
  • 69. MSBNx
  • 70. References  Textbooks  Ethem ALPAYDIN, Introduction to Machine Learning, The MIT Press, 2004  Tom Mitchell, Machine Learning, McGraw Hill, 1997  Neapolitan, R.E., Learning Bayesian Networks, Prentice Hall, 2003  Materials  Serafí Moral, Learning Bayesian Networks, University of Granada, Spain n  Zheng Rong Yang, Connectionism, Exeter University  KyuTae Cho ,Jeong Ki Yoo ,HeeJin Lee, Uncertainty in AI, Probabilistic reasoning, Especially for Bayesian Networks  Gary Bradski, Sebastian Thrun, Bayesian Networks in Computer Vision, Stanford University  Recommended Textbooks  Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006  J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992  Haykin, Simon S., Neural networks : a comprehensive foundation, Prentice Hall, 1999  Jensen, Finn V., Bayesian networks and decision graphs, Springer, 2007