SlideShare a Scribd company logo
GENETIC PROGRAMMING FOR GENERATING PROTOTYPES IN CLASSIFICATION PROBLEMS Presented by:  Tarundeep Dhot Dept of ECE Concordia University
This presentation is based on a research paper written by the following authors: L.P. Cordella, C. De Stefano, F. Fontanella, A Marcelli This paper was published at:  The 2005 IEEE Congress on Evolutionary Computation  This presentation is solely meant for educational purposes. Acknowledgements
Salient Features GP based approach for generating prototypes in a classification problem is proposed. The set of prototypes is encoded by a multitree. Multitrees are set of trees which represent the chromosome. Chromosomes are of variable length allowing classification for problems consisting of classes as well as further sub-classes.
WHAT IS CLASSIFICATION? Categorization: the act of distributing things into classes or  categories of the same type.   Classification problems undergo a  supervised training phase . Set of labeled data samples indicating the class it belongs to is provided to the system. Based on such data, unknown data is classified using rules, decision trees or mathematical functions. SYSTEM / CLASSIFIER Data Sets (Labeled) Type X X 1 X 2 Type Y Y 1 Fig: Training Phase
PROPOSED APPROACH Provides a new GP based method for determining the prototypes in a c-class problem where c    2. Prototypes  describing samples belonging to ‘c’ different classes consists of logical expressions. Each prototype represents a cluster of samples in the training set. Each prototype consists of a set of assertions (logical predicates) connected by Boolean operators. The value of a particular feature in a sample depends on such assertions. A  class  can be represented by a variable number of logical expressions i.e. the length of a single expression (as well as its predicates) is variable, thus, the devised approach is able to automatically find subclasses present in the data set. PROTOTYPE 1 PROTOTYPE 2 PROTOTYPE 3 X 1  X 2  X 3 Y 1  Y 2  Y 3 Z 1  Z 2
PROPOSED APPROACH  (cont:) The set of prototypes describing the classes makes up a single individual of the evolving population. Each prototype is encoded as a derivation tree, thus, an individual is a multi-tree (list of trees). Classification consists of attributing a sample to one of the classes i.e. associating the sample to one of the prototypes. Fitness function or value is the recognition rate obtained on the training set of an individual.  Selection is done based on the fitness values of individuals. At the end of the process, the best individual obtained, constitutes the set of prototypes to be used for the considered application.
DESCRIPTION OF THE APPROACH Set of prototypes Each prototype characterizes a class or subclass Each class or subclass consists of set of logical expressions Each expression may contain variable number of predicates. Each predicate establishes a condition on the value of a particular feature it is representing. If all predicates of an expression are satisfied by values in the feature vector describing a sample, we say the expression matches the sample.
DESCRIPTION OF THE APPROACH  (cont:) CLASSIFICATION:  Given a data set and a set of labeled expressions, the classification task is performed in the following way: Each sample of the data set is matched against the set of expressions and is either assigned to one of them or is rejected. Different cases may occur: Sample matches ONE expression: it is assigned to it. Sample matches more than one expression with DIFFERENT number of predicates: it is assigned to the expression with the SMALLEST number of predicates. Sample matches more than one expression with SAME number of predicates and different labels: sample is REJECTED. Sample matches NO expression: sample is REJECTED. Sample matches more than one expression with equal label: sample is assigned to the class the expression belongs to.
IMPLEMENTATION OF EVOLUTIONARY APPROACH GP starts by randomly generating a population of  p  individuals. Phenotype    String containing logical expressions, each one encoded as a derivation tree. Chromosome of an individual    Multitree Number of trees in an individual’s chromosome is referred as  length  of the individual. Length of individual in the initial population [2, L max ]. For generating a new population,  best  e  individuals are selected and copied to new generation    ELITISM. Tournament selection for remaining (p – e)/2 couples. This controls loss of diversity and selection intensity. Crossover is applied on selected couples with probability factor  p c . Mutation is applied on individuals with probability factor  p m . Finally, these individuals are added to the new population. Process is repeated for N G  generations.
LEARNING CLASSIFICATION RULES Before implementing the proposed evolutionary paradigm, the following steps must be executed: Structure Definition: definition of the structure to be evolved Training Phase and Fitness Function: choice of fitness function Genetic Operators: definition of genetic operators
LEARNING CLASSIFICATION RULES  (cont:) STRUCTURE DEFINITION: The implementation requires a program generator providing syntactically correct programs and an interpreter to execute them. PROGRAM GENERATOR: Based on grammar written for S-expressions. Grammar G is defined as a quadruple = (T, N, S, P) where T and N are disjoint finite alphabets. S is the starting symbol P is the set of production rules used to define strings. INTERPRETER: is implemented by an automation that computes Boolean functions. Such an automation accepts an expression as an input and returns TRUE or FALSE as an output depending on whether the expression matches the sample or not .
LEARNING CLASSIFICATION RULES  (cont:) TRAINING PHASE AND FITNESS FUNCTION: The system is trained with a set containing N tr  patterns. The training set is used for evaluating the fitness of the individuals in the population. Training set samples are assigned to the expressions belonging to each individual. Each valid expression of an individual is labeled with the most widely represented in the corresponding cluster. Recognition rate for each individual is evaluated using a classifier. This rate is assigned as the fitness value to the individual. In order to favor those individuals able to obtain good performance with lesser number of expressions, the fitness of each individual is increased by 0.1/N c  where N c  is the number of expressions in an individual.
LEARNING CLASSIFICATION RULES  (cont:) GENETIC OPERATORS: Crossover and mutation are used as the genetic operator. Crossover:  Applied to two chromosomes C 1  and C 2  and yields two new chromosomes by swapping the lists of the initial chromosomes.  Fig: (a) and (b)  Chromosome C 1  and C 2  of length 4 and 3. (c) and (d) Chromosomes obtained after crossover at t 1  = 2 and t 2  = 1
LEARNING CLASSIFICATION RULES  (cont:) GENETIC OPERATORS: Mutation:  It is independently applied to every tree of the chromosome C with probability p m . More specifically, mutation operator is applied by randomly choosing a single non-terminal node T i  in a given tree T s . If there are n non-terminal nodes in a chromosome C, probability of mutating each single node of C = p m  / n.
EXPERIMENTAL RESULTS The proposed approach was tested on three standard databases (IRIS, BUPA and Vehicle available on the UCI website and also compared another GP based approach. The method showed better results with higher recognition rates. Table:  Average Recognition rates (%). R 1     proposed classifer R 2     comparison classifier. Data Sets R 2 R 1 IRIS 98.67 99.4 BUPA 69.87 74.3 Vehicle 61.75 66.5
CONCLUSIONS GP based approach to prototype generation and classification is proposed. Population is evolved where each individual consists of a set of possible prototypes of the classes present in the data to be analyzed. A prototype consists of a set of logical expressions. Recognition rate obtained using each set of prototypes is used as fitness function for controlling evolution.  Method is able to automatically cluster data. The method offers greater flexibility in comparison to other methods due to dynamic labeling mechanism of logical expressions. The comparison with another method show better results with the proposed GP.
THANK YOU !!

More Related Content

PDF
Efficiently finding the parameter for emergign pattern based classifier
PPTX
GENETIC ALGORITHM
PPTX
Seminar Slides
PDF
12th ip CBSE chapter 4 oop in java notes complete
PDF
Multi label text classification
PDF
Genetic algorithm
DOCX
Java interview questions
Efficiently finding the parameter for emergign pattern based classifier
GENETIC ALGORITHM
Seminar Slides
12th ip CBSE chapter 4 oop in java notes complete
Multi label text classification
Genetic algorithm
Java interview questions

What's hot (19)

PDF
Ajas11 alok
DOCX
Java interview questions and answers
PPTX
Polymorphism in Python
PPTX
Molecular phylogenetics
PDF
Data Structure Interview Questions & Answers
PPTX
Phylogenetic tree construction
PDF
Hc3413121317
PPTX
Inheritance and Polymorphism Java
PPT
Softwares For Phylogentic Analysis
DOC
PPT
Chapter 13 - Inheritance and Polymorphism
PPTX
SemiBoost: Boosting for Semi-supervised Learning
PDF
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
PDF
Multi-Cluster Based Approach for skewed Data in Data Mining
PDF
Java - Inheritance Concepts
DOC
FOCUS.doc
PPT
Polymorphism in java, method overloading and method overriding
PPTX
Genetic algorithm
PPTX
Basic concept of class, method , command line-argument
Ajas11 alok
Java interview questions and answers
Polymorphism in Python
Molecular phylogenetics
Data Structure Interview Questions & Answers
Phylogenetic tree construction
Hc3413121317
Inheritance and Polymorphism Java
Softwares For Phylogentic Analysis
Chapter 13 - Inheritance and Polymorphism
SemiBoost: Boosting for Semi-supervised Learning
A COMPARATIVE STUDY OF FEATURE SELECTION METHODS
Multi-Cluster Based Approach for skewed Data in Data Mining
Java - Inheritance Concepts
FOCUS.doc
Polymorphism in java, method overloading and method overriding
Genetic algorithm
Basic concept of class, method , command line-argument
Ad

Similar to Genetic Programming for Generating Prototypes in Classification Problems (20)

PDF
Efficiently Finding the Best Parameter for the Emerging Pattern-Based Classifier
PDF
Efficiently finding the parameter for emergign pattern based classifier
PDF
Survey and Evaluation of Methods for Tissue Classification
PDF
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
PDF
2224d_final
PDF
Performance Comparision of Machine Learning Algorithms
PDF
An improved teaching learning
PPTX
Genetic algorithms
PDF
Adaptive Training of Radial Basis Function Networks Based on Cooperative
PDF
A Modified KS-test for Feature Selection
PDF
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
PDF
F043046054
PDF
F043046054
PDF
F043046054
PDF
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PDF
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
PDF
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
PDF
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
PDF
Optimal rule set generation using pso algorithm
PDF
Similarity Features, and their Role in Concept Alignment Learning
Efficiently Finding the Best Parameter for the Emerging Pattern-Based Classifier
Efficiently finding the parameter for emergign pattern based classifier
Survey and Evaluation of Methods for Tissue Classification
Enhanced Genetic Algorithm with K-Means for the Clustering Problem
2224d_final
Performance Comparision of Machine Learning Algorithms
An improved teaching learning
Genetic algorithms
Adaptive Training of Radial Basis Function Networks Based on Cooperative
A Modified KS-test for Feature Selection
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
F043046054
F043046054
F043046054
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
Optimal rule set generation using pso algorithm
Similarity Features, and their Role in Concept Alignment Learning
Ad

Genetic Programming for Generating Prototypes in Classification Problems

  • 1. GENETIC PROGRAMMING FOR GENERATING PROTOTYPES IN CLASSIFICATION PROBLEMS Presented by: Tarundeep Dhot Dept of ECE Concordia University
  • 2. This presentation is based on a research paper written by the following authors: L.P. Cordella, C. De Stefano, F. Fontanella, A Marcelli This paper was published at: The 2005 IEEE Congress on Evolutionary Computation This presentation is solely meant for educational purposes. Acknowledgements
  • 3. Salient Features GP based approach for generating prototypes in a classification problem is proposed. The set of prototypes is encoded by a multitree. Multitrees are set of trees which represent the chromosome. Chromosomes are of variable length allowing classification for problems consisting of classes as well as further sub-classes.
  • 4. WHAT IS CLASSIFICATION? Categorization: the act of distributing things into classes or categories of the same type. Classification problems undergo a supervised training phase . Set of labeled data samples indicating the class it belongs to is provided to the system. Based on such data, unknown data is classified using rules, decision trees or mathematical functions. SYSTEM / CLASSIFIER Data Sets (Labeled) Type X X 1 X 2 Type Y Y 1 Fig: Training Phase
  • 5. PROPOSED APPROACH Provides a new GP based method for determining the prototypes in a c-class problem where c  2. Prototypes describing samples belonging to ‘c’ different classes consists of logical expressions. Each prototype represents a cluster of samples in the training set. Each prototype consists of a set of assertions (logical predicates) connected by Boolean operators. The value of a particular feature in a sample depends on such assertions. A class can be represented by a variable number of logical expressions i.e. the length of a single expression (as well as its predicates) is variable, thus, the devised approach is able to automatically find subclasses present in the data set. PROTOTYPE 1 PROTOTYPE 2 PROTOTYPE 3 X 1 X 2 X 3 Y 1 Y 2 Y 3 Z 1 Z 2
  • 6. PROPOSED APPROACH (cont:) The set of prototypes describing the classes makes up a single individual of the evolving population. Each prototype is encoded as a derivation tree, thus, an individual is a multi-tree (list of trees). Classification consists of attributing a sample to one of the classes i.e. associating the sample to one of the prototypes. Fitness function or value is the recognition rate obtained on the training set of an individual. Selection is done based on the fitness values of individuals. At the end of the process, the best individual obtained, constitutes the set of prototypes to be used for the considered application.
  • 7. DESCRIPTION OF THE APPROACH Set of prototypes Each prototype characterizes a class or subclass Each class or subclass consists of set of logical expressions Each expression may contain variable number of predicates. Each predicate establishes a condition on the value of a particular feature it is representing. If all predicates of an expression are satisfied by values in the feature vector describing a sample, we say the expression matches the sample.
  • 8. DESCRIPTION OF THE APPROACH (cont:) CLASSIFICATION: Given a data set and a set of labeled expressions, the classification task is performed in the following way: Each sample of the data set is matched against the set of expressions and is either assigned to one of them or is rejected. Different cases may occur: Sample matches ONE expression: it is assigned to it. Sample matches more than one expression with DIFFERENT number of predicates: it is assigned to the expression with the SMALLEST number of predicates. Sample matches more than one expression with SAME number of predicates and different labels: sample is REJECTED. Sample matches NO expression: sample is REJECTED. Sample matches more than one expression with equal label: sample is assigned to the class the expression belongs to.
  • 9. IMPLEMENTATION OF EVOLUTIONARY APPROACH GP starts by randomly generating a population of p individuals. Phenotype  String containing logical expressions, each one encoded as a derivation tree. Chromosome of an individual  Multitree Number of trees in an individual’s chromosome is referred as length of the individual. Length of individual in the initial population [2, L max ]. For generating a new population, best e individuals are selected and copied to new generation  ELITISM. Tournament selection for remaining (p – e)/2 couples. This controls loss of diversity and selection intensity. Crossover is applied on selected couples with probability factor p c . Mutation is applied on individuals with probability factor p m . Finally, these individuals are added to the new population. Process is repeated for N G generations.
  • 10. LEARNING CLASSIFICATION RULES Before implementing the proposed evolutionary paradigm, the following steps must be executed: Structure Definition: definition of the structure to be evolved Training Phase and Fitness Function: choice of fitness function Genetic Operators: definition of genetic operators
  • 11. LEARNING CLASSIFICATION RULES (cont:) STRUCTURE DEFINITION: The implementation requires a program generator providing syntactically correct programs and an interpreter to execute them. PROGRAM GENERATOR: Based on grammar written for S-expressions. Grammar G is defined as a quadruple = (T, N, S, P) where T and N are disjoint finite alphabets. S is the starting symbol P is the set of production rules used to define strings. INTERPRETER: is implemented by an automation that computes Boolean functions. Such an automation accepts an expression as an input and returns TRUE or FALSE as an output depending on whether the expression matches the sample or not .
  • 12. LEARNING CLASSIFICATION RULES (cont:) TRAINING PHASE AND FITNESS FUNCTION: The system is trained with a set containing N tr patterns. The training set is used for evaluating the fitness of the individuals in the population. Training set samples are assigned to the expressions belonging to each individual. Each valid expression of an individual is labeled with the most widely represented in the corresponding cluster. Recognition rate for each individual is evaluated using a classifier. This rate is assigned as the fitness value to the individual. In order to favor those individuals able to obtain good performance with lesser number of expressions, the fitness of each individual is increased by 0.1/N c where N c is the number of expressions in an individual.
  • 13. LEARNING CLASSIFICATION RULES (cont:) GENETIC OPERATORS: Crossover and mutation are used as the genetic operator. Crossover: Applied to two chromosomes C 1 and C 2 and yields two new chromosomes by swapping the lists of the initial chromosomes. Fig: (a) and (b) Chromosome C 1 and C 2 of length 4 and 3. (c) and (d) Chromosomes obtained after crossover at t 1 = 2 and t 2 = 1
  • 14. LEARNING CLASSIFICATION RULES (cont:) GENETIC OPERATORS: Mutation: It is independently applied to every tree of the chromosome C with probability p m . More specifically, mutation operator is applied by randomly choosing a single non-terminal node T i in a given tree T s . If there are n non-terminal nodes in a chromosome C, probability of mutating each single node of C = p m / n.
  • 15. EXPERIMENTAL RESULTS The proposed approach was tested on three standard databases (IRIS, BUPA and Vehicle available on the UCI website and also compared another GP based approach. The method showed better results with higher recognition rates. Table: Average Recognition rates (%). R 1  proposed classifer R 2  comparison classifier. Data Sets R 2 R 1 IRIS 98.67 99.4 BUPA 69.87 74.3 Vehicle 61.75 66.5
  • 16. CONCLUSIONS GP based approach to prototype generation and classification is proposed. Population is evolved where each individual consists of a set of possible prototypes of the classes present in the data to be analyzed. A prototype consists of a set of logical expressions. Recognition rate obtained using each set of prototypes is used as fitness function for controlling evolution. Method is able to automatically cluster data. The method offers greater flexibility in comparison to other methods due to dynamic labeling mechanism of logical expressions. The comparison with another method show better results with the proposed GP.