Learning sets of rules, Sequential Learning Algorithm,FOIL

Learning Sets of
Rules
SEQUENTIAL COVERING ALGORITHMS
LEARNING RULE SETS
LEARNING FIRST ORDER RULES
LEARNING SETS OF FIRST ORDER RULES
4/24/2020
1

Learning Rules
 One of the most expressive and human readable
representations for learned hypotheses(Target
Function) is sets of production rules (if-then rules).
 Rules can be derived from other representations
 Decision trees
 Genetic Algorithms
 or they can be learned directly (direct method).
 An important aspect of direct rule-learning algorithms
is that they can learn sets of first-order rules which
have much more representational power than the
propositional rules that can be derived from decision
trees.
4/24/2020
2

Propositional versus First-
Order Logic
 Propositional Logic does not include variables and thus
cannot express general relations among the values of the
attributes.
 Example 1: in Propositional logic, you can write:
 IF (Father1=Bob) ^ (Name2=Bob)^ (Female1=True) THEN
Daughter1,2=True.
This rule applies only to a specific family!
 Example 2: In First-Order logic, you can write:
 IF Father(y,x) ^ Female(y), THEN Daughter(x,y)
This rule (which you cannot write in Propositional Logic) applies to
any family!
4/24/2020
3

Sequential Covering Algorithms:
 In SCA, “Family of Algorithms used to learn the rule sets
based on the strategy of learning one rule by removing
the data it covers and iterating the process “
 Example: (explains why it is called sequential covering)
 Assume a subroutine “Learn one rule”
 Input : Set of positive and Negative examples
 Expected output : Rule that covers many +ve and few –ve
examples
 Requirement : Output rule with high accuracy and low
coverage
 Approach :
 Learn a rule and remove a positive example covered by rule
 Invoke subroutines again to learn second rule for another
positive example
 Iterate it as many times as required to learn rules as “disjunctive
set of rules”
4/24/2020
4

Learning Propositional Rules:
Sequential Covering Algorithms
Sequential-Covering(Target_attribute, Attributes,
Examples, Threshold)
 Learned_rules <-- { }
 Rule <-- Learn-one-rule(Target_attribute, Attributes,
Examples)
 While Performance(Rule, Examples) > Threshold, do
 Learned_rules <-- Learned_rules + Rule
 Examples <-- Examples -{examples correctly classified
by Rule}
 Rule <-- Learn-one-rule(Target_attribute, Attributes,
Examples)
 Learned_rules <-- sort Learned_rules according to
Performance over Examples
 Return Learned_rules
4/24/2020
5

Learning Propositional Rules:
Sequential Covering Algorithms
 The algorithm is called a sequential covering algorithm because it
sequentially learns a set of rules that together cover the whole set of
positive examples.
 It has the advantage of reducing the problem of learning a
disjunctive set of rules to a sequence of simpler problems, each
requiring that a single conjunctive rule be learned.
 The final set of rules is sorted so that the most accurate rules are
considered first at classification time.
 However, because it does not backtrack, this algorithm is not
guaranteed to find the smallest or best set of rules ---> Learn-one-
rule must be very effective!
4/24/2020
6

Learning Propositional Rules: Learn-one-rule
General-to-Specific Search:
 Consider the most general rule (hypothesis) which matches
every instances in the training set.
 Repeat
 Add the attribute that most improves rule performance measured over the training set.
 Until the hypothesis reaches an acceptable level of
performance.
 The difference between this algorithm and ID3 is that it
follows a single descendant at each step
 It has no backtracking and hence chances of making
suboptimal choice.
 Solution is making use of beam search rather than single
best candidate
4/24/2020
7

General-to-Specific Beam Search
(CN2):
 Rather than considering a single candidate at each
search step, keep track of the k best candidates.
 It keep track of most promising alternative to the
current top rated hypotheses
 Learn-one-rule (Target attribute, Attributes, Examples,
k)
 It returns a single rule that covers some of the examples
 Conducts general to specific greedy beam search for the
best rule guided by performance metric
8

Implementation of Learn one rule using
generic to specific beam search
 Learn-one-rule (Target attribute, Attributes, Examples, k)
1. Best-Hypothesis  most general hypothesis {ø}
2. Candidate-Hypotheses  {Set of best hypothesis}
3. While Candidate-Hypothesis is not empty, Do
a. Generate the next most specific Candidate Hypothesis
 All-Constraints  set of all constraints of the form (a=v)
 New Candidate-Hypothesis  Generated for each h in Candidate-
Hypothesis and for each c in all constraints
 Remove from New Candidate-Hypotheses duplicate, inconsistent or
not maximally specific hypothesis
b. Update Best Hypothesis
 For all h in New Candidate-Hypotheses do
 IF (Performance(h, Examples, Target-attribute) > Performance(Best-
Hypotheses, Examples, Target-attribute))
Then Best Hypothesis  h
9

10
c. Update Candidate Hypotheses
 Candidate –Hypotheses  the k best members of New
Candidate-Hypotheses, according to the performance measure
 Return a rule of the form
“IF Best-Hypothesis THEN prediction”
Where prediction is the most frequent value of Target
attribute among those examples that match Best-Hypothesis
Performance (h, Examples, Target-attribute)
h-examples  the subset of examples that match h
return –Entropy (h-examples), where entropy is w.r.t.
Target attribute
Variation:
1. Learning only the rules covered by positive examples
(negation as failure)
2. Using single positive example and covering attributes that
cover the example

Comments and Variations regarding the
Basic Rule Learning Algorithms
 Sequential versus Simultaneous covering: sequential covering
algorithms (CN2) make a larger number of independent choices than
simultaneous covering ones (ID3). Choice depends on the number of
dataset
 Direction of the search: CN2 uses a general-to-specific search
strategy. Other systems (GOLEM) uses a specific to general search
strategy. General-to-specific search has the advantage of having a
single hypothesis from which to start.
 Generate-then-test versus example-driven: CN2(Clark and Niblett-
Learn one rule) is a generate-then-test method(impact of noise is
minimized).
 Other methods (Find S, Cand Elimi.) are example-driven. Generate-then-
test systems are more robust to noise(easily misled by single noisy data).
4/24/2020
11

Comments and Variations regarding the
Basic Rule Learning Algorithms, Cont’d
 Post-Pruning: pre-conditions can be removed from the rule
whenever this leads to improved performance over a set of
pruning examples distinct from the training set.
 Performance measure: different evaluation can be used.
 Example: relative frequency = nc/n
 m-estimate of accuracy = (nc+mp)/(n+m) (certain versions
of CN2)
 Entropy (original CN2).
4/24/2020
12

Learning Sets of First-Order
Rules: FOIL (Quinlan, 1990)
FOIL is similar to the Propositional Rule
learning approach except for the following:
 FOIL accommodates first-order rules and thus needs to
accommodate variables in the rule pre-conditions.
 FOIL uses a special performance measure (FOIL-GAIN) which takes
into account the different variable bindings.
 FOILS seeks only rules that predict when the target literal is True
(instead of predicting when it is True or when it is False).
 FOIL performs a simple hillclimbing search rather than a beam
search.
4/24/2020
13

Induction as Inverted Deduction
 Let D be a set of training examples, each of the form <xi,f(xi)>.
Then, learning is the problem of discovering a hypothesis h, such
that the classification f(xi) of each training instance xi follows
deductively from the hypothesis h, the description of xi and any
other background knowledge B known to the system.
Example:
 xi: Male(Bob), Female(Sharon), Father(Sharon, Bob)
 f(xi): Child(Bob, Sharon)
 B: Parent(u,v) <-- Father(v,u)
 we want to find h such that (B^h^xi) |-- f(xi).
h1: Child(u,v) <-- Father(v,u)
h2: Child(u,v) <-- Parents(v,u)
14

 Name1=Sharon, Mother1 = Louise, Father1=Bob,
Male1= False, Female1= True,
 Name2 = Bob, father2 = Victor, Mother2=Nora, Male2
= True, Female2 = False,
Daughter1,2 = True
 IF((Father1= Bob)^ (Name2 = Bob)^(Female1=True))
Then Daughter1,2 = True
 If Father(y, x) ^ Female(y), Then Daughter (x, y)
 If father(y, z) ^ Mother (z, x) ^ Female (y)
then __?__ (x, y)
Granddaughter
15

Basic Definitions from First Order Logic
 Constants (capitalized Symbols)
 Ex : Bob, Sharon
 Variables (Lower case symbols)
 Ex : x , y, z
 Predicate symbols (Capitalized Symbols)
 Female, Male, Married
 Function Symbols(Lower case symbols)
 age
 Term : Any Constant, Variable or “function applied to any term”
 Example : Mary, x, Age(Mary), Age(x)
 Literal : Any Predicate or its negation applied to any term
 Example : Married(Bob)  positive literal
Greater_Than (age (Sharon),20)  negative literal
Ground Literal : Literal without any variable
 Clause : Disjunction of literals
16

Basic Definitions from First Order Logic
 Horn Clause : Clause with at most one positive literal
 H L1 L2 . . . . . . Ln
 H  (L1л L2 л L3 л . . . . . Ln)  IF (L1л L2 л L3 л . . . . . Ln)
THEN H
 In a Horn Clause :
 H Head or Consequent of Horn clause.
 (L1л L2 л L3 л . . . . . Ln) Body or antecedents of Horn Clause
 Well formed expression composed of :
1. Constants 3. Variables
2. Functions 4. Predicates
 Example : Female(Joe)
 Substitution : A function that replaces variables by terms
 Example: {x/3, y/z} replaces x by 3 and y by z
 In general, Substitution is θ and literal is L (Lθ ) represents
substitution
 A unifying Substitution : For any two literals L1 and L2,
substitution θ is such that L1θ =L2θ
17

Learning Sets of First Order Rules (FOIL)
FOIL (Target_Predicate, Predicates, Examples)
 Pos  examples with Target_Predicate True
 Neg  examples with Target_Predicate False
 Learned Rules  { }
 While Pos, do ; Outer loop
Learn a new rule
 NewRule  the rule that predicts Target_Predicate with
no preconditions
 NewRuleNeg  Neg
 While NewRuleNeg, do ; inner loop
Add a new literal to specialize NewRule
 Candidate_literals  generate candidate new literals for
NewRule, based on Predicates
 Best_Literal  argmax Foil_Gain (L, NewRule)
L Є Candidate_literals
18

 Add Best_literal to preconditions of NewRule
 NewRuleNeg  subset of NewRuleNeg that satisfies
NewRule preconditions
 Leraned_rules  Learned_rules + NewRule
 Pos Pos – {Members of Pos covered by NewRule}
 Return Learned_rules
Comments :
 Each iteration of outer loop adds new rule to
Learned_rules by performimg
 specific to general hypothesis
 Inner loop performs finer grained search to determine
the exact definition of each rule by performing general
to specific hill climbing search
19

Learning sets of rules, Sequential Learning Algorithm,FOIL

More Related Content

What's hot (20)

Similar to Learning sets of rules, Sequential Learning Algorithm,FOIL (20)

Recently uploaded (20)

Learning sets of rules, Sequential Learning Algorithm,FOIL