MiningAssociationRulesMiningAssociationRulesMiningAssociationRules.ppt

1
Mining Association Rules
Mohamed G. Elfeky

2
Introduction
Data mining is the discovery of
knowledge and useful information from
the large amounts of data stored in
databases.
Association Rules: describing association
relationships among the attributes in
the set of relevant data.

3
Rules
Body ==> Consequent [ Support , Confidence ]
Body: represents the examined data.
Consequent: represents a discovered property
for the examined data.
Support: represents the percentage of the
records satisfying the body or the consequent.
Confidence: represents the percentage of the
records satisfying both the body and the
consequent to those satisfying only the body.

4
Association Rules Examples
Basket Data
Tea ^ Milk ==> Sugar [0.3 , 0.9]
Relational Data
x.diagnosis = Heart ^ x.sex = Male ==> x.age > 50 [0.4 , 0.7]
Object-Oriented Data
s.hobbies = { sport , art } ==> s.age() = Young [0.5 , 0.8]

5
Topics of Discussion
Formal Statement of the Problem
Different Algorithms
 AIS
 SETM
 Apriori
 AprioriTid
 AprioriHybrid
Performance Analysis

6
Formal Statement of the
Problem
I = { i1 , i2 , … , im } is a set of items
D is a set of transactions T
Each transaction T is a set of items (subset of I)
TID is a unique identifier that is associated with
each transaction
The problem is to generate all association rules
that have support and confidence greater than
the user-specified minimum support and
minimum confidence

7
Problem Decomposition
The problem can be decomposed into two subproblems:
1. Find all sets of items (itemsets) that have
support (number of transactions) greater
than the minimum support (large itemsets).
2. Use the large itemsets to generate the desired
rules.
For each large itemset l, find all non-empty
subsets, and for each subset a generate a rule a
==> (l-a) if its confidence is greater than the
minimum confidence.

8
General Algorithm
1. In the first pass, the support of each individual
item is counted, and the large ones are
determined
2. In each subsequent pass, the large itemsets
determined in the previous pass is used to
generate new itemsets called candidate
itemsets.
3. The support of each candidate itemset is
counted, and the large ones are determined.
4. This process continues until no new large
itemsets are found.

9
AIS Algorithm
Candidate itemsets are generated and counted on-
the-fly as the database is scanned.
1. For each transaction, it is determined which of the
large itemsets of the previous pass are contained in
this transaction.
2. New candidate itemsets are generated by extending
these large itemsets with other items in this
transaction.
The disadvantage is that this results in unnecessarily
generating and counting too many candidate
itemsets that turn out to be small.

10
Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database
Itemset Support
{1} 2
{2} 3
{3} 3
{5} 3
L1
Itemset Support
{1 3}* 2
{1 4} 1
{3 4} 1
{2 3}* 2
{2 5}* 3
{3 5}* 2
{1 2} 1
{1 5} 1
C2
Itemset Support
{1 3 4} 1
{2 3 5}* 2
{1 3 5} 1
C3

11
SETM Algorithm
Candidate itemsets are generated on-the-fly as the
database is scanned, but counted at the end of the pass.
1. New candidate itemsets are generated the same way as
in AIS algorithm, but the TID of the generating
transaction is saved with the candidate itemset in a
sequential structure.
2. At the end of the pass, the support count of candidate
itemsets is determined by aggregating this sequential
structure
It has the same disadvantage of the AIS algorithm.
Another disadvantage is that for each candidate itemset,
there are as many entries as its support value.

12
Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database
Itemset Support
{1} 2
{2} 3
{3} 3
{5} 3
L1
Itemset TID
{1 3} 100
{1 4} 100
{3 4} 100
{2 3} 200
{2 5} 200
{3 5} 200
{1 2} 300
{1 3} 300
{1 5} 300
{2 3} 300
{2 5} 300
{3 5} 300
{2 5} 400
C2
Itemset TID
{1 3 4} 100
{2 3 5} 200
{1 3 5} 300
{2 3 5} 300
C3

13
Apriori Algorithm
Candidate itemsets are generated using only the
large itemsets of the previous pass without
considering the transactions in the database.
1.The large itemset of the previous pass is joined
with itself to generate all itemsets whose size is
higher by 1.
2.Each generated itemset, that has a subset
which is not large, is deleted. The remaining
itemsets are the candidate ones.

14
Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database
Itemset Support
{1} 2
{2} 3
{3} 3
{5} 3
L1
Itemset Support
{1 2} 1
{1 3}* 2
{1 5} 1
{2 3}* 2
{2 5}* 3
{3 5}* 2
C2
Itemset Support
{2 3 5}* 2
C3
{1 2 3}
{1 3 5}
{2 3 5}

15
AprioriTid Algorithm
The database is not used at all for counting the
support of candidate itemsets after the first pass.
1. The candidate itemsets are generated the same way
as in Apriori algorithm.
2. Another set C’ is generated of which each member
has the TID of each transaction and the large
itemsets present in this transaction. This set is used
to count the support of each candidate itemset.
The advantage is that the number of entries in C’ may
be smaller than the number of transactions in the
database, especially in the later passes.

16
Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database
Itemset Support
{1} 2
{2} 3
{3} 3
{5} 3
L1
Itemset Support
{1 2} 1
{1 3}* 2
{1 5} 1
{2 3}* 2
{2 5}* 3
{3 5}* 2
C2
Itemset Support
{2 3 5}* 2
C3
100 {1 3}
200 {2 3}, {2 5}, {3 5}
300 {1 2}, {1 3}, {1 5},
{2 3}, {2 5}, {3 5}
400 {2 5}
C’2
200 {2 3 5}
300 {2 3 5}
C’3

18
AprioriHybrid Algorithm
Performance Analysis shows that:
1. Apriori does better than AprioriTid in the
earlier passes.
2. AprioriTid does better than Apriori in the
later passes.
Hence, a hybrid algorithm can be
designed that uses Apriori in the initial
passes and switches to AprioriTid when it
expects that the set C’ will fit in memory.

MiningAssociationRulesMiningAssociationRulesMiningAssociationRules.ppt

More Related Content

Similar to MiningAssociationRulesMiningAssociationRulesMiningAssociationRules.ppt (20)

Recently uploaded (20)

MiningAssociationRulesMiningAssociationRulesMiningAssociationRules.ppt