Association rules and frequent pattern growth algorithms

Association Rules and Frequent Pattern Growth Algorithms
CIS 435
Francisco E. Figueroa
Executive Summary
During the last years, we have witnessed an exponential growth in the amount of data
generated and stored from all fields including science, business, and retailing. Data mining
could be defined as the process concerned with applying computational techniques to find
patterns in the data to generate knowledge and wisdom for the creation of new value for the
companies. By conducting association rules mining on on given historical sales data, the
results will be able to provide actionable intelligence to the business leadership team to the
store can be prepare for the heavy snowstorm.
Association Rules Overview
The goal of the association rule is to identify all frequent itemsets above a user specified
threshold (called support) and to generate all association rules above another threshold (called
confident) using these frequent itemsets as input. The association analysis is useful for
discovering relationships hidden in large data sets. The uncovered relationships can be
represented in the form of association rules or sets of frequent items. Retailers can use this
type of rules to help them identify new business opportunities for cross selling the products to
the clients. For example, the following rule can be extracted potentially from the data:
{milk} ----> {bread}. The rule suggests that a strong relationship exists between the sale of milk
and bread because many customers who buy bread also buy milk. The association rule is an
implication expression of the form X ---> Y, where X and Y are disjoint itemsets.
Strength, Confidence and Lift
The strength of the association rule can be measured in terms of its support and
confidence. The Support determines how often a rule is applicable to a given data set, while the
confidence determines how frequently items in Y appear in transactions that contain X. Support
is an important measure because a rule that has very low support may occur simply by chance
or that is likely to be uninteresting from a business perspective. Support can be used to
eliminate uninteresting rules.
The confidence measures the reliability of the infce made by the rule. For a given rule
X--->Y, the higher the confidence, the more likely is for Y to be present in transactions that
contain X. It also provides an estimate of the conditional probability of Y given X. The inference
made by an association reul suggest a strong co-occurrence relationship between items in the
antecedent and consequent rule.
The Lift is equal to the confidence factor divided by the expected confidence. A credible
rule has a large relative confidence factor, a relatively large level of support, and a value of lift
greater than 1. Rules having a high level of confidence but little support should be interpreted
with caution. (SAS, 2000)

So, when you analyze, the Lift of the rule is X=>Y is the confidence of the rule divided by
the expected confidence, assuming that the item sets are independent. Then we can say that:
- if lift value is greater than 1 indicates that X and Y appear more often together than
expected; this means that the occurrence of X has a positive effect on the occurrence of Y or
that X is positively correlated with Y.
- if lift is smaller than 1 indicates that X and Y appear less often together than expected,
this means that the occurrence of X has a negative effect on the occurrence of Y or that X is
negatively correlated with Y
-if lift value is near 1 indicates that X and Y appear almost as often together as expected;
this means that the occurrence of X has almost no effect on the occurrence of Y or that X and Y
have
Appriori
The Apriori algorithm was proposed for mining frequent item sets to obtain strong Boolean
association rules. A frequent itemset is a set of transactions that occurs with a minimum
specified support. A strong rule is one that satisfies both minimum support and minimum
confidence. Apriori algorithm uses an iterative level-wise search, where k-itemsets (an itemset
that contains k items) are used to explore k+1 itemsets, to mine frequent itemsets from
transactional database for Boolean association rules. The rule involved, is to first find the set of
frequent 1-itemsets (k=1). This set is denoted L1. L1 is then used to find the set of frequent
2-itemsets, L2, which is in turn used to find L3, and so on, until no more frequent k-itemsets can
be found. Each iteration involves two steps – 1) Generate large k-itemsets and 2) Determine
the support of each itemset using the transaction database. Infrequent itemsets are then pruned
and strong rules are generated from the frequent itemsets.
FP Growth
FP-Growth is an improvement of apriori designed to eliminate some of the heavy bottlenecks in
apriori. FP-Growth simplifies all the problems present in apriori by using a structure called an
FP-Tree. In the FP-Tree each node represents an item and it's current count, and each branch
represents a different association. The whole algorithm is divided in 5 simple steps: first step,
count all the items in all transaction; second step, apply the threshold; third step, sort the lists to
the count of each item; fourth step, build the tree based on each transaction and all items in
order they appear in the short list; and fifth step, every branch of the tree and only include in the
association all the nodes whose count passed the threshold. The biggest advantage of the
FP-Growth is that the algorithm needs to read the file twice, removes the need to calculate the
pairs to be counted, does not required the amount of memory resources as the apriori. (Alfaro,
2016)
Top 10 Products
When applying Apriori: MetricType: confidence; numrules 40; car:True we obtained the following
prooducts: Apriori: bath tissue, hat, water, soap, beer, flashlights, rock salt, protein bars,
blankets and milk

Top 10 Association Rules
When applying Apriori and FP Growth we obtained the following results:
FP Growth: MetricType: confidence; numrules: 40
1. [WATER=T]: 99 ==> [Soap=T]: 99 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.99)
2. [Soap=T]: 99 ==> [WATER=T]: 99 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.99)
3. [Beer=T]: 88 ==> [WATER=T]: 88 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.88)
4. [Flashlights=T]: 77 ==> [WATER=T]: 77 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.77)
5. [Milk=T]: 64 ==> [WATER=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64)
6. [Blankets=T]: 64 ==> [WATER=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64)
7. [Beer=T]: 88 ==> [Soap=T]: 88 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.88)
8. [Flashlights=T]: 77 ==> [Soap=T]: 77 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.77)
9. [Milk=T]: 64 ==> [Soap=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64)
10. [Blankets=T]: 64 ==> [Soap=T]: 64 <conf:(1)> lift:(1.01) lev:(0.01) conv:(0.64)
Apriori: MetricType: confidence; numrules 40; car:True
1. Bath Tissue=T 55 ==> Hat=T 55 conf:(1)
2. WATER=T Bath Tissue=T 54 ==> Hat=T 54 conf:(1)
3. Bath Tissue=T Soap=T 54 ==> Hat=T 54 conf:(1)
4. WATER=T Bath Tissue=T Soap=T 54 ==> Hat=T 54 conf:(1)
5. Beer=T Bath Tissue=T 48 ==> Hat=T 48 conf:(1)
6. WATER=T Beer=T Bath Tissue=T 48 ==> Hat=T 48 conf:(1)
7. Beer=T Bath Tissue=T Soap=T 48 ==> Hat=T 48 conf:(1)
8. WATER=T Beer=T Bath Tissue=T Soap=T 48 ==> Hat=T 48 conf:(1)
9. Flashlights=T Bath Tissue=T 39 ==> Hat=T 39 conf:(1)
10. Flashlights=T WATER=T Bath Tissue=T 39 ==> Hat=T 39 conf:(1)
We can appreciate that water, soap, beer, and flashlights are strong products.
Top 2 Products Purchased
The FP Growth found 19 rules associated to “Generator”. If the lift is > 1, that lets us know the
degree to which those two occurrences are dependent on one another, and makes those rules
potentially useful for predicting the consequent in future data sets. In addition, the conviction show
how often the rule can be incorrect. Based on those measures, we found that water and soap
because it has a lift of 1.01 and a conviction of 0.1. Now beer is another strong product to purchase
with the “Generator” because is has a lift of 1.14 but has a conviction of 1.2, so the rule can be 20%
of the time incorrect.
FPGrowth found 19 rules (displaying top 19)
Showing only rules that contain: Generator
1. [Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)
2. [Generator=T]: 10 ==> [Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)

3. [Generator=T]: 10 ==> [Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2)
4. [Generator=T]: 10 ==> [WATER=T, Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)
5. [WATER=T, Generator=T]: 10 ==> [Soap=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)
6. [Soap=T, Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)
7. [Generator=T]: 10 ==> [WATER=T, Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2)
8. [WATER=T, Generator=T]: 10 ==> [Beer=T]: 10 <conf:(1)> lift:(1.14) lev:(0.01) conv:(1.2)
9. [Beer=T, Generator=T]: 10 ==> [WATER=T]: 10 <conf:(1)> lift:(1.01) lev:(0) conv:(0.1)
9 Large Itemsets
To achieve a rules which contains 9 items (L(9)), Weka had to be configured with the
following parameters: Apriori, CAR: True, lowerboundMinSupport: 0.1, metricType: confidency,
minMetric 0.09, numrules: 400, outputitemsets: true. We obtain the following Large Itemsets
L(9).
Large Itemsets L(9):
Rock salt=T WATER=T Snow shovels=T Blankets=T Protien Bars=T Bath Tissue=T Soap=T Hygine Products=T
Milk=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Bread=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Milk=T Bread=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Soap=T Milk=T Bread=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Protien Bars=T Bath Tissue=T Soap=T Milk=T Bread=T 10
Flashlights=T WATER=T Blankets=T Canned food=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 13
Flashlights=T WATER=T Blankets=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10
Flashlights=T WATER=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10
Flashlights=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10
WATER=T Snow shovels=T Blankets=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Hygine Products=T Milk=T
WATER=T Blankets=T Canned food=T Protien Bars=T Beer=T Bath Tissue=T Soap=T Milk=T Bread=T 10
Real-World Association Rules - Healthcare
The Institute for Integrated and Intelligent Systems implemented system-prototype,
named CSCP system, using the association rules of data mining technique applied to a patients’
(assumed) database for discovering patterns of diseases that might be carried by a patient. The
recognised pattern by this implementation definitely can improve the healthcare services along
with medical researchers for further exploring trends of diseases that are correlated. The
technique allow the IIIS to generate correlations among diseases. (Rashid)
Real-World Association Rules - Retailing
Retailers collect data every day – such as transactional data, customer demographics
and product sales based on parameters such as seasons and festivals. To convert this data into
knowledge and wisdom, it is necessary to discover and understand the underlying patterns
involved in the organisation’s operations from these data. Analysis of past transaction data is a
commonly used approach in order to improve the quality of such decisions. Extraction of

frequent itemsets is essential towards mining interesting patterns from datasets. A typical
usage scenario for searching frequent patterns is the so called “market basket analysis” that
involves analysing the transactional data of a supermarket or retail store in order to determine
which products are purchased together and how often and also examine customer purchase
preferences. (Prasad, 2011)
Real-World Association Rules - Finance
The bankruptcy prediction is very important for any organization. The financial statement is
used to predict the bankruptcy. The financial analysis is integrated to analyze the financial
statement. The financial statement has both balance sheet and income statement. The financial
statement is then used to build a bankruptcy prediction model. The Association Rule mining
Algorithm augments the efficiency of the proposed method by providing relevant results based
on the association between the businesses’ financial statements. (Martin, 2011)
References:
Rashid, M. , Hoque, T , Sattar, A. Association Rules Mining Based Clinical Observations.
Institute for Integrated and Intelligent Systems (IIIS). Retrieved from
https://arxiv.org/pdf/1401.2571.pdf
Kouris, I.N, Makris, C., Theodoridis, E., Tsakalidis, A. Association Rules Mining for Retail
Organizations. Retrieved from
http://www.igi-global.com/viewtitlesample.aspx?id=13583&ptid=362&t=association+rules+minin
g+for+retail+organizations
Prasad, P. Malik, L., Using Association Rule Mining for Extracting Product Sales Patterns in
Retail Store Transactions. 2011. Interational Journal on Computer Science and Engineering.
Retrieved from http://www.enggjournals.com/ijcse/doc/IJCSE11-03-05-185.pdf
SAS. The Assoc Procedure. Retrieved from
http://support.sas.com/documentation/onlinedoc/miner/em43/assoc.pdf
Martin, A. , Manjula, M., Venkatesan, P. A Business Intelligence Model to Predict Bankruptcy
using Financial Domain Ontology with Association Rule Mining Algorithm. 2011. International
Journal of Computer Science. Retrieved from https://arxiv.org/pdf/1109.1087.pdf
Alfaro, F., Solano, J. Apriori vs. FP-Growth for Frequent Item Set Mining. 2016. Retrieved from
http://singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining

Association rules and frequent pattern growth algorithms

More Related Content

Similar to Association rules and frequent pattern growth algorithms (20)

More from Francisco E. Figueroa-Nigaglioni (8)

Recently uploaded (20)

Association rules and frequent pattern growth algorithms