SlideShare a Scribd company logo
Data Science with R
Unit V (Part-1) : Association Rules
M. Narasimha Raju
Asst Professor, Dept. of Computer Science & Engineering
Shri Vishnu Engineering College for Women (A),
2
Association Rules:
 Association Rules
 Overview
 Apriori Algorithm
 Evaluation of Candidate Rules
 Applications of Association
Rules
 Example
 Validation and Testing
Regression
 Linear Regression
 Logistic Regression
 Reasons to Choose and
Cautions.
3
Overview
 Given a large collection of transactions, in which each transaction
consists of one or more items, association rules go through the
items being purchased to see what items are frequently bought
together and to discover a list of rules that describe the purchasing
behavior.
 The goal with association rules is to discover interesting
relationships among the items.
 The relationships that are interesting depend both on the business
context and the nature of the algorithm being used for the
discovery.
4
The general logic behind association rules
5
Association
 Each of the uncovered rules is in the form X Y, meaning
→
that when item X is observed, item Y is also observed. In this
case, the left-hand side (LHS) of the rule is X, and the right-
hand side (RHS) of the rule is Y.
 Using association rules, patterns can be discovered from the
data that allow the association rule algorithms to disclose
rules of related product purchases.
6
Market basket analysis.
 Each transaction can be viewed as the shopping basket of a customer
that contains one or more items. This is also known as an itemset.
 The term itemset refers to a collection of items or individual entities
that contain some kind of relationship.
 This could be a set of retail items purchased together in one
transaction, a set of hyperlinks clicked on by one user in a single
session, or a set of tasks done in one day.
 An itemset containing k items is called a k-itemset. {item 1,item 2,. . .
item k} to denote a k-itemset.
 Computation of the association rules is typically based on itemsets.
7
Apriori - Suppot
 It pioneered the use of support for pruning the itemsets and
controlling the exponential growth of candidate itemsets.
 Given an itemset L, the support of L is the percentage of
transactions that contain L.
 For example, if 80% of all transactions contain itemset
{bread}, then the support of {bread} is 0.8.
 Similarly, if 60% of all transactions contain itemset {bread,
butter}, then the support of {bread, butter} is 0.6.
8
Minimum support
 A frequent itemset has items that appear together often enough.
 If the minimum support is set at 0.5, any itemset can be considered
a frequent itemset if at least 50% of the transactions contain this
itemset.
 The support of a frequent itemset should be greater than or equal
to the minimum support
 If an itemset is considered frequent, then any subset of the
frequent itemset must also be frequent.
 This is referred to as the Apriori property (or downward closure
property).
9
Frequent item sets
 If 60% of the transactions contain
{bread,jam}, then at least 60% of
all the transactions will contain
{bread} or {jam}.
 In other words, when the support
of {bread,jam} is 0.6, the support
of {bread} or {jam} is at least 0.6.
 If itemset {B,C,D} is frequent,
then all the subsets of this
itemset, shaded, must also be
frequent itemsets.
10
Apriori Algorithm
 The Apriori algorithm takes a bottom-up iterative approach to uncovering the
frequent itemsets by first determining all the possible items (or 1-itemsets, for
example {bread}, {eggs}, {milk}, …) and then identifying which among them are
frequent.
 Assuming the minimum support threshold (or the minimum support criterion)
is set at 0.5, the algorithm identifies and retains those itemsets that appear in at
least 50% of all transactions and discards (or “prunes away”) the itemsets that
have a support less than 0.5 or appear in fewer than 50% of the transactions.
 The word prune is used like it would be in gardening, where unwanted
branches of a bush are clipped away.
11
Apriori algorithm
 In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are
paired into 2-itemsets (for example, {bread,eggs}, {bread,milk}, {eggs,milk}, …) and
again evaluated to identify the frequent 2-itemsets among them.
 At each iteration, the algorithm checks whether the support criterion can be met; if it
can, the algorithm grows the itemset, repeating the process until it runs out of
support or until the itemsets reach a predefined length.
 Let variable Ck be the set of candidate k-itemsets and variable Lk be the set of k-
itemsets that satisfy the minimum support. Given a transaction database D, a
minimum support threshold δ, and an optional parameter N indicating the
maximum length an itemset could reach, Apriori iteratively computes frequent
itemsets Lk+1 based on Lk.
12
Apriori algorithm
13
Apriori algorithm
 The first step of the Apriori algorithm is to identify the frequent itemsets by
starting with each item in the transactions that meets the predefined
minimum support threshold δ.
 These itemsets are 1-itemsets denoted as L1, as each 1-itemset contains only
one item.
 Next, the algorithm grows the itemsets by joining L1 onto itself to form new,
grown 2-itemsets denoted as L2 and determines the support of each 2-itemset
in L2.
 Those itemsets that do not meet the minimum support threshold δ are
pruned away.
 The growing and pruning process is repeated until no itemsets meet the
minimum support threshold.
 Once completed, output of the Apriori algorithm is the collection of all the
frequent k-itemsets.
14
Evaluation of Candidate Rules
 Confidence is defined as the measure of certainty or
trustworthiness associated with each discovered rule.
 Confidence is the percent of transactions that contain both X and Y
out of all the transactions that contain X
 For example, if {bread, eggs, milk} has a support of 0.15 and
{bread, eggs} also has a support of 0.15, the confidence of rule
{bread, eggs} {milk} is 1, which means 100% of the time a
→
customer buys bread and eggs, milk is bought as well.
15
Evaluation of Candidate Rules
 A relationship may be thought of as interesting when the algorithm
identifies the relationship with a measure of confidence greater than or
equal to a predefined threshold.
 This predefined threshold is called the minimum confidence.
 Lift measures how many times more often X and Y occur together than
expected if they are statistically independent of each other.
 Lift is a measure of how X and Y are really related rather than
coincidentally happening together
 Lift is 1 if X and Y are statistically independent of each other. In contrast,
a lift of X Y greater than 1 indicates that there is some usefulness to the
→
rule. A larger value of lift suggests a greater strength of the association
between X and Y.
16
Evaluation of Candidate Rules
 Assuming 1,000 transactions, with {milk,eggs} appearing in 300 of
them, {milk} appearing in 500, and {eggs} appearing in 400, then
Lift(milk eggs)=0.3/(0.5* 0.4)=1.5.
→
 If {bread} appears in 400 transactions and {milk,bread} appears in
400, then Lift(milk bread)=0.4 /(0.5* 0.4)=2.
→
 Therefore it can be concluded that milk and bread have a stronger
association than milk and eggs.
17
 Leverage measures the difference in the probability of X and Y appearing together
in the dataset compared to what would be expected if X and Y were statistically
independent of each other.
 Leverage is 0 when X and Y are statistically independent of each other.
 If X and Y have some kind of relationship, the leverage would be greater than zero.
 A larger leverage value indicates a stronger relationship between X and Y.
 For the previous example, Leverage(milk eggs)=0.3 (0.5* 0.4)=0.1 and
→ −
Leverage(milk bread)=0.4 (0.5* 0.4)=0.2.
→ −
 It again confirms that milk and bread have a stronger association than milk and
eggs.
18
Applications of Association Rules
 Broad-scale approaches to better merchandising—what
products should be included in or excluded from the inventory
each month
 Cross-merchandising between products and high-margin or high-
ticket items
 Physical or logical placement of product within related categories
of products
 Promotional programs—multiple product purchase incentives
managed through a loyalty card program
19
Recommendation Systems
 Many online service providers such as Amazon and Netflix use
recommender systems.
 Recommender systems can use association rules to discover
related products or identify customers who have similar interests.
 For example, association rules may suggest that those customers
who have bought product A have also bought product B, or those
customers who have bought products A, B, and C are more similar
to this customer.
 These findings provide opportunities for retailers to cross-sell
their products.
20
Clickstream analysis
 Clickstream analysis refers to the analytics on data related to web
browsing and user clicks, which is stored on the client or the server
side.
 Web usage log files generated on web servers contain huge amounts of
information, and association rules can potentially give useful
knowledge to web usage data analysts.
 For example, association rules may suggest that website visitors who
land on page X click on links A, B, and C much more often than links D,
E, and F.
 This observation provides valuable insight on how to better
personalize and recommend the content to site visitors.
21
An Example: Transactions in a Grocery Store
 Using R and the arules and arulesViz packages
 The Groceries Dataset
22
 The class of the dataset is transactions,
as defined by the arules package. The
transactions class contains three slots:
 transactionInfo: A data frame with
vectors of the same length as the
number of transactions
 itemInfo: A data frame to store item
labels
 data: A binary incidence matrix that
indicates which item labels appear in
every transaction
23
24
Frequent Itemset Generation
 The apriori() function from the arule package implements the Apriori algorithm to create
frequent itemsets.
 Note that, by default, the apriori() function executes all the iterations at once.
 Assume that the minimum support threshold is set to 0.02 based on management
discretion.
 Because the dataset contains 9,853 transactions, an itemset should appear at least 198
times to be considered a frequent itemset.
 The first iteration of the Apriori algorithm computes the support of each product in the
dataset and retains those products that satisfy the minimum support.
 The following code identifies 59 frequent 1-itemsets that satisfy the minimum support.
 The parameters of apriori() specify the minimum and maximum lengths of the itemsets,
the minimum support threshold, and the target indicating the type of association mined.
25
26
27
28
29
30
31
32
Rule Generation and Visualization
33
plot(rules)
 The scatterplot shows that, of the 2,918 rules generated from the
Groceries dataset, the highest lift occurs at a low support and a
low confidence.
34
 Entering plot(rules@quality) displays a scatterplot matrix (Figure
5-4) to compare the support, confidence, and lift of the 2,918 rules.
lift is
proportional to
confidence and
illustrates
several linear
groupings.
35
 Lift=Confidence / Support(Y).
 when the support of Y remains the same, lift is proportional to
confidence, and the slope of the linear trend is the reciprocal of
Support(Y).
36
37
38
39
Validation and Testing
 After gathering the output rules, it may become necessary to
use one or more methods to validate the results in the
business context for the sample dataset.
 The first approach can be established through statistical
measures such as confidence, lift, and leverage.
 Rules that involve mutually independent items or cover few
transactions are considered uninteresting because they may
capture spurious relationships.
40
 Confidence measures the chance that X and Y appear together in relation to the
chance X appears.
 Confidence can be used to identify the interestingness of the rules.
 Lift and leverage both compare the support of X and Y against their individual
support.
 While mining data with association rules, some rules generated could be purely
coincidental.
 For example, if 95% of customers buy X and 90% of customers buy Y, then X
and Y would occur together at least 85% of the time, even if there is no
relationship between the two.
 Measures like lift and leverage ensure that interesting rules are identified
rather than coincidental ones.
41
Diagnostics
 Although the Apriori algorithm is easy to understand and implement,
some of the rules generated are uninteresting or practically useless.
 Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
 Measures like confidence, lift, and leverage should be used along with
human insights to address this problem.
 The Apriori algorithm reduces the computational workload by only
examining itemsets that meet the specified minimum threshold.
 However, depending on the size of the dataset, the Apriori algorithm
can be computationally expensive.
 For each level of support, the algorithm requires a scan of the entire
database to obtain the result.
42
Approaches to improve Apriori’s efficiency:
 Partitioning: Any itemset that is potentially frequent in a transaction
database must be frequent in at least one of the partitions of the transaction
database.
 Sampling: This extracts a subset of the data with a lower support threshold
and uses the subset to perform association rule mining.
 Transaction reduction: A transaction that does not contain frequent k-
itemsets is useless in subsequent scans and therefore can be ignored.
 Hash-based itemset counting: If the corresponding hashing bucket count of
a k-itemset is below a certain threshold, the k-itemset cannot be frequent.
 Dynamic itemset counting: Only add new candidate itemsets when all of
their subsets are estimated to be frequent.
Data Science with R
Unit V (Part-2) : Association Rules With R Programming
M. Narasimha Raju
Asst Professor, Dept. of Computer Science & Engineering
Shri Vishnu Engineering College for Women (A),
Thank
You

More Related Content

PDF
Data Science - Part VI - Market Basket and Product Recommendation Engines
PPTX
Business intelligence
PPTX
BAS 250 Lecture 4
PPTX
Unit 4_ML.pptx
PDF
Mining Negative Association Rules
PPTX
1.pptx .
PPTX
Association rule mining and Apriori algorithm
DOCX
Assignment #3 10.19.14
Data Science - Part VI - Market Basket and Product Recommendation Engines
Business intelligence
BAS 250 Lecture 4
Unit 4_ML.pptx
Mining Negative Association Rules
1.pptx .
Association rule mining and Apriori algorithm
Assignment #3 10.19.14

Similar to Data SAcience with r progarmming Unit - V Part-1.pptx (20)

PDF
Association rules and frequent pattern growth algorithms
PDF
6. Association Rule.pdf
PDF
N0342080084
PPTX
Association rules
DOCX
5Association AnalysisBasic Concepts an.docx
PPTX
Association Rule mining
PDF
G0364347
PDF
What goes with what (Market Basket Analysis)
PDF
IRJET- Minning Frequent Patterns,Associations and Correlations
PDF
PROJECT-109,93.pdf data miiining project
PPTX
Association and Correlation analysis.....
PDF
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
PPTX
WEEK 11 - Association Mining_020520.pptx
PDF
Understanding Association Rule Mining
PPTX
Introduction to Association Rules.pptx
PDF
Market Basket Analysis of bakery Shop
PPTX
big data seminar.pptx
PPT
MarketBasket(BahanAR-2)gfhjghhhbjbjbn.ppt
PDF
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
PDF
Data Mining Apriori Algorithm Implementation using R
Association rules and frequent pattern growth algorithms
6. Association Rule.pdf
N0342080084
Association rules
5Association AnalysisBasic Concepts an.docx
Association Rule mining
G0364347
What goes with what (Market Basket Analysis)
IRJET- Minning Frequent Patterns,Associations and Correlations
PROJECT-109,93.pdf data miiining project
Association and Correlation analysis.....
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
WEEK 11 - Association Mining_020520.pptx
Understanding Association Rule Mining
Introduction to Association Rules.pptx
Market Basket Analysis of bakery Shop
big data seminar.pptx
MarketBasket(BahanAR-2)gfhjghhhbjbjbn.ppt
Effectiveness of ERules in Generating Non Redundant Rule Sets in Pharmacy Dat...
Data Mining Apriori Algorithm Implementation using R
Ad

Recently uploaded (20)

PDF
Digital Logic Computer Design lecture notes
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
DOCX
573137875-Attendance-Management-System-original
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mechanical Engineering MATERIALS Selection
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Operating System & Kernel Study Guide-1 - converted.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Automation-in-Manufacturing-Chapter-Introduction.pdf
573137875-Attendance-Management-System-original
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Sustainable Sites - Green Building Construction
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Foundation to blockchain - A guide to Blockchain Tech
additive manufacturing of ss316l using mig welding
Construction Project Organization Group 2.pptx
Ad

Data SAcience with r progarmming Unit - V Part-1.pptx

  • 1. Data Science with R Unit V (Part-1) : Association Rules M. Narasimha Raju Asst Professor, Dept. of Computer Science & Engineering Shri Vishnu Engineering College for Women (A),
  • 2. 2 Association Rules:  Association Rules  Overview  Apriori Algorithm  Evaluation of Candidate Rules  Applications of Association Rules  Example  Validation and Testing Regression  Linear Regression  Logistic Regression  Reasons to Choose and Cautions.
  • 3. 3 Overview  Given a large collection of transactions, in which each transaction consists of one or more items, association rules go through the items being purchased to see what items are frequently bought together and to discover a list of rules that describe the purchasing behavior.  The goal with association rules is to discover interesting relationships among the items.  The relationships that are interesting depend both on the business context and the nature of the algorithm being used for the discovery.
  • 4. 4 The general logic behind association rules
  • 5. 5 Association  Each of the uncovered rules is in the form X Y, meaning → that when item X is observed, item Y is also observed. In this case, the left-hand side (LHS) of the rule is X, and the right- hand side (RHS) of the rule is Y.  Using association rules, patterns can be discovered from the data that allow the association rule algorithms to disclose rules of related product purchases.
  • 6. 6 Market basket analysis.  Each transaction can be viewed as the shopping basket of a customer that contains one or more items. This is also known as an itemset.  The term itemset refers to a collection of items or individual entities that contain some kind of relationship.  This could be a set of retail items purchased together in one transaction, a set of hyperlinks clicked on by one user in a single session, or a set of tasks done in one day.  An itemset containing k items is called a k-itemset. {item 1,item 2,. . . item k} to denote a k-itemset.  Computation of the association rules is typically based on itemsets.
  • 7. 7 Apriori - Suppot  It pioneered the use of support for pruning the itemsets and controlling the exponential growth of candidate itemsets.  Given an itemset L, the support of L is the percentage of transactions that contain L.  For example, if 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8.  Similarly, if 60% of all transactions contain itemset {bread, butter}, then the support of {bread, butter} is 0.6.
  • 8. 8 Minimum support  A frequent itemset has items that appear together often enough.  If the minimum support is set at 0.5, any itemset can be considered a frequent itemset if at least 50% of the transactions contain this itemset.  The support of a frequent itemset should be greater than or equal to the minimum support  If an itemset is considered frequent, then any subset of the frequent itemset must also be frequent.  This is referred to as the Apriori property (or downward closure property).
  • 9. 9 Frequent item sets  If 60% of the transactions contain {bread,jam}, then at least 60% of all the transactions will contain {bread} or {jam}.  In other words, when the support of {bread,jam} is 0.6, the support of {bread} or {jam} is at least 0.6.  If itemset {B,C,D} is frequent, then all the subsets of this itemset, shaded, must also be frequent itemsets.
  • 10. 10 Apriori Algorithm  The Apriori algorithm takes a bottom-up iterative approach to uncovering the frequent itemsets by first determining all the possible items (or 1-itemsets, for example {bread}, {eggs}, {milk}, …) and then identifying which among them are frequent.  Assuming the minimum support threshold (or the minimum support criterion) is set at 0.5, the algorithm identifies and retains those itemsets that appear in at least 50% of all transactions and discards (or “prunes away”) the itemsets that have a support less than 0.5 or appear in fewer than 50% of the transactions.  The word prune is used like it would be in gardening, where unwanted branches of a bush are clipped away.
  • 11. 11 Apriori algorithm  In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are paired into 2-itemsets (for example, {bread,eggs}, {bread,milk}, {eggs,milk}, …) and again evaluated to identify the frequent 2-itemsets among them.  At each iteration, the algorithm checks whether the support criterion can be met; if it can, the algorithm grows the itemset, repeating the process until it runs out of support or until the itemsets reach a predefined length.  Let variable Ck be the set of candidate k-itemsets and variable Lk be the set of k- itemsets that satisfy the minimum support. Given a transaction database D, a minimum support threshold δ, and an optional parameter N indicating the maximum length an itemset could reach, Apriori iteratively computes frequent itemsets Lk+1 based on Lk.
  • 13. 13 Apriori algorithm  The first step of the Apriori algorithm is to identify the frequent itemsets by starting with each item in the transactions that meets the predefined minimum support threshold δ.  These itemsets are 1-itemsets denoted as L1, as each 1-itemset contains only one item.  Next, the algorithm grows the itemsets by joining L1 onto itself to form new, grown 2-itemsets denoted as L2 and determines the support of each 2-itemset in L2.  Those itemsets that do not meet the minimum support threshold δ are pruned away.  The growing and pruning process is repeated until no itemsets meet the minimum support threshold.  Once completed, output of the Apriori algorithm is the collection of all the frequent k-itemsets.
  • 14. 14 Evaluation of Candidate Rules  Confidence is defined as the measure of certainty or trustworthiness associated with each discovered rule.  Confidence is the percent of transactions that contain both X and Y out of all the transactions that contain X  For example, if {bread, eggs, milk} has a support of 0.15 and {bread, eggs} also has a support of 0.15, the confidence of rule {bread, eggs} {milk} is 1, which means 100% of the time a → customer buys bread and eggs, milk is bought as well.
  • 15. 15 Evaluation of Candidate Rules  A relationship may be thought of as interesting when the algorithm identifies the relationship with a measure of confidence greater than or equal to a predefined threshold.  This predefined threshold is called the minimum confidence.  Lift measures how many times more often X and Y occur together than expected if they are statistically independent of each other.  Lift is a measure of how X and Y are really related rather than coincidentally happening together  Lift is 1 if X and Y are statistically independent of each other. In contrast, a lift of X Y greater than 1 indicates that there is some usefulness to the → rule. A larger value of lift suggests a greater strength of the association between X and Y.
  • 16. 16 Evaluation of Candidate Rules  Assuming 1,000 transactions, with {milk,eggs} appearing in 300 of them, {milk} appearing in 500, and {eggs} appearing in 400, then Lift(milk eggs)=0.3/(0.5* 0.4)=1.5. →  If {bread} appears in 400 transactions and {milk,bread} appears in 400, then Lift(milk bread)=0.4 /(0.5* 0.4)=2. →  Therefore it can be concluded that milk and bread have a stronger association than milk and eggs.
  • 17. 17  Leverage measures the difference in the probability of X and Y appearing together in the dataset compared to what would be expected if X and Y were statistically independent of each other.  Leverage is 0 when X and Y are statistically independent of each other.  If X and Y have some kind of relationship, the leverage would be greater than zero.  A larger leverage value indicates a stronger relationship between X and Y.  For the previous example, Leverage(milk eggs)=0.3 (0.5* 0.4)=0.1 and → − Leverage(milk bread)=0.4 (0.5* 0.4)=0.2. → −  It again confirms that milk and bread have a stronger association than milk and eggs.
  • 18. 18 Applications of Association Rules  Broad-scale approaches to better merchandising—what products should be included in or excluded from the inventory each month  Cross-merchandising between products and high-margin or high- ticket items  Physical or logical placement of product within related categories of products  Promotional programs—multiple product purchase incentives managed through a loyalty card program
  • 19. 19 Recommendation Systems  Many online service providers such as Amazon and Netflix use recommender systems.  Recommender systems can use association rules to discover related products or identify customers who have similar interests.  For example, association rules may suggest that those customers who have bought product A have also bought product B, or those customers who have bought products A, B, and C are more similar to this customer.  These findings provide opportunities for retailers to cross-sell their products.
  • 20. 20 Clickstream analysis  Clickstream analysis refers to the analytics on data related to web browsing and user clicks, which is stored on the client or the server side.  Web usage log files generated on web servers contain huge amounts of information, and association rules can potentially give useful knowledge to web usage data analysts.  For example, association rules may suggest that website visitors who land on page X click on links A, B, and C much more often than links D, E, and F.  This observation provides valuable insight on how to better personalize and recommend the content to site visitors.
  • 21. 21 An Example: Transactions in a Grocery Store  Using R and the arules and arulesViz packages  The Groceries Dataset
  • 22. 22  The class of the dataset is transactions, as defined by the arules package. The transactions class contains three slots:  transactionInfo: A data frame with vectors of the same length as the number of transactions  itemInfo: A data frame to store item labels  data: A binary incidence matrix that indicates which item labels appear in every transaction
  • 23. 23
  • 24. 24 Frequent Itemset Generation  The apriori() function from the arule package implements the Apriori algorithm to create frequent itemsets.  Note that, by default, the apriori() function executes all the iterations at once.  Assume that the minimum support threshold is set to 0.02 based on management discretion.  Because the dataset contains 9,853 transactions, an itemset should appear at least 198 times to be considered a frequent itemset.  The first iteration of the Apriori algorithm computes the support of each product in the dataset and retains those products that satisfy the minimum support.  The following code identifies 59 frequent 1-itemsets that satisfy the minimum support.  The parameters of apriori() specify the minimum and maximum lengths of the itemsets, the minimum support threshold, and the target indicating the type of association mined.
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32 Rule Generation and Visualization
  • 33. 33 plot(rules)  The scatterplot shows that, of the 2,918 rules generated from the Groceries dataset, the highest lift occurs at a low support and a low confidence.
  • 34. 34  Entering plot(rules@quality) displays a scatterplot matrix (Figure 5-4) to compare the support, confidence, and lift of the 2,918 rules. lift is proportional to confidence and illustrates several linear groupings.
  • 35. 35  Lift=Confidence / Support(Y).  when the support of Y remains the same, lift is proportional to confidence, and the slope of the linear trend is the reciprocal of Support(Y).
  • 36. 36
  • 37. 37
  • 38. 38
  • 39. 39 Validation and Testing  After gathering the output rules, it may become necessary to use one or more methods to validate the results in the business context for the sample dataset.  The first approach can be established through statistical measures such as confidence, lift, and leverage.  Rules that involve mutually independent items or cover few transactions are considered uninteresting because they may capture spurious relationships.
  • 40. 40  Confidence measures the chance that X and Y appear together in relation to the chance X appears.  Confidence can be used to identify the interestingness of the rules.  Lift and leverage both compare the support of X and Y against their individual support.  While mining data with association rules, some rules generated could be purely coincidental.  For example, if 95% of customers buy X and 90% of customers buy Y, then X and Y would occur together at least 85% of the time, even if there is no relationship between the two.  Measures like lift and leverage ensure that interesting rules are identified rather than coincidental ones.
  • 41. 41 Diagnostics  Although the Apriori algorithm is easy to understand and implement, some of the rules generated are uninteresting or practically useless.  Additionally, some of the rules may be generated due to coincidental relationships between the variables.  Measures like confidence, lift, and leverage should be used along with human insights to address this problem.  The Apriori algorithm reduces the computational workload by only examining itemsets that meet the specified minimum threshold.  However, depending on the size of the dataset, the Apriori algorithm can be computationally expensive.  For each level of support, the algorithm requires a scan of the entire database to obtain the result.
  • 42. 42 Approaches to improve Apriori’s efficiency:  Partitioning: Any itemset that is potentially frequent in a transaction database must be frequent in at least one of the partitions of the transaction database.  Sampling: This extracts a subset of the data with a lower support threshold and uses the subset to perform association rule mining.  Transaction reduction: A transaction that does not contain frequent k- itemsets is useless in subsequent scans and therefore can be ignored.  Hash-based itemset counting: If the corresponding hashing bucket count of a k-itemset is below a certain threshold, the k-itemset cannot be frequent.  Dynamic itemset counting: Only add new candidate itemsets when all of their subsets are estimated to be frequent.
  • 43. Data Science with R Unit V (Part-2) : Association Rules With R Programming M. Narasimha Raju Asst Professor, Dept. of Computer Science & Engineering Shri Vishnu Engineering College for Women (A), Thank You