Data SAcience with r progarmming Unit - V Part-1.pptx

Data Science with R
Unit V (Part-1) : Association Rules
M. Narasimha Raju
Asst Professor, Dept. of Computer Science & Engineering
Shri Vishnu Engineering College for Women (A),

2
Association Rules:
 Association Rules
 Overview
 Apriori Algorithm
 Evaluation of Candidate Rules
 Applications of Association
Rules
 Example
 Validation and Testing
Regression
 Linear Regression
 Logistic Regression
 Reasons to Choose and
Cautions.

3
Overview
 Given a large collection of transactions, in which each transaction
consists of one or more items, association rules go through the
items being purchased to see what items are frequently bought
together and to discover a list of rules that describe the purchasing
behavior.
 The goal with association rules is to discover interesting
relationships among the items.
 The relationships that are interesting depend both on the business
context and the nature of the algorithm being used for the
discovery.

4
The general logic behind association rules

5
Association
 Each of the uncovered rules is in the form X Y, meaning
→
that when item X is observed, item Y is also observed. In this
case, the left-hand side (LHS) of the rule is X, and the right-
hand side (RHS) of the rule is Y.
 Using association rules, patterns can be discovered from the
data that allow the association rule algorithms to disclose
rules of related product purchases.

6
Market basket analysis.
 Each transaction can be viewed as the shopping basket of a customer
that contains one or more items. This is also known as an itemset.
 The term itemset refers to a collection of items or individual entities
that contain some kind of relationship.
 This could be a set of retail items purchased together in one
transaction, a set of hyperlinks clicked on by one user in a single
session, or a set of tasks done in one day.
 An itemset containing k items is called a k-itemset. {item 1,item 2,. . .
item k} to denote a k-itemset.
 Computation of the association rules is typically based on itemsets.

7
Apriori - Suppot
 It pioneered the use of support for pruning the itemsets and
controlling the exponential growth of candidate itemsets.
 Given an itemset L, the support of L is the percentage of
transactions that contain L.
 For example, if 80% of all transactions contain itemset
{bread}, then the support of {bread} is 0.8.
 Similarly, if 60% of all transactions contain itemset {bread,
butter}, then the support of {bread, butter} is 0.6.

8
Minimum support
 A frequent itemset has items that appear together often enough.
 If the minimum support is set at 0.5, any itemset can be considered
a frequent itemset if at least 50% of the transactions contain this
itemset.
 The support of a frequent itemset should be greater than or equal
to the minimum support
 If an itemset is considered frequent, then any subset of the
frequent itemset must also be frequent.
 This is referred to as the Apriori property (or downward closure
property).

9
Frequent item sets
 If 60% of the transactions contain
{bread,jam}, then at least 60% of
all the transactions will contain
{bread} or {jam}.
 In other words, when the support
of {bread,jam} is 0.6, the support
of {bread} or {jam} is at least 0.6.
 If itemset {B,C,D} is frequent,
then all the subsets of this
itemset, shaded, must also be
frequent itemsets.

10
Apriori Algorithm
 The Apriori algorithm takes a bottom-up iterative approach to uncovering the
frequent itemsets by first determining all the possible items (or 1-itemsets, for
example {bread}, {eggs}, {milk}, …) and then identifying which among them are
frequent.
 Assuming the minimum support threshold (or the minimum support criterion)
is set at 0.5, the algorithm identifies and retains those itemsets that appear in at
least 50% of all transactions and discards (or “prunes away”) the itemsets that
have a support less than 0.5 or appear in fewer than 50% of the transactions.
 The word prune is used like it would be in gardening, where unwanted
branches of a bush are clipped away.

11
Apriori algorithm
 In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are
paired into 2-itemsets (for example, {bread,eggs}, {bread,milk}, {eggs,milk}, …) and
again evaluated to identify the frequent 2-itemsets among them.
 At each iteration, the algorithm checks whether the support criterion can be met; if it
can, the algorithm grows the itemset, repeating the process until it runs out of
support or until the itemsets reach a predefined length.
 Let variable Ck be the set of candidate k-itemsets and variable Lk be the set of k-
itemsets that satisfy the minimum support. Given a transaction database D, a
minimum support threshold δ, and an optional parameter N indicating the
maximum length an itemset could reach, Apriori iteratively computes frequent
itemsets Lk+1 based on Lk.

13
Apriori algorithm
 The first step of the Apriori algorithm is to identify the frequent itemsets by
starting with each item in the transactions that meets the predefined
minimum support threshold δ.
 These itemsets are 1-itemsets denoted as L1, as each 1-itemset contains only
one item.
 Next, the algorithm grows the itemsets by joining L1 onto itself to form new,
grown 2-itemsets denoted as L2 and determines the support of each 2-itemset
in L2.
 Those itemsets that do not meet the minimum support threshold δ are
pruned away.
 The growing and pruning process is repeated until no itemsets meet the
minimum support threshold.
 Once completed, output of the Apriori algorithm is the collection of all the
frequent k-itemsets.

14
Evaluation of Candidate Rules
 Confidence is defined as the measure of certainty or
trustworthiness associated with each discovered rule.
 Confidence is the percent of transactions that contain both X and Y
out of all the transactions that contain X
 For example, if {bread, eggs, milk} has a support of 0.15 and
{bread, eggs} also has a support of 0.15, the confidence of rule
{bread, eggs} {milk} is 1, which means 100% of the time a
→
customer buys bread and eggs, milk is bought as well.

15
 A relationship may be thought of as interesting when the algorithm
identifies the relationship with a measure of confidence greater than or
equal to a predefined threshold.
 This predefined threshold is called the minimum confidence.
 Lift measures how many times more often X and Y occur together than
expected if they are statistically independent of each other.
 Lift is a measure of how X and Y are really related rather than
coincidentally happening together
 Lift is 1 if X and Y are statistically independent of each other. In contrast,
a lift of X Y greater than 1 indicates that there is some usefulness to the
→
rule. A larger value of lift suggests a greater strength of the association
between X and Y.

16
 Assuming 1,000 transactions, with {milk,eggs} appearing in 300 of
them, {milk} appearing in 500, and {eggs} appearing in 400, then
Lift(milk eggs)=0.3/(0.5* 0.4)=1.5.
→
 If {bread} appears in 400 transactions and {milk,bread} appears in
400, then Lift(milk bread)=0.4 /(0.5* 0.4)=2.
→
 Therefore it can be concluded that milk and bread have a stronger
association than milk and eggs.

17
 Leverage measures the difference in the probability of X and Y appearing together
in the dataset compared to what would be expected if X and Y were statistically
independent of each other.
 Leverage is 0 when X and Y are statistically independent of each other.
 If X and Y have some kind of relationship, the leverage would be greater than zero.
 A larger leverage value indicates a stronger relationship between X and Y.
 For the previous example, Leverage(milk eggs)=0.3 (0.5* 0.4)=0.1 and
→ −
Leverage(milk bread)=0.4 (0.5* 0.4)=0.2.
→ −
 It again confirms that milk and bread have a stronger association than milk and
eggs.

18
Applications of Association Rules
 Broad-scale approaches to better merchandising—what
products should be included in or excluded from the inventory
each month
 Cross-merchandising between products and high-margin or high-
ticket items
 Physical or logical placement of product within related categories
of products
 Promotional programs—multiple product purchase incentives
managed through a loyalty card program

19
Recommendation Systems
 Many online service providers such as Amazon and Netflix use
recommender systems.
 Recommender systems can use association rules to discover
related products or identify customers who have similar interests.
 For example, association rules may suggest that those customers
who have bought product A have also bought product B, or those
customers who have bought products A, B, and C are more similar
to this customer.
 These findings provide opportunities for retailers to cross-sell
their products.

20
Clickstream analysis
 Clickstream analysis refers to the analytics on data related to web
browsing and user clicks, which is stored on the client or the server
side.
 Web usage log files generated on web servers contain huge amounts of
information, and association rules can potentially give useful
knowledge to web usage data analysts.
 For example, association rules may suggest that website visitors who
land on page X click on links A, B, and C much more often than links D,
E, and F.
 This observation provides valuable insight on how to better
personalize and recommend the content to site visitors.

21
An Example: Transactions in a Grocery Store
 Using R and the arules and arulesViz packages
 The Groceries Dataset

22
 The class of the dataset is transactions,
as defined by the arules package. The
transactions class contains three slots:
 transactionInfo: A data frame with
vectors of the same length as the
number of transactions
 itemInfo: A data frame to store item
labels
 data: A binary incidence matrix that
indicates which item labels appear in
every transaction

24
Frequent Itemset Generation
 The apriori() function from the arule package implements the Apriori algorithm to create
frequent itemsets.
 Note that, by default, the apriori() function executes all the iterations at once.
 Assume that the minimum support threshold is set to 0.02 based on management
discretion.
 Because the dataset contains 9,853 transactions, an itemset should appear at least 198
times to be considered a frequent itemset.
 The first iteration of the Apriori algorithm computes the support of each product in the
dataset and retains those products that satisfy the minimum support.
 The following code identifies 59 frequent 1-itemsets that satisfy the minimum support.
 The parameters of apriori() specify the minimum and maximum lengths of the itemsets,
the minimum support threshold, and the target indicating the type of association mined.

32
Rule Generation and Visualization

33
plot(rules)
 The scatterplot shows that, of the 2,918 rules generated from the
Groceries dataset, the highest lift occurs at a low support and a
low confidence.

34
 Entering plot(rules@quality) displays a scatterplot matrix (Figure
5-4) to compare the support, confidence, and lift of the 2,918 rules.
lift is
proportional to
confidence and
illustrates
several linear
groupings.

35
 Lift=Confidence / Support(Y).
 when the support of Y remains the same, lift is proportional to
confidence, and the slope of the linear trend is the reciprocal of
Support(Y).

39
Validation and Testing
 After gathering the output rules, it may become necessary to
use one or more methods to validate the results in the
business context for the sample dataset.
 The first approach can be established through statistical
measures such as confidence, lift, and leverage.
 Rules that involve mutually independent items or cover few
transactions are considered uninteresting because they may
capture spurious relationships.

40
 Confidence measures the chance that X and Y appear together in relation to the
chance X appears.
 Confidence can be used to identify the interestingness of the rules.
 Lift and leverage both compare the support of X and Y against their individual
support.
 While mining data with association rules, some rules generated could be purely
coincidental.
 For example, if 95% of customers buy X and 90% of customers buy Y, then X
and Y would occur together at least 85% of the time, even if there is no
relationship between the two.
 Measures like lift and leverage ensure that interesting rules are identified
rather than coincidental ones.

41
Diagnostics
 Although the Apriori algorithm is easy to understand and implement,
some of the rules generated are uninteresting or practically useless.
 Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
 Measures like confidence, lift, and leverage should be used along with
human insights to address this problem.
 The Apriori algorithm reduces the computational workload by only
examining itemsets that meet the specified minimum threshold.
 However, depending on the size of the dataset, the Apriori algorithm
can be computationally expensive.
 For each level of support, the algorithm requires a scan of the entire
database to obtain the result.

42
Approaches to improve Apriori’s efficiency:
 Partitioning: Any itemset that is potentially frequent in a transaction
database must be frequent in at least one of the partitions of the transaction
database.
 Sampling: This extracts a subset of the data with a lower support threshold
and uses the subset to perform association rule mining.
 Transaction reduction: A transaction that does not contain frequent k-
itemsets is useless in subsequent scans and therefore can be ignored.
 Hash-based itemset counting: If the corresponding hashing bucket count of
a k-itemset is below a certain threshold, the k-itemset cannot be frequent.
 Dynamic itemset counting: Only add new candidate itemsets when all of
their subsets are estimated to be frequent.

Data Science with R
Unit V (Part-2) : Association Rules With R Programming
M. Narasimha Raju
Asst Professor, Dept. of Computer Science & Engineering
Shri Vishnu Engineering College for Women (A),
Thank
You

Data SAcience with r progarmming Unit - V Part-1.pptx

More Related Content

Similar to Data SAcience with r progarmming Unit - V Part-1.pptx (20)

Recently uploaded (20)

Data SAcience with r progarmming Unit - V Part-1.pptx