DATA MINING-MODULE II NOTES(S4 BCA)_________.pdf

DATA MINING & DATA
WAREHOUSES
S4 BCA – KU – MODULE II NOTES ( PREPARED BY VINEETH P )
CHRIST NAGAR COLLEGE , MARANALLOOR
1

SYLLABUS - MODULE II
PREPARED BY VINEETH P 2

What is Market Basket Analysis
Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analysing large data sets, such as
purchase history, to reveal product groupings, as well as products that are likely to be purchased
together.
The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS)
systems. Compared to handwritten records kept by store owners, the digital records generated by
POS systems made it easier for applications to process and analyse large volumes of purchase
data.

TYPES OF MARKET BASKET ANALYSIS
Retailers should understand the following types of market basket analysis:
•Predictive market basket analysis. This type considers items purchased in sequence to
determine cross-sell.
•Differential market basket analysis. This type considers data across different stores, as well as
purchases from different customer groups during different times of the day, month or year. If a
rule holds in one dimension, such as store, time period or customer group, but does not hold in
the others, analysts can determine the factors responsible for the exception. These insights can
lead to new product offers that drive higher sales.

ALGORITHM FOR MARKET BASKET
ANALYSIS
In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together,
seeking to find associations that occur far more often than expected.
Algorithms that use association rules include AIS, SETM and Apriori. The Apriori algorithm is
commonly cited by data scientists in research articles about market basket analysis and is used
to identify frequent items in the database, then evaluate their frequency as the datasets are
expanded to larger sizes.

Well Known Example
Amazon's website uses a well-known example of market basket analysis. On a
product page, Amazon presents users with related products, under the headings
of "Frequently bought together" and "Customers who bought this item also
bought."

Benefits of Market-Basket Analysis
Market basket analysis can increase sales and customer satisfaction. Using data
to determine that products are often purchased together, retailers can optimize
product placement, offer special deals and create new product bundles to
encourage further sales of these combinations.
These improvements can generate additional sales for the retailer, while making
the shopping experience more productive and valuable for customers. By using
market basket analysis, customers may feel a stronger sentiment or brand
loyalty toward the company.

APRIORI ALGORITHM - INTRODUCTION
R. Agarwal and R. Srikantto are the creators of the Apriori algorithm. They created it in 1994 by
identifying the most frequent themes through Boolean association rules. The algorithm has
found great use in performing Market Basket Analysis, allowing businesses to sell their products
more effectively.
The use of this algorithm is not just for market basket analysis. Various fields, like healthcare,
education, etc, also use it. Its widespread use is primarily due to its simple yet effective
implementation, as it utilizes the knowledge of previous common itemset features. The Apriori
algorithm greatly helps to increase the effectiveness of level-wise production of frequent item-
sets.

Terms in Apriori Algorithm
•Itemset
Item-set refers to a set of items combined. We can refer to an item as a k-itemset because
it has a k number of unique items. Typically, an itemset contains at least two items.
•Frequent Itemset
The next important concept is frequent itemset. A frequent itemset refers to an itemset
that occurs most frequently. For example, a frequent itemset can be of {bread, butter},
{chips, cold drink}, {laptop, antivirus software} etc.

Terms in Apriori Algorithm
Support is a metric that indicates transactions with products or items purchased together
(in a single transaction). Confidence indicates those transactions where the
products/items are purchased one after the other.
Support(X) means – how many times item X got purchased in the list of total no of
transactions
Support(X ^ Y) means - How many times the items X and Y together purchased in the list
of total transactions

Process of extracting frequent item-sets
Mining frequent item-sets is the process of identifying them, and this involves
using specific thresholds for Support and Confidence to define the frequent
item-sets. The issue, however, is finding the correct threshold values for these
metrics.
Normally the threshold value minimum support ( call min_sup ) will be given in
question itself.

To further explain the Apriori Algorithm, we need to understand Association
Rule Mining. The Apriori algorithm works by finding relationships among
numerous items in a dataset. The method known as association rule mining
makes this discovery.
For example, in a supermarket, a pattern emerges where people buy certain
items together. Let’s assume that individuals might buy cold drinks and chips
together to make the example more concrete. Similarly, it’s also found that
customers also put notebooks and pens together in a purchase.

Through association rule mining, you, as a supermarket owner, can leverage
identified relationships to boost sales. Strategies like packaging associated
products together, placing them in close proximity, offering group discounts, and
optimizing inventory management can lead to increased profits.

Support of an Item
Support indicates an item’s popularity, calculated by counting the transactions
where that particular item was present. For item ‘Z,’ its Support would be the
number of times the item was purchased, as the transaction data indicates.
Sometimes, this count is divided by the total number of transactions to make
the number easily representable. Let’s understand Support with an example.
Suppose there is transaction data for a day having 1,000 transactions.

The items you are interested in are apples, oranges, and apples+oranges (a
combination item). Now, you count the transactions where these items were
bought and find that the count for apples, oranges, and apples+oranges is 200,
150, and 100.
The formula for Support is-
Support (Z) = Transactions containing item Z / Total transactions

In the Apriori algorithm, such a metric is used to calculate the “support” for
different items and item-sets to establish that the frequency of the item-sets is
enough to be considered for generating candidate item-sets for the next iteration.
Here, the support threshold plays a crucial role as it’s used to define items/item-
sets that are not frequent enough.

Confidence of a rule
This key metric is used in the Apriori algorithm to indicate the probability of an
item ‘Y’ being purchased if a customer has bought an item ‘Z’. If you notice,
here, conditional probability is getting calculated, i.e., in this case, it’s the
conditional probability that item Z appears in a transaction, given that another
item Y appears in the same transaction. Therefore, the formula for calculating
Confidence is

Confidence of a rule
P(Z|Y) = P(Y and Z) / P(Y)
It can also be written as
Support(Y ∪ Z) / Support(Y)
Confidence is typically denoted by
(Y → Z)
Ex:
Confidence (Apples → Oranges) = 100
/ 200
Confidence (Apples → Oranges) = 0.5
[ Meaning when apples are purchased ,
there is 50% of chance that customer
also buy oranges ]

Lift to determine strength of a rule
Lift denotes the strength of an association rule. Suppose you need to calculate
the Lift(Y → Z); then you can do so by dividing Confidence(Y → Z) by Support(Z),
i.e.,
Lift(Y -> Z) = Confidence(Y -> Z) / Support(Z)
Another way of calculating Lift is by considering Support of (Y, Z) and dividing by
Support(Y)*Support(Z), i.e., it’s the ratio of Support of two items occurring together
to the Support of the individual items multiplied together.

Lift to determine strength of a rule
In the above example, the Lift for Apples 🡪 Oranges would be the following-
Lift(Apple -> Orange) = Confidence(Apple -> Orange) / Support(Orange)
Lift(Apple -> Orange) = 0.5 / 0.15
Lift(Apple -> Orange) = 33.33

Interpreting Lift Value
❖A Lift value of 1 generally indicates randomness, suggesting independent
items, and the association rule can be disregarded.
❖A value above 1 signifies a positive association, indicating that two items will
likely be purchased together.
❖Conversely, a value below 1 indicates a negative association, suggesting that
the two items are more likely to be purchased separately.

Overall Steps of Apriori Algorithm

Steps in Apriori Algorithm
1. Start
2. Define the minimum threshold
3. Create a list of frequent items
4. Create candidate item-sets
5. Calculate the support of each candidate
6. Prune the candidate item-sets
7. Repeats the above steps until a single item-set remains in candidate set ( iteration )
8. Generate association rules
9. Evaluate association rules
10. Stop

Steps in Apriori Algorithm

Example Problem
Consider the following dataset and we will find frequent itemsets and generate association rules
for them.
minimum support count is 2
minimum confidence is 60%

Example Problem
Step-1: K=1
(I) Create a table containing support count of each
item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with
minimum support count(here min_support=2 if
support_count of candidate set items is less than
min_support then remove those items). This gives us
itemset L1

Example Problem
Step-2: K=2
•Generate candidate set C2 using L1 (this is
called join step). Condition of joining Lk-1 and Lk-
1 is that it should have (K-2) elements in
common.
•Check all subsets of an itemset are frequent or
not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2}
they are frequent.Check for each itemset)
•Now find support count of these itemsets by
searching in dataset.

Example Problem
(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support
then remove those items) this gives us itemset L2.

Example Problem
Step-3:
•Generate candidate set C3 using L2 (join step).
Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for
L2, first element should match.
So itemset generated by joining L2 is {I1, I2,
I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3,
I5}
•Check if all subsets of these itemsets are
frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1,
I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it.
Similarly check for every itemset)
•find support count of these remaining itemset
by searching in dataset.

Example Problem
(II) Compare candidate (C3) support count with minimum support
count(here min_support=2 if support_count of candidate set item is
less than min_support then remove those items) this gives us
itemset L3.

Example Problem
Step-4:
•Generate candidate set C4 using L3 (join step).
Condition of joining Lk-1 and Lk-1 (K=4) is that, they
should have (K-2) elements in common. So here, for
L3, first 2 elements (items) should match.
•Check all subsets of these itemsets are frequent or
not (Here itemset formed by joining L3 is {I1, I2, I3,
I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
•We stop here because no frequent itemsets are
found further
Thus, we have discovered all the frequent item-
sets. Now generation of strong association rule
comes into picture. For that we need to calculate
confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the
customers, who purchased milk and bread also
bought butter.

Example Problem

Limitations of Apriori Algorithm
•Computational complexity.
•Time & space overhead.
•Difficulty handling sparse data.
•Limited discovery of complex patterns.
•Higher memory usage.
•Bias of minimum support threshold.
•Inability to handle numeric data.
•Lack of incorporation of context.

Ways to Improve the efficiency of Apriori
Algorithm
There are some variations of the Apriori algorithm that have been projected that target
developing the efficiency of the original algorithm which are as follows −
The hash-based technique (hashing itemsets into corresponding buckets) − A hash-
based technique can be used to decrease the size of the candidate k-itemsets, Ck, for k
> 1. For instance, when scanning each transaction in the database to create the
frequent 1-itemsets,L1, from the candidate 1-itemsets in C1, it can make some 2-
itemsets for each transaction, hash (i.e., map) them into the several buckets of a hash
table structure, and increase the equivalent bucket counts.

Algorithm
Transaction reduction − A transaction that does not include some frequent k-itemsets cannot
include some frequent (k + 1)-itemsets. Thus, such a transaction can be marked or deleted
from further consideration because subsequent scans of the database for j-itemsets, where j >
k, will not need it.
Partitioning − A partitioning technique can be used that needed two database scans to mine
the frequent itemsets. It includes two phases involving In Phase I, the algorithm subdivides the
transactions of D into n non-overlapping partitions. If the minimum support threshold for
transactions in D is min_sup, therefore the minimum support count for a partition is min_sup ×
the number of transactions in that partition.

Algorithm
For each partition, all frequent itemsets within the partition are discovered. These
are defined as local frequent itemsets. The process employs a specific data
structure that, for each itemset, records the TIDs of the transactions including the
items in the itemset. This enables it to find all of the local frequent k-itemsets, for k
= 1, 2... in only one scan of the database.

Algorithm
A local frequent itemset can or cannot be frequently related to the whole database, D. Any
itemset that is possibly frequent related D must appear as a frequent itemset is partially
one of the partitions. Thus, all local frequent itemsets are candidate itemsets slightly D.
The set of frequent itemsets from all partitions forms the worldwise candidate itemsets for
D. In Phase II, the second scan of D is organized in which the actual support of each
candidate is assessed to decide the global frequent itemsets.

Algorithm
Sampling − The fundamental idea of the sampling approach is to select a random
sample S of the given data D, and then search for frequent itemsets in S rather than D.
In this method, it can trade off some degree of accuracy against efficiency. The sample
size of S is such that the search for frequent itemsets in S can be completed in main
memory, and therefore only one scan of the transactions in S is needed overall.

References
Apriori Algorithm In Data Mining : Methods, Examples, and More (analytixlabs.co.in)
https://www.geeksforgeeks.org/apriori-algorithm/

DATA MINING-MODULE II NOTES(S4 BCA)_________.pdf

More Related Content

Similar to DATA MINING-MODULE II NOTES(S4 BCA)_________.pdf (20)

Recently uploaded (20)

DATA MINING-MODULE II NOTES(S4 BCA)_________.pdf