Fundamental of Data Science BCA 6th Sem Notes

1
Mining Frequent
Patterns
UNIT-
III

2
e,o r
Freq uent pattern refer to sets of item s,subsequenc
substructures that app earfret uently to gether in a
dataset.
U se of frequent pattern m ining: Id entify ing p rod ucts
that are o ften purchased
to gether to o ptim ize invento ry, im pro ve sales strateg
ies and desig n better
prom otion c am paigns.

Apps
Other
sources
ETL Data
Warehous
e
Business
Intelligenc
e
Data
Science
Decision
Making

4
W hat is m arket basket
analysis?
M arket basket analysis is a data m ining tec hnique used by retailers to
inc rease sales by better understand ing c ustom er purc hasing p atterns. It invo
lves analyz ing larg e data sets, suc h as p urchase history , to reveal pro duc t
grouping s, as w ell as pro ducts that are likely to be purchased tog ether.
Marketing analysis uses data mining techniques to understand customer
behavior, preferences, and market trends to improve decision-making in
marketing
strategies.

6
• Data Marts: These are smaller, more focused data repositories
derived
from the data warehouse, designed to meet the needs of specific
business departments or functions.
• OLAP (Online Analytical Processing) Tools: OLAP tools allow
users to analyze data in multiple dimensions, providing deeper
insights and
supporting complex analytical queries.
• End-User Access Tools: These are reporting and analysis tools,
such as
dashboards or Business Intelligence (BI) tools, that enable business

8
Market Basket Analysis (MBA)A popular data mining technique.Goal: Find
associations between products bought together.Example Rule:If a customer buys
bread and butter, they are likely to buy milk.---5. Applications in MarketingCross-
selling and Up- selling: Recommend related products.Customer Segmentation:
Group customers for targeted advertising.Churn Prediction: Identify customers
likely to leave.Personalized Marketing: Offer deals based on purchase
history.Campaign Management: Evaluate the success of promotional
campaigns.---6. BenefitsBetter understanding of customer needs.Increased sales
and customer loyalty.More effective and efficient marketing
strategies.---

•9im
D
p
a
rove
ta mi
c
n
ro
in
s
g
s-
c
s
o
e
n
l
c
il
e
n
p
g
ts
op
ar
p
e
o
i
r
n
ut
u
n
s
i
e
it e
fo
,s
r
to
Sa
i
l
n
e
c
s
re
a
a
n
s
d
e
m
d
a
rei
rk
c
e
t
t
m
ing
ia l
to
er
p
s
r
p
o
o
v
n
id
s
e
e
b
ar
e
te
tte
.s
r customer service, to
• Customer Retention in the form of pattern identification and prediction of likely defections is
possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
• IF means Antecedent: An antecedent is an item found within the data
• THEN means Consequent: A consequent is an item found in combination with the
antecedent.

10
Support
Sup port is a m easure of how freq uently the item s app ear in the dataset. It
helps to id entify the m o st c om m on item s o r item sets in the dataset.
SUPPORT: It is been calculated with the number of transactions divided by the
total number of transactions made,

11
Confidence
C onfidence is a m easure o f the reliab ility of the inferenc e m ade by a rule.
It quantifies the likeliho od of finding item B in transactions under the co
ndition that the transac tio n already co ntains item A.
CONFIDENCE: It is been calculated for whether the product sales are
popular on individual sales or through combined sales. That is calculated
with combined transactions/individual transactions.

12
Lif
t
Lift m easures the streng th o f a rule o ver the rando m co -occ urrence of the
item set, p rov iding a m etric to und erstand ho w m uch m o re likely item B is to
be b ought
w hen item A is b oug ht co m pared to if B w as b ought indep endently.
LIFT: Lift is calculated for knowing the ratio for the sales

13
Mining techniques are methods used to discover patterns, relationships, and insights from large
datasets. Some common mining techniques include:
*Types of Mining Techniques*
1. *Frequent Pattern Mining*: Discovers frequent patterns and relationships in data.
2. *Association Rule Mining*: Generates rules that describe relationships between items.
3. *Classification*: Predicts the class or category of an item based on its attributes.
4. *Clustering*: Groups similar items together based on their attributes.
5. *Anomaly Detection*: Identifies unusual or outlier data points.
*.

14
*Applications*
1. *Market Basket Analysis*: Identifies products that are frequently purchased together.
2. *Recommendation Systems*: Suggests products based on user behavior.
3. *Customer Segmentation*: Groups customers based on their behavior and attributes.
4. *Fraud Detection*: Identifies unusual patterns in data that may indicate fraud.
*Benefits*
5. *Improved Decision-Making*: Mining techniques help businesses make informed decisions.
6. *Increased Efficiency*: Automated pattern discovery saves time and resources.
7. *Enhanced Customer Insights*: Mining techniques provide valuable insights into customer
behavior.
*Challenges*
8. *Data Quality*: Poor data quality can affect the accuracy of mining results.
9. *Scalability*: Handling large datasets can be computationally expensive.
10.*Interpretability*: Understanding and interpreting mining results can be challenging.

15
---
What is Apriori Algorithm?
It’s used for Association Rule Mining — finding frequent itemsets in a
database and deriving rules (like "If people buy milk, they also buy
bread").
Works on the principle:
"If an itemset is frequent, all its subsets must also be
frequent." (This is called the Apriori Property.)
T--h- e Apriori algorithm is a popular algorithm for mining frequent itemsets and generating association rules. Here's a step-by-
step overview:
*2A- priori Algorithm Steps*
1. *Generate Candidate Itemsets*: Generate all possible itemsets from the dataset.
2. *Calculate Support*: Calculate the support for each candidate itemset.
3.*Prune Itemsets*: Prune itemsets that do not meet the minimum support
threshold. 4--.- *Repeat*: Repeat steps 1-3 until no more frequent itemsets can
be generated.
5. *Generate Association Rules*: Generate association rules from the frequent
itemsets.

16
2-Step Apriori Algorithm Process
Step 1: Find Frequent 1-itemsets and 2-itemsets
Scan the database and count the support (how often it appears) for single
items and pairs of items.
Keep only those items/pairs that meet the minimum support threshold.
Step 2: Generate Association Rules
From the frequent 2-itemsets, generate rules.
Check if the rules meet the minimum confidence thresho

17
Step 1: Find Frequent 1-itemsets and 2-
itemsets
1-itemsets:
A (4 times), B (3 times), C (3 times), D (1 time), E (2
times)
Keep A, B, C, E (because D has support 1 < 2)
2-itemsets:
(A, B): 2 times
(A, C): 3 times
(A, E): 1 time
(B, C): 2 times
(B, E): 2 times
(C, E): 1 time
Keep (A, B), (A, C), (B, C), (B, E)l
Step 2: Generate Rules
(A → C), (C → A), (B → C), (C → B), etc.
Check Confidence for each rule (Confidence = Support(Itemset) /
Support(Antecedent)) Example: Confidence(A → C) = Support(A, C) / Support(A) =

23
Fundamentals of Data Science
Dr. Chandrajit M, MIT First Grade college
1. Simplicity & ease of implementation
2. The rules are easy to human-readable
3.Works well on unlabelled
data 4.Flexibility &
customisability
5.Extensions for multiple use
cases can be created easily
6. The algorithm is widely
used & studied
Disadvantages of Apriori
algorithm:1.Computational
complexity: Requires many
database scans.
7. Higher memory usage:
Assumes transaction
database is memory
resident.
8. It needs to generate a huge
no. of candidate sets.
9. Limited discovery of
complex patterns

24
Improving the efficiency of Apriori Algorithm;
Here are some of the methods how to improve efficiency of apriori algorithm -
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-iternsets
and their corresponding count. It uses a hash function for
2
generating the table
Transaction Reduction: This method reduces the number of transactions scanned in iterations. The transactions which do not
contain frequent items are marked or removed.Partitioning:This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should be frequent in at least one of the
partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in S. It may be posible
to lose a
global frequent itenset. This can be reduced bv lowerins the min sun
5.Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the database during the
scanning of the database.

25
Frequent Pattern-growth Algorithm
FP-growth is an algorithm for mining frequent patterns that uses a divide-and-conquer approach.FP Growth algoritbm was
developed by Han in 2000. It constructs a tree-like data structure called the frequent pattern (FP) tree, where each node
represents an item in a frequent pattern, and its children represent its immediate sub-patterns. By scanning the dataset only
twice, FP-growth can efficiently mine all frequent itensets without generating candidate itemsets explicitly. It is particularly suitable
for datasets with long patterns and relatively low support thresholds.

26
Working on FP Growth Algorithm
The working of the FP Growth algorithm in data mining can be summarized in the following
steps: Scan the database:
In this step, the algorithm scans the input dataset to determine the frequency of each item.
This determines the order in which
items are added to the FP tree, with the most frequent items
added first Sort items:
In this step, the items in the dataset are sorted in descending
order of frequency. The infrequent items that do not meet the
minimum support threshold are removed from the dataset. This helps to reduce the dataset's size and improve the
algorithm's efficiency.
Construct the FP-tree;
In this step, the FP-ree is constructed. The FP-tree is a compact data structure that stores the frequent itemsets
and their support counts.

27
Generate frequent itemsets:
Once the FP-tree has been constructed, frequent itemsets can be generated by recursively mining the tree. Starting at the
bottom of the tree, the algorithm finds all combinations of frequent item sets that satisfy the minimum support threshold.
Generate association rules:
Once all frequent item sets have been generated, the algorithm post-processes the generated frequent item sets to
generate association rules, which can be used to identify interesting relationshins between the itens in the dataset

28
FP Tree
The FP-tree (Frequent Pattern tree) is a data structure used in the FP Growth algorithm for frequent patterm mining, It represents the
frequent itemsets in the input dataset compactly and efficiently. The FP tree consists of the following components:
Root Node:
The root node of the FP-tree represents an empty set. It has no associated item but a pointer to the first node of each item in
the tree. Item Node:
Each item node in the FP-tree represents a unique item in the dataset. It stores the item name and the frequency count of the
item in the dataset.
Header Table:
The header table lists all the unique items in the dataset, along with their frequency
count. It is used to track each item's location in the FP tree
Child Node
Each child node of an item node represents an item that co-occurs with the item the parent node represents in at least one
transaction in the dataset.
Node Link:
The node-link is a pointer that connects each item in the header table to the first node of that item in the FP-tree. It is used to
traverse the conditional pattern base of each item during the mining process.

29
Construction Steps1. Scan the database once to count
item frequencies.2. Discard infrequent items (below min
support).3. Sort items in each transaction by frequency
(descending).4. Insert transactions into the tree:Shared
prefixes are merged.Frequencies are updated.

30
Working of FP- Growth Algorithm

31
Step 1: Count Item Frequencies

32 Step 2: Sort Items in Each Transaction by Frequency (Descending)
descending order of their respective frequencies. After
insertion of the relevant items, the set L looks like
this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}

33
Step 3: Build the FP-TreeStart with a null root node, and add transactions one
by one
Inserting the set {K, E, M, O, Y}:

34 Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E, simply the
support count is increased by 1. On inserting O we can
see that there is no direct link between E and O,
therefore a new node for the item O is initialized with
the support count as 1 and item E is linked to this new
node. On inserting Y, we first initialize a new node for
the item Y with support count as 1 and link the new
node of O with the new node of Y.

35
Inserting the set {K, E, M}:
simply the support count of each element is increased by 1.

36
Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and
Y are initialized and linked accordingly.

37
Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that
the support count of the new node of item O is increased.

38
Multilevel Association Rule :
Association rules created from mining information at
different degrees of reflection are called various level
or staggered association rules.
Multilevel association rules can be mined effectively
utilizing idea progressions under a help certainty
system.
Rules at a high idea level may add to good judgment
while rules at a low idea level may not be valuable
consistently.

39
Needs of Multidimensional Rule :
•Sometimes at the low data level, data does not show
any significant pattern but there is useful information
hiding behind it.
•The aim is to find the hidden information in or
between levels of abstraction.

40
Multidimensional Association Rules :
In Multi dimensional association rule Qualities can be
absolute or quantitative.
• Quantitative characteristics are numeric and
consolidates order.
• Numeric traits should be discretized.
• Multi dimensional affiliation rule comprises of more
than one measurement.
• Example –buys(X, “IBM Laptop computer”)buys(X,
“HP Inkjet Printer”)

41
Multilevel Association Rules
M ultilevel asso ciation rules invo lve find ing relationship s betw een item s at d ifferent
levels o f abstractio n in a hierarchical struc ture. F or instance, in a retail sc enario,
produc ts c an be o rganized into c atego ries suc h as "Electro nic s" and "H om e
Ap pliances,
" w hich c an be further d ivided into subcateg ories like "M ob ile Phones"
and "Refrig erators."
1. Hierarchy of Items: Item s are organized in a hierarchical m anner. F or
exam ple:
● Level 1: Electronic s, H om e App liances
● Level 2: M o bile P hones, Lapto ps (under Electro nic s), Refrig erators,
W ashing M ac hines (und er H o m e Ap pliances)
● Level 3: S pecific brand s or m odels o f m o bile p hones, lapto ps, etc .
2. Support and Confidence: At d ifferent levels, the supp ort (frequency of
item sets) and co nfid enc e (reliability o f the asso ciation) are c alculated.

Fundamental of Data Science BCA 6th Sem Notes

More Related Content

Similar to Fundamental of Data Science BCA 6th Sem Notes (20)

Recently uploaded (20)

Fundamental of Data Science BCA 6th Sem Notes