Data mining

Data Mining
Zahra Pourbahman and Behnaz Sadat Motavali
Supervisor: Dr. Alireza Bagheri
Advanced Database Course
November 2016
1/59

Outline
Introduction DM Methods
Complementary
Information
Conclusion
2/59

Introduction
What is Data Mining and why is it
importante?
1
Complementary
Information
Conclusion
3/59

“
Data Mining
 Extracting or mining knowledge from large amount of data
 Knowledge discovery in databases
4/59

Process of
Knowledge
Discovery
6/59

Data
Mining
Methods
Data Mining Methods
Predictive Descriptive
Classification
Association Rules
SVM
Clustering
K-Means
Regression
Apriori
Linear
7/59

DM Methods
What are Classification, Clustering, Association
Rules and Regression in Data Mining?
2
Complementary
Information
Conclusion
8/59

Classification
Classification
Clustering
Association
Rules
Regression
9/59

Classification
Problem  Given: Training set
 labeled set of 𝑁 input-output pairs 𝐷={xi , yi}
where 1<i<N
 𝑦 ∈ {1,…,𝐾}
 Goal: Given an input 𝒙 as a test data, assign it to
one of 𝐾 classes
 Examples:
▸ Spam filter
▸ Shape recognition
10/59

Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
11/59

12/59

13/59

Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
14/59

Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
×
15/59

Hard Margin
Support Vector
Machine
(SVM)
 When training samples are not linearly separable, it has no
solution.
16/59

Beyond Linear Separability
 Noise in the linearly separable
classes
 Overlapping classes that can be
approximately separated by a
linear boundary
17/59

Beyond Linear
Separability:
Soft-Margin
SVM
 Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
 SVM with slack variables:
Allows samples to fall within the margin, but penalizes them
18/59

Soft-Margin
SVM:
Parameter 𝐶 is a tradeoff parameter:
 small 𝐶 allows margin constraints to be easily
ignored large margin
 large 𝐶 makes constraints hard to ignore
narrow margin
 𝐶=∞ enforces all constraints: hard margin
19/59

Support Vectors
 Hard Margin Support Vectors :
(SVs) = {𝑥i | 𝛼>0}
 The direction of hyper-plane
can be found only based on
support vectors:
 The direction of hyper-plane can be found only based on support
vectors
𝑊 = 𝛼𝑖 𝑦(𝑖)
𝑥(𝑖)
𝛼 𝑖
20/59

Classifying
New Samples
Using only
SVs in SVM
Classification of a new sample 𝒙:
21/59

Clustering
Classification
Clustering
Association
Rules
Regression
22/59

Clustering
Problem  We have a set of unlabeled data points {𝒙(i) }
where 1<i<N and we intend to find groups of
similar objects (based on the observed
features)
24/59

K-means
Clustering  Given: the number of clusters 𝐾 and a set of
unlabeled data 𝒳=𝒙1,…,𝒙N
 Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}
 Hard Partitioning:
∀𝑗,𝒞𝑗≠∅
∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅
 Inter-cluster distances are small (compared with
intra-cluster distances)
25/59

Distortion measure
 Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C
26/59

K-means
Algorithm
Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial
centroids.
 Repeat until converges (or other stopping
criterion):
 for i=1 to 𝑁 do:
 Assign 𝒙(𝑖) to the closest cluster
 for k=1 to 𝐾 do:
 Centriod update
27/59

K-means Algorithm
Step by Step
28/59

Assigning data to clusters Updating means
[Bishop]
29/59

Summary
of the First
Part
Hard margin SVM :
maximizing margin
Soft margin SVM:
handling noisy data
and overlapping
classes
Linearly Separable
Labeled Data
Clustering
Unlabeled Data
K-means:
Assigning data
to clusters
Centriod update
Data
Mining
30/59

Association
Rules
Classification
Clustering
Association
Rules
Regression
31/59

“
Association Rules:
 Frequent Patterns
▹ Frequent Itemset
▹ Frequent Sequential Pattern
▹ …
 Relation Between Data
32/59

Example
{ Milk , Bread }
K = 2
33/59

Support & Confidence
{Milk} → {Bread} [Support=50% , Confidence=100%]
ID Items
1 {Milk,Bread,Meat} , k=3
2 {Sugar,Bread,Eggs} , k=3
3 {Milk,Sugar,Bread,Butter} , k=4
4 {Bread,Butter} , k=2
34/59

Association Rules
X → Y and X∩Y=Ø
minsup , minconf
Y → X != X → Y
35/59
{Milk,Sugar} → {Bread}

Frequent Pattern
{Milk,Bread}
Support = 2 𝑘 − 1
N transactions
A items
2 𝑘
− 1 ∗ N ∗ A compare
 So we need an algorithm to decrease that → Apriori
37/59

Apriori
Algorithm  Used to find Frequent Items
 Uses candidate generation
 Uses prior knowledge
 Level-wise search
 Uses minimum support
 Apriori property: All nonempty subsets of a frequent
item set must also be frequent.
In level k, k-item sets are found
Then these items are used to explore k+1
38/59

Apriori Algorithm
Step by Step
39/59

Example
11 items
5 transactions
40/59

Example …
without minsup 55 with
minsup 10
41/59

Example …
combine k-itemsets to
generate k+1-itemsets
{I1,I4}+{I2,I4} → {I1,I2,I4} √
{I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х
42/59

Without Apriori Algorithm:
11
1
+ 11
2
+ 11
3
= 231
Using Apriori Algorithm:
(11+10+6) = 27
43/59

Association
Rules
 Time to find Association Rules
 For each frequent pattern with k-itemset 2 𝑘
− 2
rules
 Need confidence
44/59

Regression
Classification
Clustering
Association
Rules
Regression
45/59

Regression
 Previous classifications labels
 Used for prediction
 Numeric, continuous value
 Relation between independent and dependent
variables
46/59

Linear Regression
Equation:
Simple Equation:
Error:
W0 and W1:
47/59

Regression
Continue
 Class label related to attribute if not we use
Correlation Coefficient
 Nonlinear regressions can be converted to linear
 Generalized linear model is Logistic Regression
uses probability
 Decision tree to Regression trees by predicting
continuous values rather than class labels
49/59

Complementary
Information
Data Mining Tools, Usage and Types
3
Complementary
Information
Conclusion
50/59

Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
DM Tools  Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
 Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
52/59

DM Usage
 Bank
 Financial issues
 Perfect quality
 Granting loans
 Financial services
 Reducing risks
 Money
Laundering and
Financial damages
 Marketing
 Massive data
 Increasing fast
 E-commerce
 Shopping patterns
 Service quality
 Customer
satisfaction
 Advertising
 Discount
 Bioinformation
 Laboratory
Information
 Protein structures
(Gene)
 Massive number of
sequences
 Need for computer
algorithms to
analyze them
 Accurate
53/59

DM Types
Text Mining
• No tables
• Books, articles, texts
• Semi-structured data
• Data Recovery and
Database
• Key words
• Massive data and text
Web Mining
• Massive Unstructured,
Semi-structured,
Multimedia data
• Links, advertisements
• Poor quality, changing
• Web structure, Content,
Web usage Mining
• Search engines
Multimedia Mining
• Voice, image, video
• Nature of the data
• Key words or
Patterns and shapes
Graph Mining
• Electronic circuits,
image, web and …
• Graph search algorithm
• Difference, index
• Social networks
analisis
Spatial Mining
• Medical images,
VLSI layers
• Location based
• Efficient techniques
54/59

Conclusion
Challenges and Conclusion.
4
Complementary
Information
Conclusion
55/59

“Challenges
 Individual systems or
Single-purpose systems
 Scalable and Interactive systems
 Standardization of data mining languages
 Complex data
 Distributed and Real Time data mining
56/59

Review
Introduction
• Data Mining
• Knowledge
Discovery
• DM Methods
DM Methods
• Classification
• Clustering
• Association Rules
• Regression
Conclusion
• Challenges
Complementary
Information
• DM Tools
• DM Usage
• DM Types
57/59

 C. M. Bishop, Pattern Recognition and Machine
Learning; Springer, 2006.
 Jiawei Han, Micheline Kamber; Data Mining
Concepts and Techniques, Second Edition,
Elsevier Inc. , 2006.
 J. Furnkranz et al. ;Foundations of Rule
Learning: Cognitive Technologies, Springer-
Verlag Berlin Heidelberg, 2012.
 Abraham Silberschatz, Henry F.Korth,
S.Sudarshan; Database System Concepts, Sixth
Edition, McGraw-Hill, 2010.
‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬،
‫ماه‬ ‫تیر‬1391.
References
58/59

Thanks!
Any questions?
You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com
😉
59/59

Data mining

More Related Content

What's hot (18)

Similar to Data mining (20)

Recently uploaded (20)

Data mining