SlideShare a Scribd company logo
Data Mining
Zahra Pourbahman and Behnaz Sadat Motavali
Supervisor: Dr. Alireza Bagheri
Advanced Database Course
November 2016
1/59
Outline
Introduction DM Methods
Complementary
Information
Conclusion
2/59
Introduction
What is Data Mining and why is it
importante?
1
Introduction DM Methods
Complementary
Information
Conclusion
3/59
“
Data Mining
 Extracting or mining knowledge from large amount of data
 Knowledge discovery in databases
4/59
Why Data
Mining?
5/59
Process of
Knowledge
Discovery
6/59
Data
Mining
Methods
Data Mining Methods
Predictive Descriptive
Classification
Association Rules
SVM
Clustering
K-Means
Regression
Apriori
Linear
7/59
DM Methods
What are Classification, Clustering, Association
Rules and Regression in Data Mining?
2
Introduction DM Methods
Complementary
Information
Conclusion
8/59
Classification
Classification
Clustering
Association
Rules
Regression
9/59
Classification
Problem  Given: Training set
 labeled set of 𝑁 input-output pairs 𝐷={xi , yi}
where 1<i<N
 𝑦 ∈ {1,…,𝐾}
 Goal: Given an input 𝒙 as a test data, assign it to
one of 𝐾 classes
 Examples:
▸ Spam filter
▸ Shape recognition
10/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
11/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
12/59
Learning and Decision Boundary
 Assume that training data is perfectly linearly separable
 Note that we seek w such that
wT x ≥ 0 when y = +1
wT x < 0 when y = −1
wT x n yn ≥ 0 E(w) = Σ wT x n yn
13/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
14/59
Margin
 Which line is better to select as the
boundary to provide more
generalization capability?
 Larger margin provides better
generalization to unseen data
 A hyperplane that is farthest from
all training samples
 The largest margin has equal
distances to the nearest sample of
both classes
×
15/59
Hard Margin
Support Vector
Machine
(SVM)
 When training samples are not linearly separable, it has no
solution.
16/59
Beyond Linear Separability
 Noise in the linearly separable
classes
 Overlapping classes that can be
approximately separated by a
linear boundary
17/59
Beyond Linear
Separability:
Soft-Margin
SVM
 Soft margin:
Maximizing a margin while trying to minimize the distance
between misclassified points and their correct margin plane
 SVM with slack variables:
Allows samples to fall within the margin, but penalizes them
18/59
Soft-Margin
SVM:
Parameter 𝐶 is a tradeoff parameter:
 small 𝐶 allows margin constraints to be easily
ignored large margin
 large 𝐶 makes constraints hard to ignore
narrow margin
 𝐶=∞ enforces all constraints: hard margin
19/59
Support Vectors
 Hard Margin Support Vectors :
(SVs) = {𝑥i | 𝛼>0}
 The direction of hyper-plane
can be found only based on
support vectors:
 The direction of hyper-plane can be found only based on support
vectors
𝑊 = 𝛼𝑖 𝑦(𝑖)
𝑥(𝑖)
𝛼 𝑖
20/59
Classifying
New Samples
Using only
SVs in SVM
Classification of a new sample 𝒙:
21/59
Clustering
Classification
Clustering
Association
Rules
Regression
22/59
Clusty.com
23/59
Clustering
Problem  We have a set of unlabeled data points {𝒙(i) }
where 1<i<N and we intend to find groups of
similar objects (based on the observed
features)
24/59
K-means
Clustering  Given: the number of clusters 𝐾 and a set of
unlabeled data 𝒳=𝒙1,…,𝒙N
 Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}
 Hard Partitioning:
∀𝑗,𝒞𝑗≠∅
∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅
 Inter-cluster distances are small (compared with
intra-cluster distances)
25/59
Distortion measure
 Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C
26/59
K-means
Algorithm
Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial
centroids.
 Repeat until converges (or other stopping
criterion):
 for i=1 to 𝑁 do:
 Assign 𝒙(𝑖) to the closest cluster
 for k=1 to 𝐾 do:
 Centriod update
27/59
K-means Algorithm
Step by Step
28/59
Assigning data to clusters Updating means
[Bishop]
29/59
Summary
of the First
Part
Hard margin SVM :
maximizing margin
Soft margin SVM:
handling noisy data
and overlapping
classes
Linearly Separable
Labeled Data
Clustering
Unlabeled Data
K-means:
Assigning data
to clusters
Centriod update
Data
Mining
30/59
Association
Rules
Classification
Clustering
Association
Rules
Regression
31/59
“
Association Rules:
 Frequent Patterns
▹ Frequent Itemset
▹ Frequent Sequential Pattern
▹ …
 Relation Between Data
32/59
Example
{ Milk , Bread }
K = 2
33/59
Support & Confidence
{Milk} → {Bread} [Support=50% , Confidence=100%]
ID Items
1 {Milk,Bread,Meat} , k=3
2 {Sugar,Bread,Eggs} , k=3
3 {Milk,Sugar,Bread,Butter} , k=4
4 {Bread,Butter} , k=2
34/59
Association Rules
X → Y and X∩Y=Ø
minsup , minconf
Y → X != X → Y
35/59
{Milk,Sugar} → {Bread}
Example
36/59
Frequent Pattern
{Milk,Bread}
Support = 2 𝑘 − 1
N transactions
A items
2 𝑘
− 1 ∗ N ∗ A compare
 So we need an algorithm to decrease that → Apriori
37/59
Apriori
Algorithm  Used to find Frequent Items
 Uses candidate generation
 Uses prior knowledge
 Level-wise search
 Uses minimum support
 Apriori property: All nonempty subsets of a frequent
item set must also be frequent.
In level k, k-item sets are found
Then these items are used to explore k+1
38/59
Apriori Algorithm
Step by Step
39/59
Example
11 items
5 transactions
40/59
Example …
without minsup 55 with
minsup 10
41/59
Example …
combine k-itemsets to
generate k+1-itemsets
{I1,I4}+{I2,I4} → {I1,I2,I4} √
{I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х
42/59
Without Apriori Algorithm:
11
1
+ 11
2
+ 11
3
= 231
Using Apriori Algorithm:
(11+10+6) = 27
43/59
Association
Rules
 Time to find Association Rules
 For each frequent pattern with k-itemset 2 𝑘
− 2
rules
 Need confidence
44/59
Regression
Classification
Clustering
Association
Rules
Regression
45/59
Regression
 Previous classifications labels
 Used for prediction
 Numeric, continuous value
 Relation between independent and dependent
variables
46/59
Linear Regression
Equation:
Simple Equation:
Error:
W0 and W1:
47/59
Example
48/59
Regression
Continue
 Class label related to attribute if not we use
Correlation Coefficient
 Nonlinear regressions can be converted to linear
 Generalized linear model is Logistic Regression
uses probability
 Decision tree to Regression trees by predicting
continuous values rather than class labels
49/59
Complementary
Information
Data Mining Tools, Usage and Types
3
Introduction DM Methods
Complementary
Information
Conclusion
50/59
Tools
Usage
Types
51/59
Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
DM Tools  Business Software:
 IBM Intelligent Miner
 SAS Enterprise Miner
 Microsoft SQL Server 2005
 SPSS Clementine
 …
 Open Source Software:
 Rapid-I Rapid Miner
 Weka
 …
52/59
DM Usage
 Bank
 Financial issues
 Perfect quality
 Granting loans
 Financial services
 Reducing risks
 Money
Laundering and
Financial damages
 Marketing
 Massive data
 Increasing fast
 E-commerce
 Shopping patterns
 Service quality
 Customer
satisfaction
 Advertising
 Discount
 Bioinformation
 Laboratory
Information
 Protein structures
(Gene)
 Massive number of
sequences
 Need for computer
algorithms to
analyze them
 Accurate
53/59
DM Types
Text Mining
• No tables
• Books, articles, texts
• Semi-structured data
• Data Recovery and
Database
• Key words
• Massive data and text
Web Mining
• Massive Unstructured,
Semi-structured,
Multimedia data
• Links, advertisements
• Poor quality, changing
• Web structure, Content,
Web usage Mining
• Search engines
Multimedia Mining
• Voice, image, video
• Nature of the data
• Key words or
Patterns and shapes
Graph Mining
• Electronic circuits,
image, web and …
• Graph search algorithm
• Difference, index
• Social networks
analisis
Spatial Mining
• Medical images,
VLSI layers
• Location based
• Efficient techniques
54/59
Conclusion
Challenges and Conclusion.
4
Introduction DM Methods
Complementary
Information
Conclusion
55/59
“Challenges
 Individual systems or
Single-purpose systems
 Scalable and Interactive systems
 Standardization of data mining languages
 Complex data
 Distributed and Real Time data mining
56/59
Review
Introduction
• Data Mining
• Knowledge
Discovery
• DM Methods
DM Methods
• Classification
• Clustering
• Association Rules
• Regression
Conclusion
• Challenges
Complementary
Information
• DM Tools
• DM Usage
• DM Types
57/59
 C. M. Bishop, Pattern Recognition and Machine
Learning; Springer, 2006.
 Jiawei Han, Micheline Kamber; Data Mining
Concepts and Techniques, Second Edition,
Elsevier Inc. , 2006.
 J. Furnkranz et al. ;Foundations of Rule
Learning: Cognitive Technologies, Springer-
Verlag Berlin Heidelberg, 2012.
 Abraham Silberschatz, Henry F.Korth,
S.Sudarshan; Database System Concepts, Sixth
Edition, McGraw-Hill, 2010.
‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬،
‫ماه‬ ‫تیر‬1391.
References
58/59
Thanks!
Any questions?
You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com
😉
59/59

More Related Content

PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
PDF
Understanding the Machine Learning Algorithms
PPTX
Machine Learning
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPTX
Decision Trees
PDF
Classification Based Machine Learning Algorithms
PDF
Understanding random forests
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Understanding the Machine Learning Algorithms
Machine Learning
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Decision Trees
Classification Based Machine Learning Algorithms
Understanding random forests

What's hot (18)

PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPTX
Decision trees
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
2.8 accuracy and ensemble methods
PDF
Decision tree
PDF
Machine Learning Lecture 3 Decision Trees
PPT
Learning On The Border:Active Learning in Imbalanced classification Data
PPT
Covering (Rules-based) Algorithm
PDF
Introduction to Some Tree based Learning Method
PPTX
Machine learning session9(clustering)
PPTX
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
PPT
Machine Learning 3 - Decision Tree Learning
PPTX
Classification
PPTX
AI Algorithms
PPTX
Improve Your Regression with CART and RandomForests
PDF
Lecture 2: Preliminaries (Understanding and Preprocessing data)
PDF
Lecture 3b: Decision Trees (1 part)
Introduction to Machine Learning Aristotelis Tsirigos
Decision trees
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
2.8 accuracy and ensemble methods
Decision tree
Machine Learning Lecture 3 Decision Trees
Learning On The Border:Active Learning in Imbalanced classification Data
Covering (Rules-based) Algorithm
Introduction to Some Tree based Learning Method
Machine learning session9(clustering)
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning 3 - Decision Tree Learning
Classification
AI Algorithms
Improve Your Regression with CART and RandomForests
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 3b: Decision Trees (1 part)
Ad

Similar to Data mining (20)

PDF
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
PDF
BI Chapter 04.pdf business business business business
PPTX
Data mining
PDF
IRJET- A Detailed Study on Classification Techniques for Data Mining
PPTX
Data mining an introduction
PDF
Data mining
PPTX
Data mining approaches and methods
PPTX
Different Algorithms used in classification [Auto-saved].pptx
PPT
Supervised and unsupervised learning
PPTX
Data mining Basics and complete description onword
PDF
Knowledge discovery claudiad amato
PPTX
Data mining concepts and work
PDF
Machinr Learning and artificial_Lect1.pdf
PPTX
Dwd mdatamining intro-iep
PPT
Data mining and its concepts
PPTX
1. Introduction to Data Mining (12).pptx
PDF
Data mining Algorithm’s Variant Analysis
PDF
Data mining Algorithm’s Variant Analysis
PDF
E017153342
PPTX
Lecture 09(introduction to machine learning)
turban_dss9e_Data Mining-Decision Support and Business Intelligence.pdf
BI Chapter 04.pdf business business business business
Data mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
Data mining an introduction
Data mining
Data mining approaches and methods
Different Algorithms used in classification [Auto-saved].pptx
Supervised and unsupervised learning
Data mining Basics and complete description onword
Knowledge discovery claudiad amato
Data mining concepts and work
Machinr Learning and artificial_Lect1.pdf
Dwd mdatamining intro-iep
Data mining and its concepts
1. Introduction to Data Mining (12).pptx
Data mining Algorithm’s Variant Analysis
Data mining Algorithm’s Variant Analysis
E017153342
Lecture 09(introduction to machine learning)
Ad

Recently uploaded (20)

PPTX
CYBER SECURITY the Next Warefare Tactics
PPT
Predictive modeling basics in data cleaning process
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Introduction to the R Programming Language
PDF
Global Data and Analytics Market Outlook Report
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Microsoft 365 products and services descrption
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
CYBER SECURITY the Next Warefare Tactics
Predictive modeling basics in data cleaning process
Pilar Kemerdekaan dan Identi Bangsa.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Introduction to the R Programming Language
Global Data and Analytics Market Outlook Report
Navigating the Thai Supplements Landscape.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Qualitative Qantitative and Mixed Methods.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Microsoft 365 products and services descrption
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
ISS -ESG Data flows What is ESG and HowHow
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf

Data mining

  • 1. Data Mining Zahra Pourbahman and Behnaz Sadat Motavali Supervisor: Dr. Alireza Bagheri Advanced Database Course November 2016 1/59
  • 3. Introduction What is Data Mining and why is it importante? 1 Introduction DM Methods Complementary Information Conclusion 3/59
  • 4. “ Data Mining  Extracting or mining knowledge from large amount of data  Knowledge discovery in databases 4/59
  • 7. Data Mining Methods Data Mining Methods Predictive Descriptive Classification Association Rules SVM Clustering K-Means Regression Apriori Linear 7/59
  • 8. DM Methods What are Classification, Clustering, Association Rules and Regression in Data Mining? 2 Introduction DM Methods Complementary Information Conclusion 8/59
  • 10. Classification Problem  Given: Training set  labeled set of 𝑁 input-output pairs 𝐷={xi , yi} where 1<i<N  𝑦 ∈ {1,…,𝐾}  Goal: Given an input 𝒙 as a test data, assign it to one of 𝐾 classes  Examples: ▸ Spam filter ▸ Shape recognition 10/59
  • 11. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 11/59
  • 12. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 12/59
  • 13. Learning and Decision Boundary  Assume that training data is perfectly linearly separable  Note that we seek w such that wT x ≥ 0 when y = +1 wT x < 0 when y = −1 wT x n yn ≥ 0 E(w) = Σ wT x n yn 13/59
  • 14. Margin  Which line is better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes 14/59
  • 15. Margin  Which line is better to select as the boundary to provide more generalization capability?  Larger margin provides better generalization to unseen data  A hyperplane that is farthest from all training samples  The largest margin has equal distances to the nearest sample of both classes × 15/59
  • 16. Hard Margin Support Vector Machine (SVM)  When training samples are not linearly separable, it has no solution. 16/59
  • 17. Beyond Linear Separability  Noise in the linearly separable classes  Overlapping classes that can be approximately separated by a linear boundary 17/59
  • 18. Beyond Linear Separability: Soft-Margin SVM  Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct margin plane  SVM with slack variables: Allows samples to fall within the margin, but penalizes them 18/59
  • 19. Soft-Margin SVM: Parameter 𝐶 is a tradeoff parameter:  small 𝐶 allows margin constraints to be easily ignored large margin  large 𝐶 makes constraints hard to ignore narrow margin  𝐶=∞ enforces all constraints: hard margin 19/59
  • 20. Support Vectors  Hard Margin Support Vectors : (SVs) = {𝑥i | 𝛼>0}  The direction of hyper-plane can be found only based on support vectors:  The direction of hyper-plane can be found only based on support vectors 𝑊 = 𝛼𝑖 𝑦(𝑖) 𝑥(𝑖) 𝛼 𝑖 20/59
  • 21. Classifying New Samples Using only SVs in SVM Classification of a new sample 𝒙: 21/59
  • 24. Clustering Problem  We have a set of unlabeled data points {𝒙(i) } where 1<i<N and we intend to find groups of similar objects (based on the observed features) 24/59
  • 25. K-means Clustering  Given: the number of clusters 𝐾 and a set of unlabeled data 𝒳=𝒙1,…,𝒙N  Goal: find groups of data points 𝒞={𝒞1,𝒞2,…,𝒞k}  Hard Partitioning: ∀𝑗,𝒞𝑗≠∅ ∀𝑖,𝑗,𝒞𝑖∩𝒞𝑗=∅  Inter-cluster distances are small (compared with intra-cluster distances) 25/59
  • 26. Distortion measure  Our goal is to find 𝒞={𝒞1,𝒞2,…,𝒞k } and {𝝁1,…,𝝁k} so as to minimize 𝐽C 26/59
  • 27. K-means Algorithm Select 𝑘 random points 𝝁1,𝝁2,…𝝁k as clusters’ initial centroids.  Repeat until converges (or other stopping criterion):  for i=1 to 𝑁 do:  Assign 𝒙(𝑖) to the closest cluster  for k=1 to 𝐾 do:  Centriod update 27/59
  • 29. Assigning data to clusters Updating means [Bishop] 29/59
  • 30. Summary of the First Part Hard margin SVM : maximizing margin Soft margin SVM: handling noisy data and overlapping classes Linearly Separable Labeled Data Clustering Unlabeled Data K-means: Assigning data to clusters Centriod update Data Mining 30/59
  • 32. “ Association Rules:  Frequent Patterns ▹ Frequent Itemset ▹ Frequent Sequential Pattern ▹ …  Relation Between Data 32/59
  • 33. Example { Milk , Bread } K = 2 33/59
  • 34. Support & Confidence {Milk} → {Bread} [Support=50% , Confidence=100%] ID Items 1 {Milk,Bread,Meat} , k=3 2 {Sugar,Bread,Eggs} , k=3 3 {Milk,Sugar,Bread,Butter} , k=4 4 {Bread,Butter} , k=2 34/59
  • 35. Association Rules X → Y and X∩Y=Ø minsup , minconf Y → X != X → Y 35/59 {Milk,Sugar} → {Bread}
  • 37. Frequent Pattern {Milk,Bread} Support = 2 𝑘 − 1 N transactions A items 2 𝑘 − 1 ∗ N ∗ A compare  So we need an algorithm to decrease that → Apriori 37/59
  • 38. Apriori Algorithm  Used to find Frequent Items  Uses candidate generation  Uses prior knowledge  Level-wise search  Uses minimum support  Apriori property: All nonempty subsets of a frequent item set must also be frequent. In level k, k-item sets are found Then these items are used to explore k+1 38/59
  • 41. Example … without minsup 55 with minsup 10 41/59
  • 42. Example … combine k-itemsets to generate k+1-itemsets {I1,I4}+{I2,I4} → {I1,I2,I4} √ {I1,I4}+{I2,I5} → {I1,I2,I4,I5} Х 42/59
  • 43. Without Apriori Algorithm: 11 1 + 11 2 + 11 3 = 231 Using Apriori Algorithm: (11+10+6) = 27 43/59
  • 44. Association Rules  Time to find Association Rules  For each frequent pattern with k-itemset 2 𝑘 − 2 rules  Need confidence 44/59
  • 46. Regression  Previous classifications labels  Used for prediction  Numeric, continuous value  Relation between independent and dependent variables 46/59
  • 49. Regression Continue  Class label related to attribute if not we use Correlation Coefficient  Nonlinear regressions can be converted to linear  Generalized linear model is Logistic Regression uses probability  Decision tree to Regression trees by predicting continuous values rather than class labels 49/59
  • 50. Complementary Information Data Mining Tools, Usage and Types 3 Introduction DM Methods Complementary Information Conclusion 50/59
  • 52. Business Software:  IBM Intelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  … Open Source Software:  Rapid-I Rapid Miner  Weka  … DM Tools  Business Software:  IBM Intelligent Miner  SAS Enterprise Miner  Microsoft SQL Server 2005  SPSS Clementine  …  Open Source Software:  Rapid-I Rapid Miner  Weka  … 52/59
  • 53. DM Usage  Bank  Financial issues  Perfect quality  Granting loans  Financial services  Reducing risks  Money Laundering and Financial damages  Marketing  Massive data  Increasing fast  E-commerce  Shopping patterns  Service quality  Customer satisfaction  Advertising  Discount  Bioinformation  Laboratory Information  Protein structures (Gene)  Massive number of sequences  Need for computer algorithms to analyze them  Accurate 53/59
  • 54. DM Types Text Mining • No tables • Books, articles, texts • Semi-structured data • Data Recovery and Database • Key words • Massive data and text Web Mining • Massive Unstructured, Semi-structured, Multimedia data • Links, advertisements • Poor quality, changing • Web structure, Content, Web usage Mining • Search engines Multimedia Mining • Voice, image, video • Nature of the data • Key words or Patterns and shapes Graph Mining • Electronic circuits, image, web and … • Graph search algorithm • Difference, index • Social networks analisis Spatial Mining • Medical images, VLSI layers • Location based • Efficient techniques 54/59
  • 55. Conclusion Challenges and Conclusion. 4 Introduction DM Methods Complementary Information Conclusion 55/59
  • 56. “Challenges  Individual systems or Single-purpose systems  Scalable and Interactive systems  Standardization of data mining languages  Complex data  Distributed and Real Time data mining 56/59
  • 57. Review Introduction • Data Mining • Knowledge Discovery • DM Methods DM Methods • Classification • Clustering • Association Rules • Regression Conclusion • Challenges Complementary Information • DM Tools • DM Usage • DM Types 57/59
  • 58.  C. M. Bishop, Pattern Recognition and Machine Learning; Springer, 2006.  Jiawei Han, Micheline Kamber; Data Mining Concepts and Techniques, Second Edition, Elsevier Inc. , 2006.  J. Furnkranz et al. ;Foundations of Rule Learning: Cognitive Technologies, Springer- Verlag Berlin Heidelberg, 2012.  Abraham Silberschatz, Henry F.Korth, S.Sudarshan; Database System Concepts, Sixth Edition, McGraw-Hill, 2010. ‫اسماعیلی‬،‫مهدی‬‫؛‬‫و‬ ‫مفاهیم‬‫های‬‫تکنیک‬‫کاوی‬ ‫داده‬‫؛‬‫دانش‬ ‫نیاز‬، ‫ماه‬ ‫تیر‬1391. References 58/59
  • 59. Thanks! Any questions? You can find us at z.poorbahman1@yahoo.com & bs.motavali@yahoo.com 😉 59/59