SlideShare a Scribd company logo
Data mining
DATA MINING
 The process of semiautomatically analyzing large
databases to find useful patterns.
 Knowledge discovered can be represented by:
 A set of rules (degree of support and confidance)
 Equations relating different variables to each other
 Other mechanisms for predicting outcomes
 Most widely used applications:
 Prediction: if the person is a good credit risk.
 Association: books that tend to be bought together.
CLASSIFICATION
 Given the items belong to one of several classes,
and given past instances (training instances) of
items along with the classes to which they belong,
the problem is to predict the class to which a new
item belongs.
CLASSIFICATION
 Classification can be done by finding rules that partition
the given data into disjoint groups.
 A case study: Credit-card company
 The company assigns a credit-worthiness level of:
excellent, good, average, bad to each of a sample
set of current customers.
 Then it attempts to find rules that classify its current
customers into those classes.
CLASSIFICATION
 For example:
∀person P, P.degree = masters and P.income > 75, 000 ⇒ P.credit = excellent
∀ person P, P.degree = bachelors or
(P.income ≥ 25, 000 and P.income ≤ 75, 000) ⇒ P.credit = good
The process of building a classifier starts from a sample of
data: training set
For each tuple in the training set, the class to which the tuple
belongs is already known.
There are several ways of building a classifier…
DECISION-TREE CLASSIFIERS
 Each leaf node has an associated class
 Each internal node has a predicate (or more generally, a
function)
BUILDING DECISION-TREE CLASSIFIERS
 The most common way: a greedy algorithm
 Works recursively, starting at the root with all training
instances associated, and building the three
downward.
 At each node, if all or almost all training instances
associated with it, belong to the same class => the
node becomes a leaf node associated with that class.
 Otherwise, a partitioning attribute and partitioning
condition must be selected to create child nodes.
BEST SPLITS
 To judge the benefit of picking a particular attribute
and condition for partitioning of the data at a node,
we measure the purity of the data at the children
resulting from partitioning by that attribute.
 The attribute and condition that result in the
maximum purity are chosen.
 The purity of a set S of training instances can be
measured in several ways…
BEST SPLITS
BEST SPLITS
 the information gain due to a particular split of S into Si:
Information-gain(S,{Si, Si,..., Si}) = purity(S) - purity(Si, Si,..., Si)
 The information content of a particular split can be
defined in terms of entropy as:
Information-content(S,{Si, Si,..., Si}) = - 𝑖−1
𝑟 |Si|
|𝑆|
log2
|Si|
|𝑆|
 The best split for an attribute is the one that gives the
maximum information gain ratio, defined as:
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 − 𝑔𝑎𝑖𝑛(S,{Si, Si,..., Si})
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 − 𝑐𝑜𝑛𝑡𝑒𝑛𝑡(S,{Si, Si,..., Si})
FINDING BEST SPLITS
 How to split an attribute depends on type of it.
Continuous values can be ordered, like numbers
(income)
Categorical values have no meaningful order (degree)
 Continuous-valued attribute, finding best binary splits :
First sort the attribute values in the training instances.
Then compute the information gain obtained by splitting
at each value. (training instance values: 1,10,15,25 => split
points: 1,10,15)
 For a categorical attribute, a child for each value.
DECISION-TREE CONSTRUCTION ALGORITHM
 Evaluate different attributes and different
partitioning conditions, and pick the one that results
in maximum information-gain ratio.
 The same procedure works recursively on each of
the sets resulting from the split.
 The recursion stops when the purity of a set is 0 or
sufficiently high.
DECISION-TREE CONSTRUCTION ALGORITHM
procedure GrowTree(S)
Partition(S);
procedure Partition (S)
if (purity(S) > p or |S| < s ) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, . . . , Sr ;
for i = 1, 2, . . . , r
Partition(Si );
For each leaf node we generate a rule as follows:
Degree=masters and income>75000 => excellent
conjunction of all the split conditions on the path to the leaf class of the leaf
OTHER TYPE OF CLASSIFIERS
 There are several types of classifiers:
Neural-net classifiers
Bayesian classifiers
Support vector machine
Bayesian classifiers:
P(cj|d) =
𝑝 𝑑 cj 𝑝(cj)
𝑝(𝑑)
Naïve Bayesian classifiers: p(d|cj) = p(d1|cj)*p(d2|cj)*…*p(dn|cj)
The class with maximum probability => predicted class
for instance d.
Probability of
occurrence of
class cj
Probability of
generating
instance d
given class cj
Probability of
instance d
occurring
Probability that
instance d
belongs to class
cj
THE SUPPORT VECTOR MACHINE (SVM)
 We are given a training set of points whose class is known.
 We need to build a classifier of points using these training
points.
 Suppose a line, such that all points in class A lie to one side and
all points in class B lie to the other.
 The SVM classifier chooses the line whose distance from the
nearest point in either class is maximum: the maximum
margin line
X: points in class
A
O: points in class
B
REGRESSION
 Deals with the prediction of a value, rather than a class.
 Given values for a set variables, X1,X2,…,Xn, we wish to
predicate the value of the variable Y.
 Linear regression:
Y = a0 + a1*X1 + a2*X2 + … an*Xn
 Curve fitting (may be only approximate)
 Regression aims to find coefficients that gives the best
possible fit.
VALIDATING A CLASSIFIER
 Measuring its classification error, before deciding to use it.
 A set of test cases where the outcome is already known is used.
 The quality of classifier can be measured in several ways:
1. Accuracy: (t-pos+t-neg)/(pos+neg)
2. Recall (Sensitivity): t-pos/pos
3. Precision: t-pos/(t-pos+f-pos)
4. Specificity: t-neg/neg
Which of these should be used, depends on the needs of
application.
It’s a bad idea to use exactly the same set of test cases to train as
well as to measure the quality of classifier.

More Related Content

PDF
Linear discriminant analysis
PPTX
Lect 2 getting to know your data
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPT
2.8 accuracy and ensemble methods
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PPTX
Data For Datamining
PPTX
Discretization and concept hierarchy(os)
Linear discriminant analysis
Lect 2 getting to know your data
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
2.8 accuracy and ensemble methods
Data Mining: Concepts and Techniques — Chapter 2 —
Data For Datamining
Discretization and concept hierarchy(os)

What's hot (20)

PPTX
Machine Learning Project
PDF
Chapter 02-logistic regression
PPT
Data.Mining.C.6(II).classification and prediction
PPT
2.7 other classifiers
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPTX
04 Classification in Data Mining
PDF
Machine learning Introduction
PDF
Classification Based Machine Learning Algorithms
PDF
Logistic regression in Machine Learning
PPTX
BAS 250 Lecture 8
PPT
Measurement scales
PDF
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
PDF
Chapter 04-discriminant analysis
PPTX
K nearest neighbor
PPTX
Machine learning session8(svm nlp)
PPT
Covering (Rules-based) Algorithm
PPTX
Decision tree
PDF
Business statistics
PPTX
Marketing analytics - clustering Types
PPT
Decision tree
Machine Learning Project
Chapter 02-logistic regression
Data.Mining.C.6(II).classification and prediction
2.7 other classifiers
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
04 Classification in Data Mining
Machine learning Introduction
Classification Based Machine Learning Algorithms
Logistic regression in Machine Learning
BAS 250 Lecture 8
Measurement scales
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Chapter 04-discriminant analysis
K nearest neighbor
Machine learning session8(svm nlp)
Covering (Rules-based) Algorithm
Decision tree
Business statistics
Marketing analytics - clustering Types
Decision tree
Ad

Similar to Data mining (20)

PPT
DM Unit-III ppt.ppt
PDF
IJCSI-10-6-1-288-292
PDF
classification in data mining and data warehousing.pdf
PPTX
Machine learning
PPTX
Chapter4-ML.pptx slide for concept of mechanic learning
PDF
A Decision Tree Based Classifier for Classification & Prediction of Diseases
PDF
18 ijcse-01232
PPTX
Data mining approaches and methods
PPT
Supervised and unsupervised learning
PPT
3_learning.ppt
PPT
Machine Learning: Decision Trees Chapter 18.1-18.3
PDF
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
PPTX
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
PPT
Business Analytics using R.ppt
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
PPTX
UNIT 3: Data Warehousing and Data Mining
PPT
Data Mining
PPTX
Decision tree, softmax regression and ensemble methods in machine learning
PPTX
Unit 4 Classification of data and more info on it
PDF
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
DM Unit-III ppt.ppt
IJCSI-10-6-1-288-292
classification in data mining and data warehousing.pdf
Machine learning
Chapter4-ML.pptx slide for concept of mechanic learning
A Decision Tree Based Classifier for Classification & Prediction of Diseases
18 ijcse-01232
Data mining approaches and methods
Supervised and unsupervised learning
3_learning.ppt
Machine Learning: Decision Trees Chapter 18.1-18.3
A Review on Subjectivity Analysis through Text Classification Using Mining Te...
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
Business Analytics using R.ppt
MACHINE LEARNING Unit -2 Algorithm.pptx
UNIT 3: Data Warehousing and Data Mining
Data Mining
Decision tree, softmax regression and ensemble methods in machine learning
Unit 4 Classification of data and more info on it
MLT_KCS055 (Unit-2 Notes).pdfNNNNNNNNNNNNNNNN
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
Clinical guidelines as a resource for EBP(1).pdf
1_Introduction to advance data techniques.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Major-Components-ofNKJNNKNKNKNKronment.pptx
Lecture1 pattern recognition............

Data mining

  • 2. DATA MINING  The process of semiautomatically analyzing large databases to find useful patterns.  Knowledge discovered can be represented by:  A set of rules (degree of support and confidance)  Equations relating different variables to each other  Other mechanisms for predicting outcomes  Most widely used applications:  Prediction: if the person is a good credit risk.  Association: books that tend to be bought together.
  • 3. CLASSIFICATION  Given the items belong to one of several classes, and given past instances (training instances) of items along with the classes to which they belong, the problem is to predict the class to which a new item belongs.
  • 4. CLASSIFICATION  Classification can be done by finding rules that partition the given data into disjoint groups.  A case study: Credit-card company  The company assigns a credit-worthiness level of: excellent, good, average, bad to each of a sample set of current customers.  Then it attempts to find rules that classify its current customers into those classes.
  • 5. CLASSIFICATION  For example: ∀person P, P.degree = masters and P.income > 75, 000 ⇒ P.credit = excellent ∀ person P, P.degree = bachelors or (P.income ≥ 25, 000 and P.income ≤ 75, 000) ⇒ P.credit = good The process of building a classifier starts from a sample of data: training set For each tuple in the training set, the class to which the tuple belongs is already known. There are several ways of building a classifier…
  • 6. DECISION-TREE CLASSIFIERS  Each leaf node has an associated class  Each internal node has a predicate (or more generally, a function)
  • 7. BUILDING DECISION-TREE CLASSIFIERS  The most common way: a greedy algorithm  Works recursively, starting at the root with all training instances associated, and building the three downward.  At each node, if all or almost all training instances associated with it, belong to the same class => the node becomes a leaf node associated with that class.  Otherwise, a partitioning attribute and partitioning condition must be selected to create child nodes.
  • 8. BEST SPLITS  To judge the benefit of picking a particular attribute and condition for partitioning of the data at a node, we measure the purity of the data at the children resulting from partitioning by that attribute.  The attribute and condition that result in the maximum purity are chosen.  The purity of a set S of training instances can be measured in several ways…
  • 10. BEST SPLITS  the information gain due to a particular split of S into Si: Information-gain(S,{Si, Si,..., Si}) = purity(S) - purity(Si, Si,..., Si)  The information content of a particular split can be defined in terms of entropy as: Information-content(S,{Si, Si,..., Si}) = - 𝑖−1 𝑟 |Si| |𝑆| log2 |Si| |𝑆|  The best split for an attribute is the one that gives the maximum information gain ratio, defined as: 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 − 𝑔𝑎𝑖𝑛(S,{Si, Si,..., Si}) 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 − 𝑐𝑜𝑛𝑡𝑒𝑛𝑡(S,{Si, Si,..., Si})
  • 11. FINDING BEST SPLITS  How to split an attribute depends on type of it. Continuous values can be ordered, like numbers (income) Categorical values have no meaningful order (degree)  Continuous-valued attribute, finding best binary splits : First sort the attribute values in the training instances. Then compute the information gain obtained by splitting at each value. (training instance values: 1,10,15,25 => split points: 1,10,15)  For a categorical attribute, a child for each value.
  • 12. DECISION-TREE CONSTRUCTION ALGORITHM  Evaluate different attributes and different partitioning conditions, and pick the one that results in maximum information-gain ratio.  The same procedure works recursively on each of the sets resulting from the split.  The recursion stops when the purity of a set is 0 or sufficiently high.
  • 13. DECISION-TREE CONSTRUCTION ALGORITHM procedure GrowTree(S) Partition(S); procedure Partition (S) if (purity(S) > p or |S| < s ) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S1, S2, . . . , Sr ; for i = 1, 2, . . . , r Partition(Si ); For each leaf node we generate a rule as follows: Degree=masters and income>75000 => excellent conjunction of all the split conditions on the path to the leaf class of the leaf
  • 14. OTHER TYPE OF CLASSIFIERS  There are several types of classifiers: Neural-net classifiers Bayesian classifiers Support vector machine Bayesian classifiers: P(cj|d) = 𝑝 𝑑 cj 𝑝(cj) 𝑝(𝑑) Naïve Bayesian classifiers: p(d|cj) = p(d1|cj)*p(d2|cj)*…*p(dn|cj) The class with maximum probability => predicted class for instance d. Probability of occurrence of class cj Probability of generating instance d given class cj Probability of instance d occurring Probability that instance d belongs to class cj
  • 15. THE SUPPORT VECTOR MACHINE (SVM)  We are given a training set of points whose class is known.  We need to build a classifier of points using these training points.  Suppose a line, such that all points in class A lie to one side and all points in class B lie to the other.  The SVM classifier chooses the line whose distance from the nearest point in either class is maximum: the maximum margin line X: points in class A O: points in class B
  • 16. REGRESSION  Deals with the prediction of a value, rather than a class.  Given values for a set variables, X1,X2,…,Xn, we wish to predicate the value of the variable Y.  Linear regression: Y = a0 + a1*X1 + a2*X2 + … an*Xn  Curve fitting (may be only approximate)  Regression aims to find coefficients that gives the best possible fit.
  • 17. VALIDATING A CLASSIFIER  Measuring its classification error, before deciding to use it.  A set of test cases where the outcome is already known is used.  The quality of classifier can be measured in several ways: 1. Accuracy: (t-pos+t-neg)/(pos+neg) 2. Recall (Sensitivity): t-pos/pos 3. Precision: t-pos/(t-pos+f-pos) 4. Specificity: t-neg/neg Which of these should be used, depends on the needs of application. It’s a bad idea to use exactly the same set of test cases to train as well as to measure the quality of classifier.