SlideShare a Scribd company logo
3
Most read
10
Most read
12
Most read
Data Mining:
Classification
Classification and Prediction
• What is classification? What is prediction?
• Issues regarding classification and prediction
• Classification by decision tree induction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification vs. Prediction
Classification Learning: Definition
l Given a collection of records (training set)
– Each record contains a set of attributes, one of the
attributes is the class
l Find a model for the class attribute as a function
of the values of the other attributes
l Goal: previously unseen records should be
assigned a class as accurately as possible
– Use test set to estimate the accuracy of the model
– Often, the given data set is divided into training and test
sets, with training set used to build the model and test
set used to validate it.
Illustrating Classification Learning
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
Examples of Classification Task
l Predicting tumor cells as benign or malignant
l Classifying credit card transactions
as legitimate or fraudulent
l Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
l Categorizing news stories as finance,
weather, entertainment, sports, etc.
Classification - A Two-Step Process
• Model construction: describing a set of predetermined classes
– Building the Classifier or Model
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Using Classifier for Classification
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Classification Process (1): Model Construction
Example: Loan application
The data classification process: Learning: Training data are analyzed by a
classification algorithm. Here, the class label attribute is loan decision, and the
learned model or classifier is represented in the form of classification rules.
Classification Process (2): Use the Model
in Prediction
Classification: Test data are used to estimate the accuracy of the classification
rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
The problem
• Given a set of training cases/objects and their attribute
values, try to determine the target attribute value of new
examples.
– Classification
– Prediction
• Use a decision tree to predict categories for new events.
• Use training data to build the decision tree.
New
Events
Decision
Tree
Category
Training
Events and
Categories

More Related Content

PPTX
Knowledge Discovery and Data Mining
PPTX
Data warehousing
PPT
2.5 backpropagation
PPT
PPTX
Data mining primitives
PPTX
Machine learning and types
PPTX
Data mining: Classification and prediction
PPT
Data mining techniques unit 1
Knowledge Discovery and Data Mining
Data warehousing
2.5 backpropagation
Data mining primitives
Machine learning and types
Data mining: Classification and prediction
Data mining techniques unit 1

What's hot (20)

PPTX
Classification in data mining
PPTX
NLP PPT.pptx
PPT
Pattern Recognition
PPSX
Frequent itemset mining methods
PPT
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPTX
05 Clustering in Data Mining
PPTX
Data clustring
PPT
5.2 mining time series data
PPTX
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
PPTX
Knowledge discovery process
PPTX
Types of Machine Learning
PPTX
Machine Learning
PDF
Decision tree lecture 3
PPT
1.8 discretization
PPTX
Text data mining1
PPTX
OLAP & DATA WAREHOUSE
PPT
Machine Learning
PPTX
Artificial intelligence Pattern recognition system
PPTX
K-Nearest Neighbor Classifier
Classification in data mining
NLP PPT.pptx
Pattern Recognition
Frequent itemset mining methods
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
05 Clustering in Data Mining
Data clustring
5.2 mining time series data
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Knowledge discovery process
Types of Machine Learning
Machine Learning
Decision tree lecture 3
1.8 discretization
Text data mining1
OLAP & DATA WAREHOUSE
Machine Learning
Artificial intelligence Pattern recognition system
K-Nearest Neighbor Classifier
Ad

Similar to Lect8 Classification & prediction (20)

PPTX
dataminingclassificationprediction123 .pptx
PPTX
Classification
PPTX
Supervised Learning-Unit 3.pptx
PPTX
Unit 4 Classification of data and more info on it
PDF
Data mining chapter04and5-best
PPTX
Big Data Analytics - Unit 3.pptx
PPT
Classification and prediction
PPTX
Classification Algorithm in Machine Learning
PPT
Dm bs-lec7-classification - dti
PPT
ai4.ppt
PPT
PPTX
Lecture 3 ml
DOCX
Concept of Classification in Data Mining.docx
PPTX
Classification and Prediction.pptx
PPT
Data Mining
PPT
ai4.ppt
PPTX
UNIT 3: Data Warehousing and Data Mining
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
PPTX
classification in Data Analysis Data Analysis.pptx
dataminingclassificationprediction123 .pptx
Classification
Supervised Learning-Unit 3.pptx
Unit 4 Classification of data and more info on it
Data mining chapter04and5-best
Big Data Analytics - Unit 3.pptx
Classification and prediction
Classification Algorithm in Machine Learning
Dm bs-lec7-classification - dti
ai4.ppt
Lecture 3 ml
Concept of Classification in Data Mining.docx
Classification and Prediction.pptx
Data Mining
ai4.ppt
UNIT 3: Data Warehousing and Data Mining
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
classification in Data Analysis Data Analysis.pptx
Ad

More from hktripathy (18)

PPTX
Lect 3 background mathematics
PDF
Lect 2 getting to know your data
PDF
Lect 1 introduction
PDF
Lecture7.1 data sampling
PDF
Lecture5 virtualization
PDF
Lecture6 introduction to data streams
PDF
Lecture4 big data technology foundations
PDF
Lecture3 business intelligence
PDF
Lecture2 big data life cycle
PDF
Lecture1 introduction to big data
PPTX
Lect9 Decision tree
PPTX
Lect7 Association analysis to correlation analysis
PPTX
Lect6 Association rule & Apriori algorithm
PPTX
Lect5 principal component analysis
PPTX
Lect4 principal component analysis-I
PPTX
Lect 3 background mathematics for Data Mining
PPTX
Lect 2 getting to know your data
PPTX
Lect 1 introduction
Lect 3 background mathematics
Lect 2 getting to know your data
Lect 1 introduction
Lecture7.1 data sampling
Lecture5 virtualization
Lecture6 introduction to data streams
Lecture4 big data technology foundations
Lecture3 business intelligence
Lecture2 big data life cycle
Lecture1 introduction to big data
Lect9 Decision tree
Lect7 Association analysis to correlation analysis
Lect6 Association rule & Apriori algorithm
Lect5 principal component analysis
Lect4 principal component analysis-I
Lect 3 background mathematics for Data Mining
Lect 2 getting to know your data
Lect 1 introduction

Recently uploaded (20)

PDF
Pre independence Education in Inndia.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
master seminar digital applications in india
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Institutional Correction lecture only . . .
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Business Ethics Teaching Materials for college
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Cell Structure & Organelles in detailed.
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
Pre independence Education in Inndia.pdf
Cell Types and Its function , kingdom of life
master seminar digital applications in india
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
O7-L3 Supply Chain Operations - ICLT Program
Pharma ospi slides which help in ospi learning
Microbial diseases, their pathogenesis and prophylaxis
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Institutional Correction lecture only . . .
STATICS OF THE RIGID BODIES Hibbelers.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
The Healthy Child – Unit II | Child Health Nursing I | B.Sc Nursing 5th Semester
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Business Ethics Teaching Materials for college
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Basic Mud Logging Guide for educational purpose
Cell Structure & Organelles in detailed.
FourierSeries-QuestionsWithAnswers(Part-A).pdf
102 student loan defaulters named and shamed – Is someone you know on the list?

Lect8 Classification & prediction

  • 2. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction
  • 3. • Classification: – predicts categorical class labels – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: – models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications – credit approval – target marketing – medical diagnosis – treatment effectiveness analysis Classification vs. Prediction
  • 4. Classification Learning: Definition l Given a collection of records (training set) – Each record contains a set of attributes, one of the attributes is the class l Find a model for the class attribute as a function of the values of the other attributes l Goal: previously unseen records should be assigned a class as accurately as possible – Use test set to estimate the accuracy of the model – Often, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
  • 5. Illustrating Classification Learning Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
  • 6. Examples of Classification Task l Predicting tumor cells as benign or malignant l Classifying credit card transactions as legitimate or fraudulent l Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil l Categorizing news stories as finance, weather, entertainment, sports, etc.
  • 7. Classification - A Two-Step Process • Model construction: describing a set of predetermined classes – Building the Classifier or Model – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction: training set – The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects – Using Classifier for Classification – Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur
  • 8. Classification Process (1): Model Construction Example: Loan application The data classification process: Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is loan decision, and the learned model or classifier is represented in the form of classification rules.
  • 9. Classification Process (2): Use the Model in Prediction Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
  • 10. Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 11. Issues (1): Data Preparation • Data cleaning – Preprocess data in order to reduce noise and handle missing values • Relevance analysis (feature selection) – Remove the irrelevant or redundant attributes • Data transformation – Generalize and/or normalize data
  • 12. Issues (2): Evaluating Classification Methods • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model • Robustness – handling noise and missing values • Scalability – efficiency in disk-resident databases • Interpretability: – understanding and insight provded by the model • Goodness of rules – decision tree size – compactness of classification rules
  • 13. The problem • Given a set of training cases/objects and their attribute values, try to determine the target attribute value of new examples. – Classification – Prediction • Use a decision tree to predict categories for new events. • Use training data to build the decision tree. New Events Decision Tree Category Training Events and Categories