CS 402 DATAMINING AND WAREHOUSING -MODULE 3

Module 3
• Introduction to Classification and Prediction
• Issues regarding classification and prediction
• Decision Tree- ID3, C4.5
• Naive Bayes Classifier
1
1
6/30/2020 NIMMY RAJU,AP,VKCET,TVM

• Classification and prediction are two forms of data analysis that
extracts models describing important data classes or to predict
future data trends.
• Classification models (classifiers)predicts categorical labels
(discrete, unordered )
• Prediction models (Predictors) continuos-valued functions.
• For example, we can build a classification model to categorize
bank loan applications as either safe or risky, while a prediction
model may be built to predict the expenditures of potential
customers on computer equipment given their income and
occupation .

Classification and Prediction has numerous applications,
including
• fraud detection
• performance prediction
• medical diagnosis

What Is Classification?
• Data classification is a two-step process, consisting of a
learning step (where a classification model is constructed) and
a classification step (where the model is used to predict class
labels for given data).
• In the first step, a classifier is built describing a predetermined
set of data classes or concepts. This is the learning step (or
training phase), where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made
up of database tuples and their associated class labels.
4
4

• A tuple, X, is represented by an n-dimensional
attribute vector, X = (x1, x2,..., xn), represents n
database attributes, respectively, A1, A2,..., An.
• Each tuple, X, is assumed to belong to a predefined
class as determined by another database attribute
called the class label attribute.
• The class label attribute is discrete-valued and
unordered. It is categorical (or nominal) in that each
value serves as a category or class.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 5

• The individual tuples making up the training set are
referred to as training tuples and are randomly
sampled from the database under analysis.
• Data tuples can be referred to as samples,
examples, instances, data points, or objects.
• Because the class label of each training tuple is
provided, this step is also known as supervised
learning (i.e., the learning of the classifier is
“supervised” in that it is told to which class each
training tuple belongs).

• This first step of the classification process can also be viewed as
the learning of a mapping or function, y = f (X), that can predict
the associated class label y of a given tuple X.
• Typically, this mapping is represented in the form of
classification rules, decision trees, or mathematical formulae.
• In our example, the mapping is represented as classification
rules that identify loan applications as being either safe or risky
.
• The rules can be used to categorize future data tuples, as well
as provide deeper insight into the data contents. They also
provide a compressed data representation.

“What about classification accuracy?”
• In the second step the model is used for classification.
• First, the predictive accuracy of the classifier is estimated.
• A test set is used, made up of test tuples and their associated
class labels.
• They are independent of the training tuples, meaning that they
were not used to construct the classifier. .
• The accuracy of a classifier on a given test set is the percentage
of test set tuples that are correctly classified by the classifier.
• The associated class label of each test tuple is compared with
the learned classifier’s class prediction for that tuple.
• If the accuracy of the classifier is considered acceptable, the
classifier can be used to classify future data tuples for which the
class label is not known.

What Is Prediction ?
• Data prediction is a two step process, similar to that
of data classification .
– First, construct a model
– Second, use model to predict unknown value
• Attribute for which values are being predicted is
continuous-valued (ordered) rather than categorical
(discrete-valued and unordered). The attribute can be
referred to simply as the predicted attribute.
• Suppose that, in our example, we instead wanted to
predict the amount that would be “safe” for the bank
to loan an applicant

Issues regarding classification and prediction
1.Preparing data for classification and prediction
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data

• prediction can also be viewed as a mapping or function, y = f
(X), where X
is the input (e.g., a tuple describing a loan applicant), and the
output y is a continuous or ordered value (such as the
predicted amount that the bank can safely loan the applicant) .
• As with classification, the training set used to build a predictor
should not be used to assess its accuracy.
• An independent test set should be used instead.
• The accuracy of a predictor is estimated by computing an error
based on the difference between the predicted value and the
actual known value of y for each of the test tuples, X.

• Comparing Classification and Prediction Methods
• accuracy: This refers to ability of the model to correctly
classify or predict the class label of new or previously unseen
data.
• Speed: This refers to the computation costs involved in
generating and using the model.
• Robustness: This is the ability of the model to make correct
predictions given noisy data or data with the missing values.
• Scalability: This refers to the ability to construct the model
efficiently given large amounts of data.
• Interpretability: This refers to the level of understanding and
insight that is provided by the model.

3.Issues in Classification
• Missing Data. Missing data values cause problems
during both the training phase and the classification
process itself.
• There are many approaches to handling missing data:
• Ignore the missing data.
• Assume a value for the missing data.
Assume a special value for the missing data.

DECISION TREE-BASED ALGORITHMS
• The decision tree approach is most useful in
classification problems.
• With this technique, a tree is constructed to model
the classification process.
• Once the tree is built, it is applied to each tuple in
the database and results in a classification for that
tuple.
• There are two basic steps in the technique: building
the tree and applying the tree to the database.

• Definition:Given a database D = {t1 , ... , tn } where ti
= (ti1 , . .. , tih} and the database schema contains the
following attributes {A1 , A2, ... , Ah }. Also given is a
set of classes C = { C 1 , ... , C m}.
• A decision tree(DT) or classification tree is a tree
associated with D that has the following properties:
• Each internal node is labeled with an attribute, Ai.
• Each arc is labeled with a predicate that can be
applied to the attribute associated with the parent.
• Each leaf node is labeled with a class, C j.

• Solving the classification problem using decision
trees is a two-step process:
1. Decision tree induction: Construct a DT using
training data.
2. For each ti E D, apply the DT to determine its
class.

Advantages
• DTs certainly are easy to use and efficient.
• Rules can be generated that are easy to interpret
and understand.
• They scale well for large databases because the
tree size is independent of the database size.

Disadvantages
• Do not easily handle continuous data.
• Handling missing data is difficult because correct
branches in the tree could not be taken.
• Since the DT is constructed from the training data,
overfitting may occur. This can be overcome via tree
pruning.
• Finally, correlations among attributes in the database
are ignored by the DT process.

Algorithm

• This recursive algorithm builds the tree in a top-down
fashion by examining the training data.
• Using the initial training data, the "best" splitting
attribute is chosen first.
• Algorithms differ in how they determine the "best
attribute" and its "best predicates" to use for splitting.
• Once this has been determined, the node and its arcs
are created and added to the created tree.
• The algorithm continues recursively by adding new
subtrees to each branching arc.
• The algorithm terminates when some "stopping
criteria" is reached

• Again, each algorithm determines when to stop the
tree differently.
• One simple approach would be to stop when the
tuples in the reduced training set all belong to the
same class.
• This class is then used to label the leaf node
created.

The following issues are faced by most DT algorithms:
• Choosing splitting attributes:
• Ordering of splitting attributes
• Splits:
• Tree structure
• Training data
• Stopping criteria:
• Pruning:

ID3
• The ID3 technique to building a decision tree is based on information theory and
attempts to minimize the expected number of comparisons.
• The basic strategy used by ID3 is to choose splitting attributes with the highest
information gain first.
• The concept used to quantify information is called entropy.
• Entropy is used to measure the amount of uncertainty or surprise or randomness
in a set of data.
• Certainly, when all data in a set belong to a single class, there is no uncertainty. In
this case the entropy is zero.
• The objective of decision tree classification is to iteratively partition the given
data set into subsets where all elements in each final subset belong to the same
class.

DEFINITION
Given probabilities PI , P2, ... , Ps where, entropy is
defined as
• Each step in ID3 chooses the state that orders splitting the
most.
• A database state is completely ordered if all tuples in it are
in the same class.
• ID3 chooses the splitting attribute with the highest gain in
information, where gain is defined as the difference
between how much information is needed to make a
correct classification before the split versus how much
information is needed after the split.

• This is calculated by determining the differences
between the entropies of the original dataset and the
weighted sum of the entropies from each of the
subdivided datasets.
• The ID3 algorithm calculates the gain of a particular
split by the following formula:

Naive Bayes Classifier
• Bayesian classifiers predict class membership probabilities,
such as the probability that a given tuple belongs to a
particular class.
• Bayesian classification is based on Bayes’ theorem.
• Naive Bayesian classifier is a simple Bayesian classifier .
• Bayesian classifiers have also exhibited high accuracy and
speed when applied to large databases.

Bayes’ Theorem
• Let X be a data tuple.
• In Bayesian terms, X is considered “evidence.”
• It is described by measurements made on a set of n attributes.
• Let H be some hypothesis, such as that the data tuple X
belongs to a specified class C.
For classification problems:
• Determine P(H|X), the probability that the hypothesis H holds
given the “evidence” or observed data tuple X.
• In other words, it is the probability that tuple X belongs to
class C, given the attribute description of X.

Bayes’ Theorem
• P(H|X) is the posterior probability, or a posteriori
probability, of H conditioned on X.
• Eg:data tuples is confined to customers described by
the attributes age and income, respectively, and that
X is a 35-year-old customer with an income of
$40,000. Suppose that H is the hypothesis that our
customer will buy a computer.
• Then P(H|X) reflects the probability that customer X
will buy a computer given that we know the
customer’s age and income.

• P(H) is the prior probability, or a priori probability, of
H.
Eg:This is the probability that any given customer will
buy a computer, regardless of age, income, or any
other information, for that matter.
• The posterior probability, P(H|X),
is based on more information (e.g., customer
information) than the prior probability,
P(H), which is independent of X.

• P(X|H) is the posterior probability of X conditioned
on H.
• Eg: That is, it is the probability that a customer, X, is
35 years old and earns $40,000, given that we know
the customer will buy a computer.
• P(X) is the prior probability of X.
• Using our example, it is the probability that a person
from our set of customers is 35 years old and earns
$40,000 .

• How are these probabilities estimated?”
• P(H), P(X|H), and P(X) may be estimated from the
given data.
• Bayes’ theorem is useful in that it provides
• a way of calculating the posterior probability,
P(HjX), from P(H), P(XjH), and P(X)

Naive Bayes Classifier

• As P(X) is constant for all classes, only P(X|Ci)P(Ci)
need be maximized.
• Note that the class prior probabilities may be
estimated by P(Ci) = Ci /D, where Ci,D is the number
of training tuples of class Ci in D.

• Given data sets with many attributes, it would be
extremely computationally expensive to compute
P(X|Ci). In order to reduce computation in evaluating
P(X|Ci), presumes that the values of the attributes
are conditionally independent of one another, given
the class label of the tuple

CS 402 DATAMINING AND WAREHOUSING -MODULE 3

More Related Content

What's hot (20)

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 3 (20)

Recently uploaded (20)

CS 402 DATAMINING AND WAREHOUSING -MODULE 3