Evaluating Classifiers' Performance KDD2002

Evaluating Classifiers’ Performance
In A Constrained Environment
Anna Olecka
olecka@rutcor.rutgers.edu
Fleet Boston Financial
Database Marketing Department
1075 Main St.
Waltham, MA 02451
RUTCOR
Rutgers University
640 Bartholomew Road
Piscataway, NJ 08854-8003
ABSTRACT
In this paper, we focus on methodology of finding a classifier
with a minimal cost in presence of additional performance
constraints. ROCCH analysis, where accuracy and cost are
intertwined in the solution space, was a revolutionary tool for
two-class problems. We propose an alternative formulation, as an
optimization problem, commonly used in Operations Research.
This approach extends the ROCCH analysis to allow for locating
optimal solutions while outside constraints are present. Similarly
to the ROCCH analysis, we combine cost and class distribution
while defining the objective function. Rather than focusing on
slopes of the edges in the convex hull of the solution space,
however, we treat cost as an objective function to be minimized
over the solution space, by selecting the best performing
classifier(s) (one or more vertex in the solution space). The Linear
Programming framework provides a theoretical and
computational methodology for finding the vertex (classifier)
which minimizes the objective function.
1. INTRODUCTION
Consider a problem, where classifiers’ performance has to be
evaluated taking into account additional constraints related to
error rates. Such constraints often arise from implementation.
They could, for example, involve a limited workforce to resolve
cases of suspected fraud, a limited size of a direct mail campaign,
or restrictions in cost of incentives for responders. An application
example used throughout this paper involves an attrition model
for a bank. A bank plans a calling campaign to lower attrition rate
among its customers.
Naturally, one of the implementation concerns is limited
availability of resources, such as phone representatives.
Traditionally, a model is selected first and summarized by a
modeler into several performance buckets. Then, the
implementation team will “eyeball” the thresholds of the
performance buckets and pick the threshold that matches
constrained resources the closest. Such business practice often
results in a sub-optimal solution being selected. If the constraints
are known a-priori, they could be built into the system evaluating
classifiers’ performance. If they are not known till the
implementation time, a system analogous to the ROCCH could be
built to select the best classifier at that time. To this end, we are
proposing an evaluation system that can deal with additional
constraints related to prediction errors. We will also show how to
apply such system in a business scenario described above.
Provost and Fawcett have shown in [3] that some specific metrics
frequently used in Machine Learning (eg, workforce constraints,
and the Neyman-Pearson decision criterion), are optimized by the
ROCCH method. Optimization approach proposed here extends
their results to any linear constraint. In fact, our approach can be
applied to an unlimited number of constraints, as long as they
remain linear. Finding numerically intersection points of such
additional constraints with ROCCH can be computationally
tedious. A mathematical programming approach provides efficient
tools for finding optimal solutions without explicitly calculating
all the intersection points.
ROCCH analysis is a powerful and widely accepted tool for
visualizing and optimizing classification problems with two
classes. It plots performance of all classifiers under consideration
in a two-dimensional space, with false positive rate on one axis
and true positive rate on the other. It measures classifiers’
performance for various costs of misclassification and under
various class distributions. It allows for visual representation of
classifiers and enables quick decisions in choosing the right
classifier for the given costs. The main advantage of this
methodology is its flexibility under varying conditions. One
drawback of this methodology is that it is somewhat rigid in
looking for an optimal classifier. It forms an optimal slope by
considering an optimal combination of class probabilities and
error costs, and then looks for one of the two scenarios. Either an
edge of the convex hull with a slope equal to the optimal one, or
for a vertex between two edges where a difference between the
edges slope and the optimal slope changes sign.
Similarly to the ROCCH, we construct the convex hull of all
classifiers in the error space. Convexity of the error space is
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada.
Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.

accomplished in the following steps. First we note that a convex
combination of two classifiers is also a viable classifier. Then, we
remove dominated classifiers. Finally, we impose additional
performance constraints if any. Those additional constraints also
form a convex hull. Intersection of the two regions, is a new
convex hull, the potential solution space. Finally, similarly to the
ROCCH, we treat a combination of costs and class probabilities as
an objective function to be minimized over the solution space.
Due to existence of additional constraints, however, iterating
slopes of the edges of the convex hull is no longer
computationally efficient because not all vertices of the convex
hull are known. The theory of linear programming provides
computational tools for finding the optimal solution(s) without
explicit knowledge of all vertices.
2. ROC CONVEX HULL & HYBRID CLASSIFIERS
Suppose we construct a series of k-1 classifiers by varying a
positive decision threshold from ”never”, and gradually increasing
probability of decision “Yes”. By allowing more instances to be
classified as positive, each new threshold will increase the number
of correctly classified positive instances, but it may also increase
frequency of false positive classification. In the space (FP rate, TP
rate), each new classifier will be positioned up and to the right
from the previous one. Connecting each pair of points with a line
segment generates a convex set, where each vertex represents a
classifier (Figure1).
.
Table 1. Neural net model for bank attrition
Table 1 represents a set of classifiers obtained from a neural
networks model for a bank attrition problem. Figure 1 shows the
resulting convex set
Figure 1. ROC curve for the neural net attrition model
Figure 2 shows a set of classifiers obtained by varying thresholds
for two logistic regression models for the attrition data. In this
representation, we can visually recognize dominated classifiers.
They are positioned inside the convex region, while potential
candidates for optimal solution are on the boundary. It is easy to
see [3], that under any cost structure, there is a classifier on the
boundary, which will outperform a dominated classifier.
Figure 2. Two competing logistic regression models
Figure 3 shows a hybrid classifier obtained by removing
dominated points. Any point positioned inside the bounded
region, can be outperformed by some point on the boundary. The
boundary forms a hybrid classifier, or a set of potential candidates
for an optimal solution.
Figure 3. ROCCH for the two logistic regression models
2.1 Remark on convexity
The hybrid classifier obtained from our two logistic regression
models formed a convex set in the (FP rate, TP rate) plane. In
general, this doesn’t have to be the case. Figure 4 provides such
an example. Point B, where the region boundary transitions from
the neural networks model to the logistic model. The boundary
“dips” below the line from A to C. This can be “fixed” by
creating a new classifier B’ on the (A, C) line. For example, if we
want a point half way between A and C, the new classifier is
obtained by using classifiers A and B randomly, each with
probability 0.5. Provost and Fawcett describe a similar scheme,
using random sampling to create a new classifier.
In general any convex combination of two classifiers becomes a
new classifier. For any point X on a line segment created by
classifiers π1, π2, , we can always construct a new classifier,
which would correspond to such point. We start by describing
such point as convex combination
ROC surves for two logistic regression models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TP
Ra
te
Logistic2
Logistic1
Hybrid classifier for two logistic regression models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Logistic1
Cutoff /
Threshold
FP rate
p (Y | n)
TP rate
p (Y | y)
Never 0 0
0.54 0.06 0.25
0.48 0.14 0.43
0.42 0.23 0.58
0.36 0.33 0.70
0.30 0.42 0.82
0.24 0.53 0.90
Always 1 1
ROC Curve for Attrition Data
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate

X=απ1+ (1−α)π2 where 0 ≤ α ≤1
Then, we randomly partition the population being classified into 2
groups in proportions α, 1- α and apply an appropriate model for
each group.
Figure 4. Boundary of ROC space for two classifiers is not
always convex
A new, convex boundary for the attrition problem is shown in
Figure 5.
Figure 5. Convexity is obtained by creating convex
combinations of existing classifiers
3. BASIC TERMINOLOGY
The following terminology is used throughout the remaining
sections.
− Two classes: positive (y) and negative (n) with probabilities
respectively p and 1-p
− Classification decision: positive (Y) and negative (N)
− FP, TP, FN, TN represent number of instances of each kind:
false positive, true positive, false negative and true negative
respectively
− Rates of those instances are represented as follows:
FP_rate = p(Y|n) TP_rate = p(Y|y)
FN_rate = p (N|y) TN_rate = p(N|n)
− In the ROC space, the horizontal axis represents FP_rate, and
the vertical axis represents TP_rate. We will use x to denote
FP_rate and y to denote TP_rate
− Unit cost of a false positive error = c(Y|n)
− Unit cost of a false negative error = c(N|y)
In defining the cost, we need to start with the expected number of
errors and transform the resulting formula into the ROC space
terms, where the variables are error rates. The following scheme
visualizes interdependencies between terms.
Table 2. Two class classification scheme
Let M be the total number of instances being classified. For given
p, x and y we can calculate the expected number of classified
cases of each kind as follows.
TP = M*p*y
FN = M*p*(1-y)
FP = M* (1-p)*x
TN = M*(1-p)*(1-x)
4. COST FUNCTION
We now apply the cost function to the attrition problem and look
for a classifier on the boundaries of the convex hull, which will
minimize the total cost.
A false negative classification means that we won’t recognize an
attriting customer and loose the account. The cost of loosing a
customer is tied to net income after taxes (NIAT) this customer
brings to the bank. In addition, both types of error generate labor
costs related preventive action. The line of business decided to
assign error costs as shown in Table3.
Table 3. Misclassification costs assignment
Given cost of a false negative error c(N | y), and a false positive
error c(Y | n), the total expected cost
EC = c(Y | n) * E(FP) + c(N | y)* E(FN)
Where E(FN) is the expected number of false negatives and
E(FP) is the expected number of false positives.
ROC Curves for Logistic and Neural Nets models
C
B
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Neural Net 1
False
positive
c(Y|n)
False
negative
c(N|y)
30$ 2,575$
Error Cost
per unit
Two class
classification schem e
Population
M
Class n
(1-p)*M
Class y
p*M
Decision
Y
(TP)
Decision
N
(FN)
Decision Y
(FP)
Decision N
(TN)
FN rate
1-y
TP rate
y
TN rate
1-x
FP rate
x
p 1-p
Hybrid classifier for Logistic2 and Neural Nets models
B' C
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Neural Net 1

We want to minimize the expected total cost in term of decision
variables x and y.
Minimize EC = c(Y | n) * E(FP) + c(N | y)* E(FN)
= c(Y | n) * (1-p) * M * x + c(N | y) * p* M * (1 – y )
= c(Y | n) * (1-p) * M * x + c(N | y) * p * M - c(N | y)*p*M*y
After subtracting the constant (not dependent on the decision
variables) term and dividing by M, this is equivalent to:
Minimize (EC - c(N | y)*p )/M =
= c(Y | n)*(1-p)*x - c(N | y)*p*y
This is equivalent to maximizing the opposite function
Maximize C’ = - c(Y | n)*(1-p)*x + c(N | y)*p*y
C’ in space (x,y) is a collection of parallel lines with slopes
depending on misclassification costs and the a-priori probability
of the positive class. In the attrition example we are analyzing,
p=3.75%. We are now ready to calculate slopes for the cost lines.
m = [c(Y | n)*(1-p)] / [c(N | y)*p] =
= 30*(1-0.0375) / (2575*0.0375) = 0.3
Intercepts of those lines vary with the position of each line on the
plane. The higher the position of the line on the plane in relation
to y, the higher will the value of C’ be. Since our objective is to
maximize C’, we can start at the bottom of the convex hull and
move the line up, as far as possible. Each point of the region
touching the line in a given position has the same cost. This
defines iso-performing lines.
We want to find a point in the convex region that the line touches
in its highest position possible. Figure 6 shows the convex hull for
the attrition problem, and cost function moving up through the
region. The highest position of the cost line is obtained at vertex
A. The theory of Linear Programming shows, that for a convex
and bounded region, the objective function is maximized either at
one of the vertices, or on a line segment joining two vertices
Bazaara [1]. Thus, we can always pick one or more best
performing classifiers.
Figure 6. Cost functions traversing the ROC convex hull
As in the ROCCH analysis, we have iso-performing lines, which
help visualize performance of classifiers under various cost
structures. In the Linear Programming setting, however, the
feasible region can incorporate any additional constraints. We are
gaining flexibility and control.
5. ADDING ADDITIONAL CONSTRAINTS
In the attrition example, some implicit constraints are already
built into the ROC space. Error rates are non-negative and cannot
exceed 1. Those constraints bound the convex set and guarantee
existence of an optimal solution. In addition, a planned calling
campaign is limited by customer service resources availability. A
phone call scenario included approaching a customer to find out if
indeed their intention was to leave the bank, and to attempt to
entice them to stay. Naturally, there are limited resources bank
can devote to this undertaking. Traditionally, modelers would
select the best model, and any additional constraints would be
imposed at the implementation time, based on the pre-selected
model. But if a modeler is aware of the constraints, they can be
built into the system evaluating model’s performance. The line of
business, for which these classifiers were developed, determined
that they could handle calling 20% of their customer base. This
resulted in the constraint
FP + TP ≤ 0.2*M
In order to plot the constraint in the (x, y) error rate space, we
need to formulate it in terms of the decision variables x and y.
(1-p)*x - p*y ≤ 0.2
Geometrically, this inequality is represented by a half plane in the
(x,y) space. Figure 7 shows a case of the attrition problem with
an additional constraint. The new convex set is the intersection of
the original one with the half plane. The feasible region is now an
intersection of the previous convex set with the new one: the area
to the left (down) of the dotted line.
Figure 7. The capacity constraint bounds the feasible region
Our optimal classifier is no longer feasible, since it is outside of
the new feasible region.
We need to “slide” the cost line back down to the feasible region.
It will touch the feasible region at the point where the hybrid
classifier intersects the constraint line. As noted earlier, this point
A(.64,.96)
Optimal
solution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FP rate
TP
rate
Attrition model with workforce capacity constraints
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FP rate
TPrate
Hybrid classifier Capacity constraint

defines a new classifier. The new classifier is now optimal, as
shown in Figure 8.
Figure 8. New optimal solution
In the next section we show how to find the new vertex. Here, we
will point out an actual implementation of the new solution as
outlined by Fawcett and Provost in [3].
Assume that the intersection point divides the line segment
between A and B at ratio α. To obtain the new classifier, we can
proceed as follows.
1. With probability α use classifier A
2. With probability 1- α use classifier B
If A and B were obtained from the same model by varying the
decision threshold, this process can be simplified by finding an
appropriate threshold between A and B.
6. LINEAR PROGRAMMING FORMULATION
A linear program is a constrained optimization problem, where
the objective function, as well as all constraints, are linear. We
need to select values for all decision variables so that all
constraints are satisfied and the objective function is minimized
(or maximized). In this case, we need to pick a point(s) in the
ROC space which will minimize the cost function (or maximize
the modified cost function). Decision variables are then the
coordinates (x,y) of points in the ROC space. Some of the
constraints arise from the classifiers’ performance.
In the ROC space, points under consideration need to be below
the boundary of the convex region. Additional constraints are
usually related to implementation and/or quality issues.
Finally, we have non-negativity constraints and bounding
constraints, since the ROC variables have to be between 0 and 1.
A canonical form of a linear (maximization) program takes the
following format.
Maximize C = ∑j
cj x j
Subject to ∑j
aij xj ≤bi i = 1,…, k.
xj ≥0 j = 1,…, d
The theory of linear programming assures us, that if the set of
constraints forms a convex and bounded set (called the feasible
region), then an optimal solution is found on the boundaries of the
feasible region [1]. In our - two dimensional - case, a solution can
be found at a vertex, or on an edge joining two vertices. Note that
the feasible set is created as a conjunction of several linear
inequalities. Not all vertices are known explicitly. A number of
computational techniques have been designed to find optimal
solutions without explicitly iterating over the vertices. More
details can be found in [1] and [2].
The theory of Linear Programming also aids an analysis of a
solution found, if we want to play some “what if” scenarios. At a
point of optimality, some constraints will be satisfied as
“binding”. That is, the left-hand-side will be equal to the right
hand side. Others will be satisfied as an inequality, leaving
“slack”, or room for improvement.
In a two-dimensional case, if two constraints are found binding at
an optimal solution, the classifier at the intersection of those
constraints is optimal. An analysis of slack at neighboring points
(called marginal analysis or sensitivity analysis), often provides
insight into alternative solutions and how close they are to
optimality. All commercially available optimization packages,
including a module that comes with Excel, will provide slack
information for all constraints. We are showing an example of
Excel sensitivity analysis report in the Appendix. In that example,
the capacity constraint, which was based on workforce
availability, is binding. That means that any further improvement
of classifier’s performance will require additional workforce
resources. Additional resources will provide slack on the capacity
constraint. This will allow the optimal solution to move along the
convex hull boundary in a direction improving the objective
function.
We will now formulate the problem of looking for optimal
classifier in the presence of additional constraints, as a linear
programming problem. We already have a formal representation
of the objective function.
Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y
Now we need to formulate the set of constraints related to the
classifiers and add a set of additional, bounding and non-
negativity constraints.
Given two classifiers Pi and Pj on the boundary, with error rates
(xj, yj) and (xi, yi). Assume that xi ≤ xj .A line segment through
points Pi, Pj has the slope
m = (yj – yi)/ (xj – xi)
and the equation y – yi = m*( x – xi)
The feasible region is positioned below the line, so it is
determined by a collection of inequalities
y – yi ≤ ((yj – yi)/ (xj – xi) )*( x – xi)
which can be rearranged as
- (yj – yi) x + (xj – xi) y ≤ - xi (yj – yi) + yi (xj – xi)
Optimal solution change
under capacity constraints
New optimal
solution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
FP rate
TPrate

The optimization problem can now be defined as follows.
Given classifiers Pi with error rates (xi, yi), i = 1,2,…k,
such that xi ≤ xi+1
Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y
Such that
a1
ij x + a2
ij y ≤ bij for i = 1,2,…k; j = i +1
π1
l x + π2
l y ≤ ql l = 1,2, l0
0 ≤x ≤1
0 ≤y ≤1
where a1
ij = - (yj – yi) , a2
ij = (xj – xi)
bij = - xi (yj – yi) + yi (xj – xi)
and
π1
l , π2
l and ql form additional constraints
6.1 Computational considerations
In the absence of additional constraints, all vertices are known a
priori. In such case, it is computationally efficient, to explicitly
calculate expected cost of each classifier and choose the one with
the smallest cost, rather than actually slide the lines over the
feasible region.
When additional constraints are introduced, the situation changes.
Finding new vertices, formed by the additional intersection points,
can get cumbersome. Fortunately, the theory and practice of
mathematical programming provides efficient algorithms for
finding optimal solutions without a need to iterate over all
vertices. There are also efficient solvers on the market, which can
solve optimization problems. An add-on to Excel provides one
such solver. It is efficient for problems in small dimensions as the
ones described here, and it is available in almost any business
setting. Figure 9 shows the original attrition problem, without the
additional constraint, solved in Excel. Spreadsheet cell labeled
Cost contains the cost formula. Value of that cell is maximized,
by changing decision variables x and y, as long as all constraints
are satisfied. Point (0.64, 0.96) minimizes the objective function.
Figure 9. Excel optimizer solves the optimization problem
Slopes of line segments on the boundaries are calculated in
column 3. Note that at point (0.64, 0.96), slope changes from 0.55
to 0. 21. As noted earlier, slope of the cost line is 0.3. So point
(0.64, 0.96) would have been selected as optimal by the traditional
ROCCH analysis as well.
Figure 10 shows a new solution, after the additional capacity
constraint was added. The new solution is (0.15, 0.44).
Figure 10. Excel solution incorporating the capacity
constraint
6.2 Note on practical considerations
For all practical purposes, a business setting often prefers speed
and expediency of delivery, to an optimal solution, as classifier’s
performance is near optimal. In case of the attrition problem, we
were lucky, in that one of the existing classifiers (0.14, 0.43) was
in close proximity of the optimal solution (0.15, 0.44) and still
within the feasible region. A selection of near optimal solution
was, in this case, a simple decision because this was a relatively
simple problem with just one additional constraint.
6.3 New performance constraints
While analyzing the new optimal solution, we notice that the true
positive rate is 0.44. In other words, out of all instances of the
positive class, only 44% (less than a half) is classified correctly.
This is not a desirable performance. 54% of defecting customers
remain unrecognized and without being contacted will leave the
bank. Solution analysis (see Appendix) shows that the capacity
constraint is binding at the optimal point. We cannot increase the
True Positive rate y, without leaving the feasible region and
violating the capacity constraint. Any further improvements in the
attritors’ recognition rate will require additional resources so that
the capacity constraint can be relaxed. The optimization template
set up in Excel allowed us to play a number of “what if”
scenarios. In a final compromise, the line of business decided to
relax the capacity constraint to allow for 30% of the population to
be classified as positive. In return, we added a new “missed
Cost
c(Y|n) c(N|y) Rate p Max x y
30$ 2,575$ 0.0375 38.33 0.15 0.44
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0
0.06 0.25 3.83 -0.14 <= 0.00
0.14 0.43 2.31 0.09 <= 0.10
0.23 0.58 1.66 0.19 <= 0.19
0.33 0.70 1.35 0.24 <= 0.26
0.42 0.82 1.28 0.25 <= 0.29
0.53 0.90 0.72 0.33 <= 0.52
0.64 0.96 0.55 0.36 <= 0.61
0.75 0.98 0.21 0.41 <= 0.82
0.88 0.99 0.08 0.43 <= 0.92
1.00 1 0.04 0.44 <= 0.96
(1-p)*x + p*y <= 0.20
0.16 <= 0.16
Capacity Constraint
Data Points
Maximizing Modified Cost Function
-c(Y | n)*(1-p)*x + c(N | y)*p*y
Decision variables
Error Cost
per unit
Constraints
Cost
c(Y|n) c(N|y)
Attrition
Rate p Max x y
30$ 2,575$ 0.0375 74.29 0.64 0.96
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0
0.06 0.25 3.83 -1.48 <= 0.00
0.14 0.43 2.31 -0.51 <= 0.10
0.23 0.58 1.66 -0.09 <= 0.19
0.33 0.70 1.35 0.10 <= 0.26
0.42 0.82 1.28 0.15 <= 0.29
0.53 0.90 0.72 0.50 <= 0.52
0.64 0.96 0.55 0.61 <= 0.61
0.75 0.98 0.21 0.82 <= 0.82
0.88 0.99 0.08 0.91 <= 0.92
1.00 1 0.04 0.93 <= 0.96
Data Points
Maximizing Modified Cost Function
-c(Y | n)*(1-p)*x + c(N | y)*p*y
Decision variables
Error Cost
per unit
Constraints

opportunity” constraint, which required that the True Positive rate
is no less than 60%. The optimal solution for this problem with
two additional constraints is shown in Figure 11. The feasible
region is now the triangle contained between the lines: the
classifier constraint, the performance constraint y ≥.6 and the
relaxed capacity constraint
(1-p)*x + p*y ≤ .3
Figure 11. Feasible region with two additional constraints
The optimal solution (0.29, 0.65) found by the Excel Optimizer is
shown in Figure 12.
Figure12. Excel solution with two additional constraints
Other examples of frequently used performance constraints could
include:
− limited size of a direct mail campaign
− restrictions in processing capacity for responders to a
campaign
− limited expense of incentives for responders
− limited capacity of response processing systems (indirectly
resulting in limiting a campaign size)
In each case, we need to express the constraints in terms of
decision variables to create the new convex hull to intersect with
the one created by the classifiers set.
Note, that in this formulation we do not need to know the
intersection points of ROCCH with the constrains convex hull.
The optimal solution can be found with no explicit knowledge of
vertices of the combined convex hull.
7. CONCLUSION
Linear Programming is a part of a broad field of constrained
optimizations called Mathematical Programming. Mathematical
Programming is a rich discipline and its applications have been
growing fast in the past several decades.
Once a problem has been framed as a mathematical programming
problem, we gain a powerful tool. More importantly, we can draw
from that field’s rich legacy. Some future developments may, for
example, include multi-dimensional problems where a
classification problem has more than two classes. Computational
difficulties grow fast with problem’s dimensions, but techniques
and methods of Linear Programming can help overcome
computational issues. We could also consider problems involving
non-linear cost function. The theory of Non-linear Programming
may guide us in minimizing non-linear costs, such as piece wise
linear or quadratic. Other areas of pursuit may include methods to
deal with classifier and/or constraint uncertainty. Stochastic
Programming methods could perhaps be employed to deal with
optimal solutions if the standard errors of classifiers are need to be
taken into account.
At the same time we are not losing the main benefits of ROCCH.
We maintain the benefit of visualization, except the cost function
now “slides” along the convex hull. The benefit of being able to
choose the optimal classifier at the run time if constraints change,
can be maintained as well. The selection process would proceed
as follows:
- Input new costs and constraints
- Intersect a constraints space with the ROC space
- Input all competing points
- Solve the optimization problem
8. ACKNOWLEDGEMENTS
Thanks to all who helped to inspire, develop and formulate the
above thoughts. In particular my gratitude goes to Endre Boros,
Stan Matwin and Vera Helman who generously shared their
knowledge and experience and provided feedback throughout the
various stages of this work.
9. REFERENCES
[1] Bazaraa, Moktar S. at al. Linear Programming and
Network Flows. John Wiley & Sons, Inc., 1997.
[2] Murty, Katta G. Operations Research: Deterministic
Optimization Models. Prentice Hall, Inc. 1995.
[3] Provost, F., Fawcett, T. Robust Classification Systems
for Imprecise Environment. Proceedings of the Fifteenth
International Conference on Artificial Intelligence ’98.
[4] Provost, F., Fawcett, T. Robust Classification for
Imprecise Environment; Machine Learning, Vol. 42, No.3,
2001, pp.203-231
Attrition model with capacity and performance constraints
0.55
0.6
0.65
0.7
0.2 0.25 0.3 0.35
FP rate
TPrate
Cost
c(Y|n) c(N|y) Rate p Max x y
30$ 2,575$ 0.0375 54.39 0.29 0.65
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0 0.00
0.06 0.25 3.83 -0.45 <= 0.00
0.14 0.43 2.31 -0.01 <= 0.10
0.23 0.58 1.66 0.17 <= 0.19
0.33 0.70 1.35 0.26 <= 0.26
0.42 0.82 1.28 0.28 <= 0.29
0.53 0.90 0.72 0.44 <= 0.52
0.64 0.96 0.55 0.49 <= 0.61
0.75 0.98 0.21 0.59 <= 0.82
0.88 0.99 0.08 0.63 <= 0.92
1.00 1 0.04 0.64 <= 0.96
(1-p)*x + p*y <= 0.30
0.30 <= 0.30
y >= 0.60
0.65 >= 0.60
Capacity Constraint
Data Points
Missed Opportunity Constraint
Decision variables
Error Cost
per unit
Constraints

10. APPENDIX
Answer report and sensitivity analysis report produced by an
Excel Optimizer for the attrition problem with capacity constraint.
Worksheet: [ROCCH1.xls]Optimization
Report Created: 2/28/2002 6:16:12 PM
Target Cell (Max)
Cell Name Original Value Final Value
$D$4 Max 0.00 38.33
Adjustable Cells
Cell Name Original Value Final Value
$E$4 x 0.00 0.15
$F$4 y 0.00 0.44
Constraints
Cell Name Cell Value Formula Status Slack
$D$7 - mx + y 0.00 $D$7<=$F$7 Binding 0
$D$8 - mx + y -0.14 $D$8<=$F$8 Not Binding 0.14
$D$9 - mx + y 0.09 $D$9<=$F$9 Not Binding 0.01
$D$10 - mx + y 0.19 $D$10<=$F$10 Binding 0
$D$20 (1-p)*x + p*y 0.16 $D$20<=$F$20 Binding 0
$F$4 y 0.44 $F$4<=1 Not Binding 0.56
$E$4 x 0.15 $E$4<=1 Not Binding 0.85
Microsoft Excel 8.0e Answer Report
Worksheet:[ROCCH1.xls]Optimization
ReportCreated:2/28/20026:20:01PM
Final Reduced Objective Allowable Allowable
Cell Name Value Cost Coefficient Increase Decrease
$E$4 x 0.15 0 -28.9 2507.31 131.01
$F$4 y 0.44 0 96.6 1E+30 79.12
Final Shadow Constraint Allowable Allowable
Cell Name Value Price R.H.Side Increase Decrease
$D$7 -mx+y 0.00 0 0 1E+30 0
$D$8 -mx+y -0.14 0 0 1E+30 0.14
$D$9 -mx+y 0.09 0 0.10 1E+30 0.01
$D$10 -mx+y 0.19 91.77 0.19 0.01 0.47
$D$11 -mx+y 0.24 0 0.26 1E+30 0.02
$D$12 -mx+y 0.25 0 0.29 1E+30 0.04
$D$13 -mx+y 0.33 0 0.52 1E+30 0.19
$D$14 -mx+y 0.36 0 0.61 1E+30 0.25
$D$15 -mx+y 0.41 0 0.82 1E+30 0.41
$D$16 -mx+y 0.43 0 0.92 1E+30 0.49
$D$20 (1-p)*x+p*y 0.16 127.87 0.16 0.08 0.01
Adjustable Cells
Constraints
MicrosoftExcel8.0eSensitivityReport

Evaluating Classifiers' Performance KDD2002

More Related Content

Similar to Evaluating Classifiers' Performance KDD2002 (20)

Evaluating Classifiers' Performance KDD2002