SlideShare a Scribd company logo
Evaluating Classifiers’ Performance
In A Constrained Environment
Anna Olecka
olecka@rutcor.rutgers.edu
Fleet Boston Financial
Database Marketing Department
1075 Main St.
Waltham, MA 02451
RUTCOR
Rutgers University
640 Bartholomew Road
Piscataway, NJ 08854-8003
ABSTRACT
In this paper, we focus on methodology of finding a classifier
with a minimal cost in presence of additional performance
constraints. ROCCH analysis, where accuracy and cost are
intertwined in the solution space, was a revolutionary tool for
two-class problems. We propose an alternative formulation, as an
optimization problem, commonly used in Operations Research.
This approach extends the ROCCH analysis to allow for locating
optimal solutions while outside constraints are present. Similarly
to the ROCCH analysis, we combine cost and class distribution
while defining the objective function. Rather than focusing on
slopes of the edges in the convex hull of the solution space,
however, we treat cost as an objective function to be minimized
over the solution space, by selecting the best performing
classifier(s) (one or more vertex in the solution space). The Linear
Programming framework provides a theoretical and
computational methodology for finding the vertex (classifier)
which minimizes the objective function.
1. INTRODUCTION
Consider a problem, where classifiers’ performance has to be
evaluated taking into account additional constraints related to
error rates. Such constraints often arise from implementation.
They could, for example, involve a limited workforce to resolve
cases of suspected fraud, a limited size of a direct mail campaign,
or restrictions in cost of incentives for responders. An application
example used throughout this paper involves an attrition model
for a bank. A bank plans a calling campaign to lower attrition rate
among its customers.
Naturally, one of the implementation concerns is limited
availability of resources, such as phone representatives.
Traditionally, a model is selected first and summarized by a
modeler into several performance buckets. Then, the
implementation team will “eyeball” the thresholds of the
performance buckets and pick the threshold that matches
constrained resources the closest. Such business practice often
results in a sub-optimal solution being selected. If the constraints
are known a-priori, they could be built into the system evaluating
classifiers’ performance. If they are not known till the
implementation time, a system analogous to the ROCCH could be
built to select the best classifier at that time. To this end, we are
proposing an evaluation system that can deal with additional
constraints related to prediction errors. We will also show how to
apply such system in a business scenario described above.
Provost and Fawcett have shown in [3] that some specific metrics
frequently used in Machine Learning (eg, workforce constraints,
and the Neyman-Pearson decision criterion), are optimized by the
ROCCH method. Optimization approach proposed here extends
their results to any linear constraint. In fact, our approach can be
applied to an unlimited number of constraints, as long as they
remain linear. Finding numerically intersection points of such
additional constraints with ROCCH can be computationally
tedious. A mathematical programming approach provides efficient
tools for finding optimal solutions without explicitly calculating
all the intersection points.
ROCCH analysis is a powerful and widely accepted tool for
visualizing and optimizing classification problems with two
classes. It plots performance of all classifiers under consideration
in a two-dimensional space, with false positive rate on one axis
and true positive rate on the other. It measures classifiers’
performance for various costs of misclassification and under
various class distributions. It allows for visual representation of
classifiers and enables quick decisions in choosing the right
classifier for the given costs. The main advantage of this
methodology is its flexibility under varying conditions. One
drawback of this methodology is that it is somewhat rigid in
looking for an optimal classifier. It forms an optimal slope by
considering an optimal combination of class probabilities and
error costs, and then looks for one of the two scenarios. Either an
edge of the convex hull with a slope equal to the optimal one, or
for a vertex between two edges where a difference between the
edges slope and the optimal slope changes sign.
Similarly to the ROCCH, we construct the convex hull of all
classifiers in the error space. Convexity of the error space is
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada.
Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.
accomplished in the following steps. First we note that a convex
combination of two classifiers is also a viable classifier. Then, we
remove dominated classifiers. Finally, we impose additional
performance constraints if any. Those additional constraints also
form a convex hull. Intersection of the two regions, is a new
convex hull, the potential solution space. Finally, similarly to the
ROCCH, we treat a combination of costs and class probabilities as
an objective function to be minimized over the solution space.
Due to existence of additional constraints, however, iterating
slopes of the edges of the convex hull is no longer
computationally efficient because not all vertices of the convex
hull are known. The theory of linear programming provides
computational tools for finding the optimal solution(s) without
explicit knowledge of all vertices.
2. ROC CONVEX HULL & HYBRID CLASSIFIERS
Suppose we construct a series of k-1 classifiers by varying a
positive decision threshold from ”never”, and gradually increasing
probability of decision “Yes”. By allowing more instances to be
classified as positive, each new threshold will increase the number
of correctly classified positive instances, but it may also increase
frequency of false positive classification. In the space (FP rate, TP
rate), each new classifier will be positioned up and to the right
from the previous one. Connecting each pair of points with a line
segment generates a convex set, where each vertex represents a
classifier (Figure1).
.
Table 1. Neural net model for bank attrition
Table 1 represents a set of classifiers obtained from a neural
networks model for a bank attrition problem. Figure 1 shows the
resulting convex set
Figure 1. ROC curve for the neural net attrition model
Figure 2 shows a set of classifiers obtained by varying thresholds
for two logistic regression models for the attrition data. In this
representation, we can visually recognize dominated classifiers.
They are positioned inside the convex region, while potential
candidates for optimal solution are on the boundary. It is easy to
see [3], that under any cost structure, there is a classifier on the
boundary, which will outperform a dominated classifier.
Figure 2. Two competing logistic regression models
Figure 3 shows a hybrid classifier obtained by removing
dominated points. Any point positioned inside the bounded
region, can be outperformed by some point on the boundary. The
boundary forms a hybrid classifier, or a set of potential candidates
for an optimal solution.
Figure 3. ROCCH for the two logistic regression models
2.1 Remark on convexity
The hybrid classifier obtained from our two logistic regression
models formed a convex set in the (FP rate, TP rate) plane. In
general, this doesn’t have to be the case. Figure 4 provides such
an example. Point B, where the region boundary transitions from
the neural networks model to the logistic model. The boundary
“dips” below the line from A to C. This can be “fixed” by
creating a new classifier B’ on the (A, C) line. For example, if we
want a point half way between A and C, the new classifier is
obtained by using classifiers A and B randomly, each with
probability 0.5. Provost and Fawcett describe a similar scheme,
using random sampling to create a new classifier.
In general any convex combination of two classifiers becomes a
new classifier. For any point X on a line segment created by
classifiers π1, π2, , we can always construct a new classifier,
which would correspond to such point. We start by describing
such point as convex combination
ROC surves for two logistic regression models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TP
Ra
te
Logistic2
Logistic1
Hybrid classifier for two logistic regression models
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Logistic1
Cutoff /
Threshold
FP rate
p (Y | n)
TP rate
p (Y | y)
Never 0 0
0.54 0.06 0.25
0.48 0.14 0.43
0.42 0.23 0.58
0.36 0.33 0.70
0.30 0.42 0.82
0.24 0.53 0.90
Always 1 1
ROC Curve for Attrition Data
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
X=απ1+ (1−α)π2 where 0 ≤ α ≤1
Then, we randomly partition the population being classified into 2
groups in proportions α, 1- α and apply an appropriate model for
each group.
Figure 4. Boundary of ROC space for two classifiers is not
always convex
A new, convex boundary for the attrition problem is shown in
Figure 5.
Figure 5. Convexity is obtained by creating convex
combinations of existing classifiers
3. BASIC TERMINOLOGY
The following terminology is used throughout the remaining
sections.
− Two classes: positive (y) and negative (n) with probabilities
respectively p and 1-p
− Classification decision: positive (Y) and negative (N)
− FP, TP, FN, TN represent number of instances of each kind:
false positive, true positive, false negative and true negative
respectively
− Rates of those instances are represented as follows:
FP_rate = p(Y|n) TP_rate = p(Y|y)
FN_rate = p (N|y) TN_rate = p(N|n)
− In the ROC space, the horizontal axis represents FP_rate, and
the vertical axis represents TP_rate. We will use x to denote
FP_rate and y to denote TP_rate
− Unit cost of a false positive error = c(Y|n)
− Unit cost of a false negative error = c(N|y)
In defining the cost, we need to start with the expected number of
errors and transform the resulting formula into the ROC space
terms, where the variables are error rates. The following scheme
visualizes interdependencies between terms.
Table 2. Two class classification scheme
Let M be the total number of instances being classified. For given
p, x and y we can calculate the expected number of classified
cases of each kind as follows.
TP = M*p*y
FN = M*p*(1-y)
FP = M* (1-p)*x
TN = M*(1-p)*(1-x)
4. COST FUNCTION
We now apply the cost function to the attrition problem and look
for a classifier on the boundaries of the convex hull, which will
minimize the total cost.
A false negative classification means that we won’t recognize an
attriting customer and loose the account. The cost of loosing a
customer is tied to net income after taxes (NIAT) this customer
brings to the bank. In addition, both types of error generate labor
costs related preventive action. The line of business decided to
assign error costs as shown in Table3.
Table 3. Misclassification costs assignment
Given cost of a false negative error c(N | y), and a false positive
error c(Y | n), the total expected cost
EC = c(Y | n) * E(FP) + c(N | y)* E(FN)
Where E(FN) is the expected number of false negatives and
E(FP) is the expected number of false positives.
ROC Curves for Logistic and Neural Nets models
C
B
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Neural Net 1
False
positive
c(Y|n)
False
negative
c(N|y)
30$ 2,575$
Error Cost
per unit
Two class
classification schem e
Population
M
Class n
(1-p)*M
Class y
p*M
Decision
Y
(TP)
Decision
N
(FN)
Decision Y
(FP)
Decision N
(TN)
FN rate
1-y
TP rate
y
TN rate
1-x
FP rate
x
p 1-p
Hybrid classifier for Logistic2 and Neural Nets models
B' C
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
FP Rate
TPRate
Logistic2
Neural Net 1
We want to minimize the expected total cost in term of decision
variables x and y.
Minimize EC = c(Y | n) * E(FP) + c(N | y)* E(FN)
= c(Y | n) * (1-p) * M * x + c(N | y) * p* M * (1 – y )
= c(Y | n) * (1-p) * M * x + c(N | y) * p * M - c(N | y)*p*M*y
After subtracting the constant (not dependent on the decision
variables) term and dividing by M, this is equivalent to:
Minimize (EC - c(N | y)*p )/M =
= c(Y | n)*(1-p)*x - c(N | y)*p*y
This is equivalent to maximizing the opposite function
Maximize C’ = - c(Y | n)*(1-p)*x + c(N | y)*p*y
C’ in space (x,y) is a collection of parallel lines with slopes
depending on misclassification costs and the a-priori probability
of the positive class. In the attrition example we are analyzing,
p=3.75%. We are now ready to calculate slopes for the cost lines.
m = [c(Y | n)*(1-p)] / [c(N | y)*p] =
= 30*(1-0.0375) / (2575*0.0375) = 0.3
Intercepts of those lines vary with the position of each line on the
plane. The higher the position of the line on the plane in relation
to y, the higher will the value of C’ be. Since our objective is to
maximize C’, we can start at the bottom of the convex hull and
move the line up, as far as possible. Each point of the region
touching the line in a given position has the same cost. This
defines iso-performing lines.
We want to find a point in the convex region that the line touches
in its highest position possible. Figure 6 shows the convex hull for
the attrition problem, and cost function moving up through the
region. The highest position of the cost line is obtained at vertex
A. The theory of Linear Programming shows, that for a convex
and bounded region, the objective function is maximized either at
one of the vertices, or on a line segment joining two vertices
Bazaara [1]. Thus, we can always pick one or more best
performing classifiers.
Figure 6. Cost functions traversing the ROC convex hull
As in the ROCCH analysis, we have iso-performing lines, which
help visualize performance of classifiers under various cost
structures. In the Linear Programming setting, however, the
feasible region can incorporate any additional constraints. We are
gaining flexibility and control.
5. ADDING ADDITIONAL CONSTRAINTS
In the attrition example, some implicit constraints are already
built into the ROC space. Error rates are non-negative and cannot
exceed 1. Those constraints bound the convex set and guarantee
existence of an optimal solution. In addition, a planned calling
campaign is limited by customer service resources availability. A
phone call scenario included approaching a customer to find out if
indeed their intention was to leave the bank, and to attempt to
entice them to stay. Naturally, there are limited resources bank
can devote to this undertaking. Traditionally, modelers would
select the best model, and any additional constraints would be
imposed at the implementation time, based on the pre-selected
model. But if a modeler is aware of the constraints, they can be
built into the system evaluating model’s performance. The line of
business, for which these classifiers were developed, determined
that they could handle calling 20% of their customer base. This
resulted in the constraint
FP + TP ≤ 0.2*M
In order to plot the constraint in the (x, y) error rate space, we
need to formulate it in terms of the decision variables x and y.
(1-p)*x - p*y ≤ 0.2
Geometrically, this inequality is represented by a half plane in the
(x,y) space. Figure 7 shows a case of the attrition problem with
an additional constraint. The new convex set is the intersection of
the original one with the half plane. The feasible region is now an
intersection of the previous convex set with the new one: the area
to the left (down) of the dotted line.
Figure 7. The capacity constraint bounds the feasible region
Our optimal classifier is no longer feasible, since it is outside of
the new feasible region.
We need to “slide” the cost line back down to the feasible region.
It will touch the feasible region at the point where the hybrid
classifier intersects the constraint line. As noted earlier, this point
A(.64,.96)
Optimal
solution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FP rate
TP
rate
Attrition model with workforce capacity constraints
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
FP rate
TPrate
Hybrid classifier Capacity constraint
defines a new classifier. The new classifier is now optimal, as
shown in Figure 8.
Figure 8. New optimal solution
In the next section we show how to find the new vertex. Here, we
will point out an actual implementation of the new solution as
outlined by Fawcett and Provost in [3].
Assume that the intersection point divides the line segment
between A and B at ratio α. To obtain the new classifier, we can
proceed as follows.
1. With probability α use classifier A
2. With probability 1- α use classifier B
If A and B were obtained from the same model by varying the
decision threshold, this process can be simplified by finding an
appropriate threshold between A and B.
6. LINEAR PROGRAMMING FORMULATION
A linear program is a constrained optimization problem, where
the objective function, as well as all constraints, are linear. We
need to select values for all decision variables so that all
constraints are satisfied and the objective function is minimized
(or maximized). In this case, we need to pick a point(s) in the
ROC space which will minimize the cost function (or maximize
the modified cost function). Decision variables are then the
coordinates (x,y) of points in the ROC space. Some of the
constraints arise from the classifiers’ performance.
In the ROC space, points under consideration need to be below
the boundary of the convex region. Additional constraints are
usually related to implementation and/or quality issues.
Finally, we have non-negativity constraints and bounding
constraints, since the ROC variables have to be between 0 and 1.
A canonical form of a linear (maximization) program takes the
following format.
Maximize C = ∑j
cj x j
Subject to ∑j
aij xj ≤bi i = 1,…, k.
xj ≥0 j = 1,…, d
The theory of linear programming assures us, that if the set of
constraints forms a convex and bounded set (called the feasible
region), then an optimal solution is found on the boundaries of the
feasible region [1]. In our - two dimensional - case, a solution can
be found at a vertex, or on an edge joining two vertices. Note that
the feasible set is created as a conjunction of several linear
inequalities. Not all vertices are known explicitly. A number of
computational techniques have been designed to find optimal
solutions without explicitly iterating over the vertices. More
details can be found in [1] and [2].
The theory of Linear Programming also aids an analysis of a
solution found, if we want to play some “what if” scenarios. At a
point of optimality, some constraints will be satisfied as
“binding”. That is, the left-hand-side will be equal to the right
hand side. Others will be satisfied as an inequality, leaving
“slack”, or room for improvement.
In a two-dimensional case, if two constraints are found binding at
an optimal solution, the classifier at the intersection of those
constraints is optimal. An analysis of slack at neighboring points
(called marginal analysis or sensitivity analysis), often provides
insight into alternative solutions and how close they are to
optimality. All commercially available optimization packages,
including a module that comes with Excel, will provide slack
information for all constraints. We are showing an example of
Excel sensitivity analysis report in the Appendix. In that example,
the capacity constraint, which was based on workforce
availability, is binding. That means that any further improvement
of classifier’s performance will require additional workforce
resources. Additional resources will provide slack on the capacity
constraint. This will allow the optimal solution to move along the
convex hull boundary in a direction improving the objective
function.
We will now formulate the problem of looking for optimal
classifier in the presence of additional constraints, as a linear
programming problem. We already have a formal representation
of the objective function.
Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y
Now we need to formulate the set of constraints related to the
classifiers and add a set of additional, bounding and non-
negativity constraints.
Given two classifiers Pi and Pj on the boundary, with error rates
(xj, yj) and (xi, yi). Assume that xi ≤ xj .A line segment through
points Pi, Pj has the slope
m = (yj – yi)/ (xj – xi)
and the equation y – yi = m*( x – xi)
The feasible region is positioned below the line, so it is
determined by a collection of inequalities
y – yi ≤ ((yj – yi)/ (xj – xi) )*( x – xi)
which can be rearranged as
- (yj – yi) x + (xj – xi) y ≤ - xi (yj – yi) + yi (xj – xi)
Optimal solution change
under capacity constraints
New optimal
solution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
FP rate
TPrate
The optimization problem can now be defined as follows.
Given classifiers Pi with error rates (xi, yi), i = 1,2,…k,
such that xi ≤ xi+1
Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y
Such that
a1
ij x + a2
ij y ≤ bij for i = 1,2,…k; j = i +1
π1
l x + π2
l y ≤ ql l = 1,2, l0
0 ≤x ≤1
0 ≤y ≤1
where a1
ij = - (yj – yi) , a2
ij = (xj – xi)
bij = - xi (yj – yi) + yi (xj – xi)
and
π1
l , π2
l and ql form additional constraints
6.1 Computational considerations
In the absence of additional constraints, all vertices are known a
priori. In such case, it is computationally efficient, to explicitly
calculate expected cost of each classifier and choose the one with
the smallest cost, rather than actually slide the lines over the
feasible region.
When additional constraints are introduced, the situation changes.
Finding new vertices, formed by the additional intersection points,
can get cumbersome. Fortunately, the theory and practice of
mathematical programming provides efficient algorithms for
finding optimal solutions without a need to iterate over all
vertices. There are also efficient solvers on the market, which can
solve optimization problems. An add-on to Excel provides one
such solver. It is efficient for problems in small dimensions as the
ones described here, and it is available in almost any business
setting. Figure 9 shows the original attrition problem, without the
additional constraint, solved in Excel. Spreadsheet cell labeled
Cost contains the cost formula. Value of that cell is maximized,
by changing decision variables x and y, as long as all constraints
are satisfied. Point (0.64, 0.96) minimizes the objective function.
Figure 9. Excel optimizer solves the optimization problem
Slopes of line segments on the boundaries are calculated in
column 3. Note that at point (0.64, 0.96), slope changes from 0.55
to 0. 21. As noted earlier, slope of the cost line is 0.3. So point
(0.64, 0.96) would have been selected as optimal by the traditional
ROCCH analysis as well.
Figure 10 shows a new solution, after the additional capacity
constraint was added. The new solution is (0.15, 0.44).
Figure 10. Excel solution incorporating the capacity
constraint
6.2 Note on practical considerations
For all practical purposes, a business setting often prefers speed
and expediency of delivery, to an optimal solution, as classifier’s
performance is near optimal. In case of the attrition problem, we
were lucky, in that one of the existing classifiers (0.14, 0.43) was
in close proximity of the optimal solution (0.15, 0.44) and still
within the feasible region. A selection of near optimal solution
was, in this case, a simple decision because this was a relatively
simple problem with just one additional constraint.
6.3 New performance constraints
While analyzing the new optimal solution, we notice that the true
positive rate is 0.44. In other words, out of all instances of the
positive class, only 44% (less than a half) is classified correctly.
This is not a desirable performance. 54% of defecting customers
remain unrecognized and without being contacted will leave the
bank. Solution analysis (see Appendix) shows that the capacity
constraint is binding at the optimal point. We cannot increase the
True Positive rate y, without leaving the feasible region and
violating the capacity constraint. Any further improvements in the
attritors’ recognition rate will require additional resources so that
the capacity constraint can be relaxed. The optimization template
set up in Excel allowed us to play a number of “what if”
scenarios. In a final compromise, the line of business decided to
relax the capacity constraint to allow for 30% of the population to
be classified as positive. In return, we added a new “missed
Cost
c(Y|n) c(N|y) Rate p Max x y
30$ 2,575$ 0.0375 38.33 0.15 0.44
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0
0.06 0.25 3.83 -0.14 <= 0.00
0.14 0.43 2.31 0.09 <= 0.10
0.23 0.58 1.66 0.19 <= 0.19
0.33 0.70 1.35 0.24 <= 0.26
0.42 0.82 1.28 0.25 <= 0.29
0.53 0.90 0.72 0.33 <= 0.52
0.64 0.96 0.55 0.36 <= 0.61
0.75 0.98 0.21 0.41 <= 0.82
0.88 0.99 0.08 0.43 <= 0.92
1.00 1 0.04 0.44 <= 0.96
(1-p)*x + p*y <= 0.20
0.16 <= 0.16
Capacity Constraint
Data Points
Maximizing Modified Cost Function
-c(Y | n)*(1-p)*x + c(N | y)*p*y
Decision variables
Error Cost
per unit
Constraints
Cost
c(Y|n) c(N|y)
Attrition
Rate p Max x y
30$ 2,575$ 0.0375 74.29 0.64 0.96
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0
0.06 0.25 3.83 -1.48 <= 0.00
0.14 0.43 2.31 -0.51 <= 0.10
0.23 0.58 1.66 -0.09 <= 0.19
0.33 0.70 1.35 0.10 <= 0.26
0.42 0.82 1.28 0.15 <= 0.29
0.53 0.90 0.72 0.50 <= 0.52
0.64 0.96 0.55 0.61 <= 0.61
0.75 0.98 0.21 0.82 <= 0.82
0.88 0.99 0.08 0.91 <= 0.92
1.00 1 0.04 0.93 <= 0.96
Data Points
Maximizing Modified Cost Function
-c(Y | n)*(1-p)*x + c(N | y)*p*y
Decision variables
Error Cost
per unit
Constraints
opportunity” constraint, which required that the True Positive rate
is no less than 60%. The optimal solution for this problem with
two additional constraints is shown in Figure 11. The feasible
region is now the triangle contained between the lines: the
classifier constraint, the performance constraint y ≥.6 and the
relaxed capacity constraint
(1-p)*x + p*y ≤ .3
Figure 11. Feasible region with two additional constraints
The optimal solution (0.29, 0.65) found by the Excel Optimizer is
shown in Figure 12.
Figure12. Excel solution with two additional constraints
Other examples of frequently used performance constraints could
include:
− limited size of a direct mail campaign
− restrictions in processing capacity for responders to a
campaign
− limited expense of incentives for responders
− limited capacity of response processing systems (indirectly
resulting in limiting a campaign size)
In each case, we need to express the constraints in terms of
decision variables to create the new convex hull to intersect with
the one created by the classifiers set.
Note, that in this formulation we do not need to know the
intersection points of ROCCH with the constrains convex hull.
The optimal solution can be found with no explicit knowledge of
vertices of the combined convex hull.
7. CONCLUSION
Linear Programming is a part of a broad field of constrained
optimizations called Mathematical Programming. Mathematical
Programming is a rich discipline and its applications have been
growing fast in the past several decades.
Once a problem has been framed as a mathematical programming
problem, we gain a powerful tool. More importantly, we can draw
from that field’s rich legacy. Some future developments may, for
example, include multi-dimensional problems where a
classification problem has more than two classes. Computational
difficulties grow fast with problem’s dimensions, but techniques
and methods of Linear Programming can help overcome
computational issues. We could also consider problems involving
non-linear cost function. The theory of Non-linear Programming
may guide us in minimizing non-linear costs, such as piece wise
linear or quadratic. Other areas of pursuit may include methods to
deal with classifier and/or constraint uncertainty. Stochastic
Programming methods could perhaps be employed to deal with
optimal solutions if the standard errors of classifiers are need to be
taken into account.
At the same time we are not losing the main benefits of ROCCH.
We maintain the benefit of visualization, except the cost function
now “slides” along the convex hull. The benefit of being able to
choose the optimal classifier at the run time if constraints change,
can be maintained as well. The selection process would proceed
as follows:
- Input new costs and constraints
- Intersect a constraints space with the ROC space
- Input all competing points
- Solve the optimization problem
8. ACKNOWLEDGEMENTS
Thanks to all who helped to inspire, develop and formulate the
above thoughts. In particular my gratitude goes to Endre Boros,
Stan Matwin and Vera Helman who generously shared their
knowledge and experience and provided feedback throughout the
various stages of this work.
9. REFERENCES
[1] Bazaraa, Moktar S. at al. Linear Programming and
Network Flows. John Wiley & Sons, Inc., 1997.
[2] Murty, Katta G. Operations Research: Deterministic
Optimization Models. Prentice Hall, Inc. 1995.
[3] Provost, F., Fawcett, T. Robust Classification Systems
for Imprecise Environment. Proceedings of the Fifteenth
International Conference on Artificial Intelligence ’98.
[4] Provost, F., Fawcett, T. Robust Classification for
Imprecise Environment; Machine Learning, Vol. 42, No.3,
2001, pp.203-231
Attrition model with capacity and performance constraints
0.55
0.6
0.65
0.7
0.2 0.25 0.3 0.35
FP rate
TPrate
Cost
c(Y|n) c(N|y) Rate p Max x y
30$ 2,575$ 0.0375 54.39 0.29 0.65
Slopes
FP rate
x1
TP rate
y1 m - mx + y <= - m*x1 + y1
0 0 0.00
0.06 0.25 3.83 -0.45 <= 0.00
0.14 0.43 2.31 -0.01 <= 0.10
0.23 0.58 1.66 0.17 <= 0.19
0.33 0.70 1.35 0.26 <= 0.26
0.42 0.82 1.28 0.28 <= 0.29
0.53 0.90 0.72 0.44 <= 0.52
0.64 0.96 0.55 0.49 <= 0.61
0.75 0.98 0.21 0.59 <= 0.82
0.88 0.99 0.08 0.63 <= 0.92
1.00 1 0.04 0.64 <= 0.96
(1-p)*x + p*y <= 0.30
0.30 <= 0.30
y >= 0.60
0.65 >= 0.60
Capacity Constraint
Data Points
Missed Opportunity Constraint
Decision variables
Error Cost
per unit
Constraints
10. APPENDIX
Answer report and sensitivity analysis report produced by an
Excel Optimizer for the attrition problem with capacity constraint.
Worksheet: [ROCCH1.xls]Optimization
Report Created: 2/28/2002 6:16:12 PM
Target Cell (Max)
Cell Name Original Value Final Value
$D$4 Max 0.00 38.33
Adjustable Cells
Cell Name Original Value Final Value
$E$4 x 0.00 0.15
$F$4 y 0.00 0.44
Constraints
Cell Name Cell Value Formula Status Slack
$D$7 - mx + y 0.00 $D$7<=$F$7 Binding 0
$D$8 - mx + y -0.14 $D$8<=$F$8 Not Binding 0.14
$D$9 - mx + y 0.09 $D$9<=$F$9 Not Binding 0.01
$D$10 - mx + y 0.19 $D$10<=$F$10 Binding 0
$D$11 - mx + y 0.24 $D$11<=$F$11 Not Binding 0.02
$D$12 - mx + y 0.25 $D$12<=$F$12 Not Binding 0.04
$D$13 - mx + y 0.33 $D$13<=$F$13 Not Binding 0.19
$D$14 - mx + y 0.36 $D$14<=$F$14 Not Binding 0.25
$D$15 - mx + y 0.41 $D$15<=$F$15 Not Binding 0.41
$D$16 - mx + y 0.43 $D$16<=$F$16 Not Binding 0.49
$D$20 (1-p)*x + p*y 0.16 $D$20<=$F$20 Binding 0
$F$4 y 0.44 $F$4<=1 Not Binding 0.56
$E$4 x 0.15 $E$4<=1 Not Binding 0.85
Microsoft Excel 8.0e Answer Report
Worksheet:[ROCCH1.xls]Optimization
ReportCreated:2/28/20026:20:01PM
Final Reduced Objective Allowable Allowable
Cell Name Value Cost Coefficient Increase Decrease
$E$4 x 0.15 0 -28.9 2507.31 131.01
$F$4 y 0.44 0 96.6 1E+30 79.12
Final Shadow Constraint Allowable Allowable
Cell Name Value Price R.H.Side Increase Decrease
$D$7 -mx+y 0.00 0 0 1E+30 0
$D$8 -mx+y -0.14 0 0 1E+30 0.14
$D$9 -mx+y 0.09 0 0.10 1E+30 0.01
$D$10 -mx+y 0.19 91.77 0.19 0.01 0.47
$D$11 -mx+y 0.24 0 0.26 1E+30 0.02
$D$12 -mx+y 0.25 0 0.29 1E+30 0.04
$D$13 -mx+y 0.33 0 0.52 1E+30 0.19
$D$14 -mx+y 0.36 0 0.61 1E+30 0.25
$D$15 -mx+y 0.41 0 0.82 1E+30 0.41
$D$16 -mx+y 0.43 0 0.92 1E+30 0.49
$D$20 (1-p)*x+p*y 0.16 127.87 0.16 0.08 0.01
Adjustable Cells
Constraints
MicrosoftExcel8.0eSensitivityReport

More Related Content

PDF
Flavours of Physics Challenge: Transfer Learning approach
PDF
ProbErrorBoundROM_MC2015
PPTX
Fuzzy Model Presentation
PPT
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
PDF
Adaptive response surface by kriging using pilot points for structural reliab...
PDF
Ica 2013021816274759
PPTX
Fuzzy presenta
PPT
Introduction to mars_2009
Flavours of Physics Challenge: Transfer Learning approach
ProbErrorBoundROM_MC2015
Fuzzy Model Presentation
Forecasting Default Probabilities in Emerging Markets and Dynamical Regula...
Adaptive response surface by kriging using pilot points for structural reliab...
Ica 2013021816274759
Fuzzy presenta
Introduction to mars_2009

Similar to Evaluating Classifiers' Performance KDD2002 (20)

DOC
POST OPTIMALITY ANALYSIS.doc
PDF
1607.01152.pdf
PDF
FurtherInvestegationOnProbabilisticErrorBounds_final
PDF
Further investegationonprobabilisticerrorbounds final
PDF
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
PDF
35000120030_Aritra Kundu_Operations Research.pdf
PDF
CounterFactual Explanations.pdf
PDF
h264_publication_1
PPT
RBHF_SDM_2011_Jie
PDF
International Journal of Humanities and Social Science Invention (IJHSSI)
PDF
ABDO_MLROM_PHYSOR2016
DOCX
Trust Region Algorithm - Bachelor Dissertation
PDF
A parsimonious SVM model selection criterion for classification of real-world ...
PDF
MultiLevelROM2_Washinton
PDF
Aco
PDF
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
PDF
Multiple Target Machine Learning Prediction of Capacity Curves of Reinforced ...
PDF
CFM Challenge - Course Project
PDF
Iaetsd protecting privacy preserving for cost effective adaptive actions
POST OPTIMALITY ANALYSIS.doc
1607.01152.pdf
FurtherInvestegationOnProbabilisticErrorBounds_final
Further investegationonprobabilisticerrorbounds final
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
35000120030_Aritra Kundu_Operations Research.pdf
CounterFactual Explanations.pdf
h264_publication_1
RBHF_SDM_2011_Jie
International Journal of Humanities and Social Science Invention (IJHSSI)
ABDO_MLROM_PHYSOR2016
Trust Region Algorithm - Bachelor Dissertation
A parsimonious SVM model selection criterion for classification of real-world ...
MultiLevelROM2_Washinton
Aco
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Multiple Target Machine Learning Prediction of Capacity Curves of Reinforced ...
CFM Challenge - Course Project
Iaetsd protecting privacy preserving for cost effective adaptive actions
Ad

Evaluating Classifiers' Performance KDD2002

  • 1. Evaluating Classifiers’ Performance In A Constrained Environment Anna Olecka olecka@rutcor.rutgers.edu Fleet Boston Financial Database Marketing Department 1075 Main St. Waltham, MA 02451 RUTCOR Rutgers University 640 Bartholomew Road Piscataway, NJ 08854-8003 ABSTRACT In this paper, we focus on methodology of finding a classifier with a minimal cost in presence of additional performance constraints. ROCCH analysis, where accuracy and cost are intertwined in the solution space, was a revolutionary tool for two-class problems. We propose an alternative formulation, as an optimization problem, commonly used in Operations Research. This approach extends the ROCCH analysis to allow for locating optimal solutions while outside constraints are present. Similarly to the ROCCH analysis, we combine cost and class distribution while defining the objective function. Rather than focusing on slopes of the edges in the convex hull of the solution space, however, we treat cost as an objective function to be minimized over the solution space, by selecting the best performing classifier(s) (one or more vertex in the solution space). The Linear Programming framework provides a theoretical and computational methodology for finding the vertex (classifier) which minimizes the objective function. 1. INTRODUCTION Consider a problem, where classifiers’ performance has to be evaluated taking into account additional constraints related to error rates. Such constraints often arise from implementation. They could, for example, involve a limited workforce to resolve cases of suspected fraud, a limited size of a direct mail campaign, or restrictions in cost of incentives for responders. An application example used throughout this paper involves an attrition model for a bank. A bank plans a calling campaign to lower attrition rate among its customers. Naturally, one of the implementation concerns is limited availability of resources, such as phone representatives. Traditionally, a model is selected first and summarized by a modeler into several performance buckets. Then, the implementation team will “eyeball” the thresholds of the performance buckets and pick the threshold that matches constrained resources the closest. Such business practice often results in a sub-optimal solution being selected. If the constraints are known a-priori, they could be built into the system evaluating classifiers’ performance. If they are not known till the implementation time, a system analogous to the ROCCH could be built to select the best classifier at that time. To this end, we are proposing an evaluation system that can deal with additional constraints related to prediction errors. We will also show how to apply such system in a business scenario described above. Provost and Fawcett have shown in [3] that some specific metrics frequently used in Machine Learning (eg, workforce constraints, and the Neyman-Pearson decision criterion), are optimized by the ROCCH method. Optimization approach proposed here extends their results to any linear constraint. In fact, our approach can be applied to an unlimited number of constraints, as long as they remain linear. Finding numerically intersection points of such additional constraints with ROCCH can be computationally tedious. A mathematical programming approach provides efficient tools for finding optimal solutions without explicitly calculating all the intersection points. ROCCH analysis is a powerful and widely accepted tool for visualizing and optimizing classification problems with two classes. It plots performance of all classifiers under consideration in a two-dimensional space, with false positive rate on one axis and true positive rate on the other. It measures classifiers’ performance for various costs of misclassification and under various class distributions. It allows for visual representation of classifiers and enables quick decisions in choosing the right classifier for the given costs. The main advantage of this methodology is its flexibility under varying conditions. One drawback of this methodology is that it is somewhat rigid in looking for an optimal classifier. It forms an optimal slope by considering an optimal combination of class probabilities and error costs, and then looks for one of the two scenarios. Either an edge of the convex hull with a slope equal to the optimal one, or for a vertex between two edges where a difference between the edges slope and the optimal slope changes sign. Similarly to the ROCCH, we construct the convex hull of all classifiers in the error space. Convexity of the error space is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada. Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.
  • 2. accomplished in the following steps. First we note that a convex combination of two classifiers is also a viable classifier. Then, we remove dominated classifiers. Finally, we impose additional performance constraints if any. Those additional constraints also form a convex hull. Intersection of the two regions, is a new convex hull, the potential solution space. Finally, similarly to the ROCCH, we treat a combination of costs and class probabilities as an objective function to be minimized over the solution space. Due to existence of additional constraints, however, iterating slopes of the edges of the convex hull is no longer computationally efficient because not all vertices of the convex hull are known. The theory of linear programming provides computational tools for finding the optimal solution(s) without explicit knowledge of all vertices. 2. ROC CONVEX HULL & HYBRID CLASSIFIERS Suppose we construct a series of k-1 classifiers by varying a positive decision threshold from ”never”, and gradually increasing probability of decision “Yes”. By allowing more instances to be classified as positive, each new threshold will increase the number of correctly classified positive instances, but it may also increase frequency of false positive classification. In the space (FP rate, TP rate), each new classifier will be positioned up and to the right from the previous one. Connecting each pair of points with a line segment generates a convex set, where each vertex represents a classifier (Figure1). . Table 1. Neural net model for bank attrition Table 1 represents a set of classifiers obtained from a neural networks model for a bank attrition problem. Figure 1 shows the resulting convex set Figure 1. ROC curve for the neural net attrition model Figure 2 shows a set of classifiers obtained by varying thresholds for two logistic regression models for the attrition data. In this representation, we can visually recognize dominated classifiers. They are positioned inside the convex region, while potential candidates for optimal solution are on the boundary. It is easy to see [3], that under any cost structure, there is a classifier on the boundary, which will outperform a dominated classifier. Figure 2. Two competing logistic regression models Figure 3 shows a hybrid classifier obtained by removing dominated points. Any point positioned inside the bounded region, can be outperformed by some point on the boundary. The boundary forms a hybrid classifier, or a set of potential candidates for an optimal solution. Figure 3. ROCCH for the two logistic regression models 2.1 Remark on convexity The hybrid classifier obtained from our two logistic regression models formed a convex set in the (FP rate, TP rate) plane. In general, this doesn’t have to be the case. Figure 4 provides such an example. Point B, where the region boundary transitions from the neural networks model to the logistic model. The boundary “dips” below the line from A to C. This can be “fixed” by creating a new classifier B’ on the (A, C) line. For example, if we want a point half way between A and C, the new classifier is obtained by using classifiers A and B randomly, each with probability 0.5. Provost and Fawcett describe a similar scheme, using random sampling to create a new classifier. In general any convex combination of two classifiers becomes a new classifier. For any point X on a line segment created by classifiers π1, π2, , we can always construct a new classifier, which would correspond to such point. We start by describing such point as convex combination ROC surves for two logistic regression models 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 FP Rate TP Ra te Logistic2 Logistic1 Hybrid classifier for two logistic regression models 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 FP Rate TPRate Logistic2 Logistic1 Cutoff / Threshold FP rate p (Y | n) TP rate p (Y | y) Never 0 0 0.54 0.06 0.25 0.48 0.14 0.43 0.42 0.23 0.58 0.36 0.33 0.70 0.30 0.42 0.82 0.24 0.53 0.90 Always 1 1 ROC Curve for Attrition Data 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 FP Rate TPRate
  • 3. X=απ1+ (1−α)π2 where 0 ≤ α ≤1 Then, we randomly partition the population being classified into 2 groups in proportions α, 1- α and apply an appropriate model for each group. Figure 4. Boundary of ROC space for two classifiers is not always convex A new, convex boundary for the attrition problem is shown in Figure 5. Figure 5. Convexity is obtained by creating convex combinations of existing classifiers 3. BASIC TERMINOLOGY The following terminology is used throughout the remaining sections. − Two classes: positive (y) and negative (n) with probabilities respectively p and 1-p − Classification decision: positive (Y) and negative (N) − FP, TP, FN, TN represent number of instances of each kind: false positive, true positive, false negative and true negative respectively − Rates of those instances are represented as follows: FP_rate = p(Y|n) TP_rate = p(Y|y) FN_rate = p (N|y) TN_rate = p(N|n) − In the ROC space, the horizontal axis represents FP_rate, and the vertical axis represents TP_rate. We will use x to denote FP_rate and y to denote TP_rate − Unit cost of a false positive error = c(Y|n) − Unit cost of a false negative error = c(N|y) In defining the cost, we need to start with the expected number of errors and transform the resulting formula into the ROC space terms, where the variables are error rates. The following scheme visualizes interdependencies between terms. Table 2. Two class classification scheme Let M be the total number of instances being classified. For given p, x and y we can calculate the expected number of classified cases of each kind as follows. TP = M*p*y FN = M*p*(1-y) FP = M* (1-p)*x TN = M*(1-p)*(1-x) 4. COST FUNCTION We now apply the cost function to the attrition problem and look for a classifier on the boundaries of the convex hull, which will minimize the total cost. A false negative classification means that we won’t recognize an attriting customer and loose the account. The cost of loosing a customer is tied to net income after taxes (NIAT) this customer brings to the bank. In addition, both types of error generate labor costs related preventive action. The line of business decided to assign error costs as shown in Table3. Table 3. Misclassification costs assignment Given cost of a false negative error c(N | y), and a false positive error c(Y | n), the total expected cost EC = c(Y | n) * E(FP) + c(N | y)* E(FN) Where E(FN) is the expected number of false negatives and E(FP) is the expected number of false positives. ROC Curves for Logistic and Neural Nets models C B A 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 FP Rate TPRate Logistic2 Neural Net 1 False positive c(Y|n) False negative c(N|y) 30$ 2,575$ Error Cost per unit Two class classification schem e Population M Class n (1-p)*M Class y p*M Decision Y (TP) Decision N (FN) Decision Y (FP) Decision N (TN) FN rate 1-y TP rate y TN rate 1-x FP rate x p 1-p Hybrid classifier for Logistic2 and Neural Nets models B' C A 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 FP Rate TPRate Logistic2 Neural Net 1
  • 4. We want to minimize the expected total cost in term of decision variables x and y. Minimize EC = c(Y | n) * E(FP) + c(N | y)* E(FN) = c(Y | n) * (1-p) * M * x + c(N | y) * p* M * (1 – y ) = c(Y | n) * (1-p) * M * x + c(N | y) * p * M - c(N | y)*p*M*y After subtracting the constant (not dependent on the decision variables) term and dividing by M, this is equivalent to: Minimize (EC - c(N | y)*p )/M = = c(Y | n)*(1-p)*x - c(N | y)*p*y This is equivalent to maximizing the opposite function Maximize C’ = - c(Y | n)*(1-p)*x + c(N | y)*p*y C’ in space (x,y) is a collection of parallel lines with slopes depending on misclassification costs and the a-priori probability of the positive class. In the attrition example we are analyzing, p=3.75%. We are now ready to calculate slopes for the cost lines. m = [c(Y | n)*(1-p)] / [c(N | y)*p] = = 30*(1-0.0375) / (2575*0.0375) = 0.3 Intercepts of those lines vary with the position of each line on the plane. The higher the position of the line on the plane in relation to y, the higher will the value of C’ be. Since our objective is to maximize C’, we can start at the bottom of the convex hull and move the line up, as far as possible. Each point of the region touching the line in a given position has the same cost. This defines iso-performing lines. We want to find a point in the convex region that the line touches in its highest position possible. Figure 6 shows the convex hull for the attrition problem, and cost function moving up through the region. The highest position of the cost line is obtained at vertex A. The theory of Linear Programming shows, that for a convex and bounded region, the objective function is maximized either at one of the vertices, or on a line segment joining two vertices Bazaara [1]. Thus, we can always pick one or more best performing classifiers. Figure 6. Cost functions traversing the ROC convex hull As in the ROCCH analysis, we have iso-performing lines, which help visualize performance of classifiers under various cost structures. In the Linear Programming setting, however, the feasible region can incorporate any additional constraints. We are gaining flexibility and control. 5. ADDING ADDITIONAL CONSTRAINTS In the attrition example, some implicit constraints are already built into the ROC space. Error rates are non-negative and cannot exceed 1. Those constraints bound the convex set and guarantee existence of an optimal solution. In addition, a planned calling campaign is limited by customer service resources availability. A phone call scenario included approaching a customer to find out if indeed their intention was to leave the bank, and to attempt to entice them to stay. Naturally, there are limited resources bank can devote to this undertaking. Traditionally, modelers would select the best model, and any additional constraints would be imposed at the implementation time, based on the pre-selected model. But if a modeler is aware of the constraints, they can be built into the system evaluating model’s performance. The line of business, for which these classifiers were developed, determined that they could handle calling 20% of their customer base. This resulted in the constraint FP + TP ≤ 0.2*M In order to plot the constraint in the (x, y) error rate space, we need to formulate it in terms of the decision variables x and y. (1-p)*x - p*y ≤ 0.2 Geometrically, this inequality is represented by a half plane in the (x,y) space. Figure 7 shows a case of the attrition problem with an additional constraint. The new convex set is the intersection of the original one with the half plane. The feasible region is now an intersection of the previous convex set with the new one: the area to the left (down) of the dotted line. Figure 7. The capacity constraint bounds the feasible region Our optimal classifier is no longer feasible, since it is outside of the new feasible region. We need to “slide” the cost line back down to the feasible region. It will touch the feasible region at the point where the hybrid classifier intersects the constraint line. As noted earlier, this point A(.64,.96) Optimal solution 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FP rate TP rate Attrition model with workforce capacity constraints 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 FP rate TPrate Hybrid classifier Capacity constraint
  • 5. defines a new classifier. The new classifier is now optimal, as shown in Figure 8. Figure 8. New optimal solution In the next section we show how to find the new vertex. Here, we will point out an actual implementation of the new solution as outlined by Fawcett and Provost in [3]. Assume that the intersection point divides the line segment between A and B at ratio α. To obtain the new classifier, we can proceed as follows. 1. With probability α use classifier A 2. With probability 1- α use classifier B If A and B were obtained from the same model by varying the decision threshold, this process can be simplified by finding an appropriate threshold between A and B. 6. LINEAR PROGRAMMING FORMULATION A linear program is a constrained optimization problem, where the objective function, as well as all constraints, are linear. We need to select values for all decision variables so that all constraints are satisfied and the objective function is minimized (or maximized). In this case, we need to pick a point(s) in the ROC space which will minimize the cost function (or maximize the modified cost function). Decision variables are then the coordinates (x,y) of points in the ROC space. Some of the constraints arise from the classifiers’ performance. In the ROC space, points under consideration need to be below the boundary of the convex region. Additional constraints are usually related to implementation and/or quality issues. Finally, we have non-negativity constraints and bounding constraints, since the ROC variables have to be between 0 and 1. A canonical form of a linear (maximization) program takes the following format. Maximize C = ∑j cj x j Subject to ∑j aij xj ≤bi i = 1,…, k. xj ≥0 j = 1,…, d The theory of linear programming assures us, that if the set of constraints forms a convex and bounded set (called the feasible region), then an optimal solution is found on the boundaries of the feasible region [1]. In our - two dimensional - case, a solution can be found at a vertex, or on an edge joining two vertices. Note that the feasible set is created as a conjunction of several linear inequalities. Not all vertices are known explicitly. A number of computational techniques have been designed to find optimal solutions without explicitly iterating over the vertices. More details can be found in [1] and [2]. The theory of Linear Programming also aids an analysis of a solution found, if we want to play some “what if” scenarios. At a point of optimality, some constraints will be satisfied as “binding”. That is, the left-hand-side will be equal to the right hand side. Others will be satisfied as an inequality, leaving “slack”, or room for improvement. In a two-dimensional case, if two constraints are found binding at an optimal solution, the classifier at the intersection of those constraints is optimal. An analysis of slack at neighboring points (called marginal analysis or sensitivity analysis), often provides insight into alternative solutions and how close they are to optimality. All commercially available optimization packages, including a module that comes with Excel, will provide slack information for all constraints. We are showing an example of Excel sensitivity analysis report in the Appendix. In that example, the capacity constraint, which was based on workforce availability, is binding. That means that any further improvement of classifier’s performance will require additional workforce resources. Additional resources will provide slack on the capacity constraint. This will allow the optimal solution to move along the convex hull boundary in a direction improving the objective function. We will now formulate the problem of looking for optimal classifier in the presence of additional constraints, as a linear programming problem. We already have a formal representation of the objective function. Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y Now we need to formulate the set of constraints related to the classifiers and add a set of additional, bounding and non- negativity constraints. Given two classifiers Pi and Pj on the boundary, with error rates (xj, yj) and (xi, yi). Assume that xi ≤ xj .A line segment through points Pi, Pj has the slope m = (yj – yi)/ (xj – xi) and the equation y – yi = m*( x – xi) The feasible region is positioned below the line, so it is determined by a collection of inequalities y – yi ≤ ((yj – yi)/ (xj – xi) )*( x – xi) which can be rearranged as - (yj – yi) x + (xj – xi) y ≤ - xi (yj – yi) + yi (xj – xi) Optimal solution change under capacity constraints New optimal solution 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 FP rate TPrate
  • 6. The optimization problem can now be defined as follows. Given classifiers Pi with error rates (xi, yi), i = 1,2,…k, such that xi ≤ xi+1 Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y Such that a1 ij x + a2 ij y ≤ bij for i = 1,2,…k; j = i +1 π1 l x + π2 l y ≤ ql l = 1,2, l0 0 ≤x ≤1 0 ≤y ≤1 where a1 ij = - (yj – yi) , a2 ij = (xj – xi) bij = - xi (yj – yi) + yi (xj – xi) and π1 l , π2 l and ql form additional constraints 6.1 Computational considerations In the absence of additional constraints, all vertices are known a priori. In such case, it is computationally efficient, to explicitly calculate expected cost of each classifier and choose the one with the smallest cost, rather than actually slide the lines over the feasible region. When additional constraints are introduced, the situation changes. Finding new vertices, formed by the additional intersection points, can get cumbersome. Fortunately, the theory and practice of mathematical programming provides efficient algorithms for finding optimal solutions without a need to iterate over all vertices. There are also efficient solvers on the market, which can solve optimization problems. An add-on to Excel provides one such solver. It is efficient for problems in small dimensions as the ones described here, and it is available in almost any business setting. Figure 9 shows the original attrition problem, without the additional constraint, solved in Excel. Spreadsheet cell labeled Cost contains the cost formula. Value of that cell is maximized, by changing decision variables x and y, as long as all constraints are satisfied. Point (0.64, 0.96) minimizes the objective function. Figure 9. Excel optimizer solves the optimization problem Slopes of line segments on the boundaries are calculated in column 3. Note that at point (0.64, 0.96), slope changes from 0.55 to 0. 21. As noted earlier, slope of the cost line is 0.3. So point (0.64, 0.96) would have been selected as optimal by the traditional ROCCH analysis as well. Figure 10 shows a new solution, after the additional capacity constraint was added. The new solution is (0.15, 0.44). Figure 10. Excel solution incorporating the capacity constraint 6.2 Note on practical considerations For all practical purposes, a business setting often prefers speed and expediency of delivery, to an optimal solution, as classifier’s performance is near optimal. In case of the attrition problem, we were lucky, in that one of the existing classifiers (0.14, 0.43) was in close proximity of the optimal solution (0.15, 0.44) and still within the feasible region. A selection of near optimal solution was, in this case, a simple decision because this was a relatively simple problem with just one additional constraint. 6.3 New performance constraints While analyzing the new optimal solution, we notice that the true positive rate is 0.44. In other words, out of all instances of the positive class, only 44% (less than a half) is classified correctly. This is not a desirable performance. 54% of defecting customers remain unrecognized and without being contacted will leave the bank. Solution analysis (see Appendix) shows that the capacity constraint is binding at the optimal point. We cannot increase the True Positive rate y, without leaving the feasible region and violating the capacity constraint. Any further improvements in the attritors’ recognition rate will require additional resources so that the capacity constraint can be relaxed. The optimization template set up in Excel allowed us to play a number of “what if” scenarios. In a final compromise, the line of business decided to relax the capacity constraint to allow for 30% of the population to be classified as positive. In return, we added a new “missed Cost c(Y|n) c(N|y) Rate p Max x y 30$ 2,575$ 0.0375 38.33 0.15 0.44 Slopes FP rate x1 TP rate y1 m - mx + y <= - m*x1 + y1 0 0 0.06 0.25 3.83 -0.14 <= 0.00 0.14 0.43 2.31 0.09 <= 0.10 0.23 0.58 1.66 0.19 <= 0.19 0.33 0.70 1.35 0.24 <= 0.26 0.42 0.82 1.28 0.25 <= 0.29 0.53 0.90 0.72 0.33 <= 0.52 0.64 0.96 0.55 0.36 <= 0.61 0.75 0.98 0.21 0.41 <= 0.82 0.88 0.99 0.08 0.43 <= 0.92 1.00 1 0.04 0.44 <= 0.96 (1-p)*x + p*y <= 0.20 0.16 <= 0.16 Capacity Constraint Data Points Maximizing Modified Cost Function -c(Y | n)*(1-p)*x + c(N | y)*p*y Decision variables Error Cost per unit Constraints Cost c(Y|n) c(N|y) Attrition Rate p Max x y 30$ 2,575$ 0.0375 74.29 0.64 0.96 Slopes FP rate x1 TP rate y1 m - mx + y <= - m*x1 + y1 0 0 0.06 0.25 3.83 -1.48 <= 0.00 0.14 0.43 2.31 -0.51 <= 0.10 0.23 0.58 1.66 -0.09 <= 0.19 0.33 0.70 1.35 0.10 <= 0.26 0.42 0.82 1.28 0.15 <= 0.29 0.53 0.90 0.72 0.50 <= 0.52 0.64 0.96 0.55 0.61 <= 0.61 0.75 0.98 0.21 0.82 <= 0.82 0.88 0.99 0.08 0.91 <= 0.92 1.00 1 0.04 0.93 <= 0.96 Data Points Maximizing Modified Cost Function -c(Y | n)*(1-p)*x + c(N | y)*p*y Decision variables Error Cost per unit Constraints
  • 7. opportunity” constraint, which required that the True Positive rate is no less than 60%. The optimal solution for this problem with two additional constraints is shown in Figure 11. The feasible region is now the triangle contained between the lines: the classifier constraint, the performance constraint y ≥.6 and the relaxed capacity constraint (1-p)*x + p*y ≤ .3 Figure 11. Feasible region with two additional constraints The optimal solution (0.29, 0.65) found by the Excel Optimizer is shown in Figure 12. Figure12. Excel solution with two additional constraints Other examples of frequently used performance constraints could include: − limited size of a direct mail campaign − restrictions in processing capacity for responders to a campaign − limited expense of incentives for responders − limited capacity of response processing systems (indirectly resulting in limiting a campaign size) In each case, we need to express the constraints in terms of decision variables to create the new convex hull to intersect with the one created by the classifiers set. Note, that in this formulation we do not need to know the intersection points of ROCCH with the constrains convex hull. The optimal solution can be found with no explicit knowledge of vertices of the combined convex hull. 7. CONCLUSION Linear Programming is a part of a broad field of constrained optimizations called Mathematical Programming. Mathematical Programming is a rich discipline and its applications have been growing fast in the past several decades. Once a problem has been framed as a mathematical programming problem, we gain a powerful tool. More importantly, we can draw from that field’s rich legacy. Some future developments may, for example, include multi-dimensional problems where a classification problem has more than two classes. Computational difficulties grow fast with problem’s dimensions, but techniques and methods of Linear Programming can help overcome computational issues. We could also consider problems involving non-linear cost function. The theory of Non-linear Programming may guide us in minimizing non-linear costs, such as piece wise linear or quadratic. Other areas of pursuit may include methods to deal with classifier and/or constraint uncertainty. Stochastic Programming methods could perhaps be employed to deal with optimal solutions if the standard errors of classifiers are need to be taken into account. At the same time we are not losing the main benefits of ROCCH. We maintain the benefit of visualization, except the cost function now “slides” along the convex hull. The benefit of being able to choose the optimal classifier at the run time if constraints change, can be maintained as well. The selection process would proceed as follows: - Input new costs and constraints - Intersect a constraints space with the ROC space - Input all competing points - Solve the optimization problem 8. ACKNOWLEDGEMENTS Thanks to all who helped to inspire, develop and formulate the above thoughts. In particular my gratitude goes to Endre Boros, Stan Matwin and Vera Helman who generously shared their knowledge and experience and provided feedback throughout the various stages of this work. 9. REFERENCES [1] Bazaraa, Moktar S. at al. Linear Programming and Network Flows. John Wiley & Sons, Inc., 1997. [2] Murty, Katta G. Operations Research: Deterministic Optimization Models. Prentice Hall, Inc. 1995. [3] Provost, F., Fawcett, T. Robust Classification Systems for Imprecise Environment. Proceedings of the Fifteenth International Conference on Artificial Intelligence ’98. [4] Provost, F., Fawcett, T. Robust Classification for Imprecise Environment; Machine Learning, Vol. 42, No.3, 2001, pp.203-231 Attrition model with capacity and performance constraints 0.55 0.6 0.65 0.7 0.2 0.25 0.3 0.35 FP rate TPrate Cost c(Y|n) c(N|y) Rate p Max x y 30$ 2,575$ 0.0375 54.39 0.29 0.65 Slopes FP rate x1 TP rate y1 m - mx + y <= - m*x1 + y1 0 0 0.00 0.06 0.25 3.83 -0.45 <= 0.00 0.14 0.43 2.31 -0.01 <= 0.10 0.23 0.58 1.66 0.17 <= 0.19 0.33 0.70 1.35 0.26 <= 0.26 0.42 0.82 1.28 0.28 <= 0.29 0.53 0.90 0.72 0.44 <= 0.52 0.64 0.96 0.55 0.49 <= 0.61 0.75 0.98 0.21 0.59 <= 0.82 0.88 0.99 0.08 0.63 <= 0.92 1.00 1 0.04 0.64 <= 0.96 (1-p)*x + p*y <= 0.30 0.30 <= 0.30 y >= 0.60 0.65 >= 0.60 Capacity Constraint Data Points Missed Opportunity Constraint Decision variables Error Cost per unit Constraints
  • 8. 10. APPENDIX Answer report and sensitivity analysis report produced by an Excel Optimizer for the attrition problem with capacity constraint. Worksheet: [ROCCH1.xls]Optimization Report Created: 2/28/2002 6:16:12 PM Target Cell (Max) Cell Name Original Value Final Value $D$4 Max 0.00 38.33 Adjustable Cells Cell Name Original Value Final Value $E$4 x 0.00 0.15 $F$4 y 0.00 0.44 Constraints Cell Name Cell Value Formula Status Slack $D$7 - mx + y 0.00 $D$7<=$F$7 Binding 0 $D$8 - mx + y -0.14 $D$8<=$F$8 Not Binding 0.14 $D$9 - mx + y 0.09 $D$9<=$F$9 Not Binding 0.01 $D$10 - mx + y 0.19 $D$10<=$F$10 Binding 0 $D$11 - mx + y 0.24 $D$11<=$F$11 Not Binding 0.02 $D$12 - mx + y 0.25 $D$12<=$F$12 Not Binding 0.04 $D$13 - mx + y 0.33 $D$13<=$F$13 Not Binding 0.19 $D$14 - mx + y 0.36 $D$14<=$F$14 Not Binding 0.25 $D$15 - mx + y 0.41 $D$15<=$F$15 Not Binding 0.41 $D$16 - mx + y 0.43 $D$16<=$F$16 Not Binding 0.49 $D$20 (1-p)*x + p*y 0.16 $D$20<=$F$20 Binding 0 $F$4 y 0.44 $F$4<=1 Not Binding 0.56 $E$4 x 0.15 $E$4<=1 Not Binding 0.85 Microsoft Excel 8.0e Answer Report Worksheet:[ROCCH1.xls]Optimization ReportCreated:2/28/20026:20:01PM Final Reduced Objective Allowable Allowable Cell Name Value Cost Coefficient Increase Decrease $E$4 x 0.15 0 -28.9 2507.31 131.01 $F$4 y 0.44 0 96.6 1E+30 79.12 Final Shadow Constraint Allowable Allowable Cell Name Value Price R.H.Side Increase Decrease $D$7 -mx+y 0.00 0 0 1E+30 0 $D$8 -mx+y -0.14 0 0 1E+30 0.14 $D$9 -mx+y 0.09 0 0.10 1E+30 0.01 $D$10 -mx+y 0.19 91.77 0.19 0.01 0.47 $D$11 -mx+y 0.24 0 0.26 1E+30 0.02 $D$12 -mx+y 0.25 0 0.29 1E+30 0.04 $D$13 -mx+y 0.33 0 0.52 1E+30 0.19 $D$14 -mx+y 0.36 0 0.61 1E+30 0.25 $D$15 -mx+y 0.41 0 0.82 1E+30 0.41 $D$16 -mx+y 0.43 0 0.92 1E+30 0.49 $D$20 (1-p)*x+p*y 0.16 127.87 0.16 0.08 0.01 Adjustable Cells Constraints MicrosoftExcel8.0eSensitivityReport