Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
Combining Pattern Classifiers Methods And Algorithms Ludmila I Kuncheva
1. Combining Pattern Classifiers Methods And
Algorithms Ludmila I Kuncheva download
https://ebookbell.com/product/combining-pattern-classifiers-
methods-and-algorithms-ludmila-i-kuncheva-896974
Explore and download more ebooks at ebookbell.com
2. Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Combining Pattern Classifiers Methods And Algorithms 2nd Edition
Ludmila I Kuncheva
https://ebookbell.com/product/combining-pattern-classifiers-methods-
and-algorithms-2nd-edition-ludmila-i-kuncheva-36520994
Identifying Patterns In Financial Markets New Approach Combining Rules
Between Pips And Sax 1st Edition Joo Leito
https://ebookbell.com/product/identifying-patterns-in-financial-
markets-new-approach-combining-rules-between-pips-and-sax-1st-edition-
joo-leito-6842370
Patterns Of Commoning David Bollier Silke Helfrich
https://ebookbell.com/product/patterns-of-commoning-david-bollier-
silke-helfrich-60886112
Coining Images Of Power Patterns In The Representation Of Roman
Emperors On Imperial Coinage Ad 193284 Erika Manders
https://ebookbell.com/product/coining-images-of-power-patterns-in-the-
representation-of-roman-emperors-on-imperial-coinage-ad-193284-erika-
manders-4145712
3. The Politics Of Urban Potentiality Spatial Patterns Of Emancipatory
Commoning Stavros Stavrides
https://ebookbell.com/product/the-politics-of-urban-potentiality-
spatial-patterns-of-emancipatory-commoning-stavros-stavrides-57966074
Combining Political History And Political Science Towards A New
Understanding Of The Political Carlos Domper Lass
https://ebookbell.com/product/combining-political-history-and-
political-science-towards-a-new-understanding-of-the-political-carlos-
domper-lass-46191624
Combining Case Study Designs For Theory Building A New Sourcebook For
Rigorous Social Science Researchers Lakshmi Balachandran Nair
https://ebookbell.com/product/combining-case-study-designs-for-theory-
building-a-new-sourcebook-for-rigorous-social-science-researchers-
lakshmi-balachandran-nair-47672292
Combining The Legal And The Social In Sociology Of Law An Homage To
Reza Banakar Hkan Hydn
https://ebookbell.com/product/combining-the-legal-and-the-social-in-
sociology-of-law-an-homage-to-reza-banakar-hkan-hydn-49422462
Combining The Past And The Present Archaeological Perspectives On
Society Proceedings From The Conference Prehistory In A Global
Perspective Held In Bergen August 31st September 2nd 2001 In Honour Of
Professor Randi Haalands 60th Anniversary Illustrated Terje Oestigaard
Editor
https://ebookbell.com/product/combining-the-past-and-the-present-
archaeological-perspectives-on-society-proceedings-from-the-
conference-prehistory-in-a-global-perspective-held-in-bergen-
august-31st-september-2nd-2001-in-honour-of-professor-randi-
haalands-60th-anniversary-illustrated-terje-oestigaard-editor-49989818
7. Copyright # 2004 by John Wiley & Sons, Inc. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form
or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as
permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior
written permission of the Publisher, or authorization through payment of the appropriate per-copy fee
to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400,
fax 978-646-8600, or on the web at www.copyright.com. Requests to the Publisher for permission
should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street,
Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts
in preparing this book, they make no representations or warranties with respect to the accuracy or
completeness of the contents of this book and specifically disclaim any implied warranties of
merchantability or fitness for a particular purpose. No warranty may be created or extended by sales
representatives or written sales materials. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a professional where appropriate. Neither the
publisher nor author shall be liable for any loss of profit or any other commercial damages, including
but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services please contact our Customer Care
Department within the U.S. at 877-762-2974, outside the U.S. at 317-572-3993 or fax 317-572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print,
however, may not be available in electronic format.
Library of Congress Cataloging-in-Publication Data:
Kuncheva, Ludmila I. (Ludmila Ilieva), 1959–
Combining pattern classifiers: methods and algorithms/Ludmila I. Kuncheva.
p. cm.
“A Wiley-Interscience publication.”
Includes bibliographical references and index.
ISBN 0-471-21078-1 (cloth)
1. Pattern recognition systems. 2. Image processing–Digital techniques. I. Title.
TK7882.P3K83 2004
006.4–dc22
2003056883
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
8. Contents
Preface xiii
Acknowledgments xvii
Notation and Acronyms xix
1 Fundamentals of Pattern Recognition 1
1.1 Basic Concepts: Class, Feature, and Data Set / 1
1.1.1 Pattern Recognition Cycle / 1
1.1.2 Classes and Class Labels / 3
1.1.3 Features / 3
1.1.4 Data Set / 5
1.2 Classifier, Discriminant Functions, and Classification Regions / 5
1.3 Classification Error and Classification Accuracy / 8
1.3.1 Calculation of the Error / 8
1.3.2 Training and Testing Data Sets / 9
1.3.3 Confusion Matrices and Loss Matrices / 10
1.4 Experimental Comparison of Classifiers / 12
1.4.1 McNemar and Difference of Proportion Tests / 13
1.4.2 Cochran’s Q Test and F-Test / 16
1.4.3 Cross-Validation Tests / 18
1.4.4 Experiment Design / 22
1.5 Bayes Decision Theory / 25
1.5.1 Probabilistic Framework / 25
v
9. 1.5.2 Normal Distribution / 26
1.5.3 Generate Your Own Data / 27
1.5.4 Discriminant Functions and Decision Boundaries / 30
1.5.5 Bayes Error / 32
1.5.6 Multinomial Selection Procedure for Comparing Classifiers / 34
1.6 Taxonomy of Classifier Design Methods / 35
1.7 Clustering / 37
Appendix 1A K-Hold-Out Paired t-Test / 39
Appendix 1B K-Fold Cross-Validation Paired t-Test / 40
Appendix 1C 5 2cv Paired t-Test / 41
Appendix 1D 500 Generations of Training/Testing Data and
Calculation of the Paired t-Test Statistic / 42
Appendix 1E Data Generation: Lissajous Figure Data / 42
2 Base Classifiers 45
2.1 Linear and Quadratic Classifiers / 45
2.1.1 Linear Discriminant Classifier / 45
2.1.2 Quadratic Discriminant Classifier / 46
2.1.3 Using Data Weights with a Linear Discriminant Classifier
and Quadratic Discriminant Classifier / 47
2.1.4 Regularized Discriminant Analysis / 48
2.2 Nonparametric Classifiers / 50
2.2.1 Multinomial Classifiers / 51
2.2.2 Parzen Classifier / 54
2.3 The k-Nearest Neighbor Rule / 56
2.3.1 Theoretical Background / 56
2.3.2 Finding k-nn Prototypes / 59
2.3.3 k-nn Variants / 64
2.4 Tree Classifiers / 68
2.4.1 Binary Versus Nonbinary Splits / 71
2.4.2 Selection of the Feature for a Node / 71
2.4.3 Stopping Criterion / 74
2.4.4 Pruning Methods / 77
2.5 Neural Networks / 82
2.5.1 Neurons / 83
2.5.2 Rosenblatt’s Perceptron / 85
2.5.3 MultiLayer Perceptron / 86
2.5.4 Backpropagation Training of MultiLayer Perceptron / 89
Appendix 2A Matlab Code for Tree Classifiers / 95
Appendix 2B Matlab Code for Neural Network Classifiers / 99
vi CONTENTS
10. 3 Multiple Classifier Systems 101
3.1 Philosophy / 101
3.1.1 Statistical / 102
3.1.2 Computational / 103
3.1.3 Representational / 103
3.2 Terminologies and Taxonomies / 104
3.2.1 Fusion and Selection / 106
3.2.2 Decision Optimization and Coverage Optimization / 106
3.2.3 Trainable and Nontrainable Ensembles / 107
3.3 To Train or Not to Train? / 107
3.3.1 Tips for Training the Ensemble / 107
3.3.2 Idea of Stacked Generalization / 109
3.4 Remarks / 109
4 Fusion of Label Outputs 111
4.1 Types of Classifier Outputs / 111
4.2 Majority Vote / 112
4.2.1 Democracy in Classifier Combination / 112
4.2.2 Limits on the Majority Vote Accuracy: An Example / 116
4.2.3 Patterns of Success and Failure / 117
4.3 Weighted Majority Vote / 123
4.4 Naive Bayes Combination / 126
4.5 Multinomial Methods / 128
4.5.1 Behavior Knowledge Space Method / 128
4.5.2 Wernecke’s Method / 129
4.6 Probabilistic Approximation / 131
4.6.1 Calculation of the Probability Estimates / 134
4.6.2 Construction of the Tree / 135
4.7 Classifier Combination Using Singular Value Decomposition / 140
4.8 Conclusions / 144
Appendix 4A Matan’s Proof for the Limits on the Majority
Vote Accuracy / 146
Appendix 4B Probabilistic Approximation of the
Joint pmf for Class-Label Outputs / 148
5 Fusion of Continuous-Valued Outputs 151
5.1 How Do We Get Probability Outputs? / 152
5.1.1 Probabilities Based on Discriminant Scores / 152
CONTENTS vii
11. 5.1.2 Probabilities Based on Counts: Laplace Estimator / 154
5.2 Class-Conscious Combiners / 157
5.2.1 Nontrainable Combiners / 157
5.2.2 Trainable Combiners / 163
5.3 Class-Indifferent Combiners / 170
5.3.1 Decision Templates / 170
5.3.2 Dempster–Shafer Combination / 175
5.4 Where Do the Simple Combiners Come from? / 177
5.4.1 Conditionally Independent Representations / 177
5.4.2 A Bayesian Perspective / 179
5.4.3 The Supra Bayesian Approach / 183
5.4.4 Kullback–Leibler Divergence / 184
5.4.5 Consensus Theory / 186
5.5 Comments / 187
Appendix 5A Calculation of l for the Fuzzy Integral
Combiner / 188
6 Classifier Selection 189
6.1 Preliminaries / 189
6.2 Why Classifier Selection Works / 190
6.3 Estimating Local Competence Dynamically / 192
6.3.1 Decision-Independent Estimates / 192
6.3.2 Decision-Dependent Estimates / 193
6.3.3 Tie-Break for Classifiers with Equal Competences / 195
6.4 Preestimation of the Competence Regions / 196
6.4.1 Clustering / 197
6.4.2 Selective Clustering / 197
6.5 Selection or Fusion? / 199
6.6 Base Classifiers and Mixture of Experts / 200
7 Bagging and Boosting 203
7.1 Bagging / 203
7.1.1 Origins: Bagging Predictors / 203
7.1.2 Why Does Bagging Work? / 204
7.1.3 Variants of Bagging / 207
7.2 Boosting / 212
7.2.1 Origins: Algorithm Hedge(b) / 212
7.2.2 AdaBoost Algorithm / 215
7.2.3 arc-x4 Algorithm / 215
viii CONTENTS
12. 7.2.4 Why Does AdaBoost Work? / 217
7.2.5 Variants of Boosting / 221
7.3 Bias-Variance Decomposition / 222
7.3.1 Bias, Variance, and Noise of the Classification Error / 223
7.3.2 Decomposition of the Error / 226
7.3.3 How Do Bagging and Boosting Affect Bias and Variance? / 229
7.4 Which Is Better: Bagging or Boosting? / 229
Appendix 7A Proof of the Error Bound on the Training Set for
AdaBoost (Two Classes) / 230
Appendix 7B Proof of the Error Bound on the Training Set for
AdaBoost (c Classes) / 234
8 Miscellanea 237
8.1 Feature Selection / 237
8.1.1 Natural Grouping / 237
8.1.2 Random Selection / 237
8.1.3 Nonrandom Selection / 238
8.1.4 Genetic Algorithms / 240
8.1.5 Ensemble Methods for Feature Selection / 242
8.2 Error Correcting Output Codes / 244
8.2.1 Code Designs / 244
8.2.2 Implementation Issues / 247
8.2.3 Error Correcting Ouput Codes, Voting, and
Decision Templates / 248
8.2.4 Soft Error Correcting Output Code Labels and
Pairwise Classification / 249
8.2.5 Comments and Further Directions / 250
8.3 Combining Clustering Results / 251
8.3.1 Measuring Similarity Between Partitions / 251
8.3.2 Evaluating Clustering Algorithms / 254
8.3.3 Cluster Ensembles / 257
Appendix 8A Exhaustive Generation of Error Correcting
Output Codes / 264
Appendix 8B Random Generation of Error Correcting
Output Codes / 264
Appendix 8C Model Explorer Algorithm for Determining
the Number of Clusters c / 265
9 Theoretical Views and Results 267
9.1 Equivalence of Simple Combination Rules / 267
CONTENTS ix
13. 9.1.1 Equivalence of MINIMUM and MAXIMUM Combiners
for Two Classes / 267
9.1.2 Equivalence of MAJORITY VOTE and MEDIAN
Combiners for Two Classes and Odd Number
of Classifiers / 268
9.2 Added Error for the Mean Combination Rule / 269
9.2.1 Added Error of an Individual Classifier / 269
9.2.2 Added Error of the Ensemble / 273
9.2.3 Relationship Between the Individual Outputs’ Correlation
and the Ensemble Error / 275
9.2.4 Questioning the Assumptions and Identifying
Further Problems / 276
9.3 Added Error for the Weighted Mean Combination / 279
9.3.1 Error Formula / 280
9.3.2 Optimal Weights for Independent Classifiers / 282
9.4 Ensemble Error for Normal and Uniform Distributions
of the Outputs / 283
9.4.1 Individual Error / 287
9.4.2 Minimum and Maximum / 287
9.4.3 Mean / 288
9.4.4 Median and Majority Vote / 288
9.4.5 Oracle / 290
9.4.6 Example / 290
10 Diversity in Classifier Ensembles 295
10.1 What Is Diversity? / 295
10.1.1 Diversity in Biology / 296
10.1.2 Diversity in Software Engineering / 298
10.1.3 Statistical Measures of Relationship / 298
10.2 Measuring Diversity in Classifier Ensembles / 300
10.2.1 Pairwise Measures / 300
10.2.2 Nonpairwise Measures / 301
10.3 Relationship Between Diversity and Accuracy / 306
10.3.1 Example / 306
10.3.2 Relationship Patterns / 309
10.4 Using Diversity / 314
10.4.1 Diversity for Finding Bounds and Theoretical
Relationships / 314
10.4.2 Diversity for Visualization / 315
x CONTENTS
14. 10.4.3 Overproduce and Select / 315
10.4.4 Diversity for Building the Ensemble / 322
10.5 Conclusions: Diversity of Diversity / 322
Appendix 10A Equivalence Between the Averaged Disagreement
Measure Dav and Kohavi–Wolpert Variance
KW / 323
Appendix 10B Matlab Code for Some Overproduce
and Select Algorithms / 325
References 329
Index 347
CONTENTS xi
15. Preface
Everyday life throws at us an endless number of pattern recognition problems:
smells, images, voices, faces, situations, and so on. Most of these problems we
solve at a sensory level or intuitively, without an explicit method or algorithm.
As soon as we are able to provide an algorithm the problem becomes trivial and
we happily delegate it to the computer. Indeed, machines have confidently replaced
humans in many formerly difficult or impossible, now just tedious pattern recog-
nition tasks such as mail sorting, medical test reading, military target recognition,
signature verification, meteorological forecasting, DNA matching, fingerprint
recognition, and so on.
In the past, pattern recognition focused on designing single classifiers. This book
is about combining the “opinions” of an ensemble of pattern classifiers in the hope
that the new opinion will be better than the individual ones. “Vox populi, vox Dei.”
The field of combining classifiers is like a teenager: full of energy, enthusiasm,
spontaneity, and confusion; undergoing quick changes and obstructing the attempts
to bring some order to its cluttered box of accessories. When I started writing this
book, the field was small and tidy, but it has grown so rapidly that I am faced
with the Herculean task of cutting out a (hopefully) useful piece of this rich,
dynamic, and loosely structured discipline. This will explain why some methods
and algorithms are only sketched, mentioned, or even left out and why there is a
chapter called “Miscellanea” containing a collection of important topics that I
could not fit anywhere else.
The book is not intended as a comprehensive survey of the state of the art of the
whole field of combining classifiers. Its purpose is less ambitious and more practical:
to expose and illustrate some of the important methods and algorithms. The majority
of these methods are well known within the pattern recognition and machine
xiii
16. learning communities, albeit scattered across various literature sources and dis-
guised under different names and notations. Yet some of the methods and algorithms
in the book are less well known. My choice was guided by how intuitive, simple, and
effective the methods are. I have tried to give sufficient detail so that the methods can
be reproduced from the text. For some of them, simple Matlab code is given as well.
The code is not foolproof nor is it optimized for time or other efficiency criteria. Its
sole purpose is to enable the reader to experiment. Matlab was seen as a suitable
language for such illustrations because it often looks like executable pseudocode.
I have refrained from making strong recommendations about the methods and
algorithms. The computational examples given in the book, with real or artificial
data, should not be regarded as a guide for preferring one method to another. The
examples are meant to illustrate how the methods work. Making an extensive experi-
mental comparison is beyond the scope of this book. Besides, the fairness of such a
comparison rests on the conscientiousness of the designer of the experiment. J.A.
Anderson says it beautifully [89]
There appears to be imbalance in the amount of polish allowed for the techniques.
There is a world of difference between “a poor thing – but my own” and “a poor
thing but his”.
The book is organized as follows. Chapter 1 gives a didactic introduction into the
main concepts in pattern recognition, Bayes decision theory, and experimental com-
parison of classifiers. Chapter 2 contains methods and algorithms for designing the
individual classifiers, called the base classifiers, to be used later as an ensemble.
Chapter 3 discusses some philosophical questions in combining classifiers includ-
ing: “Why should we combine classifiers?” and “How do we train the ensemble?”
Being a quickly growing area, combining classifiers is difficult to put into unified
terminology, taxonomy, or a set of notations. New methods appear that have to
be accommodated within the structure. This makes it look like patchwork rather
than a tidy hierarchy. Chapters 4 and 5 summarize the classifier fusion methods
when the individual classifiers give label outputs or continuous-value outputs.
Chapter 6 is a brief summary of a different approach to combining classifiers termed
classifier selection. The two most successful methods for building classifier ensem-
bles, bagging and boosting, are explained in Chapter 7. A compilation of topics is
presented in Chapter 8. We talk about feature selection for the ensemble, error-cor-
recting output codes (ECOC), and clustering ensembles. Theoretical results found in
the literature are presented in Chapter 9. Although the chapter lacks coherence, it
was considered appropriate to put together a list of such results along with the details
of their derivation. The need of a general theory that underpins classifier combi-
nation has been acknowledged regularly, but such a theory does not exist as yet.
The collection of results in Chapter 9 can be regarded as a set of jigsaw pieces await-
ing further work. Diversity in classifier combination is a controversial issue. It is a
necessary component of a classifier ensemble and yet its role in the ensemble per-
formance is ambiguous. Little has been achieved by measuring diversity and
xiv PREFACE
17. employing it for building the ensemble. In Chapter 10 we review the studies in
diversity and join in the debate about its merit.
The book is suitable for postgraduate students and researchers in mathematics,
statistics, computer science, engineering, operations research, and other related dis-
ciplines. Some knowledge of the areas of pattern recognition and machine learning
will be beneficial.
The quest for a structure and a self-identity of the classifier combination field will
continue. Take a look at any book for teaching introductory statistics; there is hardly
much difference in the structure and the ordering of the chapters, even the sections
and subsections. Compared to this, the classifier combination area is pretty chaotic.
Curiously enough, questions like “Who needs a taxonomy?!” were raised at the
Discussion of the last edition of the International Workshop on Multiple Classifier
Systems, MCS 2003. I believe that we do need an explicit and systematic description
of the field. How otherwise are we going to teach the newcomers, place our own
achievements in the bigger framework, or keep track of the new developments?
This book is an attempt to tidy up a piece of the classifier combination realm,
maybe just the attic. I hope that, among the antiques, you will find new tools and,
more importantly, new applications for the old tools.
LUDMILA I. KUNCHEVA
Bangor, Gwynedd, United Kingdom
September 2003
PREFACE xv
18. Acknowledgments
Many thanks to my colleagues from the School of Informatics, University of Wales,
Bangor, for their support throughout this project. Special thanks to my friend Chris
Whitaker with whom I have shared publications, bottles of wine, and many inspiring
and entertaining discussions on multiple classifiers, statistics, and life. I am indebted
to Bob Duin and Franco Masulli for their visits to Bangor and for sharing with me
their time, expertise, and exciting ideas about multiple classifier systems. A great
part of the material in this book was inspired by the Workshops on Multiple Classi-
fier Systems held annually since 2000. Many thanks to Fabio Roli, Josef Kittler, and
Terry Windeatt for keeping these workshops alive and fun. I am truly grateful to my
husband Roumen and my daughters Diana and Kamelia for their love, patience, and
understanding.
L. I. K.
xvii
19. Notation and Acronyms
CART classification and regression trees
LDC linear discriminant classifier
MCS multiple classifier systems
PCA principal component analysis
pdf probability density function
pmf probability mass function
QDC quadratic discriminant classifier
RDA regularized discriminant analysis
SVD singular value decomposition
P(A) a general notation for the probability of an event A
P(AjB) a general notation for the probability of an event A conditioned by
an event B
D a classifier ensemble, D ¼ {D1, . . . , DL}
Di an individual classifier from the ensemble
E the expectation operator
F an aggregation method or formula for combining classifier
outputs
I(a, b) indicator function taking value 1 if a ¼ b, and 0, otherwise
I(l(zj),vi) example of using the indicator function: takes value 1 if the label of
zj is vi, and 0, otherwise
l(zj) the class label of zj
P(vk) the prior probability for class vk to occur
P(vkjx) the posterior probability that the true class is vk, given x [ Rn
xix
20. p(xjvk) the class-conditional probability density function for x, given class
vk
Rn
the n-dimensional real space
s the vector of the class labels produced by the ensemble, s ¼
½s1, . . . , sLT
si the class label produced by classifier Di for a given input x, si [ V
V the variance operator
x a feature vector, x ¼ ½x1, . . . , xnT
, x [ Rn
(column vectors are
always assumed)
Z the data set, Z ¼ {z1, . . . , zN}, zj [ Rn
, usually with known labels
for all zj
V the set of class labels, V ¼ {v1, . . . , vc}
vk a class label
xx NOTATION AND ACRONYMS
21. 1
Fundamentals of Pattern
Recognition
1.1 BASIC CONCEPTS: CLASS, FEATURE, AND DATA SET
1.1.1 Pattern Recognition Cycle
Pattern recognition is about assigning labels to objects. Objects are described by a
set of measurements called also attributes or features. Current research builds
upon foundations laid out in the 1960s and 1970s. A series of excellent books of
that time shaped up the outlook of the field including [1–10]. Many of them were
subsequently translated into Russian. Because pattern recognition is faced with
the challenges of solving real-life problems, in spite of decades of productive
research, elegant modern theories still coexist with ad hoc ideas, intuition and gues-
sing. This is reflected in the variety of methods and techniques available to the
researcher.
Figure 1.1 shows the basic tasks and stages of pattern recognition. Suppose that
there is a hypothetical User who presents us with the problem and a set of data. Our
task is to clarify the problem, translate it into pattern recognition terminology, solve
it, and communicate the solution back to the User.
If the data set is not given, an experiment is planned and a data set is collected.
The relevant features have to be nominated and measured. The feature set should be
as large as possible, containing even features that may not seem too relevant at this
stage. They might be relevant in combination with other features. The limitations for
the data collection usually come from the financial side of the project. Another pos-
sible reason for such limitations could be that some features cannot be easily
1
Combining Pattern Classifiers: Methods and Algorithms, by Ludmila I. Kuncheva
ISBN 0-471-21078-1 Copyright # 2004 John Wiley Sons, Inc.
22. measured, for example, features that require damaging or destroying the object,
medical tests requiring an invasive examination when there are counterindications
for it, and so on.
There are two major types of pattern recognition problems: unsupervised and
supervised. In the unsupervised category (called also unsupervised learning), the
problem is to discover the structure of the data set if there is any. This usually
means that the User wants to know whether there are groups in the data, and what
characteristics make the objects similar within the group and different across the
Fig. 1.1 The pattern recognition cycle.
2 FUNDAMENTALS OF PATTERN RECOGNITION
23. groups. Many clustering algorithms have been and are being developed for unsuper-
vised learning. The choice of an algorithm is a matter of designer’s preference.
Different algorithms might come up with different structures for the same set of
data. The curse and the blessing of this branch of pattern recognition is that there
is no ground truth against which to compare the results. The only indication of
how good the result is, is probably the subjective estimate of the User.
In the supervised category (called also supervised learning), each object in the
data set comes with a preassigned class label. Our task is to train a classifier to
do the labeling “sensibly.” Most often the labeling process cannot be described in
an algorithmic form. So we supply the machine with learning skills and present
the labeled data to it. The classification knowledge learned by the machine in this
process might be obscure, but the recognition accuracy of the classifier will be
the judge of its adequacy.
Features are not all equally relevant. Some of them are important only in relation
to others and some might be only “noise” in the particular context. Feature selection
and extraction are used to improve the quality of the description.
Selection, training, and testing of a classifier model form the core of supervised
pattern recognition. As the dashed and dotted lines in Figure 1.1 show, the loop of
tuning the model can be closed at different places. We may decide to use the same
classifier model and re-do the training only with different parameters, or to change
the classifier model as well. Sometimes feature selection and extraction are also
involved in the loop.
When a satisfactory solution has been reached, we can offer it to the User for
further testing and application.
1.1.2 Classes and Class Labels
Intuitively, a class contains similar objects, whereas objects from different classes
are dissimilar. Some classes have a clear-cut meaning, and in the simplest case
are mutually exclusive. For example, in signature verification, the signature is either
genuine or forged. The true class is one of the two, no matter that we might not be
able to guess correctly from the observation of a particular signature. In other pro-
blems, classes might be difficult to define, for example, the classes of left-handed
and right-handed people. Medical research generates a huge amount of difficulty
in interpreting data because of the natural variability of the object of study. For
example, it is often desirable to distinguish between low risk, medium risk, and
high risk, but we can hardly define sharp discrimination criteria between these
class labels.
We shall assume that there are c possible classes in the problem, labeled v1 to vc,
organized as a set of labels V ¼ {v1, . . . , vc} and that each object belongs to one
and only one class.
1.1.3 Features
As mentioned before, objects are described by characteristics called features. The
features might be qualitative or quantitative as illustrated on the diagram in
BASIC CONCEPTS: CLASS, FEATURE, AND DATA SET 3
24. Figure 1.2. Discrete features with a large number of possible values are treated as
quantitative. Qualitative (categorical) features are these with small number of poss-
ible values, either with or without gradations. A branch of pattern recognition, called
syntactic pattern recognition (as opposed to statistical pattern recognition) deals
exclusively with qualitative features [3].
Statistical pattern recognition operates with numerical features. These include,
for example, systolic blood pressure, speed of the wind, company’s net profit in
the past 12 months, gray-level intensity of a pixel. The feature values for a given
object are arranged as an n-dimensional vector x ¼ ½x1, . . . , xnT
[ Rn
. The real
space Rn
is called the feature space, each axis corresponding to a physical feature.
Real-number representation (x [ Rn
) requires a methodology to convert qualitative
features into quantitative. Typically, such methodologies are highly subjective and
heuristic. For example, sitting an exam is a methodology to quantify students’ learn-
ing progress. There are also unmeasurable features that we, humans, can assess
intuitively but hardly explain. These include sense of humor, intelligence, and
beauty. For the purposes of this book, we shall assume that all features have numeri-
cal expressions.
Sometimes an object can be represented by multiple subsets of features. For
example, in identity verification, three different sensing modalities can be used
[11]: frontal face, face profile, and voice. Specific feature subsets are measured
for each modality and then the feature vector is composed by three subvectors,
x ¼ ½x(1)
, x(2)
, x(3)
T
. We call this distinct pattern representation after Kittler et al.
[11]. As we shall see later, an ensemble of classifiers can be built using distinct pat-
tern representation, one classifier on each feature subset.
Fig. 1.2 Types of features.
4 FUNDAMENTALS OF PATTERN RECOGNITION
25. 1.1.4 Data Set
The information to design a classifier is usually in the form of a labeled data set
Z ¼ {z1, . . . , zN}, zj [ Rn
. The class label of zj is denoted by l(zj) [ V,
j ¼ 1, . . . , N. Figure 1.3 shows a set of examples of handwritten digits, which
have to be labeled by the machine into 10 classes. To construct a data set, the
black and white images have to be transformed into feature vectors. It is not always
easy to formulate the n features to be used in the problem. In the example in
Figure 1.3, various discriminatory characteristics can be nominated, using also
various transformations on the image. Two possible features are, for example, the
number of vertical strokes and the number of circles in the image of the digit.
Nominating a good set of features predetermines to a great extent the success of a
pattern recognition system. In this book we assume that the features have been
already defined and measured and we have a ready-to-use data set Z.
1.2 CLASSIFIER, DISCRIMINANT FUNCTIONS, AND
CLASSIFICATION REGIONS
A classifier is any function:
D : Rn
! V (1:1)
In the “canonical model of a classifier” [2] shown in Figure 1.4, we consider a set
of c discriminant functions G ¼ {g1(x), . . . , gc(x)},
gi : Rn
! R, i ¼ 1, . . . , c (1:2)
each yielding a score for the respective class. Typically (and most naturally), x is
labeled in the class with the highest score. This labeling choice is called the
Fig. 1.3 Examples of handwritten digits.
CLASSIFIER, DISCRIMINANT FUNCTIONS, AND CLASSIFICATION REGIONS 5
26. maximum membership rule, that is,
D(x) ¼ vi [ V , gi (x) ¼ max
i¼1,...,c
{gi(x)} (1:3)
Ties are broken randomly, that is, x is assigned randomly to one of the tied classes.
The discriminant functions partition the feature space Rn
into c (not necessarily
compact) decision regions or classification regions denoted by R1, . . . , Rc
Ri ¼ x
x [ Rn
, gi(x) ¼ max
k¼1,...,c
gk(x)
, i ¼ 1, . . . , c (1:4)
The decision region for class vi is the set of points for which the ith discriminant
function has the highest score. According to the maximum membership rule (1.3),
all points in decision region Ri are assigned in class vi. The decision regions are
specified by the classifier D, or, equivalently, by the discriminant functions G.
The boundaries of the decision regions are called classification boundaries, and con-
tain the points for which the highest discriminant function votes tie. A point on the
boundary can be assigned to any of the bordering classes. If a decision region Ri
contains data points from the labeled set Z with true class label vj, j = i, the classes
vi and vj are called overlapping. Note that overlapping classes for a particular par-
tition of the feature space (defined by a certain classifier D) can be nonoverlapping if
the feature space was partitioned in another way. If in Z there are no identical
points with different class labels, we can always partition the feature space into
Fig. 1.4 Canonical model of a classifier. The double arrows denote the n-dimensional input
vector x, the output of the boxes are the discriminant function values, gi (x) (scalars), and the
output of the maximum selector is the class label vk [ V assigned according to the maximum
membership rule.
6 FUNDAMENTALS OF PATTERN RECOGNITION
27. classification regions so that the classes are nonoverlapping. Generally, the smaller
the overlapping, the better the classifier.
Example: Classification Regions. A 15-point two-class problem is depicted in
Figure 1.5. The feature space R2
is divided into two classification regions: R1 is
shaded (class v1: squares) and R2 is not shaded (class v2: dots). For two classes
we can use only one discriminant function instead of two:
g(x) ¼ g1(x) g2(x) (1:5)
and assign class v1 if g(x) is positive and class v2 if it is negative. For this example,
we have drawn the classification boundary produced by the linear discriminant
function
g(x) ¼ 7x1 þ 4x2 þ 21 ¼ 0 (1:6)
Notice that any line in R2
is a linear discriminant function for any two-class pro-
blem in R2
. Generally, any set of functions g1(x), . . . , gc(x) (linear or nonlinear) is a
set of discriminant functions. It is another matter how successfully these discrimi-
nant functions separate the classes.
Let G
¼ {g
1(x), . . . , g
c(x)} be a set of optimal (in some sense) discriminant
functions. We can obtain infinitely many sets of optimal discriminant functions
from G
by applying a transformation f(g
i (x)) that preserves the order of the
function values for every x [ Rn
. For example, f(z ) can be a log(z ),
ffiffiffi
z
p
for positive
definite g
(x), az
, for a . 1, and so on. Applying the same f to all discriminant
functions in G
, we obtain an equivalent set of discriminant functions. Using the
Fig. 1.5 A two-class example with a linear discriminant function.
CLASSIFIER, DISCRIMINANT FUNCTIONS, AND CLASSIFICATION REGIONS 7
28. maximum membership rule (1.3), x will be labeled to the same class by any of the
equivalent sets of discriminant functions.
If the classes in Z can be separated completely from each other by a hyperplane
(a point in R, a line in R2
, a plane in R3
), they are called linearly separable. The two
classes in Figure 1.5 are not linearly separable because of the dot at (5,6.6) which is
on the wrong side of the discriminant function.
1.3 CLASSIFICATION ERROR AND CLASSIFICATION ACCURACY
It is important to know how well our classifier performs. The performance of a
classifier is a compound characteristic, whose most important component is the
classification accuracy. If we were able to try the classifier on all possible input
objects, we would know exactly how accurate it is. Unfortunately, this is hardly a
possible scenario, so an estimate of the accuracy has to be used instead.
1.3.1 Calculation of the Error
Assume that a labeled data set Zts of size Nts n is available for testing the accuracy
of our classifier, D. The most natural way to calculate an estimate of the error is to
run D on all the objects in Zts and find the proportion of misclassified objects
Error(D) ¼
Nerror
Nts
(1:7)
where Nerror is the number of misclassifications committed by D. This is called the
counting estimator of the error rate because it is based on the count of misclassifi-
cations. Let sj [ V be the class label assigned by D to object zj. The counting
estimator can be rewritten as
Error(D) ¼
1
Nts
X
Nts
j¼1
1 Iðl(zj), sjÞ
, zj [ Zts (1:8)
where I(a, b) is an indicator function taking value 1 if a ¼ b and 0 if a = b.
Error(D) is also called the apparent error rate. Dual to this characteristic is the
apparent classification accuracy which is calculated by 1 Error(D).
To look at the error from a probabilistic point of view, we can adopt the following
model. The classifier commits an error with probability PD on any object x [ Rn
(a wrong but useful assumption). Then the number of errors has a binomial distri-
bution with parameters (PD, Nts). An estimate of PD is
^
P
PD ¼
Nerror
Nts
(1:9)
8 FUNDAMENTALS OF PATTERN RECOGNITION
29. which is in fact the counting error, Error(D), defined above. If Nts and PD satisfy the
rule of thumb: Nts . 30, ^
P
PD Nts . 5 and (1 ^
P
PD) Nts . 5, the binomial distri-
bution can be approximated by a normal distribution. The 95 percent confidence
interval for the error is
^
P
PD 1:96
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^
P
PD(1 ^
P
PD)
Nts
s
, ^
P
PD þ 1:96
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
^
P
PD(1 ^
P
PD)
Nts
s
2
4
3
5 (1:10)
There are so-called smoothing modifications of the counting estimator [12] whose
purpose is to reduce the variance of the estimate of PD. The binary indicator function
I(a, b) in Eq. (1.8) is replaced by a smoothing function taking values in the interval
½0, 1 , R.
1.3.2 Training and Testing Data Sets
Suppose that we have a data set Z of size N n, containing n-dimensional feature
vectors describing N objects. We would like to use as much as possible of the data to
build the classifier (training), and also as much as possible unseen data to test its
performance more thoroughly (testing). However, if we use all data for training
and the same data for testing, we might overtrain the classifier so that it perfectly
learns the available data and fails on unseen data. That is why it is important to
have a separate data set on which to examine the final product. The main alternatives
for making the best use of Z can be summarized as follows.
. Resubstitution (R-method). Design classifier D on Z and test it on Z. ^
P
PD is
optimistically biased.
. Hold-out (H-method). Traditionally, split Z into halves, use one half for train-
ing, and the other half for calculating ^
P
PD. ^
P
PD is pessimistically biased. Splits in
other proportions are also used. We can swap the two subsets, get another esti-
mate ^
P
PD and average the two. A version of this method is the data shuffle
where we do L random splits of Z into training and testing parts and average
all L estimates of PD calculated on the respective testing parts.
. Cross-validation (called also the rotation method or p-method). We choose an
integer K (preferably a factor of N) and randomly divide Z into K subsets of
size N=K. Then we use one subset to test the performance of D trained on
the union of the remaining K 1 subsets. This procedure is repeated K
times, choosing a different part for testing each time. To get the final value
of ^
P
PD we average the K estimates. When K ¼ N, the method is called the
leave-one-out (or U-method).
. Bootstrap. This method is designed to correct the optimistic bias of the
R-method. This is done by randomly generating L sets of cardinality N from
the original set Z, with replacement. Then we assess and average the error
rate of the classifiers built on these sets.
CLASSIFICATION ERROR AND CLASSIFICATION ACCURACY 9
30. The question about the best way to organize the training/testing experiment has
been around for a long time [13]. Pattern recognition has now outgrown the stage
where the computation resource was the decisive factor as to which method to
use. However, even with modern computing technology, the problem has not disap-
peared. The ever growing sizes of the data sets collected in different fields of science
and practice pose a new challenge. We are back to using the good old hold-out
method, first because the others might be too time-consuming, and secondly,
because the amount of data might be so excessive that small parts of it will suffice
for training and testing. For example, consider a data set obtained from retail analy-
sis, which involves hundreds of thousands of transactions. Using an estimate of the
error over, say, 10,000 data points, can conveniently shrink the confidence interval,
and make the estimate reliable enough.
It is now becoming a common practice to use three instead of two data sets: one
for training, one for validation, and one for testing. As before, the testing set remains
unseen during the training process. The validation data set acts as pseudo-testing.
We continue the training process until the performance improvement on the training
set is no longer matched by a performance improvement on the validation set. At
this point the training should be stopped so as to avoid overtraining. Not all data
sets are large enough to allow for a validation part to be cut out. Many of the
data sets from the UCI Machine Learning Repository Database (at http://www.
ics.uci.edu/mlearn/MLRepository.html), often used as benchmarks in pattern
recognition and machine learning, may be unsuitable for a three-way split into
training/validation/testing. The reason is that the data subsets will be too small
and the estimates of the error on these subsets would be unreliable. Then stopping
the training at the point suggested by the validation set might be inadequate, the esti-
mate of the testing accuracy might be inaccurate, and the classifier might be poor
because of the insufficient training data.
When multiple training and testing sessions are carried out, there is the question
about which of the classifiers built during this process we should use in the end. For
example, in a 10-fold cross-validation, we build 10 different classifiers using differ-
ent data subsets. The above methods are only meant to give us an estimate of the
accuracy of a certain model built for the problem at hand. We rely on the assumption
that the classification accuracy will change smoothly with the changes in the size of
the training data [14]. Therefore, if we are happy with the accuracy and its variability
across different training subsets, we may decide finally to train a single classifier on
the whole data set. Alternatively, we may keep the classifiers built throughout the
training and consider using them together in an ensemble, as we shall see later.
1.3.3 Confusion Matrices and Loss Matrices
To find out how the errors are distributed across the classes we construct a confusion
matrix using the testing data set, Zts. The entry aij of such a matrix denotes the
number of elements from Zts whose true class is vi, and which are assigned by D
to class vj.
10 FUNDAMENTALS OF PATTERN RECOGNITION
31. The confusion matrix for the linear classifier for the 15-point data depicted in
Figure 1.5, is given as
True Class
D(x)
v1 v2
v1 7 0
v2 1 7
The estimate of the classifier’s accuracy can be calculated as the trace of the
matrix divided by the total sum of the entries, (7 þ 7)=15 in this case. The additional
information that the confusion matrix provides is where the misclassifications have
occurred. This is important for problems with a large number of classes because
a large off-diagonal entry of the matrix might indicate a difficult two-class problem
that needs to be tackled separately.
Example: Confusion Matrix for the Letter Data. The Letter data set available
from the UCI Machine Learning Repository Database contains data extracted
from 20,000 black and white images of capital English letters. Sixteen numerical
features describe each image (N ¼ 20,000, c ¼ 26, n ¼ 16). For the purpose of
this illustration we used the hold-out method. The data set was randomly split
into halves. One half was used for training a linear classifier, and the other half
was used for testing. The labels of the testing data were matched to the labels
obtained from the classifier, and the 26 26 confusion matrix was constructed. If
the classifier was ideal and all labels matched, the confusion matrix would be
diagonal.
Table 1.1 shows the row in the confusion matrix corresponding to class “H.” The
entries show the number of times that true “H” is mistaken for the letter in the
respective column. The boldface number is the diagonal entry showing how many
times “H” has been correctly recognized. Thus, from the total of 379 examples of
“H” in the testing set, only 165 have been labeled correctly by the classifier.
Curiously, the largest number of mistakes, 37, are for the letter “O.”
The errors in classification are not equally costly. To account for the different
costs of mistakes, we introduce the loss matrix. We define a loss matrix with entries
TABLE 1.1 The “H”-Row in the Confusion Matrix for the Letter Data Set Obtained
from a Linear Classifier Trained on 10,000 Points.
“H” mistaken for: A B C D E F G H I J K L M
No of times: 2 12 0 27 0 2 1 165 0 0 26 0 1
“H” mistaken for: N O P Q R S T U V W X Y Z
No of times: 31 37 4 8 17 1 1 13 3 1 27 0 0
CLASSIFICATION ERROR AND CLASSIFICATION ACCURACY 11
32. lij denoting the loss incurred by assigning label vi, given that the true label of the
object is vj. If the classifier is unsure about the label, it may refuse to make a
decision. An extra class (called refuse-to-decide) denoted vcþ1 can be added to
V. Choosing vcþ1 should be less costly than choosing a wrong class. For a problem
with c original classes and a refuse option, the loss matrix is of size (c þ 1) c. Loss
matrices are usually specified by the user. A zero–one (0–1) loss matrix is defined
as lij ¼ 0 for i ¼ j and lij ¼ 1 for i = j, that is, all errors are equally costly.
1.4 EXPERIMENTAL COMPARISON OF CLASSIFIERS
There is no single “best” classifier. Classifiers applied to different problems and
trained by different algorithms perform differently [15–17]. Comparative studies
are usually based on extensive experiments using a number of simulated and real
data sets. Dietterich [14] details four important sources of variation that have to
be taken into account when comparing classifier models.
1. The choice of the testing set. Different testing sets may rank differently clas-
sifiers that otherwise have the same accuracy across the whole population.
Therefore it is dangerous to draw conclusions from a single testing exper-
iment, especially when the data size is small.
2. The choice of the training set. Some classifier models are called instable [18]
because small changes in the training set can cause substantial changes of the
classifier trained on this set. Examples of instable classifiers are decision tree
classifiers and some neural networks. (Note, all classifier models mentioned
will be discussed later.) Instable classifiers are versatile models that are
capable of adapting, so that all training examples are correctly classified.
The instability of such classifiers is in a way the pay-off for their versatility.
As we shall see later, instable classifiers play a major role in classifier
ensembles. Here we note that the variability with respect to the training
data has to be accounted for.
3. The internal randomness of the training algorithm. Some training algorithms
have a random component. This might be the initialization of the parameters
of the classifier, which are then fine-tuned (e.g., backpropagation algorithm
for training neural networks), or a random-based procedure for tuning the clas-
sifier (e.g., a genetic algorithm). Thus the trained classifier might be different
for the same training set and even for the same initialization of the parameters.
4. The random classification error. Dietterich [14] considers the possibility of
having mislabeled objects in the testing data as the fourth source of variability.
The above list suggests that multiple training and testing sets should be used, and
multiple training runs should be carried out for classifiers whose training has a
stochastic element.
12 FUNDAMENTALS OF PATTERN RECOGNITION
33. Let {D1, . . . , DL} be a set of classifiers tested on the same data set
Zts ¼ {z1, . . . , zNts
}. Let the classification results be organized in a binary Nts L
matrix whose ijth entry is 0 if classifier Dj has misclassified vector zi, and 1 if Dj
has produced the correct class label l(zi).
Example: Arranging the Classifier Output Results. Figure 1.6 shows the classi-
fication regions of three classifiers trained to recognize two banana-shaped classes.1
A
training set was generated consisting of 100 points, 50 in each class, and another set of
the same size was generated for testing. The points are uniformly distributed along the
banana shapes with a superimposed normal distribution with a standard deviation 1.5.
The figure gives the gradient-shaded regions and a scatter plot of the testing data.
The confusion matrices of the classifiers for the testing data are shown in Table 1.2.
The classifier models and their training will be discussed further in the book. We are
now only interested in comparing the accuracies.
The matrix with the correct/incorrect outputs is summarized in Table 1.3. The
number of possible combinations of zeros and ones at the outputs of the three
classifiers for a given object, for this example, is 2L
¼ 8. The table shows the
0–1 combinations and the number of times they occur in Zts.
The data in the table is then used to test statistical hypotheses about the equival-
ence of the accuracies.
1.4.1 McNemar and Difference of Proportion Tests
Suppose we have two trained classifiers that have been run on the same testing data
giving testing accuracies of 98 and 96 percent, respectively. Can we claim that the
first classifier is significantly better than the second?
1
To generate the training and testing data sets we used the gendatb command from the PRTOOLS tool-
box for Matlab [19]. This toolbox has been developed by Prof. R. P. W. Duin and his group (Pattern Rec-
ognition Group, Department of Applied Physics, Delft University of Technology) as a free aid for
researchers in pattern recognition and machine learning. Available at http://www.ph.tn.tudelft.nl/
bob/PRTOOLS.html. Version 2 was used throughout this book.
Fig. 1.6 The decision regions found by the three classifiers.
EXPERIMENTAL COMPARISON OF CLASSIFIERS 13
34. 1.4.1.1 McNemar Test. The testing results for two classifiers D1 and D2 are
expressed in Table 1.4.
The null hypothesis, H0, is that there is no difference between the accuracies of
the two classifiers. If the null hypothesis is correct, then the expected counts for both
off-diagonal entries in Table 1.4 are 1
2(N01 þ N10). The discrepancy between the
expected and the observed counts is measured by the following statistic
x2
¼
jN01 N10j 1
ð Þ2
N01 þ N10
(1:11)
which is approximately distributed as x2
with 1 degree of freedom. The “21” in the
numerator is a continuity correction [14]. The simplest way to carry out the test is to
calculate x2
and compare it with the tabulated x2
value for, say, level of significance
0.05.2
2
The level of significance of a statistical test is the probability of rejecting H0 when it is true, that is, the
probability to “convict the innocent.” This error is called Type I error. The alternative error, when we do
not reject H0 when it is in fact incorrect, is called Type II error. The corresponding name for it would be
“free the guilty.” Both errors are needed in order to characterize a statistical test. For example, if we
always accept H0, there will be no Type I error at all. However, in this case the Type II error might be
large. Ideally, both errors should be small.
TABLE 1.2 Confusion Matrices and Total Accuracies for the
Three Classifiers on the Banana Data.
LDC 9-nm Parzen
42 8 44 6 47 3
8 42 2 48 5 45
84% correct 92% correct 92% correct
LDC, linear discriminant classifier; 9-nn, nine nearest neighbor.
TABLE 1.3 Correct/Incorrect Outputs for the Three Classifiers
on the Banana Data: “0” Means Misclassification, “1” Means
Correct Classification.
D1 ¼ LDC D2 ¼ 9-nn D3 ¼ Parzen Number
1 1 1 80
1 1 0 2
1 0 1 0
1 0 0 2
0 1 1 9
0 1 0 1
0 0 1 3
0 0 0 3
84 92 92 100
LDC, linear discriminant classifier; 9-nn, nine nearest neighbor.
14 FUNDAMENTALS OF PATTERN RECOGNITION
35. Then if x2
. 3:841, we reject the null hypothesis and accept that the two classi-
fiers have significantly different accuracies.
1.4.1.2 Difference of Two Proportions. Denote the two proportions of inter-
est to be the accuracies of the two classifiers, estimated from Table 1.4 as
p1 ¼
N11 þ N10
Nts
; p2 ¼
N11 þ N01
Nts
(1:12)
We can use Eq. (1.10) to calculate the 95 percent confidence intervals of the two
accuracies, and if they are not overlapping, we can conclude that the accuracies
are significantly different.
A shorter way would be to consider just one random variable, d ¼ p1 p2. We can
approximate the two binomial distributions by normal distributions (given that Nts
30 and p1 Nts . 5, (1 p1) Nts . 5, p2 Nts . 5, and (1 p2) Nts . 5).
If the two errors are independent, then d is a normally distributed random vari-
able. Under a null hypothesis, H0, of equal p1 and p2, the following statistic
z ¼
p1 p2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(2p(1 p))=(Nts)
p (1:13)
has (approximately) a standard normal distribution, where p ¼ 1
2( p1 þ p2) is the
pooled sample proportion. The null hypothesis is rejected if jzj . 1:96 (a two-
sided test with a level of significance of 0.05).
Note that this test assumes that the testing experiments are done on independently
drawn testing samples of size Nts. In our case, the classifiers use the same testing
data, so it is more appropriate to use a paired or matched test. Dietterich shows
[14] that with a correction for this dependence, we arrive at a statistic that is the
square root of x2
in Eq. (1.11). Since the above z statistic is commonly used in
the machine learning literature, Dietterich investigated experimentally how badly
z is affected by the violation of the independence assumption. He recommends
using the McNemar test rather than the difference of proportions test.
Example: Comparison of Two Classifiers on the Banana Data. Consider the
linear discriminant classifier (LDC) and the nine-nearest neighbor classifier (9-nn)
for the banana data. Using Table 1.3, we can construct the two-way table with
TABLE 1.4 The 2 3 2 Relationship Table with Counts.
D2 correct (1) D2 wrong (0)
D1 correct (1) N11 N10
D1 wrong (0) N01 N00
Total, N11 þ N10 þ N01 þ N00 ¼ Nts.
EXPERIMENTAL COMPARISON OF CLASSIFIERS 15
36. counts needed for the calculation of x2
in Eq. (1.11).
N11 ¼ 80 þ 2 ¼ 82 N10 ¼ 0 þ 2 ¼ 2
N01 ¼ 9 þ 1 ¼ 10 N00 ¼ 3 þ 3 ¼ 6
From Eq. (1.11)
x2
¼
(j10 2j 1)2
10 þ 2
¼
49
12
4:0833 (1:14)
Since the calculated x2
is greater than the tabulated value of 3.841, we reject the
null hypothesis and accept that LDC and 9-nn are significantly different. Applying
the difference of proportions test to the same pair of classifiers gives p ¼
(0:84 þ 0:92)=2 ¼ 0:88, and
z ¼
0:84 0:92
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(2 0:88 0:12)=(100
p
)
1:7408 (1:15)
In this case jzj is smaller than the tabulated value of 1.96, so we cannot reject the null
hypothesis and claim that LDC and 9-nn have significantly different accuracies.
Which of the two decisions do we trust? The McNemar test takes into account the
fact that the same testing set Zts was used whereas the difference of proportions does
not. Therefore, we can accept the decision of the McNemar test. (Would it not have
been better if the two tests agreed?)
1.4.2 Cochran’s Q Test and F-Test
To compare L . 2 classifiers on the same testing data, the Cochran’s Q test or the
F-test can be used.
1.4.2.1 Cochran’s Q Test. Cochran’s Q test is proposed for measuring whether
there are significant differences in L proportions measured on the same data [20].
This test is used in Ref. [21] in the context of comparing classifier accuracies. Let
pi denote the classification accuracy of classifier Di. We shall test the hypothesis
for no difference between the classification accuracies (equal proportions):
H0 : p1 ¼ p2 ¼ ¼ pL (1:16)
If there is no difference, then the following statistic is distributed approximately as
x2
with L 1 degrees of freedom
QC ¼ (L 1)
L
PL
i¼1 G2
i T2
LT
PNts
j¼1 (Lj)2
(1:17)
16 FUNDAMENTALS OF PATTERN RECOGNITION
37. where Gi is the number of objects out of Nts correctly classified by Di, i ¼ 1, . . . , L;
Lj is the number of classifiers out of L that correctly classified object zj [ Zts; and T
is the total number of correct votes among the L classifiers
T ¼
X
L
i¼1
Gi ¼
X
Nts
j¼1
Lj (1:18)
To test H0 we compare the calculated QC with the tabulated value of x2
for L 1
degrees of freedom and the desired level of significance. If the calculated value is
greater than the tabulated value, we reject the null hypothesis and accept that
there are significant differences among the classifiers. We can apply pairwise tests
to find out which pair (or pairs) of classifiers are significantly different.
1.4.2.2 F-Test. Looney [22] proposed a method for testing L independent clas-
sifiers on the same testing set. The sample estimates of the accuracies,
p
p1, . . . ,
p
pL,
and
p
p are found and used to calculate the sum of squares for the classifiers
SSA ¼ Nts
X
L
i¼1
p
p2
i NtsL
p
p2
(1:19)
and the sum of squares for the objects
SSB ¼
1
L
X
Nts
j¼1
(Lj)2
L Nts
p
p2
(1:20)
The total sum of squares is
SST ¼ NtsL
p
p(1
p
p) (1:21)
and the sum of squares for classification–object interaction is
SSAB ¼ SST SSA SSB (1:22)
The calculated F value is obtained by
MSA ¼
SSA
(L 1)
; MSAB ¼
SSAB
(L 1)(Nts 1)
; Fcal ¼
MSA
MSAB
(1:23)
We check the validity of H0 by comparing our Fcal with the tabulated value of an
F-distribution with degrees of freedom (L 1) and (L 1) (Nts 1). If Fcal is
greater than the tabulated F-value, we reject the null hypothesis H0 and can further
search for pairs of classifiers that differ significantly. Looney [22] suggests we use
the same Fcal but with adjusted degrees of freedom (called the Fþ
test).
EXPERIMENTAL COMPARISON OF CLASSIFIERS 17
38. Example: Cochran’s Q Test and F-Test for Multiple Classifiers. For the three
classifiers on the banana data, LDC, 9-nn, and Parzen, we use Table 1.3 to calculate
T ¼ 84 þ 92 þ 92 ¼ 268, and subsequently
QC ¼ 2
3 (842
þ 922
þ 922
) 2682
3 268 (80 9 þ 11 4 þ 6 1)
3:7647 (1:24)
The tabulated value of x2
for L 1 ¼ 2 degrees of freedom and level of significance
0.05 is 5.991. Since the calculated value is smaller than that, we cannot reject H0.
For the F-test, the results are
SSA ¼ 100 (0:842
þ 2 0:922
3 0:89332
) 0:4445
SSB ¼ 1
3(80 9 þ 11 4 þ 6 1) 3 100 0:89332
17:2712
SST ¼ 100 3 0:8933 0:1067 28:5945
SSAB ¼ 28:5945 0:4445 17:2712 ¼ 10:8788
MSA ¼ 0:4445=2 0:2223
MSAB ¼ 10:8788=(2 99) 0:0549
Fcal ¼
0:2223
0:0549
4:0492
The tabulated F-value for degrees of freedom 2 and (2 99) ¼ 198 is 3.09. In
this example the F-test disagrees with the Cochran’s Q test, suggesting that we
can reject H0 and accept that there is a significant difference among the three com-
pared classifiers.
Looney [22] recommends the F-test because it is the less conservative of the two.
Indeed, in our example, the F-test did suggest difference between the three classi-
fiers whereas Cochran’s Q test did not. Looking at the scatterplots and the classifi-
cation regions, it seems more intuitive to agree with the tests that do indicate
difference: the McNemar test (between LDC and 9-nn) and the F-test (among all
three classifiers).
1.4.3 Cross-Validation Tests
We now consider several ways to account for the variability of the training and test-
ing data.
1.4.3.1 K-Hold-Out Paired t-Test. According to Ref. [14], this test is widely
used for comparing algorithms in machine learning. Consider two classifier models,
A and B, and a data set Z. The data set is split into training and testing subsets,
usually 2/3 for training and 1/3 for testing (the hold-out method). Classifiers A
and B are trained on the training set and tested on the testing set. Denote the
18 FUNDAMENTALS OF PATTERN RECOGNITION
39. observed testing accuracies as PA and PB, respectively. This process is repeated K
times (typical value of K is 30), and the testing accuracies are tagged with super-
scripts (i), i ¼ 1, . . . , K. Thus a set of K differences is obtained, P(1)
¼ P(1)
A P(1)
B
to P(K)
¼ P(K)
A P(K)
B . The assumption that we make is that the set of differences
is an independently drawn sample from an approximately normal distribution.
Then, under the null hypothesis (H0: equal accuracies), the following statistic has
a t-distribution with K 1 degrees of freedom
t ¼
P
P
ffiffiffiffi
K
p
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
PK
i¼1 (P(i)
P
P)2
=(K 1)
q (1:25)
where
P
P ¼ (1=K)
PK
i¼1 P(i)
. If the calculated t is greater than the tabulated value for
the chosen level of significance and K 1 degrees of freedom, we reject H0 and
accept that there are significant differences in the two compared classifier models.
Dietterich argues that the above design might lead to deceptive results because
the assumption of the independently drawn sample is invalid. The differences are
dependent because the data sets used to train the classifier models and the sets to
estimate the testing accuracies in each of the K runs are overlapping. This is
found to be a severe drawback.
1.4.3.2 K-Fold Cross-Validation Paired t-Test. This is an alternative of the
above procedure, which avoids the overlap of the testing data. The data set is split
into K parts of approximately equal sizes, and each part is used in turn for testing of a
classifier built on the pooled remaining K 1 parts. The resultant differences are
again assumed to be an independently drawn sample from an approximately normal
distribution. The same statistic t, as in Eq. (1.25), is calculated and compared with
the tabulated value.
Only part of the problem is resolved by this experimental set-up. The testing sets
are independent, but the training sets are overlapping again. Besides, the testing set
sizes might become too small, which entails high variance of the estimates.
1.4.3.3 Dietterich’s 5 3 2-Fold Cross-Validation Paired t-Test (5 3
2cv). Dietterich [14] suggests a testing procedure that consists of repeating a
two-fold cross-validation procedure five times. In each cross-validated run, we
split the data into training and testing halves. Classifier models A and B are trained
first on half #1, and tested on half #2, giving observed accuracies P(1)
A and P(1)
B ,
respectively. By swapping the training and testing halves, estimates P(2)
A and P(2)
B
are obtained. The differences are respectively
P(1)
¼ P(1)
A P(1)
B
EXPERIMENTAL COMPARISON OF CLASSIFIERS 19
40. and
P(2)
¼ P(2)
A P(2)
B
The estimated mean and variance of the differences, for this two-fold cross-validation
run, are calculated as
P
P ¼
P(1)
þ P(2)
2
; s2
¼ P(1)
P
P
2
þ P(2)
P
P
2
(1:26)
Let P(1)
i denote the difference P(1)
in the ith run, and s2
i denote the estimated variance
for run i, i ¼ 1, . . . , 5. The proposed ~
t
t statistic is
~
t
t ¼
P(1)
1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
(1=5)
P5
i¼1 s2
i
q (1:27)
Note that only one of the ten differences that we will calculate throughout this exper-
iment is used in the numerator of the formula. It is shown in Ref. [14] that under the
null hypothesis, ~
t
t has approximately a t distribution with five degrees of freedom.
Example: Comparison of Two Classifier Models Through Cross-Validation
Tests. The banana data set used in the previous examples is suitable for experiment-
ing here because we can generate as many as necessary independent data sets from
the same distribution. We chose the 9-nn and Parzen classifiers. The Matlab code for
the three cross-validation methods discussed above is given in Appendices 1A to 1C
at the end of this chapter. PRTOOLS toolbox for Matlab, version 2 [19], was used to
train and test the two classifiers.
K-Hold-Out Paired t-Test. The training and testing data sets used in the previous
example were pooled and the K-hold-out paired t-test was run with K ¼ 30, as
explained above. We chose to divide the data set into halves instead of a 2=3 to
1=3 split. The test statistic (1.25) was found to be t ¼ 1:9796. At level of signifi-
cance 0.05, and degrees of freedom K 1 ¼ 29, the tabulated value is 2.045
(two-tailed test). Since the calculated value is smaller than the tabulated value,
we cannot reject the null hypothesis. This test suggests that 9-nn and Parzen classi-
fiers do not differ in accuracy on the banana data. The averaged accuracies over the
30 runs were 92.5 percent for 9-nn, and 91.83 percent for Parzen.
K-Fold Cross-Validation Paired t-Test. We ran a 10-fold cross-validation for the
set of 200 data points, so each testing set consisted of 20 objects. The ten testing
accuracies for 9-nn and Parzen are shown in Table 1.5.
From Eq. (1.25) we found t ¼ 1:0000. At level of significance 0.05, and degrees
of freedom K 1 ¼ 9, the tabulated value is 2.262 (two-tailed test). Again, since the
20 FUNDAMENTALS OF PATTERN RECOGNITION
41. calculated value is smaller than the tabulated value, we cannot reject the null
hypothesis, and we accept that 9-nn and Parzen do not differ in accuracy on the
banana data. The averaged accuracies over the 10 splits were 91.50 percent for
9-nn, and 92.00 percent for Parzen.
5 3 2cv. The results of the five cross-validation runs are summarized in Table 1.6.
Using (1.27), ~
t
t ¼ 1:0690. Comparing it with the tabulated value of 2.571 (level of
significance 0.05, two-tailed test, five degrees of freedom), we again conclude
that there is no difference in the accuracies of 9-nn and Parzen. The averaged accu-
racies across the 10 estimates (5 runs 2 estimates in each) were 91.90 for 9-nn and
91.20 for Parzen.
Looking at the averaged accuracies in all three tests, it is tempting to conclude
that 9-nn is marginally better than Parzen on this data. In many publications differ-
ences in accuracy are claimed on even smaller discrepancies. However, none of the
three tests suggested that the difference is significant.
To re-confirm this result we ran a larger experiment where we did generate inde-
pendent training and testing data sets from the same distribution, and applied the
paired t-test as in Eq. (1.25). Now the assumptions of independence are satisfied
and the test should be accurate. The Matlab code for this experiment is given in
Appendix 1D at the end of this chapter. Five hundred training and testing samples,
of size 100 each, were generated. The averaged accuracy over the 500 runs was
91.61 percent for 9-nn and 91.60 percent for the Parzen classifier. The t-statistic
was calculated to be 0.1372 (we can use the standard normal distribution in this
case because K ¼ 500 30). The value is smaller than 1.96 (tabulated value for
TABLE 1.5 Accuracies (in %) of 9-nn and Parzen Using a 10-Fold Cross-Validation
on the Banana Data.
Sample #
1 2 3 4 5 6 7 8 9 10
9-nn (model A) 90 95 95 95 95 90 100 80 85 90
Parzen (model B) 90 95 95 95 95 90 100 85 85 90
PA 2 PB 0 0 0 0 0 0 0 25 0 0
TABLE 1.6 Accuracies (in %), Differences (in %), and Variances s 2
of 9-nn (A)
and Parzen (B) Using a 5 3 2-Fold Cross-Validation on the Banana Data.
Exp # P(1)
A P(1)
B P(1)
P(2)
A P(2)
B P(2)
s2
1 93 91 2 93 94 1 0.00045
2 92 89 3 93 93 0 0.00045
3 90 90 0 88 90 2 0.00020
4 94 94 0 91 88 3 0.00045
5 93 93 0 92 90 2 0.00020
EXPERIMENTAL COMPARISON OF CLASSIFIERS 21
42. level of significance 0.05). Therefore we cannot conclude that there is a significant
difference between the two models on this data set.
It is intuitively clear that simple models or stable classifiers are less likely to be
overtrained than more sophisticated models. However, simple models might not be
versatile enough to fit complex classification boundaries. More complex models
(e.g., neural networks and prototype-based classifiers) have a better flexibility but
require more system resources and are prone to overtraining. What do “simple”
and “complex” mean in this context? The main aspects of complexity can be sum-
marized as [23]
. training time and training complexity;
. memory requirements (e.g., the number of the parameters of the classifier that
are needed for its operation); and
. running complexity.
1.4.4 Experiment Design
When talking about experiment design, I cannot refrain from quoting again and
again a masterpiece of advice by George Nagy titled “Candide’s practical principles
of experimental pattern recognition” [24]:
Comparison of Classification Accuracies
Comparisons against algorithms proposed by others are distasteful and should be
avoided. When this is not possible, the following Theorem of Ethical Data Selection
may prove useful. Theorem: There exists a set of data for which a candidate algorithm
is superior to any given rival algorithm. This set may be constructed by omitting from
the test set any pattern which is misclassified by the candidate algorithm.
Replication of Experiments
Since pattern recognition is a mature discipline, the replication of experiments on new
data by independent research groups, a fetish in the physical and biological sciences, is
unnecessary. Concentrate instead on the accumulation of novel, universally applicable
algorithms. Casey’s Caution: Do not ever make your experimental data available to
others; someone may find an obvious solution that you missed.
Albeit meant to be satirical, the above principles are surprisingly widespread and
closely followed! Speaking seriously now, the rest of this section gives some prac-
tical tips and recommendations.
Example: Which Is the “Best” Result? Testing should be carried out on pre-
viously unseen data. Let D(r) be a classifier with a parameter r such that varying
r leads to different training accuracies. To account for this variability, here we
use a randomly drawn 1000 objects from the Letter data set. The remaining
22 FUNDAMENTALS OF PATTERN RECOGNITION
43. 19,000 objects were used for testing. A quadratic discriminant classifier (QDC) from
PRTOOLS is used.3
We vary the regularization parameter r, r [ ½0, 1, which spe-
cifies to what extent we make use of the data. For r ¼ 0 there is no regularization, we
have more accuracy on the training data and less certainty that the classifier will per-
form well on unseen data. For r ¼ 1, the classifier might be less accurate on the
training data, but can be expected to perform at the same rate on unseen data.
This dilemma can be transcribed into everyday language as “specific expertise” ver-
sus “common sense.” If the classifier is trained to expertly recognize a certain data
set, it might have this data-specific expertise and little common sense. This will
show as high testing error. Conversely, if the classifier is trained to have good com-
mon sense, even if not overly successful on the training data, we might expect it to
have common sense with any data set drawn from the same distribution.
In the experiment r was decreased for 20 steps, starting with r0 ¼ 0:4 and taking
rkþ1 to be 0:8 rk. Figure 1.7 shows the training and the testing errors for the 20
steps.
This example is intended to demonstrate the overtraining phenomenon in the pro-
cess of varying a parameter, therefore we will look at the tendencies in the error
curves. While the training error decreases steadily with r, the testing error decreases
to a certain point, and then increases again. This increase indicates overtraining, that
is, the classifier becomes too much of a data-specific expert and loses common
sense. A common mistake in this case is to declare that the quadratic discriminant
3
Discussed in Chapter 2.
Fig. 1.7 Example of overtraining: Letter data set.
EXPERIMENTAL COMPARISON OF CLASSIFIERS 23
44. classifier has a testing error of 21.37 percent (the minimum in the bottom plot). The
mistake is in that the testing set was used to find the best value of r.
Let us use the difference of proportions test for the errors of the classifiers. The
testing error of our quadratic classifier (QDC) at the final 20th step (corresponding to
the minimum training error) is 23.38 percent. Assume that the competing classifier
has a testing error on this data set of 22.00 percent. Table 1.7 summarizes the results
from two experiments. Experiment 1 compares the best testing error found for QDC,
21.37 percent, with the rival classifier’s error of 22.00 percent. Experiment 2 com-
pares the end error of 23.38 percent (corresponding to the minimum training error of
QDC), with the 22.00 percent error. The testing data size in both experiments is
Nts ¼ 19,000.
The results suggest that we would decide differently if we took the best testing
error rather than the testing error corresponding to the best training error. Exper-
iment 2 is the fair comparison in this case.
A point raised by Duin [16] is that the performance of a classifier depends upon
the expertise and the willingness of the designer. There is not much to be done for
classifiers with fixed structures and training procedures (called “automatic” classi-
fiers in Ref. [16]). For classifiers with many training parameters, however, we can
make them work or fail due to designer choices. Keeping in mind that there are
no rules defining a fair comparison of classifiers, here are a few (non-Candide’s)
guidelines:
1. Pick the training procedures in advance and keep them fixed during training.
When publishing, give enough detail so that the experiment is reproducible by
other researchers.
2. Compare modified versions of classifiers with the original (nonmodified) clas-
sifier. For example, a distance-based modification of k-nearest neighbors
(k-nn) should be compared with the standard k-nn first, and then with other
classifier models, for example, neural networks. If a slight modification of a
certain model is being compared with a totally different classifier, then it is
not clear who deserves the credit, the modification or the original model itself.
3. Make sure that all the information about the data is utilized by all classifiers to
the largest extent possible. For example, a clever initialization of a prototype-
based classifier such as the learning vector quantization (LVQ) can make it
favorite among a group of equivalent but randomly initialized prototype clas-
sifiers.
TABLE 1.7 Comparison of Testing Errors of Two Classifiers.
e1 e2 z jzj . 1.96? Outcome
Experiment 1 21.37 22.00 22.11 Yes Different (e1 , e2)
Experiment 2 23.38 22.00 4.54 Yes Different (e1 . e2)
24 FUNDAMENTALS OF PATTERN RECOGNITION
45. 4. Make sure that the testing set has not been seen at any stage of the training.
5. If possible, give also the complexity of the classifier: training and running
times, memory requirements, and so on.
1.5 BAYES DECISION THEORY
1.5.1 Probabilistic Framework
Although many types of uncertainty exist, the probabilistic model fits surprisingly
well in most pattern recognition problems. We assume that the class label v is a ran-
dom variable taking values in the set of class labels V ¼ {v1, . . . , vc}. The prior
probabilities, P(vi), i ¼ 1, . . . , c, constitute the probability mass function of the
variable v,
0 P(vi) 1
and
X
c
i¼1
P(vi) ¼ 1 (1:28)
We can construct a classifier based on this information only. To make the smallest
possible number of mislabelings, we should always label an object with the class of
the highest prior probability.
However, by measuring the relevant characteristics of the objects, organized as
the vector x [ Rn
, we should be able to make a more accurate decision about this
particular object. Assume that the objects from class vi are distributed in Rn
accord-
ing to the class-conditional probability density function (pdf ) p(xjvi), p(xjvi) 0,
8x [ Rn
, and
ð
Rn
p(xjvi) dx ¼ 1, i ¼ 1, . . . , c (1:29)
The likelihood of x [ Rn
is given by the unconditional pdf
p(x) ¼
X
c
i¼1
P(vi)p(xjvi) (1:30)
Given the prior probabilities and the class-conditional pdfs we can calculate the
posterior probability that the true class label of the measured x is vi using the Bayes
formula
P(vijx) ¼
P(vi)p(xjvi)
p(x)
¼
P(vi)p(xjvi)
Pc
j¼1 P(vj)p(xjvj)
(1:31)
BAYES DECISION THEORY 25
46. Equation (1.31) gives the probability mass function of the class label variable v
for the observed x. The decision for that particular x should be made with respect to
the posterior probability. Choosing the class with the highest posterior probability
will lead to the smallest possible mistake when classifying x.
The probability model described above is valid for the discrete case as well. Let x
be a discrete variable with possible values in V ¼ {v1, . . . , vs}. The only difference
from the continuous-valued case is that instead of class-conditional pdf, we use
class-conditional probability mass functions (pmf), P(xjvi), giving the probability
that a particular value from V occurs if we draw at random an object from class
vi. For all pmfs,
0 P(xjvi) 1, 8x [ V, and
X
s
j¼1
P(vjjvi) ¼ 1 (1:32)
1.5.2 Normal Distribution
An important example of class-conditional pdf is the normal distribution denoted
p(xjvi) N(mi,Si), where mi [ Rn
, and Si are the parameters of the distribution.
mi is the mean of class vi, and Si is an n n covariance matrix. The class-
conditional pdf is calculated as
p(xjvi) ¼
1
(2p)n=2
ffiffiffiffiffiffiffiffi
jSij
p exp
1
2
(x mi)T
S1
i (x mi) (1:33)
where jSij is the determinant of Si. For the one-dimensional case, x and mi are
scalars, and Si reduces to the variance of x for class vi, denoted s2
i . Equation
(1.33) simplifies to
p(xjvi) ¼
1
ffiffiffiffiffiffi
2p
p
si
exp
1
2
x mi
si
2
#
(1:34)
The normal distribution (or also Gaussian distribution) is the most natural
assumption reflecting the following situation: there is an “ideal prototype” of
class vi (a point in Rn
) and all class members are distorted versions of it. Small dis-
tortions are more likely to occur than large distortions, causing more objects to be
located in the close vicinity of the ideal prototype than far away from it. The proto-
type is represented by the population mean mi and the scatter of the points around it
is associated with the covariance matrix Si.
Example: Data Cloud Shapes and the Corresponding Covariance
Matrices. Figure 1.8 shows four two-dimensional data sets generated from normal
distributions with different covariance matrices as displayed underneath the respect-
ive scatterplot.
26 FUNDAMENTALS OF PATTERN RECOGNITION
47. Plots (a) and (b) are generated with independent (noninteracting) features, that is,
the data cloud is either spherical (subplot (a)), or stretched along the coordinate axes
(subplot (b)). Notice that for these cases the off-diagonal entries of the covariance
matrix are zeros. Subplots (c) and (d) represent cases where the features are
dependent.
In the case of independent features we can decompose the n-dimensional pdf as a
product of n one-dimensional pdfs. Let s2
ik be the kth diagonal entry of the covari-
ance matrix Si and mik be the kth component of mi. Then
p(xjvi) ¼
1
(2p)n=2
ffiffiffiffiffiffiffiffi
jSij
p exp
1
2
(x mi)T
S1
i (x mi)
¼
Y
n
k¼1
1
ffiffiffiffiffiffiffiffiffi
(2p)
p
sik
exp
1
2
xk mik
sik
2
#
( )
(1:35)
The cumulative distribution function for a random variable X [ R with a normal
distribution, F(z) ¼ P(X z), is available in tabulated form from any statistical
textbook.4
1.5.3 Generate Your Own Data
Trivial though it might be, sometimes you need a piece of code to generate your own
data set with specified probabilistic characteristics.
Fig. 1.8 Normally distributed data sets with mean [0,0 ]T
and different covariance matrices
shown underneath.
4
F(z) can be approximated with error at most 0.005 for 0 z 2:2 as [25]
F(z) ¼ 0:5 þ
z(4:4 z)
10
BAYES DECISION THEORY 27
48. 1.5.3.1 Noisy Geometric Figures. The following example suggests one pos-
sible way for achieving this. (The Matlab code is shown in Appendix 1E.)
Suppose we want to generate two classes in R2
with prior probabilities 0.6 and
0.4, respectively. Each class will be distributed along a piece of a parametric
curve. Let class v1 have a skeleton of a Lissajous figure with a parameter t, such that
x ¼ a sin nt, y ¼ b cos t, t [ ½p, p (1:36)
Pick a ¼ b ¼ 1 and n ¼ 2. The Lissajous figure is shown in Figure 1.9a.
Let class v2 be shaped as a segment of a line with a parametric equation
x ¼ t, y ¼ at þ b, for t [ ½0:3, 1:5 (1:37)
Let us pick a ¼ 1:4 and b ¼ 1:5. The segment is depicted in Figure 1.9a.
We shall draw random samples with uniform distributions along the skeletons
with overlaid normal distributions of specified variances. For v1 we shall use s2
¼
0:005 on both axes and a diagonal covariance matrix. For v2, we shall use s2
1 ¼
0:01 (1:5 x)2
and s2
2 ¼ 0:001. We chose s1 to vary with x so that smaller x
values exhibit larger variance. To design the data set, select the total number of
data points T, and follow the list of steps below. The normal distributions for the
example are generated within the code. Only the standard (uniform) random genera-
tor of Matlab will be used.
1. Generate a random number r [ ½0;1.
2. If r , 0:6, then proceed to generate a point from v1.
(a) Generate randomly t in the interval ½p, p.
(b) Find the point (x, y) on the curve using Eq. (1.36).
(c) To superimpose the noise generate a series of triples of random numbers
u, v within ½3s, 3s, and w [ ½0, 1, until the following condition,
Fig. 1.9 (a) The skeletons of the two classes to be generated. (b) The generated data set.
28 FUNDAMENTALS OF PATTERN RECOGNITION
49. coming from the multivariate normal distribution formula (1.33), is met
w ,
1
2ps2
exp
1
2
u2
þ v2
s2
(1:38)
where s2
¼ 0:005.
(d) Add the new point (x þ u, y þ v) and a label v1 for it to the data set.
3. Otherwise (r 0:6) proceed to generate a point from v2.
(a) Generate randomly t in the interval 0:3, 1:5.
(b) Find the point (x, y) on the line using Eq. (1.37).
(c) To superimpose the noise generate a series of triples of random numbers
u [ ½3s1, 3s1
v [ ½3s2, 3s2
w [ ½0, 1
until the following condition is met
w ,
1
2ps1s2
exp
1
2
u2
s2
1
þ
v2
s2
2
(1:39)
where s2
1 ¼ 0:01 (1:5 x)2
and s2
2 ¼ 0:001.
(d) Add the new point (x þ u, y þ v) and a label v2 for it to the data set.
Any pdfs can be simulated in a similar way.
1.5.3.2 Rotated Check-Board Data. The Matlab code below generates a data
set with complicated decision boundaries. The data is two-dimensional and spans
the unit square ½0, 1 ½0, 1. The classes are placed as the white and the black
squares of a check-board and then the whole board is rotated at an angle a. A par-
ameter “a” specifies the side of the individual squares. For example, if a ¼ 0:5, then
before rotation, there will be four squares in total. Figure 1.10 shows a data set of
100,000 points generated from the Matlab code for two sets of input parameters.
The properties that make this data set interesting for experimental purposes are:
. The two classes are perfectly separable, therefore zero error is the target for
both training and testing.
. The classification regions for the same class are disjoint.
. The boundaries are not parallel to the coordinate axes.
. The classification performance will be highly dependent on the sample size.
BAYES DECISION THEORY 29
50. The Matlab code for N data points is:
function [d,labd]=gendatcb(N,a,alpha);
d=rand(N,2);
d_transformed=[d(:,1)*cos(alpha)-d(:,2)*sin(alpha), . . .
d(:,1)*sin(alpha)+d(:,2)*cos(alpha)];
s=ceil(d_transformed(:,1)/a)+floor(d_transformed(:,2)/a);
labd=2-mod(s,2);
1.5.4 Discriminant Functions and Decision Boundaries
The class with the highest posterior probability is the most natural choice for a given
x. Therefore the posterior probabilities can be used directly as the discriminant
functions, that is,
gi(x) ¼ P(vijx), i ¼ 1, . . . , c (1:40)
Hence we rewrite the maximum membership rule (1.3) as
D(x) ¼ vi [ V , P(vi jx) ¼ max
i¼1,...,c
{P(vijx)} (1:41)
In fact, a set of discriminant functions leading to the same classification regions
would be
gi(x) ¼ P(vi)p(xjvi), i ¼ 1, . . . , c (1:42)
Fig. 1.10 Rotated check-board data (100,000 points in each plot).
30 FUNDAMENTALS OF PATTERN RECOGNITION
51. because the denominator of Eq. (1.31) is the same for all i, and so will not change the
ranking order of gi values. Another useful set of discriminant functions derived from
the posterior probabilities is
gi(x) ¼ log P(vi)p(xjvi)
½ , i ¼ 1, . . . , c (1:43)
Example: Decision/Classification Boundaries. Let x [ R. Figure 1.11 shows two
sets of discriminant functions for three normally distributed classes with
P(v1) ¼ 0:45, p(xjv1) N 4, (2:0)2
P(v2) ¼ 0:35, p(xjv2) N 5, (1:2)2
P(v3) ¼ 0:20, p(xjv3) N 7, (1:0)2
The first set (top plot) depicts a set of functions (1.42), P(vi)p(xjvi), i ¼ 1, 2, 3.
The classification boundaries are marked with bullets on the x-axis. The posterior
probabilities (1.40) are depicted in the bottom plot. The classification regions speci-
fied by the boundaries are displayed with different shades of gray in the bottom plot.
Note that the same regions are found in both plots.
Fig. 1.11 (a) Plot of two equivalent sets of discriminant functions: Pðv1Þpðxjv1Þ (the thin line),
Pðv2Þpðxjv2 Þ (the dashed line), and Pðv3Þpðxjv3Þ (the thick line). (b) Plot of the three posterior
probability functions Pðv1jxÞ (the thin line), Pðv2jxÞ (the dashed line), and Pðv3jxÞ (the thick
line). In both plots x [ ½0; 10.
BAYES DECISION THEORY 31
52. Sometimes more than two discriminant function might tie at the boundaries. Ties
are resolved randomly.
1.5.5 Bayes Error
Let D
be a classifier that always assigns the class label with the largest posterior
probability. Since for every x we can only be correct with probability
P(vi jx) ¼ max
i¼1,...,c
{P(vijx)} (1:44)
there is some inevitable error. The overall probability of error of D
is the sum of the
errors of the individual xs weighted by their likelihood values, p(x); that is,
Pe(D
) ¼
ð
Rn
½1 P(vi jx) p(x) dx (1:45)
It is convenient to split the integral into c integrals, one on each classification
region. For this case class vi will be specified by the region’s label. Then
Pe(D
) ¼
X
c
i¼1
ð
R
i
½1 P(vijx) p(x) dx (1:46)
where R
i is the classification region for class vi, R
i R
j ¼ ; for any j = i and
Sc
i¼1 R
i ¼ Rn
. Substituting Eq. (1.31) into Eq. (1.46) and taking into account
that
Pc
i¼1
Ð
R
i
¼
Ð
Rn ,
Pe(D
) ¼
X
c
i¼1
ð
R
i
1
P(vi)p(xjvi)
p(x)
p(x) dx (1:47)
¼
ð
Rn
p(x) dx
X
c
i¼1
ð
R
i
P(vi)p(xjvi) dx (1:48)
¼ 1
X
c
i¼1
ð
R
i
P(vi)p(xjvi) dx (1:49)
Note that Pe(D
) ¼ 1 Pc(D
), where Pc(D
) is the overall probability of correct
classification of D
.
Consider a different classifier, D, which produces classification regions
R1, . . . ,Rc, Ri Rj ¼ ; for any j = i and
Sc
i¼1 Ri ¼ Rn
. Regardless of the way
the regions are formed, the error of D is
Pe(D) ¼
X
c
i¼1
ð
Ri
½1 P(vijx) p(x) dx (1:50)
32 FUNDAMENTALS OF PATTERN RECOGNITION
53. The error of D
is the smallest possible error, called the Bayes error. The example
below illustrates this concept.
Example: Bayes Error. Consider the simple case of x [ R and V ¼ {v1, v2}.
Figure 1.12 displays the discriminant functions in the form gi(x) ¼ P(vi)p(xjvi),
i ¼ 1, 2, x [ ½0, 10.
For two classes,
P(v1jx) ¼ 1 P(v2jx) (1:51)
and Pe(D
) in Eq. (1.46) becomes
Pe(D
) ¼
ð
R
1
½1 P(v1jx) p(x) dx þ
ð
R
2
½1 P(v2jx) p(x) dx (1:52)
¼
ð
R
1
P(v2jx)p(x) dx þ
ð
R
2
P(v1jx)p(x) dx (1:53)
¼
ð
R
1
P(v2)p(xjv2) dx þ
ð
R
2
P(v1)p(xjv1) dx (1:54)
Fig. 1.12 Plot of two discriminant functions Pðv1Þpðxjv1Þ (left curve) and Pðv2Þpðxjv2 Þ (right
curve) for x [ ½0; 10. The light gray area corresponds to the Bayes error, incurred if the
optimal decision boundary (denoted by †) is used. The dark gray area corresponds to the
additional error when another boundary (denoted by W) is used.
BAYES DECISION THEORY 33
54. By design, the classification regions of D
correspond to the true highest posterior
probabilities. The bullet on the x-axis in Figure 1.12 splits R into R
1 (to the left) and
R
2 (to the right). According to Eq. (1.54), the Bayes error will be the area under
P(v2)p(xjv2) in R
1 plus the area under P(v1)p(xjv1) in R
2. The total area corre-
sponding to the Bayes error is marked in light gray. If the boundary is shifted to
the left or right, additional error will be incurred. We can think of this boundary
as the result from classifier D, which is an imperfect approximation of D
. The
shifted boundary, depicted by an open circle, is called in this example the “real”
boundary. Region R1 is therefore R
1 extended to the right. The error calculated
through Eq. (1.54) is the area under P(v2)p(xjv2) in the whole of R1, and extra
error will be incurred, measured by the area shaded in dark gray. Therefore, using
the true posterior probabilities or an equivalent set of discriminant functions guar-
antees the smallest possible error rate, called the Bayes error.
Since the true probabilities are never available in practice, it is impossible to cal-
culate the exact Bayes error or design the perfect Bayes classifier. Even if the prob-
abilities were given, it will be difficult to find the classification regions in Rn
and
calculate the integrals. Therefore, we rely on estimates of the error as discussed
in Section 1.3.
1.5.6 Multinomial Selection Procedure for Comparing Classifiers
Alsing et al. [26] propose a different view of classification performance. The classi-
fiers are compared on a labeled data set, relative to each other in order to identify
which classifier has most often been closest to the true class label. We assume
that each classifier gives at its output a set of c posterior probabilities, one for
each class, guessing the chance of that class being the true label for the input vector
x. Since we use labeled data, the posterior probabilities for the correct label of x are
sorted and the classifier with the largest probability is nominated as the winner for
this x.
Suppose we have classifiers D1, . . . , DL to be compared on a data set Z of size N.
The multinomial selection procedure consists of the following steps.
1. For i ¼ 1, . . . , c,
(a) Use only the Ni data points whose true label is vi. Initialize an Ni L per-
formance array T.
(b) For every point zj, such that l(zj) ¼ vi, find the estimates of the posterior
probability P(vijzj) guessed by each classifier. Identify the largest pos-
terior probability, store a value of 1 for the winning classifier Dq by set-
ting T( j, q) ¼ 1 and values 0 for the remaining L 1 classifiers,
T( j, k) ¼ 0, k ¼ 1, . . . , L, k = q.
(c) Calculate an estimate of each classifier being the winner for class vi
assuming that the number of winnings follows a binomial distribution.
The estimate of this probability will be the total number of 1s stored
34 FUNDAMENTALS OF PATTERN RECOGNITION
56. reports of the German losses, according to a prisoner captured later,
gave 600 killed, wounded, and missing.
IN THE PICARDY BATTLE
Franco-American positions south of the Somme and on the Avre
were officially mentioned for the first time in the French War Office
report of April 24, indicating that forces of the United States were
there on the battlefront resisting the great German offensive. The
report stated that an intense bombardment of the positions all along
this front was followed by an attack directed against Hangard-en-
Santerre, the region of Hailles, and Senecat Wood. The Germans
were repulsed almost everywhere.
Formal announcement that American troops sent to reinforce the
allied armies had taken part in the fighting was made by the War
Department in its weekly review of the situation issued on April 29.
Our own forces, the statement read, have taken part in the battle.
American units are in the area east of Amiens. During the
engagements which have raged in this area they have acquitted
themselves well.
UNDER INTENSE FIRE
Another heavy attack was launched by the Germans against the
Americans in the vicinity of Villers-Bretonneux on April 30. It was
repulsed with heavy losses for the enemy. The German
bombardment opened at 5 o'clock in the afternoon and was directed
especially against the Americans, who were supported on the north
and south by the French. The fire was intense, and at the end of two
hours the German commander sent forward three battalions of
infantry. There was hand-to-hand fighting all along the line, as a
result of which the enemy was thrust back, his dead and wounded
lying on the ground in all directions. The French troops were full of
praise for the manner in which the Americans conducted themselves
57. under trying circumstances, especially in view of the fact that they
are fighting at one of the most difficult points on the battlefront. The
American losses were rather severe.
The gallantry of the 300 American engineers who were caught in the
opening of the German offensive on March 21 was the subject of a
dispatch from General Pershing made public by the War Department
on April 19. The engineers were among the forces hastily gathered
by Major Gen. Sanderson Carey, the British commander, who
stopped the gap in the line when General Gough's army was driven
back. [See diagram on Page 389.] During the period of thirteen days
covered by General Pershing's report, the engineers were almost
continuously in action. They were in the very thick of the hardest
days of the great German drive in Picardy.
General Pershing embodied in his report a communication from
General Rawlinson, commander of the British 5th Army, in which the
latter declared that it has been largely due to your assistance that
the enemy is checked. The report covered the fighting period from
March 21 to April 3. The former date marked the beginning of the
Ludendorff offensive along the whole front from La Fère to Croisilles.
It showed that while under shellfire the American engineers
destroyed material dumps at Chaulnes, that they fell back with the
British forces to Moreuil, where the commands laid out trench work,
and were then assigned to a sector of the defensive line at Demuin,
and to a position near Warfusee-Abancourt.
During the period of thirteen days covered by the report the
American engineers had two officers killed and three wounded, while
twenty men were killed, fifty-two wounded, and forty-five reported
missing.
STORY OF CAREY EPISODE
A correspondent of The Associated Press at the front gave this
account of the part played by Americans in the historic episode
58. under General Carey:
A disastrous-looking gap appeared In the 5th Army south
of Hamel in the later stages of the opening battle. The
Germans had crossed the Somme at Hamel and had a
clear path for a sweep southwestward.
No troops were available to throw into the opening. A
certain Brigadier General was commissioned by Major
Gen. Gough, commander of the 5th Army, to gather up
every man he could find and to hold the gap at any cost.
The General called upon the American and Canadian
engineers, cooks, chauffeurs, road workmen, anybody he
could find; gave them guns, pistols, any available weapon,
and rushed them into the gap in trucks, on horseback, or
on mule-drawn limbers.
A large number of machine guns from a machine-gun
school near by were confiscated. Only a few men,
however, knew how to operate the weapons, and they had
to be worked by amateurs with one instructor for every
ten or twelve guns. The Americans did especially well in
handling this arm.
For two days the detachment held the mile and a half gap.
At the end of the second day the commander, having gone
forty-eight hours without sleep, collapsed. The situation of
the detachment looked desperate.
While all were wondering what would happen next, a
dusty automobile came bounding along the road from the
north. It contained Brig. Gen. Carey, who had been home
on leave and who was trying to find his headquarters.
The General was commandeered by the detachment and
he was found to be just the commander needed. He is an
old South African soldier of the daredevil type. He is
59. famous among his men for the scrapes and escapades of
his school-boy life as well as for his daring exploits in
South Africa.
Carey took the detachment in hand and led it in a series
of attacks and counterattacks which left no time for
sleeping and little for eating. He gave neither his men nor
the enemy a rest, attacking first on the north, then in the
centre, then on the south—harassing the enemy
unceasingly with the idea of convincing the Germans that
a large force opposed them.
Whenever the Germans tried to feel him out with an
attack at one point, Carey parried with a thrust
somewhere else, even if it took his last available man, and
threw the Germans on the defensive.
The spirit of Carey's troops was wonderful. The work they
did was almost super-natural. It would have been
impossible with any body of men not physical giants, but
the Americans and Canadians gloried in it. They crammed
every hour of the day full of fighting. It was a constantly
changing battle, kaleidoscopic, free-for-all, catch-as-catch-
can. The Germans gained ground. Carey and his men
were back at them, hungry for more punishment. At the
end of the sixth day, dog-tired and battle-worn, but still
full of fight, the detachment was relieved by a fresh
battalion which had come up from the rear.
STAFF CHANGES
Major Gen. James W. McAndrew, it was announced on May 3, was
appointed Chief of Staff of the American expeditionary force in
succession to Brig. Gen. James G. Harbord, who was assigned to a
command in the field. Other changes on General Pershing's staff
60. included the appointment of Lieut. Col. Robert C. Davis as Adjutant
General, and Colonel Merritte W. Ireland as Surgeon General.
The General Staff of the American expeditionary forces in France, as
the result of several changes in personnel, consisted on May 14,
1918, of the following:
Commander: General John J. Pershing
Aid de Camp: Colonel James L. Collins
Aid de Camp: Colonel Carl Boyd
Aid de Camp: Colonel M. C. Shallenberger
Chief of Staff: Major Gen. J. W. McAndrew
Adjutant: Lieut. Col. Robert C. Davis
Inspector: Brig. Gen. Andre W. Brewster
Judge Advocate: Brig. Gen. Walter A. Bethel
Quartermaster: Brig. Gen. Harry L. Rogers
Surgeon: Colonel Merritte W. Ireland
Engineer: Brig. Gen. Harry Taylor
Ordnance Officer: Brig. Gen. C. B. Wheeler
Signal Officer: Brig. Gen. Edgar Russell
Aviation Officer: Brig. Gen. B. D. Foulois
President Wilson on May 4 pardoned two soldiers of the American
expeditionary force who had been condemned to death by a military
court-martial in France for sleeping on sentry duty and commuted to
nominal prison terms the death sentences imposed on two others for
disobeying orders.
HEALTH OF THE SOLDIERS
Major Hugh H. Young, director of the work of dealing with
communicable blood diseases in our army in France, made this
61. striking statement on May 12 regarding the freedom of the American
expeditionary force from such diseases:
In making plans for this department of medical work in
France it had been calculated by the medical authorities in
Washington to have ten 1,000-bed hospitals, in which a
million men could receive treatment, but with 500,000
Americans in France there is not one of the five allotted
Americans in any of the hospitals now running, and only
500 cases of this type of disease needing hospital
treatment, instead of the expected 5,000.
In other words, instead of having 1 per cent. of our
soldiers in hospitals from social diseases, as had been
expected, the actual number is only one-tenth of 1 per
cent. There is no reason to doubt that this record will be
maintained. The hospitals prepared for this special
treatment are to be used for other cases.
This means that the American Army is the cleanest in the world. The
results, according to Major Young, have been achieved by preventive
steps taken by the American medical directors, coupled with the co-
operation of the men.
62. T
Overseas Forces More Than Half a
Million
Preparing for an Army of 3,000,000
he overseas fighting forces of the United States have been
increasing at a much more rapid rate than the public was aware
of. Early in May the number of our men in France was in excess of
500,000. A great increase in the ultimate size of the army was
further indicated when the War Department asked the House Military
Affairs Committee for a new appropriation of $15,000,000,000.
Mr. Baker, Secretary of War, appeared before the committee on April
23 and, after describing the results of his inspection of the army in
France, said that the size of the army that the United States would
send abroad was entirely dependent upon the shipping situation.
Troops were already moving to France at an accelerated rate.
President Wilson, through Mr. Baker, presented the House Military
Affairs Committee on May 2 with proposals for increasing the army.
The President asked that all limits be removed on the number of
men to be drafted for service. Mr. Baker said that he declined to
discuss the numbers of the proposed army for the double reason
that any number implies a limit, and the only possible limit is our
ability to equip and transport men, which is constantly on the
increase.
The Administration's plans were submitted in detail on May 3, when
the committee began the preparation of the army appropriation bill
carrying $15,000,000,000 to finance the army during the fiscal year
ending June 30, 1919. Mr. Baker again refused to go into the
question of figures, but it became known at the Capitol that the
63. estimates he submitted were based on a force of not fewer than
3,000,000 men and 160,000 officers in the field by July 1, 1919. The
plan contemplated having 130,000 officers and 2,168,000 men, or a
total of 2,298,000, in the field and in camps by July 1, 1918, and
approximately an additional million in the field before June 30, 1919.
Mr. Baker said that all the army camps and cantonments were to be
materially enlarged, to take care of the training of the men to be
raised in the next twelve months. The General Staff had this
question under careful consideration, and the idea was to increase
the size of existing training camps rather than to establish new
camps. These camps, it was estimated, already had facilities for
training close to a million men at one time.
The Secretary of War also made it clear that the total of
$15,000,000,000 involved in the estimates as revised for the new
army bill did not cover the whole cost of the army for the next fiscal
year. The $15,000,000,000, he explained, was in addition to the
large sums that would be carried in the Fortifications Appropriation
bill, which covers the cost of heavy ordnance both in the United
States and overseas. Nor did it include the Military Academy bill. It
was emphasized that, although estimates were submitted on the
basis of an army of a certain size, Congress was being asked for
blanket authority for the President to raise all the men needed, and
the approximate figures of $15,000,000,000 could be increased by
deficiency appropriations.
It was brought out in the committee that the transportation service
had improved and that the War Department was able to send more
men to France each month. It was estimated that if transport
facilities continued to improve, close to 1,500,000 fighting men
would be on the western front by Dec. 31, 1918. The United States
had now in camp and in the field, it was explained to the committee,
the following enlisted men and officers:
Enlisted men 1,765,000
64. Officers 120,000
——
Total 1,885,000
Provost Marshal General Crowder announced on May 8 that
1,227,000 Americans had been called to the colors under the
Selective Draft act, thereby indicating approximately the strength of
the national army. Additional calls during May for men to be in camp
by June 2 affected something like 366,600 registrants under the
draft law. These men were largely intended to fill up the camps at
home, replacing the seasoned personnel from the divisions
previously training there. With the increase of the number of
divisions in France, the flow of replacement troops was increasing
proportionately.
In regard to the number of men in France, Mr. Baker on May 8 made
the following important announcement:
In January I told the Senate committee that there was
strong likelihood that early in the present year 500,000
American troops would be dispatched to France. I cannot
either now or perhaps later discuss the number of
American troops in France, but I am glad to be able to say
that the forecast I made in January has been surpassed.
This was the first official utterance indicating even indirectly the
number of men sent abroad. The first force to go was never
described except as a division, although as a matter of fact it was
constituted into two divisions soon after its arrival in France.
An Associated Press dispatch dated May 17 announced that troops of
the new American Army had arrived within the zone of the British
forces in Northern France and were completing their training in the
area occupied by the armies which were blocking the path of the
Germans to the Channel ports. The British officers who were training
the Americans stated that the men from overseas were of the finest
65. material. The newcomers were warmly greeted by the British troops
and were reported to be full of enthusiasm.
66. I
American Troops in Central France
By Laurence Jerrold
This friendly British view of our soldiers in France is from the pen of a noted war
correspondent of The London Morning Post
have recently visited the miniature America now installed in
France, and installed in the most French part of Central France.
There is nothing more French than these ancient towns with historic
castles, moats, dungeons, and torture chambers, these old villages,
where farms are sometimes still battlemented like small castles, and
this countryside where living is easy and pleasant. On to this heart
of France has descended a whole people from across the ocean, a
people that hails from New England and California, from Virginia and
Illinois. The American Army has taken over this heart of France, and
is teaching it to go some. Townsfolk and villagers enjoy being
taught. The arrival of the American Army is a revelation to them.
I was surprised at first to find how fresh a novelty an allied army
was in this part of France. Then I remembered that these little towns
and villages have in the last few months for the first time seen allies
of France. The ports where the American troops land have seen
many other allies; they saw, indeed, in August, 1914, some of the
first British troops land, whose reception remains in the recollection
of the inhabitants as a scene of such fervor and loving enthusiasm
as had never been known before and probably will not be known
again. In fact, to put it brutally, French ports are blasé. But this
Central France for the first time welcomes allied troops. It is true
they had seen some Russians, but the least said of them now the
better. Some of the Russians are still there, hewing wood for three
francs a day per head, and behaving quite peaceably.
67. These old towns and villages look upon the American Army in their
midst as the greatest miracle they have ever known, and a greater
one than they ever could have dreamed of. One motors through
scores of little towns and villages where the American soldier, in his
khaki, his soft hat, (which I am told is soon to be abolished,) and his
white gaiters, swarms. The villagers put up bunting, calico signs,
flags, and have stocks of American canned goods to show in their
shop windows. The children, when bold, play with the American
soldiers, and the children that are more shy just venture to go up
and touch an American soldier's leg. Very old peasant ladies put on
their Sunday black and go out walking and in some mysterious way
talking with American soldiers. The village Mayor turns out and
makes a speech utterly incomprehensible to the American soldier,
whenever a fresh contingent of the latter arrives. The 1919 class,
just called up, plays bugles and shouts Good morning when an
American car comes by.
Vice versa, this Central France is perhaps even more of a miracle to
the American troops than the American troops are to it. To watch the
American trooper from Arkansas or Chicago being shown over a
castle which is not only older than the United States, but was in its
prime under Louis XII., and dates back to a Roman fortress now
beneath it, is a wonderful sight. Here the American soldier shows
himself a charming child. There is nothing of the Innocents Abroad
about him. I heard scarcely anything (except about telephones and
railways) of any American brag of modernism in this ancient part of
France. On the contrary, the soldier is learning with open eyes, and
trying to learn with open ears, all these wonders of the past among
which he has been suddenly put. The officer, too, even the educated
officer, is beautifully astonished at all this past, which he had read
about, but which, quite possibly, he didn't really believe to exist. The
American officers who speak French—and there are some of them,
coming chiefly from the Southern States—are, of course, heroes in
every town, and sought after in cafés at recreation hours by every
French officer and man. Those who do not know French are learning
it, and I remember a picturesque sight, that of a very elderly, prim
68. French governess in black, teaching French to American subalterns
in a Y.M.C.A. canteen.
A great French preacher the other day, in his sermon in a Paris
church, said that this coming to France of millions of English troops
and future millions of American troops may mean eventually one of
the greatest changes in Continental Europe the world has ever
known. His words never seemed to me so full of meaning as they
did when I was among the Americans in the heart of France. There,
of course, the contrast is infinitely greater than it can be in the
France which our own troops are occupying and defending. These
young, fresh, hustling, keen Americans, building up numerous works
of all kinds to prepare for defending France, have brought with them
Chinese labor and negro labor; and Chinese and negroes and
German and Austrian prisoners all work in these American camps
under American officers' orders. Imagine what an experience, what
a miracle, indeed, this spectacle seems to the country-folk of this old
French soil, who have always lived very quietly, who never wanted to
go anywhere else, and who knew, indeed, that France had allies
fighting and working for her, but had never seen any of them until
these Americans came across three thousand miles of ocean.
Something of a miracle, also, is what our new allies are
accomplishing. They are doing everything on a huge scale. I saw
aviation camps, training camps, aviation schools, vast tracts where
barracks were being put up, railways built, telegraphs and
telephones installed by Chinese labor, negro labor, German prisoners'
labor, under the direction of American skilled workmen, who are in
France by the thousand. There are Y.M.C.A. canteens, Red Cross
canteens, clubs for officers and for men, theatres and cinemas for
the army, and a prodigious amount of food—all come from America.
The hams alone I saw strung up in one canteen would astonish the
boches. American canned goods, meat, fruit, condensed milk, meal,
c., have arrived in France in stupendous quantities. No body of
American troops land in France until what is required for their
sustenance several weeks ahead is already stored in France. Only
69. the smallest necessaries are bought on the spot, and troops passing
through England on their way to France are strictly forbidden, both
officers and men, to buy any article of food whatsoever in England.
As for the quality, the American has nothing to complain of, so far as
I could see. All pastry, cakes, sweets are henceforth prohibited
throughout civilian France, but the American troops rightly have all
these things in plenty. I saw marvelous cakes and tarts, which would
create a run on any Paris or London teashop, and the lady who
manages one American Red Cross canteen (by the way, she is an
Englishwoman, and is looked up to by the American military
authorities as one of the best organizers they have met) explained to
me wonderful recipes they have for making jam with honey and
preserved fruit. The bread, of course, they make themselves, and, as
is right, it is pure white flour bread, such as no civilian knows
nowadays.
One motors through scores of villages and more, and every little old
French spot swarms with American Tommies billeted in cottages and
farmhouses. Many of them marched straight to their billets from
their landing port, and the experience is as wonderful for them, just
spirited over from the wilds of America, as it is for the villagers who
welcome these almost fabulous allies. But it is the engineering,
building, and machinery works the Americans are putting up which
are the most astonishing. Gangs of workers have come over in
thousands. Many of these young chaps are college men, Harvard or
Princeton graduates. They dig and toil as efficiently as any laborer,
and perhaps with more zeal. One American Major told me with glee
how a party of these young workers arrived straight from America at
3:30 P. M., and started digging at 5 A. M. next morning. And they
liked it; it tickled them to death. Many of these drafts, in fact, were
sick and tired of inaction in ports before their departure from
America, and they welcomed work in France as if it were some great
game.
Perhaps the biggest work of all the Americans are doing is a certain
aviation camp and school. In a few months it has neared completion,
70. and when it is finished it will, I believe, be the biggest of its kind in
the world. There pilots are trained, and trained in numbers which I
may not say, but which are comforting. The number of airplanes
they use merely for training, which also I must not state, is in itself
remarkable. Training pilots is the one essential thing, I was told by
the C.O. These flying men—or boys—who have, of course, already
been broken in in America, do an additional course in France, and
when they leave the aviation camp I saw they are absolutely ready
for air fighting at the front. This is the finishing school. The aviators
go through eight distinct courses in this school. They are perfected
in flying, in observation, in bombing, in machine-gun firing. On even
a cloudy and windy day the air overhead buzzes with these young
American fliers, all getting into the pink of condition to do their
stunts at the front. They seemed to me as keen as our own flying
men, and as well disciplined. They live in the camp, and it requires
moving heaven and earth for one of them to get leave to go even to
the nearest little quiet old town.
The impression is the same of the American bases in France as of
the American front in France. I found there and here one distinctive
characteristic, the total absence of bluff. I was never once told that
we were going to be shown how to win the war. I was never once
told that America is going to win the war. I never heard that
American men and machines are better than ours, but I did hear
almost apologies from American soldiers because they had not come
into the war sooner. They are, I believe, spending now more money
than we are—indeed, the pay of their officers is about double that of
ours. I said something about the cost. Yes, but you see we must
make up for lost time, was all the American General said. And he
told me about the splendid training work that is being done now in
the States by British and French officers who have gone out there
knowing what war is, and who teach American officers and men
from first-hand experience. This particular General hoped that by
this means in a very short time American troops arriving in France
may be sent much more quickly to the front than is now the case.
71. An impression of complete, businesslike determination is what one
gets when visiting the Americans in France. A discipline even stricter
than that which applies in British and French troops is enforced. In
towns, officers, for instance, are not allowed out after 9 P. M. Some
towns where subalterns discovered the wine of the country have
instantly been put out of bounds. No officer, on any pretext
whatsoever, is allowed to go to Paris, except on official business.
From the camps they are not even allowed to go to the neighboring
towns. They have, to put it quite frankly, a reputation of wild
Americanism to live down, and they sometimes surprise the French
by their seriousness. It is a striking sight to see American officers
and men flocking into tiny little French Protestant churches on
Sundays in this Catholic heart of France. The congregation is a
handful of old French Huguenots, and the ancient, rigid French
pasteur never in his life preached to so many, and certainly never to
soldiers from so far. They come from so far, and from such various
parts, these Americans, and for France, as well as for themselves, it
is a wonderful experience. I was told that the postal censors who
read the letters of the American expeditionary force are required to
know forty-seven languages. Of these languages the two least used
are Chinese and German.
72. A
American Shipbuilders Break All
Records
Charles M. Schwab Speeds the Work
[Month Ended May 15, 1918]
ll shipbuilding records have been broken by American builders
in the last month. On May 14 it was announced that the first
million tons of ships had been completed and delivered to the United
States Government under the direction of the Shipping Board. The
actual figures on May 11 showed the number of ships to be 159,
aggregating 1,108,621 tons. More than half of this tonnage was
delivered since Jan. 1, 1918. Most of these ships were requisitioned
on the ways or in contract form when the United States entered the
war. This result had been anticipated in the monthly records, which
showed a steady increase in the tonnage launched:
Month.
Number
of Ships
Launched.
Aggregate
Tonnage.
January 11 91,541
February 16 123,100
March 21 166,700
The rapidity with which ships are being produced was shown by the
breaking of the world's record on April 20 and in turn the breaking of
this record on May 5. On the former date the 8,800-ton steel
steamship West Lianga was launched at Seattle, Wash., fifty-five
working days from the date the keel was laid. This was then the
world's record. But on May 5 at Camden, N. J., the steel freight
73. steamship Tuckahoe, of 5,548 tons, was launched twenty-seven days
after the keel was laid.
Ten days after this extraordinary achievement the Tuckahoe was
finished and furnished and ready for sea—another record feat.
Charles M. Schwab, Chairman of the Board of Directors of the
Bethlehem Steel Corporation, was on April 16, 1918, appointed
Director General of the Emergency Fleet Corporation to speed up the
Government's shipbuilding program. He was invested with practically
unlimited powers over all construction work in shipyards producing
vessels for the Emergency Fleet Corporation. Charles Piez in
consequence ceased to be General Manager of the Corporation,
remaining, however, as Vice President to supervise administrative
details of construction and placing contracts.
Mr. Schwab, who was the fifth man to be put in charge of the
shipbuilding program, was not desirous of accepting the position
when first approached because he considered his work in producing
steel of first importance in the carrying out of the nation's war
program. But after a conference with President Wilson, Edward N.
Hurley, Chairman of the Shipping Board; Bainbridge Colby, another
member of the board, and Charles Piez, he decided to accept the
new position.
Almost the first thing Mr. Schwab did was to move his headquarters
to Philadelphia as the centre of the steel-shipbuilding region, taking
with him all the division chiefs of the Fleet Corporation directly
connected with construction work and about 2,000 employes. The
Shipping Board and Mr. Piez retained their offices in Washington with
1,500 subordinates and employes. As a further step toward
decentralization it was arranged to move the operating department,
including agencies such as the Interallied Ship Control Committee,
headed by P. A. S. Franklin, to New York City.
74. The original cost-plus contract under which the Submarine Boat
Corporation of Newark was to build 160 ships of 5,000 tons for the
Government was canceled by Mr. Schwab as an experiment to
determine whether shipyards operating under lump-sum contracts
and accepting all responsibility for providing materials could make
greater speed in construction than those operating with Government
money, such as the Hog Island yards. The result was to increase the
cost of each of the 160 ships from $787,500 to $960,000.
A request for an appropriation of $2,223,835,000 for the 1919
program was presented by Mr. Hurley and Mr. Schwab to the House
Appropriations Committee on May 8.
Of this total $1,386,100,000 was for construction of ships and
$652,000,000 for the purchasing and requisitioning of plants and
material in connection with the building program.
75. W
Third Liberty Loan Oversubscribed
Approximately 17,000,000 Buyers
hen the Third Liberty Loan, raised to finance America's war
needs, closed on May 4, 1918, the subscriptions were well
over $4,000,000,000, a billion in excess of the amount called for.
The total was announced on May 17 as $4,170,019,650. Secretary
McAdoo stated that he would allot bonds in full on all subscriptions.
The loan was regarded as the most successful ever floated by any
nation, not so much because of the volume of sales, but because of
the wide distribution of the loan. Approximately 17,000,000
individuals subscribed, that is, about one person in every six in the
United States. The number of buyers in the Third Loan exceeded
those in the Second by 7,000,000 and those in the First by
12,500,000.
The campaign throughout the country was conducted with all the
thoroughness of a great political struggle, with the difference that
there were no contending parties and all forces were marshaled to
make the loan a success. Nor was the campaign merely a display of
efficient organization and vigorous propaganda. It had many
features of dramatic and picturesque interest, not only in the large
cities, but in almost every smaller centre of the nation. A noonday
rally of 50,000 men and women in Wall Street, New York, on the
closing day, was typical. An eyewitness described it thus:
The Police Department Band appeared and the band of
the 15th Coast Artillery from Fort Hamilton. Taking
advantage of the occasion, James Montgomery Flagg now
appeared in his studio van on the southern fringe of the
76. Broad Street crowd. A girl with him played something on
the cornet. It was a good deal like a show on the Midway
at a Western county fair. But this was no faker—one of the
most famous artists in America, throwing in a signed
sketch of whoever bought Liberty bonds. Those near him
began pushing and crowding to take advantage of the
offer.
And now, suddenly, a tremendous racket up the street
toward Broadway. Who comes?
Cheer on cheer, now. It is the Anzacs. Twelve long,
rangy fellows, officers all, six or seven of them with the
little brass A on the shoulder, which signifies service at
Gallipoli and in Flanders. They are members of the
contingent of 500 which arrived here yesterday on its way
to the battlefields of France. They run lightly up the Sub-
Treasury steps and take their stand in a group beside the
soldier band.
And now they all come—all the actors in the drama of the
day. Governor Whitman, bareheaded, solemn-faced; Rabbi
Stephen Wise, with his rugged face and his shock of blue-
black hair; Mme. Schumann-Heink, panting a little with
excitement; Auguste Bouilliz, baritone of the Royal Opera
of Brussels, who later is to thrill them all with his singing
of the Marseillaise; Cecil Arden, in a shining helmet and
draped in the Union Jack, come to sing God Save the
King, while the sunburned Australian officers stand like
statues at salute; Oscar Straus, and then—
Yee-ee-ee-eee.
Oh, how they cheered! For the Blue Devils of France had
poured out of the door of the Sub-Treasury and, with the
fitful sun shining once more and gleaming on their
77. bayonets, were running down the steps in two lines, past
the Anzacs, past the soldier band, to draw up in ranks at
the bottom.
Lieutenant de Moal speaks. What does he say? Who
knows? But he is widely cheered, just the same, as he
gives way to Governor Whitman.
There are gatherings like this, though not so large, all
over our land today, cries the Governor. In every town
and city we Americans are gathered together at this
moment to demonstrate that we are behind our army,
behind our navy, behind our President.
The cheers that acclaimed his mention of the President
drowned his voice for several moments.
Here are the Australians, he cries, pointing to the
Anzac officers. They have brought us a message, but
we are going to give them a message, too.
As the Governor stepped back to cheers that rocked the
street, Lieutenant de Moal barked a sharp order, and the
Blue Devils shouldered their guns with fixed bayonets,
the six trumpeters ta-ra-ta-raed, and the soldiers of
France moved off up the sidewalk lane to the side door of
the Stock Exchange, where all business was suspended
during the fifteen minutes of their visit on the floor.
Four of the Anzacs meanwhile were taken from their
ranks on the steps of the building up to the pedestal of
the statue of Washington, which was used as speaker's
platform, and Captain Frank McCallam made a brief
address.
We haven't many men left, he said simply. And it is up
to you people to help us out to the best of your ability.
78. More cheers, and then Cecil Arden sang God Save the
King. The American regular fired a blank volley over the
heads of the crowd, and the kids scrambled for the empty
shells.
Following Wise and Straus, Bouilliz, the Belgian baritone,
sang the Marseillaise, and then, after the soldier band
had played Where Do We Go from Here, Boys? Mme.
Schumann-Heink advanced and sang the national anthem,
following it up with an appeal that was the climax to the
play.
Less exciting but more impressive was the parade on April 26, when
thousands of mothers who had sent their sons to the front marched
in a column of 35,000 men and women in the Liberty Day parade in
New York City. This day had been proclaimed as such by President
Wilson for the people of the United States to assemble in their
respective communities and liberally pledge anew their financial
support to sustain the nation's cause, and to hold patriotic
demonstrations in every city, town, and hamlet throughout the land.
The challenge of the mothers was inscribed on one of the banners
they carried: We give our sons—they give their lives—what do you
give?
Remarkable as was the appearance of these mothers with the little
service flags over their shoulders, many of them so old that they
marched with difficulty, the spectators who flanked the line of march
along Fifth Avenue from Washington Square to Fifty-ninth Street
found it even more thrilling to note that so very many of them,
whether they were mothers or young wives, or just young girls
proud of the brothers that had gone forth to service—so very many
of them carried service flags with three and four and five and even
six stars, and occasionally a glint of the sun would even carry the
eye to a gold star, which meant, whenever it appeared, a veil of
mourning for a wooden cross somewhere in France.
79. Among the minor but ingenious forms of publicity was the Liberty
Loan ball which was rolled from Buffalo to New York, a distance of
470 miles, and which ended its journey of three weeks on May 4 at
the City Hall. The ball was a large steel shell covered with canvas.
Every community that reached or exceeded its quota to the loan was
entitled to raise a flag of honor specially designed for the purpose.
At least 32,000 communities gained the honor and raised the flag.
To strengthen the financial basis of the nation's war industries and
use monetary resources to the best advantage the War Finance
Corporation bill was passed by Congress and approved by President
Wilson on April 5, 1918. The two main purposes of the act are to
provide credits for industries and enterprises necessary or
contributory to the prosecution of the war and to supervise new
issues of capital. The act creates the War Finance Corporation,
consisting of the Secretary and four additional persons, with
$500,000,000 capital stock, all subscribed by the United States.
Banks and trust companies financing war industries or enterprises
may receive advances from the corporation.
80. W
Former War Loans of the United
States
A Historical Retrospect
The United States Government asked for $2,000,000,000 on the First
Liberty Loan in the Spring of 1917, and $3,034,000,000 was
subscribed by over 4,000,000 subscribers. For the Second Loan, near
the end of 1917, $3,000,000,000 was sought, and $4,617,532,300
was subscribed by 9,420,000 subscribers.
The Guaranty Trust Company of New York in a recent brochure
reviewed the history of the various war loans of the United States,
beginning with the Revolutionary loans, as follows:
hen the patriots at Lexington fired the shot heard 'round the
world, the thirteen Colonies found themselves suddenly in
the midst of war, but with practically no funds in their Treasuries.
The Continental Congress was without power to raise money by
taxation, and had to depend upon credit bills and requisitions drawn
against the several Colonies. France was the first foreign country to
come to the aid of struggling America, the King of France himself
advancing us our first loan. All told, France's loan was $6,352,500;
Holland loaned us $1,304,000; and Spain assisted us with $174,017.
Our loan from France was repaid between 1791 and 1795 to the
Revolutionary Government of France; the Holland loan during the
same period in five annual installments, and the Spanish loan in
1792-3.
Our first domestic war loan of £6,000 was made in 1775, and the
loan was taken at par. A year and a half later found Congress
laboring under unusual difficulties. Boston and New York were held
81. by the enemy, the patriot forces were retreating, and the people
were as little inclined to submit to domestic taxation as they had
formerly been to taxation without representation. To raise funds
even a lottery was attempted. In October, 1776, Congress authorized
a second loan for $5,000,000. It was not a pronounced success, only
$3,787,000 being raised in twelve months. In 1778 fourteen issues
of paper money were authorized as the only way to meet the
expenses of the army. By the end of the year 1779 Congress had
issued $200,000,000 in paper money, while a like amount had been
issued by the several States. In 1781, as a result of this financing
and of the general situation, Continental bills of credit had fallen 99
per cent.
Then came Robert Morris, that genius of finance, who found ways to
raise the money which assured the triumph of the American cause.
By straining his personal credit, which was higher than that of the
Government, he borrowed upon his own individual security on every
hand. On one occasion he borrowed from the commander of the
French fleet, securing the latter with his personal obligation. If
Morris and other patriotic citizens had not rendered such assistance
to the Government, some of the most important campaigns of the
Revolutionary War would have been impossible. Following came the
Bank of Pennsylvania, which issued its notes—in effect, loans—to
provide rations and equipment for Washington's army at Valley
Forge. These notes were secured by bills of exchange drawn against
our envoys abroad, but it was never seriously intended that they
should be presented for payment. The bank was a tremendous
success in securing the money necessary to carry out its patriotic
purposes, and was practically the first bank of issue in this country.
With the actual establishment of the United States and the adoption
of the Constitution, Alexander Hamilton came forward with a funding
scheme by which the various debts owed to foreign countries, to
private creditors, and to the several States were combined. In 1791,
on a specie basis, our total debt was $75,000,000. The paper dollar
was practically valueless and the people were forced to give the
82. Government adequate powers to raise money and to impose taxes.
Between that date and 1812 thirteen tariff bills were passed to raise
money to meet public expenditures and pay off the national debt.
THE WAR OF 1812.
For some time previous to the actual outbreak of the War of 1812
hostilities had been predicted. In a measure, this enabled Congress
to prepare for it. And although the war did not begin until June of
1812, as early as March of that year a loan of $11,000,000, bearing
6 per cent. at par, to be paid off within 12 years from the beginning
of 1813, was authorized. Of this, however, only $2,150,000 was
issued, and all was redeemed by 1817. The next year a loan of
$16,000,000 was authorized and subscribed. This was followed, in
August, by a loan of $7,500,000 which sold at 88-1/4 per cent.
At the end of the war the total loans negotiated by the Government
aggregated $88,000,000. The nation's public debt, as a result of this
war, was increased to $127,334,933 in 1816. By 1835, either by
redemptions or maturity, it was all paid.
MEXICAN WAR LOANS
The Mexican War net debt incurred by the United States was
approximately $49,000,000 and was financed by loans in the form of
Treasury notes and Government stock. The Treasury notes, under
the act of 1846, totaled $7,687,800 and the stock $4,999,149. The
latter paid 6 per cent. interest. By act of 1847 Treasury notes to the
amount of $26,122,100 were issued, bearing interest in the
discretion of the Secretary of the Treasury, reimbursable one and
two years after date, and convertible into United States stock at 6
per cent. They were redeemable after Dec. 31, 1867. Economic
developments following this war led to a period of extraordinary
industrial prosperity which lasted for several years. A change in the
fiscal policy of the Government, with overexpansion of industry,
83. however, resulted in a panic in 1857 and a Treasury deficit in 1858.
The debt contracted in consequence of the Mexican War was
redeemed in full by 1874.
The situation had not improved to any great extent when Lincoln
took office on March 4, 1861, and by mid-November of that year a
panic was in full swing. The outbreak of the civil war found the
Treasury empty and the financial machinery of the Government
seriously disorganized. Public credit was low, the public mind was
disturbed, and raising money was difficult. In 1862 the Legal Tender
act was passed, authorizing an issue of $150,000,000 of legal-tender
notes, and an issue of bonds in the amount of $500,000,000 was
authorized.
This proved to be a most popular loan. The bonds were subject to
redemption after five years and were payable in twenty years. They
bore interest at 6 per cent., payable semi-annually, and were issued
in denominations of $50, $100, $500, and $1,000. Through one
agent, Jay Cooke, a genius at distribution, who employed 2,850 sub-
agents and advertised extensively, this loan was placed directly with
the people at par in currency. Altogether the aggregate of this loan
was $514,771,600. Later in that year Congress authorized a second
issue of Treasury notes in the amount of $150,000,000 at par, with
interest at 6 per cent.; in January, 1863, a third issue of
$100,000,000 was authorized, which was increased in March to
$150,000,000, at 5 per cent. interest. These issues were referred to
as the one and two year issues of 1863.
DEFICIT IN 1862
In December, 1862, Congress had to face a deficit of $277,000,000
and unpaid requisitions amounting to $47,000,000. By the close of
1863 nearly $400,000,000 had been raised by bond sales. A further
loan act, passed March 3, 1864, provided for an issue of
$200,000,000 of 5 per cent. bonds known as ten-fortys, but of this
84. total only $73,337,000 was disposed of. Subsequently, on June 30,
1864, a great public loan of $200,000,000 was authorized. This was
an issue of Treasury notes, payable at any time not exceeding three
years, and bearing interest at 7-3/10 per cent. Notes amounting to
$828,800,000 were sold. The aggregate of Government loans during
the civil war footed up a total of $2,600,700,000; and on Sept. 1,
1865, the public debt closely approached $3,000,000,000, less than
one-half of which was funded.
Civil war loans, with one exception, which sold at 89-3/10, were all
placed at par in currency, subject to commissions ranging from an
eighth to one per cent. to distributing bankers. The average interest
nominally paid by the Government on its bonds during the war was
slightly under 6 per cent. Owing to payment being made in currency,
however, the rate was, in reality, much higher. With the conclusion of
the war, the reduction of the public debt was undertaken, and it has
continued with but two interruptions to date.
Heavy tax receipts for several years after the close of the war
potentially enabled the Government to reduce its debt. Indeed, from
1866 to 1891, each year's ordinary receipts exceeded
disbursements, and enabled the Government to lighten its financial
burdens. In 1866 the decrease in the net debt was $120,395,408; in
1867, $127,884,952; in 1868, $27,297,798; in 1869, $48,081,540; in
1870, $101,601,917; in 1871, $84,175,888; in 1872, $97,213,538,
and in 1873, $44,318,470.
Through refunding operations—in addition to bonds and short-time
obligations redeemed with surplus revenues—the Government paid
off, up to 1879, $535,000,000 bonds bearing interest at from 5 to 6
per cent. In this year the credit of the Government was on a 4 per
cent. basis, and a year later on a 3-1/4 per cent. basis, against a
maximum basis of 15-1/2 per cent. in 1864.
Between 1881 and 1887 the Government paid off, either with
surplus revenues or by conversion, $618,000,000 of interest-bearing
85. debt. In 1891 all bonds then redeemable were retired, and on July 1,
1893, the public debt amounted to less than one-third of the
maximum outstanding in 1865. In 1900 the Government converted
$445,900,000 bonds out of an aggregate of $839,000,000
convertible under the refunding act passed by Congress in that year.
And further conversions in 1903, 1905, and 1907 brought the grand
total up to $647,250,150—a result which earned for the Government
a net annual saving in interest account of $16,551,037.
SPANISH WAR LOANS
The United States is a debt-paying nation. Hence, America's credit,
despite occasional fluctuations, has steadily risen, and our national
debt has sold on a lower income basis than that of any other nation
in the world.
Following the sinking of the Maine in Havana Harbor, in 1898,
Congress authorized an issue of $200,000,000 3 per cent. ten-
twenty-year bonds. Of this aggregate $198,792,660 were sold by the
Government at par. So popular was this loan, it was oversubscribed
seven times. During the year 1898, following the allotment to the
public, this issue sold at a premium, the price going to 107-3/4, and,
during the next year, to 110-3/4. After the war ended, the
Government, in accordance with its unvarying custom, began to pay
off this debt; but, despite the Secretary of the Treasury's offer to buy
these bonds, he succeeded in purchasing only about $20,000,000 of
them.
86. Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com