Neural Networks And Deep Learning Charu C Aggarwal

Neural Networks And Deep Learning Charu C
Aggarwal download
https://ebookbell.com/product/neural-networks-and-deep-learning-
charu-c-aggarwal-59041922
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Neural Networks And Deep Learning A Textbook Charu C Aggarwal
https://ebookbell.com/product/neural-networks-and-deep-learning-a-
textbook-charu-c-aggarwal-49464354
Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Ed 2023
Charu C Aggarwal
textbook-2nd-edition-2nd-ed-2023-charu-c-aggarwal-50699788
Neural Networks And Deep Learning Theoretical Insights And Frameworks
Dr Vishwas Mishra
https://ebookbell.com/product/neural-networks-and-deep-learning-
theoretical-insights-and-frameworks-dr-vishwas-mishra-56221402
Neural Networks And Deep Learning A Textbook Charu C Aggarwal
textbook-charu-c-aggarwal-7166102

Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Pat Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-pat-
nakamoto-11565930
Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Edition
Charu C Aggarwal
textbook-2nd-edition-2nd-edition-charu-c-aggarwal-50687826
Deep Learning Neural Network Machine Learning Nakamoto
who-want-to-make-their-own-deep-learning-neural-network-machine-
learning-nakamoto-36065668
Deep Learning Neural Network Machine Learning Pat Nakamoto
learning-pat-nakamoto-42058928
Deep Learning Neural Network Machine Learning Nakamoto
learning-nakamoto-36068052

Deep Learning Neural Network Machine Learning Pat Nakamoto
learning-pat-nakamoto-37277310

Neural
Networks and
Deep Learning
Charu C. Aggarwal
ATextbook

Neural Networks and Deep Learning

Charu C. Aggarwal
Neural Networks and Deep
Learning
A Textbook
123

Charu C. Aggarwal
IBM T. J. Watson Research Center
International Business Machines
Yorktown Heights, NY, USA
ISBN 978-3-319-94462-3 ISBN 978-3-319-94463-0 (eBook)
https://doi.org/10.1007/978-3-319-94463-0
Library of Congress Control Number: 2018947636
c
Springer International Publishing AG, part of Springer Nature 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, com-
puter software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and
therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be
true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or
implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher
remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To my wife Lata, my daughter Sayani,
and my late parents Dr. Prem Sarup and Mrs. Pushplata Aggarwal.

Preface
“Any A.I. smart enough to pass a Turing test is smart enough to know to fail
it.”—Ian McDonald
Neural networks were developed to simulate the human nervous system for machine
learning tasks by treating the computational units in a learning model in a manner similar
to human neurons. The grand vision of neural networks is to create artificial intelligence
by building machines whose architecture simulates the computations in the human ner-
vous system. This is obviously not a simple task because the computational power of the
fastest computer today is a minuscule fraction of the computational power of a human
brain. Neural networks were developed soon after the advent of computers in the fifties and
sixties. Rosenblatt’s perceptron algorithm was seen as a fundamental cornerstone of neural
networks, which caused an initial excitement about the prospects of artificial intelligence.
However, after the initial euphoria, there was a period of disappointment in which the data
hungry and computationally intensive nature of neural networks was seen as an impediment
to their usability. Eventually, at the turn of the century, greater data availability and in-
creasing computational power lead to increased successes of neural networks, and this area
was reborn under the new label of “deep learning.” Although we are still far from the day
that artificial intelligence (AI) is close to human performance, there are specific domains
like image recognition, self-driving cars, and game playing, where AI has matched or ex-
ceeded human performance. It is also hard to predict what AI might be able to do in the
future. For example, few computer vision experts would have thought two decades ago that
any automated system could ever perform an intuitive task like categorizing an image more
accurately than a human.
Neural networks are theoretically capable of learning any mathematical function with
sufficient training data, and some variants like recurrent neural networks are known to be
Turing complete. Turing completeness refers to the fact that a neural network can simulate
any learning algorithm, given sufficient training data. The sticking point is that the amount
of data required to learn even simple tasks is often extraordinarily large, which causes a
corresponding increase in training time (if we assume that enough training data is available
in the first place). For example, the training time for image recognition, which is a simple
task for a human, can be on the order of weeks even on high-performance systems. Fur-
thermore, there are practical issues associated with the stability of neural network training,
which are being resolved even today. Nevertheless, given that the speed of computers is
VII

VIII PREFACE
expected to increase rapidly over time, and fundamentally more powerful paradigms like
quantum computing are on the horizon, the computational issue might not eventually turn
out to be quite as critical as imagined.
Although the biological analogy of neural networks is an exciting one and evokes com-
parisons with science fiction, the mathematical understanding of neural networks is a more
mundane one. The neural network abstraction can be viewed as a modular approach of
enabling learning algorithms that are based on continuous optimization on a computational
graph of dependencies between the input and output. To be fair, this is not very different
from traditional work in control theory; indeed, some of the methods used for optimization
in control theory are strikingly similar to (and historically preceded) the most fundamental
algorithms in neural networks. However, the large amounts of data available in recent years
together with increased computational power have enabled experimentation with deeper
architectures of these computational graphs than was previously possible. The resulting
success has changed the broader perception of the potential of deep learning.
The chapters of the book are organized as follows:
1. The basics of neural networks: Chapter 1 discusses the basics of neural network design.
Many traditional machine learning models can be understood as special cases of neural
learning. Understanding the relationship between traditional machine learning and
neural networks is the first step to understanding the latter. The simulation of various
machine learning models with neural networks is provided in Chapter 2. This will give
the analyst a feel of how neural networks push the envelope of traditional machine
learning algorithms.
2. Fundamentals of neural networks: Although Chapters 1 and 2 provide an overview
of the training methods for neural networks, a more detailed understanding of the
training challenges is provided in Chapters 3 and 4. Chapters 5 and 6 present radial-
basis function (RBF) networks and restricted Boltzmann machines.
3. Advanced topics in neural networks: A lot of the recent success of deep learning is a
result of the specialized architectures for various domains, such as recurrent neural
networks and convolutional neural networks. Chapters 7 and 8 discuss recurrent and
convolutional neural networks. Several advanced topics like deep reinforcement learn-
ing, neural Turing mechanisms, and generative adversarial networks are discussed in
Chapters 9 and 10.
We have taken care to include some of the “forgotten” architectures like RBF networks
and Kohonen self-organizing maps because of their potential in many applications. The
book is written for graduate students, researchers, and practitioners. Numerous exercises
are available along with a solution manual to aid in classroom teaching. Where possible, an
application-centric view is highlighted in order to give the reader a feel for the technology.
Throughout this book, a vector or a multidimensional data point is annotated with a bar,
such as X or y. A vector or multidimensional point may be denoted by either small letters
or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots,
such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout
the book, the n × d matrix corresponding to the entire training data set is denoted by
D, with n documents and d dimensions. The individual data points in D are therefore
d-dimensional row vectors. On the other hand, vectors with one component for each data

PREFACE IX
point are usually n-dimensional column vectors. An example is the n-dimensional column
vector y of class variables of n data points. An observed value yi is distinguished from a
predicted value ŷi by a circumﬂex at the top of the variable.
Yorktown Heights, NY, USA Charu C. Aggarwal

Acknowledgments
I would like to thank my family for their love and support during the busy time spent
in writing this book. I would also like to thank my manager Nagui Halim for his support
during the writing of this book.
Several figures in this book have been provided by the courtesy of various individuals
and institutions. The Smithsonian Institution made the image of the Mark I perceptron
(cf. Figure 1.5) available at no cost. Saket Sathe provided the outputs in Chapter 7 for
the tiny Shakespeare data set, based on code available/described in [233, 580]. Andrew
Zisserman provided Figures 8.12 and 8.16 in the section on convolutional visualizations.
Another visualization of the feature maps in the convolution network (cf. Figure 8.15) was
provided by Matthew Zeiler. NVIDIA provided Figure 9.10 on the convolutional neural
network for self-driving cars in Chapter 9, and Sergey Levine provided the image on self-
learning robots (cf. Figure 9.9) in the same chapter. Alec Radford provided Figure 10.8,
which appears in Chapter 10. Alex Krizhevsky provided Figure 8.9(b) containing AlexNet.
This book has benefitted from significant feedback and several collaborations that I have
had with numerous colleagues over the years. I would like to thank Quoc Le, Saket Sathe,
Karthik Subbian, Jiliang Tang, and Suhang Wang for their feedback on various portions of
this book. Shuai Zheng provided feedbback on the section on regularized autoencoders in
Chapter 4. I received feedback on the sections on autoencoders from Lei Cai and Hao Yuan.
Feedback on the chapter on convolutional neural networks was provided by Hongyang Gao,
Shuiwang Ji, and Zhengyang Wang. Shuiwang Ji, Lei Cai, Zhengyang Wang and Hao Yuan
also reviewed the Chapters 3 and 7, and suggested several edits. They also suggested the
ideas of using Figures 8.6 and 8.7 for elucidating the convolution/deconvolution operations.
For their collaborations, I would like to thank Tarek F. Abdelzaher, Jinghui Chen, Jing
Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang,
Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M.
Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep Sri-
vastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jiany-
ong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang
Zhai, and Peixiang Zhao. I would also like to thank my advisor James B. Orlin for his guid-
ance during my early years as a researcher.
XI

XII ACKNOWLEDGMENTS
I would like to thank Lata Aggarwal for helping me with some of the ﬁgures created
using PowerPoint graphics in this book. My daughter, Sayani, was helpful in incorporating
special eﬀects (e.g., image color, contrast, and blurring) in several JPEG images used at
various places in this book.

Contents
1 An Introduction to Neural Networks 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Humans Versus Computers: Stretching the Limits
of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Basic Architecture of Neural Networks . . . . . . . . . . . . . . . . . . 4
1.2.1 Single Computational Layer: The Perceptron . . . . . . . . . . . . . 5
1.2.1.1 What Objective Function Is the Perceptron Optimizing? . 8
1.2.1.2 Relationship with Support Vector Machines . . . . . . . . . 10
1.2.1.3 Choice of Activation and Loss Functions . . . . . . . . . . 11
1.2.1.4 Choice and Number of Output Nodes . . . . . . . . . . . . 14
1.2.1.5 Choice of Loss Function . . . . . . . . . . . . . . . . . . . . 14
1.2.1.6 Some Useful Derivatives of Activation Functions . . . . . . 16
1.2.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 The Multilayer Network as a Computational Graph . . . . . . . . . 20
1.3 Training a Neural Network with Backpropagation . . . . . . . . . . . . . . . 21
1.4 Practical Issues in Neural Network Training . . . . . . . . . . . . . . . . . . 24
1.4.1 The Problem of Overfitting . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.1.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1.2 Neural Architecture and Parameter Sharing . . . . . . . . . 27
1.4.1.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1.4 Trading Off Breadth for Depth . . . . . . . . . . . . . . . . 27
1.4.1.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.2 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . 28
1.4.3 Difficulties in Convergence . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.4 Local and Spurious Optima . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.5 Computational Challenges . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5 The Secrets to the Power of Function Composition . . . . . . . . . . . . . . 30
1.5.1 The Importance of Nonlinear Activation . . . . . . . . . . . . . . . . 32
1.5.2 Reducing Parameter Requirements with Depth . . . . . . . . . . . . 34
1.5.3 Unconventional Neural Architectures . . . . . . . . . . . . . . . . . . 35
1.5.3.1 Blurring the Distinctions Between Input, Hidden,
and Output Layers . . . . . . . . . . . . . . . . . . . . . . . 35
1.5.3.2 Unconventional Operations and Sum-Product Networks . . 36
XIII

XIV CONTENTS
1.6 Common Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.1 Simulating Basic Machine Learning with Shallow Models . . . . . . 37
1.6.2 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . 37
1.6.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 38
1.6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 40
1.6.6 Hierarchical Feature Engineering and Pretrained Models . . . . . . . 42
1.7 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.2 Separating Data Storage and Computations . . . . . . . . . . . . . . 45
1.7.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 45
1.8 Two Notable Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.8.1 The MNIST Database of Handwritten Digits . . . . . . . . . . . . . 46
1.8.2 The ImageNet Database . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.10.1 Video Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.10.2 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2 Machine Learning with Shallow Neural Networks 53
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2 Neural Architectures for Binary Classification Models . . . . . . . . . . . . 55
2.2.1 Revisiting the Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2 Least-Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2.2.1 Widrow-Hoff Learning . . . . . . . . . . . . . . . . . . . . . 59
2.2.2.2 Closed Form Solutions . . . . . . . . . . . . . . . . . . . . . 61
2.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.3.1 Alternative Choices of Activation and Loss . . . . . . . . . 63
2.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3 Neural Architectures for Multiclass Models . . . . . . . . . . . . . . . . . . 65
2.3.1 Multiclass Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.3.2 Weston-Watkins SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.3 Multinomial Logistic Regression (Softmax Classifier) . . . . . . . . . 68
2.3.4 Hierarchical Softmax for Many Classes . . . . . . . . . . . . . . . . . 69
2.4 Backpropagated Saliency for Feature Selection . . . . . . . . . . . . . . . . 70
2.5 Matrix Factorization with Autoencoders . . . . . . . . . . . . . . . . . . . . 70
2.5.1 Autoencoder: Basic Principles . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1.1 Autoencoder with a Single Hidden Layer . . . . . . . . . . 72
2.5.1.2 Connections with Singular Value Decomposition . . . . . . 74
2.5.1.3 Sharing Weights in Encoder and Decoder . . . . . . . . . . 74
2.5.1.4 Other Matrix Factorization Methods . . . . . . . . . . . . . 76
2.5.2 Nonlinear Activations . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5.3 Deep Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.5.4 Application to Outlier Detection . . . . . . . . . . . . . . . . . . . . 80
2.5.5 When the Hidden Layer Is Broader than the Input Layer . . . . . . 81
2.5.5.1 Sparse Feature Learning . . . . . . . . . . . . . . . . . . . . 81
2.5.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

CONTENTS XV
2.5.7 Recommender Systems: Row Index to Row Value Prediction . . . . 83
2.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6 Word2vec: An Application of Simple Neural Architectures . . . . . . . . . . 87
2.6.1 Neural Embedding with Continuous Bag of Words . . . . . . . . . . 87
2.6.2 Neural Embedding with Skip-Gram Model . . . . . . . . . . . . . . . 90
2.6.3 Word2vec (SGNS) Is Logistic Matrix Factorization . . . . . . . . . . 95
2.6.4 Vanilla Skip-Gram Is Multinomial Matrix Factorization . . . . . . . 98
2.7 Simple Neural Architectures for Graph Embeddings . . . . . . . . . . . . . 98
2.7.1 Handling Arbitrary Edge Counts . . . . . . . . . . . . . . . . . . . . 100
2.7.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.7.3 Connections with DeepWalk and Node2vec . . . . . . . . . . . . . . 100
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3 Training Deep Neural Networks 105
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.2 Backpropagation: The Gory Details . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.1 Backpropagation with the Computational Graph Abstraction . . . . 107
3.2.2 Dynamic Programming to the Rescue . . . . . . . . . . . . . . . . . 111
3.2.3 Backpropagation with Post-Activation Variables . . . . . . . . . . . 113
3.2.4 Backpropagation with Pre-activation Variables . . . . . . . . . . . . 115
3.2.5 Examples of Updates for Various Activations . . . . . . . . . . . . . 117
3.2.5.1 The Special Case of Softmax . . . . . . . . . . . . . . . . . 117
3.2.6 A Decoupled View of Vector-Centric Backpropagation . . . . . . . . 118
3.2.7 Loss Functions on Multiple Output Nodes and Hidden Nodes . . . . 121
3.2.8 Mini-Batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . 121
3.2.9 Backpropagation Tricks for Handling Shared Weights . . . . . . . . 123
3.2.10 Checking the Correctness of Gradient Computation . . . . . . . . . 124
3.3 Setup and Initialization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.1 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.2 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . . . . . 129
3.4.1 Geometric Understanding of the Eﬀect of Gradient Ratios . . . . . . 130
3.4.2 A Partial Fix with Activation Function Choice . . . . . . . . . . . . 133
3.4.3 Dying Neurons and “Brain Damage” . . . . . . . . . . . . . . . . . . 133
3.4.3.1 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3.2 Maxout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5 Gradient-Descent Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.1 Learning Rate Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.5.2 Momentum-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 136
3.5.2.1 Nesterov Momentum . . . . . . . . . . . . . . . . . . . . . 137
3.5.3 Parameter-Speciﬁc Learning Rates . . . . . . . . . . . . . . . . . . . 137
3.5.3.1 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.5.3.2 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.5.3.3 RMSProp with Nesterov Momentum . . . . . . . . . . . . . 139

XVI CONTENTS
3.5.3.4 AdaDelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.5.3.5 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5.4 Cliﬀs and Higher-Order Instability . . . . . . . . . . . . . . . . . . . 141
3.5.5 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.5.6 Second-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 143
3.5.6.1 Conjugate Gradients and Hessian-Free Optimization . . . . 145
3.5.6.2 Quasi-Newton Methods and BFGS . . . . . . . . . . . . . . 148
3.5.6.3 Problems with Second-Order Methods: Saddle Points . . . 149
3.5.7 Polyak Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.5.8 Local and Spurious Minima . . . . . . . . . . . . . . . . . . . . . . . 151
3.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7 Practical Tricks for Acceleration and Compression . . . . . . . . . . . . . . 156
3.7.1 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.7.2 Parallel and Distributed Implementations . . . . . . . . . . . . . . . 158
3.7.3 Algorithmic Tricks for Model Compression . . . . . . . . . . . . . . 160
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4 Teaching Deep Learners to Generalize 169
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.2 The Bias-Variance Trade-Oﬀ . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.2.1 Formal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.3 Generalization Issues in Model Tuning and Evaluation . . . . . . . . . . . . 178
4.3.1 Evaluating with Hold-Out and Cross-Validation . . . . . . . . . . . . 179
4.3.2 Issues with Training at Scale . . . . . . . . . . . . . . . . . . . . . . 180
4.3.3 How to Detect Need to Collect More Data . . . . . . . . . . . . . . . 181
4.4 Penalty-Based Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4.1 Connections with Noise Injection . . . . . . . . . . . . . . . . . . . . 182
4.4.2 L1-Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.3 L1- or L2-Regularization? . . . . . . . . . . . . . . . . . . . . . . . . 184
4.4.4 Penalizing Hidden Units: Learning Sparse Representations . . . . . . 185
4.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.1 Bagging and Subsampling . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.2 Parametric Model Selection and Averaging . . . . . . . . . . . . . . 187
4.5.3 Randomized Connection Dropping . . . . . . . . . . . . . . . . . . . 188
4.5.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.5.5 Data Perturbation Ensembles . . . . . . . . . . . . . . . . . . . . . . 191
4.6 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.1 Understanding Early Stopping from the Variance Perspective . . . . 192
4.7 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.7.1 Variations of Unsupervised Pretraining . . . . . . . . . . . . . . . . . 197
4.7.2 What About Supervised Pretraining? . . . . . . . . . . . . . . . . . 197
4.8 Continuation and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . 199
4.8.1 Continuation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.8.2 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

CONTENTS XVII
4.10 Regularization in Unsupervised Applications . . . . . . . . . . . . . . . . . 201
4.10.1 Value-Based Penalization: Sparse Autoencoders . . . . . . . . . . . . 202
4.10.2 Noise Injection: De-noising Autoencoders . . . . . . . . . . . . . . . 202
4.10.3 Gradient-Based Penalization: Contractive Autoencoders . . . . . . . 204
4.10.4 Hidden Probabilistic Structure: Variational Autoencoders . . . . . . 207
4.10.4.1 Reconstruction and Generative Sampling . . . . . . . . . . 210
4.10.4.2 Conditional Variational Autoencoders . . . . . . . . . . . . 212
4.10.4.3 Relationship with Generative Adversarial Networks . . . . 213
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5 Radial Basis Function Networks 217
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.2 Training an RBF Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.1 Training the Hidden Layer . . . . . . . . . . . . . . . . . . . . . . . . 221
5.2.2 Training the Output Layer . . . . . . . . . . . . . . . . . . . . . . . 222
5.2.2.1 Expression with Pseudo-Inverse . . . . . . . . . . . . . . . 224
5.2.3 Orthogonal Least-Squares Algorithm . . . . . . . . . . . . . . . . . . 224
5.2.4 Fully Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3 Variations and Special Cases of RBF Networks . . . . . . . . . . . . . . . . 226
5.3.1 Classification with Perceptron Criterion . . . . . . . . . . . . . . . . 226
5.3.2 Classification with Hinge Loss . . . . . . . . . . . . . . . . . . . . . . 227
5.3.3 Example of Linear Separability Promoted by RBF . . . . . . . . . . 227
5.3.4 Application to Interpolation . . . . . . . . . . . . . . . . . . . . . . . 228
5.4 Relationship with Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . 229
5.4.1 Kernel Regression as a Special Case of RBF Networks . . . . . . . . 229
5.4.2 Kernel SVM as a Special Case of RBF Networks . . . . . . . . . . . 230
5.4.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6 Restricted Boltzmann Machines 235
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.1.1 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.2 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.2.1 Optimal State Configurations of a Trained Network . . . . . . . . . 238
6.2.2 Training a Hopfield Network . . . . . . . . . . . . . . . . . . . . . . 240
6.2.3 Building a Toy Recommender and Its Limitations . . . . . . . . . . 241
6.2.4 Increasing the Expressive Power of the Hopfield Network . . . . . . 242
6.3 The Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.3.1 How a Boltzmann Machine Generates Data . . . . . . . . . . . . . . 244
6.3.2 Learning the Weights of a Boltzmann Machine . . . . . . . . . . . . 245
6.4 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.4.1 Training the RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.4.2 Contrastive Divergence Algorithm . . . . . . . . . . . . . . . . . . . 250
6.4.3 Practical Issues and Improvisations . . . . . . . . . . . . . . . . . . . 251

XVIII CONTENTS
6.5 Applications of Restricted Boltzmann Machines . . . . . . . . . . . . . . . . 251
6.5.1 Dimensionality Reduction and Data Reconstruction . . . . . . . . . 252
6.5.2 RBMs for Collaborative Filtering . . . . . . . . . . . . . . . . . . . . 254
6.5.3 Using RBMs for Classification . . . . . . . . . . . . . . . . . . . . . . 257
6.5.4 Topic Models with RBMs . . . . . . . . . . . . . . . . . . . . . . . . 260
6.5.5 RBMs for Machine Learning with Multimodal Data . . . . . . . . . 262
6.6 Using RBMs Beyond Binary Data Types . . . . . . . . . . . . . . . . . . . . 263
6.7 Stacking Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 264
6.7.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.7.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.7.3 Deep Boltzmann Machines and Deep Belief Networks . . . . . . . . 267
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7 Recurrent Neural Networks 271
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.1.1 Expressiveness of Recurrent Networks . . . . . . . . . . . . . . . . . 274
7.2 The Architecture of Recurrent Neural Networks . . . . . . . . . . . . . . . . 274
7.2.1 Language Modeling Example of RNN . . . . . . . . . . . . . . . . . 277
7.2.1.1 Generating a Language Sample . . . . . . . . . . . . . . . . 278
7.2.2 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . 280
7.2.3 Bidirectional Recurrent Networks . . . . . . . . . . . . . . . . . . . . 283
7.2.4 Multilayer Recurrent Networks . . . . . . . . . . . . . . . . . . . . . 284
7.3 The Challenges of Training Recurrent Networks . . . . . . . . . . . . . . . . 286
7.3.1 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.4 Echo-State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.5 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . 292
7.6 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.7 Applications of Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 297
7.7.1 Application to Automatic Image Captioning . . . . . . . . . . . . . . 298
7.7.2 Sequence-to-Sequence Learning and Machine Translation . . . . . . 299
7.7.2.1 Question-Answering Systems . . . . . . . . . . . . . . . . . 301
7.7.3 Application to Sentence-Level Classification . . . . . . . . . . . . . . 303
7.7.4 Token-Level Classification with Linguistic Features . . . . . . . . . . 304
7.7.5 Time-Series Forecasting and Prediction . . . . . . . . . . . . . . . . 305
7.7.6 Temporal Recommender Systems . . . . . . . . . . . . . . . . . . . . 307
7.7.7 Secondary Protein Structure Prediction . . . . . . . . . . . . . . . . 309
7.7.8 End-to-End Speech Recognition . . . . . . . . . . . . . . . . . . . . . 309
7.7.9 Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . 309
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

CONTENTS XIX
8 Convolutional Neural Networks 315
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
8.1.1 Historical Perspective and Biological Inspiration . . . . . . . . . . . 316
8.1.2 Broader Observations About Convolutional Neural Networks . . . . 317
8.2 The Basic Structure of a Convolutional Network . . . . . . . . . . . . . . . 318
8.2.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.2.2 Strides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2.3 Typical Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2.4 The ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.2.6 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.2.7 The Interleaving Between Layers . . . . . . . . . . . . . . . . . . . . 328
8.2.8 Local Response Normalization . . . . . . . . . . . . . . . . . . . . . 330
8.2.9 Hierarchical Feature Engineering . . . . . . . . . . . . . . . . . . . . 331
8.3 Training a Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . 332
8.3.1 Backpropagating Through Convolutions . . . . . . . . . . . . . . . . 333
8.3.2 Backpropagation as Convolution with Inverted/Transposed Filter . . 334
8.3.3 Convolution/Backpropagation as Matrix Multiplications . . . . . . . 335
8.3.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.4 Case Studies of Convolutional Architectures . . . . . . . . . . . . . . . . . . 338
8.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.2 ZFNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.4.3 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
8.4.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
8.4.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.4.6 The Eﬀects of Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
8.4.7 Pretrained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5 Visualization and Unsupervised Learning . . . . . . . . . . . . . . . . . . . 352
8.5.1 Visualizing the Features of a Trained Network . . . . . . . . . . . . 353
8.5.2 Convolutional Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 357
8.6 Applications of Convolutional Networks . . . . . . . . . . . . . . . . . . . . 363
8.6.1 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . 363
8.6.2 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
8.6.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
8.6.4 Natural Language and Sequence Learning . . . . . . . . . . . . . . . 366
8.6.5 Video Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8.8.1 Software Resources and Data Sets . . . . . . . . . . . . . . . . . . . 370
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9 Deep Reinforcement Learning 373
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.2 Stateless Algorithms: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . 375
9.2.1 Naı̈ve Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.2.2 -Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.2.3 Upper Bounding Methods . . . . . . . . . . . . . . . . . . . . . . . . 376
9.3 The Basic Framework of Reinforcement Learning . . . . . . . . . . . . . . . 377
9.3.1 Challenges of Reinforcement Learning . . . . . . . . . . . . . . . . . 379

XX CONTENTS
9.3.2 Simple Reinforcement Learning for Tic-Tac-Toe . . . . . . . . . . . . 380
9.3.3 Role of Deep Learning and a Straw-Man Algorithm . . . . . . . . . 380
9.4 Bootstrapping for Value Function Learning . . . . . . . . . . . . . . . . . . 383
9.4.1 Deep Learning Models as Function Approximators . . . . . . . . . . 384
9.4.2 Example: Neural Network for Atari Setting . . . . . . . . . . . . . . 386
9.4.3 On-Policy Versus Off-Policy Methods: SARSA . . . . . . . . . . . . 387
9.4.4 Modeling States Versus State-Action Pairs . . . . . . . . . . . . . . . 389
9.5 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
9.5.1 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . . . . 392
9.5.2 Likelihood Ratio Methods . . . . . . . . . . . . . . . . . . . . . . . . 393
9.5.3 Combining Supervised Learning with Policy Gradients . . . . . . . . 395
9.5.4 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 395
9.5.5 Continuous Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . 397
9.5.6 Advantages and Disadvantages of Policy Gradients . . . . . . . . . . 397
9.6 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
9.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
9.7.1 AlphaGo: Championship Level Play at Go . . . . . . . . . . . . . . . 399
9.7.1.1 Alpha Zero: Enhancements to Zero Human Knowledge . . 402
9.7.2 Self-Learning Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
9.7.2.1 Deep Learning of Locomotion Skills . . . . . . . . . . . . . 404
9.7.2.2 Deep Learning of Visuomotor Skills . . . . . . . . . . . . . 406
9.7.3 Building Conversational Systems: Deep Learning for Chatbots . . . 407
9.7.4 Self-Driving Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
9.7.5 Inferring Neural Architectures with Reinforcement Learning . . . . . 412
9.8 Practical Challenges Associated with Safety . . . . . . . . . . . . . . . . . . 413
9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.10.1 Software Resources and Testbeds . . . . . . . . . . . . . . . . . . . . 416
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
10 Advanced Topics in Deep Learning 419
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.2 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.2.1 Recurrent Models of Visual Attention . . . . . . . . . . . . . . . . . 422
10.2.1.1 Application to Image Captioning . . . . . . . . . . . . . . . 424
10.2.2 Attention Mechanisms for Machine Translation . . . . . . . . . . . . 425
10.3 Neural Networks with External Memory . . . . . . . . . . . . . . . . . . . . 429
10.3.1 A Fantasy Video Game: Sorting by Example . . . . . . . . . . . . . 430
10.3.1.1 Implementing Swaps with Memory Operations . . . . . . . 431
10.3.2 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.3.3 Differentiable Neural Computer: A Brief Overview . . . . . . . . . . 437
10.4 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . 438
10.4.1 Training a Generative Adversarial Network . . . . . . . . . . . . . . 439
10.4.2 Comparison with Variational Autoencoder . . . . . . . . . . . . . . . 442
10.4.3 Using GANs for Generating Image Data . . . . . . . . . . . . . . . . 442
10.4.4 Conditional Generative Adversarial Networks . . . . . . . . . . . . . 444
10.5 Competitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
10.5.1 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.5.2 Kohonen Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . 450

CONTENTS XXI
10.6 Limitations of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 453
10.6.1 An Aspirational Goal: One-Shot Learning . . . . . . . . . . . . . . . 453
10.6.2 An Aspirational Goal: Energy-Eﬃcient Learning . . . . . . . . . . . 455
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Bibliography 459
Index 493

Author Biography
Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM
T. J. Watson Research Center in Yorktown Heights, New York. He completed his under-
graduate degree in Computer Science from the Indian Institute of Technology at Kan-
pur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.
He has worked extensively in the field of data mining. He has pub-
lished more than 350 papers in refereed conferences and journals
and authored over 80 patents. He is the author or editor of 18
books, including textbooks on data mining, recommender systems,
and outlier analysis. Because of the commercial value of his patents,
he has thrice been designated a Master Inventor at IBM. He is a
recipient of an IBM Corporate Award (2003) for his work on bio-
terrorist threat detection in data streams, a recipient of the IBM
Outstanding Innovation Award (2008) for his scientific contribu-
tions to privacy technology, and a recipient of two IBM Outstanding
Technical Achievement Awards (2009, 2015) for his work on data streams/high-dimensional
data. He received the EDBT 2014 Test of Time Award for his work on condensation-based
privacy-preserving data mining. He is also a recipient of the IEEE ICDM Research Con-
tributions Award (2015), which is one of the two highest awards for influential research
contributions in the field of data mining.
He has served as the general co-chair of the IEEE Big Data Conference (2014) and as
the program co-chair of the ACM CIKM Conference (2015), the IEEE ICDM Conference
(2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE
Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate
editor of the IEEE Transactions on Big Data, an action editor of the Data Mining and
Knowledge Discovery Journal, and an associate editor of the Knowledge and Information
Systems Journal. He serves as the editor-in-chief of the ACM Transactions on Knowledge
Discovery from Data as well as the ACM SIGKDD Explorations. He serves on the advisory
board of the Lecture Notes on Social Networks, a publication by Springer. He has served as
the vice-president of the SIAM Activity Group on Data Mining and is a member of the SIAM
industry committee. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to
knowledge discovery and data mining algorithms.”
XXIII

Chapter 1
An Introduction to Neural Networks
“Thou shalt not make a machine to counterfeit a human mind.”—Frank Herbert
1.1 Introduction
Artificial neural networks are popular machine learning techniques that simulate the mech-
anism of learning in biological organisms. The human nervous system contains cells, which
are referred to as neurons. The neurons are connected to one another with the use of ax-
ons and dendrites, and the connecting regions between axons and dendrites are referred to
as synapses. These connections are illustrated in Figure 1.1(a). The strengths of synaptic
connections often change in response to external stimuli. This change is how learning takes
place in living organisms.
This biological mechanism is simulated in artificial neural networks, which contain com-
putation units that are referred to as neurons. Throughout this book, we will use the term
“neural networks” to refer to artificial neural networks rather than biological ones. The
computational units are connected to one another through weights, which serve the same
NEURON
w1
w2
w3
w4
AXON
DENDRITES WITH
SYNAPTIC WEIGHTS
w5
(a) Biological neural network (b) Artificial neural network
Figure 1.1: The synaptic connections between neurons. The image in (a) is from “The Brain:
Understanding Neurobiology Through the Study of Addiction [598].” Copyright c
2000 by
BSCS Videodiscovery. All rights reserved. Used with permission.
© Springer International Publishing AG, part of Springer Nature 2018
C. C. Aggarwal, Neural Networks and Deep Learning,
https://doi.org/10.1007/978-3-319-94463-0 1
1

2 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
role as the strengths of synaptic connections in biological organisms. Each input to a neuron
is scaled with a weight, which affects the function computed at that unit. This architecture
is illustrated in Figure 1.1(b). An artificial neural network computes a function of the inputs
by propagating the computed values from the input neurons to the output neuron(s) and
using the weights as intermediate parameters. Learning occurs by changing the weights con-
necting the neurons. Just as external stimuli are needed for learning in biological organisms,
the external stimulus in artificial neural networks is provided by the training data contain-
ing examples of input-output pairs of the function to be learned. For example, the training
data might contain pixel representations of images (input) and their annotated labels (e.g.,
carrot, banana) as the output. These training data pairs are fed into the neural network by
using the input representations to make predictions about the output labels. The training
data provides feedback to the correctness of the weights in the neural network depending
on how well the predicted output (e.g., probability of carrot) for a particular input matches
the annotated output label in the training data. One can view the errors made by the neural
network in the computation of a function as a kind of unpleasant feedback in a biological
organism, leading to an adjustment in the synaptic strengths. Similarly, the weights between
neurons are adjusted in a neural network in response to prediction errors. The goal of chang-
ing the weights is to modify the computed function to make the predictions more correct in
future iterations. Therefore, the weights are changed carefully in a mathematically justified
way so as to reduce the error in computation on that example. By successively adjusting
the weights between neurons over many input-output pairs, the function computed by the
neural network is refined over time so that it provides more accurate predictions. Therefore,
if the neural network is trained with many different images of bananas, it will eventually
be able to properly recognize a banana in an image it has not seen before. This ability to
accurately compute functions of unseen inputs by training over a finite set of input-output
pairs is referred to as model generalization. The primary usefulness of all machine learning
models is gained from their ability to generalize their learning from seen training data to
unseen examples.
The biological comparison is often criticized as a very poor caricature of the workings
of the human brain; nevertheless, the principles of neuroscience have often been useful in
designing neural network architectures. A different view is that neural networks are built
as higher-level abstractions of the classical models that are commonly used in machine
learning. In fact, the most basic units of computation in the neural network are inspired by
traditional machine learning algorithms like least-squares regression and logistic regression.
Neural networks gain their power by putting together many such basic units, and learning
the weights of the different units jointly in order to minimize the prediction error. From
this point of view, a neural network can be viewed as a computational graph of elementary
units in which greater power is gained by connecting them in particular ways. When a
neural network is used in its most basic form, without hooking together multiple units, the
learning algorithms often reduce to classical machine learning models (see Chapter 2). The
real power of a neural model over classical methods is unleashed when these elementary
computational units are combined, and the weights of the elementary models are trained
using their dependencies on one another. By combining multiple units, one is increasing the
power of the model to learn more complicated functions of the data than are inherent in the
elementary models of basic machine learning. The way in which these units are combined
also plays a role in the power of the architecture, and requires some understanding and
insight from the analyst. Furthermore, sufficient training data is also required in order to
learn the larger number of weights in these expanded computational graphs.

1.1. INTRODUCTION 3
ACCURACY
AMOUNT OF DATA
DEEP LEARNING
CONVENTIONAL
MACHINE LEARNING
Figure 1.2: An illustrative comparison of the accuracy of a typical machine learning al-
gorithm with that of a large neural network. Deep learners become more attractive than
conventional methods primarily when sufficient data/computational power is available. Re-
cent years have seen an increase in data availability and computational power, which has
led to a “Cambrian explosion” in deep learning technology.
1.1.1 Humans Versus Computers: Stretching the Limits
of Artificial Intelligence
Humans and computers are inherently suited to different types of tasks. For example, com-
puting the cube root of a large number is very easy for a computer, but it is extremely
difficult for humans. On the other hand, a task such as recognizing the objects in an image
is a simple matter for a human, but has traditionally been very difficult for an automated
learning algorithm. It is only in recent years that deep learning has shown an accuracy on
some of these tasks that exceeds that of a human. In fact, the recent results by deep learning
algorithms that surpass human performance [184] in (some narrow tasks on) image recog-
nition would not have been considered likely by most computer vision experts as recently
as 10 years ago.
Many deep learning architectures that have shown such extraordinary performance are
not created by indiscriminately connecting computational units. The superior performance
of deep neural networks mirrors the fact that biological neural networks gain much of their
power from depth as well. Furthermore, biological networks are connected in ways we do not
fully understand. In the few cases that the biological structure is understood at some level,
significant breakthroughs have been achieved by designing artificial neural networks along
those lines. A classical example of this type of architecture is the use of the convolutional
neural network for image recognition. This architecture was inspired by Hubel and Wiesel’s
experiments [212] in 1959 on the organization of the neurons in the cat’s visual cortex. The
precursor to the convolutional neural network was the neocognitron [127], which was directly
based on these results.
The human neuronal connection structure has evolved over millions of years to optimize
survival-driven performance; survival is closely related to our ability to merge sensation and
intuition in a way that is currently not possible with machines. Biological neuroscience [232]
is a field that is still very much in its infancy, and only a limited amount is known about how
the brain truly works. Therefore, it is fair to suggest that the biologically inspired success
of convolutional neural networks might be replicated in other settings, as we learn more
about how the human brain works [176]. A key advantage of neural networks over tradi-
tional machine learning is that the former provides a higher-level abstraction of expressing
semantic insights about data domains by architectural design choices in the computational
graph. The second advantage is that neural networks provide a simple way to adjust the

complexity of a model by adding or removing neurons from the architecture according to
the availability of training data or computational power. A large part of the recent suc-
cess of neural networks is explained by the fact that the increased data availability and
computational power of modern computers has outgrown the limits of traditional machine
learning algorithms, which fail to take full advantage of what is now possible. This situation
is illustrated in Figure 1.2. The performance of traditional machine learning remains better
at times for smaller data sets because of more choices, greater ease of model interpretation,
and the tendency to hand-craft interpretable features that incorporate domain-specific in-
sights. With limited data, the best of a very wide diversity of models in machine learning
will usually perform better than a single class of models (like neural networks). This is one
reason why the potential of neural networks was not realized in the early years.
The “big data” era has been enabled by the advances in data collection technology; vir-
tually everything we do today, including purchasing an item, using the phone, or clicking on
a site, is collected and stored somewhere. Furthermore, the development of powerful Graph-
ics Processor Units (GPUs) has enabled increasingly efficient processing on such large data
sets. These advances largely explain the recent success of deep learning using algorithms
that are only slightly adjusted from the versions that were available two decades back.
Furthermore, these recent adjustments to the algorithms have been enabled by increased
speed of computation, because reduced run-times enable efficient testing (and subsequent
algorithmic adjustment). If it requires a month to test an algorithm, at most twelve varia-
tions can be tested in an year on a single hardware platform. This situation has historically
constrained the intensive experimentation required for tweaking neural-network learning
algorithms. The rapid advances associated with the three pillars of improved data, compu-
tation, and experimentation have resulted in an increasingly optimistic outlook about the
future of deep learning. By the end of this century, it is expected that computers will have
the power to train neural networks with as many neurons as the human brain. Although
it is hard to predict what the true capabilities of artificial intelligence will be by then, our
experience with computer vision should prepare us to expect the unexpected.
Chapter Organization
This chapter is organized as follows. The next section introduces single-layer and multi-layer
networks. The different types of activation functions, output nodes, and loss functions are
discussed. The backpropagation algorithm is introduced in Section 1.3. Practical issues in
neural network training are discussed in Section 1.4. Some key points on how neural networks
gain their power with specific choices of activation functions are discussed in Section 1.5. The
common architectures used in neural network design are discussed in Section 1.6. Advanced
topics in deep learning are discussed in Section 1.7. Some notable benchmarks used by the
deep learning community are discussed in Section 1.8. A summary is provided in Section 1.9.
1.2 The Basic Architecture of Neural Networks
In this section, we will introduce single-layer and multi-layer neural networks. In the single-
layer network, a set of inputs is directly mapped to an output by using a generalized variation
of a linear function. This simple instantiation of a neural network is also referred to as the
perceptron. In multi-layer neural networks, the neurons are arranged in layered fashion, in
which the input and output layers are separated by a group of hidden layers. This layer-wise
architecture of the neural network is also referred to as a feed-forward network. This section
will discuss both single-layer and multi-layer networks.

1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 5
INPUT NODES
∑
OUTPUT NODE
y
w1
w2
w3
w4
x4
x3
x2
x1
x5
w5
INPUT NODES
∑
OUTPUT NODE
w1
w2
w3
w4
w5
b
+1 BIAS NEURON
y
x4
x3
x2
x1
x5
(a) Perceptron without bias (b) Perceptron with bias
Figure 1.3: The basic architecture of the perceptron
1.2.1 Single Computational Layer: The Perceptron
The simplest neural network is referred to as the perceptron. This neural network contains
a single input layer and an output node. The basic architecture of the perceptron is shown
in Figure 1.3(a). Consider a situation where each training instance is of the form (X, y),
where each X = [x1, . . . xd] contains d feature variables, and y ∈ {−1, +1} contains the
observed value of the binary class variable. By “observed value” we refer to the fact that it
is given to us as a part of the training data, and our goal is to predict the class variable for
cases in which it is not observed. For example, in a credit-card fraud detection application,
the features might represent various properties of a set of credit card transactions (e.g.,
amount and frequency of transactions), and the class variable might represent whether or
not this set of transactions is fraudulent. Clearly, in this type of application, one would have
historical cases in which the class variable is observed, and other (current) cases in which
the class variable has not yet been observed but needs to be predicted.
The input layer contains d nodes that transmit the d features X = [x1 . . . xd] with
edges of weight W = [w1 . . . wd] to an output node. The input layer does not perform
any computation in its own right. The linear function W · X =
d
i=1 wixi is computed at
the output node. Subsequently, the sign of this real value is used in order to predict the
dependent variable of X. Therefore, the prediction ŷ is computed as follows:
ŷ = sign{W · X} = sign{
d

j=1
wjxj} (1.1)
The sign function maps a real value to either +1 or −1, which is appropriate for binary
classiﬁcation. Note the circumﬂex on top of the variable y to indicate that it is a predicted
value rather than an observed value. The error of the prediction is therefore E(X) = y − ŷ,
which is one of the values drawn from the set {−2, 0, +2}. In cases where the error value
E(X) is nonzero, the weights in the neural network need to be updated in the (negative)
direction of the error gradient. As we will see later, this process is similar to that used in
various types of linear models in machine learning. In spite of the similarity of the perceptron
with respect to traditional machine learning models, its interpretation as a computational
unit is very useful because it allows us to put together multiple units in order to create far
more powerful models than are available in traditional machine learning.

The architecture of the perceptron is shown in Figure 1.3(a), in which a single input layer
transmits the features to the output node. The edges from the input to the output contain
the weights w1 . . . wd with which the features are multiplied and added at the output node.
Subsequently, the sign function is applied in order to convert the aggregated value into a
class label. The sign function serves the role of an activation function. Different choices
of activation functions can be used to simulate different types of models used in machine
learning, like least-squares regression with numeric targets, the support vector machine,
or a logistic regression classifier. Most of the basic machine learning models can be easily
represented as simple neural network architectures. It is a useful exercise to model traditional
machine learning techniques as neural architectures, because it provides a clearer picture of
how deep learning generalizes traditional machine learning. This point of view is explored
in detail in Chapter 2. It is noteworthy that the perceptron contains two layers, although
the input layer does not perform any computation and only transmits the feature values.
The input layer is not included in the count of the number of layers in a neural network.
Since the perceptron contains a single computational layer, it is considered a single-layer
network.
In many settings, there is an invariant part of the prediction, which is referred to as
the bias. For example, consider a setting in which the feature variables are mean centered,
but the mean of the binary class prediction from {−1, +1} is not 0. This will tend to occur
in situations in which the binary class distribution is highly imbalanced. In such a case,
the aforementioned approach is not sufficient for prediction. We need to incorporate an
additional bias variable b that captures this invariant part of the prediction:
ŷ = sign{W · X + b} = sign{
d

j=1
wjxj + b} (1.2)
The bias can be incorporated as the weight of an edge by using a bias neuron. This is
achieved by adding a neuron that always transmits a value of 1 to the output node. The
weight of the edge connecting the bias neuron to the output node provides the bias variable.
An example of a bias neuron is shown in Figure 1.3(b). Another approach that works well
with single-layer architectures is to use a feature engineering trick in which an additional
feature is created with a constant value of 1. The coefficient of this feature provides the bias,
and one can then work with Equation 1.1. Throughout this book, biases will not be explicitly
used (for simplicity in architectural representations) because they can be incorporated with
bias neurons. The details of the training algorithms remain the same by simply treating the
bias neurons like any other neuron with a fixed activation value of 1. Therefore, the following
will work with the predictive assumption of Equation 1.1, which does not explicitly uses
biases.
At the time that the perceptron algorithm was proposed by Rosenblatt [405], these op-
timizations were performed in a heuristic way with actual hardware circuits, and it was not
presented in terms of a formal notion of optimization in machine learning (as is common
today). However, the goal was always to minimize the error in prediction, even if a for-
mal optimization formulation was not presented. The perceptron algorithm was, therefore,
heuristically designed to minimize the number of misclassifications, and convergence proofs
were available that provided correctness guarantees of the learning algorithm in simplified
settings. Therefore, we can still write the (heuristically motivated) goal of the perceptron
algorithm in least-squares form with respect to all training instances in a data set D con-

taining feature-label pairs:
MinimizeW L =

(X,y)∈D
(y − ŷ)2
=

(X,y)∈D

y − sign{W · X}
2
This type of minimization objective function is also referred to as a loss function. As we
will see later, almost all neural network learning algorithms are formulated with the use
of a loss function. As we will learn in Chapter 2, this loss function looks a lot like least-
squares regression. However, the latter is defined for continuous-valued target variables,
and the corresponding loss is a smooth and continuous function of the variables. On the
other hand, for the least-squares form of the objective function, the sign function is non-
differentiable, with step-like jumps at specific points. Furthermore, the sign function takes
on constant values over large portions of the domain, and therefore the exact gradient takes
on zero values at differentiable points. This results in a staircase-like loss surface, which
is not suitable for gradient-descent. The perceptron algorithm (implicitly) uses a smooth
approximation of the gradient of this objective function with respect to each example:
∇Lsmooth =

(X,y)∈D
(y − ŷ)X (1.3)
Note that the above gradient is not a true gradient of the staircase-like surface of the (heuris-
tic) objective function, which does not provide useful gradients. Therefore, the staircase is
smoothed out into a sloping surface defined by the perceptron criterion. The properties of the
perceptron criterion will be described in Section 1.2.1.1. It is noteworthy that concepts like
the “perceptron criterion” were proposed later than the original paper by Rosenblatt [405]
in order to explain the heuristic gradient-descent steps. For now, we will assume that the
perceptron algorithm optimizes some unknown smooth function with the use of gradient
descent.
Although the above objective function is defined over the entire training data, the train-
ing algorithm of neural networks works by feeding each input data instance X into the
network one by one (or in small batches) to create the prediction ŷ. The weights are then
updated, based on the error value E(X) = (y − ŷ). Specifically, when the data point X is
fed into the network, the weight vector W is updated as follows:
W ⇐ W + α(y − ŷ)X (1.4)
The parameter α regulates the learning rate of the neural network. The perceptron algorithm
repeatedly cycles through all the training examples in random order and iteratively adjusts
the weights until convergence is reached. A single training data point may be cycled through
many times. Each such cycle is referred to as an epoch. One can also write the gradient-
descent update in terms of the error E(X) = (y − ŷ) as follows:
W ⇐ W + αE(X)X (1.5)
The basic perceptron algorithm can be considered a stochastic gradient-descent method,
which implicitly minimizes the squared error of prediction by performing gradient-descent
updates with respect to randomly chosen training points. The assumption is that the neural
network cycles through the points in random order during training and changes the weights
with the goal of reducing the prediction error on that point. It is easy to see from Equa-
tion 1.5 that non-zero updates are made to the weights only when y = ŷ, which occurs only

when errors are made in prediction. In mini-batch stochastic gradient descent, the aforemen-
tioned updates of Equation 1.5 are implemented over a randomly chosen subset of training
points S:
W ⇐ W + α

X∈S
E(X)X (1.6)
LINEARLY SEPARABLE NOT LINEARLY SEPARABLE
W X = 0
Figure 1.4: Examples of linearly separable and inseparable data in two classes
The advantages of using mini-batch stochastic gradient descent are discussed in Section 3.2.8
of Chapter 3. An interesting quirk of the perceptron is that it is possible to set the learning
rate α to 1, because the learning rate only scales the weights.
The type of model proposed in the perceptron is a linear model, in which the equation
W ·X = 0 deﬁnes a linear hyperplane. Here, W = (w1 . . . wd) is a d-dimensional vector that
is normal to the hyperplane. Furthermore, the value of W · X is positive for values of X on
one side of the hyperplane, and it is negative for values of X on the other side. This type of
model performs particularly well when the data is linearly separable. Examples of linearly
separable and inseparable data are shown in Figure 1.4.
The perceptron algorithm is good at classifying data sets like the one shown on the
left-hand side of Figure 1.4, when the data is linearly separable. On the other hand, it tends
to perform poorly on data sets like the one shown on the right-hand side of Figure 1.4. This
example shows the inherent modeling limitation of a perceptron, which necessitates the use
of more complex neural architectures.
Since the original perceptron algorithm was proposed as a heuristic minimization of
classiﬁcation errors, it was particularly important to show that the algorithm converges
to reasonable solutions in some special cases. In this context, it was shown [405] that the
perceptron algorithm always converges to provide zero error on the training data when
the data are linearly separable. However, the perceptron algorithm is not guaranteed to
converge in instances where the data are not linearly separable. For reasons discussed in
the next section, the perceptron might sometimes arrive at a very poor solution with data
that are not linearly separable (in comparison with many other learning algorithms).
1.2.1.1 What Objective Function Is the Perceptron Optimizing?
As discussed earlier in this chapter, the original perceptron paper by Rosenblatt [405] did
not formally propose a loss function. In those years, these implementations were achieved
using actual hardware circuits. The original Mark I perceptron was intended to be a machine
rather than an algorithm, and custom-built hardware was used to create it (cf. Figure 1.5).

The general goal was to minimize the number of classification errors with a heuristic update
process (in hardware) that changed weights in the “correct” direction whenever errors were
made. This heuristic update strongly resembled gradient descent but it was not derived
as a gradient-descent method. Gradient descent is defined only for smooth loss functions
in algorithmic settings, whereas the hardware-centric approach was designed in a more
Figure 1.5: The perceptron algorithm was originally implemented using hardware circuits.
The image depicts the Mark I perceptron machine built in 1958. (Courtesy: Smithsonian
Institute)
heuristic way with binary outputs. Many of the binary and circuit-centric principles were
inherited from the McCulloch-Pitts model [321] of the neuron. Unfortunately, binary signals
are not prone to continuous optimization.
Can we find a smooth loss function, whose gradient turns out to be the perceptron
update? The number of classification errors in a binary classification problem can be written
in the form of a 0/1 loss function for training data point (Xi, yi) as follows:
L
(0/1)
i =
1
2
(yi − sign{W · Xi})2
= 1 − yi · sign{W · Xi} (1.7)
The simplification to the right-hand side of the above objective function is obtained by set-
ting both y2
i and sign{W ·Xi}2
to 1, since they are obtained by squaring a value drawn from
{−1, +1}. However, this objective function is not differentiable, because it has a staircase-
like shape, especially when it is added over multiple points. Note that the 0/1 loss above
is dominated by the term −yisign{W · Xi}, in which the sign function causes most of
the problems associated with non-differentiability. Since neural networks are defined by
gradient-based optimization, we need to define a smooth objective function that is respon-
sible for the perceptron updates. It can be shown [41] that the updates of the perceptron
implicitly optimize the perceptron criterion. This objective function is defined by dropping
the sign function in the above 0/1 loss and setting negative values to 0 in order to treat all
correct predictions in a uniform and lossless way:
Li = max{−yi(W · Xi), 0} (1.8)
The reader is encouraged to use calculus to verify that the gradient of this smoothed objec-
tive function leads to the perceptron update, and the update of the perceptron is essentially

W ⇐ W − α∇W Li. The modified loss function to enable gradient computation of a non-
differentiable function is also referred to as a smoothed surrogate loss function. Almost all
continuous optimization-based learning methods (such as neural networks) with discrete
outputs (such as class labels) use some type of smoothed surrogate loss function.
LOSS
PERCEPTRON CRITERION HINGE LOSS
1
0
VALUE OF W X FOR
POSITIVE CLASS INSTANCE
Figure 1.6: Perceptron criterion versus hinge loss
Although the aforementioned perceptron criterion was reverse engineered by working
backwards from the perceptron updates, the nature of this loss function exposes some of
the weaknesses of the updates in the original algorithm. An interesting observation about the
perceptron criterion is that one can set W to the zero vector irrespective of the training data
set in order to obtain the optimal loss value of 0. In spite of this fact, the perceptron updates
continue to converge to a clear separator between the two classes in linearly separable cases;
after all, a separator between the two classes provides a loss value of 0 as well. However,
the behavior for data that are not linearly separable is rather arbitrary, and the resulting
solution is sometimes not even a good approximate separator of the classes. The direct
sensitivity of the loss to the magnitude of the weight vector can dilute the goal of class
separation; it is possible for updates to worsen the number of misclassifications significantly
while improving the loss. This is an example of how surrogate loss functions might sometimes
not fully achieve their intended goals. Because of this fact, the approach is not stable and
can yield solutions of widely varying quality.
Several variations of the learning algorithm were therefore proposed for inseparable data,
and a natural approach is to always keep track of the best solution in terms of the number of
misclassifications [128]. This approach of always keeping the best solution in one’s “pocket”
is referred to as the pocket algorithm. Another highly performing variant incorporates the
notion of margin in the loss function, which creates an identical algorithm to the linear
support vector machine. For this reason, the linear support vector machine is also referred
to as the perceptron of optimal stability.
1.2.1.2 Relationship with Support Vector Machines
The perceptron criterion is a shifted version of the hinge-loss used in support vector ma-
chines (see Chapter 2). The hinge loss looks even more similar to the zero-one loss criterion
of Equation 1.7, and is defined as follows:
Lsvm
i = max{1 − yi(W · Xi), 0} (1.9)
Note that the perceptron does not keep the constant term of 1 on the right-hand side of
Equation 1.7, whereas the hinge loss keeps this constant within the maximization function.
This change does not affect the algebraic expression for the gradient, but it does change

which points are lossless and should not cause an update. The relationship between the
perceptron criterion and the hinge loss is shown in Figure 1.6. This similarity becomes
particularly evident when the perceptron updates of Equation 1.6 are rewritten as follows:
W ⇐ W + α

(X,y)∈S+
yX (1.10)
Here, S+
is defined as the set of all misclassified training points X ∈ S that satisfy the
condition y(W ·X) 0. This update seems to look somewhat different from the perceptron,
because the perceptron uses the error E(X) for the update, which is replaced with y in the
update above. A key point is that the (integer) error value E(X) = (y − sign{W · X}) ∈
{−2, +2} can never be 0 for misclassified points in S+
. Therefore, we have E(X) = 2y
for misclassified points, and E(X) can be replaced with y in the updates after absorbing
the factor of 2 within the learning rate. This update is identical to that used by the primal
support vector machine (SVM) algorithm [448], except that the updates are performed only
for the misclassified points in the perceptron, whereas the SVM also uses the marginally
correct points near the decision boundary for updates. Note that the SVM uses the condition
y(W · X) 1 [instead of using the condition y(W · X) 0] to define S+
, which is one of
the key differences between the two algorithms. This point shows that the perceptron is
fundamentally not very different from well-known machine learning algorithms like the
support vector machine in spite of its different origins. Freund and Schapire provide a
beautiful exposition of the role of margin in improving stability of the perceptron and also
its relationship with the support vector machine [123]. It turns out that many traditional
machine learning models can be viewed as minor variations of shallow neural architectures
like the perceptron. The relationships between classical machine learning models and shallow
neural networks are described in detail in Chapter 2.
1.2.1.3 Choice of Activation and Loss Functions
The choice of activation function is a critical part of neural network design. In the case of the
perceptron, the choice of the sign activation function is motivated by the fact that a binary
class label needs to be predicted. However, it is possible to have other types of situations
where different target variables may be predicted. For example, if the target variable to be
predicted is real, then it makes sense to use the identity activation function, and the resulting
algorithm is the same as least-squares regression. If it is desirable to predict a probability
of a binary class, it makes sense to use a sigmoid function for activating the output node, so
that the prediction ŷ indicates the probability that the observed value, y, of the dependent
variable is 1. The negative logarithm of |y/2−0.5+ŷ| is used as the loss, assuming that y is
coded from {−1, 1}. If ŷ is the probability that y is 1, then |y/2 − 0.5 + ŷ| is the probability
that the correct value is predicted. This assertion is easy to verify by examining the two
cases where y is 0 or 1. This loss function can be shown to be representative of the negative
log-likelihood of the training data (see Section 2.2.3 of Chapter 2).
The importance of nonlinear activation functions becomes significant when one moves
from the single-layered perceptron to the multi-layered architectures discussed later in this
chapter. Different types of nonlinear functions such as the sign, sigmoid, or hyperbolic tan-
gents may be used in various layers. We use the notation Φ to denote the activation function:
ŷ = Φ(W · X) (1.11)
Therefore, a neuron really computes two functions within the node, which is why we have
incorporated the summation symbol Σ as well as the activation symbol Φ within a neuron.
The break-up of the neuron computations into two separate values is shown in Figure 1.7.

∑
BREAK UP
∑ ∑
h= (W X)
.
ah
POST-ACTIVATION
VALUE
PRE-ACTIVATION
VALUE
= W X
.
{
X
W
h= (ah)
{
X
W
Figure 1.7: Pre-activation and post-activation values within a neuron
The value computed before applying the activation function Φ(·) will be referred to as the
pre-activation value, whereas the value computed after applying the activation function is
referred to as the post-activation value. The output of a neuron is always the post-activation
value, although the pre-activation variables are often used in diﬀerent types of analyses, such
as the computations of the backpropagation algorithm discussed later in this chapter. The
pre-activation and post-activation values of a neuron are shown in Figure 1.7.
The most basic activation function Φ(·) is the identity or linear activation, which provides
no nonlinearity:
Φ(v) = v
The linear activation function is often used in the output node, when the target is a real
value. It is even used for discrete outputs when a smoothed surrogate loss function needs
to be set up.
The classical activation functions that were used early in the development of neural
networks were the sign, sigmoid, and the hyperbolic tangent functions:
Φ(v) = sign(v) (sign function)
Φ(v) =
1
1 + e−v
(sigmoid function)
Φ(v) =
e2v
− 1
e2v + 1
(tanh function)
While the sign activation can be used to map to binary outputs at prediction time, its
non-diﬀerentiability prevents its use for creating the loss function at training time. For
example, while the perceptron uses the sign function for prediction, the perceptron crite-
rion in training only requires linear activation. The sigmoid activation outputs a value in
(0, 1), which is helpful in performing computations that should be interpreted as probabil-
ities. Furthermore, it is also helpful in creating probabilistic outputs and constructing loss
functions derived from maximum-likelihood models. The tanh function has a shape simi-
lar to that of the sigmoid function, except that it is horizontally re-scaled and vertically
translated/re-scaled to [−1, 1]. The tanh and sigmoid functions are related as follows (see
Exercise 3):
tanh(v) = 2 · sigmoid(2v) − 1
The tanh function is preferable to the sigmoid when the outputs of the computations are de-
sired to be both positive and negative. Furthermore, its mean-centering and larger gradient

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−10 −5 0 5 10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−6 −4 −2 0 2 4 6
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.8: Various activation functions
(because of stretching) with respect to sigmoid makes it easier to train. The sigmoid and the
tanh functions have been the historical tools of choice for incorporating nonlinearity in the
neural network. In recent years, however, a number of piecewise linear activation functions
have become more popular:
Φ(v) = max{v, 0} (Rectiﬁed Linear Unit [ReLU])
Φ(v) = max {min [v, 1] , −1} (hard tanh)
The ReLU and hard tanh activation functions have largely replaced the sigmoid and soft
tanh activation functions in modern neural networks because of the ease in training multi-
layered neural networks with these activation functions.
Pictorial representations of all the aforementioned activation functions are illustrated
in Figure 1.8. It is noteworthy that all activation functions shown here are monotonic.
Furthermore, other than the identity activation function, most1
of the other activation
functions saturate at large absolute values of the argument at which increasing further does
not change the activation much.
As we will see later, such nonlinear activation functions are also very useful in multilayer
networks, because they help in creating more powerful compositions of diﬀerent types of
functions. Many of these functions are referred to as squashing functions, as they map the
outputs from an arbitrary range to bounded outputs. The use of a nonlinear activation plays
a fundamental role in increasing the modeling power of a network. If a network used only
linear activations, it would not provide better modeling power than a single-layer linear
network. This issue is discussed in Section 1.5.
1The ReLU shows asymmetric saturation.

x4
x3
x2
x1
INPUT LAYER
HIDDEN LAYER
x5
P(y=green)
OUTPUTS
P(y=blue)
P(y=red)
v1
v2
v3
SOFTMAX LAYER
Figure 1.9: An example of multiple outputs for categorical classification with the use of a
softmax layer
1.2.1.4 Choice and Number of Output Nodes
The choice and number of output nodes is also tied to the activation function, which in
turn depends on the application at hand. For example, if k-way classification is intended,
k output values can be used, with a softmax activation function with respect to outputs
v = [v1, . . . , vk] at the nodes in a given layer. Specifically, the activation function for the ith
output is defined as follows:
Φ(v)i =
exp(vi)
k
j=1 exp(vj)
∀i ∈ {1, . . . , k} (1.12)
It is helpful to think of these k values as the values output by k nodes, in which the in-
puts are v1 . . . vk. An example of the softmax function with three outputs is illustrated in
Figure 1.9, and the values v1, v2, and v3 are also shown in the same figure. Note that the
three outputs correspond to the probabilities of the three classes, and they convert the three
outputs of the final hidden layer into probabilities with the softmax function. The final hid-
den layer often uses linear (identity) activations, when it is input into the softmax layer.
Furthermore, there are no weights associated with the softmax layer, since it is only con-
verting real-valued outputs into probabilities. The use of softmax with a single hidden layer
of linear activations exactly implements a model, which is referred to as multinomial logistic
regression [6]. Similarly, many variations like multi-class SVMs can be easily implemented
with neural networks. Another example of a case in which multiple output nodes are used is
the autoencoder, in which each input data point is fully reconstructed by the output layer.
The autoencoder can be used to implement matrix factorization methods like singular value
decomposition. This architecture will be discussed in detail in Chapter 2. The simplest neu-
ral networks that simulate basic machine learning algorithms are instructive because they
lie on the continuum between traditional machine learning and deep networks. By exploring
these architectures, one gets a better idea of the relationship between traditional machine
learning and neural networks, and also the advantages provided by the latter.
1.2.1.5 Choice of Loss Function
The choice of the loss function is critical in defining the outputs in a way that is sensitive
to the application at hand. For example, least-squares regression with numeric outputs

requires a simple squared loss of the form (y − ŷ)2
for a single training instance with target
y and prediction ŷ. One can also use other types of loss like hinge loss for y ∈ {−1, +1} and
real-valued prediction ŷ (with identity activation):
L = max{0, 1 − y · ŷ} (1.13)
The hinge loss can be used to implement a learning method, which is referred to as a support
vector machine.
For multiway predictions (like predicting word identifiers or one of multiple classes),
the softmax output is particularly useful. However, a softmax output is probabilistic, and
therefore it requires a different type of loss function. In fact, for probabilistic predictions,
two different types of loss functions are used, depending on whether the prediction is binary
or whether it is multiway:
1. Binary targets (logistic regression): In this case, it is assumed that the observed
value y is drawn from {−1, +1}, and the prediction ŷ is a an arbitrary numerical value
on using the identity activation function. In such a case, the loss function for a single
instance with observed value y and real-valued prediction ŷ (with identity activation)
is defined as follows:
L = log(1 + exp(−y · ŷ)) (1.14)
This type of loss function implements a fundamental machine learning method, re-
ferred to as logistic regression. Alternatively, one can use a sigmoid activation function
to output ŷ ∈ (0, 1), which indicates the probability that the observed value y is 1.
Then, the negative logarithm of |y/2 − 0.5 + ŷ| provides the loss, assuming that y is
coded from {−1, 1}. This is because |y/2 − 0.5 + ŷ| indicates the probability that the
prediction is correct. This observation illustrates that one can use various combina-
tions of activation and loss functions to achieve the same result.
2. Categorical targets: In this case, if ŷ1 . . . ŷk are the probabilities of the k classes
(using the softmax activation of Equation 1.9), and the rth class is the ground-truth
class, then the loss function for a single instance is defined as follows:
L = −log(ŷr) (1.15)
This type of loss function implements multinomial logistic regression, and it is re-
ferred to as the cross-entropy loss. Note that binary logistic regression is identical to
multinomial logistic regression, when the value of k is set to 2 in the latter.
The key point to remember is that the nature of the output nodes, the activation function,
and the loss function depend on the application at hand. Furthermore, these choices also
depend on one another. Even though the perceptron is often presented as the quintessential
representative of single-layer networks, it is only a single representative out of a very large
universe of possibilities. In practice, one rarely uses the perceptron criterion as the loss
function. For discrete-valued outputs, it is common to use softmax activation with cross-
entropy loss. For real-valued outputs, it is common to use linear activation with squared
loss. Generally, cross-entropy loss is easier to optimize than squared loss.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−10 −5 0 5 10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−6 −4 −2 0 2 4 6
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.10: The derivatives of various activation functions
1.2.1.6 Some Useful Derivatives of Activation Functions
Most neural network learning is primarily related to gradient-descent with activation func-
tions. For this reason, the derivatives of these activation functions are used repeatedly in
this book, and gathering them in a single place for future reference is useful. This section
provides details on the derivatives of these loss functions. Later chapters will extensively
refer to these results.
1. Linear and sign activations: The derivative of the linear activation function is 1 at
all places. The derivative of sign(v) is 0 at all values of v other than at v = 0,
where it is discontinuous and non-diﬀerentiable. Because of the zero gradient and
non-diﬀerentiability of this activation function, it is rarely used in the loss function
even when it is used for prediction at testing time. The derivatives of the linear and
sign activations are illustrated in Figure 1.10(a) and (b), respectively.
2. Sigmoid activation: The derivative of sigmoid activation is particularly simple, when
it is expressed in terms of the output of the sigmoid, rather than the input. Let o be
the output of the sigmoid function with argument v:
o =
1
1 + exp(−v)
(1.16)
Then, one can write the derivative of the activation as follows:
∂o
∂v
=
exp(−v)
(1 + exp(−v))2
(1.17)

The key point is that this sigmoid can be written more conveniently in terms of the
outputs:
∂o
∂v
= o(1 − o) (1.18)
The derivative of the sigmoid is often used as a function of the output rather than the
input. The derivative of the sigmoid activation function is illustrated in Figure 1.10(c).
3. Tanh activation: As in the case of the sigmoid activation, the tanh activation is often
used as a function of the output o rather than the input v:
o =
exp(2v) − 1
exp(2v) + 1
(1.19)
One can then compute the gradient as follows:
∂o
∂v
=
4 · exp(2v)
(exp(2v) + 1)2
(1.20)
One can also write this derivative in terms of the output o:
∂o
∂v
= 1 − o2
(1.21)
The derivative of the tanh activation is illustrated in Figure 1.10(d).
4. ReLU and hard tanh activations: The ReLU takes on a partial derivative value of 1
for non-negative values of its argument, and 0, otherwise. The hard tanh function
takes on a partial derivative value of 1 for values of the argument in [−1, +1] and 0,
otherwise. The derivatives of the ReLU and hard tanh activations are illustrated in
Figure 1.10(e) and (f), respectively.
1.2.2 Multilayer Neural Networks
Multilayer neural networks contain more than one computational layer. The perceptron
contains an input and output layer, of which the output layer is the only computation-
performing layer. The input layer transmits the data to the output layer, and all com-
putations are completely visible to the user. Multilayer neural networks contain multiple
computational layers; the additional intermediate layers (between input and output) are
referred to as hidden layers because the computations performed are not visible to the user.
The specific architecture of multilayer neural networks is referred to as feed-forward net-
works, because successive layers feed into one another in the forward direction from input
to output. The default architecture of feed-forward networks assumes that all nodes in one
layer are connected to those of the next layer. Therefore, the architecture of the neural
network is almost fully defined, once the number of layers and the number/type of nodes in
each layer have been defined. The only remaining detail is the loss function that is optimized
in the output layer. Although the perceptron algorithm uses the perceptron criterion, this
is not the only choice. It is extremely common to use softmax outputs with cross-entropy
loss for discrete prediction and linear outputs with squared loss for real-valued prediction.
As in the case of single-layer networks, bias neurons can be used both in the hidden
layers and in the output layers. Examples of multilayer networks with or without the bias
neurons are shown in Figure 1.11(a) and (b), respectively. In each case, the neural network

contains three layers. Note that the input layer is often not counted, because it simply
transmits the data and no computation is performed in that layer. If a neural network
contains p1 . . . pk units in each of its k layers, then the (column) vector representations of
these outputs, denoted by h1 . . . hk have dimensionalities p1 . . . pk. Therefore, the number
of units in each layer is referred to as the dimensionality of that layer.
INPUT LAYER
HIDDEN LAYER
OUTPUT LAYER
y
x4
x3
x2
x1
x5
INPUT LAYER
HIDDEN LAYER
y
OUTPUT LAYER
+1 +1
BIAS
NEURONS
+1
BIAS
NEURON
x4
x3
x2
x1
x5
s
n
o
r
u
e
n
s
a
i
b
h
t
i
W
)
b
(
s
n
o
r
u
e
n
s
a
i
b
o
N
)
a
(
y
x4
x3
x2
x1
x5
h11
h12
h13 h23
h22
h21
h1 h2
X SCALAR WEIGHTS ON CONNECTIONS
WEIGHT MATRICES ON CONNECTIONS
y
X h1 h2
X
5 X 3
MATRIX
3 X 3
MATRIX
3 X 1
MATRIX
Figure 1.11: The basic architecture of a feed-forward network with two hidden layers and
a single output layer. Even though each unit contains a single scalar variable, one often
represents all units within a single layer as a single vector unit. Vector units are often
represented as rectangles and have connection matrices between them.
INPUT LAYER
HIDDEN LAYER
OUTPUT LAYER
xI
4
xI
3
xI
2
xI
1
xI
5
OUTPUT OF THIS LAYER PROVIDES
REDUCED REPRESENTATION
x4
x3
x2
x1
x5
Figure 1.12: An example of an autoencoder with multiple outputs

The weights of the connections between the input layer and the first hidden layer are
contained in a matrix W1 with size d × p1, whereas the weights between the rth hidden
layer and the (r + 1)th hidden layer are denoted by the pr × pr+1 matrix denoted by Wr.
If the output layer contains o nodes, then the final matrix Wk+1 is of size pk × o. The
d-dimensional input vector x is transformed into the outputs using the following recursive
equations:
h1 = Φ(WT
1 x) [Input to Hidden Layer]
hp+1 = Φ(WT
p+1hp) ∀p ∈ {1 . . . k − 1} [Hidden to Hidden Layer]
o = Φ(WT
k+1hk) [Hidden to Output Layer]
Here, the activation functions like the sigmoid function are applied in element-wise fashion
to their vector arguments. However, some activation functions such as the softmax (which
are typically used in the output layers) naturally have vector arguments. Even though each
unit of a neural network contains a single variable, many architectural diagrams combine
the units in a single layer to create a single vector unit, which is represented as a rectangle
rather than a circle. For example, the architectural diagram in Figure 1.11(c) (with scalar
units) has been transformed to a vector-based neural architecture in Figure 1.11(d). Note
that the connections between the vector units are now matrices. Furthermore, an implicit
assumption in the vector-based neural architecture is that all units in a layer use the same
activation function, which is applied in element-wise fashion to that layer. This constraint is
usually not a problem, because most neural architectures use the same activation function
throughout the computational pipeline, with the only deviation caused by the nature of
the output layer. Throughout this book, neural architectures in which units contain vector
variables will be depicted with rectangular units, whereas scalar variables will correspond
to circular units.
Note that the aforementioned recurrence equations and vector architectures are valid
only for layer-wise feed-forward networks, and cannot always be used for unconventional
architectural designs. It is possible to have all types of unconventional designs in which
inputs might be incorporated in intermediate layers, or the topology might allow connections
between non-consecutive layers. Furthermore, the functions computed at a node may not
always be in the form of a combination of a linear function and an activation. It is possible
to have all types of arbitrary computational functions at nodes.
Although a very classical type of architecture is shown in Figure 1.11, it is possible to
vary on it in many ways, such as allowing multiple output nodes. These choices are often
determined by the goals of the application at hand (e.g., classification or dimensionality
reduction). A classical example of the dimensionality reduction setting is the autoencoder,
which recreates the outputs from the inputs. Therefore, the number of outputs and inputs
is equal, as shown in Figure 1.12. The constricted hidden layer in the middle outputs the
reduced representation of each instance. As a result of this constriction, there is some loss in
the representation, which typically corresponds to the noise in the data. The outputs of the
hidden layers correspond to the reduced representation of the data. In fact, a shallow variant
of this scheme can be shown to be mathematically equivalent to a well-known dimensionality
reduction method known as singular value decomposition. As we will learn in Chapter 2,
increasing the depth of the network results in inherently more powerful reductions.
Although a fully connected architecture is able to perform well in many settings, better
performance is often achieved by pruning many of the connections or sharing them in an
insightful way. Typically, these insights are obtained by using a domain-specific understand-
ing of the data. A classical example of this type of weight pruning and sharing is that of

the convolutional neural network architecture (cf. Chapter 8), in which the architecture is
carefully designed in order to conform to the typical properties of image data. Such an ap-
proach minimizes the risk of overfitting by incorporating domain-specific insights (or bias).
As we will discuss later in this book (cf. Chapter 4), overfitting is a pervasive problem in
neural network design, so that the network often performs very well on the training data,
but it generalizes poorly to unseen test data. This problem occurs when the number of free
parameters, (which is typically equal to the number of weight connections), is too large
compared to the size of the training data. In such cases, the large number of parameters
memorize the specific nuances of the training data, but fail to recognize the statistically
significant patterns for classifying unseen test data. Clearly, increasing the number of nodes
in the neural network tends to encourage overfitting. Much recent work has been focused
both on the architecture of the neural network as well as on the computations performed
within each node in order to minimize overfitting. Furthermore, the way in which the neu-
ral network is trained also has an impact on the quality of the final solution. Many clever
methods, such as pretraining (cf. Chapter 4), have been proposed in recent years in order to
improve the quality of the learned solution. This book will explore these advanced training
methods in detail.
1.2.3 The Multilayer Network as a Computational Graph
It is helpful to view a neural network as a computational graph, which is constructed by
piecing together many basic parametric models. Neural networks are fundamentally more
powerful than their building blocks because the parameters of these models are learned
jointly to create a highly optimized composition function of these models. The common use
of the term “perceptron” to refer to the basic unit of a neural network is somewhat mis-
leading, because there are many variations of this basic unit that are leveraged in different
settings. In fact, it is far more common to use logistic units (with sigmoid activation) and
piecewise/fully linear units as building blocks of these models.
A multilayer network evaluates compositions of functions computed at individual nodes.
A path of length 2 in the neural network in which the function f(·) follows g(·) can be
considered a composition function f(g(·)). Furthermore, if g1(·), g2(·) . . . gk(·) are the func-
tions computed in layer m, and a particular layer-(m + 1) node computes f(·), then the
composition function computed by the layer-(m + 1) node in terms of the layer-m inputs
is f(g1(·), . . . gk(·)). The use of nonlinear activation functions is the key to increasing the
power of multiple layers. If all layers use an identity activation function, then a multilayer
network can be shown to simplify to linear regression. It has been shown [208] that a net-
work with a single hidden layer of nonlinear units (with a wide ranging choice of squashing
functions like the sigmoid unit) and a single (linear) output layer can compute almost
any “reasonable” function. As a result, neural networks are often referred to as universal
function approximators, although this theoretical claim is not always easy to translate into
practical usefulness. The main issue is that the number of hidden units required to do so
is rather large, which increases the number of parameters to be learned. This results in
practical problems in training the network with a limited amount of data. In fact, deeper
networks are often preferred because they reduce the number of hidden units in each layer
as well as the overall number of parameters.
The “building block” description is particularly appropriate for multilayer neural net-
works. Very often, off-the-shelf softwares for building neural networks2
provide analysts
2Examples include Torch [572], Theano [573], and TensorFlow [574].

1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 21
with access to these building blocks. The analyst is able to specify the number and type of
units in each layer along with an off-the-shelf or customized loss function. A deep neural
network containing tens of layers can often be described in a few hundred lines of code.
All the learning of the weights is done automatically by the backpropagation algorithm that
uses dynamic programming to work out the complicated parameter update steps of the
underlying computational graph. The analyst does not have to spend the time and effort
to explicitly work out these steps. This makes the process of trying different types of ar-
chitectures relatively painless for the analyst. Building a neural network with many of the
off-the-shelf softwares is often compared to a child constructing a toy from building blocks
that appropriately fit with one another. Each block is like a unit (or a layer of units) with a
particular type of activation. Much of this ease in training neural networks is attributable
to the backpropagation algorithm, which shields the analyst from explicitly working out the
parameter update steps of what is actually an extremely complicated optimization problem.
Working out these steps is often the most difficult part of most machine learning algorithms,
and an important contribution of the neural network paradigm is to bring modular thinking
into machine learning. In other words, the modularity in neural network design translates
to modularity in learning its parameters; the specific name for the latter type of modularity
is “backpropagation.” This makes the design of neural networks more of an (experienced)
engineer’s task rather than a mathematical exercise.
1.3 Training a Neural Network with Backpropagation
In the single-layer neural network, the training process is relatively straightforward because
the error (or loss function) can be computed as a direct function of the weights, which
allows easy gradient computation. In the case of multi-layer networks, the problem is that
the loss is a complicated composition function of the weights in earlier layers. The gradient
of a composition function is computed using the backpropagation algorithm. The backprop-
agation algorithm leverages the chain rule of differential calculus, which computes the error
gradients in terms of summations of local-gradient products over the various paths from a
node to the output. Although this summation has an exponential number of components
(paths), one can compute it efficiently using dynamic programming. The backpropagation
algorithm is a direct application of dynamic programming. It contains two main phases,
referred to as the forward and backward phases, respectively. The forward phase is required
to compute the output values and the local derivatives at various nodes, and the backward
phase is required to accumulate the products of these local values over all paths from the
node to the output:
1. Forward phase: In this phase, the inputs for a training instance are fed into the neural
network. This results in a forward cascade of computations across the layers, using
the current set of weights. The final predicted output can be compared to that of the
training instance and the derivative of the loss function with respect to the output is
computed. The derivative of this loss now needs to be computed with respect to the
weights in all layers in the backwards phase.
2. Backward phase: The main goal of the backward phase is to learn the gradient of the
loss function with respect to the different weights by using the chain rule of differen-
tial calculus. These gradients are used to update the weights. Since these gradients
are learned in the backward direction, starting from the output node, this learning
process is referred to as the backward phase. Consider a sequence of hidden units

w
f(w)
g(y)
h(z)
K(p,q)
O = K(p,q) = K(g(f(w)),h(f(w)))
UGLY COMPOSITION FUNCTION
O
INPUT
WEIGHT
OUTPUT
∂o
∂w
=
∂o
∂p
·
∂p
∂w
+
∂o
∂q
·
∂q
∂w
[Multivariable Chain Rule]
=
∂o
∂p
·
∂p
∂y
·
∂y
∂w
+
∂o
∂q
·
∂q
∂z
·
∂z
∂w
[Univariate Chain Rule]
=
∂K(p, q)
∂p
· g (y) · f (w)
First path
+
∂K(p, q)
∂q
· h (z) · f (w)
Second path
Figure 1.13: Illustration of chain rule in computational graphs: The products of
node-speciﬁc partial derivatives along paths from weight w to output o are aggregated. The
resulting value yields the derivative of output o with respect to weight w. Only two paths
between input and output exist in this simpliﬁed example.
h1, h2, . . . , hk followed by output o, with respect to which the loss function L is com-
puted. Furthermore, assume that the weight of the connection from hidden unit hr to
hr+1 is w(hr,hr+1). Then, in the case that a single path exists from h1 to o, one can
derive the gradient of the loss function with respect to any of these edge weights using
the chain rule:
∂L
∂w(hr−1,hr)
=
∂L
∂o
·

∂o
∂hk
k−1

i=r
∂hi+1
∂hi

∂hr
∂w(hr−1,hr)
∀r ∈ 1 . . . k (1.22)
The aforementioned expression assumes that only a single path from h1 to o exists in
the network, whereas an exponential number of paths might exist in reality. A gener-
alized variant of the chain rule, referred to as the multivariable chain rule, computes
the gradient in a computational graph, where more than one path might exist. This is
achieved by adding the composition along each of the paths from h1 to o. An example
of the chain rule in a computational graph with two paths is shown in Figure 1.13.
Therefore, one generalizes the above expression to the case where a set P of paths
exist from hr to o:
∂L
∂w(hr−1,hr)
=
∂L
∂o
·
⎡
⎣

[hr,hr+1,...hk,o]∈P
∂o
∂hk
k−1

i=r
∂hi+1
∂hi
⎤
⎦

Backpropagation computes Δ(hr, o) = ∂L
∂hr
∂hr
∂w(hr−1,hr)
(1.23)

1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 23
The computation of ∂hr
∂w(hr−1,hr)
on the right-hand side is straightforward and will
be discussed below (cf. Equation 1.27). However, the path-aggregated term above
[annotated by Δ(hr, o) = ∂L
∂hr
] is aggregated over an exponentially increasing number
of paths (with respect to path length), which seems to be intractable at first sight. A
key point is that the computational graph of a neural network does not have cycles,
and it is possible to compute such an aggregation in a principled way in the backwards
direction by first computing Δ(hk, o) for nodes hk closest to o, and then recursively
computing these values for nodes in earlier layers in terms of the nodes in later layers.
Furthermore, the value of Δ(o, o) for each output node is initialized as follows:
Δ(o, o) =
∂L
∂o
(1.24)
This type of dynamic programming technique is used frequently to efficiently compute
all types of path-centric functions in directed acyclic graphs, which would otherwise
require an exponential number of operations. The recursion for Δ(hr, o) can be derived
using the multivariable chain rule:
Δ(hr, o) =
∂L
∂hr
=

h:hr⇒h
∂L
∂h
∂h
∂hr
=

h:hr⇒h
∂h
∂hr
Δ(h, o) (1.25)
Since each h is in a later layer than hr, Δ(h, o) has already been computed while
evaluating Δ(hr, o). However, we still need to evaluate ∂h
∂hr
in order to compute Equa-
tion 1.25. Consider a situation in which the edge joining hr to h has weight w(hr,h),
and let ah be the value computed in hidden unit h just before applying the activation
function Φ(·). In other words, we have h = Φ(ah), where ah is a linear combination of
its inputs from earlier-layer units incident on h. Then, by the univariate chain rule,
the following expression for ∂h
∂hr
can be derived:
∂h
∂hr
=
∂h
∂ah
·
∂ah
∂hr
=
∂Φ(ah)
∂ah
· w(hr,h) = Φ
(ah) · w(hr,h)
This value of ∂h
∂hr
is used in Equation 1.25, which is repeated recursively in the back-
wards direction, starting with the output node. The corresponding updates in the
backwards direction are as follows:
Δ(hr, o) =

h:hr⇒h
Φ
(ah) · w(hr,h) · Δ(h, o) (1.26)
Therefore, gradients are successively accumulated in the backwards direction, and
each node is processed exactly once in a backwards pass. Note that the computation
of Equation 1.25 (which requires proportional operations to the number of outgoing
edges) needs to be repeated for each incoming edge into the node to compute the gra-
dient with respect to all edge weights. Finally, Equation 1.23 requires the computation
of ∂hr
∂w(hr−1,hr)
, which is easily computed as follows:
∂hr
∂w(hr−1,hr)
= hr−1 · Φ
(ahr
) (1.27)
Here, the key gradient that is backpropagated is the derivative with respect to layer acti-
vations, and the gradient with respect to the weights is easy to compute for any incident
edge on the corresponding unit.

It is noteworthy that the dynamic programming recursion of Equation 1.26 can be
computed in multiple ways, depending on which variables one uses for intermediate chaining.
All these recursions are equivalent in terms of the final result of backpropagation. In the
following, we give an alternative version of the dynamic programming recursion, which is
more commonly seen in textbooks. Note that Equation 1.23 uses the variables in the hidden
layers as the “chain” variables for the dynamic programming recursion. One can also use
the pre-activation values of the variables for the chain rule. The pre-activation variables in a
neuron are obtained after applying the linear transform (but before applying the activation
variables) as the intermediate variables. The pre-activation value of the hidden variable
h = Φ(ah) is ah. The differences between the pre-activation and post-activation values
within a neuron are shown in Figure 1.7. Therefore, instead of Equation 1.23, one can use
the following chain rule:
∂L
∂w(hr−1,hr)
=
∂L
∂o
· Φ
(ao) ·
⎡
⎣

[hr,hr+1,...hk,o]∈P
∂ao
∂ahk
k−1

i=r
∂ahi+1
∂ahi
⎤
⎦

Backpropagation computes δ(hr, o) = ∂L
∂ahr
∂ahr
∂w(hr−1,hr)

hr−1
(1.28)
Here, we have introduced the notation δ(hr, o) = ∂L
∂ahr
instead of Δ(hr, o) = ∂L
∂hr
for setting
up the recursive equation. The value of δ(o, o) = ∂L
∂ao
is initialized as follows:
δ(o, o) =
∂L
∂ao
= Φ
(ao) ·
∂L
∂o
(1.29)
Then, one can use the multivariable chain rule to set up a similar recursion:
δ(hr, o) =
∂L
∂ahr
=

h:hr⇒h
δ(h,o)

∂L
∂ah
∂ah
∂ahr

Φ(ahr )w(hr,h)
= Φ
(ahr
)

h:hr⇒h
w(hr,h) · δ(h, o) (1.30)
This recursion condition is found more commonly in textbooks discussing backpropagation.
The partial derivative of the loss with respect to the weight is then computed using δ(hr, o)
as follows:
∂L
∂w(hr−1,hr)
= δ(hr, o) · hr−1 (1.31)
As with the single-layer network, the process of updating the nodes is repeated to conver-
gence by repeatedly cycling through the training data in epochs. A neural network may
sometimes require thousands of epochs through the training data to learn the weights at
the different nodes. A detailed description of the backpropagation algorithm and associated
issues is provided in Chapter 3. In this chapter, we provide a brief discussion of these issues.
1.4 Practical Issues in Neural Network Training
In spite of the formidable reputation of neural networks as universal function approximators,
considerable challenges remain with respect to actually training neural networks to provide
this level of performance. These challenges are primarily related to several practical problems
associated with training, the most important one of which is overfitting.

1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 25
1.4.1 The Problem of Overfitting
The problem of overfitting refers to the fact that fitting a model to a particular training
data set does not guarantee that it will provide good prediction performance on unseen test
data, even if the model predicts the targets on the training data perfectly. In other words,
there is always a gap between the training and test data performance, which is particularly
large when the models are complex and the data set is small.
In order to understand this point, consider a simple single-layer neural network on a
data set with five attributes, where we use the identity activation to learn a real-valued
target variable. This architecture is almost identical to that of Figure 1.3, except that the
identity activation function is used in order to predict a real-valued target. Therefore, the
network tries to learn the following function:
ŷ =
5

i=1
wi · xi (1.32)
Consider a situation in which the observed target value is real and is always twice the
value of the first attribute, whereas other attributes are completely unrelated to the target.
However, we have only four training instances, which is one less than the number of features
(free parameters). For example, the training instances could be as follows:
x1 x2 x3 x4 x5 y
1 1 0 0 0 2
2 0 1 0 0 4
3 0 0 1 0 6
4 0 0 0 1 8
The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the known rela-
tionship between the first feature and target. The training data also provides zero error
with this solution, although the relationship needs to be learned from the given instances
since it is not given to us a priori. However, the problem is that the number of training
points is fewer than the number of parameters and it is possible to find an infinite number
of solutions with zero error. For example, the parameter set [0, 2, 4, 6, 8] also provides zero
error on the training data. However, if we used this solution on unseen test data, it is likely
to provide very poor performance because the learned parameters are spuriously inferred
and are unlikely to generalize well to new points in which the target is twice the first at-
tribute (and other attributes are random). This type of spurious inference is caused by the
paucity of training data, where random nuances are encoded into the model. As a result,
the solution does not generalize well to unseen test data. This situation is almost similar to
learning by rote, which is highly predictive for training data but not predictive for unseen
test data. Increasing the number of training instances improves the generalization power
of the model, whereas increasing the complexity of the model reduces its generalization
power. At the same time, when a lot of training data is available, an overly simple model
is unlikely to capture complex relationships between the features and target. A good rule
of thumb is that the total number of training data points should be at least 2 to 3 times
larger than the number of parameters in the neural network, although the precise number
of data instances depends on the specific model at hand. In general, models with a larger
number of parameters are said to have high capacity, and they require a larger amount of
data in order to gain generalization power to unseen test data. The notion of overfitting is
often understood in the trade-off between bias and variance in machine learning. The key

take-away from the notion of bias-variance trade-off is that one does not always win with
more powerful (i.e., less biased) models when working with limited training data, because
of the higher variance of these models. For example, if we change the training data in the
table above to a different set of four points, we are likely to learn a completely different set
of parameters (from the random nuances of those points). This new model is likely to yield
a completely different prediction on the same test instance as compared to the predictions
using the first training data set. This type of variation in the prediction of the same test
instance using different training data sets is a manifestation of model variance, which also
adds to the error of the model; after all, both predictions of the same test instance could not
possibly be correct. More complex models have the drawback of seeing spurious patterns
in random nuances, especially when the training data are insufficient. One must be careful
to pick an optimum point when deciding the complexity of the model. These notions are
described in detail in Chapter 4.
Neural networks have always been known to theoretically be powerful enough to ap-
proximate any function [208]. However, the lack of data availability can result in poor
performance; this is one of the reasons that neural networks only recently achieved promi-
nence. The greater availability of data has revealed the advantages of neural networks over
traditional machine learning (cf. Figure 1.2). In general, neural networks require careful
design to minimize the harmful effects of overfitting, even when a large amount of data is
available. This section provides an overview of some of the design methods used to mitigate
the impact of overfitting.
1.4.1.1 Regularization
Since a larger number of parameters causes overfitting, a natural approach is to constrain
the model to use fewer non-zero parameters. In the previous example, if we constrain the
vector W to have only one non-zero component out of five components, it will correctly
obtain the solution [2, 0, 0, 0, 0]. Smaller absolute values of the parameters also tend to
overfit less. Since it is hard to constrain the values of the parameters, the softer approach
of adding the penalty λ||W||p
to the loss function is used. The value of p is typically set to
2, which leads to Tikhonov regularization. In general, the squared value of each parameter
(multiplied with the regularization parameter λ 0) is added to the objective function.
The practical effect of this change is that a quantity proportional to λwi is subtracted from
the update of the parameter wi. An example of a regularized version of Equation 1.6 for
mini-batch S and update step-size α 0 is as follows:
W ⇐ W(1 − αλ) + α

X∈S
E(X)X (1.33)
Here, E[X] represents the current error (y − ŷ) between observed and predicted values
of training instance X. One can view this type of penalization as a kind of weight decay
during the updates. Regularization is particularly important when the amount of available
data is limited. A neat biological interpretation of regularization is that it corresponds to
gradual forgetting, as a result of which “less important” (i.e., noisy) patterns are removed.
In general, it is often advisable to use more complex models with regularization rather than
simpler models without regularization.
As a side note, the general form of Equation 1.33 is used by many regularized machine
learning models like least-squares regression (cf. Chapter 2), where E(X) is replaced by the
error-function of that specific model. Interestingly, weight decay is only sparingly used in the

single-layer perceptron3
because it can sometimes cause overly rapid forgetting with a small
number of recently misclassified training points dominating the weight vector; the main
issue is that the perceptron criterion is already a degenerate loss function with a minimum
value of 0 at W = 0 (unlike its hinge-loss or least-squares cousins). This quirk is a legacy
of the fact that the single-layer perceptron was originally defined in terms of biologically
inspired updates rather than in terms of carefully thought-out loss functions. Convergence
to an optimal solution was never guaranteed other than in linearly separable cases. For the
single-layer perceptron, some other regularization techniques, which are discussed below,
are more commonly used.
1.4.1.2 Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the architecture of the
neural network after giving some thought to the underlying data domain. For example, the
successive words in a sentence are often related to one another, whereas the nearby pixels
in an image are typically related. These types of insights are used to create specialized
architectures for text and image data with fewer parameters. Furthermore, many of the
parameters might be shared. For example, a convolutional neural network uses the same
set of parameters to learn the characteristics of a local block of the image. The recent
advancements in the use of neural networks like recurrent neural networks and convolutional
neural networks are examples of this phenomena.
1.4.1.3 Early Stopping
Another common form of regularization is early stopping, in which the gradient descent is
ended after only a few iterations. One way to decide the stopping point is by holding out a
part of the training data, and then testing the error of the model on the held-out set. The
gradient-descent approach is terminated when the error on the held-out set begins to rise.
Early stopping essentially reduces the size of the parameter space to a smaller neighborhood
within the initial values of the parameters. From this point of view, early stopping acts as
a regularizer because it effectively restricts the parameter space.
1.4.1.4 Trading Off Breadth for Depth
As discussed earlier, a two-layer neural network can be used as a universal function approx-
imator [208], if a large number of hidden units are used within the hidden layer. It turns out
that networks with more layers (i.e., greater depth) tend to require far fewer units per layer
because the composition functions created by successive layers make the neural network
more powerful. Increased depth is a form of regularization, as the features in later layers
are forced to obey a particular type of structure imposed by the earlier layers. Increased
constraints reduce the capacity of the network, which is helpful when there are limitations
on the amount of available data. A brief explanation of this type of behavior is given in
Section 1.5. The number of units in each layer can typically be reduced to such an extent
that a deep network often has far fewer parameters even when added up over the greater
number of layers. This observation has led to an explosion in research on the topic of deep
learning.
3Weight decay is generally used with other loss functions in single-layer models and in all multi-layer
models with a large number of parameters.

Even though deep networks have fewer problems with respect to overfitting, they come
with a different family of problems associated with ease of training. In particular, the loss
derivatives with respect to the weights in different layers of the network tend to have vastly
different magnitudes, which causes challenges in properly choosing step sizes. Different
manifestations of this undesirable behavior are referred to as the vanishing and exploding
gradient problems. Furthermore, deep networks often take unreasonably long to converge.
These issues and design choices will be discussed later in this section and at several places
throughout the book.
1.4.1.5 Ensemble Methods
A variety of ensemble methods like bagging are used in order to increase the generalization
power of the model. These methods are applicable not just to neural networks but to
any type of machine learning algorithm. However, in recent years, a number of ensemble
methods that are specifically focused on neural networks have also been proposed. Two
such methods include Dropout and Dropconnect. These methods can be combined with
many neural network architectures to obtain an additional accuracy improvement of about
2% in many real settings. However, the precise improvement depends to the type of data and
the nature of the underlying training. For example, normalizing the activations in hidden
layers can reduce the effectiveness of Dropout methods, although one can gain from the
normalization itself. Ensemble methods are discussed in Chapter 4.
1.4.2 The Vanishing and Exploding Gradient Problems
While increasing depth often reduces the number of parameters of the network, it leads to
different types of practical issues. Propagating backwards using the chain rule has its draw-
backs in networks with a large number of layers in terms of the stability of the updates. In
particular, the updates in earlier layers can either be negligibly small (vanishing gradient) or
they can be increasingly large (exploding gradient) in certain types of neural network archi-
tectures. This is primarily caused by the chain-like product computation in Equation 1.23,
which can either exponentially increase or decay over the length of the path. In order to
understand this point, consider a situation in which we have a multi-layer network with one
neuron in each layer. Each local derivative along a path can be shown to be the product of
the weight and the derivative of the activation function. The overall backpropagated deriva-
tive is the product of these values. If each such value is randomly distributed, and has an
expected value less than 1, the product of these derivatives in Equation 1.23 will drop off ex-
ponentially fast with path length. If the individual values on the path have expected values
greater than 1, it will typically cause the gradient to explode. Even if the local derivatives
are randomly distributed with an expected value of exactly 1, the overall derivative will
typically show instability depending on how the values are actually distributed. In other
words, the vanishing and exploding gradient problems are rather natural to deep networks,
which makes their training process unstable.
Many solutions have been proposed to address this issue. For example, a sigmoid activa-
tion often encourages the vanishing gradient problem, because its derivative is less than 0.25
at all values of its argument (see Exercise 7), and is extremely small at saturation. A ReLU
activation unit is known to be less likely to create a vanishing gradient problem because its
derivative is always 1 for positive values of the argument. More discussions on this issue are
provided in Chapter 3. Aside from the use of the ReLU, a whole host of gradient-descent
tricks are used to improve the convergence behavior of the problem. In particular, the use

of adaptive learning rates and conjugate gradient methods can help in many cases. Further-
more, a recent technique called batch normalization is helpful in addressing some of these
issues. These techniques are discussed in Chapter 3.
1.4.3 Difficulties in Convergence
Sufficiently fast convergence of the optimization process is difficult to achieve with very
deep networks, as depth leads to increased resistance to the training process in terms of
letting the gradients smoothly flow through the network. This problem is somewhat related
to the vanishing gradient problem, but has its own unique characteristics. Therefore, some
“tricks” have been proposed in the literature for these cases, including the use of gating
networks and residual networks [184]. These methods are discussed in Chapters 7 and 8,
respectively.
1.4.4 Local and Spurious Optima
The optimization function of a neural network is highly nonlinear, which has lots of local
optima. When the parameter space is large, and there are many local optima, it makes sense
to spend some effort in picking good initialization points. One such method for improving
neural network initialization is referred to as pretraining. The basic idea is to use either
supervised or unsupervised training on shallow sub-networks of the original network in
order to create the initial weights. This type of pretraining is done in a greedy and layerwise
fashion in which a single layer of the network is trained at one time in order to learn
the initialization points of that layer. This type of approach provides initialization points
that ignore drastically irrelevant parts of the parameter space to begin with. Furthermore,
unsupervised pretraining often tends to avoid problems associated with overfitting. The
basic idea here is that some of the minima in the loss function are spurious optima because
they are exhibited only in the training data and not in the test data. Using unsupervised
pretraining tends to move the initialization point closer to the basin of “good” optima in
the test data. This is an issue associated with model generalization. Methods for pretraining
are discussed in Section 4.7 of Chapter 4.
Interestingly, the notion of spurious optima is often viewed from the lens of model gen-
eralization in neural networks. This is a different perspective from traditional optimization.
In traditional optimization, one does not focus on the differences in the loss functions of
the training and test data, but on the shape of the loss function in only the training data.
Surprisingly, the problem of local optima (from a traditional perspective) is a smaller issue
in neural networks than one might normally expect from such a nonlinear function. Most
of the time, the nonlinearity causes problems during the training process itself (e.g., failure
to converge), rather than getting stuck in a local minimum.
1.4.5 Computational Challenges
A significant challenge in neural network design is the running time required to train the
network. It is not uncommon to require weeks to train neural networks in the text and image
domains. In recent years, advances in hardware technology such as Graphics Processor Units
(GPUs) have helped to a significant extent. GPUs are specialized hardware processors that
can significantly speed up the kinds of operations commonly used in neural networks. In
this sense, some algorithmic frameworks like Torch are particularly convenient because they
have GPU support tightly integrated into the platform.

Although algorithmic advancements have played a role in the recent excitement around
deep learning, a lot of the gains have come from the fact that the same algorithms can do
much more on modern hardware. Faster hardware also supports algorithmic development,
because one needs to repeatedly test computationally intensive algorithms to understand
what works and what does not. For example, a recent neural model such as the long short-
term memory has changed only modestly [150] since it was first proposed in 1997 [204]. Yet,
the potential of this model has been recognized only recently because of the advances in
computational power of modern machines and algorithmic tweaks associated with improved
experimentation.
One convenient property of the vast majority of neural network models is that most of
the computational heavy lifting is front loaded during the training phase, and the prediction
phase is often computationally efficient, because it requires a small number of operations
(depending on the number of layers). This is important because the prediction phase is
often far more time-critical compared to the training phase. For example, it is far more
important to classify an image in real time (with a pre-built model), although the actual
building of that model might have required a few weeks over millions of images. Methods
have also been designed to compress trained networks in order to enable their deployment
in mobile and space-constrained settings. These issues are discussed in Chapter 3.
1.5 The Secrets to the Power of Function Composition
Even though the biological metaphor sounds like an exciting way to intuitively justify the
computational power of a neural network, it does not provide a complete picture of the
settings in which neural networks perform well. At its most basic level, a neural network is
a computational graph that performs compositions of simpler functions to provide a more
complex function. Much of the power of deep learning arises from the fact that repeated
composition of multiple nonlinear functions has significant expressive power. Even though
the work in [208] shows that the single composition of a large number of squashing functions
can approximate almost any function, this approach will require an extremely large number
of units (i.e., parameters) of the network. This increases the capacity of the network, which
causes overfitting unless the data set is extremely large. Much of the power of deep learning
arises from the fact that the repeated composition of certain types of functions increases the
representation power of the network, and therefore reduces the parameter space required for
learning.
Not all base functions are equally good at achieving this goal. In fact, the nonlinear
squashing functions used in neural networks are not arbitrarily chosen, but are carefully
designed because of certain types of properties. For example, imagine a situation in which
the identity activation function is used in each layer, so that only linear functions are
computed. In such a case, the resulting neural network is no stronger than a single-layer,
linear network:
Theorem 1.5.1 A multi-layer network that uses only the identity activation function in
all its layers reduces to a single-layer network performing linear regression.
Proof: Consider a network containing k hidden layers, and therefore contains a total of
(k +1) computational layers (including the output layer). The corresponding (k +1) weight
matrices between successive layers are denoted by W1 . . . Wk+1. Let x be the d-dimensional
column vector corresponding to the input, h1 . . . hk be the column vectors corresponding to
the hidden layers, and o be the m-dimensional column vector corresponding to the output.

Random documents with unrelated
content Scribd suggests to you:

prend cette chose tellement fanée
et empuantie de toutes les puanteurs
dans sa tendre et blanche main.
Arrachée par le talent du poète,
dans un doux accord avec le beau,
une larme lentement s'écoule
et tendrement fait sa part dans l'œuvre commune:
nul lecteur qui n'y laisse une tache!
O pensée grandiose et puissante!
O résultat merveilleux!
Qu'il est béni des dieux le poète
qui possède un si noble talent!
Grands et Petits, Pauvres et Riches,
cette crasse est l'œuvre de tous!
Ah! celui qui vit encore dans l'osbcurité,
qui lutte pour se hausser jusqu'au laurier,
assurément sent, dans sa brûlante ardeur,
un désir lui tirailler le sein.
Dieu bon, implore-t-il chaque jour,
Accorde-moi ce bonheur indicible:
fais que mes pauvres livres de vers
soient aussi gras et crasseux!
Mais si les poètes aspirent aux embrassements «de la grande
impudique
Qui tient dans ses bras l'univers,
s'ils sont tellement avides du bruit qu'ils ouvrent leur escarcelle toute
grande à la popularité, cette «gloire en gros sous», il n'en est point
de même des vrais amants des livres, de ceux qui ne les font pas,
mais qui les achètent, les parent, les enchâssent, en délectent leurs
doigts, leurs yeux, et parfois leur esprit.
Ecoutez la tirade mise par un poète anglais dans la bouche d'un
bibliophile qui a prêté à un infidèle ami une reliure de Trautz-
Bauzonnet et qui ne l'a jamais revue:

Une fois prêté, un livre est perdu...
Prêter des livres! Parbleu, je n'y consentirai
plus.
Vos prêteurs faciles ne sont que des fous que je
redoute.
Si les gens veulent des livres, par le grand
Grolier, qu'ils les achètent!
Qui est-ce qui prête sa femme lorsqu'il peut se
dispenser du prêt?
Nos femmes seront-elles donc tenues pour plus
que nos livres chères?
Nous en préserve de Thou! Jamais plus de livres
ne prêterai.
Ne dirait-on pas que c'est pour ce bibliophile échaudé que fut faite
cette imitation supérieurement réussie des inscriptions dont les
écoliers sont prodigues sur leurs rudiments et Selectæ:
Qui ce livre volera,
Pro suis criminibus
Au gibet il dansera,
Pedibus penditibus.
Ce châtiment n'eût pas dépassé les mérites de celui contre lequel
Lebrun fit son épigramme «à un Abbé qui aimait les lettres et un peu
trop mes livres»:
Non, tu n'es point de ces abbés ignares,
Qui n'ont jamais rien lu que le Missel:
Des bons écrits tu savoures le sel,
Et te connais en livres beaux et rares.
Trop bien le sais! car, lorsqu'à pas de loup
Tu viens chez moi feuilleter coup sur coup
Mes Elzévirs, ils craignent ton approche.
Dans ta mémoire il en reste beaucoup;
Beaucoup aussi te restent dans la poche.

Un amateur de livres de nuance libérale pourrait adopter pour devise
cette inscription mise à l'entrée d'une bibliothèque populaire
anglaise:
Tolle, aperi, recita, ne lœdas, claude, rapine!
ce qui, traduit librement, signifie: «Prends, ouvre, lis, n'abîme pas,
referme, mais surtout mets en place!»
Punch, le Charivari d'Outre-Manche, en même temps qu'il incarne
pour les Anglais notre Polichinelle et le Pulcinello des Italiens,
résume à merveille la question. Voici, dit-il, «la tenue des livres
enseignée en une leçon:—Ne les prêtez pas.»

VII
C'est qu'ils sont précieux, non pas tant par leur valeur intrinsèque,—
bien que certains d'entre eux représentent plus que leur poids d'or,—
que parce qu'on les aime, d'amour complexe peut-être, mais à coup
sûr d'amour vrai.
«Accordez-moi, seigneur, disait un ancien (c'est Jules Janin qui
rapporte ces paroles), une maison pleine de livres, un jardin plein de
fleurs!—Voulez-vous, disait-il encore, un abrégé de toutes les
misères humaines, regardez un malheureux qui vend ses livres:
Bibliothecam vendat.»
Si le malheureux vend ses livres parce qu'il y est contraint, non pas
par un caprice, une toquade de spéculation, une saute de goût,
passant de la bibliophilie à l'iconophilie ou à la faïençomanie ou à
tout autre dada frais éclos dans sa cervelle, ou encore sous le coup
d'une passionnette irrésistible dont quelques mois auront bientôt usé
l'éternité, comme il advint à Asselineau qui se défit de sa
bibliothèque pour suivre une femme et qui peu après se défit de la
femme pour se refaire une bibliothèque, si c'est, dis-je, par misère
pure, il faut qu'il soit bien marqué par le destin et qu'il ait de triples
galons dans l'armée des Pas-de-Chance, car les livres aiment ceux
qui les aiment et, le plus souvent leur portent bonheur. Témoin, pour
n'en citer qu'un, Grotius, qui s'échappa de prison en se mettant dans
un coffre à livres, lequel faisait la navette entre sa maison et sa
geôle, apportant et remportant les volumes qu'il avait obtenu de
faire venir de la fameuse bibliothèque formée à grands frais et avec
tant de soins, pour lui «et ses amis».
Richard de Bury, évêque de Durham et chancelier d'Angleterre, qui
vivait au XIVe siècle, rapporte, dans son Philobiblon, des vers latins

de John Salisbury, dont voici le sens:
Nul main que le fer a touchée n'est propre à
manier les livres,
ni celui dont le cœur regarde l'or avec trop de
joie;
les mêmes hommes n'aiment pas à la fois les
livres et l'argent,
et ton troupeau, ô Epicure, a pour les livres du
dégoût;
les avares et les amis des livre ne vont guère de
compagnie,
et ne demeurent point, tu peux m'en croire, en
paix sous le même toit.
«Personne donc, en conclut un peu vite le bon Richard de Bury, ne
peut servir en même temps les livres et Mammon».
Il reprend ailleurs: «Ceux qui sont férus de l'amour des livres font
bon marché du monde et des richesses».
Les temps sont quelque peu changés; il est en notre vingtième siècle
des amateurs dont on ne saurait dire s'ils estiment des livres
précieux pour en faire un jour une vente profitable, ou s'ils
dépensent de l'argent à accroître leur bibliothèque pour la seule
satisfaction de leurs goûts de collectionneur et de lettré.
Toujours est-il que le Philobiblon n'est qu'un long dithyrambe en
prose, naïf et convaincu, sur les livres et les joies qu'ils procurent. J'y
prends au hasard quelques phrases caractéristiques, qui, enfouies
dans ce vieux livre peu connu en France, n'ont pas encore eu le
temps de devenir banales parmi nous.
«Les livres nous charment lorsque la prospérité nous sourit; ils nous
réconfortent comme des amis inséparables lorsque la fortune
orageuse fronce le sourcil sur nous.»

Voilà une pensée qui a été exprimée bien des fois et que nous
retrouverons encore; mais n'a-t-elle pas un tour original qui lui
donne je ne sais quel air imprévu de nouveauté?
Le chapitre XV de l'ouvrage traite des «avantages de l'amour des
livres.» On y lit ceci:
«Il passe le pouvoir de l'intelligence humaine, quelque largement
qu'elle ait pu boire à la fontaine de Pégase, de développer
pleinement le titre du présent chapitre. Quand on parlerait avec la
langue des hommes et des anges, quand on serait devenu un
Mercure, un Tullius ou un Cicéron, quand on aurait acquis la douceur
de l'éloquence lactée de Tite-Live, on aurait encore à s'excuser de
bégayer comme Moïse, ou à confesser avec Jérémie qu'on n'est
qu'un enfant et qu'on ne sait point parler.»
Après ce début, qui s'étonnera que Richard de Bury fasse un devoir
à tous les honnêtes gens d'acheter des livres et de les aimer. «Il
n'est point de prix élevé qui doive empêcher quelqu'un d'acheter des
livres s'il a l'argent qu'on en demande, à moins que ce ne soit pour
résister aux artifices du vendeur ou pour attendre une plus favorable
occasion d'achat... Qu'on doive acheter les livres avec joie et les
vendre à regret, c'est à quoi Salomon, le soleil de l'humanité, nous
exhorte dans les Proverbes: «Achète la vérité, dit-il, et ne vends pas
la sagesse.»
On ne s'attendait guère, j'imagine, à voir Salomon dans cette affaire.
Et pourtant quoi de plus naturel que d'en appeler à l'auteur de la
Sagesse en une question qui intéresse tous les sages?
«Une bibliothèque prudemment composée est plus précieuse que
toutes les richesses, et nulle des choses qui sont désirables ne
sauraient lui être comparée. Quiconque donc se pique d'être zélé
pour la vérité, le bonheur, la sagesse ou la science, et même pour la
foi, doit nécessairement devenir un ami des livres.»
En effet, ajoute-t-il, en un élan croissant d'enthousiasme, «les livres
sont des maîtres qui nous instruisent sans verges ni férules, sans

paroles irritées, sans qu'il faille leur donner ni habits, ni argent. Si
vous venez à eux, ils ne dorment point; si vous questionnez et vous
enquérez auprès d'eux, ils ne se récusent point; ils ne grondent
point si vous faites des fautes; ils ne se moquent point de vous si
vous êtes ignorant. O livres, seuls êtres libéraux et libres, qui donnez
à tous ceux qui vous demandent, et affranchissez tous ceux qui vous
servent fidèlement!»
C'est pourquoi «les Princes, les prélats, les juges, les docteurs, et
tous les autres dirigeants de l'Etat, d'autant qu'ils ont plus que les
autres besoin de sagesse, doivent plus que les autres montrer du
zèle pour ces vases où la sagesse est contenue.»
Tel était l'avis du grand homme d'Etat Gladstone, qui acheta plus de
trente cinq mille volumes au cours de sa longue vie. «Un
collectionneur de livres, disait-il, dans une lettre adressée au fameux
libraire londonien Quaritch (9 septembre 1896), doit, suivant l'idée
que je m'en fais, posséder les six qualités suivantes: appétit, loisir,
fortune, science, discernement et persévérance.» Et plus loin:
«Collectionner des livres peut avoir ses ridicules et ses excentricités.
Mais, en somme, c'est un élément revivifiant dans une société
criblée de tant de sources de corruption.»

VIII
Cependant les livres, jusque dans la maison du bibliophile, ont un
implacable ennemi: c'est la femme. Je les entends se plaindre du
traitement que la maîtresse du logis, dès qu'elle en a l'occasion, leur
fait subir:
«La femme, toujours jalouse de l'amour qu'on nous porte, est
impossible à jamais apaiser. Si elle nous aperçoit dans quelque coin,
sans autre protection que la toile d'une araignée morte, elle nous
insulte et nous ravale, le sourcil froncé, la parole amère, affirmant
que, de tout le mobilier de la maison, nous seuls ne sommes pas
nécessaires; elle se plaint que nous ne soyons utiles à rien dans le
ménage, et elle conseille de nous convertir promptement en riches
coiffures, en soie, en pourpre deux fois teinte, en robes et en
fourrures, en laine et en toile. A dire vrai sa haine ne serait pas sans
motifs si elle pouvait voir le fond de nos cœurs, si elle avait écouté
nos secrets conseils, si elle avait lu le livre de Théophraste ou celui
de Valerius, si seulement elle avait écouté le XXVe chapitre de
l'Ecclésiaste avec des oreilles intelligentes.» (Richard de Bury.)
M. Octave Uzanne rappelle, dans les Zigs-Zags d'un Curieux, un mot
du bibliophile Jacob, frappé en manière de proverbe et qui est bien
en situation ici:
Amours de femme et de bouquin,
Ne se chantent pas au même lutrin.
Et il ajoute fort à propos: «La passion bouquinière n'admet pas de
partage; c'est un peu, il faut le dire, une passion de retraite, un
refuge extrême à cette heure de la vie où l'homme, déséquilibré par
les cahots de l'existence mondaine, s'écrie, à l'exemple de Thomas

Moore: Je n'avais jusqu'ici pour lire que les regards des femmes, et
c'est la folie qu'ils m'ont enseignée!»
Cette incapacité des femmes, sauf de rares exceptions, à goûter les
joies du bibliophile, a été souvent remarquée. Une d'elles—et c'est
ce qui rend la citation piquante—Mme Emile de Girardin, écrivait dans
la chronique qu'elle signait à la Presse du pseudonyme de Vicomte
de Launay:
«Voyez ce beau salon d'étude, ce boudoir charmant; admirez-le dans
ses détails, vous y trouverez tout ce qui peut séduire, tout ce que
vous pouvez désirer, excepté deux choses pourtant: un beau livre et
un joli tableau. Il n'y a peut-être pas dix femmes à Paris chez
lesquelles ces deux raretés puissent être admirées.»
C'est dans le même ordre d'idées que l'américain Hawthorne, le fils
de l'auteur du Faune de Marbre et de tant d'autres ouvrages où une
sereine philosophie se pare des agréments de la fiction, a écrit ces
lignes curieuses:
«Cœlebs, grand amateur de bouquins, se rase devant son miroir, et
monologue sur la femme qui, d'après son expérience, jeune ou
vieille, laide ou belle, est toujours le diable.» Et Cœlebs finit en se
donnant à lui-même ces conseils judicieux: «Donc, épouse tes livres!
Il ne recherche point d'autre maîtresse, l'homme sage qui regarde,
non la surface, mais le fond des choses. Les livres ne flirtent ni ne
feignent; ne boudent ni ne taquinent; ils ne se plaignent pas, ils
disent les choses, mais ils s'abstiennent de vous les demander.
»Que les livres soient ton harem, et toi leur Grand Turc. De rayon en
rayon, ils attendent tes faveurs, silencieux et soumis! Jamais la
jalousie ne les agite. Je n'ai nulle part rencontré Vénus, et j'accorde
qu'elle est belle; toujours est-il qu'elle n'est pas de beaucoup si
accommodante qu'eux.»

IX
Comment n'aimerait-on pas les livres? Il en est pour tous les goûts,
ainsi qu'un auteur du Chansonnier des Grâces le fait chanter à un
libraire vaudevillesque (1820):
Venez, lecteurs, chez un libraire
De vous servir toujours jaloux;
Vos besoins ainsi que vos goûts
Chez moi pourront se satisfaire.
J'offre la Grammaire aux auteurs,
Des Vers à nos jeunes poëtes;
L'Esprit des lois aux procureurs,
L'Essai sur l'homme à nos coquettes...
Aux plus célèbres gastronomes
Je donne Racine et Boileau!
La Harpe aux chanteurs de caveau,
Les Nuits d'Young aux astronomes;
J'ai Descartes pour les joueurs,
Voiture pour toutes les belles,
Lucrèce pour les amateurs,
Martial pour les demoiselles.
Pour le plaideur et l'adversaire
J'aurai l'avocat Patelin;
Le malade et le médecin
Chez moi consulteront Molière:
Pour un sexe trop confiant
Je garde le Berger fidèle;
Et pour le malheureux amant
Je réserverai la Pucelle.

Armand Gouffé était d'un autre avis lorsqu'il fredonnait:
Un sot avec cent mille francs
Peut se passer de livres.
Mais les sots très riches ont généralement juste assez d'esprit pour
retrancher et masquer leur sottise derrière l'apparat imposant d'une
grande bibliothèque, où les bons livres consacrés par le temps et le
jugement universel se partagent les rayons avec les ouvrages à la
mode. Car si, comme le dit le proverbe allemand, «l'âne n'est pas
savant parce qu'il est chargé de livres», il est des cas où l'amas des
livres peut cacher un moment la nature de l'animal.
C'est en pensant aux amateurs de cet acabit que Chamfort a formulé
cette maxime: «L'espoir n'est souvent au cœur que ce que la
bibliothèque d'un château est à la personne du maître.»
Lilly, le fameux auteur d'Euphues, disait: «Aie ton cabinet plein de
livres plutôt que ta bourse pleine d'argent». Le malheur est que
remplir l'un a vite fait de vider l'autre, si les sources dont celle-ci
s'alimente ne sont pas d'une abondance continue.
L'historien Gibbon allait plus loin lorsqu'il déclarait qu'il n'échangerait
pas le goût de la lecture contre tous les trésors de l'Inde. De même
Macaulay, qui aurait mieux aimé être un pauvre homme avec des
livres qu'un grand roi sans livres.
Bien avant eux, Claudius Clément, dans son traité latin des
bibliothèques, tant privées que publiques, émettait, avec des
restrictions de sage morale, une idée semblable: «Il y a peu de
dépenses, de profusions, je dirais même de prodigalités plus
louables que celles qu'on fait pour les livres, lorsqu'en eux on
cherche un refuge, la volupté de l'âme, l'honneur, la pureté des
mœurs, la doctrine et un renom immortel.»
«L'or, écrivait Pétrarque à son frère Gérard, l'argent, les pierres
précieuses, les vêtements de pourpre, les domaines, les tableaux,
les chevaux, toutes les autres choses de ce genre offrent un plaisir

changeant et de surface: les livres nous réjouissent jusqu'aux
moëlles.»
C'est encore Pétrarque qui traçait ce tableau ingénieux et charmant:
«J'ai des amis dont la société m'est extrêmement agréable; ils sont
de tous les âges et de tous les pays. Ils se sont distingués dans les
conseils et sur les champs de bataille, et ont obtenu de grands
honneurs par leur connaissance des sciences. Il est facile de trouver
accès près d'eux; en effet ils sont toujours à mon service, je les
admets dans ma société ou les congédie quand il me plaît. Ils ne
sont jamais importuns, et ils répondent aussitôt à toutes les
questions que je leur pose. Les uns me racontent les événements
des siècles passés, les autres me révèlent les secrets de la nature. Il
en est qui m'apprennent à vivre, d'autres à mourir. Certains, par leur
vivacité, chassent mes soucis et répandent en moi la gaieté: d'autres
donnent du courage à mon âme, m'enseignant la science si
importante de contenir ses désirs et de ne compter absolument que
sur soi. Bref, ils m'ouvrent les différentes avenues de tous les arts et
de toutes les sciences, et je peux, sans risque, me fier à eux en
toute occasion. En retour de leurs services, ils ne me demandent
que de leur fournir une chambre commode dans quelque coin de
mon humble demeure, où ils puissent reposer en paix, car ces amis-
là trouvent plus de charmes à la tranquillité de la retraite qu'au
tumulte de la société.»
Il faut comparer ce morceau au passage où notre Montaigne, après
avoir parlé du commerce des hommes et de l'amour des femmes,
dont il dit: «l'un est ennuyeux par sa rareté, l'aultre se flestrit par
l'usage», déclare que celui des livres «est bien plus seur et plus à
nous; il cède aux premiers les aultres advantages, mais il a pour sa
part la constance et facilité de son service... Il me console en la
vieillesse et en la solitude; il me descharge du poids d'une oysiveté
ennuyeuse et me desfaict à toute heure des compagnies qui me
faschent; il esmousse les poinctures de la douleur, si elle n'est du
tout extrême et maistresse. Pour me distraire d'une imagination
importune, il n'est que de recourir aux livres...

«Le fruict que je tire des livres... j'en jouïs, comme les avaricieux des
trésors, pour sçavoir que j'en jouïray quand il me plaira: mon âme se
rassasie et contente de ce droit de possession... Il ne se peult dire
combien je me repose et séjourne en ceste considération qu'ils sont
à mon côté pour me donner du plaisir à mon heure, et à
recognoistre combien ils portent de secours à ma vie. C'est la
meilleure munition que j'aye trouvé à cest humain voyage; et plainds
extrêmement les hommes d'entendement qui l'ont à dire.»
Sur ce thème, les variations sont infinies et rivalisent d'éclat et
d'ampleur.
Le roi d'Egypte Osymandias, dont la mémoire inspira à Shelley un
sonnet si beau, avait inscrit au-dessus de sa «librairie»:
Pharmacie de l'âme.
«Une chambre sans livres est un corps sans âme», disait Cicéron.
«La poussière des bibliothèques est une poussière féconde»,
renchérit Werdet.
«Les livres ont toujours été la passion des honnêtes gens», affirme
Ménage.
Sir John Herschel était sûrement de ces honnêtes gens dont parle le
bel esprit érudit du XVIIe siècle, car il fait cette déclaration, que
Gibbon eût signée:
«Si j'avais à demander un goût qui pût me conserver ferme au
milieu des circonstances les plus diverses et être pour moi une
source de bonheur et de gaieté à travers la vie et un bouclier contre
ses maux, quelque adverses que pussent être les circonstances et de
quelques rigueurs que le monde pût m'accabler, je demanderais le
goût de la lecture.»
«Autant vaut tuer un homme que détruire un bon livre», s'écrie
Milton; et ailleurs, en un latin superbe que je renonce à traduire:

Et totum rapiunt me, mea vita, libri.
«Pourquoi, demandait Louis XIV au maréchal de Vivonne, passez-
vous autant de temps avec vos livres?—Sire, c'est pour qu'ils
donnent à mon esprit le coloris, la fraîcheur et la vie que donnent à
mes joues les excellentes perdrix de Votre Majesté.»
Voilà une aimable réponse de commensal et de courtisan. Mais
combien d'enthousiastes se sentiraient choqués de cet épicuréisme
flatteur et léger! Ce n'est pas le poète anglais John Florio, qui
écrivait au commencement du même siècle, dont on eût pu attendre
une explication aussi souriante et dégagée. Il le prend plutôt au
tragique, quand il s'écrie:
«Quels pauvres souvenirs sont statues, tombes et autres
monuments que les hommes érigent aux princes, et qui restent en
des lieux fermés où quelques-uns à peine les voient, en comparaison
des livres, qui aux yeux du monde entier montrent comment ces
princes vécurent, tandis que les autres monuments montrent où ils
gisent!»
C'est à dessein, je le répète, que j'accumule les citations d'auteurs
étrangers. Non seulement, elles ont moins de chances d'être
connues, mais elles possèdent je ne sais quelle saveur d'exotisme
qu'on ne peut demander à nos écrivains nationaux.
Ecoutons Isaac Barrow exposer sagement la leçon de son
expérience:
«Celui qui aime les livres ne manque jamais d'un ami fidèle, d'un
conseiller salutaire, d'un gai compagnon, d'un soutien efficace. En
étudiant, en pensant, en lisant, l'on peut innocemment se distraire et
agréablement se récréer dans toutes les saisons comme dans toutes
les fortunes.»
Jeremy Collier, pensant de même, ne s'exprime guère autrement:
«Les livres sont un guide dans la jeunesse et une récréation dans le
grand âge. Ils nous soutiennent dans la solitude et nous empêchent

d'être à charge à nous-mêmes. Ils nous aident à oublier les ennuis
qui nous viennent des hommes et des choses; ils calment nos soucis
et nos passions; ils endorment nos déceptions. Quand nous sommes
las des vivants, nous pouvons nous tourner vers les morts: ils n'ont
dans leur commerce, ni maussaderie, ni orgueil, ni arrière-pensée.»
Parmi les joies que donnent les livres, celle de les rechercher, de les
pourchasser chez les libraires et les bouquinistes, n'est pas la
moindre. On a écrit des centaines de chroniques, des études, des
traités et des livres sur ce sujet spécial. La Physiologie des quais de
Paris, de M. Octave Uzanne, est connue de tous ceux qui
s'intéressent aux bouquins. On se rappelle moins un brillant article
de Théodore de Banville, qui parut jadis dans un supplément
littéraire du Figaro; aussi me saura-t-on gré d'en citer ce joli
passage:
«Sur le quai Voltaire, il y aurait de quoi regarder et s'amuser
pendant toute une vie; mais sans tourner, comme dit Hésiode,
autour du chêne et du rocher, je veux nommer tout de suite ce qui
est le véritable sujet, l'attrait vertigineux, le charme invincible: c'est
le Livre ou, pour parler plus exactement, le Bouquin. Il y a sur le
quai de nombreuses boutiques, dont les marchands, véritables
bibliophiles, collectionnent, achètent dans les ventes, et offrent aux
consommateurs de beaux livres à des prix assez honnêtes. Mais ce
n'est pas là ce que veut l'amateur, le fureteur, le découvreur de
trésors mal connus. Ce qu'il veut, c'est trouver pour des sous, pour
rien, dans les boîtes posées sur le parapet, des livres, des bouquins
qui ont—ou qui auront—un grand prix, ignoré du marchand.
«Et à ce sujet, un duel, qui n'a pas eu de commencement et n'aura
pas de fin, recommence et se continue sans cesse entre le marchand
et l'amateur. Le libraire, qui, naturellement, veut vendre cher sa
marchandise, se hâte de retirer des boîtes et de porter dans la
boutique tout livre soupçonné d'avoir une valeur; mais par une force

étrange et surnaturelle, le Livre s'arrange toujours pour revenir, on
ne sait pas comment ou par quels artifices, dans les boîtes du
parapet. Car lui aussi a ses opinions; il veut être acheté par
l'amateur, avec des sous, et surtout et avant tout, par amour!»
C'est ainsi que M. Jean Rameau, poète et bibliophile, raconte qu'il a
trouvé, en cette année 1901, dans une boîte des quais, à vingt-cinq
centimes, quatre volumes, dont le dos élégamment fleuri portait un
écusson avec la devise: Boutez en avant. C'était un abrégé du
Faramond de la Calprenède, et les quatre volumes avaient appartenu
à la Du Barry, dont le Boutez en avant est suffisamment
caractéristique. Que fit le poète, lorsqu'il se fut renseigné auprès du
baron de Claye, qui n'hésite point sur ces questions? Il alla dès sept
heures du matin se poster devant l'étalage, avala le brouillard de la
Seine, s'en imprégna et y développa des «rhumatismes atroces»
jusqu'à onze heures du matin,—car le bouquiniste, ami du
nonchaloir, ne vint pas plus tôt,—prit les volumes et «bouta une
pièce d'un franc» en disant: «Vous allez me laisser ça pour quinze
sous, hein?»—«Va pour quinze sous!» fit le bouquiniste bonhomme!
Et le poète s'enfuit avec son butin, et aussi, par surcroît, «avec un
petit frisson de gloire».
Puisque nous sommes sur le quai Voltaire, ne le quittons pas sans le
regarder à travers la lunette d'un poète dont le nom, Gabriel Marc,
n'éveille pas de retentissants échos, mais qui, depuis 1875, année
où il publiait ses Sonnets parisiens, a dû parfois éprouver l'émotion—
amère et douce—exprimée en trait final dans le gracieux tableau
qu'il intitule: En bouquinant.
Le quai Voltaire est un véritable musée
En plein soleil. Partout, pour charmer les
regards,
Armes, bronzes, vitraux, estampes, objets d'art,
Et notre flânerie est sans cesse amusée.
Avec leur reliure ancienne et presque usée,
Voici les manuscrits sauvés par le hasard;

Puis les livres: Montaigne, Hugo, Chénier,
Ponsard,
Ou la petite toile au Salon refusée.
Le ciel bleuâtre et clair noircit à l'horizon.
Le pêcheur à la ligne a jeté l'hameçon;
Et la Seine se ride aux souffles de la brise.
Ou la petite toile au Salon refusée.
On bouquine. On revoit, sous la poudre des
temps,
Tous les chers oubliés; et parfois, ô surprise!
Le volume de vers que l'on fit à vingt ans.
Un autre contemporain, Mr. J. Rogers Rees, qui a écrit tout un livre
sur les plaisirs du bouquineur (the Pleasures of a Bookworm), trouve
dans le commerce des livres une source de fraternité et de solidarité
humaines. «Un grand amour pour les livres, dit-il, a en soi, dans
tous les temps, le pouvoir d'élargir le cœur et de le remplir de
facultés sympathiques plus larges et véritablement éducatrices.»
Un poète américain, Mr. C. Alex. Nelson, termine une pièce à laquelle
il donne ce titre français: Les Livres, par une prière naïve, dont les
deux derniers vers sont aussi en français dans le texte:
Les amoureux du livre, tous d'un cœur
reconnaissant,
toujours exhalèrent une prière unique:
Que le bon Dieu préserve les livres
et sauve la Société!
Le vieux Chaucer ne le prenait pas de si haut: doucement et
poétiquement il avouait que l'attrait des livres était moins puissant
sur son cœur que l'attrait de la nature.
Je voudrais pouvoir mettre dans mon essai de traduction un peu du
charme poétique qui, comme un parfum très ancien, mais persistant

et d'autant plus suave, se dégage de ces vers dans le texte original.
Quant à moi, bien que je ne sache que peu de
chose,
à lire dans les livres je me délecte,
et j'y donne ma foi et ma pleine croyance,
et dans mon cœur j'en garde le respect
si sincèrement qu'il n'y a point de plaisir
qui puisse me faire quitter mes livres,
si ce n'est, quelques rares fois, le jour saint,
sauf aussi, sûrement, lorsque, le mois de mai
venu, j'entends les oiseaux chanter,
et que les fleurs commencent à surgir,—
alors adieu mon livre et ma dévotion!
Comment encore conserver en mon français sans rimes et
péniblement rythmé l'harmonie légère et gracieuse, pourtant si nette
et précise, de ce délicieux couplet d'une vieille chanson populaire,
que tout Anglais sait par cœur:
Oh! un livre et, dans l'ombre un coin,
soit à la maison, soit dehors,
les vertes feuilles chuchotant sur ma tête,
ou les cris de la rue autour de moi;
là où je puisse lire tout à mon aise
aussi bien du neuf que du vieux!
Car un brave et bon livre à parcourir
vaut pour moi mieux que de l'or!
Mais il faut s'arrêter dans l'éloge. Je ne saurais mieux conclure, sur
ce sujet entraînant, qu'en prenant à mon compte et en offrant aux
autres ces lignes d'un homme qui fut, en son temps, le «prince de la
critique» et dont le nom même commence à être oublié. Nous
pouvons tous, amis, amoureux, dévots ou maniaques du livre, nous
écrier avec Jules Janin:

«O mes livres! mes économies et mes amours! une fête à mon foyer,
un repos à l'ombre du vieil arbre, mes compagnons de voyage!... et
puis, quand tout sera fini pour moi, les témoins de ma vie et de mon
labeur!»

Neural Networks And Deep Learning Charu C Aggarwal

X
A côté de ceux qui adorent les livres, les chantent et les bénissent, il
y a ceux qui les détestent, les dénigrent et leur crient anathème; et
ceux-ci ne sont pas les moins passionnés.
On voit nettement la transition, le passage d'un de ces deux
sentiments à l'autre, en même temps que leur foncière identité, dans
ces vers de Jean Richepin (Les Blasphèmes):
Peut-être, ô Solitude, est-ce toi qui délivres
De cette ardente soif que l'ivresse des livres
Ne saurait étancher aux flots de son vin noir.
J'en ai bu comme si j'étais un entonnoir,
De ce vin fabriqué, de ce vin lamentable;
J'en ai bu jusqu'à choir lourdement sous la
table,
A pleine gueule, à plein amour, à plein cerveau.
Mais toujours, au réveil, je sentais de nouveau
L'inextinguible soif dans ma gorge plus rêche.
On ne s'étonnera pas, je pense, que sa gorge étant plus rêche, le
poète songe à la mieux rafraîchir et achète, pour ce, des livres
superbes qui lui mériteront, quand on écrira sa biographie définitive,
un chapitre, curieux entre maint autre, intitulé: «Richepin,
bibliophile.»
D'une veine plus froide et plus méprisante, mais, après tout, peu
dissemblable, sort cette boutade de Baudelaire (Œuvres
posthumes):
«L'homme d'esprit, celui qui ne s'accordera jamais avec personne,
doit s'appliquer à aimer la conversation des imbéciles et la lecture

des mauvais livres. Il en tirera des jouissances amères qui
compenseront largement sa fatigue.»
L'auteur du traité De la Bibliomanie n'y met point tant de finesse. Il
déclare tout à trac que «la folle passion des livres entraîne souvent
au libertinage et à l'incrédulité».
Encore faudrait-il savoir où commence «la folle passion», car le
même écrivain (Bollioud-Mermet) ne peut s'empêcher, un peu plus
loin, de reconnaître que «les livres simplement agréables
contiennent, ainsi que les plus sérieux, des leçons utiles pour les
cœurs droits et pour les bons esprits».
Pétrarque avait déjà exprimé une pensée analogue dans son élégant
latin de la Renaissance: «Les livres mènent certaines personnes à la
science, et certaines autres à la folie, lorsque celles-ci en absorbent
plus qu'elles ne peuvent digérer.»
Libri quosdam ad scientiam, quosdam ad insaniam deduxere, dum
plus hauriunt quam digerunt.
Cela rappelle un joli mot attribué au peintre Doyen sur un homme
plus érudit que judicieux: «Sa tête est la boutique d'un libraire qui
déménage.»
C'est, en somme, une question de choix. On l'a répété bien souvent
depuis Sénèque, et on l'avait sûrement dit plus d'une fois avant lui:
«Il n'importe pas d'avoir beaucoup de livres, mais d'en avoir de
bons.»
Ce n'est pas là le point de vue auquel se placent les bibliomanes;
mais nous ne nous occupons pas d'eux pour l'instant. Quant aux
bibliophiles délicats, même ceux que le livre ravit par lui-même bien
plus que par ce qu'il contient, ils veulent bien en avoir beaucoup,
mais surtout en avoir de beaux, se rapprochant le plus possible de la

perfection; et plutôt que d'accueillir sur leurs rayons des exemplaires
tarés ou médiocres, eux-aussi prendraient la devise: Pauca sed
bona.
«Une des maladies de ce siècle, dit un Anglais (Barnaby Rich), c'est
la multitude des livres, qui surchargent tellement le lecteur qu'il ne
peut plus digérer l'abondance d'oiseuse matière chaque jour éclose
et mise au monde sous des formes aussi diverses que les traits
mêmes du visage des auteurs.»
En avoir beaucoup, c'est largesse;
En étudier peu, c'est sagesse.
déclare un proverbe cité par Jules Janin.
Michel Montaigne, qui a mis les livres à profit autant qu'homme du
monde et qui en a parlé en des termes enthousiastes et
reconnaissants cités plus haut, fait cependant des réserves, mais
seulement en ce qui touche le développement physique et la santé.
«Les livres, dit-il, ont beaucoup de qualités agréables à ceulx qui les
sçavent choisir; mais, aulcun bien sans peine; c'est un plaisir qui
n'est pas net et pur, non plus que les autres; il a ses incommodités
et bien poisantes; l'âme s'y exerce; mais le corps demeure sans
action, s'atterre et s'attriste.»
L'âme même arrive à la lassitude et au dégoût, comme le fait
observer le poète anglais Crabbe: «Les livres ne sauraient toujours
plaire, quelque bons qu'ils soient; l'esprit n'aspire pas toujours après
sa nourriture.»
Un proverbe italien nous ramène, d'un mot vif et original, à la
théorie des moralistes sur les bonnes et les mauvaises lectures: «Pas
de voleur pire qu'un mauvais livre.»
Quel voleur, en effet, a jamais songé à dérober l'innocence, la
pureté, les croyances, les nobles élans? Et les moralistes nous
affirment qu'il y a des livres qui dépouillent l'âme de tout cela.

«Mieux vaudrait, s'écrie Walter Scott, qu'il ne fût jamais né, celui qui
lit pour arriver au doute, celui qui lit pour arriver au mépris du bien.»
Un écrivain anglais contemporain, Mr. Lowell, donne un tour
ingénieux à l'expression d'une idée semblable, quand il écrit:
«Le conseil de Caton: Cum bonis ambula, Marche avec les bons,
est tout aussi vrai si on l'étend aux livres, car, eux aussi, donnent,
par degrés insensibles, leur propre nature à l'esprit qui converse
avec eux. Ou ils nous élèvent, ou ils nous abaissent.»
Les sages, qui pèsent le pour et le contre, et, se tenant dans un
juste milieu, reconnaissent aux livres une influence tantôt bonne,
tantôt mauvaise, souvent nulle, suivant leur nature et la disposition
d'esprit des lecteurs, sont, je crois, les plus nombreux.
L'helléniste Egger met à formuler cette opinion judicieusement
pondérée, un ton d'enthousiasme à quoi l'on devine qu'il pardonne
au livre tous ses méfaits pour les joies et les secours qu'il sait
donner.
«Le plus grand personnage qui, depuis 3,000 ans peut-être, fasse
parler de lui dans le monde, tour à tour géant ou pygmée,
orgueilleux ou modeste, entreprenant ou timide, sachant prendre
toutes les formes et tous les rôles, capable tour à tour d'éclairer ou
de pervertir les esprits, d'émouvoir les passions ou de les apaiser,
artisan de factions ou conciliateur des partis, véritable Protée
qu'aucune définition ne peut saisir, c'est «le Livre.»
Un moraliste peu connu du XVIIIe siècle, L.-C. d'Arc, auteur d'un livre
intitulé: Mes Loisirs, que j'ai cité ailleurs, redoute l'excès de la
lecture, ce «travail des paresseux», comme on l'a dit assez
justement:
«La lecture est l'aliment de l'esprit et quelquefois le
tombeau du génie.»
«Celui qui lit beaucoup s'expose à ne penser que
d'après les autres.»

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

Neural Networks And Deep Learning Charu C Aggarwal

More Related Content

Similar to Neural Networks And Deep Learning Charu C Aggarwal (20)

Recently uploaded (20)

Neural Networks And Deep Learning Charu C Aggarwal