SlideShare a Scribd company logo
Neural Networks And Deep Learning Charu C
Aggarwal download
https://ebookbell.com/product/neural-networks-and-deep-learning-
charu-c-aggarwal-59041922
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Neural Networks And Deep Learning A Textbook Charu C Aggarwal
https://ebookbell.com/product/neural-networks-and-deep-learning-a-
textbook-charu-c-aggarwal-49464354
Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Ed 2023
Charu C Aggarwal
https://ebookbell.com/product/neural-networks-and-deep-learning-a-
textbook-2nd-edition-2nd-ed-2023-charu-c-aggarwal-50699788
Neural Networks And Deep Learning Theoretical Insights And Frameworks
Dr Vishwas Mishra
https://ebookbell.com/product/neural-networks-and-deep-learning-
theoretical-insights-and-frameworks-dr-vishwas-mishra-56221402
Neural Networks And Deep Learning A Textbook Charu C Aggarwal
https://ebookbell.com/product/neural-networks-and-deep-learning-a-
textbook-charu-c-aggarwal-7166102
Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Pat Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-pat-
nakamoto-11565930
Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Edition
Charu C Aggarwal
https://ebookbell.com/product/neural-networks-and-deep-learning-a-
textbook-2nd-edition-2nd-edition-charu-c-aggarwal-50687826
Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Machine Learning Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-machine-
learning-nakamoto-36065668
Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Machine Learning Pat Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-machine-
learning-pat-nakamoto-42058928
Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Machine Learning Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-machine-
learning-nakamoto-36068052
Neural Networks And Deep Learning Deep Learning Explained To Your
Granny A Visual Introduction For Beginners Who Want To Make Their Own
Deep Learning Neural Network Machine Learning Pat Nakamoto
https://ebookbell.com/product/neural-networks-and-deep-learning-deep-
learning-explained-to-your-granny-a-visual-introduction-for-beginners-
who-want-to-make-their-own-deep-learning-neural-network-machine-
learning-pat-nakamoto-37277310
Neural
Networks and
Deep Learning
Charu C. Aggarwal
ATextbook
Neural Networks and Deep Learning
Charu C. Aggarwal
Neural Networks and Deep
Learning
A Textbook
123
Charu C. Aggarwal
IBM T. J. Watson Research Center
International Business Machines
Yorktown Heights, NY, USA
ISBN 978-3-319-94462-3 ISBN 978-3-319-94463-0 (eBook)
https://doi.org/10.1007/978-3-319-94463-0
Library of Congress Control Number: 2018947636
c
 Springer International Publishing AG, part of Springer Nature 2018
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, com-
puter software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and
therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be
true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or
implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher
remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife Lata, my daughter Sayani,
and my late parents Dr. Prem Sarup and Mrs. Pushplata Aggarwal.
Preface
“Any A.I. smart enough to pass a Turing test is smart enough to know to fail
it.”—Ian McDonald
Neural networks were developed to simulate the human nervous system for machine
learning tasks by treating the computational units in a learning model in a manner similar
to human neurons. The grand vision of neural networks is to create artificial intelligence
by building machines whose architecture simulates the computations in the human ner-
vous system. This is obviously not a simple task because the computational power of the
fastest computer today is a minuscule fraction of the computational power of a human
brain. Neural networks were developed soon after the advent of computers in the fifties and
sixties. Rosenblatt’s perceptron algorithm was seen as a fundamental cornerstone of neural
networks, which caused an initial excitement about the prospects of artificial intelligence.
However, after the initial euphoria, there was a period of disappointment in which the data
hungry and computationally intensive nature of neural networks was seen as an impediment
to their usability. Eventually, at the turn of the century, greater data availability and in-
creasing computational power lead to increased successes of neural networks, and this area
was reborn under the new label of “deep learning.” Although we are still far from the day
that artificial intelligence (AI) is close to human performance, there are specific domains
like image recognition, self-driving cars, and game playing, where AI has matched or ex-
ceeded human performance. It is also hard to predict what AI might be able to do in the
future. For example, few computer vision experts would have thought two decades ago that
any automated system could ever perform an intuitive task like categorizing an image more
accurately than a human.
Neural networks are theoretically capable of learning any mathematical function with
sufficient training data, and some variants like recurrent neural networks are known to be
Turing complete. Turing completeness refers to the fact that a neural network can simulate
any learning algorithm, given sufficient training data. The sticking point is that the amount
of data required to learn even simple tasks is often extraordinarily large, which causes a
corresponding increase in training time (if we assume that enough training data is available
in the first place). For example, the training time for image recognition, which is a simple
task for a human, can be on the order of weeks even on high-performance systems. Fur-
thermore, there are practical issues associated with the stability of neural network training,
which are being resolved even today. Nevertheless, given that the speed of computers is
VII
VIII PREFACE
expected to increase rapidly over time, and fundamentally more powerful paradigms like
quantum computing are on the horizon, the computational issue might not eventually turn
out to be quite as critical as imagined.
Although the biological analogy of neural networks is an exciting one and evokes com-
parisons with science fiction, the mathematical understanding of neural networks is a more
mundane one. The neural network abstraction can be viewed as a modular approach of
enabling learning algorithms that are based on continuous optimization on a computational
graph of dependencies between the input and output. To be fair, this is not very different
from traditional work in control theory; indeed, some of the methods used for optimization
in control theory are strikingly similar to (and historically preceded) the most fundamental
algorithms in neural networks. However, the large amounts of data available in recent years
together with increased computational power have enabled experimentation with deeper
architectures of these computational graphs than was previously possible. The resulting
success has changed the broader perception of the potential of deep learning.
The chapters of the book are organized as follows:
1. The basics of neural networks: Chapter 1 discusses the basics of neural network design.
Many traditional machine learning models can be understood as special cases of neural
learning. Understanding the relationship between traditional machine learning and
neural networks is the first step to understanding the latter. The simulation of various
machine learning models with neural networks is provided in Chapter 2. This will give
the analyst a feel of how neural networks push the envelope of traditional machine
learning algorithms.
2. Fundamentals of neural networks: Although Chapters 1 and 2 provide an overview
of the training methods for neural networks, a more detailed understanding of the
training challenges is provided in Chapters 3 and 4. Chapters 5 and 6 present radial-
basis function (RBF) networks and restricted Boltzmann machines.
3. Advanced topics in neural networks: A lot of the recent success of deep learning is a
result of the specialized architectures for various domains, such as recurrent neural
networks and convolutional neural networks. Chapters 7 and 8 discuss recurrent and
convolutional neural networks. Several advanced topics like deep reinforcement learn-
ing, neural Turing mechanisms, and generative adversarial networks are discussed in
Chapters 9 and 10.
We have taken care to include some of the “forgotten” architectures like RBF networks
and Kohonen self-organizing maps because of their potential in many applications. The
book is written for graduate students, researchers, and practitioners. Numerous exercises
are available along with a solution manual to aid in classroom teaching. Where possible, an
application-centric view is highlighted in order to give the reader a feel for the technology.
Throughout this book, a vector or a multidimensional data point is annotated with a bar,
such as X or y. A vector or multidimensional point may be denoted by either small letters
or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots,
such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout
the book, the n × d matrix corresponding to the entire training data set is denoted by
D, with n documents and d dimensions. The individual data points in D are therefore
d-dimensional row vectors. On the other hand, vectors with one component for each data
PREFACE IX
point are usually n-dimensional column vectors. An example is the n-dimensional column
vector y of class variables of n data points. An observed value yi is distinguished from a
predicted value ŷi by a circumflex at the top of the variable.
Yorktown Heights, NY, USA Charu C. Aggarwal
Acknowledgments
I would like to thank my family for their love and support during the busy time spent
in writing this book. I would also like to thank my manager Nagui Halim for his support
during the writing of this book.
Several figures in this book have been provided by the courtesy of various individuals
and institutions. The Smithsonian Institution made the image of the Mark I perceptron
(cf. Figure 1.5) available at no cost. Saket Sathe provided the outputs in Chapter 7 for
the tiny Shakespeare data set, based on code available/described in [233, 580]. Andrew
Zisserman provided Figures 8.12 and 8.16 in the section on convolutional visualizations.
Another visualization of the feature maps in the convolution network (cf. Figure 8.15) was
provided by Matthew Zeiler. NVIDIA provided Figure 9.10 on the convolutional neural
network for self-driving cars in Chapter 9, and Sergey Levine provided the image on self-
learning robots (cf. Figure 9.9) in the same chapter. Alec Radford provided Figure 10.8,
which appears in Chapter 10. Alex Krizhevsky provided Figure 8.9(b) containing AlexNet.
This book has benefitted from significant feedback and several collaborations that I have
had with numerous colleagues over the years. I would like to thank Quoc Le, Saket Sathe,
Karthik Subbian, Jiliang Tang, and Suhang Wang for their feedback on various portions of
this book. Shuai Zheng provided feedbback on the section on regularized autoencoders in
Chapter 4. I received feedback on the sections on autoencoders from Lei Cai and Hao Yuan.
Feedback on the chapter on convolutional neural networks was provided by Hongyang Gao,
Shuiwang Ji, and Zhengyang Wang. Shuiwang Ji, Lei Cai, Zhengyang Wang and Hao Yuan
also reviewed the Chapters 3 and 7, and suggested several edits. They also suggested the
ideas of using Figures 8.6 and 8.7 for elucidating the convolution/deconvolution operations.
For their collaborations, I would like to thank Tarek F. Abdelzaher, Jinghui Chen, Jing
Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang,
Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M.
Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep Sri-
vastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jiany-
ong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang
Zhai, and Peixiang Zhao. I would also like to thank my advisor James B. Orlin for his guid-
ance during my early years as a researcher.
XI
XII ACKNOWLEDGMENTS
I would like to thank Lata Aggarwal for helping me with some of the figures created
using PowerPoint graphics in this book. My daughter, Sayani, was helpful in incorporating
special effects (e.g., image color, contrast, and blurring) in several JPEG images used at
various places in this book.
Contents
1 An Introduction to Neural Networks 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Humans Versus Computers: Stretching the Limits
of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The Basic Architecture of Neural Networks . . . . . . . . . . . . . . . . . . 4
1.2.1 Single Computational Layer: The Perceptron . . . . . . . . . . . . . 5
1.2.1.1 What Objective Function Is the Perceptron Optimizing? . 8
1.2.1.2 Relationship with Support Vector Machines . . . . . . . . . 10
1.2.1.3 Choice of Activation and Loss Functions . . . . . . . . . . 11
1.2.1.4 Choice and Number of Output Nodes . . . . . . . . . . . . 14
1.2.1.5 Choice of Loss Function . . . . . . . . . . . . . . . . . . . . 14
1.2.1.6 Some Useful Derivatives of Activation Functions . . . . . . 16
1.2.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 The Multilayer Network as a Computational Graph . . . . . . . . . 20
1.3 Training a Neural Network with Backpropagation . . . . . . . . . . . . . . . 21
1.4 Practical Issues in Neural Network Training . . . . . . . . . . . . . . . . . . 24
1.4.1 The Problem of Overfitting . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.1.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1.2 Neural Architecture and Parameter Sharing . . . . . . . . . 27
1.4.1.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.4.1.4 Trading Off Breadth for Depth . . . . . . . . . . . . . . . . 27
1.4.1.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . 28
1.4.2 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . 28
1.4.3 Difficulties in Convergence . . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.4 Local and Spurious Optima . . . . . . . . . . . . . . . . . . . . . . . 29
1.4.5 Computational Challenges . . . . . . . . . . . . . . . . . . . . . . . . 29
1.5 The Secrets to the Power of Function Composition . . . . . . . . . . . . . . 30
1.5.1 The Importance of Nonlinear Activation . . . . . . . . . . . . . . . . 32
1.5.2 Reducing Parameter Requirements with Depth . . . . . . . . . . . . 34
1.5.3 Unconventional Neural Architectures . . . . . . . . . . . . . . . . . . 35
1.5.3.1 Blurring the Distinctions Between Input, Hidden,
and Output Layers . . . . . . . . . . . . . . . . . . . . . . . 35
1.5.3.2 Unconventional Operations and Sum-Product Networks . . 36
XIII
XIV CONTENTS
1.6 Common Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.1 Simulating Basic Machine Learning with Shallow Models . . . . . . 37
1.6.2 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . 37
1.6.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 38
1.6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 40
1.6.6 Hierarchical Feature Engineering and Pretrained Models . . . . . . . 42
1.7 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.7.2 Separating Data Storage and Computations . . . . . . . . . . . . . . 45
1.7.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 45
1.8 Two Notable Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.8.1 The MNIST Database of Handwritten Digits . . . . . . . . . . . . . 46
1.8.2 The ImageNet Database . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.10.1 Video Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.10.2 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
1.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2 Machine Learning with Shallow Neural Networks 53
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2 Neural Architectures for Binary Classification Models . . . . . . . . . . . . 55
2.2.1 Revisiting the Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2 Least-Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . 58
2.2.2.1 Widrow-Hoff Learning . . . . . . . . . . . . . . . . . . . . . 59
2.2.2.2 Closed Form Solutions . . . . . . . . . . . . . . . . . . . . . 61
2.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.3.1 Alternative Choices of Activation and Loss . . . . . . . . . 63
2.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3 Neural Architectures for Multiclass Models . . . . . . . . . . . . . . . . . . 65
2.3.1 Multiclass Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.3.2 Weston-Watkins SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.3.3 Multinomial Logistic Regression (Softmax Classifier) . . . . . . . . . 68
2.3.4 Hierarchical Softmax for Many Classes . . . . . . . . . . . . . . . . . 69
2.4 Backpropagated Saliency for Feature Selection . . . . . . . . . . . . . . . . 70
2.5 Matrix Factorization with Autoencoders . . . . . . . . . . . . . . . . . . . . 70
2.5.1 Autoencoder: Basic Principles . . . . . . . . . . . . . . . . . . . . . . 71
2.5.1.1 Autoencoder with a Single Hidden Layer . . . . . . . . . . 72
2.5.1.2 Connections with Singular Value Decomposition . . . . . . 74
2.5.1.3 Sharing Weights in Encoder and Decoder . . . . . . . . . . 74
2.5.1.4 Other Matrix Factorization Methods . . . . . . . . . . . . . 76
2.5.2 Nonlinear Activations . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5.3 Deep Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.5.4 Application to Outlier Detection . . . . . . . . . . . . . . . . . . . . 80
2.5.5 When the Hidden Layer Is Broader than the Input Layer . . . . . . 81
2.5.5.1 Sparse Feature Learning . . . . . . . . . . . . . . . . . . . . 81
2.5.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
CONTENTS XV
2.5.7 Recommender Systems: Row Index to Row Value Prediction . . . . 83
2.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.6 Word2vec: An Application of Simple Neural Architectures . . . . . . . . . . 87
2.6.1 Neural Embedding with Continuous Bag of Words . . . . . . . . . . 87
2.6.2 Neural Embedding with Skip-Gram Model . . . . . . . . . . . . . . . 90
2.6.3 Word2vec (SGNS) Is Logistic Matrix Factorization . . . . . . . . . . 95
2.6.4 Vanilla Skip-Gram Is Multinomial Matrix Factorization . . . . . . . 98
2.7 Simple Neural Architectures for Graph Embeddings . . . . . . . . . . . . . 98
2.7.1 Handling Arbitrary Edge Counts . . . . . . . . . . . . . . . . . . . . 100
2.7.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.7.3 Connections with DeepWalk and Node2vec . . . . . . . . . . . . . . 100
2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
2.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3 Training Deep Neural Networks 105
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.2 Backpropagation: The Gory Details . . . . . . . . . . . . . . . . . . . . . . . 107
3.2.1 Backpropagation with the Computational Graph Abstraction . . . . 107
3.2.2 Dynamic Programming to the Rescue . . . . . . . . . . . . . . . . . 111
3.2.3 Backpropagation with Post-Activation Variables . . . . . . . . . . . 113
3.2.4 Backpropagation with Pre-activation Variables . . . . . . . . . . . . 115
3.2.5 Examples of Updates for Various Activations . . . . . . . . . . . . . 117
3.2.5.1 The Special Case of Softmax . . . . . . . . . . . . . . . . . 117
3.2.6 A Decoupled View of Vector-Centric Backpropagation . . . . . . . . 118
3.2.7 Loss Functions on Multiple Output Nodes and Hidden Nodes . . . . 121
3.2.8 Mini-Batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . 121
3.2.9 Backpropagation Tricks for Handling Shared Weights . . . . . . . . 123
3.2.10 Checking the Correctness of Gradient Computation . . . . . . . . . 124
3.3 Setup and Initialization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.1 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.2 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . . . . . 129
3.4.1 Geometric Understanding of the Effect of Gradient Ratios . . . . . . 130
3.4.2 A Partial Fix with Activation Function Choice . . . . . . . . . . . . 133
3.4.3 Dying Neurons and “Brain Damage” . . . . . . . . . . . . . . . . . . 133
3.4.3.1 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . 133
3.4.3.2 Maxout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5 Gradient-Descent Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5.1 Learning Rate Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.5.2 Momentum-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 136
3.5.2.1 Nesterov Momentum . . . . . . . . . . . . . . . . . . . . . 137
3.5.3 Parameter-Specific Learning Rates . . . . . . . . . . . . . . . . . . . 137
3.5.3.1 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.5.3.2 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3.5.3.3 RMSProp with Nesterov Momentum . . . . . . . . . . . . . 139
XVI CONTENTS
3.5.3.4 AdaDelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.5.3.5 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5.4 Cliffs and Higher-Order Instability . . . . . . . . . . . . . . . . . . . 141
3.5.5 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.5.6 Second-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 143
3.5.6.1 Conjugate Gradients and Hessian-Free Optimization . . . . 145
3.5.6.2 Quasi-Newton Methods and BFGS . . . . . . . . . . . . . . 148
3.5.6.3 Problems with Second-Order Methods: Saddle Points . . . 149
3.5.7 Polyak Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.5.8 Local and Spurious Minima . . . . . . . . . . . . . . . . . . . . . . . 151
3.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
3.7 Practical Tricks for Acceleration and Compression . . . . . . . . . . . . . . 156
3.7.1 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
3.7.2 Parallel and Distributed Implementations . . . . . . . . . . . . . . . 158
3.7.3 Algorithmic Tricks for Model Compression . . . . . . . . . . . . . . 160
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4 Teaching Deep Learners to Generalize 169
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.2.1 Formal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.3 Generalization Issues in Model Tuning and Evaluation . . . . . . . . . . . . 178
4.3.1 Evaluating with Hold-Out and Cross-Validation . . . . . . . . . . . . 179
4.3.2 Issues with Training at Scale . . . . . . . . . . . . . . . . . . . . . . 180
4.3.3 How to Detect Need to Collect More Data . . . . . . . . . . . . . . . 181
4.4 Penalty-Based Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4.1 Connections with Noise Injection . . . . . . . . . . . . . . . . . . . . 182
4.4.2 L1-Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.3 L1- or L2-Regularization? . . . . . . . . . . . . . . . . . . . . . . . . 184
4.4.4 Penalizing Hidden Units: Learning Sparse Representations . . . . . . 185
4.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.1 Bagging and Subsampling . . . . . . . . . . . . . . . . . . . . . . . . 186
4.5.2 Parametric Model Selection and Averaging . . . . . . . . . . . . . . 187
4.5.3 Randomized Connection Dropping . . . . . . . . . . . . . . . . . . . 188
4.5.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.5.5 Data Perturbation Ensembles . . . . . . . . . . . . . . . . . . . . . . 191
4.6 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.6.1 Understanding Early Stopping from the Variance Perspective . . . . 192
4.7 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.7.1 Variations of Unsupervised Pretraining . . . . . . . . . . . . . . . . . 197
4.7.2 What About Supervised Pretraining? . . . . . . . . . . . . . . . . . 197
4.8 Continuation and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . 199
4.8.1 Continuation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 199
4.8.2 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
4.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
CONTENTS XVII
4.10 Regularization in Unsupervised Applications . . . . . . . . . . . . . . . . . 201
4.10.1 Value-Based Penalization: Sparse Autoencoders . . . . . . . . . . . . 202
4.10.2 Noise Injection: De-noising Autoencoders . . . . . . . . . . . . . . . 202
4.10.3 Gradient-Based Penalization: Contractive Autoencoders . . . . . . . 204
4.10.4 Hidden Probabilistic Structure: Variational Autoencoders . . . . . . 207
4.10.4.1 Reconstruction and Generative Sampling . . . . . . . . . . 210
4.10.4.2 Conditional Variational Autoencoders . . . . . . . . . . . . 212
4.10.4.3 Relationship with Generative Adversarial Networks . . . . 213
4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.12 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
4.12.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
4.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5 Radial Basis Function Networks 217
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.2 Training an RBF Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.1 Training the Hidden Layer . . . . . . . . . . . . . . . . . . . . . . . . 221
5.2.2 Training the Output Layer . . . . . . . . . . . . . . . . . . . . . . . 222
5.2.2.1 Expression with Pseudo-Inverse . . . . . . . . . . . . . . . 224
5.2.3 Orthogonal Least-Squares Algorithm . . . . . . . . . . . . . . . . . . 224
5.2.4 Fully Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3 Variations and Special Cases of RBF Networks . . . . . . . . . . . . . . . . 226
5.3.1 Classification with Perceptron Criterion . . . . . . . . . . . . . . . . 226
5.3.2 Classification with Hinge Loss . . . . . . . . . . . . . . . . . . . . . . 227
5.3.3 Example of Linear Separability Promoted by RBF . . . . . . . . . . 227
5.3.4 Application to Interpolation . . . . . . . . . . . . . . . . . . . . . . . 228
5.4 Relationship with Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . 229
5.4.1 Kernel Regression as a Special Case of RBF Networks . . . . . . . . 229
5.4.2 Kernel SVM as a Special Case of RBF Networks . . . . . . . . . . . 230
5.4.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6 Restricted Boltzmann Machines 235
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.1.1 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.2 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.2.1 Optimal State Configurations of a Trained Network . . . . . . . . . 238
6.2.2 Training a Hopfield Network . . . . . . . . . . . . . . . . . . . . . . 240
6.2.3 Building a Toy Recommender and Its Limitations . . . . . . . . . . 241
6.2.4 Increasing the Expressive Power of the Hopfield Network . . . . . . 242
6.3 The Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.3.1 How a Boltzmann Machine Generates Data . . . . . . . . . . . . . . 244
6.3.2 Learning the Weights of a Boltzmann Machine . . . . . . . . . . . . 245
6.4 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.4.1 Training the RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6.4.2 Contrastive Divergence Algorithm . . . . . . . . . . . . . . . . . . . 250
6.4.3 Practical Issues and Improvisations . . . . . . . . . . . . . . . . . . . 251
XVIII CONTENTS
6.5 Applications of Restricted Boltzmann Machines . . . . . . . . . . . . . . . . 251
6.5.1 Dimensionality Reduction and Data Reconstruction . . . . . . . . . 252
6.5.2 RBMs for Collaborative Filtering . . . . . . . . . . . . . . . . . . . . 254
6.5.3 Using RBMs for Classification . . . . . . . . . . . . . . . . . . . . . . 257
6.5.4 Topic Models with RBMs . . . . . . . . . . . . . . . . . . . . . . . . 260
6.5.5 RBMs for Machine Learning with Multimodal Data . . . . . . . . . 262
6.6 Using RBMs Beyond Binary Data Types . . . . . . . . . . . . . . . . . . . . 263
6.7 Stacking Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 264
6.7.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.7.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
6.7.3 Deep Boltzmann Machines and Deep Belief Networks . . . . . . . . 267
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7 Recurrent Neural Networks 271
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.1.1 Expressiveness of Recurrent Networks . . . . . . . . . . . . . . . . . 274
7.2 The Architecture of Recurrent Neural Networks . . . . . . . . . . . . . . . . 274
7.2.1 Language Modeling Example of RNN . . . . . . . . . . . . . . . . . 277
7.2.1.1 Generating a Language Sample . . . . . . . . . . . . . . . . 278
7.2.2 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . 280
7.2.3 Bidirectional Recurrent Networks . . . . . . . . . . . . . . . . . . . . 283
7.2.4 Multilayer Recurrent Networks . . . . . . . . . . . . . . . . . . . . . 284
7.3 The Challenges of Training Recurrent Networks . . . . . . . . . . . . . . . . 286
7.3.1 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.4 Echo-State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7.5 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . 292
7.6 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . 295
7.7 Applications of Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 297
7.7.1 Application to Automatic Image Captioning . . . . . . . . . . . . . . 298
7.7.2 Sequence-to-Sequence Learning and Machine Translation . . . . . . 299
7.7.2.1 Question-Answering Systems . . . . . . . . . . . . . . . . . 301
7.7.3 Application to Sentence-Level Classification . . . . . . . . . . . . . . 303
7.7.4 Token-Level Classification with Linguistic Features . . . . . . . . . . 304
7.7.5 Time-Series Forecasting and Prediction . . . . . . . . . . . . . . . . 305
7.7.6 Temporal Recommender Systems . . . . . . . . . . . . . . . . . . . . 307
7.7.7 Secondary Protein Structure Prediction . . . . . . . . . . . . . . . . 309
7.7.8 End-to-End Speech Recognition . . . . . . . . . . . . . . . . . . . . . 309
7.7.9 Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . 309
7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
CONTENTS XIX
8 Convolutional Neural Networks 315
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
8.1.1 Historical Perspective and Biological Inspiration . . . . . . . . . . . 316
8.1.2 Broader Observations About Convolutional Neural Networks . . . . 317
8.2 The Basic Structure of a Convolutional Network . . . . . . . . . . . . . . . 318
8.2.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
8.2.2 Strides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2.3 Typical Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
8.2.4 The ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.2.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
8.2.6 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.2.7 The Interleaving Between Layers . . . . . . . . . . . . . . . . . . . . 328
8.2.8 Local Response Normalization . . . . . . . . . . . . . . . . . . . . . 330
8.2.9 Hierarchical Feature Engineering . . . . . . . . . . . . . . . . . . . . 331
8.3 Training a Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . 332
8.3.1 Backpropagating Through Convolutions . . . . . . . . . . . . . . . . 333
8.3.2 Backpropagation as Convolution with Inverted/Transposed Filter . . 334
8.3.3 Convolution/Backpropagation as Matrix Multiplications . . . . . . . 335
8.3.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.4 Case Studies of Convolutional Architectures . . . . . . . . . . . . . . . . . . 338
8.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
8.4.2 ZFNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.4.3 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
8.4.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
8.4.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.4.6 The Effects of Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
8.4.7 Pretrained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
8.5 Visualization and Unsupervised Learning . . . . . . . . . . . . . . . . . . . 352
8.5.1 Visualizing the Features of a Trained Network . . . . . . . . . . . . 353
8.5.2 Convolutional Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 357
8.6 Applications of Convolutional Networks . . . . . . . . . . . . . . . . . . . . 363
8.6.1 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . 363
8.6.2 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
8.6.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
8.6.4 Natural Language and Sequence Learning . . . . . . . . . . . . . . . 366
8.6.5 Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
8.8.1 Software Resources and Data Sets . . . . . . . . . . . . . . . . . . . 370
8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9 Deep Reinforcement Learning 373
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.2 Stateless Algorithms: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . 375
9.2.1 Naı̈ve Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.2.2 -Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.2.3 Upper Bounding Methods . . . . . . . . . . . . . . . . . . . . . . . . 376
9.3 The Basic Framework of Reinforcement Learning . . . . . . . . . . . . . . . 377
9.3.1 Challenges of Reinforcement Learning . . . . . . . . . . . . . . . . . 379
XX CONTENTS
9.3.2 Simple Reinforcement Learning for Tic-Tac-Toe . . . . . . . . . . . . 380
9.3.3 Role of Deep Learning and a Straw-Man Algorithm . . . . . . . . . 380
9.4 Bootstrapping for Value Function Learning . . . . . . . . . . . . . . . . . . 383
9.4.1 Deep Learning Models as Function Approximators . . . . . . . . . . 384
9.4.2 Example: Neural Network for Atari Setting . . . . . . . . . . . . . . 386
9.4.3 On-Policy Versus Off-Policy Methods: SARSA . . . . . . . . . . . . 387
9.4.4 Modeling States Versus State-Action Pairs . . . . . . . . . . . . . . . 389
9.5 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
9.5.1 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . . . . 392
9.5.2 Likelihood Ratio Methods . . . . . . . . . . . . . . . . . . . . . . . . 393
9.5.3 Combining Supervised Learning with Policy Gradients . . . . . . . . 395
9.5.4 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 395
9.5.5 Continuous Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . 397
9.5.6 Advantages and Disadvantages of Policy Gradients . . . . . . . . . . 397
9.6 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
9.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
9.7.1 AlphaGo: Championship Level Play at Go . . . . . . . . . . . . . . . 399
9.7.1.1 Alpha Zero: Enhancements to Zero Human Knowledge . . 402
9.7.2 Self-Learning Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
9.7.2.1 Deep Learning of Locomotion Skills . . . . . . . . . . . . . 404
9.7.2.2 Deep Learning of Visuomotor Skills . . . . . . . . . . . . . 406
9.7.3 Building Conversational Systems: Deep Learning for Chatbots . . . 407
9.7.4 Self-Driving Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
9.7.5 Inferring Neural Architectures with Reinforcement Learning . . . . . 412
9.8 Practical Challenges Associated with Safety . . . . . . . . . . . . . . . . . . 413
9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
9.10.1 Software Resources and Testbeds . . . . . . . . . . . . . . . . . . . . 416
9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
10 Advanced Topics in Deep Learning 419
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
10.2 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
10.2.1 Recurrent Models of Visual Attention . . . . . . . . . . . . . . . . . 422
10.2.1.1 Application to Image Captioning . . . . . . . . . . . . . . . 424
10.2.2 Attention Mechanisms for Machine Translation . . . . . . . . . . . . 425
10.3 Neural Networks with External Memory . . . . . . . . . . . . . . . . . . . . 429
10.3.1 A Fantasy Video Game: Sorting by Example . . . . . . . . . . . . . 430
10.3.1.1 Implementing Swaps with Memory Operations . . . . . . . 431
10.3.2 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . 432
10.3.3 Differentiable Neural Computer: A Brief Overview . . . . . . . . . . 437
10.4 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . 438
10.4.1 Training a Generative Adversarial Network . . . . . . . . . . . . . . 439
10.4.2 Comparison with Variational Autoencoder . . . . . . . . . . . . . . . 442
10.4.3 Using GANs for Generating Image Data . . . . . . . . . . . . . . . . 442
10.4.4 Conditional Generative Adversarial Networks . . . . . . . . . . . . . 444
10.5 Competitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
10.5.1 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
10.5.2 Kohonen Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . 450
CONTENTS XXI
10.6 Limitations of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 453
10.6.1 An Aspirational Goal: One-Shot Learning . . . . . . . . . . . . . . . 453
10.6.2 An Aspirational Goal: Energy-Efficient Learning . . . . . . . . . . . 455
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
10.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
10.8.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
Bibliography 459
Index 493
Author Biography
Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM
T. J. Watson Research Center in Yorktown Heights, New York. He completed his under-
graduate degree in Computer Science from the Indian Institute of Technology at Kan-
pur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.
He has worked extensively in the field of data mining. He has pub-
lished more than 350 papers in refereed conferences and journals
and authored over 80 patents. He is the author or editor of 18
books, including textbooks on data mining, recommender systems,
and outlier analysis. Because of the commercial value of his patents,
he has thrice been designated a Master Inventor at IBM. He is a
recipient of an IBM Corporate Award (2003) for his work on bio-
terrorist threat detection in data streams, a recipient of the IBM
Outstanding Innovation Award (2008) for his scientific contribu-
tions to privacy technology, and a recipient of two IBM Outstanding
Technical Achievement Awards (2009, 2015) for his work on data streams/high-dimensional
data. He received the EDBT 2014 Test of Time Award for his work on condensation-based
privacy-preserving data mining. He is also a recipient of the IEEE ICDM Research Con-
tributions Award (2015), which is one of the two highest awards for influential research
contributions in the field of data mining.
He has served as the general co-chair of the IEEE Big Data Conference (2014) and as
the program co-chair of the ACM CIKM Conference (2015), the IEEE ICDM Conference
(2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE
Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate
editor of the IEEE Transactions on Big Data, an action editor of the Data Mining and
Knowledge Discovery Journal, and an associate editor of the Knowledge and Information
Systems Journal. He serves as the editor-in-chief of the ACM Transactions on Knowledge
Discovery from Data as well as the ACM SIGKDD Explorations. He serves on the advisory
board of the Lecture Notes on Social Networks, a publication by Springer. He has served as
the vice-president of the SIAM Activity Group on Data Mining and is a member of the SIAM
industry committee. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to
knowledge discovery and data mining algorithms.”
XXIII
Chapter 1
An Introduction to Neural Networks
“Thou shalt not make a machine to counterfeit a human mind.”—Frank Herbert
1.1 Introduction
Artificial neural networks are popular machine learning techniques that simulate the mech-
anism of learning in biological organisms. The human nervous system contains cells, which
are referred to as neurons. The neurons are connected to one another with the use of ax-
ons and dendrites, and the connecting regions between axons and dendrites are referred to
as synapses. These connections are illustrated in Figure 1.1(a). The strengths of synaptic
connections often change in response to external stimuli. This change is how learning takes
place in living organisms.
This biological mechanism is simulated in artificial neural networks, which contain com-
putation units that are referred to as neurons. Throughout this book, we will use the term
“neural networks” to refer to artificial neural networks rather than biological ones. The
computational units are connected to one another through weights, which serve the same
NEURON
w1
w2
w3
w4
AXON
DENDRITES WITH
SYNAPTIC WEIGHTS
w5
(a) Biological neural network (b) Artificial neural network
Figure 1.1: The synaptic connections between neurons. The image in (a) is from “The Brain:
Understanding Neurobiology Through the Study of Addiction [598].” Copyright c
2000 by
BSCS  Videodiscovery. All rights reserved. Used with permission.
© Springer International Publishing AG, part of Springer Nature 2018
C. C. Aggarwal, Neural Networks and Deep Learning,
https://doi.org/10.1007/978-3-319-94463-0 1
1
2 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
role as the strengths of synaptic connections in biological organisms. Each input to a neuron
is scaled with a weight, which affects the function computed at that unit. This architecture
is illustrated in Figure 1.1(b). An artificial neural network computes a function of the inputs
by propagating the computed values from the input neurons to the output neuron(s) and
using the weights as intermediate parameters. Learning occurs by changing the weights con-
necting the neurons. Just as external stimuli are needed for learning in biological organisms,
the external stimulus in artificial neural networks is provided by the training data contain-
ing examples of input-output pairs of the function to be learned. For example, the training
data might contain pixel representations of images (input) and their annotated labels (e.g.,
carrot, banana) as the output. These training data pairs are fed into the neural network by
using the input representations to make predictions about the output labels. The training
data provides feedback to the correctness of the weights in the neural network depending
on how well the predicted output (e.g., probability of carrot) for a particular input matches
the annotated output label in the training data. One can view the errors made by the neural
network in the computation of a function as a kind of unpleasant feedback in a biological
organism, leading to an adjustment in the synaptic strengths. Similarly, the weights between
neurons are adjusted in a neural network in response to prediction errors. The goal of chang-
ing the weights is to modify the computed function to make the predictions more correct in
future iterations. Therefore, the weights are changed carefully in a mathematically justified
way so as to reduce the error in computation on that example. By successively adjusting
the weights between neurons over many input-output pairs, the function computed by the
neural network is refined over time so that it provides more accurate predictions. Therefore,
if the neural network is trained with many different images of bananas, it will eventually
be able to properly recognize a banana in an image it has not seen before. This ability to
accurately compute functions of unseen inputs by training over a finite set of input-output
pairs is referred to as model generalization. The primary usefulness of all machine learning
models is gained from their ability to generalize their learning from seen training data to
unseen examples.
The biological comparison is often criticized as a very poor caricature of the workings
of the human brain; nevertheless, the principles of neuroscience have often been useful in
designing neural network architectures. A different view is that neural networks are built
as higher-level abstractions of the classical models that are commonly used in machine
learning. In fact, the most basic units of computation in the neural network are inspired by
traditional machine learning algorithms like least-squares regression and logistic regression.
Neural networks gain their power by putting together many such basic units, and learning
the weights of the different units jointly in order to minimize the prediction error. From
this point of view, a neural network can be viewed as a computational graph of elementary
units in which greater power is gained by connecting them in particular ways. When a
neural network is used in its most basic form, without hooking together multiple units, the
learning algorithms often reduce to classical machine learning models (see Chapter 2). The
real power of a neural model over classical methods is unleashed when these elementary
computational units are combined, and the weights of the elementary models are trained
using their dependencies on one another. By combining multiple units, one is increasing the
power of the model to learn more complicated functions of the data than are inherent in the
elementary models of basic machine learning. The way in which these units are combined
also plays a role in the power of the architecture, and requires some understanding and
insight from the analyst. Furthermore, sufficient training data is also required in order to
learn the larger number of weights in these expanded computational graphs.
1.1. INTRODUCTION 3
ACCURACY
AMOUNT OF DATA
DEEP LEARNING
CONVENTIONAL
MACHINE LEARNING
Figure 1.2: An illustrative comparison of the accuracy of a typical machine learning al-
gorithm with that of a large neural network. Deep learners become more attractive than
conventional methods primarily when sufficient data/computational power is available. Re-
cent years have seen an increase in data availability and computational power, which has
led to a “Cambrian explosion” in deep learning technology.
1.1.1 Humans Versus Computers: Stretching the Limits
of Artificial Intelligence
Humans and computers are inherently suited to different types of tasks. For example, com-
puting the cube root of a large number is very easy for a computer, but it is extremely
difficult for humans. On the other hand, a task such as recognizing the objects in an image
is a simple matter for a human, but has traditionally been very difficult for an automated
learning algorithm. It is only in recent years that deep learning has shown an accuracy on
some of these tasks that exceeds that of a human. In fact, the recent results by deep learning
algorithms that surpass human performance [184] in (some narrow tasks on) image recog-
nition would not have been considered likely by most computer vision experts as recently
as 10 years ago.
Many deep learning architectures that have shown such extraordinary performance are
not created by indiscriminately connecting computational units. The superior performance
of deep neural networks mirrors the fact that biological neural networks gain much of their
power from depth as well. Furthermore, biological networks are connected in ways we do not
fully understand. In the few cases that the biological structure is understood at some level,
significant breakthroughs have been achieved by designing artificial neural networks along
those lines. A classical example of this type of architecture is the use of the convolutional
neural network for image recognition. This architecture was inspired by Hubel and Wiesel’s
experiments [212] in 1959 on the organization of the neurons in the cat’s visual cortex. The
precursor to the convolutional neural network was the neocognitron [127], which was directly
based on these results.
The human neuronal connection structure has evolved over millions of years to optimize
survival-driven performance; survival is closely related to our ability to merge sensation and
intuition in a way that is currently not possible with machines. Biological neuroscience [232]
is a field that is still very much in its infancy, and only a limited amount is known about how
the brain truly works. Therefore, it is fair to suggest that the biologically inspired success
of convolutional neural networks might be replicated in other settings, as we learn more
about how the human brain works [176]. A key advantage of neural networks over tradi-
tional machine learning is that the former provides a higher-level abstraction of expressing
semantic insights about data domains by architectural design choices in the computational
graph. The second advantage is that neural networks provide a simple way to adjust the
4 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
complexity of a model by adding or removing neurons from the architecture according to
the availability of training data or computational power. A large part of the recent suc-
cess of neural networks is explained by the fact that the increased data availability and
computational power of modern computers has outgrown the limits of traditional machine
learning algorithms, which fail to take full advantage of what is now possible. This situation
is illustrated in Figure 1.2. The performance of traditional machine learning remains better
at times for smaller data sets because of more choices, greater ease of model interpretation,
and the tendency to hand-craft interpretable features that incorporate domain-specific in-
sights. With limited data, the best of a very wide diversity of models in machine learning
will usually perform better than a single class of models (like neural networks). This is one
reason why the potential of neural networks was not realized in the early years.
The “big data” era has been enabled by the advances in data collection technology; vir-
tually everything we do today, including purchasing an item, using the phone, or clicking on
a site, is collected and stored somewhere. Furthermore, the development of powerful Graph-
ics Processor Units (GPUs) has enabled increasingly efficient processing on such large data
sets. These advances largely explain the recent success of deep learning using algorithms
that are only slightly adjusted from the versions that were available two decades back.
Furthermore, these recent adjustments to the algorithms have been enabled by increased
speed of computation, because reduced run-times enable efficient testing (and subsequent
algorithmic adjustment). If it requires a month to test an algorithm, at most twelve varia-
tions can be tested in an year on a single hardware platform. This situation has historically
constrained the intensive experimentation required for tweaking neural-network learning
algorithms. The rapid advances associated with the three pillars of improved data, compu-
tation, and experimentation have resulted in an increasingly optimistic outlook about the
future of deep learning. By the end of this century, it is expected that computers will have
the power to train neural networks with as many neurons as the human brain. Although
it is hard to predict what the true capabilities of artificial intelligence will be by then, our
experience with computer vision should prepare us to expect the unexpected.
Chapter Organization
This chapter is organized as follows. The next section introduces single-layer and multi-layer
networks. The different types of activation functions, output nodes, and loss functions are
discussed. The backpropagation algorithm is introduced in Section 1.3. Practical issues in
neural network training are discussed in Section 1.4. Some key points on how neural networks
gain their power with specific choices of activation functions are discussed in Section 1.5. The
common architectures used in neural network design are discussed in Section 1.6. Advanced
topics in deep learning are discussed in Section 1.7. Some notable benchmarks used by the
deep learning community are discussed in Section 1.8. A summary is provided in Section 1.9.
1.2 The Basic Architecture of Neural Networks
In this section, we will introduce single-layer and multi-layer neural networks. In the single-
layer network, a set of inputs is directly mapped to an output by using a generalized variation
of a linear function. This simple instantiation of a neural network is also referred to as the
perceptron. In multi-layer neural networks, the neurons are arranged in layered fashion, in
which the input and output layers are separated by a group of hidden layers. This layer-wise
architecture of the neural network is also referred to as a feed-forward network. This section
will discuss both single-layer and multi-layer networks.
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 5
INPUT NODES
∑
OUTPUT NODE
y
w1
w2
w3
w4
x4
x3
x2
x1
x5
w5
INPUT NODES
∑
OUTPUT NODE
w1
w2
w3
w4
w5
b
+1 BIAS NEURON
y
x4
x3
x2
x1
x5
(a) Perceptron without bias (b) Perceptron with bias
Figure 1.3: The basic architecture of the perceptron
1.2.1 Single Computational Layer: The Perceptron
The simplest neural network is referred to as the perceptron. This neural network contains
a single input layer and an output node. The basic architecture of the perceptron is shown
in Figure 1.3(a). Consider a situation where each training instance is of the form (X, y),
where each X = [x1, . . . xd] contains d feature variables, and y ∈ {−1, +1} contains the
observed value of the binary class variable. By “observed value” we refer to the fact that it
is given to us as a part of the training data, and our goal is to predict the class variable for
cases in which it is not observed. For example, in a credit-card fraud detection application,
the features might represent various properties of a set of credit card transactions (e.g.,
amount and frequency of transactions), and the class variable might represent whether or
not this set of transactions is fraudulent. Clearly, in this type of application, one would have
historical cases in which the class variable is observed, and other (current) cases in which
the class variable has not yet been observed but needs to be predicted.
The input layer contains d nodes that transmit the d features X = [x1 . . . xd] with
edges of weight W = [w1 . . . wd] to an output node. The input layer does not perform
any computation in its own right. The linear function W · X =
d
i=1 wixi is computed at
the output node. Subsequently, the sign of this real value is used in order to predict the
dependent variable of X. Therefore, the prediction ŷ is computed as follows:
ŷ = sign{W · X} = sign{
d

j=1
wjxj} (1.1)
The sign function maps a real value to either +1 or −1, which is appropriate for binary
classification. Note the circumflex on top of the variable y to indicate that it is a predicted
value rather than an observed value. The error of the prediction is therefore E(X) = y − ŷ,
which is one of the values drawn from the set {−2, 0, +2}. In cases where the error value
E(X) is nonzero, the weights in the neural network need to be updated in the (negative)
direction of the error gradient. As we will see later, this process is similar to that used in
various types of linear models in machine learning. In spite of the similarity of the perceptron
with respect to traditional machine learning models, its interpretation as a computational
unit is very useful because it allows us to put together multiple units in order to create far
more powerful models than are available in traditional machine learning.
6 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
The architecture of the perceptron is shown in Figure 1.3(a), in which a single input layer
transmits the features to the output node. The edges from the input to the output contain
the weights w1 . . . wd with which the features are multiplied and added at the output node.
Subsequently, the sign function is applied in order to convert the aggregated value into a
class label. The sign function serves the role of an activation function. Different choices
of activation functions can be used to simulate different types of models used in machine
learning, like least-squares regression with numeric targets, the support vector machine,
or a logistic regression classifier. Most of the basic machine learning models can be easily
represented as simple neural network architectures. It is a useful exercise to model traditional
machine learning techniques as neural architectures, because it provides a clearer picture of
how deep learning generalizes traditional machine learning. This point of view is explored
in detail in Chapter 2. It is noteworthy that the perceptron contains two layers, although
the input layer does not perform any computation and only transmits the feature values.
The input layer is not included in the count of the number of layers in a neural network.
Since the perceptron contains a single computational layer, it is considered a single-layer
network.
In many settings, there is an invariant part of the prediction, which is referred to as
the bias. For example, consider a setting in which the feature variables are mean centered,
but the mean of the binary class prediction from {−1, +1} is not 0. This will tend to occur
in situations in which the binary class distribution is highly imbalanced. In such a case,
the aforementioned approach is not sufficient for prediction. We need to incorporate an
additional bias variable b that captures this invariant part of the prediction:
ŷ = sign{W · X + b} = sign{
d

j=1
wjxj + b} (1.2)
The bias can be incorporated as the weight of an edge by using a bias neuron. This is
achieved by adding a neuron that always transmits a value of 1 to the output node. The
weight of the edge connecting the bias neuron to the output node provides the bias variable.
An example of a bias neuron is shown in Figure 1.3(b). Another approach that works well
with single-layer architectures is to use a feature engineering trick in which an additional
feature is created with a constant value of 1. The coefficient of this feature provides the bias,
and one can then work with Equation 1.1. Throughout this book, biases will not be explicitly
used (for simplicity in architectural representations) because they can be incorporated with
bias neurons. The details of the training algorithms remain the same by simply treating the
bias neurons like any other neuron with a fixed activation value of 1. Therefore, the following
will work with the predictive assumption of Equation 1.1, which does not explicitly uses
biases.
At the time that the perceptron algorithm was proposed by Rosenblatt [405], these op-
timizations were performed in a heuristic way with actual hardware circuits, and it was not
presented in terms of a formal notion of optimization in machine learning (as is common
today). However, the goal was always to minimize the error in prediction, even if a for-
mal optimization formulation was not presented. The perceptron algorithm was, therefore,
heuristically designed to minimize the number of misclassifications, and convergence proofs
were available that provided correctness guarantees of the learning algorithm in simplified
settings. Therefore, we can still write the (heuristically motivated) goal of the perceptron
algorithm in least-squares form with respect to all training instances in a data set D con-
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 7
taining feature-label pairs:
MinimizeW L =

(X,y)∈D
(y − ŷ)2
=

(X,y)∈D

y − sign{W · X}
2
This type of minimization objective function is also referred to as a loss function. As we
will see later, almost all neural network learning algorithms are formulated with the use
of a loss function. As we will learn in Chapter 2, this loss function looks a lot like least-
squares regression. However, the latter is defined for continuous-valued target variables,
and the corresponding loss is a smooth and continuous function of the variables. On the
other hand, for the least-squares form of the objective function, the sign function is non-
differentiable, with step-like jumps at specific points. Furthermore, the sign function takes
on constant values over large portions of the domain, and therefore the exact gradient takes
on zero values at differentiable points. This results in a staircase-like loss surface, which
is not suitable for gradient-descent. The perceptron algorithm (implicitly) uses a smooth
approximation of the gradient of this objective function with respect to each example:
∇Lsmooth =

(X,y)∈D
(y − ŷ)X (1.3)
Note that the above gradient is not a true gradient of the staircase-like surface of the (heuris-
tic) objective function, which does not provide useful gradients. Therefore, the staircase is
smoothed out into a sloping surface defined by the perceptron criterion. The properties of the
perceptron criterion will be described in Section 1.2.1.1. It is noteworthy that concepts like
the “perceptron criterion” were proposed later than the original paper by Rosenblatt [405]
in order to explain the heuristic gradient-descent steps. For now, we will assume that the
perceptron algorithm optimizes some unknown smooth function with the use of gradient
descent.
Although the above objective function is defined over the entire training data, the train-
ing algorithm of neural networks works by feeding each input data instance X into the
network one by one (or in small batches) to create the prediction ŷ. The weights are then
updated, based on the error value E(X) = (y − ŷ). Specifically, when the data point X is
fed into the network, the weight vector W is updated as follows:
W ⇐ W + α(y − ŷ)X (1.4)
The parameter α regulates the learning rate of the neural network. The perceptron algorithm
repeatedly cycles through all the training examples in random order and iteratively adjusts
the weights until convergence is reached. A single training data point may be cycled through
many times. Each such cycle is referred to as an epoch. One can also write the gradient-
descent update in terms of the error E(X) = (y − ŷ) as follows:
W ⇐ W + αE(X)X (1.5)
The basic perceptron algorithm can be considered a stochastic gradient-descent method,
which implicitly minimizes the squared error of prediction by performing gradient-descent
updates with respect to randomly chosen training points. The assumption is that the neural
network cycles through the points in random order during training and changes the weights
with the goal of reducing the prediction error on that point. It is easy to see from Equa-
tion 1.5 that non-zero updates are made to the weights only when y = ŷ, which occurs only
8 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
when errors are made in prediction. In mini-batch stochastic gradient descent, the aforemen-
tioned updates of Equation 1.5 are implemented over a randomly chosen subset of training
points S:
W ⇐ W + α

X∈S
E(X)X (1.6)
LINEARLY SEPARABLE NOT LINEARLY SEPARABLE
W X = 0
Figure 1.4: Examples of linearly separable and inseparable data in two classes
The advantages of using mini-batch stochastic gradient descent are discussed in Section 3.2.8
of Chapter 3. An interesting quirk of the perceptron is that it is possible to set the learning
rate α to 1, because the learning rate only scales the weights.
The type of model proposed in the perceptron is a linear model, in which the equation
W ·X = 0 defines a linear hyperplane. Here, W = (w1 . . . wd) is a d-dimensional vector that
is normal to the hyperplane. Furthermore, the value of W · X is positive for values of X on
one side of the hyperplane, and it is negative for values of X on the other side. This type of
model performs particularly well when the data is linearly separable. Examples of linearly
separable and inseparable data are shown in Figure 1.4.
The perceptron algorithm is good at classifying data sets like the one shown on the
left-hand side of Figure 1.4, when the data is linearly separable. On the other hand, it tends
to perform poorly on data sets like the one shown on the right-hand side of Figure 1.4. This
example shows the inherent modeling limitation of a perceptron, which necessitates the use
of more complex neural architectures.
Since the original perceptron algorithm was proposed as a heuristic minimization of
classification errors, it was particularly important to show that the algorithm converges
to reasonable solutions in some special cases. In this context, it was shown [405] that the
perceptron algorithm always converges to provide zero error on the training data when
the data are linearly separable. However, the perceptron algorithm is not guaranteed to
converge in instances where the data are not linearly separable. For reasons discussed in
the next section, the perceptron might sometimes arrive at a very poor solution with data
that are not linearly separable (in comparison with many other learning algorithms).
1.2.1.1 What Objective Function Is the Perceptron Optimizing?
As discussed earlier in this chapter, the original perceptron paper by Rosenblatt [405] did
not formally propose a loss function. In those years, these implementations were achieved
using actual hardware circuits. The original Mark I perceptron was intended to be a machine
rather than an algorithm, and custom-built hardware was used to create it (cf. Figure 1.5).
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 9
The general goal was to minimize the number of classification errors with a heuristic update
process (in hardware) that changed weights in the “correct” direction whenever errors were
made. This heuristic update strongly resembled gradient descent but it was not derived
as a gradient-descent method. Gradient descent is defined only for smooth loss functions
in algorithmic settings, whereas the hardware-centric approach was designed in a more
Figure 1.5: The perceptron algorithm was originally implemented using hardware circuits.
The image depicts the Mark I perceptron machine built in 1958. (Courtesy: Smithsonian
Institute)
heuristic way with binary outputs. Many of the binary and circuit-centric principles were
inherited from the McCulloch-Pitts model [321] of the neuron. Unfortunately, binary signals
are not prone to continuous optimization.
Can we find a smooth loss function, whose gradient turns out to be the perceptron
update? The number of classification errors in a binary classification problem can be written
in the form of a 0/1 loss function for training data point (Xi, yi) as follows:
L
(0/1)
i =
1
2
(yi − sign{W · Xi})2
= 1 − yi · sign{W · Xi} (1.7)
The simplification to the right-hand side of the above objective function is obtained by set-
ting both y2
i and sign{W ·Xi}2
to 1, since they are obtained by squaring a value drawn from
{−1, +1}. However, this objective function is not differentiable, because it has a staircase-
like shape, especially when it is added over multiple points. Note that the 0/1 loss above
is dominated by the term −yisign{W · Xi}, in which the sign function causes most of
the problems associated with non-differentiability. Since neural networks are defined by
gradient-based optimization, we need to define a smooth objective function that is respon-
sible for the perceptron updates. It can be shown [41] that the updates of the perceptron
implicitly optimize the perceptron criterion. This objective function is defined by dropping
the sign function in the above 0/1 loss and setting negative values to 0 in order to treat all
correct predictions in a uniform and lossless way:
Li = max{−yi(W · Xi), 0} (1.8)
The reader is encouraged to use calculus to verify that the gradient of this smoothed objec-
tive function leads to the perceptron update, and the update of the perceptron is essentially
10 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
W ⇐ W − α∇W Li. The modified loss function to enable gradient computation of a non-
differentiable function is also referred to as a smoothed surrogate loss function. Almost all
continuous optimization-based learning methods (such as neural networks) with discrete
outputs (such as class labels) use some type of smoothed surrogate loss function.
LOSS
PERCEPTRON CRITERION HINGE LOSS
1
0
VALUE OF W X FOR
POSITIVE CLASS INSTANCE
Figure 1.6: Perceptron criterion versus hinge loss
Although the aforementioned perceptron criterion was reverse engineered by working
backwards from the perceptron updates, the nature of this loss function exposes some of
the weaknesses of the updates in the original algorithm. An interesting observation about the
perceptron criterion is that one can set W to the zero vector irrespective of the training data
set in order to obtain the optimal loss value of 0. In spite of this fact, the perceptron updates
continue to converge to a clear separator between the two classes in linearly separable cases;
after all, a separator between the two classes provides a loss value of 0 as well. However,
the behavior for data that are not linearly separable is rather arbitrary, and the resulting
solution is sometimes not even a good approximate separator of the classes. The direct
sensitivity of the loss to the magnitude of the weight vector can dilute the goal of class
separation; it is possible for updates to worsen the number of misclassifications significantly
while improving the loss. This is an example of how surrogate loss functions might sometimes
not fully achieve their intended goals. Because of this fact, the approach is not stable and
can yield solutions of widely varying quality.
Several variations of the learning algorithm were therefore proposed for inseparable data,
and a natural approach is to always keep track of the best solution in terms of the number of
misclassifications [128]. This approach of always keeping the best solution in one’s “pocket”
is referred to as the pocket algorithm. Another highly performing variant incorporates the
notion of margin in the loss function, which creates an identical algorithm to the linear
support vector machine. For this reason, the linear support vector machine is also referred
to as the perceptron of optimal stability.
1.2.1.2 Relationship with Support Vector Machines
The perceptron criterion is a shifted version of the hinge-loss used in support vector ma-
chines (see Chapter 2). The hinge loss looks even more similar to the zero-one loss criterion
of Equation 1.7, and is defined as follows:
Lsvm
i = max{1 − yi(W · Xi), 0} (1.9)
Note that the perceptron does not keep the constant term of 1 on the right-hand side of
Equation 1.7, whereas the hinge loss keeps this constant within the maximization function.
This change does not affect the algebraic expression for the gradient, but it does change
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 11
which points are lossless and should not cause an update. The relationship between the
perceptron criterion and the hinge loss is shown in Figure 1.6. This similarity becomes
particularly evident when the perceptron updates of Equation 1.6 are rewritten as follows:
W ⇐ W + α

(X,y)∈S+
yX (1.10)
Here, S+
is defined as the set of all misclassified training points X ∈ S that satisfy the
condition y(W ·X)  0. This update seems to look somewhat different from the perceptron,
because the perceptron uses the error E(X) for the update, which is replaced with y in the
update above. A key point is that the (integer) error value E(X) = (y − sign{W · X}) ∈
{−2, +2} can never be 0 for misclassified points in S+
. Therefore, we have E(X) = 2y
for misclassified points, and E(X) can be replaced with y in the updates after absorbing
the factor of 2 within the learning rate. This update is identical to that used by the primal
support vector machine (SVM) algorithm [448], except that the updates are performed only
for the misclassified points in the perceptron, whereas the SVM also uses the marginally
correct points near the decision boundary for updates. Note that the SVM uses the condition
y(W · X)  1 [instead of using the condition y(W · X)  0] to define S+
, which is one of
the key differences between the two algorithms. This point shows that the perceptron is
fundamentally not very different from well-known machine learning algorithms like the
support vector machine in spite of its different origins. Freund and Schapire provide a
beautiful exposition of the role of margin in improving stability of the perceptron and also
its relationship with the support vector machine [123]. It turns out that many traditional
machine learning models can be viewed as minor variations of shallow neural architectures
like the perceptron. The relationships between classical machine learning models and shallow
neural networks are described in detail in Chapter 2.
1.2.1.3 Choice of Activation and Loss Functions
The choice of activation function is a critical part of neural network design. In the case of the
perceptron, the choice of the sign activation function is motivated by the fact that a binary
class label needs to be predicted. However, it is possible to have other types of situations
where different target variables may be predicted. For example, if the target variable to be
predicted is real, then it makes sense to use the identity activation function, and the resulting
algorithm is the same as least-squares regression. If it is desirable to predict a probability
of a binary class, it makes sense to use a sigmoid function for activating the output node, so
that the prediction ŷ indicates the probability that the observed value, y, of the dependent
variable is 1. The negative logarithm of |y/2−0.5+ŷ| is used as the loss, assuming that y is
coded from {−1, 1}. If ŷ is the probability that y is 1, then |y/2 − 0.5 + ŷ| is the probability
that the correct value is predicted. This assertion is easy to verify by examining the two
cases where y is 0 or 1. This loss function can be shown to be representative of the negative
log-likelihood of the training data (see Section 2.2.3 of Chapter 2).
The importance of nonlinear activation functions becomes significant when one moves
from the single-layered perceptron to the multi-layered architectures discussed later in this
chapter. Different types of nonlinear functions such as the sign, sigmoid, or hyperbolic tan-
gents may be used in various layers. We use the notation Φ to denote the activation function:
ŷ = Φ(W · X) (1.11)
Therefore, a neuron really computes two functions within the node, which is why we have
incorporated the summation symbol Σ as well as the activation symbol Φ within a neuron.
The break-up of the neuron computations into two separate values is shown in Figure 1.7.
12 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
∑
BREAK UP
∑ ∑
h= (W X)
.
ah
POST-ACTIVATION
VALUE
PRE-ACTIVATION
VALUE
= W X
.
{
X
W
h= (ah)
{
X
W
Figure 1.7: Pre-activation and post-activation values within a neuron
The value computed before applying the activation function Φ(·) will be referred to as the
pre-activation value, whereas the value computed after applying the activation function is
referred to as the post-activation value. The output of a neuron is always the post-activation
value, although the pre-activation variables are often used in different types of analyses, such
as the computations of the backpropagation algorithm discussed later in this chapter. The
pre-activation and post-activation values of a neuron are shown in Figure 1.7.
The most basic activation function Φ(·) is the identity or linear activation, which provides
no nonlinearity:
Φ(v) = v
The linear activation function is often used in the output node, when the target is a real
value. It is even used for discrete outputs when a smoothed surrogate loss function needs
to be set up.
The classical activation functions that were used early in the development of neural
networks were the sign, sigmoid, and the hyperbolic tangent functions:
Φ(v) = sign(v) (sign function)
Φ(v) =
1
1 + e−v
(sigmoid function)
Φ(v) =
e2v
− 1
e2v + 1
(tanh function)
While the sign activation can be used to map to binary outputs at prediction time, its
non-differentiability prevents its use for creating the loss function at training time. For
example, while the perceptron uses the sign function for prediction, the perceptron crite-
rion in training only requires linear activation. The sigmoid activation outputs a value in
(0, 1), which is helpful in performing computations that should be interpreted as probabil-
ities. Furthermore, it is also helpful in creating probabilistic outputs and constructing loss
functions derived from maximum-likelihood models. The tanh function has a shape simi-
lar to that of the sigmoid function, except that it is horizontally re-scaled and vertically
translated/re-scaled to [−1, 1]. The tanh and sigmoid functions are related as follows (see
Exercise 3):
tanh(v) = 2 · sigmoid(2v) − 1
The tanh function is preferable to the sigmoid when the outputs of the computations are de-
sired to be both positive and negative. Furthermore, its mean-centering and larger gradient
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 13
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−10 −5 0 5 10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−6 −4 −2 0 2 4 6
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.8: Various activation functions
(because of stretching) with respect to sigmoid makes it easier to train. The sigmoid and the
tanh functions have been the historical tools of choice for incorporating nonlinearity in the
neural network. In recent years, however, a number of piecewise linear activation functions
have become more popular:
Φ(v) = max{v, 0} (Rectified Linear Unit [ReLU])
Φ(v) = max {min [v, 1] , −1} (hard tanh)
The ReLU and hard tanh activation functions have largely replaced the sigmoid and soft
tanh activation functions in modern neural networks because of the ease in training multi-
layered neural networks with these activation functions.
Pictorial representations of all the aforementioned activation functions are illustrated
in Figure 1.8. It is noteworthy that all activation functions shown here are monotonic.
Furthermore, other than the identity activation function, most1
of the other activation
functions saturate at large absolute values of the argument at which increasing further does
not change the activation much.
As we will see later, such nonlinear activation functions are also very useful in multilayer
networks, because they help in creating more powerful compositions of different types of
functions. Many of these functions are referred to as squashing functions, as they map the
outputs from an arbitrary range to bounded outputs. The use of a nonlinear activation plays
a fundamental role in increasing the modeling power of a network. If a network used only
linear activations, it would not provide better modeling power than a single-layer linear
network. This issue is discussed in Section 1.5.
1The ReLU shows asymmetric saturation.
14 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
x4
x3
x2
x1
INPUT LAYER
HIDDEN LAYER
x5
P(y=green)
OUTPUTS
P(y=blue)
P(y=red)
v1
v2
v3
SOFTMAX LAYER
Figure 1.9: An example of multiple outputs for categorical classification with the use of a
softmax layer
1.2.1.4 Choice and Number of Output Nodes
The choice and number of output nodes is also tied to the activation function, which in
turn depends on the application at hand. For example, if k-way classification is intended,
k output values can be used, with a softmax activation function with respect to outputs
v = [v1, . . . , vk] at the nodes in a given layer. Specifically, the activation function for the ith
output is defined as follows:
Φ(v)i =
exp(vi)
k
j=1 exp(vj)
∀i ∈ {1, . . . , k} (1.12)
It is helpful to think of these k values as the values output by k nodes, in which the in-
puts are v1 . . . vk. An example of the softmax function with three outputs is illustrated in
Figure 1.9, and the values v1, v2, and v3 are also shown in the same figure. Note that the
three outputs correspond to the probabilities of the three classes, and they convert the three
outputs of the final hidden layer into probabilities with the softmax function. The final hid-
den layer often uses linear (identity) activations, when it is input into the softmax layer.
Furthermore, there are no weights associated with the softmax layer, since it is only con-
verting real-valued outputs into probabilities. The use of softmax with a single hidden layer
of linear activations exactly implements a model, which is referred to as multinomial logistic
regression [6]. Similarly, many variations like multi-class SVMs can be easily implemented
with neural networks. Another example of a case in which multiple output nodes are used is
the autoencoder, in which each input data point is fully reconstructed by the output layer.
The autoencoder can be used to implement matrix factorization methods like singular value
decomposition. This architecture will be discussed in detail in Chapter 2. The simplest neu-
ral networks that simulate basic machine learning algorithms are instructive because they
lie on the continuum between traditional machine learning and deep networks. By exploring
these architectures, one gets a better idea of the relationship between traditional machine
learning and neural networks, and also the advantages provided by the latter.
1.2.1.5 Choice of Loss Function
The choice of the loss function is critical in defining the outputs in a way that is sensitive
to the application at hand. For example, least-squares regression with numeric outputs
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 15
requires a simple squared loss of the form (y − ŷ)2
for a single training instance with target
y and prediction ŷ. One can also use other types of loss like hinge loss for y ∈ {−1, +1} and
real-valued prediction ŷ (with identity activation):
L = max{0, 1 − y · ŷ} (1.13)
The hinge loss can be used to implement a learning method, which is referred to as a support
vector machine.
For multiway predictions (like predicting word identifiers or one of multiple classes),
the softmax output is particularly useful. However, a softmax output is probabilistic, and
therefore it requires a different type of loss function. In fact, for probabilistic predictions,
two different types of loss functions are used, depending on whether the prediction is binary
or whether it is multiway:
1. Binary targets (logistic regression): In this case, it is assumed that the observed
value y is drawn from {−1, +1}, and the prediction ŷ is a an arbitrary numerical value
on using the identity activation function. In such a case, the loss function for a single
instance with observed value y and real-valued prediction ŷ (with identity activation)
is defined as follows:
L = log(1 + exp(−y · ŷ)) (1.14)
This type of loss function implements a fundamental machine learning method, re-
ferred to as logistic regression. Alternatively, one can use a sigmoid activation function
to output ŷ ∈ (0, 1), which indicates the probability that the observed value y is 1.
Then, the negative logarithm of |y/2 − 0.5 + ŷ| provides the loss, assuming that y is
coded from {−1, 1}. This is because |y/2 − 0.5 + ŷ| indicates the probability that the
prediction is correct. This observation illustrates that one can use various combina-
tions of activation and loss functions to achieve the same result.
2. Categorical targets: In this case, if ŷ1 . . . ŷk are the probabilities of the k classes
(using the softmax activation of Equation 1.9), and the rth class is the ground-truth
class, then the loss function for a single instance is defined as follows:
L = −log(ŷr) (1.15)
This type of loss function implements multinomial logistic regression, and it is re-
ferred to as the cross-entropy loss. Note that binary logistic regression is identical to
multinomial logistic regression, when the value of k is set to 2 in the latter.
The key point to remember is that the nature of the output nodes, the activation function,
and the loss function depend on the application at hand. Furthermore, these choices also
depend on one another. Even though the perceptron is often presented as the quintessential
representative of single-layer networks, it is only a single representative out of a very large
universe of possibilities. In practice, one rarely uses the perceptron criterion as the loss
function. For discrete-valued outputs, it is common to use softmax activation with cross-
entropy loss. For real-valued outputs, it is common to use linear activation with squared
loss. Generally, cross-entropy loss is easier to optimize than squared loss.
16 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−10 −5 0 5 10
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−6 −4 −2 0 2 4 6
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Figure 1.10: The derivatives of various activation functions
1.2.1.6 Some Useful Derivatives of Activation Functions
Most neural network learning is primarily related to gradient-descent with activation func-
tions. For this reason, the derivatives of these activation functions are used repeatedly in
this book, and gathering them in a single place for future reference is useful. This section
provides details on the derivatives of these loss functions. Later chapters will extensively
refer to these results.
1. Linear and sign activations: The derivative of the linear activation function is 1 at
all places. The derivative of sign(v) is 0 at all values of v other than at v = 0,
where it is discontinuous and non-differentiable. Because of the zero gradient and
non-differentiability of this activation function, it is rarely used in the loss function
even when it is used for prediction at testing time. The derivatives of the linear and
sign activations are illustrated in Figure 1.10(a) and (b), respectively.
2. Sigmoid activation: The derivative of sigmoid activation is particularly simple, when
it is expressed in terms of the output of the sigmoid, rather than the input. Let o be
the output of the sigmoid function with argument v:
o =
1
1 + exp(−v)
(1.16)
Then, one can write the derivative of the activation as follows:
∂o
∂v
=
exp(−v)
(1 + exp(−v))2
(1.17)
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 17
The key point is that this sigmoid can be written more conveniently in terms of the
outputs:
∂o
∂v
= o(1 − o) (1.18)
The derivative of the sigmoid is often used as a function of the output rather than the
input. The derivative of the sigmoid activation function is illustrated in Figure 1.10(c).
3. Tanh activation: As in the case of the sigmoid activation, the tanh activation is often
used as a function of the output o rather than the input v:
o =
exp(2v) − 1
exp(2v) + 1
(1.19)
One can then compute the gradient as follows:
∂o
∂v
=
4 · exp(2v)
(exp(2v) + 1)2
(1.20)
One can also write this derivative in terms of the output o:
∂o
∂v
= 1 − o2
(1.21)
The derivative of the tanh activation is illustrated in Figure 1.10(d).
4. ReLU and hard tanh activations: The ReLU takes on a partial derivative value of 1
for non-negative values of its argument, and 0, otherwise. The hard tanh function
takes on a partial derivative value of 1 for values of the argument in [−1, +1] and 0,
otherwise. The derivatives of the ReLU and hard tanh activations are illustrated in
Figure 1.10(e) and (f), respectively.
1.2.2 Multilayer Neural Networks
Multilayer neural networks contain more than one computational layer. The perceptron
contains an input and output layer, of which the output layer is the only computation-
performing layer. The input layer transmits the data to the output layer, and all com-
putations are completely visible to the user. Multilayer neural networks contain multiple
computational layers; the additional intermediate layers (between input and output) are
referred to as hidden layers because the computations performed are not visible to the user.
The specific architecture of multilayer neural networks is referred to as feed-forward net-
works, because successive layers feed into one another in the forward direction from input
to output. The default architecture of feed-forward networks assumes that all nodes in one
layer are connected to those of the next layer. Therefore, the architecture of the neural
network is almost fully defined, once the number of layers and the number/type of nodes in
each layer have been defined. The only remaining detail is the loss function that is optimized
in the output layer. Although the perceptron algorithm uses the perceptron criterion, this
is not the only choice. It is extremely common to use softmax outputs with cross-entropy
loss for discrete prediction and linear outputs with squared loss for real-valued prediction.
As in the case of single-layer networks, bias neurons can be used both in the hidden
layers and in the output layers. Examples of multilayer networks with or without the bias
neurons are shown in Figure 1.11(a) and (b), respectively. In each case, the neural network
18 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
contains three layers. Note that the input layer is often not counted, because it simply
transmits the data and no computation is performed in that layer. If a neural network
contains p1 . . . pk units in each of its k layers, then the (column) vector representations of
these outputs, denoted by h1 . . . hk have dimensionalities p1 . . . pk. Therefore, the number
of units in each layer is referred to as the dimensionality of that layer.
INPUT LAYER
HIDDEN LAYER
OUTPUT LAYER
y
x4
x3
x2
x1
x5
INPUT LAYER
HIDDEN LAYER
y
OUTPUT LAYER
+1 +1
BIAS
NEURONS
+1
BIAS
NEURON
x4
x3
x2
x1
x5
s
n
o
r
u
e
n
s
a
i
b
h
t
i
W
)
b
(
s
n
o
r
u
e
n
s
a
i
b
o
N
)
a
(
y
x4
x3
x2
x1
x5
h11
h12
h13 h23
h22
h21
h1 h2
X SCALAR WEIGHTS ON CONNECTIONS
WEIGHT MATRICES ON CONNECTIONS
y
X h1 h2
X
5 X 3
MATRIX
3 X 3
MATRIX
3 X 1
MATRIX
Figure 1.11: The basic architecture of a feed-forward network with two hidden layers and
a single output layer. Even though each unit contains a single scalar variable, one often
represents all units within a single layer as a single vector unit. Vector units are often
represented as rectangles and have connection matrices between them.
INPUT LAYER
HIDDEN LAYER
OUTPUT LAYER
xI
4
xI
3
xI
2
xI
1
xI
5
OUTPUT OF THIS LAYER PROVIDES
REDUCED REPRESENTATION
x4
x3
x2
x1
x5
Figure 1.12: An example of an autoencoder with multiple outputs
1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 19
The weights of the connections between the input layer and the first hidden layer are
contained in a matrix W1 with size d × p1, whereas the weights between the rth hidden
layer and the (r + 1)th hidden layer are denoted by the pr × pr+1 matrix denoted by Wr.
If the output layer contains o nodes, then the final matrix Wk+1 is of size pk × o. The
d-dimensional input vector x is transformed into the outputs using the following recursive
equations:
h1 = Φ(WT
1 x) [Input to Hidden Layer]
hp+1 = Φ(WT
p+1hp) ∀p ∈ {1 . . . k − 1} [Hidden to Hidden Layer]
o = Φ(WT
k+1hk) [Hidden to Output Layer]
Here, the activation functions like the sigmoid function are applied in element-wise fashion
to their vector arguments. However, some activation functions such as the softmax (which
are typically used in the output layers) naturally have vector arguments. Even though each
unit of a neural network contains a single variable, many architectural diagrams combine
the units in a single layer to create a single vector unit, which is represented as a rectangle
rather than a circle. For example, the architectural diagram in Figure 1.11(c) (with scalar
units) has been transformed to a vector-based neural architecture in Figure 1.11(d). Note
that the connections between the vector units are now matrices. Furthermore, an implicit
assumption in the vector-based neural architecture is that all units in a layer use the same
activation function, which is applied in element-wise fashion to that layer. This constraint is
usually not a problem, because most neural architectures use the same activation function
throughout the computational pipeline, with the only deviation caused by the nature of
the output layer. Throughout this book, neural architectures in which units contain vector
variables will be depicted with rectangular units, whereas scalar variables will correspond
to circular units.
Note that the aforementioned recurrence equations and vector architectures are valid
only for layer-wise feed-forward networks, and cannot always be used for unconventional
architectural designs. It is possible to have all types of unconventional designs in which
inputs might be incorporated in intermediate layers, or the topology might allow connections
between non-consecutive layers. Furthermore, the functions computed at a node may not
always be in the form of a combination of a linear function and an activation. It is possible
to have all types of arbitrary computational functions at nodes.
Although a very classical type of architecture is shown in Figure 1.11, it is possible to
vary on it in many ways, such as allowing multiple output nodes. These choices are often
determined by the goals of the application at hand (e.g., classification or dimensionality
reduction). A classical example of the dimensionality reduction setting is the autoencoder,
which recreates the outputs from the inputs. Therefore, the number of outputs and inputs
is equal, as shown in Figure 1.12. The constricted hidden layer in the middle outputs the
reduced representation of each instance. As a result of this constriction, there is some loss in
the representation, which typically corresponds to the noise in the data. The outputs of the
hidden layers correspond to the reduced representation of the data. In fact, a shallow variant
of this scheme can be shown to be mathematically equivalent to a well-known dimensionality
reduction method known as singular value decomposition. As we will learn in Chapter 2,
increasing the depth of the network results in inherently more powerful reductions.
Although a fully connected architecture is able to perform well in many settings, better
performance is often achieved by pruning many of the connections or sharing them in an
insightful way. Typically, these insights are obtained by using a domain-specific understand-
ing of the data. A classical example of this type of weight pruning and sharing is that of
20 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
the convolutional neural network architecture (cf. Chapter 8), in which the architecture is
carefully designed in order to conform to the typical properties of image data. Such an ap-
proach minimizes the risk of overfitting by incorporating domain-specific insights (or bias).
As we will discuss later in this book (cf. Chapter 4), overfitting is a pervasive problem in
neural network design, so that the network often performs very well on the training data,
but it generalizes poorly to unseen test data. This problem occurs when the number of free
parameters, (which is typically equal to the number of weight connections), is too large
compared to the size of the training data. In such cases, the large number of parameters
memorize the specific nuances of the training data, but fail to recognize the statistically
significant patterns for classifying unseen test data. Clearly, increasing the number of nodes
in the neural network tends to encourage overfitting. Much recent work has been focused
both on the architecture of the neural network as well as on the computations performed
within each node in order to minimize overfitting. Furthermore, the way in which the neu-
ral network is trained also has an impact on the quality of the final solution. Many clever
methods, such as pretraining (cf. Chapter 4), have been proposed in recent years in order to
improve the quality of the learned solution. This book will explore these advanced training
methods in detail.
1.2.3 The Multilayer Network as a Computational Graph
It is helpful to view a neural network as a computational graph, which is constructed by
piecing together many basic parametric models. Neural networks are fundamentally more
powerful than their building blocks because the parameters of these models are learned
jointly to create a highly optimized composition function of these models. The common use
of the term “perceptron” to refer to the basic unit of a neural network is somewhat mis-
leading, because there are many variations of this basic unit that are leveraged in different
settings. In fact, it is far more common to use logistic units (with sigmoid activation) and
piecewise/fully linear units as building blocks of these models.
A multilayer network evaluates compositions of functions computed at individual nodes.
A path of length 2 in the neural network in which the function f(·) follows g(·) can be
considered a composition function f(g(·)). Furthermore, if g1(·), g2(·) . . . gk(·) are the func-
tions computed in layer m, and a particular layer-(m + 1) node computes f(·), then the
composition function computed by the layer-(m + 1) node in terms of the layer-m inputs
is f(g1(·), . . . gk(·)). The use of nonlinear activation functions is the key to increasing the
power of multiple layers. If all layers use an identity activation function, then a multilayer
network can be shown to simplify to linear regression. It has been shown [208] that a net-
work with a single hidden layer of nonlinear units (with a wide ranging choice of squashing
functions like the sigmoid unit) and a single (linear) output layer can compute almost
any “reasonable” function. As a result, neural networks are often referred to as universal
function approximators, although this theoretical claim is not always easy to translate into
practical usefulness. The main issue is that the number of hidden units required to do so
is rather large, which increases the number of parameters to be learned. This results in
practical problems in training the network with a limited amount of data. In fact, deeper
networks are often preferred because they reduce the number of hidden units in each layer
as well as the overall number of parameters.
The “building block” description is particularly appropriate for multilayer neural net-
works. Very often, off-the-shelf softwares for building neural networks2
provide analysts
2Examples include Torch [572], Theano [573], and TensorFlow [574].
1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 21
with access to these building blocks. The analyst is able to specify the number and type of
units in each layer along with an off-the-shelf or customized loss function. A deep neural
network containing tens of layers can often be described in a few hundred lines of code.
All the learning of the weights is done automatically by the backpropagation algorithm that
uses dynamic programming to work out the complicated parameter update steps of the
underlying computational graph. The analyst does not have to spend the time and effort
to explicitly work out these steps. This makes the process of trying different types of ar-
chitectures relatively painless for the analyst. Building a neural network with many of the
off-the-shelf softwares is often compared to a child constructing a toy from building blocks
that appropriately fit with one another. Each block is like a unit (or a layer of units) with a
particular type of activation. Much of this ease in training neural networks is attributable
to the backpropagation algorithm, which shields the analyst from explicitly working out the
parameter update steps of what is actually an extremely complicated optimization problem.
Working out these steps is often the most difficult part of most machine learning algorithms,
and an important contribution of the neural network paradigm is to bring modular thinking
into machine learning. In other words, the modularity in neural network design translates
to modularity in learning its parameters; the specific name for the latter type of modularity
is “backpropagation.” This makes the design of neural networks more of an (experienced)
engineer’s task rather than a mathematical exercise.
1.3 Training a Neural Network with Backpropagation
In the single-layer neural network, the training process is relatively straightforward because
the error (or loss function) can be computed as a direct function of the weights, which
allows easy gradient computation. In the case of multi-layer networks, the problem is that
the loss is a complicated composition function of the weights in earlier layers. The gradient
of a composition function is computed using the backpropagation algorithm. The backprop-
agation algorithm leverages the chain rule of differential calculus, which computes the error
gradients in terms of summations of local-gradient products over the various paths from a
node to the output. Although this summation has an exponential number of components
(paths), one can compute it efficiently using dynamic programming. The backpropagation
algorithm is a direct application of dynamic programming. It contains two main phases,
referred to as the forward and backward phases, respectively. The forward phase is required
to compute the output values and the local derivatives at various nodes, and the backward
phase is required to accumulate the products of these local values over all paths from the
node to the output:
1. Forward phase: In this phase, the inputs for a training instance are fed into the neural
network. This results in a forward cascade of computations across the layers, using
the current set of weights. The final predicted output can be compared to that of the
training instance and the derivative of the loss function with respect to the output is
computed. The derivative of this loss now needs to be computed with respect to the
weights in all layers in the backwards phase.
2. Backward phase: The main goal of the backward phase is to learn the gradient of the
loss function with respect to the different weights by using the chain rule of differen-
tial calculus. These gradients are used to update the weights. Since these gradients
are learned in the backward direction, starting from the output node, this learning
process is referred to as the backward phase. Consider a sequence of hidden units
22 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
w
f(w)
g(y)
h(z)
K(p,q)
O = K(p,q) = K(g(f(w)),h(f(w)))
UGLY COMPOSITION FUNCTION
O
INPUT
WEIGHT
OUTPUT
∂o
∂w
=
∂o
∂p
·
∂p
∂w
+
∂o
∂q
·
∂q
∂w
[Multivariable Chain Rule]
=
∂o
∂p
·
∂p
∂y
·
∂y
∂w
+
∂o
∂q
·
∂q
∂z
·
∂z
∂w
[Univariate Chain Rule]
=
∂K(p, q)
∂p
· g (y) · f (w)
First path
+
∂K(p, q)
∂q
· h (z) · f (w)
Second path
Figure 1.13: Illustration of chain rule in computational graphs: The products of
node-specific partial derivatives along paths from weight w to output o are aggregated. The
resulting value yields the derivative of output o with respect to weight w. Only two paths
between input and output exist in this simplified example.
h1, h2, . . . , hk followed by output o, with respect to which the loss function L is com-
puted. Furthermore, assume that the weight of the connection from hidden unit hr to
hr+1 is w(hr,hr+1). Then, in the case that a single path exists from h1 to o, one can
derive the gradient of the loss function with respect to any of these edge weights using
the chain rule:
∂L
∂w(hr−1,hr)
=
∂L
∂o
·

∂o
∂hk
k−1

i=r
∂hi+1
∂hi

∂hr
∂w(hr−1,hr)
∀r ∈ 1 . . . k (1.22)
The aforementioned expression assumes that only a single path from h1 to o exists in
the network, whereas an exponential number of paths might exist in reality. A gener-
alized variant of the chain rule, referred to as the multivariable chain rule, computes
the gradient in a computational graph, where more than one path might exist. This is
achieved by adding the composition along each of the paths from h1 to o. An example
of the chain rule in a computational graph with two paths is shown in Figure 1.13.
Therefore, one generalizes the above expression to the case where a set P of paths
exist from hr to o:
∂L
∂w(hr−1,hr)
=
∂L
∂o
·
⎡
⎣

[hr,hr+1,...hk,o]∈P
∂o
∂hk
k−1

i=r
∂hi+1
∂hi
⎤
⎦
 
Backpropagation computes Δ(hr, o) = ∂L
∂hr
∂hr
∂w(hr−1,hr)
(1.23)
1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 23
The computation of ∂hr
∂w(hr−1,hr)
on the right-hand side is straightforward and will
be discussed below (cf. Equation 1.27). However, the path-aggregated term above
[annotated by Δ(hr, o) = ∂L
∂hr
] is aggregated over an exponentially increasing number
of paths (with respect to path length), which seems to be intractable at first sight. A
key point is that the computational graph of a neural network does not have cycles,
and it is possible to compute such an aggregation in a principled way in the backwards
direction by first computing Δ(hk, o) for nodes hk closest to o, and then recursively
computing these values for nodes in earlier layers in terms of the nodes in later layers.
Furthermore, the value of Δ(o, o) for each output node is initialized as follows:
Δ(o, o) =
∂L
∂o
(1.24)
This type of dynamic programming technique is used frequently to efficiently compute
all types of path-centric functions in directed acyclic graphs, which would otherwise
require an exponential number of operations. The recursion for Δ(hr, o) can be derived
using the multivariable chain rule:
Δ(hr, o) =
∂L
∂hr
=

h:hr⇒h
∂L
∂h
∂h
∂hr
=

h:hr⇒h
∂h
∂hr
Δ(h, o) (1.25)
Since each h is in a later layer than hr, Δ(h, o) has already been computed while
evaluating Δ(hr, o). However, we still need to evaluate ∂h
∂hr
in order to compute Equa-
tion 1.25. Consider a situation in which the edge joining hr to h has weight w(hr,h),
and let ah be the value computed in hidden unit h just before applying the activation
function Φ(·). In other words, we have h = Φ(ah), where ah is a linear combination of
its inputs from earlier-layer units incident on h. Then, by the univariate chain rule,
the following expression for ∂h
∂hr
can be derived:
∂h
∂hr
=
∂h
∂ah
·
∂ah
∂hr
=
∂Φ(ah)
∂ah
· w(hr,h) = Φ
(ah) · w(hr,h)
This value of ∂h
∂hr
is used in Equation 1.25, which is repeated recursively in the back-
wards direction, starting with the output node. The corresponding updates in the
backwards direction are as follows:
Δ(hr, o) =

h:hr⇒h
Φ
(ah) · w(hr,h) · Δ(h, o) (1.26)
Therefore, gradients are successively accumulated in the backwards direction, and
each node is processed exactly once in a backwards pass. Note that the computation
of Equation 1.25 (which requires proportional operations to the number of outgoing
edges) needs to be repeated for each incoming edge into the node to compute the gra-
dient with respect to all edge weights. Finally, Equation 1.23 requires the computation
of ∂hr
∂w(hr−1,hr)
, which is easily computed as follows:
∂hr
∂w(hr−1,hr)
= hr−1 · Φ
(ahr
) (1.27)
Here, the key gradient that is backpropagated is the derivative with respect to layer acti-
vations, and the gradient with respect to the weights is easy to compute for any incident
edge on the corresponding unit.
24 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
It is noteworthy that the dynamic programming recursion of Equation 1.26 can be
computed in multiple ways, depending on which variables one uses for intermediate chaining.
All these recursions are equivalent in terms of the final result of backpropagation. In the
following, we give an alternative version of the dynamic programming recursion, which is
more commonly seen in textbooks. Note that Equation 1.23 uses the variables in the hidden
layers as the “chain” variables for the dynamic programming recursion. One can also use
the pre-activation values of the variables for the chain rule. The pre-activation variables in a
neuron are obtained after applying the linear transform (but before applying the activation
variables) as the intermediate variables. The pre-activation value of the hidden variable
h = Φ(ah) is ah. The differences between the pre-activation and post-activation values
within a neuron are shown in Figure 1.7. Therefore, instead of Equation 1.23, one can use
the following chain rule:
∂L
∂w(hr−1,hr)
=
∂L
∂o
· Φ
(ao) ·
⎡
⎣

[hr,hr+1,...hk,o]∈P
∂ao
∂ahk
k−1

i=r
∂ahi+1
∂ahi
⎤
⎦
 
Backpropagation computes δ(hr, o) = ∂L
∂ahr
∂ahr
∂w(hr−1,hr)
 
hr−1
(1.28)
Here, we have introduced the notation δ(hr, o) = ∂L
∂ahr
instead of Δ(hr, o) = ∂L
∂hr
for setting
up the recursive equation. The value of δ(o, o) = ∂L
∂ao
is initialized as follows:
δ(o, o) =
∂L
∂ao
= Φ
(ao) ·
∂L
∂o
(1.29)
Then, one can use the multivariable chain rule to set up a similar recursion:
δ(hr, o) =
∂L
∂ahr
=

h:hr⇒h
δ(h,o)
 
∂L
∂ah
∂ah
∂ahr
 
Φ(ahr )w(hr,h)
= Φ
(ahr
)

h:hr⇒h
w(hr,h) · δ(h, o) (1.30)
This recursion condition is found more commonly in textbooks discussing backpropagation.
The partial derivative of the loss with respect to the weight is then computed using δ(hr, o)
as follows:
∂L
∂w(hr−1,hr)
= δ(hr, o) · hr−1 (1.31)
As with the single-layer network, the process of updating the nodes is repeated to conver-
gence by repeatedly cycling through the training data in epochs. A neural network may
sometimes require thousands of epochs through the training data to learn the weights at
the different nodes. A detailed description of the backpropagation algorithm and associated
issues is provided in Chapter 3. In this chapter, we provide a brief discussion of these issues.
1.4 Practical Issues in Neural Network Training
In spite of the formidable reputation of neural networks as universal function approximators,
considerable challenges remain with respect to actually training neural networks to provide
this level of performance. These challenges are primarily related to several practical problems
associated with training, the most important one of which is overfitting.
1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 25
1.4.1 The Problem of Overfitting
The problem of overfitting refers to the fact that fitting a model to a particular training
data set does not guarantee that it will provide good prediction performance on unseen test
data, even if the model predicts the targets on the training data perfectly. In other words,
there is always a gap between the training and test data performance, which is particularly
large when the models are complex and the data set is small.
In order to understand this point, consider a simple single-layer neural network on a
data set with five attributes, where we use the identity activation to learn a real-valued
target variable. This architecture is almost identical to that of Figure 1.3, except that the
identity activation function is used in order to predict a real-valued target. Therefore, the
network tries to learn the following function:
ŷ =
5

i=1
wi · xi (1.32)
Consider a situation in which the observed target value is real and is always twice the
value of the first attribute, whereas other attributes are completely unrelated to the target.
However, we have only four training instances, which is one less than the number of features
(free parameters). For example, the training instances could be as follows:
x1 x2 x3 x4 x5 y
1 1 0 0 0 2
2 0 1 0 0 4
3 0 0 1 0 6
4 0 0 0 1 8
The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the known rela-
tionship between the first feature and target. The training data also provides zero error
with this solution, although the relationship needs to be learned from the given instances
since it is not given to us a priori. However, the problem is that the number of training
points is fewer than the number of parameters and it is possible to find an infinite number
of solutions with zero error. For example, the parameter set [0, 2, 4, 6, 8] also provides zero
error on the training data. However, if we used this solution on unseen test data, it is likely
to provide very poor performance because the learned parameters are spuriously inferred
and are unlikely to generalize well to new points in which the target is twice the first at-
tribute (and other attributes are random). This type of spurious inference is caused by the
paucity of training data, where random nuances are encoded into the model. As a result,
the solution does not generalize well to unseen test data. This situation is almost similar to
learning by rote, which is highly predictive for training data but not predictive for unseen
test data. Increasing the number of training instances improves the generalization power
of the model, whereas increasing the complexity of the model reduces its generalization
power. At the same time, when a lot of training data is available, an overly simple model
is unlikely to capture complex relationships between the features and target. A good rule
of thumb is that the total number of training data points should be at least 2 to 3 times
larger than the number of parameters in the neural network, although the precise number
of data instances depends on the specific model at hand. In general, models with a larger
number of parameters are said to have high capacity, and they require a larger amount of
data in order to gain generalization power to unseen test data. The notion of overfitting is
often understood in the trade-off between bias and variance in machine learning. The key
26 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
take-away from the notion of bias-variance trade-off is that one does not always win with
more powerful (i.e., less biased) models when working with limited training data, because
of the higher variance of these models. For example, if we change the training data in the
table above to a different set of four points, we are likely to learn a completely different set
of parameters (from the random nuances of those points). This new model is likely to yield
a completely different prediction on the same test instance as compared to the predictions
using the first training data set. This type of variation in the prediction of the same test
instance using different training data sets is a manifestation of model variance, which also
adds to the error of the model; after all, both predictions of the same test instance could not
possibly be correct. More complex models have the drawback of seeing spurious patterns
in random nuances, especially when the training data are insufficient. One must be careful
to pick an optimum point when deciding the complexity of the model. These notions are
described in detail in Chapter 4.
Neural networks have always been known to theoretically be powerful enough to ap-
proximate any function [208]. However, the lack of data availability can result in poor
performance; this is one of the reasons that neural networks only recently achieved promi-
nence. The greater availability of data has revealed the advantages of neural networks over
traditional machine learning (cf. Figure 1.2). In general, neural networks require careful
design to minimize the harmful effects of overfitting, even when a large amount of data is
available. This section provides an overview of some of the design methods used to mitigate
the impact of overfitting.
1.4.1.1 Regularization
Since a larger number of parameters causes overfitting, a natural approach is to constrain
the model to use fewer non-zero parameters. In the previous example, if we constrain the
vector W to have only one non-zero component out of five components, it will correctly
obtain the solution [2, 0, 0, 0, 0]. Smaller absolute values of the parameters also tend to
overfit less. Since it is hard to constrain the values of the parameters, the softer approach
of adding the penalty λ||W||p
to the loss function is used. The value of p is typically set to
2, which leads to Tikhonov regularization. In general, the squared value of each parameter
(multiplied with the regularization parameter λ  0) is added to the objective function.
The practical effect of this change is that a quantity proportional to λwi is subtracted from
the update of the parameter wi. An example of a regularized version of Equation 1.6 for
mini-batch S and update step-size α  0 is as follows:
W ⇐ W(1 − αλ) + α

X∈S
E(X)X (1.33)
Here, E[X] represents the current error (y − ŷ) between observed and predicted values
of training instance X. One can view this type of penalization as a kind of weight decay
during the updates. Regularization is particularly important when the amount of available
data is limited. A neat biological interpretation of regularization is that it corresponds to
gradual forgetting, as a result of which “less important” (i.e., noisy) patterns are removed.
In general, it is often advisable to use more complex models with regularization rather than
simpler models without regularization.
As a side note, the general form of Equation 1.33 is used by many regularized machine
learning models like least-squares regression (cf. Chapter 2), where E(X) is replaced by the
error-function of that specific model. Interestingly, weight decay is only sparingly used in the
1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 27
single-layer perceptron3
because it can sometimes cause overly rapid forgetting with a small
number of recently misclassified training points dominating the weight vector; the main
issue is that the perceptron criterion is already a degenerate loss function with a minimum
value of 0 at W = 0 (unlike its hinge-loss or least-squares cousins). This quirk is a legacy
of the fact that the single-layer perceptron was originally defined in terms of biologically
inspired updates rather than in terms of carefully thought-out loss functions. Convergence
to an optimal solution was never guaranteed other than in linearly separable cases. For the
single-layer perceptron, some other regularization techniques, which are discussed below,
are more commonly used.
1.4.1.2 Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the architecture of the
neural network after giving some thought to the underlying data domain. For example, the
successive words in a sentence are often related to one another, whereas the nearby pixels
in an image are typically related. These types of insights are used to create specialized
architectures for text and image data with fewer parameters. Furthermore, many of the
parameters might be shared. For example, a convolutional neural network uses the same
set of parameters to learn the characteristics of a local block of the image. The recent
advancements in the use of neural networks like recurrent neural networks and convolutional
neural networks are examples of this phenomena.
1.4.1.3 Early Stopping
Another common form of regularization is early stopping, in which the gradient descent is
ended after only a few iterations. One way to decide the stopping point is by holding out a
part of the training data, and then testing the error of the model on the held-out set. The
gradient-descent approach is terminated when the error on the held-out set begins to rise.
Early stopping essentially reduces the size of the parameter space to a smaller neighborhood
within the initial values of the parameters. From this point of view, early stopping acts as
a regularizer because it effectively restricts the parameter space.
1.4.1.4 Trading Off Breadth for Depth
As discussed earlier, a two-layer neural network can be used as a universal function approx-
imator [208], if a large number of hidden units are used within the hidden layer. It turns out
that networks with more layers (i.e., greater depth) tend to require far fewer units per layer
because the composition functions created by successive layers make the neural network
more powerful. Increased depth is a form of regularization, as the features in later layers
are forced to obey a particular type of structure imposed by the earlier layers. Increased
constraints reduce the capacity of the network, which is helpful when there are limitations
on the amount of available data. A brief explanation of this type of behavior is given in
Section 1.5. The number of units in each layer can typically be reduced to such an extent
that a deep network often has far fewer parameters even when added up over the greater
number of layers. This observation has led to an explosion in research on the topic of deep
learning.
3Weight decay is generally used with other loss functions in single-layer models and in all multi-layer
models with a large number of parameters.
28 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
Even though deep networks have fewer problems with respect to overfitting, they come
with a different family of problems associated with ease of training. In particular, the loss
derivatives with respect to the weights in different layers of the network tend to have vastly
different magnitudes, which causes challenges in properly choosing step sizes. Different
manifestations of this undesirable behavior are referred to as the vanishing and exploding
gradient problems. Furthermore, deep networks often take unreasonably long to converge.
These issues and design choices will be discussed later in this section and at several places
throughout the book.
1.4.1.5 Ensemble Methods
A variety of ensemble methods like bagging are used in order to increase the generalization
power of the model. These methods are applicable not just to neural networks but to
any type of machine learning algorithm. However, in recent years, a number of ensemble
methods that are specifically focused on neural networks have also been proposed. Two
such methods include Dropout and Dropconnect. These methods can be combined with
many neural network architectures to obtain an additional accuracy improvement of about
2% in many real settings. However, the precise improvement depends to the type of data and
the nature of the underlying training. For example, normalizing the activations in hidden
layers can reduce the effectiveness of Dropout methods, although one can gain from the
normalization itself. Ensemble methods are discussed in Chapter 4.
1.4.2 The Vanishing and Exploding Gradient Problems
While increasing depth often reduces the number of parameters of the network, it leads to
different types of practical issues. Propagating backwards using the chain rule has its draw-
backs in networks with a large number of layers in terms of the stability of the updates. In
particular, the updates in earlier layers can either be negligibly small (vanishing gradient) or
they can be increasingly large (exploding gradient) in certain types of neural network archi-
tectures. This is primarily caused by the chain-like product computation in Equation 1.23,
which can either exponentially increase or decay over the length of the path. In order to
understand this point, consider a situation in which we have a multi-layer network with one
neuron in each layer. Each local derivative along a path can be shown to be the product of
the weight and the derivative of the activation function. The overall backpropagated deriva-
tive is the product of these values. If each such value is randomly distributed, and has an
expected value less than 1, the product of these derivatives in Equation 1.23 will drop off ex-
ponentially fast with path length. If the individual values on the path have expected values
greater than 1, it will typically cause the gradient to explode. Even if the local derivatives
are randomly distributed with an expected value of exactly 1, the overall derivative will
typically show instability depending on how the values are actually distributed. In other
words, the vanishing and exploding gradient problems are rather natural to deep networks,
which makes their training process unstable.
Many solutions have been proposed to address this issue. For example, a sigmoid activa-
tion often encourages the vanishing gradient problem, because its derivative is less than 0.25
at all values of its argument (see Exercise 7), and is extremely small at saturation. A ReLU
activation unit is known to be less likely to create a vanishing gradient problem because its
derivative is always 1 for positive values of the argument. More discussions on this issue are
provided in Chapter 3. Aside from the use of the ReLU, a whole host of gradient-descent
tricks are used to improve the convergence behavior of the problem. In particular, the use
1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 29
of adaptive learning rates and conjugate gradient methods can help in many cases. Further-
more, a recent technique called batch normalization is helpful in addressing some of these
issues. These techniques are discussed in Chapter 3.
1.4.3 Difficulties in Convergence
Sufficiently fast convergence of the optimization process is difficult to achieve with very
deep networks, as depth leads to increased resistance to the training process in terms of
letting the gradients smoothly flow through the network. This problem is somewhat related
to the vanishing gradient problem, but has its own unique characteristics. Therefore, some
“tricks” have been proposed in the literature for these cases, including the use of gating
networks and residual networks [184]. These methods are discussed in Chapters 7 and 8,
respectively.
1.4.4 Local and Spurious Optima
The optimization function of a neural network is highly nonlinear, which has lots of local
optima. When the parameter space is large, and there are many local optima, it makes sense
to spend some effort in picking good initialization points. One such method for improving
neural network initialization is referred to as pretraining. The basic idea is to use either
supervised or unsupervised training on shallow sub-networks of the original network in
order to create the initial weights. This type of pretraining is done in a greedy and layerwise
fashion in which a single layer of the network is trained at one time in order to learn
the initialization points of that layer. This type of approach provides initialization points
that ignore drastically irrelevant parts of the parameter space to begin with. Furthermore,
unsupervised pretraining often tends to avoid problems associated with overfitting. The
basic idea here is that some of the minima in the loss function are spurious optima because
they are exhibited only in the training data and not in the test data. Using unsupervised
pretraining tends to move the initialization point closer to the basin of “good” optima in
the test data. This is an issue associated with model generalization. Methods for pretraining
are discussed in Section 4.7 of Chapter 4.
Interestingly, the notion of spurious optima is often viewed from the lens of model gen-
eralization in neural networks. This is a different perspective from traditional optimization.
In traditional optimization, one does not focus on the differences in the loss functions of
the training and test data, but on the shape of the loss function in only the training data.
Surprisingly, the problem of local optima (from a traditional perspective) is a smaller issue
in neural networks than one might normally expect from such a nonlinear function. Most
of the time, the nonlinearity causes problems during the training process itself (e.g., failure
to converge), rather than getting stuck in a local minimum.
1.4.5 Computational Challenges
A significant challenge in neural network design is the running time required to train the
network. It is not uncommon to require weeks to train neural networks in the text and image
domains. In recent years, advances in hardware technology such as Graphics Processor Units
(GPUs) have helped to a significant extent. GPUs are specialized hardware processors that
can significantly speed up the kinds of operations commonly used in neural networks. In
this sense, some algorithmic frameworks like Torch are particularly convenient because they
have GPU support tightly integrated into the platform.
30 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS
Although algorithmic advancements have played a role in the recent excitement around
deep learning, a lot of the gains have come from the fact that the same algorithms can do
much more on modern hardware. Faster hardware also supports algorithmic development,
because one needs to repeatedly test computationally intensive algorithms to understand
what works and what does not. For example, a recent neural model such as the long short-
term memory has changed only modestly [150] since it was first proposed in 1997 [204]. Yet,
the potential of this model has been recognized only recently because of the advances in
computational power of modern machines and algorithmic tweaks associated with improved
experimentation.
One convenient property of the vast majority of neural network models is that most of
the computational heavy lifting is front loaded during the training phase, and the prediction
phase is often computationally efficient, because it requires a small number of operations
(depending on the number of layers). This is important because the prediction phase is
often far more time-critical compared to the training phase. For example, it is far more
important to classify an image in real time (with a pre-built model), although the actual
building of that model might have required a few weeks over millions of images. Methods
have also been designed to compress trained networks in order to enable their deployment
in mobile and space-constrained settings. These issues are discussed in Chapter 3.
1.5 The Secrets to the Power of Function Composition
Even though the biological metaphor sounds like an exciting way to intuitively justify the
computational power of a neural network, it does not provide a complete picture of the
settings in which neural networks perform well. At its most basic level, a neural network is
a computational graph that performs compositions of simpler functions to provide a more
complex function. Much of the power of deep learning arises from the fact that repeated
composition of multiple nonlinear functions has significant expressive power. Even though
the work in [208] shows that the single composition of a large number of squashing functions
can approximate almost any function, this approach will require an extremely large number
of units (i.e., parameters) of the network. This increases the capacity of the network, which
causes overfitting unless the data set is extremely large. Much of the power of deep learning
arises from the fact that the repeated composition of certain types of functions increases the
representation power of the network, and therefore reduces the parameter space required for
learning.
Not all base functions are equally good at achieving this goal. In fact, the nonlinear
squashing functions used in neural networks are not arbitrarily chosen, but are carefully
designed because of certain types of properties. For example, imagine a situation in which
the identity activation function is used in each layer, so that only linear functions are
computed. In such a case, the resulting neural network is no stronger than a single-layer,
linear network:
Theorem 1.5.1 A multi-layer network that uses only the identity activation function in
all its layers reduces to a single-layer network performing linear regression.
Proof: Consider a network containing k hidden layers, and therefore contains a total of
(k +1) computational layers (including the output layer). The corresponding (k +1) weight
matrices between successive layers are denoted by W1 . . . Wk+1. Let x be the d-dimensional
column vector corresponding to the input, h1 . . . hk be the column vectors corresponding to
the hidden layers, and o be the m-dimensional column vector corresponding to the output.
Random documents with unrelated
content Scribd suggests to you:
prend cette chose tellement fanée
et empuantie de toutes les puanteurs
dans sa tendre et blanche main.
Arrachée par le talent du poète,
dans un doux accord avec le beau,
une larme lentement s'écoule
et tendrement fait sa part dans l'œuvre commune:
nul lecteur qui n'y laisse une tache!
O pensée grandiose et puissante!
O résultat merveilleux!
Qu'il est béni des dieux le poète
qui possède un si noble talent!
Grands et Petits, Pauvres et Riches,
cette crasse est l'œuvre de tous!
Ah! celui qui vit encore dans l'osbcurité,
qui lutte pour se hausser jusqu'au laurier,
assurément sent, dans sa brûlante ardeur,
un désir lui tirailler le sein.
Dieu bon, implore-t-il chaque jour,
Accorde-moi ce bonheur indicible:
fais que mes pauvres livres de vers
soient aussi gras et crasseux!
Mais si les poètes aspirent aux embrassements «de la grande
impudique
Qui tient dans ses bras l'univers,
s'ils sont tellement avides du bruit qu'ils ouvrent leur escarcelle toute
grande à la popularité, cette «gloire en gros sous», il n'en est point
de même des vrais amants des livres, de ceux qui ne les font pas,
mais qui les achètent, les parent, les enchâssent, en délectent leurs
doigts, leurs yeux, et parfois leur esprit.
Ecoutez la tirade mise par un poète anglais dans la bouche d'un
bibliophile qui a prêté à un infidèle ami une reliure de Trautz-
Bauzonnet et qui ne l'a jamais revue:
Une fois prêté, un livre est perdu...
Prêter des livres! Parbleu, je n'y consentirai
plus.
Vos prêteurs faciles ne sont que des fous que je
redoute.
Si les gens veulent des livres, par le grand
Grolier, qu'ils les achètent!
Qui est-ce qui prête sa femme lorsqu'il peut se
dispenser du prêt?
Nos femmes seront-elles donc tenues pour plus
que nos livres chères?
Nous en préserve de Thou! Jamais plus de livres
ne prêterai.
Ne dirait-on pas que c'est pour ce bibliophile échaudé que fut faite
cette imitation supérieurement réussie des inscriptions dont les
écoliers sont prodigues sur leurs rudiments et Selectæ:
Qui ce livre volera,
Pro suis criminibus
Au gibet il dansera,
Pedibus penditibus.
Ce châtiment n'eût pas dépassé les mérites de celui contre lequel
Lebrun fit son épigramme «à un Abbé qui aimait les lettres et un peu
trop mes livres»:
Non, tu n'es point de ces abbés ignares,
Qui n'ont jamais rien lu que le Missel:
Des bons écrits tu savoures le sel,
Et te connais en livres beaux et rares.
Trop bien le sais! car, lorsqu'à pas de loup
Tu viens chez moi feuilleter coup sur coup
Mes Elzévirs, ils craignent ton approche.
Dans ta mémoire il en reste beaucoup;
Beaucoup aussi te restent dans la poche.
Un amateur de livres de nuance libérale pourrait adopter pour devise
cette inscription mise à l'entrée d'une bibliothèque populaire
anglaise:
Tolle, aperi, recita, ne lœdas, claude, rapine!
ce qui, traduit librement, signifie: «Prends, ouvre, lis, n'abîme pas,
referme, mais surtout mets en place!»
Punch, le Charivari d'Outre-Manche, en même temps qu'il incarne
pour les Anglais notre Polichinelle et le Pulcinello des Italiens,
résume à merveille la question. Voici, dit-il, «la tenue des livres
enseignée en une leçon:—Ne les prêtez pas.»
VII
C'est qu'ils sont précieux, non pas tant par leur valeur intrinsèque,—
bien que certains d'entre eux représentent plus que leur poids d'or,—
que parce qu'on les aime, d'amour complexe peut-être, mais à coup
sûr d'amour vrai.
«Accordez-moi, seigneur, disait un ancien (c'est Jules Janin qui
rapporte ces paroles), une maison pleine de livres, un jardin plein de
fleurs!—Voulez-vous, disait-il encore, un abrégé de toutes les
misères humaines, regardez un malheureux qui vend ses livres:
Bibliothecam vendat.»
Si le malheureux vend ses livres parce qu'il y est contraint, non pas
par un caprice, une toquade de spéculation, une saute de goût,
passant de la bibliophilie à l'iconophilie ou à la faïençomanie ou à
tout autre dada frais éclos dans sa cervelle, ou encore sous le coup
d'une passionnette irrésistible dont quelques mois auront bientôt usé
l'éternité, comme il advint à Asselineau qui se défit de sa
bibliothèque pour suivre une femme et qui peu après se défit de la
femme pour se refaire une bibliothèque, si c'est, dis-je, par misère
pure, il faut qu'il soit bien marqué par le destin et qu'il ait de triples
galons dans l'armée des Pas-de-Chance, car les livres aiment ceux
qui les aiment et, le plus souvent leur portent bonheur. Témoin, pour
n'en citer qu'un, Grotius, qui s'échappa de prison en se mettant dans
un coffre à livres, lequel faisait la navette entre sa maison et sa
geôle, apportant et remportant les volumes qu'il avait obtenu de
faire venir de la fameuse bibliothèque formée à grands frais et avec
tant de soins, pour lui «et ses amis».
Richard de Bury, évêque de Durham et chancelier d'Angleterre, qui
vivait au XIVe siècle, rapporte, dans son Philobiblon, des vers latins
de John Salisbury, dont voici le sens:
Nul main que le fer a touchée n'est propre à
manier les livres,
ni celui dont le cœur regarde l'or avec trop de
joie;
les mêmes hommes n'aiment pas à la fois les
livres et l'argent,
et ton troupeau, ô Epicure, a pour les livres du
dégoût;
les avares et les amis des livre ne vont guère de
compagnie,
et ne demeurent point, tu peux m'en croire, en
paix sous le même toit.
«Personne donc, en conclut un peu vite le bon Richard de Bury, ne
peut servir en même temps les livres et Mammon».
Il reprend ailleurs: «Ceux qui sont férus de l'amour des livres font
bon marché du monde et des richesses».
Les temps sont quelque peu changés; il est en notre vingtième siècle
des amateurs dont on ne saurait dire s'ils estiment des livres
précieux pour en faire un jour une vente profitable, ou s'ils
dépensent de l'argent à accroître leur bibliothèque pour la seule
satisfaction de leurs goûts de collectionneur et de lettré.
Toujours est-il que le Philobiblon n'est qu'un long dithyrambe en
prose, naïf et convaincu, sur les livres et les joies qu'ils procurent. J'y
prends au hasard quelques phrases caractéristiques, qui, enfouies
dans ce vieux livre peu connu en France, n'ont pas encore eu le
temps de devenir banales parmi nous.
«Les livres nous charment lorsque la prospérité nous sourit; ils nous
réconfortent comme des amis inséparables lorsque la fortune
orageuse fronce le sourcil sur nous.»
Voilà une pensée qui a été exprimée bien des fois et que nous
retrouverons encore; mais n'a-t-elle pas un tour original qui lui
donne je ne sais quel air imprévu de nouveauté?
Le chapitre XV de l'ouvrage traite des «avantages de l'amour des
livres.» On y lit ceci:
«Il passe le pouvoir de l'intelligence humaine, quelque largement
qu'elle ait pu boire à la fontaine de Pégase, de développer
pleinement le titre du présent chapitre. Quand on parlerait avec la
langue des hommes et des anges, quand on serait devenu un
Mercure, un Tullius ou un Cicéron, quand on aurait acquis la douceur
de l'éloquence lactée de Tite-Live, on aurait encore à s'excuser de
bégayer comme Moïse, ou à confesser avec Jérémie qu'on n'est
qu'un enfant et qu'on ne sait point parler.»
Après ce début, qui s'étonnera que Richard de Bury fasse un devoir
à tous les honnêtes gens d'acheter des livres et de les aimer. «Il
n'est point de prix élevé qui doive empêcher quelqu'un d'acheter des
livres s'il a l'argent qu'on en demande, à moins que ce ne soit pour
résister aux artifices du vendeur ou pour attendre une plus favorable
occasion d'achat... Qu'on doive acheter les livres avec joie et les
vendre à regret, c'est à quoi Salomon, le soleil de l'humanité, nous
exhorte dans les Proverbes: «Achète la vérité, dit-il, et ne vends pas
la sagesse.»
On ne s'attendait guère, j'imagine, à voir Salomon dans cette affaire.
Et pourtant quoi de plus naturel que d'en appeler à l'auteur de la
Sagesse en une question qui intéresse tous les sages?
«Une bibliothèque prudemment composée est plus précieuse que
toutes les richesses, et nulle des choses qui sont désirables ne
sauraient lui être comparée. Quiconque donc se pique d'être zélé
pour la vérité, le bonheur, la sagesse ou la science, et même pour la
foi, doit nécessairement devenir un ami des livres.»
En effet, ajoute-t-il, en un élan croissant d'enthousiasme, «les livres
sont des maîtres qui nous instruisent sans verges ni férules, sans
paroles irritées, sans qu'il faille leur donner ni habits, ni argent. Si
vous venez à eux, ils ne dorment point; si vous questionnez et vous
enquérez auprès d'eux, ils ne se récusent point; ils ne grondent
point si vous faites des fautes; ils ne se moquent point de vous si
vous êtes ignorant. O livres, seuls êtres libéraux et libres, qui donnez
à tous ceux qui vous demandent, et affranchissez tous ceux qui vous
servent fidèlement!»
C'est pourquoi «les Princes, les prélats, les juges, les docteurs, et
tous les autres dirigeants de l'Etat, d'autant qu'ils ont plus que les
autres besoin de sagesse, doivent plus que les autres montrer du
zèle pour ces vases où la sagesse est contenue.»
Tel était l'avis du grand homme d'Etat Gladstone, qui acheta plus de
trente cinq mille volumes au cours de sa longue vie. «Un
collectionneur de livres, disait-il, dans une lettre adressée au fameux
libraire londonien Quaritch (9 septembre 1896), doit, suivant l'idée
que je m'en fais, posséder les six qualités suivantes: appétit, loisir,
fortune, science, discernement et persévérance.» Et plus loin:
«Collectionner des livres peut avoir ses ridicules et ses excentricités.
Mais, en somme, c'est un élément revivifiant dans une société
criblée de tant de sources de corruption.»
VIII
Cependant les livres, jusque dans la maison du bibliophile, ont un
implacable ennemi: c'est la femme. Je les entends se plaindre du
traitement que la maîtresse du logis, dès qu'elle en a l'occasion, leur
fait subir:
«La femme, toujours jalouse de l'amour qu'on nous porte, est
impossible à jamais apaiser. Si elle nous aperçoit dans quelque coin,
sans autre protection que la toile d'une araignée morte, elle nous
insulte et nous ravale, le sourcil froncé, la parole amère, affirmant
que, de tout le mobilier de la maison, nous seuls ne sommes pas
nécessaires; elle se plaint que nous ne soyons utiles à rien dans le
ménage, et elle conseille de nous convertir promptement en riches
coiffures, en soie, en pourpre deux fois teinte, en robes et en
fourrures, en laine et en toile. A dire vrai sa haine ne serait pas sans
motifs si elle pouvait voir le fond de nos cœurs, si elle avait écouté
nos secrets conseils, si elle avait lu le livre de Théophraste ou celui
de Valerius, si seulement elle avait écouté le XXVe chapitre de
l'Ecclésiaste avec des oreilles intelligentes.» (Richard de Bury.)
M. Octave Uzanne rappelle, dans les Zigs-Zags d'un Curieux, un mot
du bibliophile Jacob, frappé en manière de proverbe et qui est bien
en situation ici:
Amours de femme et de bouquin,
Ne se chantent pas au même lutrin.
Et il ajoute fort à propos: «La passion bouquinière n'admet pas de
partage; c'est un peu, il faut le dire, une passion de retraite, un
refuge extrême à cette heure de la vie où l'homme, déséquilibré par
les cahots de l'existence mondaine, s'écrie, à l'exemple de Thomas
Moore: Je n'avais jusqu'ici pour lire que les regards des femmes, et
c'est la folie qu'ils m'ont enseignée!»
Cette incapacité des femmes, sauf de rares exceptions, à goûter les
joies du bibliophile, a été souvent remarquée. Une d'elles—et c'est
ce qui rend la citation piquante—Mme Emile de Girardin, écrivait dans
la chronique qu'elle signait à la Presse du pseudonyme de Vicomte
de Launay:
«Voyez ce beau salon d'étude, ce boudoir charmant; admirez-le dans
ses détails, vous y trouverez tout ce qui peut séduire, tout ce que
vous pouvez désirer, excepté deux choses pourtant: un beau livre et
un joli tableau. Il n'y a peut-être pas dix femmes à Paris chez
lesquelles ces deux raretés puissent être admirées.»
C'est dans le même ordre d'idées que l'américain Hawthorne, le fils
de l'auteur du Faune de Marbre et de tant d'autres ouvrages où une
sereine philosophie se pare des agréments de la fiction, a écrit ces
lignes curieuses:
«Cœlebs, grand amateur de bouquins, se rase devant son miroir, et
monologue sur la femme qui, d'après son expérience, jeune ou
vieille, laide ou belle, est toujours le diable.» Et Cœlebs finit en se
donnant à lui-même ces conseils judicieux: «Donc, épouse tes livres!
Il ne recherche point d'autre maîtresse, l'homme sage qui regarde,
non la surface, mais le fond des choses. Les livres ne flirtent ni ne
feignent; ne boudent ni ne taquinent; ils ne se plaignent pas, ils
disent les choses, mais ils s'abstiennent de vous les demander.
»Que les livres soient ton harem, et toi leur Grand Turc. De rayon en
rayon, ils attendent tes faveurs, silencieux et soumis! Jamais la
jalousie ne les agite. Je n'ai nulle part rencontré Vénus, et j'accorde
qu'elle est belle; toujours est-il qu'elle n'est pas de beaucoup si
accommodante qu'eux.»
IX
Comment n'aimerait-on pas les livres? Il en est pour tous les goûts,
ainsi qu'un auteur du Chansonnier des Grâces le fait chanter à un
libraire vaudevillesque (1820):
Venez, lecteurs, chez un libraire
De vous servir toujours jaloux;
Vos besoins ainsi que vos goûts
Chez moi pourront se satisfaire.
J'offre la Grammaire aux auteurs,
Des Vers à nos jeunes poëtes;
L'Esprit des lois aux procureurs,
L'Essai sur l'homme à nos coquettes...
Aux plus célèbres gastronomes
Je donne Racine et Boileau!
La Harpe aux chanteurs de caveau,
Les Nuits d'Young aux astronomes;
J'ai Descartes pour les joueurs,
Voiture pour toutes les belles,
Lucrèce pour les amateurs,
Martial pour les demoiselles.
Pour le plaideur et l'adversaire
J'aurai l'avocat Patelin;
Le malade et le médecin
Chez moi consulteront Molière:
Pour un sexe trop confiant
Je garde le Berger fidèle;
Et pour le malheureux amant
Je réserverai la Pucelle.
Armand Gouffé était d'un autre avis lorsqu'il fredonnait:
Un sot avec cent mille francs
Peut se passer de livres.
Mais les sots très riches ont généralement juste assez d'esprit pour
retrancher et masquer leur sottise derrière l'apparat imposant d'une
grande bibliothèque, où les bons livres consacrés par le temps et le
jugement universel se partagent les rayons avec les ouvrages à la
mode. Car si, comme le dit le proverbe allemand, «l'âne n'est pas
savant parce qu'il est chargé de livres», il est des cas où l'amas des
livres peut cacher un moment la nature de l'animal.
C'est en pensant aux amateurs de cet acabit que Chamfort a formulé
cette maxime: «L'espoir n'est souvent au cœur que ce que la
bibliothèque d'un château est à la personne du maître.»
Lilly, le fameux auteur d'Euphues, disait: «Aie ton cabinet plein de
livres plutôt que ta bourse pleine d'argent». Le malheur est que
remplir l'un a vite fait de vider l'autre, si les sources dont celle-ci
s'alimente ne sont pas d'une abondance continue.
L'historien Gibbon allait plus loin lorsqu'il déclarait qu'il n'échangerait
pas le goût de la lecture contre tous les trésors de l'Inde. De même
Macaulay, qui aurait mieux aimé être un pauvre homme avec des
livres qu'un grand roi sans livres.
Bien avant eux, Claudius Clément, dans son traité latin des
bibliothèques, tant privées que publiques, émettait, avec des
restrictions de sage morale, une idée semblable: «Il y a peu de
dépenses, de profusions, je dirais même de prodigalités plus
louables que celles qu'on fait pour les livres, lorsqu'en eux on
cherche un refuge, la volupté de l'âme, l'honneur, la pureté des
mœurs, la doctrine et un renom immortel.»
«L'or, écrivait Pétrarque à son frère Gérard, l'argent, les pierres
précieuses, les vêtements de pourpre, les domaines, les tableaux,
les chevaux, toutes les autres choses de ce genre offrent un plaisir
changeant et de surface: les livres nous réjouissent jusqu'aux
moëlles.»
C'est encore Pétrarque qui traçait ce tableau ingénieux et charmant:
«J'ai des amis dont la société m'est extrêmement agréable; ils sont
de tous les âges et de tous les pays. Ils se sont distingués dans les
conseils et sur les champs de bataille, et ont obtenu de grands
honneurs par leur connaissance des sciences. Il est facile de trouver
accès près d'eux; en effet ils sont toujours à mon service, je les
admets dans ma société ou les congédie quand il me plaît. Ils ne
sont jamais importuns, et ils répondent aussitôt à toutes les
questions que je leur pose. Les uns me racontent les événements
des siècles passés, les autres me révèlent les secrets de la nature. Il
en est qui m'apprennent à vivre, d'autres à mourir. Certains, par leur
vivacité, chassent mes soucis et répandent en moi la gaieté: d'autres
donnent du courage à mon âme, m'enseignant la science si
importante de contenir ses désirs et de ne compter absolument que
sur soi. Bref, ils m'ouvrent les différentes avenues de tous les arts et
de toutes les sciences, et je peux, sans risque, me fier à eux en
toute occasion. En retour de leurs services, ils ne me demandent
que de leur fournir une chambre commode dans quelque coin de
mon humble demeure, où ils puissent reposer en paix, car ces amis-
là trouvent plus de charmes à la tranquillité de la retraite qu'au
tumulte de la société.»
Il faut comparer ce morceau au passage où notre Montaigne, après
avoir parlé du commerce des hommes et de l'amour des femmes,
dont il dit: «l'un est ennuyeux par sa rareté, l'aultre se flestrit par
l'usage», déclare que celui des livres «est bien plus seur et plus à
nous; il cède aux premiers les aultres advantages, mais il a pour sa
part la constance et facilité de son service... Il me console en la
vieillesse et en la solitude; il me descharge du poids d'une oysiveté
ennuyeuse et me desfaict à toute heure des compagnies qui me
faschent; il esmousse les poinctures de la douleur, si elle n'est du
tout extrême et maistresse. Pour me distraire d'une imagination
importune, il n'est que de recourir aux livres...
«Le fruict que je tire des livres... j'en jouïs, comme les avaricieux des
trésors, pour sçavoir que j'en jouïray quand il me plaira: mon âme se
rassasie et contente de ce droit de possession... Il ne se peult dire
combien je me repose et séjourne en ceste considération qu'ils sont
à mon côté pour me donner du plaisir à mon heure, et à
recognoistre combien ils portent de secours à ma vie. C'est la
meilleure munition que j'aye trouvé à cest humain voyage; et plainds
extrêmement les hommes d'entendement qui l'ont à dire.»
Sur ce thème, les variations sont infinies et rivalisent d'éclat et
d'ampleur.
Le roi d'Egypte Osymandias, dont la mémoire inspira à Shelley un
sonnet si beau, avait inscrit au-dessus de sa «librairie»:
Pharmacie de l'âme.
«Une chambre sans livres est un corps sans âme», disait Cicéron.
«La poussière des bibliothèques est une poussière féconde»,
renchérit Werdet.
«Les livres ont toujours été la passion des honnêtes gens», affirme
Ménage.
Sir John Herschel était sûrement de ces honnêtes gens dont parle le
bel esprit érudit du XVIIe siècle, car il fait cette déclaration, que
Gibbon eût signée:
«Si j'avais à demander un goût qui pût me conserver ferme au
milieu des circonstances les plus diverses et être pour moi une
source de bonheur et de gaieté à travers la vie et un bouclier contre
ses maux, quelque adverses que pussent être les circonstances et de
quelques rigueurs que le monde pût m'accabler, je demanderais le
goût de la lecture.»
«Autant vaut tuer un homme que détruire un bon livre», s'écrie
Milton; et ailleurs, en un latin superbe que je renonce à traduire:
Et totum rapiunt me, mea vita, libri.
«Pourquoi, demandait Louis XIV au maréchal de Vivonne, passez-
vous autant de temps avec vos livres?—Sire, c'est pour qu'ils
donnent à mon esprit le coloris, la fraîcheur et la vie que donnent à
mes joues les excellentes perdrix de Votre Majesté.»
Voilà une aimable réponse de commensal et de courtisan. Mais
combien d'enthousiastes se sentiraient choqués de cet épicuréisme
flatteur et léger! Ce n'est pas le poète anglais John Florio, qui
écrivait au commencement du même siècle, dont on eût pu attendre
une explication aussi souriante et dégagée. Il le prend plutôt au
tragique, quand il s'écrie:
«Quels pauvres souvenirs sont statues, tombes et autres
monuments que les hommes érigent aux princes, et qui restent en
des lieux fermés où quelques-uns à peine les voient, en comparaison
des livres, qui aux yeux du monde entier montrent comment ces
princes vécurent, tandis que les autres monuments montrent où ils
gisent!»
C'est à dessein, je le répète, que j'accumule les citations d'auteurs
étrangers. Non seulement, elles ont moins de chances d'être
connues, mais elles possèdent je ne sais quelle saveur d'exotisme
qu'on ne peut demander à nos écrivains nationaux.
Ecoutons Isaac Barrow exposer sagement la leçon de son
expérience:
«Celui qui aime les livres ne manque jamais d'un ami fidèle, d'un
conseiller salutaire, d'un gai compagnon, d'un soutien efficace. En
étudiant, en pensant, en lisant, l'on peut innocemment se distraire et
agréablement se récréer dans toutes les saisons comme dans toutes
les fortunes.»
Jeremy Collier, pensant de même, ne s'exprime guère autrement:
«Les livres sont un guide dans la jeunesse et une récréation dans le
grand âge. Ils nous soutiennent dans la solitude et nous empêchent
d'être à charge à nous-mêmes. Ils nous aident à oublier les ennuis
qui nous viennent des hommes et des choses; ils calment nos soucis
et nos passions; ils endorment nos déceptions. Quand nous sommes
las des vivants, nous pouvons nous tourner vers les morts: ils n'ont
dans leur commerce, ni maussaderie, ni orgueil, ni arrière-pensée.»
Parmi les joies que donnent les livres, celle de les rechercher, de les
pourchasser chez les libraires et les bouquinistes, n'est pas la
moindre. On a écrit des centaines de chroniques, des études, des
traités et des livres sur ce sujet spécial. La Physiologie des quais de
Paris, de M. Octave Uzanne, est connue de tous ceux qui
s'intéressent aux bouquins. On se rappelle moins un brillant article
de Théodore de Banville, qui parut jadis dans un supplément
littéraire du Figaro; aussi me saura-t-on gré d'en citer ce joli
passage:
«Sur le quai Voltaire, il y aurait de quoi regarder et s'amuser
pendant toute une vie; mais sans tourner, comme dit Hésiode,
autour du chêne et du rocher, je veux nommer tout de suite ce qui
est le véritable sujet, l'attrait vertigineux, le charme invincible: c'est
le Livre ou, pour parler plus exactement, le Bouquin. Il y a sur le
quai de nombreuses boutiques, dont les marchands, véritables
bibliophiles, collectionnent, achètent dans les ventes, et offrent aux
consommateurs de beaux livres à des prix assez honnêtes. Mais ce
n'est pas là ce que veut l'amateur, le fureteur, le découvreur de
trésors mal connus. Ce qu'il veut, c'est trouver pour des sous, pour
rien, dans les boîtes posées sur le parapet, des livres, des bouquins
qui ont—ou qui auront—un grand prix, ignoré du marchand.
«Et à ce sujet, un duel, qui n'a pas eu de commencement et n'aura
pas de fin, recommence et se continue sans cesse entre le marchand
et l'amateur. Le libraire, qui, naturellement, veut vendre cher sa
marchandise, se hâte de retirer des boîtes et de porter dans la
boutique tout livre soupçonné d'avoir une valeur; mais par une force
étrange et surnaturelle, le Livre s'arrange toujours pour revenir, on
ne sait pas comment ou par quels artifices, dans les boîtes du
parapet. Car lui aussi a ses opinions; il veut être acheté par
l'amateur, avec des sous, et surtout et avant tout, par amour!»
C'est ainsi que M. Jean Rameau, poète et bibliophile, raconte qu'il a
trouvé, en cette année 1901, dans une boîte des quais, à vingt-cinq
centimes, quatre volumes, dont le dos élégamment fleuri portait un
écusson avec la devise: Boutez en avant. C'était un abrégé du
Faramond de la Calprenède, et les quatre volumes avaient appartenu
à la Du Barry, dont le Boutez en avant est suffisamment
caractéristique. Que fit le poète, lorsqu'il se fut renseigné auprès du
baron de Claye, qui n'hésite point sur ces questions? Il alla dès sept
heures du matin se poster devant l'étalage, avala le brouillard de la
Seine, s'en imprégna et y développa des «rhumatismes atroces»
jusqu'à onze heures du matin,—car le bouquiniste, ami du
nonchaloir, ne vint pas plus tôt,—prit les volumes et «bouta une
pièce d'un franc» en disant: «Vous allez me laisser ça pour quinze
sous, hein?»—«Va pour quinze sous!» fit le bouquiniste bonhomme!
Et le poète s'enfuit avec son butin, et aussi, par surcroît, «avec un
petit frisson de gloire».
Puisque nous sommes sur le quai Voltaire, ne le quittons pas sans le
regarder à travers la lunette d'un poète dont le nom, Gabriel Marc,
n'éveille pas de retentissants échos, mais qui, depuis 1875, année
où il publiait ses Sonnets parisiens, a dû parfois éprouver l'émotion—
amère et douce—exprimée en trait final dans le gracieux tableau
qu'il intitule: En bouquinant.
Le quai Voltaire est un véritable musée
En plein soleil. Partout, pour charmer les
regards,
Armes, bronzes, vitraux, estampes, objets d'art,
Et notre flânerie est sans cesse amusée.
Avec leur reliure ancienne et presque usée,
Voici les manuscrits sauvés par le hasard;
Puis les livres: Montaigne, Hugo, Chénier,
Ponsard,
Ou la petite toile au Salon refusée.
Le ciel bleuâtre et clair noircit à l'horizon.
Le pêcheur à la ligne a jeté l'hameçon;
Et la Seine se ride aux souffles de la brise.
Ou la petite toile au Salon refusée.
On bouquine. On revoit, sous la poudre des
temps,
Tous les chers oubliés; et parfois, ô surprise!
Le volume de vers que l'on fit à vingt ans.
Un autre contemporain, Mr. J. Rogers Rees, qui a écrit tout un livre
sur les plaisirs du bouquineur (the Pleasures of a Bookworm), trouve
dans le commerce des livres une source de fraternité et de solidarité
humaines. «Un grand amour pour les livres, dit-il, a en soi, dans
tous les temps, le pouvoir d'élargir le cœur et de le remplir de
facultés sympathiques plus larges et véritablement éducatrices.»
Un poète américain, Mr. C. Alex. Nelson, termine une pièce à laquelle
il donne ce titre français: Les Livres, par une prière naïve, dont les
deux derniers vers sont aussi en français dans le texte:
Les amoureux du livre, tous d'un cœur
reconnaissant,
toujours exhalèrent une prière unique:
Que le bon Dieu préserve les livres
et sauve la Société!
Le vieux Chaucer ne le prenait pas de si haut: doucement et
poétiquement il avouait que l'attrait des livres était moins puissant
sur son cœur que l'attrait de la nature.
Je voudrais pouvoir mettre dans mon essai de traduction un peu du
charme poétique qui, comme un parfum très ancien, mais persistant
et d'autant plus suave, se dégage de ces vers dans le texte original.
Quant à moi, bien que je ne sache que peu de
chose,
à lire dans les livres je me délecte,
et j'y donne ma foi et ma pleine croyance,
et dans mon cœur j'en garde le respect
si sincèrement qu'il n'y a point de plaisir
qui puisse me faire quitter mes livres,
si ce n'est, quelques rares fois, le jour saint,
sauf aussi, sûrement, lorsque, le mois de mai
venu, j'entends les oiseaux chanter,
et que les fleurs commencent à surgir,—
alors adieu mon livre et ma dévotion!
Comment encore conserver en mon français sans rimes et
péniblement rythmé l'harmonie légère et gracieuse, pourtant si nette
et précise, de ce délicieux couplet d'une vieille chanson populaire,
que tout Anglais sait par cœur:
Oh! un livre et, dans l'ombre un coin,
soit à la maison, soit dehors,
les vertes feuilles chuchotant sur ma tête,
ou les cris de la rue autour de moi;
là où je puisse lire tout à mon aise
aussi bien du neuf que du vieux!
Car un brave et bon livre à parcourir
vaut pour moi mieux que de l'or!
Mais il faut s'arrêter dans l'éloge. Je ne saurais mieux conclure, sur
ce sujet entraînant, qu'en prenant à mon compte et en offrant aux
autres ces lignes d'un homme qui fut, en son temps, le «prince de la
critique» et dont le nom même commence à être oublié. Nous
pouvons tous, amis, amoureux, dévots ou maniaques du livre, nous
écrier avec Jules Janin:
«O mes livres! mes économies et mes amours! une fête à mon foyer,
un repos à l'ombre du vieil arbre, mes compagnons de voyage!... et
puis, quand tout sera fini pour moi, les témoins de ma vie et de mon
labeur!»
Neural Networks And Deep Learning Charu C Aggarwal
X
A côté de ceux qui adorent les livres, les chantent et les bénissent, il
y a ceux qui les détestent, les dénigrent et leur crient anathème; et
ceux-ci ne sont pas les moins passionnés.
On voit nettement la transition, le passage d'un de ces deux
sentiments à l'autre, en même temps que leur foncière identité, dans
ces vers de Jean Richepin (Les Blasphèmes):
Peut-être, ô Solitude, est-ce toi qui délivres
De cette ardente soif que l'ivresse des livres
Ne saurait étancher aux flots de son vin noir.
J'en ai bu comme si j'étais un entonnoir,
De ce vin fabriqué, de ce vin lamentable;
J'en ai bu jusqu'à choir lourdement sous la
table,
A pleine gueule, à plein amour, à plein cerveau.
Mais toujours, au réveil, je sentais de nouveau
L'inextinguible soif dans ma gorge plus rêche.
On ne s'étonnera pas, je pense, que sa gorge étant plus rêche, le
poète songe à la mieux rafraîchir et achète, pour ce, des livres
superbes qui lui mériteront, quand on écrira sa biographie définitive,
un chapitre, curieux entre maint autre, intitulé: «Richepin,
bibliophile.»
D'une veine plus froide et plus méprisante, mais, après tout, peu
dissemblable, sort cette boutade de Baudelaire (Œuvres
posthumes):
«L'homme d'esprit, celui qui ne s'accordera jamais avec personne,
doit s'appliquer à aimer la conversation des imbéciles et la lecture
des mauvais livres. Il en tirera des jouissances amères qui
compenseront largement sa fatigue.»
L'auteur du traité De la Bibliomanie n'y met point tant de finesse. Il
déclare tout à trac que «la folle passion des livres entraîne souvent
au libertinage et à l'incrédulité».
Encore faudrait-il savoir où commence «la folle passion», car le
même écrivain (Bollioud-Mermet) ne peut s'empêcher, un peu plus
loin, de reconnaître que «les livres simplement agréables
contiennent, ainsi que les plus sérieux, des leçons utiles pour les
cœurs droits et pour les bons esprits».
Pétrarque avait déjà exprimé une pensée analogue dans son élégant
latin de la Renaissance: «Les livres mènent certaines personnes à la
science, et certaines autres à la folie, lorsque celles-ci en absorbent
plus qu'elles ne peuvent digérer.»
Libri quosdam ad scientiam, quosdam ad insaniam deduxere, dum
plus hauriunt quam digerunt.
Cela rappelle un joli mot attribué au peintre Doyen sur un homme
plus érudit que judicieux: «Sa tête est la boutique d'un libraire qui
déménage.»
C'est, en somme, une question de choix. On l'a répété bien souvent
depuis Sénèque, et on l'avait sûrement dit plus d'une fois avant lui:
«Il n'importe pas d'avoir beaucoup de livres, mais d'en avoir de
bons.»
Ce n'est pas là le point de vue auquel se placent les bibliomanes;
mais nous ne nous occupons pas d'eux pour l'instant. Quant aux
bibliophiles délicats, même ceux que le livre ravit par lui-même bien
plus que par ce qu'il contient, ils veulent bien en avoir beaucoup,
mais surtout en avoir de beaux, se rapprochant le plus possible de la
perfection; et plutôt que d'accueillir sur leurs rayons des exemplaires
tarés ou médiocres, eux-aussi prendraient la devise: Pauca sed
bona.
«Une des maladies de ce siècle, dit un Anglais (Barnaby Rich), c'est
la multitude des livres, qui surchargent tellement le lecteur qu'il ne
peut plus digérer l'abondance d'oiseuse matière chaque jour éclose
et mise au monde sous des formes aussi diverses que les traits
mêmes du visage des auteurs.»
En avoir beaucoup, c'est largesse;
En étudier peu, c'est sagesse.
déclare un proverbe cité par Jules Janin.
Michel Montaigne, qui a mis les livres à profit autant qu'homme du
monde et qui en a parlé en des termes enthousiastes et
reconnaissants cités plus haut, fait cependant des réserves, mais
seulement en ce qui touche le développement physique et la santé.
«Les livres, dit-il, ont beaucoup de qualités agréables à ceulx qui les
sçavent choisir; mais, aulcun bien sans peine; c'est un plaisir qui
n'est pas net et pur, non plus que les autres; il a ses incommodités
et bien poisantes; l'âme s'y exerce; mais le corps demeure sans
action, s'atterre et s'attriste.»
L'âme même arrive à la lassitude et au dégoût, comme le fait
observer le poète anglais Crabbe: «Les livres ne sauraient toujours
plaire, quelque bons qu'ils soient; l'esprit n'aspire pas toujours après
sa nourriture.»
Un proverbe italien nous ramène, d'un mot vif et original, à la
théorie des moralistes sur les bonnes et les mauvaises lectures: «Pas
de voleur pire qu'un mauvais livre.»
Quel voleur, en effet, a jamais songé à dérober l'innocence, la
pureté, les croyances, les nobles élans? Et les moralistes nous
affirment qu'il y a des livres qui dépouillent l'âme de tout cela.
«Mieux vaudrait, s'écrie Walter Scott, qu'il ne fût jamais né, celui qui
lit pour arriver au doute, celui qui lit pour arriver au mépris du bien.»
Un écrivain anglais contemporain, Mr. Lowell, donne un tour
ingénieux à l'expression d'une idée semblable, quand il écrit:
«Le conseil de Caton: Cum bonis ambula, Marche avec les bons,
est tout aussi vrai si on l'étend aux livres, car, eux aussi, donnent,
par degrés insensibles, leur propre nature à l'esprit qui converse
avec eux. Ou ils nous élèvent, ou ils nous abaissent.»
Les sages, qui pèsent le pour et le contre, et, se tenant dans un
juste milieu, reconnaissent aux livres une influence tantôt bonne,
tantôt mauvaise, souvent nulle, suivant leur nature et la disposition
d'esprit des lecteurs, sont, je crois, les plus nombreux.
L'helléniste Egger met à formuler cette opinion judicieusement
pondérée, un ton d'enthousiasme à quoi l'on devine qu'il pardonne
au livre tous ses méfaits pour les joies et les secours qu'il sait
donner.
«Le plus grand personnage qui, depuis 3,000 ans peut-être, fasse
parler de lui dans le monde, tour à tour géant ou pygmée,
orgueilleux ou modeste, entreprenant ou timide, sachant prendre
toutes les formes et tous les rôles, capable tour à tour d'éclairer ou
de pervertir les esprits, d'émouvoir les passions ou de les apaiser,
artisan de factions ou conciliateur des partis, véritable Protée
qu'aucune définition ne peut saisir, c'est «le Livre.»
Un moraliste peu connu du XVIIIe siècle, L.-C. d'Arc, auteur d'un livre
intitulé: Mes Loisirs, que j'ai cité ailleurs, redoute l'excès de la
lecture, ce «travail des paresseux», comme on l'a dit assez
justement:
«La lecture est l'aliment de l'esprit et quelquefois le
tombeau du génie.»
«Celui qui lit beaucoup s'expose à ne penser que
d'après les autres.»
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
Neuromorphic Computing Principles And Organization Abderazek Ben Abdallah
PDF
An Overview On Neural Network And Its Application
PDF
Principles Of Artificial Neural Networks 2nd Ed Daniel Graupe
DOCX
Artificial Intelligence.docx
PDF
Handbook Of Neural Network Signal Processing 1st Edition Yu Hen Hu
DOCX
Project Report -Vaibhav
PDF
F03503030035
DOCX
Neural networks report
Neuromorphic Computing Principles And Organization Abderazek Ben Abdallah
An Overview On Neural Network And Its Application
Principles Of Artificial Neural Networks 2nd Ed Daniel Graupe
Artificial Intelligence.docx
Handbook Of Neural Network Signal Processing 1st Edition Yu Hen Hu
Project Report -Vaibhav
F03503030035
Neural networks report

Similar to Neural Networks And Deep Learning Charu C Aggarwal (20)

PDF
Neural Network
DOCX
Neural network
PDF
Artificial Neural Network and its Applications
PDF
What Is a Neural Network
PDF
A Parallel Framework For Multilayer Perceptron For Human Face Recognition
PPTX
Quantum neural network
PDF
Artificial Neural Network: A brief study
PPTX
NuromarphicNuromarphicpptx................
PDF
Fuzzy Logic Final Report
PDF
Integration Of Swarm Intelligence And Artificial Neural Network 1st Edition S...
PDF
ARTIFICIAL INTELLIGENT ( ITS / TASK 6 ) done by Wael Saad Hameedi / P71062
PDF
Blue Brain Project
PDF
SURVEY ON BRAIN – MACHINE INTERRELATIVE LEARNING
PPTX
Seminar Neuro-computing
PDF
IRJET- The Essentials of Neural Networks and their Applications
DOCX
VEU_CST499_FinalReport
PDF
Cognitive Technologies Alberto Paradisi Alan Godoy Souza Mello
PPTX
Industrial training (Artificial Intelligence, Machine Learning & Deep Learnin...
PDF
Artificial Neural Network Abstract
PDF
Lebanon SoftShore Artificial Intelligence Seminar - March 38, 2014
Neural Network
Neural network
Artificial Neural Network and its Applications
What Is a Neural Network
A Parallel Framework For Multilayer Perceptron For Human Face Recognition
Quantum neural network
Artificial Neural Network: A brief study
NuromarphicNuromarphicpptx................
Fuzzy Logic Final Report
Integration Of Swarm Intelligence And Artificial Neural Network 1st Edition S...
ARTIFICIAL INTELLIGENT ( ITS / TASK 6 ) done by Wael Saad Hameedi / P71062
Blue Brain Project
SURVEY ON BRAIN – MACHINE INTERRELATIVE LEARNING
Seminar Neuro-computing
IRJET- The Essentials of Neural Networks and their Applications
VEU_CST499_FinalReport
Cognitive Technologies Alberto Paradisi Alan Godoy Souza Mello
Industrial training (Artificial Intelligence, Machine Learning & Deep Learnin...
Artificial Neural Network Abstract
Lebanon SoftShore Artificial Intelligence Seminar - March 38, 2014
Ad

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
Sports Quiz easy sports quiz sports quiz
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Institutional Correction lecture only . . .
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Cell Types and Its function , kingdom of life
Sports Quiz easy sports quiz sports quiz
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Renaissance Architecture: A Journey from Faith to Humanism
PPH.pptx obstetrics and gynecology in nursing
Anesthesia in Laparoscopic Surgery in India
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Institutional Correction lecture only . . .
TR - Agricultural Crops Production NC III.pdf
Pharma ospi slides which help in ospi learning
Abdominal Access Techniques with Prof. Dr. R K Mishra
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Ad

Neural Networks And Deep Learning Charu C Aggarwal

  • 1. Neural Networks And Deep Learning Charu C Aggarwal download https://ebookbell.com/product/neural-networks-and-deep-learning- charu-c-aggarwal-59041922 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Neural Networks And Deep Learning A Textbook Charu C Aggarwal https://ebookbell.com/product/neural-networks-and-deep-learning-a- textbook-charu-c-aggarwal-49464354 Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Ed 2023 Charu C Aggarwal https://ebookbell.com/product/neural-networks-and-deep-learning-a- textbook-2nd-edition-2nd-ed-2023-charu-c-aggarwal-50699788 Neural Networks And Deep Learning Theoretical Insights And Frameworks Dr Vishwas Mishra https://ebookbell.com/product/neural-networks-and-deep-learning- theoretical-insights-and-frameworks-dr-vishwas-mishra-56221402 Neural Networks And Deep Learning A Textbook Charu C Aggarwal https://ebookbell.com/product/neural-networks-and-deep-learning-a- textbook-charu-c-aggarwal-7166102
  • 3. Neural Networks And Deep Learning Deep Learning Explained To Your Granny A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network Pat Nakamoto https://ebookbell.com/product/neural-networks-and-deep-learning-deep- learning-explained-to-your-granny-a-visual-introduction-for-beginners- who-want-to-make-their-own-deep-learning-neural-network-pat- nakamoto-11565930 Neural Networks And Deep Learning A Textbook 2nd Edition 2nd Edition Charu C Aggarwal https://ebookbell.com/product/neural-networks-and-deep-learning-a- textbook-2nd-edition-2nd-edition-charu-c-aggarwal-50687826 Neural Networks And Deep Learning Deep Learning Explained To Your Granny A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network Machine Learning Nakamoto https://ebookbell.com/product/neural-networks-and-deep-learning-deep- learning-explained-to-your-granny-a-visual-introduction-for-beginners- who-want-to-make-their-own-deep-learning-neural-network-machine- learning-nakamoto-36065668 Neural Networks And Deep Learning Deep Learning Explained To Your Granny A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network Machine Learning Pat Nakamoto https://ebookbell.com/product/neural-networks-and-deep-learning-deep- learning-explained-to-your-granny-a-visual-introduction-for-beginners- who-want-to-make-their-own-deep-learning-neural-network-machine- learning-pat-nakamoto-42058928 Neural Networks And Deep Learning Deep Learning Explained To Your Granny A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network Machine Learning Nakamoto https://ebookbell.com/product/neural-networks-and-deep-learning-deep- learning-explained-to-your-granny-a-visual-introduction-for-beginners- who-want-to-make-their-own-deep-learning-neural-network-machine- learning-nakamoto-36068052
  • 4. Neural Networks And Deep Learning Deep Learning Explained To Your Granny A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network Machine Learning Pat Nakamoto https://ebookbell.com/product/neural-networks-and-deep-learning-deep- learning-explained-to-your-granny-a-visual-introduction-for-beginners- who-want-to-make-their-own-deep-learning-neural-network-machine- learning-pat-nakamoto-37277310
  • 6. Neural Networks and Deep Learning
  • 7. Charu C. Aggarwal Neural Networks and Deep Learning A Textbook 123
  • 8. Charu C. Aggarwal IBM T. J. Watson Research Center International Business Machines Yorktown Heights, NY, USA ISBN 978-3-319-94462-3 ISBN 978-3-319-94463-0 (eBook) https://doi.org/10.1007/978-3-319-94463-0 Library of Congress Control Number: 2018947636 c Springer International Publishing AG, part of Springer Nature 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, com- puter software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
  • 9. To my wife Lata, my daughter Sayani, and my late parents Dr. Prem Sarup and Mrs. Pushplata Aggarwal.
  • 10. Preface “Any A.I. smart enough to pass a Turing test is smart enough to know to fail it.”—Ian McDonald Neural networks were developed to simulate the human nervous system for machine learning tasks by treating the computational units in a learning model in a manner similar to human neurons. The grand vision of neural networks is to create artificial intelligence by building machines whose architecture simulates the computations in the human ner- vous system. This is obviously not a simple task because the computational power of the fastest computer today is a minuscule fraction of the computational power of a human brain. Neural networks were developed soon after the advent of computers in the fifties and sixties. Rosenblatt’s perceptron algorithm was seen as a fundamental cornerstone of neural networks, which caused an initial excitement about the prospects of artificial intelligence. However, after the initial euphoria, there was a period of disappointment in which the data hungry and computationally intensive nature of neural networks was seen as an impediment to their usability. Eventually, at the turn of the century, greater data availability and in- creasing computational power lead to increased successes of neural networks, and this area was reborn under the new label of “deep learning.” Although we are still far from the day that artificial intelligence (AI) is close to human performance, there are specific domains like image recognition, self-driving cars, and game playing, where AI has matched or ex- ceeded human performance. It is also hard to predict what AI might be able to do in the future. For example, few computer vision experts would have thought two decades ago that any automated system could ever perform an intuitive task like categorizing an image more accurately than a human. Neural networks are theoretically capable of learning any mathematical function with sufficient training data, and some variants like recurrent neural networks are known to be Turing complete. Turing completeness refers to the fact that a neural network can simulate any learning algorithm, given sufficient training data. The sticking point is that the amount of data required to learn even simple tasks is often extraordinarily large, which causes a corresponding increase in training time (if we assume that enough training data is available in the first place). For example, the training time for image recognition, which is a simple task for a human, can be on the order of weeks even on high-performance systems. Fur- thermore, there are practical issues associated with the stability of neural network training, which are being resolved even today. Nevertheless, given that the speed of computers is VII
  • 11. VIII PREFACE expected to increase rapidly over time, and fundamentally more powerful paradigms like quantum computing are on the horizon, the computational issue might not eventually turn out to be quite as critical as imagined. Although the biological analogy of neural networks is an exciting one and evokes com- parisons with science fiction, the mathematical understanding of neural networks is a more mundane one. The neural network abstraction can be viewed as a modular approach of enabling learning algorithms that are based on continuous optimization on a computational graph of dependencies between the input and output. To be fair, this is not very different from traditional work in control theory; indeed, some of the methods used for optimization in control theory are strikingly similar to (and historically preceded) the most fundamental algorithms in neural networks. However, the large amounts of data available in recent years together with increased computational power have enabled experimentation with deeper architectures of these computational graphs than was previously possible. The resulting success has changed the broader perception of the potential of deep learning. The chapters of the book are organized as follows: 1. The basics of neural networks: Chapter 1 discusses the basics of neural network design. Many traditional machine learning models can be understood as special cases of neural learning. Understanding the relationship between traditional machine learning and neural networks is the first step to understanding the latter. The simulation of various machine learning models with neural networks is provided in Chapter 2. This will give the analyst a feel of how neural networks push the envelope of traditional machine learning algorithms. 2. Fundamentals of neural networks: Although Chapters 1 and 2 provide an overview of the training methods for neural networks, a more detailed understanding of the training challenges is provided in Chapters 3 and 4. Chapters 5 and 6 present radial- basis function (RBF) networks and restricted Boltzmann machines. 3. Advanced topics in neural networks: A lot of the recent success of deep learning is a result of the specialized architectures for various domains, such as recurrent neural networks and convolutional neural networks. Chapters 7 and 8 discuss recurrent and convolutional neural networks. Several advanced topics like deep reinforcement learn- ing, neural Turing mechanisms, and generative adversarial networks are discussed in Chapters 9 and 10. We have taken care to include some of the “forgotten” architectures like RBF networks and Kohonen self-organizing maps because of their potential in many applications. The book is written for graduate students, researchers, and practitioners. Numerous exercises are available along with a solution manual to aid in classroom teaching. Where possible, an application-centric view is highlighted in order to give the reader a feel for the technology. Throughout this book, a vector or a multidimensional data point is annotated with a bar, such as X or y. A vector or multidimensional point may be denoted by either small letters or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots, such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n × d matrix corresponding to the entire training data set is denoted by D, with n documents and d dimensions. The individual data points in D are therefore d-dimensional row vectors. On the other hand, vectors with one component for each data
  • 12. PREFACE IX point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points. An observed value yi is distinguished from a predicted value ŷi by a circumflex at the top of the variable. Yorktown Heights, NY, USA Charu C. Aggarwal
  • 13. Acknowledgments I would like to thank my family for their love and support during the busy time spent in writing this book. I would also like to thank my manager Nagui Halim for his support during the writing of this book. Several figures in this book have been provided by the courtesy of various individuals and institutions. The Smithsonian Institution made the image of the Mark I perceptron (cf. Figure 1.5) available at no cost. Saket Sathe provided the outputs in Chapter 7 for the tiny Shakespeare data set, based on code available/described in [233, 580]. Andrew Zisserman provided Figures 8.12 and 8.16 in the section on convolutional visualizations. Another visualization of the feature maps in the convolution network (cf. Figure 8.15) was provided by Matthew Zeiler. NVIDIA provided Figure 9.10 on the convolutional neural network for self-driving cars in Chapter 9, and Sergey Levine provided the image on self- learning robots (cf. Figure 9.9) in the same chapter. Alec Radford provided Figure 10.8, which appears in Chapter 10. Alex Krizhevsky provided Figure 8.9(b) containing AlexNet. This book has benefitted from significant feedback and several collaborations that I have had with numerous colleagues over the years. I would like to thank Quoc Le, Saket Sathe, Karthik Subbian, Jiliang Tang, and Suhang Wang for their feedback on various portions of this book. Shuai Zheng provided feedbback on the section on regularized autoencoders in Chapter 4. I received feedback on the sections on autoencoders from Lei Cai and Hao Yuan. Feedback on the chapter on convolutional neural networks was provided by Hongyang Gao, Shuiwang Ji, and Zhengyang Wang. Shuiwang Ji, Lei Cai, Zhengyang Wang and Hao Yuan also reviewed the Chapters 3 and 7, and suggested several edits. They also suggested the ideas of using Figures 8.6 and 8.7 for elucidating the convolution/deconvolution operations. For their collaborations, I would like to thank Tarek F. Abdelzaher, Jinghui Chen, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Saket Sathe, Jaideep Sri- vastava, Karthik Subbian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jiany- ong Wang, Min Wang, Suhang Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao. I would also like to thank my advisor James B. Orlin for his guid- ance during my early years as a researcher. XI
  • 14. XII ACKNOWLEDGMENTS I would like to thank Lata Aggarwal for helping me with some of the figures created using PowerPoint graphics in this book. My daughter, Sayani, was helpful in incorporating special effects (e.g., image color, contrast, and blurring) in several JPEG images used at various places in this book.
  • 15. Contents 1 An Introduction to Neural Networks 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Humans Versus Computers: Stretching the Limits of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 The Basic Architecture of Neural Networks . . . . . . . . . . . . . . . . . . 4 1.2.1 Single Computational Layer: The Perceptron . . . . . . . . . . . . . 5 1.2.1.1 What Objective Function Is the Perceptron Optimizing? . 8 1.2.1.2 Relationship with Support Vector Machines . . . . . . . . . 10 1.2.1.3 Choice of Activation and Loss Functions . . . . . . . . . . 11 1.2.1.4 Choice and Number of Output Nodes . . . . . . . . . . . . 14 1.2.1.5 Choice of Loss Function . . . . . . . . . . . . . . . . . . . . 14 1.2.1.6 Some Useful Derivatives of Activation Functions . . . . . . 16 1.2.2 Multilayer Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 17 1.2.3 The Multilayer Network as a Computational Graph . . . . . . . . . 20 1.3 Training a Neural Network with Backpropagation . . . . . . . . . . . . . . . 21 1.4 Practical Issues in Neural Network Training . . . . . . . . . . . . . . . . . . 24 1.4.1 The Problem of Overfitting . . . . . . . . . . . . . . . . . . . . . . . 25 1.4.1.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.4.1.2 Neural Architecture and Parameter Sharing . . . . . . . . . 27 1.4.1.3 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.4.1.4 Trading Off Breadth for Depth . . . . . . . . . . . . . . . . 27 1.4.1.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . 28 1.4.2 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . 28 1.4.3 Difficulties in Convergence . . . . . . . . . . . . . . . . . . . . . . . . 29 1.4.4 Local and Spurious Optima . . . . . . . . . . . . . . . . . . . . . . . 29 1.4.5 Computational Challenges . . . . . . . . . . . . . . . . . . . . . . . . 29 1.5 The Secrets to the Power of Function Composition . . . . . . . . . . . . . . 30 1.5.1 The Importance of Nonlinear Activation . . . . . . . . . . . . . . . . 32 1.5.2 Reducing Parameter Requirements with Depth . . . . . . . . . . . . 34 1.5.3 Unconventional Neural Architectures . . . . . . . . . . . . . . . . . . 35 1.5.3.1 Blurring the Distinctions Between Input, Hidden, and Output Layers . . . . . . . . . . . . . . . . . . . . . . . 35 1.5.3.2 Unconventional Operations and Sum-Product Networks . . 36 XIII
  • 16. XIV CONTENTS 1.6 Common Neural Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 37 1.6.1 Simulating Basic Machine Learning with Shallow Models . . . . . . 37 1.6.2 Radial Basis Function Networks . . . . . . . . . . . . . . . . . . . . 37 1.6.3 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . 38 1.6.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 38 1.6.5 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 40 1.6.6 Hierarchical Feature Engineering and Pretrained Models . . . . . . . 42 1.7 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.7.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 44 1.7.2 Separating Data Storage and Computations . . . . . . . . . . . . . . 45 1.7.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . 45 1.8 Two Notable Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 1.8.1 The MNIST Database of Handwritten Digits . . . . . . . . . . . . . 46 1.8.2 The ImageNet Database . . . . . . . . . . . . . . . . . . . . . . . . . 47 1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 1.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 1.10.1 Video Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.10.2 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 1.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2 Machine Learning with Shallow Neural Networks 53 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.2 Neural Architectures for Binary Classification Models . . . . . . . . . . . . 55 2.2.1 Revisiting the Perceptron . . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2 Least-Squares Regression . . . . . . . . . . . . . . . . . . . . . . . . 58 2.2.2.1 Widrow-Hoff Learning . . . . . . . . . . . . . . . . . . . . . 59 2.2.2.2 Closed Form Solutions . . . . . . . . . . . . . . . . . . . . . 61 2.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.2.3.1 Alternative Choices of Activation and Loss . . . . . . . . . 63 2.2.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 63 2.3 Neural Architectures for Multiclass Models . . . . . . . . . . . . . . . . . . 65 2.3.1 Multiclass Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.3.2 Weston-Watkins SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.3.3 Multinomial Logistic Regression (Softmax Classifier) . . . . . . . . . 68 2.3.4 Hierarchical Softmax for Many Classes . . . . . . . . . . . . . . . . . 69 2.4 Backpropagated Saliency for Feature Selection . . . . . . . . . . . . . . . . 70 2.5 Matrix Factorization with Autoencoders . . . . . . . . . . . . . . . . . . . . 70 2.5.1 Autoencoder: Basic Principles . . . . . . . . . . . . . . . . . . . . . . 71 2.5.1.1 Autoencoder with a Single Hidden Layer . . . . . . . . . . 72 2.5.1.2 Connections with Singular Value Decomposition . . . . . . 74 2.5.1.3 Sharing Weights in Encoder and Decoder . . . . . . . . . . 74 2.5.1.4 Other Matrix Factorization Methods . . . . . . . . . . . . . 76 2.5.2 Nonlinear Activations . . . . . . . . . . . . . . . . . . . . . . . . . . 76 2.5.3 Deep Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.5.4 Application to Outlier Detection . . . . . . . . . . . . . . . . . . . . 80 2.5.5 When the Hidden Layer Is Broader than the Input Layer . . . . . . 81 2.5.5.1 Sparse Feature Learning . . . . . . . . . . . . . . . . . . . . 81 2.5.6 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
  • 17. CONTENTS XV 2.5.7 Recommender Systems: Row Index to Row Value Prediction . . . . 83 2.5.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 2.6 Word2vec: An Application of Simple Neural Architectures . . . . . . . . . . 87 2.6.1 Neural Embedding with Continuous Bag of Words . . . . . . . . . . 87 2.6.2 Neural Embedding with Skip-Gram Model . . . . . . . . . . . . . . . 90 2.6.3 Word2vec (SGNS) Is Logistic Matrix Factorization . . . . . . . . . . 95 2.6.4 Vanilla Skip-Gram Is Multinomial Matrix Factorization . . . . . . . 98 2.7 Simple Neural Architectures for Graph Embeddings . . . . . . . . . . . . . 98 2.7.1 Handling Arbitrary Edge Counts . . . . . . . . . . . . . . . . . . . . 100 2.7.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 2.7.3 Connections with DeepWalk and Node2vec . . . . . . . . . . . . . . 100 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 2.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 2.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3 Training Deep Neural Networks 105 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.2 Backpropagation: The Gory Details . . . . . . . . . . . . . . . . . . . . . . . 107 3.2.1 Backpropagation with the Computational Graph Abstraction . . . . 107 3.2.2 Dynamic Programming to the Rescue . . . . . . . . . . . . . . . . . 111 3.2.3 Backpropagation with Post-Activation Variables . . . . . . . . . . . 113 3.2.4 Backpropagation with Pre-activation Variables . . . . . . . . . . . . 115 3.2.5 Examples of Updates for Various Activations . . . . . . . . . . . . . 117 3.2.5.1 The Special Case of Softmax . . . . . . . . . . . . . . . . . 117 3.2.6 A Decoupled View of Vector-Centric Backpropagation . . . . . . . . 118 3.2.7 Loss Functions on Multiple Output Nodes and Hidden Nodes . . . . 121 3.2.8 Mini-Batch Stochastic Gradient Descent . . . . . . . . . . . . . . . . 121 3.2.9 Backpropagation Tricks for Handling Shared Weights . . . . . . . . 123 3.2.10 Checking the Correctness of Gradient Computation . . . . . . . . . 124 3.3 Setup and Initialization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.3.1 Tuning Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . 125 3.3.2 Feature Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.3.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4 The Vanishing and Exploding Gradient Problems . . . . . . . . . . . . . . . 129 3.4.1 Geometric Understanding of the Effect of Gradient Ratios . . . . . . 130 3.4.2 A Partial Fix with Activation Function Choice . . . . . . . . . . . . 133 3.4.3 Dying Neurons and “Brain Damage” . . . . . . . . . . . . . . . . . . 133 3.4.3.1 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . 133 3.4.3.2 Maxout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.5 Gradient-Descent Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.5.1 Learning Rate Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.5.2 Momentum-Based Learning . . . . . . . . . . . . . . . . . . . . . . . 136 3.5.2.1 Nesterov Momentum . . . . . . . . . . . . . . . . . . . . . 137 3.5.3 Parameter-Specific Learning Rates . . . . . . . . . . . . . . . . . . . 137 3.5.3.1 AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.5.3.2 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 3.5.3.3 RMSProp with Nesterov Momentum . . . . . . . . . . . . . 139
  • 18. XVI CONTENTS 3.5.3.4 AdaDelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 3.5.3.5 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.5.4 Cliffs and Higher-Order Instability . . . . . . . . . . . . . . . . . . . 141 3.5.5 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 3.5.6 Second-Order Derivatives . . . . . . . . . . . . . . . . . . . . . . . . 143 3.5.6.1 Conjugate Gradients and Hessian-Free Optimization . . . . 145 3.5.6.2 Quasi-Newton Methods and BFGS . . . . . . . . . . . . . . 148 3.5.6.3 Problems with Second-Order Methods: Saddle Points . . . 149 3.5.7 Polyak Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 3.5.8 Local and Spurious Minima . . . . . . . . . . . . . . . . . . . . . . . 151 3.6 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 3.7 Practical Tricks for Acceleration and Compression . . . . . . . . . . . . . . 156 3.7.1 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 3.7.2 Parallel and Distributed Implementations . . . . . . . . . . . . . . . 158 3.7.3 Algorithmic Tricks for Model Compression . . . . . . . . . . . . . . 160 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 3.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 3.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 3.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4 Teaching Deep Learners to Generalize 169 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 4.2 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . 174 4.2.1 Formal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.3 Generalization Issues in Model Tuning and Evaluation . . . . . . . . . . . . 178 4.3.1 Evaluating with Hold-Out and Cross-Validation . . . . . . . . . . . . 179 4.3.2 Issues with Training at Scale . . . . . . . . . . . . . . . . . . . . . . 180 4.3.3 How to Detect Need to Collect More Data . . . . . . . . . . . . . . . 181 4.4 Penalty-Based Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.4.1 Connections with Noise Injection . . . . . . . . . . . . . . . . . . . . 182 4.4.2 L1-Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 4.4.3 L1- or L2-Regularization? . . . . . . . . . . . . . . . . . . . . . . . . 184 4.4.4 Penalizing Hidden Units: Learning Sparse Representations . . . . . . 185 4.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 4.5.1 Bagging and Subsampling . . . . . . . . . . . . . . . . . . . . . . . . 186 4.5.2 Parametric Model Selection and Averaging . . . . . . . . . . . . . . 187 4.5.3 Randomized Connection Dropping . . . . . . . . . . . . . . . . . . . 188 4.5.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 4.5.5 Data Perturbation Ensembles . . . . . . . . . . . . . . . . . . . . . . 191 4.6 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 4.6.1 Understanding Early Stopping from the Variance Perspective . . . . 192 4.7 Unsupervised Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 4.7.1 Variations of Unsupervised Pretraining . . . . . . . . . . . . . . . . . 197 4.7.2 What About Supervised Pretraining? . . . . . . . . . . . . . . . . . 197 4.8 Continuation and Curriculum Learning . . . . . . . . . . . . . . . . . . . . . 199 4.8.1 Continuation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 199 4.8.2 Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.9 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
  • 19. CONTENTS XVII 4.10 Regularization in Unsupervised Applications . . . . . . . . . . . . . . . . . 201 4.10.1 Value-Based Penalization: Sparse Autoencoders . . . . . . . . . . . . 202 4.10.2 Noise Injection: De-noising Autoencoders . . . . . . . . . . . . . . . 202 4.10.3 Gradient-Based Penalization: Contractive Autoencoders . . . . . . . 204 4.10.4 Hidden Probabilistic Structure: Variational Autoencoders . . . . . . 207 4.10.4.1 Reconstruction and Generative Sampling . . . . . . . . . . 210 4.10.4.2 Conditional Variational Autoencoders . . . . . . . . . . . . 212 4.10.4.3 Relationship with Generative Adversarial Networks . . . . 213 4.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 4.12 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 4.12.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 4.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 5 Radial Basis Function Networks 217 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 5.2 Training an RBF Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 5.2.1 Training the Hidden Layer . . . . . . . . . . . . . . . . . . . . . . . . 221 5.2.2 Training the Output Layer . . . . . . . . . . . . . . . . . . . . . . . 222 5.2.2.1 Expression with Pseudo-Inverse . . . . . . . . . . . . . . . 224 5.2.3 Orthogonal Least-Squares Algorithm . . . . . . . . . . . . . . . . . . 224 5.2.4 Fully Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . 225 5.3 Variations and Special Cases of RBF Networks . . . . . . . . . . . . . . . . 226 5.3.1 Classification with Perceptron Criterion . . . . . . . . . . . . . . . . 226 5.3.2 Classification with Hinge Loss . . . . . . . . . . . . . . . . . . . . . . 227 5.3.3 Example of Linear Separability Promoted by RBF . . . . . . . . . . 227 5.3.4 Application to Interpolation . . . . . . . . . . . . . . . . . . . . . . . 228 5.4 Relationship with Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . 229 5.4.1 Kernel Regression as a Special Case of RBF Networks . . . . . . . . 229 5.4.2 Kernel SVM as a Special Case of RBF Networks . . . . . . . . . . . 230 5.4.3 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 5.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 6 Restricted Boltzmann Machines 235 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 6.1.1 Historical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 236 6.2 Hopfield Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 6.2.1 Optimal State Configurations of a Trained Network . . . . . . . . . 238 6.2.2 Training a Hopfield Network . . . . . . . . . . . . . . . . . . . . . . 240 6.2.3 Building a Toy Recommender and Its Limitations . . . . . . . . . . 241 6.2.4 Increasing the Expressive Power of the Hopfield Network . . . . . . 242 6.3 The Boltzmann Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 6.3.1 How a Boltzmann Machine Generates Data . . . . . . . . . . . . . . 244 6.3.2 Learning the Weights of a Boltzmann Machine . . . . . . . . . . . . 245 6.4 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . 247 6.4.1 Training the RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 6.4.2 Contrastive Divergence Algorithm . . . . . . . . . . . . . . . . . . . 250 6.4.3 Practical Issues and Improvisations . . . . . . . . . . . . . . . . . . . 251
  • 20. XVIII CONTENTS 6.5 Applications of Restricted Boltzmann Machines . . . . . . . . . . . . . . . . 251 6.5.1 Dimensionality Reduction and Data Reconstruction . . . . . . . . . 252 6.5.2 RBMs for Collaborative Filtering . . . . . . . . . . . . . . . . . . . . 254 6.5.3 Using RBMs for Classification . . . . . . . . . . . . . . . . . . . . . . 257 6.5.4 Topic Models with RBMs . . . . . . . . . . . . . . . . . . . . . . . . 260 6.5.5 RBMs for Machine Learning with Multimodal Data . . . . . . . . . 262 6.6 Using RBMs Beyond Binary Data Types . . . . . . . . . . . . . . . . . . . . 263 6.7 Stacking Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 264 6.7.1 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 266 6.7.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.7.3 Deep Boltzmann Machines and Deep Belief Networks . . . . . . . . 267 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 7 Recurrent Neural Networks 271 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 7.1.1 Expressiveness of Recurrent Networks . . . . . . . . . . . . . . . . . 274 7.2 The Architecture of Recurrent Neural Networks . . . . . . . . . . . . . . . . 274 7.2.1 Language Modeling Example of RNN . . . . . . . . . . . . . . . . . 277 7.2.1.1 Generating a Language Sample . . . . . . . . . . . . . . . . 278 7.2.2 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . 280 7.2.3 Bidirectional Recurrent Networks . . . . . . . . . . . . . . . . . . . . 283 7.2.4 Multilayer Recurrent Networks . . . . . . . . . . . . . . . . . . . . . 284 7.3 The Challenges of Training Recurrent Networks . . . . . . . . . . . . . . . . 286 7.3.1 Layer Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 7.4 Echo-State Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.5 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . 292 7.6 Gated Recurrent Units (GRUs) . . . . . . . . . . . . . . . . . . . . . . . . . 295 7.7 Applications of Recurrent Neural Networks . . . . . . . . . . . . . . . . . . 297 7.7.1 Application to Automatic Image Captioning . . . . . . . . . . . . . . 298 7.7.2 Sequence-to-Sequence Learning and Machine Translation . . . . . . 299 7.7.2.1 Question-Answering Systems . . . . . . . . . . . . . . . . . 301 7.7.3 Application to Sentence-Level Classification . . . . . . . . . . . . . . 303 7.7.4 Token-Level Classification with Linguistic Features . . . . . . . . . . 304 7.7.5 Time-Series Forecasting and Prediction . . . . . . . . . . . . . . . . 305 7.7.6 Temporal Recommender Systems . . . . . . . . . . . . . . . . . . . . 307 7.7.7 Secondary Protein Structure Prediction . . . . . . . . . . . . . . . . 309 7.7.8 End-to-End Speech Recognition . . . . . . . . . . . . . . . . . . . . . 309 7.7.9 Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . 309 7.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 7.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 7.9.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 7.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
  • 21. CONTENTS XIX 8 Convolutional Neural Networks 315 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 8.1.1 Historical Perspective and Biological Inspiration . . . . . . . . . . . 316 8.1.2 Broader Observations About Convolutional Neural Networks . . . . 317 8.2 The Basic Structure of a Convolutional Network . . . . . . . . . . . . . . . 318 8.2.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 8.2.2 Strides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 8.2.3 Typical Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 8.2.4 The ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.2.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 8.2.6 Fully Connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . 327 8.2.7 The Interleaving Between Layers . . . . . . . . . . . . . . . . . . . . 328 8.2.8 Local Response Normalization . . . . . . . . . . . . . . . . . . . . . 330 8.2.9 Hierarchical Feature Engineering . . . . . . . . . . . . . . . . . . . . 331 8.3 Training a Convolutional Network . . . . . . . . . . . . . . . . . . . . . . . 332 8.3.1 Backpropagating Through Convolutions . . . . . . . . . . . . . . . . 333 8.3.2 Backpropagation as Convolution with Inverted/Transposed Filter . . 334 8.3.3 Convolution/Backpropagation as Matrix Multiplications . . . . . . . 335 8.3.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 8.4 Case Studies of Convolutional Architectures . . . . . . . . . . . . . . . . . . 338 8.4.1 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 8.4.2 ZFNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 8.4.3 VGG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 8.4.4 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 8.4.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 8.4.6 The Effects of Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . 350 8.4.7 Pretrained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 8.5 Visualization and Unsupervised Learning . . . . . . . . . . . . . . . . . . . 352 8.5.1 Visualizing the Features of a Trained Network . . . . . . . . . . . . 353 8.5.2 Convolutional Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 357 8.6 Applications of Convolutional Networks . . . . . . . . . . . . . . . . . . . . 363 8.6.1 Content-Based Image Retrieval . . . . . . . . . . . . . . . . . . . . . 363 8.6.2 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 8.6.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 8.6.4 Natural Language and Sequence Learning . . . . . . . . . . . . . . . 366 8.6.5 Video Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 8.8.1 Software Resources and Data Sets . . . . . . . . . . . . . . . . . . . 370 8.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 9 Deep Reinforcement Learning 373 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 9.2 Stateless Algorithms: Multi-Armed Bandits . . . . . . . . . . . . . . . . . . 375 9.2.1 Naı̈ve Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 9.2.2 -Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 9.2.3 Upper Bounding Methods . . . . . . . . . . . . . . . . . . . . . . . . 376 9.3 The Basic Framework of Reinforcement Learning . . . . . . . . . . . . . . . 377 9.3.1 Challenges of Reinforcement Learning . . . . . . . . . . . . . . . . . 379
  • 22. XX CONTENTS 9.3.2 Simple Reinforcement Learning for Tic-Tac-Toe . . . . . . . . . . . . 380 9.3.3 Role of Deep Learning and a Straw-Man Algorithm . . . . . . . . . 380 9.4 Bootstrapping for Value Function Learning . . . . . . . . . . . . . . . . . . 383 9.4.1 Deep Learning Models as Function Approximators . . . . . . . . . . 384 9.4.2 Example: Neural Network for Atari Setting . . . . . . . . . . . . . . 386 9.4.3 On-Policy Versus Off-Policy Methods: SARSA . . . . . . . . . . . . 387 9.4.4 Modeling States Versus State-Action Pairs . . . . . . . . . . . . . . . 389 9.5 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 9.5.1 Finite Difference Methods . . . . . . . . . . . . . . . . . . . . . . . . 392 9.5.2 Likelihood Ratio Methods . . . . . . . . . . . . . . . . . . . . . . . . 393 9.5.3 Combining Supervised Learning with Policy Gradients . . . . . . . . 395 9.5.4 Actor-Critic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 395 9.5.5 Continuous Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . 397 9.5.6 Advantages and Disadvantages of Policy Gradients . . . . . . . . . . 397 9.6 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 9.7 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 9.7.1 AlphaGo: Championship Level Play at Go . . . . . . . . . . . . . . . 399 9.7.1.1 Alpha Zero: Enhancements to Zero Human Knowledge . . 402 9.7.2 Self-Learning Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 9.7.2.1 Deep Learning of Locomotion Skills . . . . . . . . . . . . . 404 9.7.2.2 Deep Learning of Visuomotor Skills . . . . . . . . . . . . . 406 9.7.3 Building Conversational Systems: Deep Learning for Chatbots . . . 407 9.7.4 Self-Driving Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 9.7.5 Inferring Neural Architectures with Reinforcement Learning . . . . . 412 9.8 Practical Challenges Associated with Safety . . . . . . . . . . . . . . . . . . 413 9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 9.10 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 9.10.1 Software Resources and Testbeds . . . . . . . . . . . . . . . . . . . . 416 9.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 10 Advanced Topics in Deep Learning 419 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 10.2 Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 10.2.1 Recurrent Models of Visual Attention . . . . . . . . . . . . . . . . . 422 10.2.1.1 Application to Image Captioning . . . . . . . . . . . . . . . 424 10.2.2 Attention Mechanisms for Machine Translation . . . . . . . . . . . . 425 10.3 Neural Networks with External Memory . . . . . . . . . . . . . . . . . . . . 429 10.3.1 A Fantasy Video Game: Sorting by Example . . . . . . . . . . . . . 430 10.3.1.1 Implementing Swaps with Memory Operations . . . . . . . 431 10.3.2 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . 432 10.3.3 Differentiable Neural Computer: A Brief Overview . . . . . . . . . . 437 10.4 Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . 438 10.4.1 Training a Generative Adversarial Network . . . . . . . . . . . . . . 439 10.4.2 Comparison with Variational Autoencoder . . . . . . . . . . . . . . . 442 10.4.3 Using GANs for Generating Image Data . . . . . . . . . . . . . . . . 442 10.4.4 Conditional Generative Adversarial Networks . . . . . . . . . . . . . 444 10.5 Competitive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 10.5.1 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 10.5.2 Kohonen Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . 450
  • 23. CONTENTS XXI 10.6 Limitations of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 453 10.6.1 An Aspirational Goal: One-Shot Learning . . . . . . . . . . . . . . . 453 10.6.2 An Aspirational Goal: Energy-Efficient Learning . . . . . . . . . . . 455 10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 10.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 10.8.1 Software Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 10.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 Bibliography 459 Index 493
  • 24. Author Biography Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his under- graduate degree in Computer Science from the Indian Institute of Technology at Kan- pur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996. He has worked extensively in the field of data mining. He has pub- lished more than 350 papers in refereed conferences and journals and authored over 80 patents. He is the author or editor of 18 books, including textbooks on data mining, recommender systems, and outlier analysis. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio- terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contribu- tions to privacy technology, and a recipient of two IBM Outstanding Technical Achievement Awards (2009, 2015) for his work on data streams/high-dimensional data. He received the EDBT 2014 Test of Time Award for his work on condensation-based privacy-preserving data mining. He is also a recipient of the IEEE ICDM Research Con- tributions Award (2015), which is one of the two highest awards for influential research contributions in the field of data mining. He has served as the general co-chair of the IEEE Big Data Conference (2014) and as the program co-chair of the ACM CIKM Conference (2015), the IEEE ICDM Conference (2015), and the ACM KDD Conference (2016). He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the IEEE Transactions on Big Data, an action editor of the Data Mining and Knowledge Discovery Journal, and an associate editor of the Knowledge and Information Systems Journal. He serves as the editor-in-chief of the ACM Transactions on Knowledge Discovery from Data as well as the ACM SIGKDD Explorations. He serves on the advisory board of the Lecture Notes on Social Networks, a publication by Springer. He has served as the vice-president of the SIAM Activity Group on Data Mining and is a member of the SIAM industry committee. He is a fellow of the SIAM, ACM, and the IEEE, for “contributions to knowledge discovery and data mining algorithms.” XXIII
  • 25. Chapter 1 An Introduction to Neural Networks “Thou shalt not make a machine to counterfeit a human mind.”—Frank Herbert 1.1 Introduction Artificial neural networks are popular machine learning techniques that simulate the mech- anism of learning in biological organisms. The human nervous system contains cells, which are referred to as neurons. The neurons are connected to one another with the use of ax- ons and dendrites, and the connecting regions between axons and dendrites are referred to as synapses. These connections are illustrated in Figure 1.1(a). The strengths of synaptic connections often change in response to external stimuli. This change is how learning takes place in living organisms. This biological mechanism is simulated in artificial neural networks, which contain com- putation units that are referred to as neurons. Throughout this book, we will use the term “neural networks” to refer to artificial neural networks rather than biological ones. The computational units are connected to one another through weights, which serve the same NEURON w1 w2 w3 w4 AXON DENDRITES WITH SYNAPTIC WEIGHTS w5 (a) Biological neural network (b) Artificial neural network Figure 1.1: The synaptic connections between neurons. The image in (a) is from “The Brain: Understanding Neurobiology Through the Study of Addiction [598].” Copyright c 2000 by BSCS Videodiscovery. All rights reserved. Used with permission. © Springer International Publishing AG, part of Springer Nature 2018 C. C. Aggarwal, Neural Networks and Deep Learning, https://doi.org/10.1007/978-3-319-94463-0 1 1
  • 26. 2 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS role as the strengths of synaptic connections in biological organisms. Each input to a neuron is scaled with a weight, which affects the function computed at that unit. This architecture is illustrated in Figure 1.1(b). An artificial neural network computes a function of the inputs by propagating the computed values from the input neurons to the output neuron(s) and using the weights as intermediate parameters. Learning occurs by changing the weights con- necting the neurons. Just as external stimuli are needed for learning in biological organisms, the external stimulus in artificial neural networks is provided by the training data contain- ing examples of input-output pairs of the function to be learned. For example, the training data might contain pixel representations of images (input) and their annotated labels (e.g., carrot, banana) as the output. These training data pairs are fed into the neural network by using the input representations to make predictions about the output labels. The training data provides feedback to the correctness of the weights in the neural network depending on how well the predicted output (e.g., probability of carrot) for a particular input matches the annotated output label in the training data. One can view the errors made by the neural network in the computation of a function as a kind of unpleasant feedback in a biological organism, leading to an adjustment in the synaptic strengths. Similarly, the weights between neurons are adjusted in a neural network in response to prediction errors. The goal of chang- ing the weights is to modify the computed function to make the predictions more correct in future iterations. Therefore, the weights are changed carefully in a mathematically justified way so as to reduce the error in computation on that example. By successively adjusting the weights between neurons over many input-output pairs, the function computed by the neural network is refined over time so that it provides more accurate predictions. Therefore, if the neural network is trained with many different images of bananas, it will eventually be able to properly recognize a banana in an image it has not seen before. This ability to accurately compute functions of unseen inputs by training over a finite set of input-output pairs is referred to as model generalization. The primary usefulness of all machine learning models is gained from their ability to generalize their learning from seen training data to unseen examples. The biological comparison is often criticized as a very poor caricature of the workings of the human brain; nevertheless, the principles of neuroscience have often been useful in designing neural network architectures. A different view is that neural networks are built as higher-level abstractions of the classical models that are commonly used in machine learning. In fact, the most basic units of computation in the neural network are inspired by traditional machine learning algorithms like least-squares regression and logistic regression. Neural networks gain their power by putting together many such basic units, and learning the weights of the different units jointly in order to minimize the prediction error. From this point of view, a neural network can be viewed as a computational graph of elementary units in which greater power is gained by connecting them in particular ways. When a neural network is used in its most basic form, without hooking together multiple units, the learning algorithms often reduce to classical machine learning models (see Chapter 2). The real power of a neural model over classical methods is unleashed when these elementary computational units are combined, and the weights of the elementary models are trained using their dependencies on one another. By combining multiple units, one is increasing the power of the model to learn more complicated functions of the data than are inherent in the elementary models of basic machine learning. The way in which these units are combined also plays a role in the power of the architecture, and requires some understanding and insight from the analyst. Furthermore, sufficient training data is also required in order to learn the larger number of weights in these expanded computational graphs.
  • 27. 1.1. INTRODUCTION 3 ACCURACY AMOUNT OF DATA DEEP LEARNING CONVENTIONAL MACHINE LEARNING Figure 1.2: An illustrative comparison of the accuracy of a typical machine learning al- gorithm with that of a large neural network. Deep learners become more attractive than conventional methods primarily when sufficient data/computational power is available. Re- cent years have seen an increase in data availability and computational power, which has led to a “Cambrian explosion” in deep learning technology. 1.1.1 Humans Versus Computers: Stretching the Limits of Artificial Intelligence Humans and computers are inherently suited to different types of tasks. For example, com- puting the cube root of a large number is very easy for a computer, but it is extremely difficult for humans. On the other hand, a task such as recognizing the objects in an image is a simple matter for a human, but has traditionally been very difficult for an automated learning algorithm. It is only in recent years that deep learning has shown an accuracy on some of these tasks that exceeds that of a human. In fact, the recent results by deep learning algorithms that surpass human performance [184] in (some narrow tasks on) image recog- nition would not have been considered likely by most computer vision experts as recently as 10 years ago. Many deep learning architectures that have shown such extraordinary performance are not created by indiscriminately connecting computational units. The superior performance of deep neural networks mirrors the fact that biological neural networks gain much of their power from depth as well. Furthermore, biological networks are connected in ways we do not fully understand. In the few cases that the biological structure is understood at some level, significant breakthroughs have been achieved by designing artificial neural networks along those lines. A classical example of this type of architecture is the use of the convolutional neural network for image recognition. This architecture was inspired by Hubel and Wiesel’s experiments [212] in 1959 on the organization of the neurons in the cat’s visual cortex. The precursor to the convolutional neural network was the neocognitron [127], which was directly based on these results. The human neuronal connection structure has evolved over millions of years to optimize survival-driven performance; survival is closely related to our ability to merge sensation and intuition in a way that is currently not possible with machines. Biological neuroscience [232] is a field that is still very much in its infancy, and only a limited amount is known about how the brain truly works. Therefore, it is fair to suggest that the biologically inspired success of convolutional neural networks might be replicated in other settings, as we learn more about how the human brain works [176]. A key advantage of neural networks over tradi- tional machine learning is that the former provides a higher-level abstraction of expressing semantic insights about data domains by architectural design choices in the computational graph. The second advantage is that neural networks provide a simple way to adjust the
  • 28. 4 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS complexity of a model by adding or removing neurons from the architecture according to the availability of training data or computational power. A large part of the recent suc- cess of neural networks is explained by the fact that the increased data availability and computational power of modern computers has outgrown the limits of traditional machine learning algorithms, which fail to take full advantage of what is now possible. This situation is illustrated in Figure 1.2. The performance of traditional machine learning remains better at times for smaller data sets because of more choices, greater ease of model interpretation, and the tendency to hand-craft interpretable features that incorporate domain-specific in- sights. With limited data, the best of a very wide diversity of models in machine learning will usually perform better than a single class of models (like neural networks). This is one reason why the potential of neural networks was not realized in the early years. The “big data” era has been enabled by the advances in data collection technology; vir- tually everything we do today, including purchasing an item, using the phone, or clicking on a site, is collected and stored somewhere. Furthermore, the development of powerful Graph- ics Processor Units (GPUs) has enabled increasingly efficient processing on such large data sets. These advances largely explain the recent success of deep learning using algorithms that are only slightly adjusted from the versions that were available two decades back. Furthermore, these recent adjustments to the algorithms have been enabled by increased speed of computation, because reduced run-times enable efficient testing (and subsequent algorithmic adjustment). If it requires a month to test an algorithm, at most twelve varia- tions can be tested in an year on a single hardware platform. This situation has historically constrained the intensive experimentation required for tweaking neural-network learning algorithms. The rapid advances associated with the three pillars of improved data, compu- tation, and experimentation have resulted in an increasingly optimistic outlook about the future of deep learning. By the end of this century, it is expected that computers will have the power to train neural networks with as many neurons as the human brain. Although it is hard to predict what the true capabilities of artificial intelligence will be by then, our experience with computer vision should prepare us to expect the unexpected. Chapter Organization This chapter is organized as follows. The next section introduces single-layer and multi-layer networks. The different types of activation functions, output nodes, and loss functions are discussed. The backpropagation algorithm is introduced in Section 1.3. Practical issues in neural network training are discussed in Section 1.4. Some key points on how neural networks gain their power with specific choices of activation functions are discussed in Section 1.5. The common architectures used in neural network design are discussed in Section 1.6. Advanced topics in deep learning are discussed in Section 1.7. Some notable benchmarks used by the deep learning community are discussed in Section 1.8. A summary is provided in Section 1.9. 1.2 The Basic Architecture of Neural Networks In this section, we will introduce single-layer and multi-layer neural networks. In the single- layer network, a set of inputs is directly mapped to an output by using a generalized variation of a linear function. This simple instantiation of a neural network is also referred to as the perceptron. In multi-layer neural networks, the neurons are arranged in layered fashion, in which the input and output layers are separated by a group of hidden layers. This layer-wise architecture of the neural network is also referred to as a feed-forward network. This section will discuss both single-layer and multi-layer networks.
  • 29. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 5 INPUT NODES ∑ OUTPUT NODE y w1 w2 w3 w4 x4 x3 x2 x1 x5 w5 INPUT NODES ∑ OUTPUT NODE w1 w2 w3 w4 w5 b +1 BIAS NEURON y x4 x3 x2 x1 x5 (a) Perceptron without bias (b) Perceptron with bias Figure 1.3: The basic architecture of the perceptron 1.2.1 Single Computational Layer: The Perceptron The simplest neural network is referred to as the perceptron. This neural network contains a single input layer and an output node. The basic architecture of the perceptron is shown in Figure 1.3(a). Consider a situation where each training instance is of the form (X, y), where each X = [x1, . . . xd] contains d feature variables, and y ∈ {−1, +1} contains the observed value of the binary class variable. By “observed value” we refer to the fact that it is given to us as a part of the training data, and our goal is to predict the class variable for cases in which it is not observed. For example, in a credit-card fraud detection application, the features might represent various properties of a set of credit card transactions (e.g., amount and frequency of transactions), and the class variable might represent whether or not this set of transactions is fraudulent. Clearly, in this type of application, one would have historical cases in which the class variable is observed, and other (current) cases in which the class variable has not yet been observed but needs to be predicted. The input layer contains d nodes that transmit the d features X = [x1 . . . xd] with edges of weight W = [w1 . . . wd] to an output node. The input layer does not perform any computation in its own right. The linear function W · X = d i=1 wixi is computed at the output node. Subsequently, the sign of this real value is used in order to predict the dependent variable of X. Therefore, the prediction ŷ is computed as follows: ŷ = sign{W · X} = sign{ d j=1 wjxj} (1.1) The sign function maps a real value to either +1 or −1, which is appropriate for binary classification. Note the circumflex on top of the variable y to indicate that it is a predicted value rather than an observed value. The error of the prediction is therefore E(X) = y − ŷ, which is one of the values drawn from the set {−2, 0, +2}. In cases where the error value E(X) is nonzero, the weights in the neural network need to be updated in the (negative) direction of the error gradient. As we will see later, this process is similar to that used in various types of linear models in machine learning. In spite of the similarity of the perceptron with respect to traditional machine learning models, its interpretation as a computational unit is very useful because it allows us to put together multiple units in order to create far more powerful models than are available in traditional machine learning.
  • 30. 6 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS The architecture of the perceptron is shown in Figure 1.3(a), in which a single input layer transmits the features to the output node. The edges from the input to the output contain the weights w1 . . . wd with which the features are multiplied and added at the output node. Subsequently, the sign function is applied in order to convert the aggregated value into a class label. The sign function serves the role of an activation function. Different choices of activation functions can be used to simulate different types of models used in machine learning, like least-squares regression with numeric targets, the support vector machine, or a logistic regression classifier. Most of the basic machine learning models can be easily represented as simple neural network architectures. It is a useful exercise to model traditional machine learning techniques as neural architectures, because it provides a clearer picture of how deep learning generalizes traditional machine learning. This point of view is explored in detail in Chapter 2. It is noteworthy that the perceptron contains two layers, although the input layer does not perform any computation and only transmits the feature values. The input layer is not included in the count of the number of layers in a neural network. Since the perceptron contains a single computational layer, it is considered a single-layer network. In many settings, there is an invariant part of the prediction, which is referred to as the bias. For example, consider a setting in which the feature variables are mean centered, but the mean of the binary class prediction from {−1, +1} is not 0. This will tend to occur in situations in which the binary class distribution is highly imbalanced. In such a case, the aforementioned approach is not sufficient for prediction. We need to incorporate an additional bias variable b that captures this invariant part of the prediction: ŷ = sign{W · X + b} = sign{ d j=1 wjxj + b} (1.2) The bias can be incorporated as the weight of an edge by using a bias neuron. This is achieved by adding a neuron that always transmits a value of 1 to the output node. The weight of the edge connecting the bias neuron to the output node provides the bias variable. An example of a bias neuron is shown in Figure 1.3(b). Another approach that works well with single-layer architectures is to use a feature engineering trick in which an additional feature is created with a constant value of 1. The coefficient of this feature provides the bias, and one can then work with Equation 1.1. Throughout this book, biases will not be explicitly used (for simplicity in architectural representations) because they can be incorporated with bias neurons. The details of the training algorithms remain the same by simply treating the bias neurons like any other neuron with a fixed activation value of 1. Therefore, the following will work with the predictive assumption of Equation 1.1, which does not explicitly uses biases. At the time that the perceptron algorithm was proposed by Rosenblatt [405], these op- timizations were performed in a heuristic way with actual hardware circuits, and it was not presented in terms of a formal notion of optimization in machine learning (as is common today). However, the goal was always to minimize the error in prediction, even if a for- mal optimization formulation was not presented. The perceptron algorithm was, therefore, heuristically designed to minimize the number of misclassifications, and convergence proofs were available that provided correctness guarantees of the learning algorithm in simplified settings. Therefore, we can still write the (heuristically motivated) goal of the perceptron algorithm in least-squares form with respect to all training instances in a data set D con-
  • 31. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 7 taining feature-label pairs: MinimizeW L = (X,y)∈D (y − ŷ)2 = (X,y)∈D y − sign{W · X} 2 This type of minimization objective function is also referred to as a loss function. As we will see later, almost all neural network learning algorithms are formulated with the use of a loss function. As we will learn in Chapter 2, this loss function looks a lot like least- squares regression. However, the latter is defined for continuous-valued target variables, and the corresponding loss is a smooth and continuous function of the variables. On the other hand, for the least-squares form of the objective function, the sign function is non- differentiable, with step-like jumps at specific points. Furthermore, the sign function takes on constant values over large portions of the domain, and therefore the exact gradient takes on zero values at differentiable points. This results in a staircase-like loss surface, which is not suitable for gradient-descent. The perceptron algorithm (implicitly) uses a smooth approximation of the gradient of this objective function with respect to each example: ∇Lsmooth = (X,y)∈D (y − ŷ)X (1.3) Note that the above gradient is not a true gradient of the staircase-like surface of the (heuris- tic) objective function, which does not provide useful gradients. Therefore, the staircase is smoothed out into a sloping surface defined by the perceptron criterion. The properties of the perceptron criterion will be described in Section 1.2.1.1. It is noteworthy that concepts like the “perceptron criterion” were proposed later than the original paper by Rosenblatt [405] in order to explain the heuristic gradient-descent steps. For now, we will assume that the perceptron algorithm optimizes some unknown smooth function with the use of gradient descent. Although the above objective function is defined over the entire training data, the train- ing algorithm of neural networks works by feeding each input data instance X into the network one by one (or in small batches) to create the prediction ŷ. The weights are then updated, based on the error value E(X) = (y − ŷ). Specifically, when the data point X is fed into the network, the weight vector W is updated as follows: W ⇐ W + α(y − ŷ)X (1.4) The parameter α regulates the learning rate of the neural network. The perceptron algorithm repeatedly cycles through all the training examples in random order and iteratively adjusts the weights until convergence is reached. A single training data point may be cycled through many times. Each such cycle is referred to as an epoch. One can also write the gradient- descent update in terms of the error E(X) = (y − ŷ) as follows: W ⇐ W + αE(X)X (1.5) The basic perceptron algorithm can be considered a stochastic gradient-descent method, which implicitly minimizes the squared error of prediction by performing gradient-descent updates with respect to randomly chosen training points. The assumption is that the neural network cycles through the points in random order during training and changes the weights with the goal of reducing the prediction error on that point. It is easy to see from Equa- tion 1.5 that non-zero updates are made to the weights only when y = ŷ, which occurs only
  • 32. 8 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS when errors are made in prediction. In mini-batch stochastic gradient descent, the aforemen- tioned updates of Equation 1.5 are implemented over a randomly chosen subset of training points S: W ⇐ W + α X∈S E(X)X (1.6) LINEARLY SEPARABLE NOT LINEARLY SEPARABLE W X = 0 Figure 1.4: Examples of linearly separable and inseparable data in two classes The advantages of using mini-batch stochastic gradient descent are discussed in Section 3.2.8 of Chapter 3. An interesting quirk of the perceptron is that it is possible to set the learning rate α to 1, because the learning rate only scales the weights. The type of model proposed in the perceptron is a linear model, in which the equation W ·X = 0 defines a linear hyperplane. Here, W = (w1 . . . wd) is a d-dimensional vector that is normal to the hyperplane. Furthermore, the value of W · X is positive for values of X on one side of the hyperplane, and it is negative for values of X on the other side. This type of model performs particularly well when the data is linearly separable. Examples of linearly separable and inseparable data are shown in Figure 1.4. The perceptron algorithm is good at classifying data sets like the one shown on the left-hand side of Figure 1.4, when the data is linearly separable. On the other hand, it tends to perform poorly on data sets like the one shown on the right-hand side of Figure 1.4. This example shows the inherent modeling limitation of a perceptron, which necessitates the use of more complex neural architectures. Since the original perceptron algorithm was proposed as a heuristic minimization of classification errors, it was particularly important to show that the algorithm converges to reasonable solutions in some special cases. In this context, it was shown [405] that the perceptron algorithm always converges to provide zero error on the training data when the data are linearly separable. However, the perceptron algorithm is not guaranteed to converge in instances where the data are not linearly separable. For reasons discussed in the next section, the perceptron might sometimes arrive at a very poor solution with data that are not linearly separable (in comparison with many other learning algorithms). 1.2.1.1 What Objective Function Is the Perceptron Optimizing? As discussed earlier in this chapter, the original perceptron paper by Rosenblatt [405] did not formally propose a loss function. In those years, these implementations were achieved using actual hardware circuits. The original Mark I perceptron was intended to be a machine rather than an algorithm, and custom-built hardware was used to create it (cf. Figure 1.5).
  • 33. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 9 The general goal was to minimize the number of classification errors with a heuristic update process (in hardware) that changed weights in the “correct” direction whenever errors were made. This heuristic update strongly resembled gradient descent but it was not derived as a gradient-descent method. Gradient descent is defined only for smooth loss functions in algorithmic settings, whereas the hardware-centric approach was designed in a more Figure 1.5: The perceptron algorithm was originally implemented using hardware circuits. The image depicts the Mark I perceptron machine built in 1958. (Courtesy: Smithsonian Institute) heuristic way with binary outputs. Many of the binary and circuit-centric principles were inherited from the McCulloch-Pitts model [321] of the neuron. Unfortunately, binary signals are not prone to continuous optimization. Can we find a smooth loss function, whose gradient turns out to be the perceptron update? The number of classification errors in a binary classification problem can be written in the form of a 0/1 loss function for training data point (Xi, yi) as follows: L (0/1) i = 1 2 (yi − sign{W · Xi})2 = 1 − yi · sign{W · Xi} (1.7) The simplification to the right-hand side of the above objective function is obtained by set- ting both y2 i and sign{W ·Xi}2 to 1, since they are obtained by squaring a value drawn from {−1, +1}. However, this objective function is not differentiable, because it has a staircase- like shape, especially when it is added over multiple points. Note that the 0/1 loss above is dominated by the term −yisign{W · Xi}, in which the sign function causes most of the problems associated with non-differentiability. Since neural networks are defined by gradient-based optimization, we need to define a smooth objective function that is respon- sible for the perceptron updates. It can be shown [41] that the updates of the perceptron implicitly optimize the perceptron criterion. This objective function is defined by dropping the sign function in the above 0/1 loss and setting negative values to 0 in order to treat all correct predictions in a uniform and lossless way: Li = max{−yi(W · Xi), 0} (1.8) The reader is encouraged to use calculus to verify that the gradient of this smoothed objec- tive function leads to the perceptron update, and the update of the perceptron is essentially
  • 34. 10 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS W ⇐ W − α∇W Li. The modified loss function to enable gradient computation of a non- differentiable function is also referred to as a smoothed surrogate loss function. Almost all continuous optimization-based learning methods (such as neural networks) with discrete outputs (such as class labels) use some type of smoothed surrogate loss function. LOSS PERCEPTRON CRITERION HINGE LOSS 1 0 VALUE OF W X FOR POSITIVE CLASS INSTANCE Figure 1.6: Perceptron criterion versus hinge loss Although the aforementioned perceptron criterion was reverse engineered by working backwards from the perceptron updates, the nature of this loss function exposes some of the weaknesses of the updates in the original algorithm. An interesting observation about the perceptron criterion is that one can set W to the zero vector irrespective of the training data set in order to obtain the optimal loss value of 0. In spite of this fact, the perceptron updates continue to converge to a clear separator between the two classes in linearly separable cases; after all, a separator between the two classes provides a loss value of 0 as well. However, the behavior for data that are not linearly separable is rather arbitrary, and the resulting solution is sometimes not even a good approximate separator of the classes. The direct sensitivity of the loss to the magnitude of the weight vector can dilute the goal of class separation; it is possible for updates to worsen the number of misclassifications significantly while improving the loss. This is an example of how surrogate loss functions might sometimes not fully achieve their intended goals. Because of this fact, the approach is not stable and can yield solutions of widely varying quality. Several variations of the learning algorithm were therefore proposed for inseparable data, and a natural approach is to always keep track of the best solution in terms of the number of misclassifications [128]. This approach of always keeping the best solution in one’s “pocket” is referred to as the pocket algorithm. Another highly performing variant incorporates the notion of margin in the loss function, which creates an identical algorithm to the linear support vector machine. For this reason, the linear support vector machine is also referred to as the perceptron of optimal stability. 1.2.1.2 Relationship with Support Vector Machines The perceptron criterion is a shifted version of the hinge-loss used in support vector ma- chines (see Chapter 2). The hinge loss looks even more similar to the zero-one loss criterion of Equation 1.7, and is defined as follows: Lsvm i = max{1 − yi(W · Xi), 0} (1.9) Note that the perceptron does not keep the constant term of 1 on the right-hand side of Equation 1.7, whereas the hinge loss keeps this constant within the maximization function. This change does not affect the algebraic expression for the gradient, but it does change
  • 35. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 11 which points are lossless and should not cause an update. The relationship between the perceptron criterion and the hinge loss is shown in Figure 1.6. This similarity becomes particularly evident when the perceptron updates of Equation 1.6 are rewritten as follows: W ⇐ W + α (X,y)∈S+ yX (1.10) Here, S+ is defined as the set of all misclassified training points X ∈ S that satisfy the condition y(W ·X) 0. This update seems to look somewhat different from the perceptron, because the perceptron uses the error E(X) for the update, which is replaced with y in the update above. A key point is that the (integer) error value E(X) = (y − sign{W · X}) ∈ {−2, +2} can never be 0 for misclassified points in S+ . Therefore, we have E(X) = 2y for misclassified points, and E(X) can be replaced with y in the updates after absorbing the factor of 2 within the learning rate. This update is identical to that used by the primal support vector machine (SVM) algorithm [448], except that the updates are performed only for the misclassified points in the perceptron, whereas the SVM also uses the marginally correct points near the decision boundary for updates. Note that the SVM uses the condition y(W · X) 1 [instead of using the condition y(W · X) 0] to define S+ , which is one of the key differences between the two algorithms. This point shows that the perceptron is fundamentally not very different from well-known machine learning algorithms like the support vector machine in spite of its different origins. Freund and Schapire provide a beautiful exposition of the role of margin in improving stability of the perceptron and also its relationship with the support vector machine [123]. It turns out that many traditional machine learning models can be viewed as minor variations of shallow neural architectures like the perceptron. The relationships between classical machine learning models and shallow neural networks are described in detail in Chapter 2. 1.2.1.3 Choice of Activation and Loss Functions The choice of activation function is a critical part of neural network design. In the case of the perceptron, the choice of the sign activation function is motivated by the fact that a binary class label needs to be predicted. However, it is possible to have other types of situations where different target variables may be predicted. For example, if the target variable to be predicted is real, then it makes sense to use the identity activation function, and the resulting algorithm is the same as least-squares regression. If it is desirable to predict a probability of a binary class, it makes sense to use a sigmoid function for activating the output node, so that the prediction ŷ indicates the probability that the observed value, y, of the dependent variable is 1. The negative logarithm of |y/2−0.5+ŷ| is used as the loss, assuming that y is coded from {−1, 1}. If ŷ is the probability that y is 1, then |y/2 − 0.5 + ŷ| is the probability that the correct value is predicted. This assertion is easy to verify by examining the two cases where y is 0 or 1. This loss function can be shown to be representative of the negative log-likelihood of the training data (see Section 2.2.3 of Chapter 2). The importance of nonlinear activation functions becomes significant when one moves from the single-layered perceptron to the multi-layered architectures discussed later in this chapter. Different types of nonlinear functions such as the sign, sigmoid, or hyperbolic tan- gents may be used in various layers. We use the notation Φ to denote the activation function: ŷ = Φ(W · X) (1.11) Therefore, a neuron really computes two functions within the node, which is why we have incorporated the summation symbol Σ as well as the activation symbol Φ within a neuron. The break-up of the neuron computations into two separate values is shown in Figure 1.7.
  • 36. 12 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS ∑ BREAK UP ∑ ∑ h= (W X) . ah POST-ACTIVATION VALUE PRE-ACTIVATION VALUE = W X . { X W h= (ah) { X W Figure 1.7: Pre-activation and post-activation values within a neuron The value computed before applying the activation function Φ(·) will be referred to as the pre-activation value, whereas the value computed after applying the activation function is referred to as the post-activation value. The output of a neuron is always the post-activation value, although the pre-activation variables are often used in different types of analyses, such as the computations of the backpropagation algorithm discussed later in this chapter. The pre-activation and post-activation values of a neuron are shown in Figure 1.7. The most basic activation function Φ(·) is the identity or linear activation, which provides no nonlinearity: Φ(v) = v The linear activation function is often used in the output node, when the target is a real value. It is even used for discrete outputs when a smoothed surrogate loss function needs to be set up. The classical activation functions that were used early in the development of neural networks were the sign, sigmoid, and the hyperbolic tangent functions: Φ(v) = sign(v) (sign function) Φ(v) = 1 1 + e−v (sigmoid function) Φ(v) = e2v − 1 e2v + 1 (tanh function) While the sign activation can be used to map to binary outputs at prediction time, its non-differentiability prevents its use for creating the loss function at training time. For example, while the perceptron uses the sign function for prediction, the perceptron crite- rion in training only requires linear activation. The sigmoid activation outputs a value in (0, 1), which is helpful in performing computations that should be interpreted as probabil- ities. Furthermore, it is also helpful in creating probabilistic outputs and constructing loss functions derived from maximum-likelihood models. The tanh function has a shape simi- lar to that of the sigmoid function, except that it is horizontally re-scaled and vertically translated/re-scaled to [−1, 1]. The tanh and sigmoid functions are related as follows (see Exercise 3): tanh(v) = 2 · sigmoid(2v) − 1 The tanh function is preferable to the sigmoid when the outputs of the computations are de- sired to be both positive and negative. Furthermore, its mean-centering and larger gradient
  • 37. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 13 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −10 −5 0 5 10 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.8: Various activation functions (because of stretching) with respect to sigmoid makes it easier to train. The sigmoid and the tanh functions have been the historical tools of choice for incorporating nonlinearity in the neural network. In recent years, however, a number of piecewise linear activation functions have become more popular: Φ(v) = max{v, 0} (Rectified Linear Unit [ReLU]) Φ(v) = max {min [v, 1] , −1} (hard tanh) The ReLU and hard tanh activation functions have largely replaced the sigmoid and soft tanh activation functions in modern neural networks because of the ease in training multi- layered neural networks with these activation functions. Pictorial representations of all the aforementioned activation functions are illustrated in Figure 1.8. It is noteworthy that all activation functions shown here are monotonic. Furthermore, other than the identity activation function, most1 of the other activation functions saturate at large absolute values of the argument at which increasing further does not change the activation much. As we will see later, such nonlinear activation functions are also very useful in multilayer networks, because they help in creating more powerful compositions of different types of functions. Many of these functions are referred to as squashing functions, as they map the outputs from an arbitrary range to bounded outputs. The use of a nonlinear activation plays a fundamental role in increasing the modeling power of a network. If a network used only linear activations, it would not provide better modeling power than a single-layer linear network. This issue is discussed in Section 1.5. 1The ReLU shows asymmetric saturation.
  • 38. 14 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS x4 x3 x2 x1 INPUT LAYER HIDDEN LAYER x5 P(y=green) OUTPUTS P(y=blue) P(y=red) v1 v2 v3 SOFTMAX LAYER Figure 1.9: An example of multiple outputs for categorical classification with the use of a softmax layer 1.2.1.4 Choice and Number of Output Nodes The choice and number of output nodes is also tied to the activation function, which in turn depends on the application at hand. For example, if k-way classification is intended, k output values can be used, with a softmax activation function with respect to outputs v = [v1, . . . , vk] at the nodes in a given layer. Specifically, the activation function for the ith output is defined as follows: Φ(v)i = exp(vi) k j=1 exp(vj) ∀i ∈ {1, . . . , k} (1.12) It is helpful to think of these k values as the values output by k nodes, in which the in- puts are v1 . . . vk. An example of the softmax function with three outputs is illustrated in Figure 1.9, and the values v1, v2, and v3 are also shown in the same figure. Note that the three outputs correspond to the probabilities of the three classes, and they convert the three outputs of the final hidden layer into probabilities with the softmax function. The final hid- den layer often uses linear (identity) activations, when it is input into the softmax layer. Furthermore, there are no weights associated with the softmax layer, since it is only con- verting real-valued outputs into probabilities. The use of softmax with a single hidden layer of linear activations exactly implements a model, which is referred to as multinomial logistic regression [6]. Similarly, many variations like multi-class SVMs can be easily implemented with neural networks. Another example of a case in which multiple output nodes are used is the autoencoder, in which each input data point is fully reconstructed by the output layer. The autoencoder can be used to implement matrix factorization methods like singular value decomposition. This architecture will be discussed in detail in Chapter 2. The simplest neu- ral networks that simulate basic machine learning algorithms are instructive because they lie on the continuum between traditional machine learning and deep networks. By exploring these architectures, one gets a better idea of the relationship between traditional machine learning and neural networks, and also the advantages provided by the latter. 1.2.1.5 Choice of Loss Function The choice of the loss function is critical in defining the outputs in a way that is sensitive to the application at hand. For example, least-squares regression with numeric outputs
  • 39. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 15 requires a simple squared loss of the form (y − ŷ)2 for a single training instance with target y and prediction ŷ. One can also use other types of loss like hinge loss for y ∈ {−1, +1} and real-valued prediction ŷ (with identity activation): L = max{0, 1 − y · ŷ} (1.13) The hinge loss can be used to implement a learning method, which is referred to as a support vector machine. For multiway predictions (like predicting word identifiers or one of multiple classes), the softmax output is particularly useful. However, a softmax output is probabilistic, and therefore it requires a different type of loss function. In fact, for probabilistic predictions, two different types of loss functions are used, depending on whether the prediction is binary or whether it is multiway: 1. Binary targets (logistic regression): In this case, it is assumed that the observed value y is drawn from {−1, +1}, and the prediction ŷ is a an arbitrary numerical value on using the identity activation function. In such a case, the loss function for a single instance with observed value y and real-valued prediction ŷ (with identity activation) is defined as follows: L = log(1 + exp(−y · ŷ)) (1.14) This type of loss function implements a fundamental machine learning method, re- ferred to as logistic regression. Alternatively, one can use a sigmoid activation function to output ŷ ∈ (0, 1), which indicates the probability that the observed value y is 1. Then, the negative logarithm of |y/2 − 0.5 + ŷ| provides the loss, assuming that y is coded from {−1, 1}. This is because |y/2 − 0.5 + ŷ| indicates the probability that the prediction is correct. This observation illustrates that one can use various combina- tions of activation and loss functions to achieve the same result. 2. Categorical targets: In this case, if ŷ1 . . . ŷk are the probabilities of the k classes (using the softmax activation of Equation 1.9), and the rth class is the ground-truth class, then the loss function for a single instance is defined as follows: L = −log(ŷr) (1.15) This type of loss function implements multinomial logistic regression, and it is re- ferred to as the cross-entropy loss. Note that binary logistic regression is identical to multinomial logistic regression, when the value of k is set to 2 in the latter. The key point to remember is that the nature of the output nodes, the activation function, and the loss function depend on the application at hand. Furthermore, these choices also depend on one another. Even though the perceptron is often presented as the quintessential representative of single-layer networks, it is only a single representative out of a very large universe of possibilities. In practice, one rarely uses the perceptron criterion as the loss function. For discrete-valued outputs, it is common to use softmax activation with cross- entropy loss. For real-valued outputs, it is common to use linear activation with squared loss. Generally, cross-entropy loss is easier to optimize than squared loss.
  • 40. 16 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −10 −5 0 5 10 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Figure 1.10: The derivatives of various activation functions 1.2.1.6 Some Useful Derivatives of Activation Functions Most neural network learning is primarily related to gradient-descent with activation func- tions. For this reason, the derivatives of these activation functions are used repeatedly in this book, and gathering them in a single place for future reference is useful. This section provides details on the derivatives of these loss functions. Later chapters will extensively refer to these results. 1. Linear and sign activations: The derivative of the linear activation function is 1 at all places. The derivative of sign(v) is 0 at all values of v other than at v = 0, where it is discontinuous and non-differentiable. Because of the zero gradient and non-differentiability of this activation function, it is rarely used in the loss function even when it is used for prediction at testing time. The derivatives of the linear and sign activations are illustrated in Figure 1.10(a) and (b), respectively. 2. Sigmoid activation: The derivative of sigmoid activation is particularly simple, when it is expressed in terms of the output of the sigmoid, rather than the input. Let o be the output of the sigmoid function with argument v: o = 1 1 + exp(−v) (1.16) Then, one can write the derivative of the activation as follows: ∂o ∂v = exp(−v) (1 + exp(−v))2 (1.17)
  • 41. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 17 The key point is that this sigmoid can be written more conveniently in terms of the outputs: ∂o ∂v = o(1 − o) (1.18) The derivative of the sigmoid is often used as a function of the output rather than the input. The derivative of the sigmoid activation function is illustrated in Figure 1.10(c). 3. Tanh activation: As in the case of the sigmoid activation, the tanh activation is often used as a function of the output o rather than the input v: o = exp(2v) − 1 exp(2v) + 1 (1.19) One can then compute the gradient as follows: ∂o ∂v = 4 · exp(2v) (exp(2v) + 1)2 (1.20) One can also write this derivative in terms of the output o: ∂o ∂v = 1 − o2 (1.21) The derivative of the tanh activation is illustrated in Figure 1.10(d). 4. ReLU and hard tanh activations: The ReLU takes on a partial derivative value of 1 for non-negative values of its argument, and 0, otherwise. The hard tanh function takes on a partial derivative value of 1 for values of the argument in [−1, +1] and 0, otherwise. The derivatives of the ReLU and hard tanh activations are illustrated in Figure 1.10(e) and (f), respectively. 1.2.2 Multilayer Neural Networks Multilayer neural networks contain more than one computational layer. The perceptron contains an input and output layer, of which the output layer is the only computation- performing layer. The input layer transmits the data to the output layer, and all com- putations are completely visible to the user. Multilayer neural networks contain multiple computational layers; the additional intermediate layers (between input and output) are referred to as hidden layers because the computations performed are not visible to the user. The specific architecture of multilayer neural networks is referred to as feed-forward net- works, because successive layers feed into one another in the forward direction from input to output. The default architecture of feed-forward networks assumes that all nodes in one layer are connected to those of the next layer. Therefore, the architecture of the neural network is almost fully defined, once the number of layers and the number/type of nodes in each layer have been defined. The only remaining detail is the loss function that is optimized in the output layer. Although the perceptron algorithm uses the perceptron criterion, this is not the only choice. It is extremely common to use softmax outputs with cross-entropy loss for discrete prediction and linear outputs with squared loss for real-valued prediction. As in the case of single-layer networks, bias neurons can be used both in the hidden layers and in the output layers. Examples of multilayer networks with or without the bias neurons are shown in Figure 1.11(a) and (b), respectively. In each case, the neural network
  • 42. 18 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS contains three layers. Note that the input layer is often not counted, because it simply transmits the data and no computation is performed in that layer. If a neural network contains p1 . . . pk units in each of its k layers, then the (column) vector representations of these outputs, denoted by h1 . . . hk have dimensionalities p1 . . . pk. Therefore, the number of units in each layer is referred to as the dimensionality of that layer. INPUT LAYER HIDDEN LAYER OUTPUT LAYER y x4 x3 x2 x1 x5 INPUT LAYER HIDDEN LAYER y OUTPUT LAYER +1 +1 BIAS NEURONS +1 BIAS NEURON x4 x3 x2 x1 x5 s n o r u e n s a i b h t i W ) b ( s n o r u e n s a i b o N ) a ( y x4 x3 x2 x1 x5 h11 h12 h13 h23 h22 h21 h1 h2 X SCALAR WEIGHTS ON CONNECTIONS WEIGHT MATRICES ON CONNECTIONS y X h1 h2 X 5 X 3 MATRIX 3 X 3 MATRIX 3 X 1 MATRIX Figure 1.11: The basic architecture of a feed-forward network with two hidden layers and a single output layer. Even though each unit contains a single scalar variable, one often represents all units within a single layer as a single vector unit. Vector units are often represented as rectangles and have connection matrices between them. INPUT LAYER HIDDEN LAYER OUTPUT LAYER xI 4 xI 3 xI 2 xI 1 xI 5 OUTPUT OF THIS LAYER PROVIDES REDUCED REPRESENTATION x4 x3 x2 x1 x5 Figure 1.12: An example of an autoencoder with multiple outputs
  • 43. 1.2. THE BASIC ARCHITECTURE OF NEURAL NETWORKS 19 The weights of the connections between the input layer and the first hidden layer are contained in a matrix W1 with size d × p1, whereas the weights between the rth hidden layer and the (r + 1)th hidden layer are denoted by the pr × pr+1 matrix denoted by Wr. If the output layer contains o nodes, then the final matrix Wk+1 is of size pk × o. The d-dimensional input vector x is transformed into the outputs using the following recursive equations: h1 = Φ(WT 1 x) [Input to Hidden Layer] hp+1 = Φ(WT p+1hp) ∀p ∈ {1 . . . k − 1} [Hidden to Hidden Layer] o = Φ(WT k+1hk) [Hidden to Output Layer] Here, the activation functions like the sigmoid function are applied in element-wise fashion to their vector arguments. However, some activation functions such as the softmax (which are typically used in the output layers) naturally have vector arguments. Even though each unit of a neural network contains a single variable, many architectural diagrams combine the units in a single layer to create a single vector unit, which is represented as a rectangle rather than a circle. For example, the architectural diagram in Figure 1.11(c) (with scalar units) has been transformed to a vector-based neural architecture in Figure 1.11(d). Note that the connections between the vector units are now matrices. Furthermore, an implicit assumption in the vector-based neural architecture is that all units in a layer use the same activation function, which is applied in element-wise fashion to that layer. This constraint is usually not a problem, because most neural architectures use the same activation function throughout the computational pipeline, with the only deviation caused by the nature of the output layer. Throughout this book, neural architectures in which units contain vector variables will be depicted with rectangular units, whereas scalar variables will correspond to circular units. Note that the aforementioned recurrence equations and vector architectures are valid only for layer-wise feed-forward networks, and cannot always be used for unconventional architectural designs. It is possible to have all types of unconventional designs in which inputs might be incorporated in intermediate layers, or the topology might allow connections between non-consecutive layers. Furthermore, the functions computed at a node may not always be in the form of a combination of a linear function and an activation. It is possible to have all types of arbitrary computational functions at nodes. Although a very classical type of architecture is shown in Figure 1.11, it is possible to vary on it in many ways, such as allowing multiple output nodes. These choices are often determined by the goals of the application at hand (e.g., classification or dimensionality reduction). A classical example of the dimensionality reduction setting is the autoencoder, which recreates the outputs from the inputs. Therefore, the number of outputs and inputs is equal, as shown in Figure 1.12. The constricted hidden layer in the middle outputs the reduced representation of each instance. As a result of this constriction, there is some loss in the representation, which typically corresponds to the noise in the data. The outputs of the hidden layers correspond to the reduced representation of the data. In fact, a shallow variant of this scheme can be shown to be mathematically equivalent to a well-known dimensionality reduction method known as singular value decomposition. As we will learn in Chapter 2, increasing the depth of the network results in inherently more powerful reductions. Although a fully connected architecture is able to perform well in many settings, better performance is often achieved by pruning many of the connections or sharing them in an insightful way. Typically, these insights are obtained by using a domain-specific understand- ing of the data. A classical example of this type of weight pruning and sharing is that of
  • 44. 20 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS the convolutional neural network architecture (cf. Chapter 8), in which the architecture is carefully designed in order to conform to the typical properties of image data. Such an ap- proach minimizes the risk of overfitting by incorporating domain-specific insights (or bias). As we will discuss later in this book (cf. Chapter 4), overfitting is a pervasive problem in neural network design, so that the network often performs very well on the training data, but it generalizes poorly to unseen test data. This problem occurs when the number of free parameters, (which is typically equal to the number of weight connections), is too large compared to the size of the training data. In such cases, the large number of parameters memorize the specific nuances of the training data, but fail to recognize the statistically significant patterns for classifying unseen test data. Clearly, increasing the number of nodes in the neural network tends to encourage overfitting. Much recent work has been focused both on the architecture of the neural network as well as on the computations performed within each node in order to minimize overfitting. Furthermore, the way in which the neu- ral network is trained also has an impact on the quality of the final solution. Many clever methods, such as pretraining (cf. Chapter 4), have been proposed in recent years in order to improve the quality of the learned solution. This book will explore these advanced training methods in detail. 1.2.3 The Multilayer Network as a Computational Graph It is helpful to view a neural network as a computational graph, which is constructed by piecing together many basic parametric models. Neural networks are fundamentally more powerful than their building blocks because the parameters of these models are learned jointly to create a highly optimized composition function of these models. The common use of the term “perceptron” to refer to the basic unit of a neural network is somewhat mis- leading, because there are many variations of this basic unit that are leveraged in different settings. In fact, it is far more common to use logistic units (with sigmoid activation) and piecewise/fully linear units as building blocks of these models. A multilayer network evaluates compositions of functions computed at individual nodes. A path of length 2 in the neural network in which the function f(·) follows g(·) can be considered a composition function f(g(·)). Furthermore, if g1(·), g2(·) . . . gk(·) are the func- tions computed in layer m, and a particular layer-(m + 1) node computes f(·), then the composition function computed by the layer-(m + 1) node in terms of the layer-m inputs is f(g1(·), . . . gk(·)). The use of nonlinear activation functions is the key to increasing the power of multiple layers. If all layers use an identity activation function, then a multilayer network can be shown to simplify to linear regression. It has been shown [208] that a net- work with a single hidden layer of nonlinear units (with a wide ranging choice of squashing functions like the sigmoid unit) and a single (linear) output layer can compute almost any “reasonable” function. As a result, neural networks are often referred to as universal function approximators, although this theoretical claim is not always easy to translate into practical usefulness. The main issue is that the number of hidden units required to do so is rather large, which increases the number of parameters to be learned. This results in practical problems in training the network with a limited amount of data. In fact, deeper networks are often preferred because they reduce the number of hidden units in each layer as well as the overall number of parameters. The “building block” description is particularly appropriate for multilayer neural net- works. Very often, off-the-shelf softwares for building neural networks2 provide analysts 2Examples include Torch [572], Theano [573], and TensorFlow [574].
  • 45. 1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 21 with access to these building blocks. The analyst is able to specify the number and type of units in each layer along with an off-the-shelf or customized loss function. A deep neural network containing tens of layers can often be described in a few hundred lines of code. All the learning of the weights is done automatically by the backpropagation algorithm that uses dynamic programming to work out the complicated parameter update steps of the underlying computational graph. The analyst does not have to spend the time and effort to explicitly work out these steps. This makes the process of trying different types of ar- chitectures relatively painless for the analyst. Building a neural network with many of the off-the-shelf softwares is often compared to a child constructing a toy from building blocks that appropriately fit with one another. Each block is like a unit (or a layer of units) with a particular type of activation. Much of this ease in training neural networks is attributable to the backpropagation algorithm, which shields the analyst from explicitly working out the parameter update steps of what is actually an extremely complicated optimization problem. Working out these steps is often the most difficult part of most machine learning algorithms, and an important contribution of the neural network paradigm is to bring modular thinking into machine learning. In other words, the modularity in neural network design translates to modularity in learning its parameters; the specific name for the latter type of modularity is “backpropagation.” This makes the design of neural networks more of an (experienced) engineer’s task rather than a mathematical exercise. 1.3 Training a Neural Network with Backpropagation In the single-layer neural network, the training process is relatively straightforward because the error (or loss function) can be computed as a direct function of the weights, which allows easy gradient computation. In the case of multi-layer networks, the problem is that the loss is a complicated composition function of the weights in earlier layers. The gradient of a composition function is computed using the backpropagation algorithm. The backprop- agation algorithm leverages the chain rule of differential calculus, which computes the error gradients in terms of summations of local-gradient products over the various paths from a node to the output. Although this summation has an exponential number of components (paths), one can compute it efficiently using dynamic programming. The backpropagation algorithm is a direct application of dynamic programming. It contains two main phases, referred to as the forward and backward phases, respectively. The forward phase is required to compute the output values and the local derivatives at various nodes, and the backward phase is required to accumulate the products of these local values over all paths from the node to the output: 1. Forward phase: In this phase, the inputs for a training instance are fed into the neural network. This results in a forward cascade of computations across the layers, using the current set of weights. The final predicted output can be compared to that of the training instance and the derivative of the loss function with respect to the output is computed. The derivative of this loss now needs to be computed with respect to the weights in all layers in the backwards phase. 2. Backward phase: The main goal of the backward phase is to learn the gradient of the loss function with respect to the different weights by using the chain rule of differen- tial calculus. These gradients are used to update the weights. Since these gradients are learned in the backward direction, starting from the output node, this learning process is referred to as the backward phase. Consider a sequence of hidden units
  • 46. 22 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS w f(w) g(y) h(z) K(p,q) O = K(p,q) = K(g(f(w)),h(f(w))) UGLY COMPOSITION FUNCTION O INPUT WEIGHT OUTPUT ∂o ∂w = ∂o ∂p · ∂p ∂w + ∂o ∂q · ∂q ∂w [Multivariable Chain Rule] = ∂o ∂p · ∂p ∂y · ∂y ∂w + ∂o ∂q · ∂q ∂z · ∂z ∂w [Univariate Chain Rule] = ∂K(p, q) ∂p · g (y) · f (w) First path + ∂K(p, q) ∂q · h (z) · f (w) Second path Figure 1.13: Illustration of chain rule in computational graphs: The products of node-specific partial derivatives along paths from weight w to output o are aggregated. The resulting value yields the derivative of output o with respect to weight w. Only two paths between input and output exist in this simplified example. h1, h2, . . . , hk followed by output o, with respect to which the loss function L is com- puted. Furthermore, assume that the weight of the connection from hidden unit hr to hr+1 is w(hr,hr+1). Then, in the case that a single path exists from h1 to o, one can derive the gradient of the loss function with respect to any of these edge weights using the chain rule: ∂L ∂w(hr−1,hr) = ∂L ∂o · ∂o ∂hk k−1 i=r ∂hi+1 ∂hi ∂hr ∂w(hr−1,hr) ∀r ∈ 1 . . . k (1.22) The aforementioned expression assumes that only a single path from h1 to o exists in the network, whereas an exponential number of paths might exist in reality. A gener- alized variant of the chain rule, referred to as the multivariable chain rule, computes the gradient in a computational graph, where more than one path might exist. This is achieved by adding the composition along each of the paths from h1 to o. An example of the chain rule in a computational graph with two paths is shown in Figure 1.13. Therefore, one generalizes the above expression to the case where a set P of paths exist from hr to o: ∂L ∂w(hr−1,hr) = ∂L ∂o · ⎡ ⎣ [hr,hr+1,...hk,o]∈P ∂o ∂hk k−1 i=r ∂hi+1 ∂hi ⎤ ⎦ Backpropagation computes Δ(hr, o) = ∂L ∂hr ∂hr ∂w(hr−1,hr) (1.23)
  • 47. 1.3. TRAINING A NEURAL NETWORK WITH BACKPROPAGATION 23 The computation of ∂hr ∂w(hr−1,hr) on the right-hand side is straightforward and will be discussed below (cf. Equation 1.27). However, the path-aggregated term above [annotated by Δ(hr, o) = ∂L ∂hr ] is aggregated over an exponentially increasing number of paths (with respect to path length), which seems to be intractable at first sight. A key point is that the computational graph of a neural network does not have cycles, and it is possible to compute such an aggregation in a principled way in the backwards direction by first computing Δ(hk, o) for nodes hk closest to o, and then recursively computing these values for nodes in earlier layers in terms of the nodes in later layers. Furthermore, the value of Δ(o, o) for each output node is initialized as follows: Δ(o, o) = ∂L ∂o (1.24) This type of dynamic programming technique is used frequently to efficiently compute all types of path-centric functions in directed acyclic graphs, which would otherwise require an exponential number of operations. The recursion for Δ(hr, o) can be derived using the multivariable chain rule: Δ(hr, o) = ∂L ∂hr = h:hr⇒h ∂L ∂h ∂h ∂hr = h:hr⇒h ∂h ∂hr Δ(h, o) (1.25) Since each h is in a later layer than hr, Δ(h, o) has already been computed while evaluating Δ(hr, o). However, we still need to evaluate ∂h ∂hr in order to compute Equa- tion 1.25. Consider a situation in which the edge joining hr to h has weight w(hr,h), and let ah be the value computed in hidden unit h just before applying the activation function Φ(·). In other words, we have h = Φ(ah), where ah is a linear combination of its inputs from earlier-layer units incident on h. Then, by the univariate chain rule, the following expression for ∂h ∂hr can be derived: ∂h ∂hr = ∂h ∂ah · ∂ah ∂hr = ∂Φ(ah) ∂ah · w(hr,h) = Φ (ah) · w(hr,h) This value of ∂h ∂hr is used in Equation 1.25, which is repeated recursively in the back- wards direction, starting with the output node. The corresponding updates in the backwards direction are as follows: Δ(hr, o) = h:hr⇒h Φ (ah) · w(hr,h) · Δ(h, o) (1.26) Therefore, gradients are successively accumulated in the backwards direction, and each node is processed exactly once in a backwards pass. Note that the computation of Equation 1.25 (which requires proportional operations to the number of outgoing edges) needs to be repeated for each incoming edge into the node to compute the gra- dient with respect to all edge weights. Finally, Equation 1.23 requires the computation of ∂hr ∂w(hr−1,hr) , which is easily computed as follows: ∂hr ∂w(hr−1,hr) = hr−1 · Φ (ahr ) (1.27) Here, the key gradient that is backpropagated is the derivative with respect to layer acti- vations, and the gradient with respect to the weights is easy to compute for any incident edge on the corresponding unit.
  • 48. 24 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS It is noteworthy that the dynamic programming recursion of Equation 1.26 can be computed in multiple ways, depending on which variables one uses for intermediate chaining. All these recursions are equivalent in terms of the final result of backpropagation. In the following, we give an alternative version of the dynamic programming recursion, which is more commonly seen in textbooks. Note that Equation 1.23 uses the variables in the hidden layers as the “chain” variables for the dynamic programming recursion. One can also use the pre-activation values of the variables for the chain rule. The pre-activation variables in a neuron are obtained after applying the linear transform (but before applying the activation variables) as the intermediate variables. The pre-activation value of the hidden variable h = Φ(ah) is ah. The differences between the pre-activation and post-activation values within a neuron are shown in Figure 1.7. Therefore, instead of Equation 1.23, one can use the following chain rule: ∂L ∂w(hr−1,hr) = ∂L ∂o · Φ (ao) · ⎡ ⎣ [hr,hr+1,...hk,o]∈P ∂ao ∂ahk k−1 i=r ∂ahi+1 ∂ahi ⎤ ⎦ Backpropagation computes δ(hr, o) = ∂L ∂ahr ∂ahr ∂w(hr−1,hr) hr−1 (1.28) Here, we have introduced the notation δ(hr, o) = ∂L ∂ahr instead of Δ(hr, o) = ∂L ∂hr for setting up the recursive equation. The value of δ(o, o) = ∂L ∂ao is initialized as follows: δ(o, o) = ∂L ∂ao = Φ (ao) · ∂L ∂o (1.29) Then, one can use the multivariable chain rule to set up a similar recursion: δ(hr, o) = ∂L ∂ahr = h:hr⇒h δ(h,o) ∂L ∂ah ∂ah ∂ahr Φ(ahr )w(hr,h) = Φ (ahr ) h:hr⇒h w(hr,h) · δ(h, o) (1.30) This recursion condition is found more commonly in textbooks discussing backpropagation. The partial derivative of the loss with respect to the weight is then computed using δ(hr, o) as follows: ∂L ∂w(hr−1,hr) = δ(hr, o) · hr−1 (1.31) As with the single-layer network, the process of updating the nodes is repeated to conver- gence by repeatedly cycling through the training data in epochs. A neural network may sometimes require thousands of epochs through the training data to learn the weights at the different nodes. A detailed description of the backpropagation algorithm and associated issues is provided in Chapter 3. In this chapter, we provide a brief discussion of these issues. 1.4 Practical Issues in Neural Network Training In spite of the formidable reputation of neural networks as universal function approximators, considerable challenges remain with respect to actually training neural networks to provide this level of performance. These challenges are primarily related to several practical problems associated with training, the most important one of which is overfitting.
  • 49. 1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 25 1.4.1 The Problem of Overfitting The problem of overfitting refers to the fact that fitting a model to a particular training data set does not guarantee that it will provide good prediction performance on unseen test data, even if the model predicts the targets on the training data perfectly. In other words, there is always a gap between the training and test data performance, which is particularly large when the models are complex and the data set is small. In order to understand this point, consider a simple single-layer neural network on a data set with five attributes, where we use the identity activation to learn a real-valued target variable. This architecture is almost identical to that of Figure 1.3, except that the identity activation function is used in order to predict a real-valued target. Therefore, the network tries to learn the following function: ŷ = 5 i=1 wi · xi (1.32) Consider a situation in which the observed target value is real and is always twice the value of the first attribute, whereas other attributes are completely unrelated to the target. However, we have only four training instances, which is one less than the number of features (free parameters). For example, the training instances could be as follows: x1 x2 x3 x4 x5 y 1 1 0 0 0 2 2 0 1 0 0 4 3 0 0 1 0 6 4 0 0 0 1 8 The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the known rela- tionship between the first feature and target. The training data also provides zero error with this solution, although the relationship needs to be learned from the given instances since it is not given to us a priori. However, the problem is that the number of training points is fewer than the number of parameters and it is possible to find an infinite number of solutions with zero error. For example, the parameter set [0, 2, 4, 6, 8] also provides zero error on the training data. However, if we used this solution on unseen test data, it is likely to provide very poor performance because the learned parameters are spuriously inferred and are unlikely to generalize well to new points in which the target is twice the first at- tribute (and other attributes are random). This type of spurious inference is caused by the paucity of training data, where random nuances are encoded into the model. As a result, the solution does not generalize well to unseen test data. This situation is almost similar to learning by rote, which is highly predictive for training data but not predictive for unseen test data. Increasing the number of training instances improves the generalization power of the model, whereas increasing the complexity of the model reduces its generalization power. At the same time, when a lot of training data is available, an overly simple model is unlikely to capture complex relationships between the features and target. A good rule of thumb is that the total number of training data points should be at least 2 to 3 times larger than the number of parameters in the neural network, although the precise number of data instances depends on the specific model at hand. In general, models with a larger number of parameters are said to have high capacity, and they require a larger amount of data in order to gain generalization power to unseen test data. The notion of overfitting is often understood in the trade-off between bias and variance in machine learning. The key
  • 50. 26 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS take-away from the notion of bias-variance trade-off is that one does not always win with more powerful (i.e., less biased) models when working with limited training data, because of the higher variance of these models. For example, if we change the training data in the table above to a different set of four points, we are likely to learn a completely different set of parameters (from the random nuances of those points). This new model is likely to yield a completely different prediction on the same test instance as compared to the predictions using the first training data set. This type of variation in the prediction of the same test instance using different training data sets is a manifestation of model variance, which also adds to the error of the model; after all, both predictions of the same test instance could not possibly be correct. More complex models have the drawback of seeing spurious patterns in random nuances, especially when the training data are insufficient. One must be careful to pick an optimum point when deciding the complexity of the model. These notions are described in detail in Chapter 4. Neural networks have always been known to theoretically be powerful enough to ap- proximate any function [208]. However, the lack of data availability can result in poor performance; this is one of the reasons that neural networks only recently achieved promi- nence. The greater availability of data has revealed the advantages of neural networks over traditional machine learning (cf. Figure 1.2). In general, neural networks require careful design to minimize the harmful effects of overfitting, even when a large amount of data is available. This section provides an overview of some of the design methods used to mitigate the impact of overfitting. 1.4.1.1 Regularization Since a larger number of parameters causes overfitting, a natural approach is to constrain the model to use fewer non-zero parameters. In the previous example, if we constrain the vector W to have only one non-zero component out of five components, it will correctly obtain the solution [2, 0, 0, 0, 0]. Smaller absolute values of the parameters also tend to overfit less. Since it is hard to constrain the values of the parameters, the softer approach of adding the penalty λ||W||p to the loss function is used. The value of p is typically set to 2, which leads to Tikhonov regularization. In general, the squared value of each parameter (multiplied with the regularization parameter λ 0) is added to the objective function. The practical effect of this change is that a quantity proportional to λwi is subtracted from the update of the parameter wi. An example of a regularized version of Equation 1.6 for mini-batch S and update step-size α 0 is as follows: W ⇐ W(1 − αλ) + α X∈S E(X)X (1.33) Here, E[X] represents the current error (y − ŷ) between observed and predicted values of training instance X. One can view this type of penalization as a kind of weight decay during the updates. Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which “less important” (i.e., noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization. As a side note, the general form of Equation 1.33 is used by many regularized machine learning models like least-squares regression (cf. Chapter 2), where E(X) is replaced by the error-function of that specific model. Interestingly, weight decay is only sparingly used in the
  • 51. 1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 27 single-layer perceptron3 because it can sometimes cause overly rapid forgetting with a small number of recently misclassified training points dominating the weight vector; the main issue is that the perceptron criterion is already a degenerate loss function with a minimum value of 0 at W = 0 (unlike its hinge-loss or least-squares cousins). This quirk is a legacy of the fact that the single-layer perceptron was originally defined in terms of biologically inspired updates rather than in terms of carefully thought-out loss functions. Convergence to an optimal solution was never guaranteed other than in linearly separable cases. For the single-layer perceptron, some other regularization techniques, which are discussed below, are more commonly used. 1.4.1.2 Neural Architecture and Parameter Sharing The most effective way of building a neural network is by constructing the architecture of the neural network after giving some thought to the underlying data domain. For example, the successive words in a sentence are often related to one another, whereas the nearby pixels in an image are typically related. These types of insights are used to create specialized architectures for text and image data with fewer parameters. Furthermore, many of the parameters might be shared. For example, a convolutional neural network uses the same set of parameters to learn the characteristics of a local block of the image. The recent advancements in the use of neural networks like recurrent neural networks and convolutional neural networks are examples of this phenomena. 1.4.1.3 Early Stopping Another common form of regularization is early stopping, in which the gradient descent is ended after only a few iterations. One way to decide the stopping point is by holding out a part of the training data, and then testing the error of the model on the held-out set. The gradient-descent approach is terminated when the error on the held-out set begins to rise. Early stopping essentially reduces the size of the parameter space to a smaller neighborhood within the initial values of the parameters. From this point of view, early stopping acts as a regularizer because it effectively restricts the parameter space. 1.4.1.4 Trading Off Breadth for Depth As discussed earlier, a two-layer neural network can be used as a universal function approx- imator [208], if a large number of hidden units are used within the hidden layer. It turns out that networks with more layers (i.e., greater depth) tend to require far fewer units per layer because the composition functions created by successive layers make the neural network more powerful. Increased depth is a form of regularization, as the features in later layers are forced to obey a particular type of structure imposed by the earlier layers. Increased constraints reduce the capacity of the network, which is helpful when there are limitations on the amount of available data. A brief explanation of this type of behavior is given in Section 1.5. The number of units in each layer can typically be reduced to such an extent that a deep network often has far fewer parameters even when added up over the greater number of layers. This observation has led to an explosion in research on the topic of deep learning. 3Weight decay is generally used with other loss functions in single-layer models and in all multi-layer models with a large number of parameters.
  • 52. 28 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS Even though deep networks have fewer problems with respect to overfitting, they come with a different family of problems associated with ease of training. In particular, the loss derivatives with respect to the weights in different layers of the network tend to have vastly different magnitudes, which causes challenges in properly choosing step sizes. Different manifestations of this undesirable behavior are referred to as the vanishing and exploding gradient problems. Furthermore, deep networks often take unreasonably long to converge. These issues and design choices will be discussed later in this section and at several places throughout the book. 1.4.1.5 Ensemble Methods A variety of ensemble methods like bagging are used in order to increase the generalization power of the model. These methods are applicable not just to neural networks but to any type of machine learning algorithm. However, in recent years, a number of ensemble methods that are specifically focused on neural networks have also been proposed. Two such methods include Dropout and Dropconnect. These methods can be combined with many neural network architectures to obtain an additional accuracy improvement of about 2% in many real settings. However, the precise improvement depends to the type of data and the nature of the underlying training. For example, normalizing the activations in hidden layers can reduce the effectiveness of Dropout methods, although one can gain from the normalization itself. Ensemble methods are discussed in Chapter 4. 1.4.2 The Vanishing and Exploding Gradient Problems While increasing depth often reduces the number of parameters of the network, it leads to different types of practical issues. Propagating backwards using the chain rule has its draw- backs in networks with a large number of layers in terms of the stability of the updates. In particular, the updates in earlier layers can either be negligibly small (vanishing gradient) or they can be increasingly large (exploding gradient) in certain types of neural network archi- tectures. This is primarily caused by the chain-like product computation in Equation 1.23, which can either exponentially increase or decay over the length of the path. In order to understand this point, consider a situation in which we have a multi-layer network with one neuron in each layer. Each local derivative along a path can be shown to be the product of the weight and the derivative of the activation function. The overall backpropagated deriva- tive is the product of these values. If each such value is randomly distributed, and has an expected value less than 1, the product of these derivatives in Equation 1.23 will drop off ex- ponentially fast with path length. If the individual values on the path have expected values greater than 1, it will typically cause the gradient to explode. Even if the local derivatives are randomly distributed with an expected value of exactly 1, the overall derivative will typically show instability depending on how the values are actually distributed. In other words, the vanishing and exploding gradient problems are rather natural to deep networks, which makes their training process unstable. Many solutions have been proposed to address this issue. For example, a sigmoid activa- tion often encourages the vanishing gradient problem, because its derivative is less than 0.25 at all values of its argument (see Exercise 7), and is extremely small at saturation. A ReLU activation unit is known to be less likely to create a vanishing gradient problem because its derivative is always 1 for positive values of the argument. More discussions on this issue are provided in Chapter 3. Aside from the use of the ReLU, a whole host of gradient-descent tricks are used to improve the convergence behavior of the problem. In particular, the use
  • 53. 1.4. PRACTICAL ISSUES IN NEURAL NETWORK TRAINING 29 of adaptive learning rates and conjugate gradient methods can help in many cases. Further- more, a recent technique called batch normalization is helpful in addressing some of these issues. These techniques are discussed in Chapter 3. 1.4.3 Difficulties in Convergence Sufficiently fast convergence of the optimization process is difficult to achieve with very deep networks, as depth leads to increased resistance to the training process in terms of letting the gradients smoothly flow through the network. This problem is somewhat related to the vanishing gradient problem, but has its own unique characteristics. Therefore, some “tricks” have been proposed in the literature for these cases, including the use of gating networks and residual networks [184]. These methods are discussed in Chapters 7 and 8, respectively. 1.4.4 Local and Spurious Optima The optimization function of a neural network is highly nonlinear, which has lots of local optima. When the parameter space is large, and there are many local optima, it makes sense to spend some effort in picking good initialization points. One such method for improving neural network initialization is referred to as pretraining. The basic idea is to use either supervised or unsupervised training on shallow sub-networks of the original network in order to create the initial weights. This type of pretraining is done in a greedy and layerwise fashion in which a single layer of the network is trained at one time in order to learn the initialization points of that layer. This type of approach provides initialization points that ignore drastically irrelevant parts of the parameter space to begin with. Furthermore, unsupervised pretraining often tends to avoid problems associated with overfitting. The basic idea here is that some of the minima in the loss function are spurious optima because they are exhibited only in the training data and not in the test data. Using unsupervised pretraining tends to move the initialization point closer to the basin of “good” optima in the test data. This is an issue associated with model generalization. Methods for pretraining are discussed in Section 4.7 of Chapter 4. Interestingly, the notion of spurious optima is often viewed from the lens of model gen- eralization in neural networks. This is a different perspective from traditional optimization. In traditional optimization, one does not focus on the differences in the loss functions of the training and test data, but on the shape of the loss function in only the training data. Surprisingly, the problem of local optima (from a traditional perspective) is a smaller issue in neural networks than one might normally expect from such a nonlinear function. Most of the time, the nonlinearity causes problems during the training process itself (e.g., failure to converge), rather than getting stuck in a local minimum. 1.4.5 Computational Challenges A significant challenge in neural network design is the running time required to train the network. It is not uncommon to require weeks to train neural networks in the text and image domains. In recent years, advances in hardware technology such as Graphics Processor Units (GPUs) have helped to a significant extent. GPUs are specialized hardware processors that can significantly speed up the kinds of operations commonly used in neural networks. In this sense, some algorithmic frameworks like Torch are particularly convenient because they have GPU support tightly integrated into the platform.
  • 54. 30 CHAPTER 1. AN INTRODUCTION TO NEURAL NETWORKS Although algorithmic advancements have played a role in the recent excitement around deep learning, a lot of the gains have come from the fact that the same algorithms can do much more on modern hardware. Faster hardware also supports algorithmic development, because one needs to repeatedly test computationally intensive algorithms to understand what works and what does not. For example, a recent neural model such as the long short- term memory has changed only modestly [150] since it was first proposed in 1997 [204]. Yet, the potential of this model has been recognized only recently because of the advances in computational power of modern machines and algorithmic tweaks associated with improved experimentation. One convenient property of the vast majority of neural network models is that most of the computational heavy lifting is front loaded during the training phase, and the prediction phase is often computationally efficient, because it requires a small number of operations (depending on the number of layers). This is important because the prediction phase is often far more time-critical compared to the training phase. For example, it is far more important to classify an image in real time (with a pre-built model), although the actual building of that model might have required a few weeks over millions of images. Methods have also been designed to compress trained networks in order to enable their deployment in mobile and space-constrained settings. These issues are discussed in Chapter 3. 1.5 The Secrets to the Power of Function Composition Even though the biological metaphor sounds like an exciting way to intuitively justify the computational power of a neural network, it does not provide a complete picture of the settings in which neural networks perform well. At its most basic level, a neural network is a computational graph that performs compositions of simpler functions to provide a more complex function. Much of the power of deep learning arises from the fact that repeated composition of multiple nonlinear functions has significant expressive power. Even though the work in [208] shows that the single composition of a large number of squashing functions can approximate almost any function, this approach will require an extremely large number of units (i.e., parameters) of the network. This increases the capacity of the network, which causes overfitting unless the data set is extremely large. Much of the power of deep learning arises from the fact that the repeated composition of certain types of functions increases the representation power of the network, and therefore reduces the parameter space required for learning. Not all base functions are equally good at achieving this goal. In fact, the nonlinear squashing functions used in neural networks are not arbitrarily chosen, but are carefully designed because of certain types of properties. For example, imagine a situation in which the identity activation function is used in each layer, so that only linear functions are computed. In such a case, the resulting neural network is no stronger than a single-layer, linear network: Theorem 1.5.1 A multi-layer network that uses only the identity activation function in all its layers reduces to a single-layer network performing linear regression. Proof: Consider a network containing k hidden layers, and therefore contains a total of (k +1) computational layers (including the output layer). The corresponding (k +1) weight matrices between successive layers are denoted by W1 . . . Wk+1. Let x be the d-dimensional column vector corresponding to the input, h1 . . . hk be the column vectors corresponding to the hidden layers, and o be the m-dimensional column vector corresponding to the output.
  • 55. Random documents with unrelated content Scribd suggests to you:
  • 56. prend cette chose tellement fanée et empuantie de toutes les puanteurs dans sa tendre et blanche main. Arrachée par le talent du poète, dans un doux accord avec le beau, une larme lentement s'écoule et tendrement fait sa part dans l'œuvre commune: nul lecteur qui n'y laisse une tache! O pensée grandiose et puissante! O résultat merveilleux! Qu'il est béni des dieux le poète qui possède un si noble talent! Grands et Petits, Pauvres et Riches, cette crasse est l'œuvre de tous! Ah! celui qui vit encore dans l'osbcurité, qui lutte pour se hausser jusqu'au laurier, assurément sent, dans sa brûlante ardeur, un désir lui tirailler le sein. Dieu bon, implore-t-il chaque jour, Accorde-moi ce bonheur indicible: fais que mes pauvres livres de vers soient aussi gras et crasseux! Mais si les poètes aspirent aux embrassements «de la grande impudique Qui tient dans ses bras l'univers, s'ils sont tellement avides du bruit qu'ils ouvrent leur escarcelle toute grande à la popularité, cette «gloire en gros sous», il n'en est point de même des vrais amants des livres, de ceux qui ne les font pas, mais qui les achètent, les parent, les enchâssent, en délectent leurs doigts, leurs yeux, et parfois leur esprit. Ecoutez la tirade mise par un poète anglais dans la bouche d'un bibliophile qui a prêté à un infidèle ami une reliure de Trautz- Bauzonnet et qui ne l'a jamais revue:
  • 57. Une fois prêté, un livre est perdu... Prêter des livres! Parbleu, je n'y consentirai plus. Vos prêteurs faciles ne sont que des fous que je redoute. Si les gens veulent des livres, par le grand Grolier, qu'ils les achètent! Qui est-ce qui prête sa femme lorsqu'il peut se dispenser du prêt? Nos femmes seront-elles donc tenues pour plus que nos livres chères? Nous en préserve de Thou! Jamais plus de livres ne prêterai. Ne dirait-on pas que c'est pour ce bibliophile échaudé que fut faite cette imitation supérieurement réussie des inscriptions dont les écoliers sont prodigues sur leurs rudiments et Selectæ: Qui ce livre volera, Pro suis criminibus Au gibet il dansera, Pedibus penditibus. Ce châtiment n'eût pas dépassé les mérites de celui contre lequel Lebrun fit son épigramme «à un Abbé qui aimait les lettres et un peu trop mes livres»: Non, tu n'es point de ces abbés ignares, Qui n'ont jamais rien lu que le Missel: Des bons écrits tu savoures le sel, Et te connais en livres beaux et rares. Trop bien le sais! car, lorsqu'à pas de loup Tu viens chez moi feuilleter coup sur coup Mes Elzévirs, ils craignent ton approche. Dans ta mémoire il en reste beaucoup; Beaucoup aussi te restent dans la poche.
  • 58. Un amateur de livres de nuance libérale pourrait adopter pour devise cette inscription mise à l'entrée d'une bibliothèque populaire anglaise: Tolle, aperi, recita, ne lœdas, claude, rapine! ce qui, traduit librement, signifie: «Prends, ouvre, lis, n'abîme pas, referme, mais surtout mets en place!» Punch, le Charivari d'Outre-Manche, en même temps qu'il incarne pour les Anglais notre Polichinelle et le Pulcinello des Italiens, résume à merveille la question. Voici, dit-il, «la tenue des livres enseignée en une leçon:—Ne les prêtez pas.»
  • 59. VII C'est qu'ils sont précieux, non pas tant par leur valeur intrinsèque,— bien que certains d'entre eux représentent plus que leur poids d'or,— que parce qu'on les aime, d'amour complexe peut-être, mais à coup sûr d'amour vrai. «Accordez-moi, seigneur, disait un ancien (c'est Jules Janin qui rapporte ces paroles), une maison pleine de livres, un jardin plein de fleurs!—Voulez-vous, disait-il encore, un abrégé de toutes les misères humaines, regardez un malheureux qui vend ses livres: Bibliothecam vendat.» Si le malheureux vend ses livres parce qu'il y est contraint, non pas par un caprice, une toquade de spéculation, une saute de goût, passant de la bibliophilie à l'iconophilie ou à la faïençomanie ou à tout autre dada frais éclos dans sa cervelle, ou encore sous le coup d'une passionnette irrésistible dont quelques mois auront bientôt usé l'éternité, comme il advint à Asselineau qui se défit de sa bibliothèque pour suivre une femme et qui peu après se défit de la femme pour se refaire une bibliothèque, si c'est, dis-je, par misère pure, il faut qu'il soit bien marqué par le destin et qu'il ait de triples galons dans l'armée des Pas-de-Chance, car les livres aiment ceux qui les aiment et, le plus souvent leur portent bonheur. Témoin, pour n'en citer qu'un, Grotius, qui s'échappa de prison en se mettant dans un coffre à livres, lequel faisait la navette entre sa maison et sa geôle, apportant et remportant les volumes qu'il avait obtenu de faire venir de la fameuse bibliothèque formée à grands frais et avec tant de soins, pour lui «et ses amis». Richard de Bury, évêque de Durham et chancelier d'Angleterre, qui vivait au XIVe siècle, rapporte, dans son Philobiblon, des vers latins
  • 60. de John Salisbury, dont voici le sens: Nul main que le fer a touchée n'est propre à manier les livres, ni celui dont le cœur regarde l'or avec trop de joie; les mêmes hommes n'aiment pas à la fois les livres et l'argent, et ton troupeau, ô Epicure, a pour les livres du dégoût; les avares et les amis des livre ne vont guère de compagnie, et ne demeurent point, tu peux m'en croire, en paix sous le même toit. «Personne donc, en conclut un peu vite le bon Richard de Bury, ne peut servir en même temps les livres et Mammon». Il reprend ailleurs: «Ceux qui sont férus de l'amour des livres font bon marché du monde et des richesses». Les temps sont quelque peu changés; il est en notre vingtième siècle des amateurs dont on ne saurait dire s'ils estiment des livres précieux pour en faire un jour une vente profitable, ou s'ils dépensent de l'argent à accroître leur bibliothèque pour la seule satisfaction de leurs goûts de collectionneur et de lettré. Toujours est-il que le Philobiblon n'est qu'un long dithyrambe en prose, naïf et convaincu, sur les livres et les joies qu'ils procurent. J'y prends au hasard quelques phrases caractéristiques, qui, enfouies dans ce vieux livre peu connu en France, n'ont pas encore eu le temps de devenir banales parmi nous. «Les livres nous charment lorsque la prospérité nous sourit; ils nous réconfortent comme des amis inséparables lorsque la fortune orageuse fronce le sourcil sur nous.»
  • 61. Voilà une pensée qui a été exprimée bien des fois et que nous retrouverons encore; mais n'a-t-elle pas un tour original qui lui donne je ne sais quel air imprévu de nouveauté? Le chapitre XV de l'ouvrage traite des «avantages de l'amour des livres.» On y lit ceci: «Il passe le pouvoir de l'intelligence humaine, quelque largement qu'elle ait pu boire à la fontaine de Pégase, de développer pleinement le titre du présent chapitre. Quand on parlerait avec la langue des hommes et des anges, quand on serait devenu un Mercure, un Tullius ou un Cicéron, quand on aurait acquis la douceur de l'éloquence lactée de Tite-Live, on aurait encore à s'excuser de bégayer comme Moïse, ou à confesser avec Jérémie qu'on n'est qu'un enfant et qu'on ne sait point parler.» Après ce début, qui s'étonnera que Richard de Bury fasse un devoir à tous les honnêtes gens d'acheter des livres et de les aimer. «Il n'est point de prix élevé qui doive empêcher quelqu'un d'acheter des livres s'il a l'argent qu'on en demande, à moins que ce ne soit pour résister aux artifices du vendeur ou pour attendre une plus favorable occasion d'achat... Qu'on doive acheter les livres avec joie et les vendre à regret, c'est à quoi Salomon, le soleil de l'humanité, nous exhorte dans les Proverbes: «Achète la vérité, dit-il, et ne vends pas la sagesse.» On ne s'attendait guère, j'imagine, à voir Salomon dans cette affaire. Et pourtant quoi de plus naturel que d'en appeler à l'auteur de la Sagesse en une question qui intéresse tous les sages? «Une bibliothèque prudemment composée est plus précieuse que toutes les richesses, et nulle des choses qui sont désirables ne sauraient lui être comparée. Quiconque donc se pique d'être zélé pour la vérité, le bonheur, la sagesse ou la science, et même pour la foi, doit nécessairement devenir un ami des livres.» En effet, ajoute-t-il, en un élan croissant d'enthousiasme, «les livres sont des maîtres qui nous instruisent sans verges ni férules, sans
  • 62. paroles irritées, sans qu'il faille leur donner ni habits, ni argent. Si vous venez à eux, ils ne dorment point; si vous questionnez et vous enquérez auprès d'eux, ils ne se récusent point; ils ne grondent point si vous faites des fautes; ils ne se moquent point de vous si vous êtes ignorant. O livres, seuls êtres libéraux et libres, qui donnez à tous ceux qui vous demandent, et affranchissez tous ceux qui vous servent fidèlement!» C'est pourquoi «les Princes, les prélats, les juges, les docteurs, et tous les autres dirigeants de l'Etat, d'autant qu'ils ont plus que les autres besoin de sagesse, doivent plus que les autres montrer du zèle pour ces vases où la sagesse est contenue.» Tel était l'avis du grand homme d'Etat Gladstone, qui acheta plus de trente cinq mille volumes au cours de sa longue vie. «Un collectionneur de livres, disait-il, dans une lettre adressée au fameux libraire londonien Quaritch (9 septembre 1896), doit, suivant l'idée que je m'en fais, posséder les six qualités suivantes: appétit, loisir, fortune, science, discernement et persévérance.» Et plus loin: «Collectionner des livres peut avoir ses ridicules et ses excentricités. Mais, en somme, c'est un élément revivifiant dans une société criblée de tant de sources de corruption.»
  • 63. VIII Cependant les livres, jusque dans la maison du bibliophile, ont un implacable ennemi: c'est la femme. Je les entends se plaindre du traitement que la maîtresse du logis, dès qu'elle en a l'occasion, leur fait subir: «La femme, toujours jalouse de l'amour qu'on nous porte, est impossible à jamais apaiser. Si elle nous aperçoit dans quelque coin, sans autre protection que la toile d'une araignée morte, elle nous insulte et nous ravale, le sourcil froncé, la parole amère, affirmant que, de tout le mobilier de la maison, nous seuls ne sommes pas nécessaires; elle se plaint que nous ne soyons utiles à rien dans le ménage, et elle conseille de nous convertir promptement en riches coiffures, en soie, en pourpre deux fois teinte, en robes et en fourrures, en laine et en toile. A dire vrai sa haine ne serait pas sans motifs si elle pouvait voir le fond de nos cœurs, si elle avait écouté nos secrets conseils, si elle avait lu le livre de Théophraste ou celui de Valerius, si seulement elle avait écouté le XXVe chapitre de l'Ecclésiaste avec des oreilles intelligentes.» (Richard de Bury.) M. Octave Uzanne rappelle, dans les Zigs-Zags d'un Curieux, un mot du bibliophile Jacob, frappé en manière de proverbe et qui est bien en situation ici: Amours de femme et de bouquin, Ne se chantent pas au même lutrin. Et il ajoute fort à propos: «La passion bouquinière n'admet pas de partage; c'est un peu, il faut le dire, une passion de retraite, un refuge extrême à cette heure de la vie où l'homme, déséquilibré par les cahots de l'existence mondaine, s'écrie, à l'exemple de Thomas
  • 64. Moore: Je n'avais jusqu'ici pour lire que les regards des femmes, et c'est la folie qu'ils m'ont enseignée!» Cette incapacité des femmes, sauf de rares exceptions, à goûter les joies du bibliophile, a été souvent remarquée. Une d'elles—et c'est ce qui rend la citation piquante—Mme Emile de Girardin, écrivait dans la chronique qu'elle signait à la Presse du pseudonyme de Vicomte de Launay: «Voyez ce beau salon d'étude, ce boudoir charmant; admirez-le dans ses détails, vous y trouverez tout ce qui peut séduire, tout ce que vous pouvez désirer, excepté deux choses pourtant: un beau livre et un joli tableau. Il n'y a peut-être pas dix femmes à Paris chez lesquelles ces deux raretés puissent être admirées.» C'est dans le même ordre d'idées que l'américain Hawthorne, le fils de l'auteur du Faune de Marbre et de tant d'autres ouvrages où une sereine philosophie se pare des agréments de la fiction, a écrit ces lignes curieuses: «Cœlebs, grand amateur de bouquins, se rase devant son miroir, et monologue sur la femme qui, d'après son expérience, jeune ou vieille, laide ou belle, est toujours le diable.» Et Cœlebs finit en se donnant à lui-même ces conseils judicieux: «Donc, épouse tes livres! Il ne recherche point d'autre maîtresse, l'homme sage qui regarde, non la surface, mais le fond des choses. Les livres ne flirtent ni ne feignent; ne boudent ni ne taquinent; ils ne se plaignent pas, ils disent les choses, mais ils s'abstiennent de vous les demander. »Que les livres soient ton harem, et toi leur Grand Turc. De rayon en rayon, ils attendent tes faveurs, silencieux et soumis! Jamais la jalousie ne les agite. Je n'ai nulle part rencontré Vénus, et j'accorde qu'elle est belle; toujours est-il qu'elle n'est pas de beaucoup si accommodante qu'eux.»
  • 65. IX Comment n'aimerait-on pas les livres? Il en est pour tous les goûts, ainsi qu'un auteur du Chansonnier des Grâces le fait chanter à un libraire vaudevillesque (1820): Venez, lecteurs, chez un libraire De vous servir toujours jaloux; Vos besoins ainsi que vos goûts Chez moi pourront se satisfaire. J'offre la Grammaire aux auteurs, Des Vers à nos jeunes poëtes; L'Esprit des lois aux procureurs, L'Essai sur l'homme à nos coquettes... Aux plus célèbres gastronomes Je donne Racine et Boileau! La Harpe aux chanteurs de caveau, Les Nuits d'Young aux astronomes; J'ai Descartes pour les joueurs, Voiture pour toutes les belles, Lucrèce pour les amateurs, Martial pour les demoiselles. Pour le plaideur et l'adversaire J'aurai l'avocat Patelin; Le malade et le médecin Chez moi consulteront Molière: Pour un sexe trop confiant Je garde le Berger fidèle; Et pour le malheureux amant Je réserverai la Pucelle.
  • 66. Armand Gouffé était d'un autre avis lorsqu'il fredonnait: Un sot avec cent mille francs Peut se passer de livres. Mais les sots très riches ont généralement juste assez d'esprit pour retrancher et masquer leur sottise derrière l'apparat imposant d'une grande bibliothèque, où les bons livres consacrés par le temps et le jugement universel se partagent les rayons avec les ouvrages à la mode. Car si, comme le dit le proverbe allemand, «l'âne n'est pas savant parce qu'il est chargé de livres», il est des cas où l'amas des livres peut cacher un moment la nature de l'animal. C'est en pensant aux amateurs de cet acabit que Chamfort a formulé cette maxime: «L'espoir n'est souvent au cœur que ce que la bibliothèque d'un château est à la personne du maître.» Lilly, le fameux auteur d'Euphues, disait: «Aie ton cabinet plein de livres plutôt que ta bourse pleine d'argent». Le malheur est que remplir l'un a vite fait de vider l'autre, si les sources dont celle-ci s'alimente ne sont pas d'une abondance continue. L'historien Gibbon allait plus loin lorsqu'il déclarait qu'il n'échangerait pas le goût de la lecture contre tous les trésors de l'Inde. De même Macaulay, qui aurait mieux aimé être un pauvre homme avec des livres qu'un grand roi sans livres. Bien avant eux, Claudius Clément, dans son traité latin des bibliothèques, tant privées que publiques, émettait, avec des restrictions de sage morale, une idée semblable: «Il y a peu de dépenses, de profusions, je dirais même de prodigalités plus louables que celles qu'on fait pour les livres, lorsqu'en eux on cherche un refuge, la volupté de l'âme, l'honneur, la pureté des mœurs, la doctrine et un renom immortel.» «L'or, écrivait Pétrarque à son frère Gérard, l'argent, les pierres précieuses, les vêtements de pourpre, les domaines, les tableaux, les chevaux, toutes les autres choses de ce genre offrent un plaisir
  • 67. changeant et de surface: les livres nous réjouissent jusqu'aux moëlles.» C'est encore Pétrarque qui traçait ce tableau ingénieux et charmant: «J'ai des amis dont la société m'est extrêmement agréable; ils sont de tous les âges et de tous les pays. Ils se sont distingués dans les conseils et sur les champs de bataille, et ont obtenu de grands honneurs par leur connaissance des sciences. Il est facile de trouver accès près d'eux; en effet ils sont toujours à mon service, je les admets dans ma société ou les congédie quand il me plaît. Ils ne sont jamais importuns, et ils répondent aussitôt à toutes les questions que je leur pose. Les uns me racontent les événements des siècles passés, les autres me révèlent les secrets de la nature. Il en est qui m'apprennent à vivre, d'autres à mourir. Certains, par leur vivacité, chassent mes soucis et répandent en moi la gaieté: d'autres donnent du courage à mon âme, m'enseignant la science si importante de contenir ses désirs et de ne compter absolument que sur soi. Bref, ils m'ouvrent les différentes avenues de tous les arts et de toutes les sciences, et je peux, sans risque, me fier à eux en toute occasion. En retour de leurs services, ils ne me demandent que de leur fournir une chambre commode dans quelque coin de mon humble demeure, où ils puissent reposer en paix, car ces amis- là trouvent plus de charmes à la tranquillité de la retraite qu'au tumulte de la société.» Il faut comparer ce morceau au passage où notre Montaigne, après avoir parlé du commerce des hommes et de l'amour des femmes, dont il dit: «l'un est ennuyeux par sa rareté, l'aultre se flestrit par l'usage», déclare que celui des livres «est bien plus seur et plus à nous; il cède aux premiers les aultres advantages, mais il a pour sa part la constance et facilité de son service... Il me console en la vieillesse et en la solitude; il me descharge du poids d'une oysiveté ennuyeuse et me desfaict à toute heure des compagnies qui me faschent; il esmousse les poinctures de la douleur, si elle n'est du tout extrême et maistresse. Pour me distraire d'une imagination importune, il n'est que de recourir aux livres...
  • 68. «Le fruict que je tire des livres... j'en jouïs, comme les avaricieux des trésors, pour sçavoir que j'en jouïray quand il me plaira: mon âme se rassasie et contente de ce droit de possession... Il ne se peult dire combien je me repose et séjourne en ceste considération qu'ils sont à mon côté pour me donner du plaisir à mon heure, et à recognoistre combien ils portent de secours à ma vie. C'est la meilleure munition que j'aye trouvé à cest humain voyage; et plainds extrêmement les hommes d'entendement qui l'ont à dire.» Sur ce thème, les variations sont infinies et rivalisent d'éclat et d'ampleur. Le roi d'Egypte Osymandias, dont la mémoire inspira à Shelley un sonnet si beau, avait inscrit au-dessus de sa «librairie»: Pharmacie de l'âme. «Une chambre sans livres est un corps sans âme», disait Cicéron. «La poussière des bibliothèques est une poussière féconde», renchérit Werdet. «Les livres ont toujours été la passion des honnêtes gens», affirme Ménage. Sir John Herschel était sûrement de ces honnêtes gens dont parle le bel esprit érudit du XVIIe siècle, car il fait cette déclaration, que Gibbon eût signée: «Si j'avais à demander un goût qui pût me conserver ferme au milieu des circonstances les plus diverses et être pour moi une source de bonheur et de gaieté à travers la vie et un bouclier contre ses maux, quelque adverses que pussent être les circonstances et de quelques rigueurs que le monde pût m'accabler, je demanderais le goût de la lecture.» «Autant vaut tuer un homme que détruire un bon livre», s'écrie Milton; et ailleurs, en un latin superbe que je renonce à traduire:
  • 69. Et totum rapiunt me, mea vita, libri. «Pourquoi, demandait Louis XIV au maréchal de Vivonne, passez- vous autant de temps avec vos livres?—Sire, c'est pour qu'ils donnent à mon esprit le coloris, la fraîcheur et la vie que donnent à mes joues les excellentes perdrix de Votre Majesté.» Voilà une aimable réponse de commensal et de courtisan. Mais combien d'enthousiastes se sentiraient choqués de cet épicuréisme flatteur et léger! Ce n'est pas le poète anglais John Florio, qui écrivait au commencement du même siècle, dont on eût pu attendre une explication aussi souriante et dégagée. Il le prend plutôt au tragique, quand il s'écrie: «Quels pauvres souvenirs sont statues, tombes et autres monuments que les hommes érigent aux princes, et qui restent en des lieux fermés où quelques-uns à peine les voient, en comparaison des livres, qui aux yeux du monde entier montrent comment ces princes vécurent, tandis que les autres monuments montrent où ils gisent!» C'est à dessein, je le répète, que j'accumule les citations d'auteurs étrangers. Non seulement, elles ont moins de chances d'être connues, mais elles possèdent je ne sais quelle saveur d'exotisme qu'on ne peut demander à nos écrivains nationaux. Ecoutons Isaac Barrow exposer sagement la leçon de son expérience: «Celui qui aime les livres ne manque jamais d'un ami fidèle, d'un conseiller salutaire, d'un gai compagnon, d'un soutien efficace. En étudiant, en pensant, en lisant, l'on peut innocemment se distraire et agréablement se récréer dans toutes les saisons comme dans toutes les fortunes.» Jeremy Collier, pensant de même, ne s'exprime guère autrement: «Les livres sont un guide dans la jeunesse et une récréation dans le grand âge. Ils nous soutiennent dans la solitude et nous empêchent
  • 70. d'être à charge à nous-mêmes. Ils nous aident à oublier les ennuis qui nous viennent des hommes et des choses; ils calment nos soucis et nos passions; ils endorment nos déceptions. Quand nous sommes las des vivants, nous pouvons nous tourner vers les morts: ils n'ont dans leur commerce, ni maussaderie, ni orgueil, ni arrière-pensée.» Parmi les joies que donnent les livres, celle de les rechercher, de les pourchasser chez les libraires et les bouquinistes, n'est pas la moindre. On a écrit des centaines de chroniques, des études, des traités et des livres sur ce sujet spécial. La Physiologie des quais de Paris, de M. Octave Uzanne, est connue de tous ceux qui s'intéressent aux bouquins. On se rappelle moins un brillant article de Théodore de Banville, qui parut jadis dans un supplément littéraire du Figaro; aussi me saura-t-on gré d'en citer ce joli passage: «Sur le quai Voltaire, il y aurait de quoi regarder et s'amuser pendant toute une vie; mais sans tourner, comme dit Hésiode, autour du chêne et du rocher, je veux nommer tout de suite ce qui est le véritable sujet, l'attrait vertigineux, le charme invincible: c'est le Livre ou, pour parler plus exactement, le Bouquin. Il y a sur le quai de nombreuses boutiques, dont les marchands, véritables bibliophiles, collectionnent, achètent dans les ventes, et offrent aux consommateurs de beaux livres à des prix assez honnêtes. Mais ce n'est pas là ce que veut l'amateur, le fureteur, le découvreur de trésors mal connus. Ce qu'il veut, c'est trouver pour des sous, pour rien, dans les boîtes posées sur le parapet, des livres, des bouquins qui ont—ou qui auront—un grand prix, ignoré du marchand. «Et à ce sujet, un duel, qui n'a pas eu de commencement et n'aura pas de fin, recommence et se continue sans cesse entre le marchand et l'amateur. Le libraire, qui, naturellement, veut vendre cher sa marchandise, se hâte de retirer des boîtes et de porter dans la boutique tout livre soupçonné d'avoir une valeur; mais par une force
  • 71. étrange et surnaturelle, le Livre s'arrange toujours pour revenir, on ne sait pas comment ou par quels artifices, dans les boîtes du parapet. Car lui aussi a ses opinions; il veut être acheté par l'amateur, avec des sous, et surtout et avant tout, par amour!» C'est ainsi que M. Jean Rameau, poète et bibliophile, raconte qu'il a trouvé, en cette année 1901, dans une boîte des quais, à vingt-cinq centimes, quatre volumes, dont le dos élégamment fleuri portait un écusson avec la devise: Boutez en avant. C'était un abrégé du Faramond de la Calprenède, et les quatre volumes avaient appartenu à la Du Barry, dont le Boutez en avant est suffisamment caractéristique. Que fit le poète, lorsqu'il se fut renseigné auprès du baron de Claye, qui n'hésite point sur ces questions? Il alla dès sept heures du matin se poster devant l'étalage, avala le brouillard de la Seine, s'en imprégna et y développa des «rhumatismes atroces» jusqu'à onze heures du matin,—car le bouquiniste, ami du nonchaloir, ne vint pas plus tôt,—prit les volumes et «bouta une pièce d'un franc» en disant: «Vous allez me laisser ça pour quinze sous, hein?»—«Va pour quinze sous!» fit le bouquiniste bonhomme! Et le poète s'enfuit avec son butin, et aussi, par surcroît, «avec un petit frisson de gloire». Puisque nous sommes sur le quai Voltaire, ne le quittons pas sans le regarder à travers la lunette d'un poète dont le nom, Gabriel Marc, n'éveille pas de retentissants échos, mais qui, depuis 1875, année où il publiait ses Sonnets parisiens, a dû parfois éprouver l'émotion— amère et douce—exprimée en trait final dans le gracieux tableau qu'il intitule: En bouquinant. Le quai Voltaire est un véritable musée En plein soleil. Partout, pour charmer les regards, Armes, bronzes, vitraux, estampes, objets d'art, Et notre flânerie est sans cesse amusée. Avec leur reliure ancienne et presque usée, Voici les manuscrits sauvés par le hasard;
  • 72. Puis les livres: Montaigne, Hugo, Chénier, Ponsard, Ou la petite toile au Salon refusée. Le ciel bleuâtre et clair noircit à l'horizon. Le pêcheur à la ligne a jeté l'hameçon; Et la Seine se ride aux souffles de la brise. Ou la petite toile au Salon refusée. On bouquine. On revoit, sous la poudre des temps, Tous les chers oubliés; et parfois, ô surprise! Le volume de vers que l'on fit à vingt ans. Un autre contemporain, Mr. J. Rogers Rees, qui a écrit tout un livre sur les plaisirs du bouquineur (the Pleasures of a Bookworm), trouve dans le commerce des livres une source de fraternité et de solidarité humaines. «Un grand amour pour les livres, dit-il, a en soi, dans tous les temps, le pouvoir d'élargir le cœur et de le remplir de facultés sympathiques plus larges et véritablement éducatrices.» Un poète américain, Mr. C. Alex. Nelson, termine une pièce à laquelle il donne ce titre français: Les Livres, par une prière naïve, dont les deux derniers vers sont aussi en français dans le texte: Les amoureux du livre, tous d'un cœur reconnaissant, toujours exhalèrent une prière unique: Que le bon Dieu préserve les livres et sauve la Société! Le vieux Chaucer ne le prenait pas de si haut: doucement et poétiquement il avouait que l'attrait des livres était moins puissant sur son cœur que l'attrait de la nature. Je voudrais pouvoir mettre dans mon essai de traduction un peu du charme poétique qui, comme un parfum très ancien, mais persistant
  • 73. et d'autant plus suave, se dégage de ces vers dans le texte original. Quant à moi, bien que je ne sache que peu de chose, à lire dans les livres je me délecte, et j'y donne ma foi et ma pleine croyance, et dans mon cœur j'en garde le respect si sincèrement qu'il n'y a point de plaisir qui puisse me faire quitter mes livres, si ce n'est, quelques rares fois, le jour saint, sauf aussi, sûrement, lorsque, le mois de mai venu, j'entends les oiseaux chanter, et que les fleurs commencent à surgir,— alors adieu mon livre et ma dévotion! Comment encore conserver en mon français sans rimes et péniblement rythmé l'harmonie légère et gracieuse, pourtant si nette et précise, de ce délicieux couplet d'une vieille chanson populaire, que tout Anglais sait par cœur: Oh! un livre et, dans l'ombre un coin, soit à la maison, soit dehors, les vertes feuilles chuchotant sur ma tête, ou les cris de la rue autour de moi; là où je puisse lire tout à mon aise aussi bien du neuf que du vieux! Car un brave et bon livre à parcourir vaut pour moi mieux que de l'or! Mais il faut s'arrêter dans l'éloge. Je ne saurais mieux conclure, sur ce sujet entraînant, qu'en prenant à mon compte et en offrant aux autres ces lignes d'un homme qui fut, en son temps, le «prince de la critique» et dont le nom même commence à être oublié. Nous pouvons tous, amis, amoureux, dévots ou maniaques du livre, nous écrier avec Jules Janin:
  • 74. «O mes livres! mes économies et mes amours! une fête à mon foyer, un repos à l'ombre du vieil arbre, mes compagnons de voyage!... et puis, quand tout sera fini pour moi, les témoins de ma vie et de mon labeur!»
  • 76. X A côté de ceux qui adorent les livres, les chantent et les bénissent, il y a ceux qui les détestent, les dénigrent et leur crient anathème; et ceux-ci ne sont pas les moins passionnés. On voit nettement la transition, le passage d'un de ces deux sentiments à l'autre, en même temps que leur foncière identité, dans ces vers de Jean Richepin (Les Blasphèmes): Peut-être, ô Solitude, est-ce toi qui délivres De cette ardente soif que l'ivresse des livres Ne saurait étancher aux flots de son vin noir. J'en ai bu comme si j'étais un entonnoir, De ce vin fabriqué, de ce vin lamentable; J'en ai bu jusqu'à choir lourdement sous la table, A pleine gueule, à plein amour, à plein cerveau. Mais toujours, au réveil, je sentais de nouveau L'inextinguible soif dans ma gorge plus rêche. On ne s'étonnera pas, je pense, que sa gorge étant plus rêche, le poète songe à la mieux rafraîchir et achète, pour ce, des livres superbes qui lui mériteront, quand on écrira sa biographie définitive, un chapitre, curieux entre maint autre, intitulé: «Richepin, bibliophile.» D'une veine plus froide et plus méprisante, mais, après tout, peu dissemblable, sort cette boutade de Baudelaire (Œuvres posthumes): «L'homme d'esprit, celui qui ne s'accordera jamais avec personne, doit s'appliquer à aimer la conversation des imbéciles et la lecture
  • 77. des mauvais livres. Il en tirera des jouissances amères qui compenseront largement sa fatigue.» L'auteur du traité De la Bibliomanie n'y met point tant de finesse. Il déclare tout à trac que «la folle passion des livres entraîne souvent au libertinage et à l'incrédulité». Encore faudrait-il savoir où commence «la folle passion», car le même écrivain (Bollioud-Mermet) ne peut s'empêcher, un peu plus loin, de reconnaître que «les livres simplement agréables contiennent, ainsi que les plus sérieux, des leçons utiles pour les cœurs droits et pour les bons esprits». Pétrarque avait déjà exprimé une pensée analogue dans son élégant latin de la Renaissance: «Les livres mènent certaines personnes à la science, et certaines autres à la folie, lorsque celles-ci en absorbent plus qu'elles ne peuvent digérer.» Libri quosdam ad scientiam, quosdam ad insaniam deduxere, dum plus hauriunt quam digerunt. Cela rappelle un joli mot attribué au peintre Doyen sur un homme plus érudit que judicieux: «Sa tête est la boutique d'un libraire qui déménage.» C'est, en somme, une question de choix. On l'a répété bien souvent depuis Sénèque, et on l'avait sûrement dit plus d'une fois avant lui: «Il n'importe pas d'avoir beaucoup de livres, mais d'en avoir de bons.» Ce n'est pas là le point de vue auquel se placent les bibliomanes; mais nous ne nous occupons pas d'eux pour l'instant. Quant aux bibliophiles délicats, même ceux que le livre ravit par lui-même bien plus que par ce qu'il contient, ils veulent bien en avoir beaucoup, mais surtout en avoir de beaux, se rapprochant le plus possible de la
  • 78. perfection; et plutôt que d'accueillir sur leurs rayons des exemplaires tarés ou médiocres, eux-aussi prendraient la devise: Pauca sed bona. «Une des maladies de ce siècle, dit un Anglais (Barnaby Rich), c'est la multitude des livres, qui surchargent tellement le lecteur qu'il ne peut plus digérer l'abondance d'oiseuse matière chaque jour éclose et mise au monde sous des formes aussi diverses que les traits mêmes du visage des auteurs.» En avoir beaucoup, c'est largesse; En étudier peu, c'est sagesse. déclare un proverbe cité par Jules Janin. Michel Montaigne, qui a mis les livres à profit autant qu'homme du monde et qui en a parlé en des termes enthousiastes et reconnaissants cités plus haut, fait cependant des réserves, mais seulement en ce qui touche le développement physique et la santé. «Les livres, dit-il, ont beaucoup de qualités agréables à ceulx qui les sçavent choisir; mais, aulcun bien sans peine; c'est un plaisir qui n'est pas net et pur, non plus que les autres; il a ses incommodités et bien poisantes; l'âme s'y exerce; mais le corps demeure sans action, s'atterre et s'attriste.» L'âme même arrive à la lassitude et au dégoût, comme le fait observer le poète anglais Crabbe: «Les livres ne sauraient toujours plaire, quelque bons qu'ils soient; l'esprit n'aspire pas toujours après sa nourriture.» Un proverbe italien nous ramène, d'un mot vif et original, à la théorie des moralistes sur les bonnes et les mauvaises lectures: «Pas de voleur pire qu'un mauvais livre.» Quel voleur, en effet, a jamais songé à dérober l'innocence, la pureté, les croyances, les nobles élans? Et les moralistes nous affirment qu'il y a des livres qui dépouillent l'âme de tout cela.
  • 79. «Mieux vaudrait, s'écrie Walter Scott, qu'il ne fût jamais né, celui qui lit pour arriver au doute, celui qui lit pour arriver au mépris du bien.» Un écrivain anglais contemporain, Mr. Lowell, donne un tour ingénieux à l'expression d'une idée semblable, quand il écrit: «Le conseil de Caton: Cum bonis ambula, Marche avec les bons, est tout aussi vrai si on l'étend aux livres, car, eux aussi, donnent, par degrés insensibles, leur propre nature à l'esprit qui converse avec eux. Ou ils nous élèvent, ou ils nous abaissent.» Les sages, qui pèsent le pour et le contre, et, se tenant dans un juste milieu, reconnaissent aux livres une influence tantôt bonne, tantôt mauvaise, souvent nulle, suivant leur nature et la disposition d'esprit des lecteurs, sont, je crois, les plus nombreux. L'helléniste Egger met à formuler cette opinion judicieusement pondérée, un ton d'enthousiasme à quoi l'on devine qu'il pardonne au livre tous ses méfaits pour les joies et les secours qu'il sait donner. «Le plus grand personnage qui, depuis 3,000 ans peut-être, fasse parler de lui dans le monde, tour à tour géant ou pygmée, orgueilleux ou modeste, entreprenant ou timide, sachant prendre toutes les formes et tous les rôles, capable tour à tour d'éclairer ou de pervertir les esprits, d'émouvoir les passions ou de les apaiser, artisan de factions ou conciliateur des partis, véritable Protée qu'aucune définition ne peut saisir, c'est «le Livre.» Un moraliste peu connu du XVIIIe siècle, L.-C. d'Arc, auteur d'un livre intitulé: Mes Loisirs, que j'ai cité ailleurs, redoute l'excès de la lecture, ce «travail des paresseux», comme on l'a dit assez justement: «La lecture est l'aliment de l'esprit et quelquefois le tombeau du génie.» «Celui qui lit beaucoup s'expose à ne penser que d'après les autres.»
  • 80. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com