SlideShare a Scribd company logo
Linear Algebra
Lecture slides for Chapter 2 of Deep Learning
Ian Goodfellow
2016-06-24
(Goodfellow 2016)
About this chapter
• Not a comprehensive survey of all of linear algebra
• Focused on the subset most relevant to deep
learning
• Larger subset: e.g., Linear Algebra by Georgi Shilov
(Goodfellow 2016)
Scalars
• A scalar is a single number
• Integers, real numbers, rational numbers, etc.
• We denote it with italic font:
a, n, x
(Goodfellow 2016)
Vectors
• A vector is a 1-D array of numbers:
• Can be real, binary, integer, etc.
• Example notation for type and size:
rder. We can identify each individual number by its index in that ordering.
ypically we give vectors lower case names written in bold typeface, such
s x. The elements of the vector are identified by writing its name in italic
ypeface, with a subscript. The first element of x is x1, the second element
x2 and so on. We also need to say what kind of numbers are stored in
he vector. If each element is in R, and the vector has n elements, then the
ector lies in the set formed by taking the Cartesian product of R n times,
enoted as Rn. When we need to explicitly identify the elements of a vector,
e write them as a column enclosed in square brackets:
x =
2
6
6
6
4
x1
x2
...
xn
3
7
7
7
5
. (2.1)
We can think of vectors as identifying points in space, with each element
ving the coordinate along a different axis.
ometimes we need to index a set of elements of a vector. In this case, we
efine a set containing the indices and write the set as a subscript. For
xample, to access x1, x3 and x6, we define the set S = {1, 3, 6} and write
S. We use the sign to index the complement of a set. For example x 1 is
he vector containing all elements of x except for x1, and x S is the vector
ontaining all of the elements of x except for x1, x3 and x6.
Matrices: A matrix is a 2-D array of numbers, so each element is identified by
wo indices instead of just one. We usually give matrices upper-case variable
ames with bold typeface, such as A. If a real-valued matrix A has a height
• Vectors: A vector is an array of
order. We can identify each indiv
Typically we give vectors lower
as x. The elements of the vector
typeface, with a subscript. The fi
is x2 and so on. We also need t
the vector. If each element is in R
vector lies in the set formed by t
denoted as Rn. When we need to
(Goodfellow 2016)
Matrices
• A matrix is a 2-D array of numbers:
• Example notation for type and shape:
A =
2
4
A1,1 A1,2
A2,1 A2,2
A3,1 A3,2
3
5 ) A>
=

A1,1 A2,1 A3,1
A1,2 A2,2 A3,2
The transpose of the matrix can be thought of as a mirror image across the
nal.
th column of A. When we need to explicitly identify the elements of a
x, we write them as an array enclosed in square brackets:

A1,1 A1,2
A2,1 A2,2
. (2.2)
times we may need to index matrix-valued expressions that are not just
gle letter. In this case, we use subscripts after the expression, but do
onvert anything to lower case. For example, f(A)i,j gives element (i, j)
e matrix computed by applying the function f to A.
ors: In some cases we will need an array with more than two axes. In
eneral case, an array of numbers arranged on a regular grid with a
ble number of axes is known as a tensor. We denote a tensor named “A”
this typeface: A. We identify the element of A at coordinates (i, j, k)
riting Ai,j,k.
ndices and write the set as a subscript
x6, we define the set S = {1, 3, 6} and
x the complement of a set. For example
ents of x except for x1, and x S is the
of x except for x1, x3 and x6.
ray of numbers, so each element is identifi
. We usually give matrices upper-case va
h as A. If a real-valued matrix A has a
we say that A 2 Rm⇥n. We usually id
g its name in italic but not bold font, an
Column
Row
(Goodfellow 2016)
Tensors
• A tensor is an array of numbers, that may have
• zero dimensions, and be a scalar
• one dimension, and be a vector
• two dimensions, and be a matrix
• or more dimensions.
(Goodfellow 2016)
Matrix Transpose
CHAPTER 2. LINEAR ALGEBRA
A =
2
4
A1,1 A1,2
A2,1 A2,2
A3,1 A3,2
3
5 ) A>
=

A1,1 A2,1 A3,1
A1,2 A2,2 A3,2
Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the
main diagonal.
the i-th column of A. When we need to explicitly identify the elements of a
matrix, we write them as an array enclosed in square brackets:

A1,1 A1,2
A2,1 A2,2
. (2.2)
rtant operation on matrices is the transpose. The transpose of a
mirror image of the matrix across a diagonal line, called the main
ning down and to the right, starting from its upper left corner. See
graphical depiction of this operation. We denote the transpose of a
A>, and it is defined such that
(A>
)i,j = Aj,i. (2.3)
an be thought of as matrices that contain only one column. The
a vector is therefore a matrix with only one row. Sometimes we
33
ations have many useful properties that make mathematical
more convenient. For example, matrix multiplication is
A(B + C) = AB + AC. (2.6)
A(BC) = (AB)C. (2.7)
is not commutative (the condition AB = BA does not
alar multiplication. However, the dot product between two
:
x>
y = y>
x. (2.8)
matrix product has a simple form:
(AB)>
= B>
A>
. (2.9)
onstrate Eq. 2.8, by exploiting the fact that the value of
(Goodfellow 2016)
Matrix (Dot) Product
duct of matrices A and B is a third matrix C. In order
ned, A must have the same number of columns as B has
⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p.
product just by placing two or more matrices together,
C = AB. (2.4)
n is defined by
Ci,j =
X
k
Ai,kBk,j. (2.5)
d product of two matrices is not just a matrix containing
dual elements. Such an operation exists and is called the
Hadamard product, and is denoted as A B.
en two vectors x and y of the same dimensionality is the
can think of the matrix product C = AB as computing
etween row i of A and column j of B.
= •m
p
m
p
n
n
Must
match
e defined, A must have the same number of columns as B has
pe m ⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p.
atrix product just by placing two or more matrices together,
C = AB. (2.4)
ration is defined by
Ci,j =
X
k
Ai,kBk,j. (2.5)
andard product of two matrices is not just a matrix containing
ndividual elements. Such an operation exists and is called the
t or Hadamard product, and is denoted as A B.
between two vectors x and y of the same dimensionality is the
. We can think of the matrix product C = AB as computing
uct between row i of A and column j of B.
34
(Goodfellow 2016)
Identity Matrix
PTER 2. LINEAR ALGEBRA
2
4
1 0 0
0 1 0
0 0 1
3
5
Figure 2.2: Example identity matrix: This is I3.
A2,1x1 + A2,2x2 + · · · + A2,nxn = b2 (2
. . . (2
Am,1x1 + Am,2x2 + · · · + Am,nxn = bm. (2
Matrix-vector product notation provides a more compact representation
ations of this form.
Inverse Matrices
werful tool called matrix inversion that allows us to
for many values of A.
ersion, we first need to define the concept of an identity
x is a matrix that does not change any vector when we
at matrix. We denote the identity matrix that preserves
n. Formally, In 2 Rn⇥n, and
8x 2 Rn
, Inx = x. (2.20)
ity matrix is simple: all of the entries along the main
the other entries are zero. See Fig. 2.2 for an example.
(Goodfellow 2016)
Systems of Equations
t of useful properties of the matrix product here, but
that many more exist.
ear algebra notation to write down a system of linear
Ax = b (2.11)
n matrix, b 2 Rm is a known vector, and x 2 Rn is a
we would like to solve for. Each element xi of x is one
Each row of A and each element of b provide another
Eq. 2.11 as:
A1,:x = b1 (2.12)
A2,:x = b2 (2.13)
. . . (2.14)
expands to
aware that many more exist.
ough linear algebra notation to write down a system of linear
Ax = b (2.11)
a known matrix, b 2 Rm is a known vector, and x 2 Rn is a
riables we would like to solve for. Each element xi of x is one
iables. Each row of A and each element of b provide another
ewrite Eq. 2.11 as:
A1,:x = b1 (2.12)
A2,:x = b2 (2.13)
. . . (2.14)
Am,:x = bm (2.15)
tly, as:
A1,1x1 + A1,2x2 + · · · + A1,nxn = b1 (2.16)
35
(Goodfellow 2016)
Solving Systems of Equations
• A linear system of equations can have:
• No solution
• Many solutions
• Exactly one solution: this means multiplication by
the matrix is an invertible function
(Goodfellow 2016)
Matrix Inversion
• Matrix inverse:
• Solving a system using an inverse:
• Numerically unstable, but useful for abstract
analysis
identity matrix is simple: all of the entries along the main
all of the other entries are zero. See Fig. 2.2 for an example.
se of A is denoted as A 1, and it is defined as the matrix
A 1
A = In. (2.21)
Eq. 2.11 by the following steps:
Ax = b (2.22)
A 1
Ax = A 1
b (2.23)
Inx = A 1
b (2.24)
36
f the identity matrix is simple: all of the entries along the main
while all of the other entries are zero. See Fig. 2.2 for an example.
inverse of A is denoted as A 1, and it is defined as the matrix
A 1
A = In. (2.21)
solve Eq. 2.11 by the following steps:
Ax = b (2.22)
A 1
Ax = A 1
b (2.23)
Inx = A 1
b (2.24)
36
(Goodfellow 2016)
Invertibility
• Matrix can’t be inverted if…
• More rows than columns
• More columns than rows
• Redundant rows/columns (“linearly dependent”,
“low rank”)
(Goodfellow 2016)
Norms
• Functions that measure how “large” a vector is
• Similar to a distance between zero and the point
represented by the vector
is given by
||x||p =
X
i
|xi|p
!1
p
for p 2 R, p 1.
Norms, including the Lp norm, are functions mapping vect
values. On an intuitive level, the norm of a vector x measure
the origin to the point x. More rigorously, a norm is any func
the following properties:
• f(x) = 0 ) x = 0
• f(x + y)  f(x) + f(y) (the triangle inequality)
• 8↵ 2 R, f(↵x) = |↵|f(x)
The L2 norm, with p = 2, is known as the Euclidean nor
Euclidean distance from the origin to the point identified by
(Goodfellow 2016)
• Lp
norm
• Most popular norm: L2 norm, p=2
• L1 norm, p=1:
• Max norm, infinite p:
Norms
o measure the size of a vector. In machine learning, we
ectors using a function called a norm. Formally, the L
||x||p =
X
i
|xi|p
!1
p
the Lp norm, are functions mapping vectors to non-n
ive level, the norm of a vector x measures the distan
nt x. More rigorously, a norm is any function f that
ties:
APTER 2. LINEAR ALGEBRA
ause it increases very slowly near the origin. In several machine learning
lications, it is important to discriminate between elements that are exactly
and elements that are small but nonzero. In these cases, we turn to a function
t grows at the same rate in all locations, but retains mathematical simplicity:
L1 norm. The L1 norm may be simplified to
||x||1 =
X
i
|xi|. (2.31)
L1 norm is commonly used in machine learning when the difference between
o and nonzero elements is very important. Every time an element of x moves
y from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the size of the vector by counting its number of nonzero
ments. Some authors refer to this function as the “L0 norm,” but this is incorrect
minology. The number of non-zero entries in a vector is not a norm, because
zero and elements that are small but nonzero. In these cases, we turn to a function
that grows at the same rate in all locations, but retains mathematical simplicity:
the L1 norm. The L1 norm may be simplified to
||x||1 =
X
i
|xi|. (2.31)
The L1 norm is commonly used in machine learning when the difference between
zero and nonzero elements is very important. Every time an element of x moves
away from 0 by ✏, the L1 norm increases by ✏.
We sometimes measure the size of the vector by counting its number of nonzero
elements. Some authors refer to this function as the “L0 norm,” but this is incorrect
terminology. The number of non-zero entries in a vector is not a norm, because
scaling the vector by ↵ does not change the number of nonzero entries. The L1
norm is often used as a substitute for the number of nonzero entries.
One other norm that commonly arises in machine learning is the L1 norm,
also known as the max norm. This norm simplifies to the absolute value of the
element with the largest magnitude in the vector,
||x||1 = max
i
|xi|. (2.32)
Sometimes we may also wish to measure the size of a matrix. In the context
of deep learning, the most common way to do this is with the otherwise obscure
Frobenius norm sX
(Goodfellow 2016)
• Unit vector:
• Symmetric Matrix:
• Orthogonal matrix:
Special Matrices and Vectors
A = A>
. (2.35)
en arise when the entries are generated by some function of
es not depend on the order of the arguments. For example,
ance measurements, with Ai,j giving the distance from point
= Aj,i because distance functions are symmetric.
ector with unit norm:
||x||2 = 1. (2.36)
vector y are orthogonal to each other if x>y = 0. If both
orm, this means that they are at a 90 degree angle to each
n vectors may be mutually orthogonal with nonzero norm.
only orthogonal but also have unit norm, we call them
ix is a square matrix whose rows are mutually orthonormal
e mutually orthonormal:
> >
1 n
machine learning algorithm in terms of arbitrary matrices,
sive (and less descriptive) algorithm by restricting some
ices need be square. It is possible to construct a rectangular
quare diagonal matrices do not have inverses but it is still
them cheaply. For a non-square diagonal matrix D, the
scaling each element of x, and either concatenating some
is taller than it is wide, or discarding some of the last
D is wider than it is tall.
is any matrix that is equal to its own transpose:
A = A>
. (2.35)
n arise when the entries are generated by some function of
s not depend on the order of the arguments. For example,
nce measurements, with Ai,j giving the distance from point
= Aj,i because distance functions are symmetric.
es often arise when the entries are generated by some function of
at does not depend on the order of the arguments. For example,
distance measurements, with Ai,j giving the distance from point
Ai,j = Aj,i because distance functions are symmetric.
is a vector with unit norm:
||x||2 = 1. (2.36)
nd a vector y are orthogonal to each other if x>y = 0. If both
ero norm, this means that they are at a 90 degree angle to each
most n vectors may be mutually orthogonal with nonzero norm.
e not only orthogonal but also have unit norm, we call them
matrix is a square matrix whose rows are mutually orthonormal
ns are mutually orthonormal:
A>
A = AA>
= I. (2.37)
41
LGEBRA
A 1
= A>
, (2.38)
(Goodfellow 2016)
Eigendecomposition
• Eigenvector and eigenvalue:
• Eigendecomposition of a diagonalizable matrix:
• Every real symmetric matrix has a real, orthogonal
eigendecomposition:
atrix as an array of elements.
dely used kinds of matrix decomposition is called eigen-
h we decompose a matrix into a set of eigenvectors and
square matrix A is a non-zero vector v such that multipli-
the scale of v:
Av = v. (2.39)
as the eigenvalue corresponding to this eigenvector. (One
vector such that v>A = v>, but we are usually concerned
.
or of A, then so is any rescaled vector sv for s 2 R, s 6= 0.
he same eigenvalue. For this reason, we usually only look
example of the effect of eigenvectors and eigenvalues. Here, we have
h two orthonormal eigenvectors, v(1)
with eigenvalue 1 and v(2)
with
Left) We plot the set of all unit vectors u 2 R2
as a unit circle. (Right)
of all points Au. By observing the way that A distorts the unit circle, we
cales space in direction v(i)
by i.
form a matrix V with one eigenvector per column: V = [v(1), . . . ,
, we can concatenate the eigenvalues to form a vector = [ 1, . . . ,
ndecomposition of A is then given by
A = V diag( )V 1
. (2.40)
en that constructing matrices with specific eigenvalues and eigenvec-
to stretch space in desired directions. However, we often want to
rices into their eigenvalues and eigenvectors. Doing so can help us
ain properties of the matrix, much as decomposing an integer into
rs can help us understand the behavior of that integer.
matrix can be decomposed into eigenvalues and eigenvectors. In some
ALGEBRA
on exists, but may involve complex rather than real numbers.
ook, we usually need to decompose only a specific class of
simple decomposition. Specifically, every real symmetric
posed into an expression using only real-valued eigenvectors
A = Q⇤Q>
, (2.41)
gonal matrix composed of eigenvectors of A, and ⇤ is a
eigenvalue ⇤i,i is associated with the eigenvector in column i
(Goodfellow 2016)
Effect of Eigenvalues
−3 −2 −1 0 1 2 3
x(
−3
−2
−1
0
1
2
3
x1
v(1)
v(1)
Before multiplication
−3 −2 −1 0 1 2 3
x′
(
−3
−2
−1
0
1
2
3
x′
1
v(1)
λ1 v(1)
v(1)
λ1 v(1)
After multiplication
(Goodfellow 2016)
Singular Value Decomposition
• Similar to eigendecomposition
• More general; matrix need not be square
LINEAR ALGEBRA
ly applicable. Every real matrix has a singular value decomposition,
is not true of the eigenvalue decomposition. For example, if a matrix
, the eigendecomposition is not defined, and we must use a singular
osition instead.
at the eigendecomposition involves analyzing a matrix A to discover
f eigenvectors and a vector of eigenvalues such that we can rewrite
A = V diag( )V 1
. (2.42)
ular value decomposition is similar, except this time we will write A
of three matrices:
A = UDV >
. (2.43)
hat A is an m ⇥ n matrix. Then U is defined to be an m ⇥ m matrix,
m ⇥ n matrix, and V to be an n ⇥ n matrix.
(Goodfellow 2016)
Moore-Penrose Pseudoinverse
• If the equation has:
• Exactly one solution: this is the same as the inverse.
• No solution: this gives us the solution with the
smallest error
• Many solutions: this gives us the solution with the
smallest norm of x.
When A has more columns than r
pseudoinverse provides one of the ma
he solution x = A+y with minima
olutions.
When A has more rows than colu
n this case, using the pseudoinverse
possible to y in terms of Euclidean n
rithms for computing the pseudoinverse are not based on this defini-
er the formula
A+
= V D+
U>
, (2.47)
nd V are the singular value decomposition of A, and the pseudoinverse
onal matrix D is obtained by taking the reciprocal of its non-zero
taking the transpose of the resulting matrix.
as more columns than rows, then solving a linear equation using the
provides one of the many possible solutions. Specifically, it provides
x = A+y with minimal Euclidean norm ||x||2 among all possible
as more rows than columns, it is possible for there to be no solution.
using the pseudoinverse gives us the x for which Ax is as close as
in terms of Euclidean norm ||Ax y||2.
e Trace Operator
rator gives the sum of all of the diagonal entries of a matrix:
(Goodfellow 2016)
Computing the Pseudoinverse
eudoinverse allows us to make some headway in these
of A is defined as a matrix
A+
= lim
↵&0
(A>
A + ↵I) 1
A>
. (2.46)
mputing the pseudoinverse are not based on this defini-
la
A+
= V D+
U>
, (2.47)
singular value decomposition of A, and the pseudoinverse
D is obtained by taking the reciprocal of its non-zero
ranspose of the resulting matrix.
mns than rows, then solving a linear equation using the
of the many possible solutions. Specifically, it provides
Take reciprocal of non-zero entries
The SVD allows the computation of the pseudoinverse:
(Goodfellow 2016)
Trace
perator
sum of all of the diagonal entries of a matrix:
Tr(A) =
X
i
Ai,i. (2.48)
ful for a variety of reasons. Some operations that are
sorting to summation notation can be specified using
46
pression using many useful identities. For example, the trace
nt to the transpose operator:
Tr(A) = Tr(A>
). (2.50)
square matrix composed of many factors is also invariant to
ctor into the first position, if the shapes of the corresponding
resulting product to be defined:
Tr(ABC) = Tr(CAB) = Tr(BCA) (2.51)
Tr(
nY
i=1
F (i)
) = Tr(F (n)
n 1Y
i=1
F (i)
). (2.52)
cyclic permutation holds even if the resulting product has a
r example, for A 2 Rm⇥n and B 2 Rn⇥m, we have
(Goodfellow 2016)
Learning linear algebra
• Do a lot of practice problems
• Start out with lots of summation signs and indexing
into individual entries
• Eventually you will be able to mostly use matrix
and vector product notation quickly and easily

More Related Content

PPT
LinearAlgebra.ppt
PPTX
My Lecture Notes from Linear Algebra
PPT
Matrices
PPT
Linear Algebra and Matrix
PPT
Linear algebra notes 1
PPT
3.3 graph systems of linear inequalities
PDF
9.1 Systems of Linear Equations
PDF
vector spaces notes.pdf
LinearAlgebra.ppt
My Lecture Notes from Linear Algebra
Matrices
Linear Algebra and Matrix
Linear algebra notes 1
3.3 graph systems of linear inequalities
9.1 Systems of Linear Equations
vector spaces notes.pdf

What's hot (20)

PPTX
MATRICES
PPTX
Equations of Straight Lines
PPT
Systems of Equations by Elimination
PDF
PDF
Systems of linear equations in three variables
PDF
2.1 Basics of Functions and Their Graphs
PPT
Linear Algebra
PPTX
System of linear inequalities
PPTX
8.1 intro to functions
PPT
Matrix and its operation (addition, subtraction, multiplication)
PDF
Matrices and Determinants
PPT
Set theory-ppt
PPTX
4.3 The Definite Integral
PPT
Systems of Linear Equations
PPTX
Vector Spaces,subspaces,Span,Basis
PPTX
Relations & Functions
PPTX
Gauss jordan method.pptx
PPTX
Intro to probability
PPTX
Matrix presentation By DHEERAJ KATARIA
PPTX
Mathematical Induction
MATRICES
Equations of Straight Lines
Systems of Equations by Elimination
Systems of linear equations in three variables
2.1 Basics of Functions and Their Graphs
Linear Algebra
System of linear inequalities
8.1 intro to functions
Matrix and its operation (addition, subtraction, multiplication)
Matrices and Determinants
Set theory-ppt
4.3 The Definite Integral
Systems of Linear Equations
Vector Spaces,subspaces,Span,Basis
Relations & Functions
Gauss jordan method.pptx
Intro to probability
Matrix presentation By DHEERAJ KATARIA
Mathematical Induction
Ad

Similar to 02 linear algebra (20)

PDF
Linear_Algebra_final.pdf
PPTX
matrix algebra
PDF
1 linear algebra matrices
PDF
Book linear
PDF
Appendix B Matrices And Determinants
PDF
matrix-algebra-for-engineers (1).pdf
PPT
Matrix algebra
PPTX
Bba i-bm-u-2- matrix -
PPTX
Linear Algebra and Matlab tutorial
PDF
ppt power point presentation physics.pdf
PPTX
04 Chapter MATLAB linear algebra review
PPTX
2 Chapter Two matrix algebra and its application.pptx
PPT
Matrix and its applications by mohammad imran
PPT
chap01987654etghujh76687976jgtfhhhgve.ppt
PDF
Matrices & Determinants
PPTX
matrix further mahmatix for betc level 5.pptx
PPTX
Beginning direct3d gameprogrammingmath05_matrices_20160515_jintaeks
PDF
Introduction To Matrix
PPT
systems of linear equations & matrices
Linear_Algebra_final.pdf
matrix algebra
1 linear algebra matrices
Book linear
Appendix B Matrices And Determinants
matrix-algebra-for-engineers (1).pdf
Matrix algebra
Bba i-bm-u-2- matrix -
Linear Algebra and Matlab tutorial
ppt power point presentation physics.pdf
04 Chapter MATLAB linear algebra review
2 Chapter Two matrix algebra and its application.pptx
Matrix and its applications by mohammad imran
chap01987654etghujh76687976jgtfhhhgve.ppt
Matrices & Determinants
matrix further mahmatix for betc level 5.pptx
Beginning direct3d gameprogrammingmath05_matrices_20160515_jintaeks
Introduction To Matrix
systems of linear equations & matrices
Ad

More from Ronald Teo (16)

PDF
Mc td
PDF
07 regularization
PDF
PDF
06 mlp
PDF
PDF
04 numerical
PPTX
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
PDF
Intro rl
PDF
Lec7 deeprlbootcamp-svg+scg
PDF
Lec5 advanced-policy-gradient-methods
PDF
Lec6 nuts-and-bolts-deep-rl-research
PDF
Lec4b pong from_pixels
PDF
Lec4a policy-gradients-actor-critic
PDF
Lec3 dqn
PDF
Lec2 sampling-based-approximations-and-function-fitting
PDF
Lec1 intro-mdps-exact-methods
Mc td
07 regularization
06 mlp
04 numerical
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Intro rl
Lec7 deeprlbootcamp-svg+scg
Lec5 advanced-policy-gradient-methods
Lec6 nuts-and-bolts-deep-rl-research
Lec4b pong from_pixels
Lec4a policy-gradients-actor-critic
Lec3 dqn
Lec2 sampling-based-approximations-and-function-fitting
Lec1 intro-mdps-exact-methods

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Dropbox Q2 2025 Financial Results & Investor Presentation
sap open course for s4hana steps from ECC to s4
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf

02 linear algebra

  • 1. Linear Algebra Lecture slides for Chapter 2 of Deep Learning Ian Goodfellow 2016-06-24
  • 2. (Goodfellow 2016) About this chapter • Not a comprehensive survey of all of linear algebra • Focused on the subset most relevant to deep learning • Larger subset: e.g., Linear Algebra by Georgi Shilov
  • 3. (Goodfellow 2016) Scalars • A scalar is a single number • Integers, real numbers, rational numbers, etc. • We denote it with italic font: a, n, x
  • 4. (Goodfellow 2016) Vectors • A vector is a 1-D array of numbers: • Can be real, binary, integer, etc. • Example notation for type and size: rder. We can identify each individual number by its index in that ordering. ypically we give vectors lower case names written in bold typeface, such s x. The elements of the vector are identified by writing its name in italic ypeface, with a subscript. The first element of x is x1, the second element x2 and so on. We also need to say what kind of numbers are stored in he vector. If each element is in R, and the vector has n elements, then the ector lies in the set formed by taking the Cartesian product of R n times, enoted as Rn. When we need to explicitly identify the elements of a vector, e write them as a column enclosed in square brackets: x = 2 6 6 6 4 x1 x2 ... xn 3 7 7 7 5 . (2.1) We can think of vectors as identifying points in space, with each element ving the coordinate along a different axis. ometimes we need to index a set of elements of a vector. In this case, we efine a set containing the indices and write the set as a subscript. For xample, to access x1, x3 and x6, we define the set S = {1, 3, 6} and write S. We use the sign to index the complement of a set. For example x 1 is he vector containing all elements of x except for x1, and x S is the vector ontaining all of the elements of x except for x1, x3 and x6. Matrices: A matrix is a 2-D array of numbers, so each element is identified by wo indices instead of just one. We usually give matrices upper-case variable ames with bold typeface, such as A. If a real-valued matrix A has a height • Vectors: A vector is an array of order. We can identify each indiv Typically we give vectors lower as x. The elements of the vector typeface, with a subscript. The fi is x2 and so on. We also need t the vector. If each element is in R vector lies in the set formed by t denoted as Rn. When we need to
  • 5. (Goodfellow 2016) Matrices • A matrix is a 2-D array of numbers: • Example notation for type and shape: A = 2 4 A1,1 A1,2 A2,1 A2,2 A3,1 A3,2 3 5 ) A> =  A1,1 A2,1 A3,1 A1,2 A2,2 A3,2 The transpose of the matrix can be thought of as a mirror image across the nal. th column of A. When we need to explicitly identify the elements of a x, we write them as an array enclosed in square brackets:  A1,1 A1,2 A2,1 A2,2 . (2.2) times we may need to index matrix-valued expressions that are not just gle letter. In this case, we use subscripts after the expression, but do onvert anything to lower case. For example, f(A)i,j gives element (i, j) e matrix computed by applying the function f to A. ors: In some cases we will need an array with more than two axes. In eneral case, an array of numbers arranged on a regular grid with a ble number of axes is known as a tensor. We denote a tensor named “A” this typeface: A. We identify the element of A at coordinates (i, j, k) riting Ai,j,k. ndices and write the set as a subscript x6, we define the set S = {1, 3, 6} and x the complement of a set. For example ents of x except for x1, and x S is the of x except for x1, x3 and x6. ray of numbers, so each element is identifi . We usually give matrices upper-case va h as A. If a real-valued matrix A has a we say that A 2 Rm⇥n. We usually id g its name in italic but not bold font, an Column Row
  • 6. (Goodfellow 2016) Tensors • A tensor is an array of numbers, that may have • zero dimensions, and be a scalar • one dimension, and be a vector • two dimensions, and be a matrix • or more dimensions.
  • 7. (Goodfellow 2016) Matrix Transpose CHAPTER 2. LINEAR ALGEBRA A = 2 4 A1,1 A1,2 A2,1 A2,2 A3,1 A3,2 3 5 ) A> =  A1,1 A2,1 A3,1 A1,2 A2,2 A3,2 Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal. the i-th column of A. When we need to explicitly identify the elements of a matrix, we write them as an array enclosed in square brackets:  A1,1 A1,2 A2,1 A2,2 . (2.2) rtant operation on matrices is the transpose. The transpose of a mirror image of the matrix across a diagonal line, called the main ning down and to the right, starting from its upper left corner. See graphical depiction of this operation. We denote the transpose of a A>, and it is defined such that (A> )i,j = Aj,i. (2.3) an be thought of as matrices that contain only one column. The a vector is therefore a matrix with only one row. Sometimes we 33 ations have many useful properties that make mathematical more convenient. For example, matrix multiplication is A(B + C) = AB + AC. (2.6) A(BC) = (AB)C. (2.7) is not commutative (the condition AB = BA does not alar multiplication. However, the dot product between two : x> y = y> x. (2.8) matrix product has a simple form: (AB)> = B> A> . (2.9) onstrate Eq. 2.8, by exploiting the fact that the value of
  • 8. (Goodfellow 2016) Matrix (Dot) Product duct of matrices A and B is a third matrix C. In order ned, A must have the same number of columns as B has ⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p. product just by placing two or more matrices together, C = AB. (2.4) n is defined by Ci,j = X k Ai,kBk,j. (2.5) d product of two matrices is not just a matrix containing dual elements. Such an operation exists and is called the Hadamard product, and is denoted as A B. en two vectors x and y of the same dimensionality is the can think of the matrix product C = AB as computing etween row i of A and column j of B. = •m p m p n n Must match e defined, A must have the same number of columns as B has pe m ⇥ n and B is of shape n ⇥ p, then C is of shape m ⇥ p. atrix product just by placing two or more matrices together, C = AB. (2.4) ration is defined by Ci,j = X k Ai,kBk,j. (2.5) andard product of two matrices is not just a matrix containing ndividual elements. Such an operation exists and is called the t or Hadamard product, and is denoted as A B. between two vectors x and y of the same dimensionality is the . We can think of the matrix product C = AB as computing uct between row i of A and column j of B. 34
  • 9. (Goodfellow 2016) Identity Matrix PTER 2. LINEAR ALGEBRA 2 4 1 0 0 0 1 0 0 0 1 3 5 Figure 2.2: Example identity matrix: This is I3. A2,1x1 + A2,2x2 + · · · + A2,nxn = b2 (2 . . . (2 Am,1x1 + Am,2x2 + · · · + Am,nxn = bm. (2 Matrix-vector product notation provides a more compact representation ations of this form. Inverse Matrices werful tool called matrix inversion that allows us to for many values of A. ersion, we first need to define the concept of an identity x is a matrix that does not change any vector when we at matrix. We denote the identity matrix that preserves n. Formally, In 2 Rn⇥n, and 8x 2 Rn , Inx = x. (2.20) ity matrix is simple: all of the entries along the main the other entries are zero. See Fig. 2.2 for an example.
  • 10. (Goodfellow 2016) Systems of Equations t of useful properties of the matrix product here, but that many more exist. ear algebra notation to write down a system of linear Ax = b (2.11) n matrix, b 2 Rm is a known vector, and x 2 Rn is a we would like to solve for. Each element xi of x is one Each row of A and each element of b provide another Eq. 2.11 as: A1,:x = b1 (2.12) A2,:x = b2 (2.13) . . . (2.14) expands to aware that many more exist. ough linear algebra notation to write down a system of linear Ax = b (2.11) a known matrix, b 2 Rm is a known vector, and x 2 Rn is a riables we would like to solve for. Each element xi of x is one iables. Each row of A and each element of b provide another ewrite Eq. 2.11 as: A1,:x = b1 (2.12) A2,:x = b2 (2.13) . . . (2.14) Am,:x = bm (2.15) tly, as: A1,1x1 + A1,2x2 + · · · + A1,nxn = b1 (2.16) 35
  • 11. (Goodfellow 2016) Solving Systems of Equations • A linear system of equations can have: • No solution • Many solutions • Exactly one solution: this means multiplication by the matrix is an invertible function
  • 12. (Goodfellow 2016) Matrix Inversion • Matrix inverse: • Solving a system using an inverse: • Numerically unstable, but useful for abstract analysis identity matrix is simple: all of the entries along the main all of the other entries are zero. See Fig. 2.2 for an example. se of A is denoted as A 1, and it is defined as the matrix A 1 A = In. (2.21) Eq. 2.11 by the following steps: Ax = b (2.22) A 1 Ax = A 1 b (2.23) Inx = A 1 b (2.24) 36 f the identity matrix is simple: all of the entries along the main while all of the other entries are zero. See Fig. 2.2 for an example. inverse of A is denoted as A 1, and it is defined as the matrix A 1 A = In. (2.21) solve Eq. 2.11 by the following steps: Ax = b (2.22) A 1 Ax = A 1 b (2.23) Inx = A 1 b (2.24) 36
  • 13. (Goodfellow 2016) Invertibility • Matrix can’t be inverted if… • More rows than columns • More columns than rows • Redundant rows/columns (“linearly dependent”, “low rank”)
  • 14. (Goodfellow 2016) Norms • Functions that measure how “large” a vector is • Similar to a distance between zero and the point represented by the vector is given by ||x||p = X i |xi|p !1 p for p 2 R, p 1. Norms, including the Lp norm, are functions mapping vect values. On an intuitive level, the norm of a vector x measure the origin to the point x. More rigorously, a norm is any func the following properties: • f(x) = 0 ) x = 0 • f(x + y)  f(x) + f(y) (the triangle inequality) • 8↵ 2 R, f(↵x) = |↵|f(x) The L2 norm, with p = 2, is known as the Euclidean nor Euclidean distance from the origin to the point identified by
  • 15. (Goodfellow 2016) • Lp norm • Most popular norm: L2 norm, p=2 • L1 norm, p=1: • Max norm, infinite p: Norms o measure the size of a vector. In machine learning, we ectors using a function called a norm. Formally, the L ||x||p = X i |xi|p !1 p the Lp norm, are functions mapping vectors to non-n ive level, the norm of a vector x measures the distan nt x. More rigorously, a norm is any function f that ties: APTER 2. LINEAR ALGEBRA ause it increases very slowly near the origin. In several machine learning lications, it is important to discriminate between elements that are exactly and elements that are small but nonzero. In these cases, we turn to a function t grows at the same rate in all locations, but retains mathematical simplicity: L1 norm. The L1 norm may be simplified to ||x||1 = X i |xi|. (2.31) L1 norm is commonly used in machine learning when the difference between o and nonzero elements is very important. Every time an element of x moves y from 0 by ✏, the L1 norm increases by ✏. We sometimes measure the size of the vector by counting its number of nonzero ments. Some authors refer to this function as the “L0 norm,” but this is incorrect minology. The number of non-zero entries in a vector is not a norm, because zero and elements that are small but nonzero. In these cases, we turn to a function that grows at the same rate in all locations, but retains mathematical simplicity: the L1 norm. The L1 norm may be simplified to ||x||1 = X i |xi|. (2.31) The L1 norm is commonly used in machine learning when the difference between zero and nonzero elements is very important. Every time an element of x moves away from 0 by ✏, the L1 norm increases by ✏. We sometimes measure the size of the vector by counting its number of nonzero elements. Some authors refer to this function as the “L0 norm,” but this is incorrect terminology. The number of non-zero entries in a vector is not a norm, because scaling the vector by ↵ does not change the number of nonzero entries. The L1 norm is often used as a substitute for the number of nonzero entries. One other norm that commonly arises in machine learning is the L1 norm, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude in the vector, ||x||1 = max i |xi|. (2.32) Sometimes we may also wish to measure the size of a matrix. In the context of deep learning, the most common way to do this is with the otherwise obscure Frobenius norm sX
  • 16. (Goodfellow 2016) • Unit vector: • Symmetric Matrix: • Orthogonal matrix: Special Matrices and Vectors A = A> . (2.35) en arise when the entries are generated by some function of es not depend on the order of the arguments. For example, ance measurements, with Ai,j giving the distance from point = Aj,i because distance functions are symmetric. ector with unit norm: ||x||2 = 1. (2.36) vector y are orthogonal to each other if x>y = 0. If both orm, this means that they are at a 90 degree angle to each n vectors may be mutually orthogonal with nonzero norm. only orthogonal but also have unit norm, we call them ix is a square matrix whose rows are mutually orthonormal e mutually orthonormal: > > 1 n machine learning algorithm in terms of arbitrary matrices, sive (and less descriptive) algorithm by restricting some ices need be square. It is possible to construct a rectangular quare diagonal matrices do not have inverses but it is still them cheaply. For a non-square diagonal matrix D, the scaling each element of x, and either concatenating some is taller than it is wide, or discarding some of the last D is wider than it is tall. is any matrix that is equal to its own transpose: A = A> . (2.35) n arise when the entries are generated by some function of s not depend on the order of the arguments. For example, nce measurements, with Ai,j giving the distance from point = Aj,i because distance functions are symmetric. es often arise when the entries are generated by some function of at does not depend on the order of the arguments. For example, distance measurements, with Ai,j giving the distance from point Ai,j = Aj,i because distance functions are symmetric. is a vector with unit norm: ||x||2 = 1. (2.36) nd a vector y are orthogonal to each other if x>y = 0. If both ero norm, this means that they are at a 90 degree angle to each most n vectors may be mutually orthogonal with nonzero norm. e not only orthogonal but also have unit norm, we call them matrix is a square matrix whose rows are mutually orthonormal ns are mutually orthonormal: A> A = AA> = I. (2.37) 41 LGEBRA A 1 = A> , (2.38)
  • 17. (Goodfellow 2016) Eigendecomposition • Eigenvector and eigenvalue: • Eigendecomposition of a diagonalizable matrix: • Every real symmetric matrix has a real, orthogonal eigendecomposition: atrix as an array of elements. dely used kinds of matrix decomposition is called eigen- h we decompose a matrix into a set of eigenvectors and square matrix A is a non-zero vector v such that multipli- the scale of v: Av = v. (2.39) as the eigenvalue corresponding to this eigenvector. (One vector such that v>A = v>, but we are usually concerned . or of A, then so is any rescaled vector sv for s 2 R, s 6= 0. he same eigenvalue. For this reason, we usually only look example of the effect of eigenvectors and eigenvalues. Here, we have h two orthonormal eigenvectors, v(1) with eigenvalue 1 and v(2) with Left) We plot the set of all unit vectors u 2 R2 as a unit circle. (Right) of all points Au. By observing the way that A distorts the unit circle, we cales space in direction v(i) by i. form a matrix V with one eigenvector per column: V = [v(1), . . . , , we can concatenate the eigenvalues to form a vector = [ 1, . . . , ndecomposition of A is then given by A = V diag( )V 1 . (2.40) en that constructing matrices with specific eigenvalues and eigenvec- to stretch space in desired directions. However, we often want to rices into their eigenvalues and eigenvectors. Doing so can help us ain properties of the matrix, much as decomposing an integer into rs can help us understand the behavior of that integer. matrix can be decomposed into eigenvalues and eigenvectors. In some ALGEBRA on exists, but may involve complex rather than real numbers. ook, we usually need to decompose only a specific class of simple decomposition. Specifically, every real symmetric posed into an expression using only real-valued eigenvectors A = Q⇤Q> , (2.41) gonal matrix composed of eigenvectors of A, and ⇤ is a eigenvalue ⇤i,i is associated with the eigenvector in column i
  • 18. (Goodfellow 2016) Effect of Eigenvalues −3 −2 −1 0 1 2 3 x( −3 −2 −1 0 1 2 3 x1 v(1) v(1) Before multiplication −3 −2 −1 0 1 2 3 x′ ( −3 −2 −1 0 1 2 3 x′ 1 v(1) λ1 v(1) v(1) λ1 v(1) After multiplication
  • 19. (Goodfellow 2016) Singular Value Decomposition • Similar to eigendecomposition • More general; matrix need not be square LINEAR ALGEBRA ly applicable. Every real matrix has a singular value decomposition, is not true of the eigenvalue decomposition. For example, if a matrix , the eigendecomposition is not defined, and we must use a singular osition instead. at the eigendecomposition involves analyzing a matrix A to discover f eigenvectors and a vector of eigenvalues such that we can rewrite A = V diag( )V 1 . (2.42) ular value decomposition is similar, except this time we will write A of three matrices: A = UDV > . (2.43) hat A is an m ⇥ n matrix. Then U is defined to be an m ⇥ m matrix, m ⇥ n matrix, and V to be an n ⇥ n matrix.
  • 20. (Goodfellow 2016) Moore-Penrose Pseudoinverse • If the equation has: • Exactly one solution: this is the same as the inverse. • No solution: this gives us the solution with the smallest error • Many solutions: this gives us the solution with the smallest norm of x. When A has more columns than r pseudoinverse provides one of the ma he solution x = A+y with minima olutions. When A has more rows than colu n this case, using the pseudoinverse possible to y in terms of Euclidean n rithms for computing the pseudoinverse are not based on this defini- er the formula A+ = V D+ U> , (2.47) nd V are the singular value decomposition of A, and the pseudoinverse onal matrix D is obtained by taking the reciprocal of its non-zero taking the transpose of the resulting matrix. as more columns than rows, then solving a linear equation using the provides one of the many possible solutions. Specifically, it provides x = A+y with minimal Euclidean norm ||x||2 among all possible as more rows than columns, it is possible for there to be no solution. using the pseudoinverse gives us the x for which Ax is as close as in terms of Euclidean norm ||Ax y||2. e Trace Operator rator gives the sum of all of the diagonal entries of a matrix:
  • 21. (Goodfellow 2016) Computing the Pseudoinverse eudoinverse allows us to make some headway in these of A is defined as a matrix A+ = lim ↵&0 (A> A + ↵I) 1 A> . (2.46) mputing the pseudoinverse are not based on this defini- la A+ = V D+ U> , (2.47) singular value decomposition of A, and the pseudoinverse D is obtained by taking the reciprocal of its non-zero ranspose of the resulting matrix. mns than rows, then solving a linear equation using the of the many possible solutions. Specifically, it provides Take reciprocal of non-zero entries The SVD allows the computation of the pseudoinverse:
  • 22. (Goodfellow 2016) Trace perator sum of all of the diagonal entries of a matrix: Tr(A) = X i Ai,i. (2.48) ful for a variety of reasons. Some operations that are sorting to summation notation can be specified using 46 pression using many useful identities. For example, the trace nt to the transpose operator: Tr(A) = Tr(A> ). (2.50) square matrix composed of many factors is also invariant to ctor into the first position, if the shapes of the corresponding resulting product to be defined: Tr(ABC) = Tr(CAB) = Tr(BCA) (2.51) Tr( nY i=1 F (i) ) = Tr(F (n) n 1Y i=1 F (i) ). (2.52) cyclic permutation holds even if the resulting product has a r example, for A 2 Rm⇥n and B 2 Rn⇥m, we have
  • 23. (Goodfellow 2016) Learning linear algebra • Do a lot of practice problems • Start out with lots of summation signs and indexing into individual entries • Eventually you will be able to mostly use matrix and vector product notation quickly and easily