02 math essentials

Machine Learning
Math Essentials

Jeff Howbert

Introduction to Machine Learning

Winter 2012

1

Areas of math essential to machine learning
Machine learning is part of both statistics and computer
science
– Probability
– Statistical inference
– Validation
– Estimates of error, confidence intervals
Linear algebra
Li
l b
– Hugely useful for compact representation of linear
transformations on data
– Dimensionality reduction techniques
Optimization theory
p
y
Jeff Howbert


Winter 2012

2

Why worry about the math?
There are lots of easy-to-use machine learning
packages out there.
After this course, you will know how to apply
several of the most general-purpose algorithms.
g
p p
g
HOWEVER
To get really useful results, you need good
mathematical intuitions about ce ta ge e a
at e at ca tu t o s
certain general
machine learning principles, as well as the inner
workings of the individual algorithms.
Jeff Howbert


Winter 2012

3

Why worry about the math?
These intuitions will allow you to:
– Choose the right algorithm(s) for the problem
– Make good choices on parameter settings,
validation strategies
g
– Recognize over- or underfitting
– Troubleshoot poor / ambiguous results
– Put appropriate bounds of confidence /
uncertainty on results
– Do a better job of coding algorithms or
incorporating them into more complex
p
g
p
analysis pipelines
Jeff Howbert


Winter 2012

4

Notation
a∈A
|B|
|| v ||
∑
∫
ℜ
ℜn

set membership: a is member of set A
cardinality: number of items in set B
norm: length of vector v
summation
integral
the t f
th set of real numbers
l
b
real number space of dimension n
n = 2 : plane or 2-space
n = 3 : 3- (dimensional) space
p
yp p
n > 3 : n-space or hyperspace

Jeff Howbert


Winter 2012

5

Notation
x, y, z, vector (bold, lower case)
u, v
A, B, X matrix (bold, upper case)
y = f( x ) function (map): assigns unique value in
range of y to each value in domain of x
dy / dx derivative of y with respect to single
y
p
g
variable x
y = f( x ) function on multiple variables, i.e. a
(
p
vector of variables; function in n-space
∂y / ∂xi partial derivative of y with respect to
element i of vector x
Jeff Howbert


Winter 2012

6

The concept of probability

Intuition:
In some process, several outcomes are possible.
When the process is repeated a large number of
times, each outcome occurs with a characteristic
relative frequency or probability If a particular
frequency, probability.
outcome happens more often than another
outcome,
outcome we say it is more probable
probable.

Jeff Howbert


Winter 2012

7

The concept of probability
Arises in two contexts:
In actual repeated experiments.
– Example: You record the color of 1000 cars driving
by. 57 of them are green. You estimate the
probability of a car being green as 57 / 1000 = 0 0057
0.0057.
In idealized conceptions of a repeated process.
– Example: You consider the behavior of an unbiased
six-sided die. The expected probability of rolling a 5 is
1 / 6 = 0.1667.
– Example: You need a model for how people’s heights
are distributed. You choose a normal distribution
(
(bell-shaped curve) to represent the expected relative
p
)
p
p
probabilities.
Jeff Howbert


Winter 2012

8

Probability spaces
A probability space is a random process or experiment with
three components:
– Ω, the set of possible outcomes O
number of possible outcomes = | Ω | = N

– F th set of possible events E
F, the t f
ibl
t
an event comprises 0 to N outcomes
number of possible events = | F | = 2N

– P, the probability distribution
function mapping each outcome and event to real number
between 0 and 1 (th probability of O or E)
b t
d (the
b bilit f
probability of an event is sum of probabilities of possible
outcomes in event

Jeff Howbert


Winter 2012

9

Axioms of probability
1.

Non-negativity:
for any event E ∈ F p( E ) ≥ 0
F,

2.
2

All possible outcomes:
p( Ω ) = 1

3.

Additivity of disjoint events:
for all events E, E’ ∈ F where E ∩ E’ = ∅,
p( E U E’ ) = p( E ) + p( E’ )

Jeff Howbert


Winter 2012

10

Types of probability spaces
Define | Ω | = number of possible outcomes
Discrete space
| Ω | is finite
– Analysis involves summations ( ∑ )
Continuous space | Ω | is infinite
C ti
i i fi it
– Analysis involves integrals ( ∫ )

Jeff Howbert


Winter 2012

11

Example of discrete probability space
Single roll of a six-sided die
– 6 possible outcomes: O = 1, 2, 3, 4, 5, or 6
p
, , , , ,
– 26 = 64 possible events
example: E = ( O ∈ { 1, 3, 5 } ), i.e. outcome is odd

– If die is fair, then probabilities of outcomes are equal
p( 1 ) = p( 2 ) = p( 3 ) =
p( 4 ) = p( 5 ) = p( 6 ) = 1 / 6
example: probability of event E = ( outcome is odd ) is
p( 1 ) + p( 3 ) + p( 5 ) = 1 / 2

Jeff Howbert


Winter 2012

12

Example of discrete probability space
Three consecutive flips of a coin
– 8 possible outcomes: O = HHH, HHT, HTH, HTT,
p
,
,
,
,
THH, THT, TTH, TTT
– 28 = 256 possible events
example: E = ( O ∈ { HHT, HTH, THH } ), i.e. exactly two flips
are heads
example: E = ( O ∈ { THT, TTT } ), i.e. the first and third flips
are tails

– If coin is fair, then probabilities of outcomes are equal
p( HHH ) = p( HHT ) = p( HTH ) = p( HTT ) =
p( THH ) = p( THT ) = p( TTH ) = p( TTT ) = 1 / 8
example: probability of event E = ( exactly two heads ) is
p( HHT ) + p( HTH ) + p( THH ) = 3 / 8
(
(
(
Jeff Howbert


Winter 2012

13

Example of continuous probability space
Height of a randomly chosen American male
– Infinite number of possible outcomes: O has some
single value in range 2 feet to 8 feet
– Infinite number of possible events
example: E = ( O | O < 5.5 feet ), i.e. individual chosen is less
than 5.5 feet tall

– Probabilities of outcomes are not equal and are
equal,
described by a continuous function, p( O )

Jeff Howbert


Winter 2012

14

Example of continuous probability space
Height of a randomly chosen American male
– Probabilities of outcomes O are not equal and are
equal,
described by a continuous function, p( O )
– p( O ) is a relative, not an absolute probability
p( O ) for any particular O is zero
∫ p( O ) from O = -∞ to ∞ (i.e. area under curve) is 1
example: p( O = 5 8 ) > p( O = 6 2 )
5’8”
6’2”
example: p( O < 5’6” ) = ( ∫ p( O ) from O = -∞ to 5’6” ) ≈ 0.25

Jeff Howbert


Winter 2012

15

Probability distributions
Discrete:

probability mass function (pmf)

example:
sum of two
fair dice

probability density function (pdf)

example:
waiting time between
eruptions of Old Faithful
(minutes)
Jeff Howbert

probability

Continuous:


Winter 2012

16

Random variables
A random variable X is a function that associates a number x with
each outcome O of a process
– C
Common notation: X( O ) = x, or j t X = x
t ti
just
Basically a way to redefine (usually simplify) a probability space to a
new probability space
– X must obey axioms of probability (over the possible values of x)
– X can be discrete or continuous
Example: X = number of heads in three flips of a coin
– Possible values of X are 0, 1, 2, 3
– p( X = 0 ) = p( X = 3 ) = 1 / 8

p( X = 1 ) = p( X = 2 ) = 3 / 8

– Size of space (number of “outcomes”) reduced from 8 to 4
Example: X = average height of five randomly chosen American men
– Size of space unchanged (X can range from 2 feet to 8 feet) but
feet),
pdf of X different than for single man
Jeff Howbert


Winter 2012

17

Multivariate probability distributions
Scenario
– Several random processes occur (
p
(doesn’t matter
whether in parallel or in sequence)
– Want to know probabilities for each possible
combination of outcomes
bi ti
f t
Can describe as joint probability of several random
variables
– Example: two processes whose outcomes are
represented by random variables X and Y. Probability
that process X has outcome x and process Y has
outcome y is denoted as:
p( X = x, Y = y )
(
Jeff Howbert


Winter 2012

18

Example of multivariate distribution
joint probability: p( X = minivan, Y = European ) = 0.1481

Jeff Howbert


Winter 2012

19

Multivariate probability distributions
Marginal probability
– Probability distribution of a single variable in a
joint distribution
– Example: two random variables X and Y:
p( X = x ) = ∑b=all values of Y p( X = x, Y = b )
Conditional probability
– Probability distribution of one variable given
that another variable takes a certain value
– Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x Y = y ) / p( Y = y )
x,
Jeff Howbert


Winter 2012

20

Example of marginal probability
marginal probability: p( X = minivan ) = 0.0741 + 0.1111 + 0.1481 = 0.3333

Jeff Howbert


Winter 2012

21

Example of conditional probability
conditional probability: p( Y = European | X = minivan ) =
0.1481 / ( 0.0741 + 0.1111 + 0.1481 ) = 0.4433

p
probability

0.2
0.15
0 15
0.1
0.05
0
American

sport
Asian

Y = manufacturer

Jeff Howbert

SUV
minivan

European
sedan


X = model type

Winter 2012

22

Continuous multivariate distribution
Same concepts of joint, marginal, and conditional
probabilities apply (except use integrals)
Example: three-component Gaussian mixture in two
dimensions

Jeff Howbert


Winter 2012

23

Expected value
Given:
A discrete random variable X with possible
X,
values x = x1, x2, … xn
Probabilities p( X = xi ) that X takes on the
various values of xi
A function yi = f( xi ) defined on X
The expected value of f is the probability-weighted
“average” value of f( xi ):
E( f ) = ∑i p( xi ) ⋅ f( xi )
Jeff Howbert


Winter 2012

24

Example of expected value
Process: game where one card is drawn from the deck
– If face card, dealer pays y $10
,
p y you $
– If not a face card, you pay dealer $4
Random variable X = { face card, not face card }
– p( face card ) = 3/13
– p( not face card ) = 10/13
Function f( X ) is payout to you
– f( face card ) = 10
– f( not face card ) = -4
tf
d
4
Expected value of payout is:
E( f ) = ∑i p( xi ) ⋅ f( xi ) = 3/13 ⋅ 10 + 10/13 ⋅ -4 = -0.77
4 0 77
Jeff Howbert


Winter 2012

25

Expected value in continuous spaces
E( f ) = ∫x = a → b p( x ) ⋅ f( x )

Jeff Howbert


Winter 2012

26

Common forms of expected value (1)
Mean (μ)
f( xi ) = xi ⇒
μ = E( f ) = ∑i p( xi ) ⋅ xi
– Average value of X = xi, taking into account probability
of the various xi
– M t common measure of “center” of a distribution
Most
f“
t ” f di t ib ti
Compare to formula for mean of an actual sample
1 n
μ = ∑ xi
N i =1

Jeff Howbert


Winter 2012

27

Variance (σ2)
f( xi ) = ( xi - μ ) ⇒
σ2 = ∑i p( xi ) ⋅ ( xi - μ )2
– Average value of squared deviation of X = xi from
mean μ, taking into account probability of the various xi
– M t common measure of “spread” of a distribution
Most
f“
d” f di t ib ti
– σ is the standard deviation
Compare to formula for variance of an actual sample
1 n
2
σ =
( xi − μ ) 2
∑
N − 1 i =1

Jeff Howbert


Winter 2012

28

Covariance
f( xi ) = ( xi - μx ), g( yi ) = ( yi - μy ) ⇒
cov( x y ) = ∑i p( xi , yi ) ⋅ ( xi - μx ) ⋅ ( yi - μy )
x,

high (pos
sitive)
covaria
ance

no co
ovariance

– Measures tendency for x and y to deviate from their means in
same (or opposite) directions at same time

Compare to formula for covariance of actual samples

1 n
cov( x, y ) =
∑ ( xi − μ x )( yi − μ y )
N − 1 i =1
Jeff Howbert


Winter 2012

29

Correlation
Pearson’s correlation coefficient is covariance normalized
by the standard deviations of the two variables
cov( x, y )
corr( x, y ) =

σ xσ y

– Always lies in range -1 to 1
– Only reflects linear dependence between variables
Linear dependence
with noise
Linear dependence
without noise
Various nonlinear
dependencies
Jeff Howbert


Winter 2012

30

Complement rule
Given: event A, which can occur or not

p( not A ) = 1 - p( A )

Ω
A

not A

areas represent relative probabilities
Jeff Howbert


Winter 2012

31

Product rule
Given: events A and B, which can co-occur (or not)

p( A B ) = p( A | B ) ⋅ p( B )
A,
(same expression given previously to define conditional probability)

(not A, not B)

A

(A B)
A,

B

Ω
(A, not B)

(not A, B)

Jeff Howbert


Winter 2012

32

Example of product rule
Probability that a man has white hair (event A)
and is over 65 (event B)
– p( B ) = 0.18
– p( A | B ) = 0.78
– p( A, B ) = p( A | B ) ⋅ p( B ) =
0.78 ⋅ 0.18 =
0.14

Jeff Howbert


Winter 2012

33

Rule of total probability

p( A ) = p( A B ) + p( A not B )
A,
A,
(same expression given previously to define marginal probability)

(not A, not B)

A

(A B)
A,

B

Ω
(A, not B)

(not A, B)

Jeff Howbert


Winter 2012

34

Independence

p( A | B ) = p( A )

or

p( A B ) = p( A ) ⋅ p( B )
A,

Ω
(not A, not B)

(not A, B)

B
(A, not B)

A

( A, B )

Jeff Howbert


Winter 2012

35

Examples of independence / dependence
Independence:
– Outcomes on multiple rolls of a die
p
– Outcomes on multiple flips of a coin
– Height of two unrelated individuals
– Probability of getting a king on successive draws from
a deck, if card from each draw is replaced
Dependence:
D
d
– Height of two related individuals
– Duration of successive eruptions of Old Faithful
– Probability of getting a king on successive draws from
a deck, if card from each draw is not replaced
,
p
Jeff Howbert


Winter 2012

36

Example of independence vs. dependence
Independence: All manufacturers have identical product
mix. p( X = x | Y = y ) = p( X = x ).
Dependence: American manufacturers love SUVs,
Europeans manufacturers don’t.

Jeff Howbert


Winter 2012

37

Bayes rule
A way to find conditional probabilities for one variable when
conditional probabilities for another variable are known.
p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )
where p( A ) = p( A, B ) + p( A, not B )

(not A, not B)

A

( A, B )

B

Ω
(A, not B)

Jeff Howbert

(not A, B)


Winter 2012

38

Bayes rule
posterior probability ∝ likelihood × prior probability
p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )

(not A, not B)

A

( A, B )

B

Ω
(A, not B)

Jeff Howbert

(not A, B)


Winter 2012

39

Example of Bayes rule
Marie is getting married tomorrow at an outdoor ceremony in the
desert. In recent years, it has rained only 5 days each year.
Unfortunately,
Unfortunately the weatherman is forecasting rain for tomorrow When
tomorrow.
it actually rains, the weatherman has forecast rain 90% of the time.
When it doesn't rain, he has forecast rain 10% of the time. What is the
probability it will rain on the day of Marie's wedding?
Marie s
Event A: The weatherman has forecast rain.
Event B: It rains.
We know:
– p( B ) = 5 / 365 = 0.0137 [ It rains 5 days out of the year. ]
– p( not B ) = 360 / 365 = 0.9863
– p( A | B ) = 0.9 [ When it rains, the weatherman has forecast
rain 90% of the time. ]
– p( A | not B ) = 0.1 [When it does not rain the weatherman has
01
rain,
forecast rain 10% of the time.]
Jeff Howbert


Winter 2012

40

Example of Bayes rule, cont’d.

1.
2.
3.

We want to know p( B | A ), the probability it will rain on
the day of Marie's wedding, given a forecast for rain by
the
th weatherman. Th answer can b d t
th
The
be determined f
i d from
Bayes rule:
p( B | A ) = p( A | B ) ⋅ p( B ) / p( A )
p( A ) = p( A | B ) ⋅ p( B ) + p( A | not B ) ⋅ p( not B ) =
(0.9)(0.014) + (0.1)(0.986) = 0.111
p( B | A ) = (0.9)(0.0137) / 0.111 = 0.111
The result seems unintuitive but is correct. Even when the
weatherman predicts rain, it only rains only about 11% of
p
gloomy p
y prediction, it
,
the time. Despite the weatherman's g
is unlikely Marie will get rained on at her wedding.

Jeff Howbert


Winter 2012

41

Probabilities: when to add, when to multiply
ADD: When you want to allow for occurrence of
any of several possible outcomes of a single
process. Comparable to logical OR.
MULTIPLY: When you want to allow for
simultaneous occurrence of particular outcomes
p
from more than one process. Comparable to
logical AND.
– But only if the processes are independent.

Jeff Howbert


Winter 2012

42

Linear algebra applications
1)
2)
3)
4)
5)
6)

Operations on or between vectors and matrices
Coordinate transformations
Dimensionality reduction
Linear regression
Solution of linear systems of equations
Many others
M
th

Applications 1) – 4) are directly relevant to this
course. Today we’ll start with 1).
Jeff Howbert


Winter 2012

43

Why vectors and matrices?
Most common form of data
organization for machine
learning is a 2D array, where
– rows represent samples
p
p
(records, items, datapoints)
– columns represent attributes
p
(features, variables)
Natural to think of each sample
as a vector of attributes, and
whole array as a matrix
Jeff Howbert


vector
Refund Marital
Status

Taxable
Income Cheat

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

No

Single

90K

Yes

60K

10

matrix
Winter 2012

44

Vectors
Definition: an n-tuple of values (usually real
numbers).
– n referred to as the dimension of the vector
– n can be any positive integer from 1 to infinity
integer,
Can be written in column form or row form
– Column form is conventional
– Vector elements referenced by subscript
⎛ x1 ⎞
⎜ ⎟
x=⎜ M ⎟
⎜x ⎟
⎝ n⎠
Jeff Howbert

x T = ( x1 L xn )
T

means " t
transpose"
"


Winter 2012

45

Vectors
Can think of a vector as:
– a point in space or
– a directed line segment with a magnitude and
direction

Jeff Howbert


Winter 2012

46

Vector arithmetic
Addition of two vectors
– add corresponding elements

z = x + y = (x1 + y1 L xn + yn )

T

– result is a vector
Scalar multiplication of a vector
– multiply each element by scalar

y = ax = (a x1 L axn )

T

– result is a vector

Jeff Howbert


Winter 2012

47

Vector arithmetic
Dot product of two vectors
– multiply corresponding elements, th add products
lti l
di
l
t then dd
d t
n

a = x ⋅ y = ∑ xi yi
i =1

– result is a scalar
y
Dot product alternative form

a = x ⋅ y = x y cos (θ )

Jeff Howbert


θ

x

Winter 2012

48

Matrices
Definition: an m x n two-dimensional array of
values (usually real numbers).
– m rows
– n columns
Matrix referenced by two-element subscript
– first element in
⎛ a11 L a1n ⎞
subscript is row
⎜
⎟
A=⎜ M O M ⎟
– second element in
⎜a
L amn ⎟
⎝ m1
⎠
subscript is column
– example: A24 or a24 is element in second row,
fourth column of A
Jeff Howbert


Winter 2012

49

Matrices
A vector can be regarded as special case of a
matrix, where one of matrix dimensions = 1.
Matrix transpose (denoted T)
– swap columns and rows
row 1 becomes column 1, etc.

– m x n matrix becomes n x m matrix
– example:
4 ⎞
⎛2
⎛ 2 7 − 1 0 3⎞
A=⎜
⎜ 4 6 − 3 1 8⎟
⎟
⎝
⎠

Jeff Howbert

⎟
⎜
6 ⎟
⎜7
AT = ⎜ − 1 − 3⎟
⎜
⎟
1 ⎟
⎜0
⎜3
8 ⎟
⎝
⎠


Winter 2012

50

Matrix arithmetic
Addition of two matrices
– matrices must be same size
– add corresponding elements:

cij = aij + bij
– result is a matrix of same size

C= A+B =
⎛ a11 + b11 L a1n + b1n ⎞
⎜
⎟
M
O
M
⎜
⎟
⎜a + b
L amn + bmn ⎟
m1
m1
⎝
⎠

Scalar multiplication of a matrix
– multiply each element by scalar:

bij = d ⋅ aij

– result is a matrix of same size

Jeff Howbert


B = d ⋅A =
⎛ d ⋅ a11 L d ⋅ a1n ⎞
⎜
⎟
O
M ⎟
⎜ M
⎜d ⋅a
L d ⋅ amn ⎟
m1
⎝
⎠
Winter 2012

51

Matrix arithmetic
Matrix-matrix multiplication
– vector-matrix multiplication j
p
just a special case
p
TO THE BOARD!!
Multiplication is associative
A⋅(B⋅C)=(A⋅B)⋅C
Multiplication is not commutative
A ⋅ B ≠ B ⋅ A (generally)
Transposition rule:
( A ⋅ B )T = B T ⋅ A T
Jeff Howbert


Winter 2012

52

Matrix arithmetic
RULE: In any chain of matrix multiplications, the
column dimension of one matrix in the chain must
match the row dimension of the following matrix
in the chain.
Examples
A3x5
B5x5
C3x1
Right:
A ⋅ B ⋅ AT CT ⋅ A ⋅ B AT ⋅ A ⋅ B C ⋅ CT ⋅ A
Wrong:
A⋅B⋅A
C⋅A⋅B
A ⋅ AT ⋅ B CT ⋅ C ⋅ A
Jeff Howbert


Winter 2012

53

Vector projection
Orthogonal projection of y onto x
– Can take place in any space of dimensionality > 2
– Unit vector in direction of x is
y
x / || x ||
– Length of projection of y in
direction of x is
θ
x
|| y || ⋅ cos(θ )
projx( y )
– Orthogonal projection of
y onto x is the vector
projx( y ) = x ⋅ || y || ⋅ cos(θ ) / || x || =
[ ( x ⋅ y ) / || x ||2 ] x (using dot product alternate form)

Jeff Howbert


Winter 2012

54

Optimization theory topics
Maximum likelihood
Expectation maximization
Gradient descent

Jeff Howbert


Winter 2012

55

02 math essentials

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 02 math essentials (20)

Recently uploaded (20)

02 math essentials