SlideShare a Scribd company logo
Reading
Pattern Recognition
and Machine Learning
§3.3 (Bayesian Linear Regression)
Christopher M. Bishop
Introduced by: Yusuke Oda (NAIST)
@odashi_t
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
Bayesian Linear Regression
 Maximum Likelihood (ML)
– The number of basis functions (≃ model complexity)
depends on the size of the data set.
– Adds the regularization term to control model complexity.
– How should we determine
the coefficient of regularization term?
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
Bayesian Linear Regression
 Maximum Likelihood (ML)
– Using ML to determine the coefficient of regularization term
... Bad selection
• This always leads to excessively complex models (= over-fitting)
– Using independent hold-out data to determine model complexity
(See §1.3)
... Computationally expensive
... Wasteful of valuable data
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5
In the case of previous slide,
λ always becomes 0
when using ML to determine λ.
Bayesian Linear Regression
 Bayesian treatment of linear regression
– Avoids the over-fitting problem of ML.
– Leads to automatic methods of determining model complexity
using the training data alone.
 What we do?
– Introduces the prior distribution and likelihood .
• Assumes the model parameter as proberbility function.
– Calculates the posterior distribution
using the Bayes' theorem:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
Note: Marginal / Conditional Gaussians
 Marginal Gaussian distribution for
 Conditional Gaussian distribution for given
 Marginal distribution of
 Conditional distribution of given
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8
Given:
Then:
where
Parameter Distribution
 Remember the likelihood function given by §3.1.1:
– This is the exponential of quadratic function of
 The corresponding conjugate prior is given by
a Gaussian distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9
known parameter
Parameter Distribution
 Now given:
 Then the posterior distribution is shown by using (2.116):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
Online Learning- Parameter Distribution
 If data points arrive sequentially,
the design matrix has only 1 row:
 Assuming that are the n-th input data then
we can obtain the formula for online learning:
where
In addition,
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
Easy Gaussian Prior- Parameter Distribution
 If the prior distribution is a zero-mean isotropic Gaussian
governed by a single precision parameter :
 The corresponding posterior distribution is also given:
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
Relationship with MSSE- Parameter Distribution
 The log of the posterior distribution is given:
 If prior distribution is given by (3.52), this result is shown:
– Maximization of (3.55) with respect to
– Minimization of the sum-of-squares error (MSSE) function
with the addition of a quadratic regularization term
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13
Equivalent
Example- Parameter Distribution
 Straight-line fitting
– Model function:
– True function:
– Error:
– Goal: To recover the values of
from such data
– Prior distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
Generalized Gaussian Prior- Parameter Distribution
 We can generalize the
Gaussian prior about exponent.
 In which corresponds
to the Gaussian
and only in the case is the
prior conjugate to the (3.10).
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17
 Let's consider that making predictions of directly
for new values of .
 In order to obtain it, we need to evaluate the
predictive distribution:
 This formula is tipically written:
Marginalization arround
(summing out )
Predictive Distribution
 The conditional distribution of the target variable is given:
 And the posterior weight distribution is given:
 Accordingly, the result of (3.57) is shown by using (2.115):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
Predictive Distribution
 Now we discuss the variance of predictive distribution:
– As additional data points are observed, the posterior distribution
becomes narrower:
– 2nd term of the(3.59) goes zero in the limit :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19
Addictive noise
goverened by the parameter .
This term depends on the mapping vector
. of each data point .
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
Example- Predictive Distribution
 Gaussian regression with sine curve
– Basis functions: 9 Gaussian curves
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21
Mean of predictive distribution
Standard deviation of
predictive distribution
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
Problem of Localized Basis- Predictive Distribution
 Polynominal regression
 Gaussian regression
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24
Which is better?
Problem of Localized Basis- Predictive Distribution
 If we used localized basis function such as Gaussians,
then in regions away from the basis function centers
the contribution from the 2nd term in the (3.59) will goes zero.
 Accordingly, the predictive variance becomes only the noise
contribution . But it is not good result.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25
Large contribution
Small contribution
Problem of Localized Basis- Predictive Distribution
 This problem (arising from choosing localized basis function)
can be avoided by adopting an alternative Bayesian approach
to regression known as a Gaussian process.
– See §6.4.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
Case of Unknown Precision- Predictive Distribution
 If both and are treated as unknown then
we can introduce a conjugate prior distribution and
corresponding posterior distribution as Gaussian-gamma
distribution:
 And then the predictive distribution is given:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
Equivalent Kernel
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29
 If we substitute the posterior mean solution (3.53) into the
expression (3.3), the predictive mean can be written:
 This formula can assume the linear combination of :
Equivalent Kernel
 Where the coefficients of each are given:
 This function is calld smoother matrix or equivalent kernel.
 Regression functions which make predictions by taking linear
combinations of the training set target values are known as
linear smoothers.
 We also predict for new input vector using equivalent
kernel, instead of calculating parameters of basis functions.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
Example 1- Equivalent Kernel
 Equivalent kernel with Gaussian regression
 Equivalen kernel depends on the set of basis function and the
data set.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
Equivalent Kernel
 Equivalent kernel means the contribution of each data point
for predictive mean.
 The covariance between and can be shown by
equivalent kernel:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32
Large contribution
Small contribution
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel have localization property even if any basis
functions are not localized.
 Sum of equivalent kernel equals 1 for all :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33
Polynominal Sigmoid
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel satisfies an important property shared by
kernel functions in general:
– Kernel function can be expressed in the form of an inner product with
respect to a vector of nonlinear functions:
– In the case of equivalent kernel, is given below:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
Thank you!
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38
zzz...

More Related Content

PDF
星野「調査観察データの統計科学」第3章
PDF
変分推論法(変分ベイズ法)(PRML第10章)
PPTX
Wasserstein GANを熟読する
PDF
階層モデルの分散パラメータの事前分布について
PDF
ガウス過程回帰の導出 ( GPR : Gaussian Process Regression )
PPTX
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
PDF
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
PDF
自動微分変分ベイズ法の紹介
星野「調査観察データの統計科学」第3章
変分推論法(変分ベイズ法)(PRML第10章)
Wasserstein GANを熟読する
階層モデルの分散パラメータの事前分布について
ガウス過程回帰の導出 ( GPR : Gaussian Process Regression )
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
[DL輪読会]Wasserstein GAN/Towards Principled Methods for Training Generative Adv...
自動微分変分ベイズ法の紹介

What's hot (20)

PDF
3分でわかる多項分布とディリクレ分布
PDF
PRML輪読#10
PDF
2 4.devianceと尤度比検定
PPTX
統計的学習の基礎_3章
PPTX
15分でわかる(範囲の)ベイズ統計学
PDF
一般化線形モデル (GLM) & 一般化加法モデル(GAM)
PDF
PRMLの線形回帰モデル(線形基底関数モデル)
PDF
統計的因果推論への招待 -因果構造探索を中心に-
PDF
研究室における研究・実装ノウハウの共有
PDF
最適輸送入門
PDF
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
PDF
時系列問題に対するCNNの有用性検証
PDF
Prml4.4 ラプラス近似~ベイズロジスティック回帰
PPTX
変分ベイズ法の説明
PDF
PRML輪読#2
PDF
最適輸送の計算アルゴリズムの研究動向
PPTX
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PPTX
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
PPTX
WAICとWBICのご紹介
PPTX
畳み込みニューラルネットワークの高精度化と高速化
3分でわかる多項分布とディリクレ分布
PRML輪読#10
2 4.devianceと尤度比検定
統計的学習の基礎_3章
15分でわかる(範囲の)ベイズ統計学
一般化線形モデル (GLM) & 一般化加法モデル(GAM)
PRMLの線形回帰モデル(線形基底関数モデル)
統計的因果推論への招待 -因果構造探索を中心に-
研究室における研究・実装ノウハウの共有
最適輸送入門
基礎からのベイズ統計学 輪読会資料 第4章 メトロポリス・ヘイスティングス法
時系列問題に対するCNNの有用性検証
Prml4.4 ラプラス近似~ベイズロジスティック回帰
変分ベイズ法の説明
PRML輪読#2
最適輸送の計算アルゴリズムの研究動向
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
WAICとWBICのご紹介
畳み込みニューラルネットワークの高精度化と高速化
Ad

Viewers also liked (20)

PDF
Neural Machine Translation via Binary Code Prediction
PDF
Learning Continuous Control Policies by Stochastic Value Gradients
PDF
Center loss for Face Recognition
PPT
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
PPTX
Pattern Recognition and Machine Learning : Graphical Models
PPTX
DIY Deep Learning with Caffe Workshop
PDF
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
PPT
портфоліо Бабич О.А.
PDF
Caffe - A deep learning framework (Ramin Fahimi)
PDF
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
PPTX
Caffe framework tutorial2
PDF
Processor, Compiler and Python Programming Language
PPTX
Semi fragile watermarking
PDF
Using Gradient Descent for Optimization and Learning
PDF
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
PPTX
Caffe framework tutorial
PPTX
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
PDF
Facebook Deep face
PPTX
Optimization in deep learning
PPTX
Computer vision, machine, and deep learning
Neural Machine Translation via Binary Code Prediction
Learning Continuous Control Policies by Stochastic Value Gradients
Center loss for Face Recognition
Muzammil Abdulrahman PPT On Gabor Wavelet Transform (GWT) Based Facial Expres...
Pattern Recognition and Machine Learning : Graphical Models
DIY Deep Learning with Caffe Workshop
Face Recognition Based on Deep Learning (Yurii Pashchenko Technology Stream)
портфоліо Бабич О.А.
Caffe - A deep learning framework (Ramin Fahimi)
[AI07] Revolutionizing Image Processing with Cognitive Toolkit
Caffe framework tutorial2
Processor, Compiler and Python Programming Language
Semi fragile watermarking
Using Gradient Descent for Optimization and Learning
Face recognition and deep learning โดย ดร. สรรพฤทธิ์ มฤคทัต NECTEC
Caffe framework tutorial
Structure Learning of Bayesian Networks with p Nodes from n Samples when n&lt...
Facebook Deep face
Optimization in deep learning
Computer vision, machine, and deep learning
Ad

Similar to Pattern Recognition and Machine Learning: Section 3.3 (20)

PPTX
PRML Chapter 3
PPTX
Elements of Statistical Learning 読み会 第2章
PDF
(DL hacks輪読) Deep Kernel Learning
PPTX
PRML Chapter 6
PDF
Kernel Bayes Rule
PDF
ABC workshop: 17w5025
PDF
Gaussian process in machine learning
PDF
Gaussian Processes: Applications in Machine Learning
PDF
Basics of probability in statistical simulation and stochastic programming
PDF
(DL hacks輪読)Bayesian Neural Network
PDF
Can we estimate a constant?
PDF
Probability and Statistics Cookbook
PDF
YSC 2013
PDF
Discussion of Persi Diaconis' lecture at ISBA 2016
PDF
ガウス過程入門
PDF
eatonmuirheadsoaita
PDF
2_GLMs_printable.pdf
PPTX
PRML Chapter 2
PDF
A walk through the intersection between machine learning and mechanistic mode...
PDF
bayes_machine_learning_book for data scientist
PRML Chapter 3
Elements of Statistical Learning 読み会 第2章
(DL hacks輪読) Deep Kernel Learning
PRML Chapter 6
Kernel Bayes Rule
ABC workshop: 17w5025
Gaussian process in machine learning
Gaussian Processes: Applications in Machine Learning
Basics of probability in statistical simulation and stochastic programming
(DL hacks輪読)Bayesian Neural Network
Can we estimate a constant?
Probability and Statistics Cookbook
YSC 2013
Discussion of Persi Diaconis' lecture at ISBA 2016
ガウス過程入門
eatonmuirheadsoaita
2_GLMs_printable.pdf
PRML Chapter 2
A walk through the intersection between machine learning and mechanistic mode...
bayes_machine_learning_book for data scientist

More from Yusuke Oda (12)

PPTX
primitiv: Neural Network Toolkit
PDF
ChainerによるRNN翻訳モデルの実装+@
PDF
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
PDF
Encoder-decoder 翻訳 (TISハンズオン資料)
PDF
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
PDF
A Chainer MeetUp Talk
PDF
PCFG構文解析法
ODP
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
PDF
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
PDF
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
PPTX
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)
PDF
primitiv: Neural Network Toolkit
ChainerによるRNN翻訳モデルの実装+@
複数の事前並べ替え候補を用いた句に基づく統計的機械翻訳
Encoder-decoder 翻訳 (TISハンズオン資料)
Learning to Generate Pseudo-code from Source Code using Statistical Machine T...
A Chainer MeetUp Talk
PCFG構文解析法
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic ...
ACL Reading @NAIST: Fast and Robust Neural Network Joint Model for Statistica...
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
翻訳精度の最大化による同時音声翻訳のための文分割法 (NLP2014)

Recently uploaded (20)

PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
Empowerment Technology for Senior High School Guide
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Computing-Curriculum for Schools in Ghana
PDF
Complications of Minimal Access Surgery at WLH
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Introduction to Building Materials
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
RMMM.pdf make it easy to upload and study
PPTX
History, Philosophy and sociology of education (1).pptx
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PDF
Classroom Observation Tools for Teachers
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
What if we spent less time fighting change, and more time building what’s rig...
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
Empowerment Technology for Senior High School Guide
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Computing-Curriculum for Schools in Ghana
Complications of Minimal Access Surgery at WLH
Final Presentation General Medicine 03-08-2024.pptx
Introduction to Building Materials
Indian roads congress 037 - 2012 Flexible pavement
RMMM.pdf make it easy to upload and study
History, Philosophy and sociology of education (1).pptx
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Digestion and Absorption of Carbohydrates, Proteina and Fats
Classroom Observation Tools for Teachers
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
202450812 BayCHI UCSC-SV 20250812 v17.pptx
What if we spent less time fighting change, and more time building what’s rig...

Pattern Recognition and Machine Learning: Section 3.3

  • 1. Reading Pattern Recognition and Machine Learning §3.3 (Bayesian Linear Regression) Christopher M. Bishop Introduced by: Yusuke Oda (NAIST) @odashi_t 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
  • 2. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
  • 3. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
  • 4. Bayesian Linear Regression  Maximum Likelihood (ML) – The number of basis functions (≃ model complexity) depends on the size of the data set. – Adds the regularization term to control model complexity. – How should we determine the coefficient of regularization term? 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
  • 5. Bayesian Linear Regression  Maximum Likelihood (ML) – Using ML to determine the coefficient of regularization term ... Bad selection • This always leads to excessively complex models (= over-fitting) – Using independent hold-out data to determine model complexity (See §1.3) ... Computationally expensive ... Wasteful of valuable data 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5 In the case of previous slide, λ always becomes 0 when using ML to determine λ.
  • 6. Bayesian Linear Regression  Bayesian treatment of linear regression – Avoids the over-fitting problem of ML. – Leads to automatic methods of determining model complexity using the training data alone.  What we do? – Introduces the prior distribution and likelihood . • Assumes the model parameter as proberbility function. – Calculates the posterior distribution using the Bayes' theorem: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
  • 7. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
  • 8. Note: Marginal / Conditional Gaussians  Marginal Gaussian distribution for  Conditional Gaussian distribution for given  Marginal distribution of  Conditional distribution of given 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8 Given: Then: where
  • 9. Parameter Distribution  Remember the likelihood function given by §3.1.1: – This is the exponential of quadratic function of  The corresponding conjugate prior is given by a Gaussian distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9 known parameter
  • 10. Parameter Distribution  Now given:  Then the posterior distribution is shown by using (2.116): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
  • 11. Online Learning- Parameter Distribution  If data points arrive sequentially, the design matrix has only 1 row:  Assuming that are the n-th input data then we can obtain the formula for online learning: where In addition, 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
  • 12. Easy Gaussian Prior- Parameter Distribution  If the prior distribution is a zero-mean isotropic Gaussian governed by a single precision parameter :  The corresponding posterior distribution is also given: where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
  • 13. Relationship with MSSE- Parameter Distribution  The log of the posterior distribution is given:  If prior distribution is given by (3.52), this result is shown: – Maximization of (3.55) with respect to – Minimization of the sum-of-squares error (MSSE) function with the addition of a quadratic regularization term 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13 Equivalent
  • 14. Example- Parameter Distribution  Straight-line fitting – Model function: – True function: – Error: – Goal: To recover the values of from such data – Prior distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
  • 15. Generalized Gaussian Prior- Parameter Distribution  We can generalize the Gaussian prior about exponent.  In which corresponds to the Gaussian and only in the case is the prior conjugate to the (3.10). 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
  • 16. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
  • 17. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17  Let's consider that making predictions of directly for new values of .  In order to obtain it, we need to evaluate the predictive distribution:  This formula is tipically written: Marginalization arround (summing out )
  • 18. Predictive Distribution  The conditional distribution of the target variable is given:  And the posterior weight distribution is given:  Accordingly, the result of (3.57) is shown by using (2.115): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
  • 19. Predictive Distribution  Now we discuss the variance of predictive distribution: – As additional data points are observed, the posterior distribution becomes narrower: – 2nd term of the(3.59) goes zero in the limit : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19 Addictive noise goverened by the parameter . This term depends on the mapping vector . of each data point .
  • 20. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
  • 21. Example- Predictive Distribution  Gaussian regression with sine curve – Basis functions: 9 Gaussian curves 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21 Mean of predictive distribution Standard deviation of predictive distribution
  • 22. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
  • 23. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
  • 24. Problem of Localized Basis- Predictive Distribution  Polynominal regression  Gaussian regression 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24 Which is better?
  • 25. Problem of Localized Basis- Predictive Distribution  If we used localized basis function such as Gaussians, then in regions away from the basis function centers the contribution from the 2nd term in the (3.59) will goes zero.  Accordingly, the predictive variance becomes only the noise contribution . But it is not good result. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25 Large contribution Small contribution
  • 26. Problem of Localized Basis- Predictive Distribution  This problem (arising from choosing localized basis function) can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process. – See §6.4. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
  • 27. Case of Unknown Precision- Predictive Distribution  If both and are treated as unknown then we can introduce a conjugate prior distribution and corresponding posterior distribution as Gaussian-gamma distribution:  And then the predictive distribution is given: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
  • 28. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
  • 29. Equivalent Kernel 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29  If we substitute the posterior mean solution (3.53) into the expression (3.3), the predictive mean can be written:  This formula can assume the linear combination of :
  • 30. Equivalent Kernel  Where the coefficients of each are given:  This function is calld smoother matrix or equivalent kernel.  Regression functions which make predictions by taking linear combinations of the training set target values are known as linear smoothers.  We also predict for new input vector using equivalent kernel, instead of calculating parameters of basis functions. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
  • 31. Example 1- Equivalent Kernel  Equivalent kernel with Gaussian regression  Equivalen kernel depends on the set of basis function and the data set. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
  • 32. Equivalent Kernel  Equivalent kernel means the contribution of each data point for predictive mean.  The covariance between and can be shown by equivalent kernel: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32 Large contribution Small contribution
  • 33. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel have localization property even if any basis functions are not localized.  Sum of equivalent kernel equals 1 for all : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33 Polynominal Sigmoid
  • 34. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
  • 35. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
  • 36. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
  • 37. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel satisfies an important property shared by kernel functions in general: – Kernel function can be expressed in the form of an inner product with respect to a vector of nonlinear functions: – In the case of equivalent kernel, is given below: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
  • 38. Thank you! 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38 zzz...