SlideShare a Scribd company logo
Machine Learning Basics
Lecture 1: Linear Regression
Princeton University COS 495
Instructor: Yingyu Liang
Machine learning basics
What is machine learning?
• “A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T as measured by P, improves with experience
E.”
------- Machine Learning, Tom Mitchell, 1997
Example 1: image classification
Task: determine if the image is indoor or outdoor
Performance measure: probability of misclassification
Example 1: image classification
indoor outdoor
Experience/Data:
images with labels
Indoor
Example 1: image classification
• A few terminologies
• Training data: the images given for learning
• Test data: the images to be classified
• Binary classification: classify into two classes
Example 1: image classification (multi-class)
ImageNet figure borrowed from vision.standford.edu
Example 2: clustering images
Task: partition the images into 2 groups
Performance: similarities within groups
Data: a set of images
Example 2: clustering images
• A few terminologies
• Unlabeled data vs labeled data
• Supervised learning vs unsupervised learning
Math formulation
Color Histogram
Red Green Blue
Indoor 0
Feature vector: 𝑥𝑖
Label: 𝑦𝑖
Extract
features
Math formulation
Color Histogram
Red Green Blue
outdoor 1
Feature vector: 𝑥𝑗
Label: 𝑦𝑗
Extract
features
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛
• Find 𝑦 = 𝑓(𝑥) using training data
• s.t. 𝑓 correct on test data
What kind of functions?
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. 𝑓 correct on test data
Hypothesis class
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. 𝑓 correct on test data
Connection between
training data and test data?
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. 𝑓 correct on test data i.i.d. from distribution 𝐷
They have the same
distribution
i.i.d.: independently
identically distributed
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. 𝑓 correct on test data i.i.d. from distribution 𝐷
What kind of performance
measure?
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] Various loss functions
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]
• Examples of loss functions:
• 0-1 loss: 𝑙 𝑓, 𝑥, 𝑦 = 𝕀[𝑓 𝑥 ≠ 𝑦] and 𝐿 𝑓 = Pr[𝑓 𝑥 ≠ 𝑦]
• 𝑙2 loss: 𝑙 𝑓, 𝑥, 𝑦 = [𝑓 𝑥 − 𝑦]2 and 𝐿 𝑓 = 𝔼[𝑓 𝑥 − 𝑦]2
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] How to use?
Math formulation
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes ෠
𝐿 𝑓 =
1
𝑛
σ𝑖=1
𝑛
𝑙(𝑓, 𝑥𝑖, 𝑦𝑖)
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]
Empirical loss
Machine learning 1-2-3
• Collect data and extract features
• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Wait…
• Why handcraft the feature vectors 𝑥, 𝑦?
• Can use prior knowledge to design suitable features
• Can computer learn the features on the raw images?
• Learn features directly on the raw images: Representation Learning
• Deep Learning ⊆ Representation Learning ⊆ Machine Learning ⊆ Artificial
Intelligence
Wait…
• Does MachineLearning-1-2-3 include all approaches?
• Include many but not all
• Our current focus will be MachineLearning-1-2-3
Example: Stock Market Prediction
2013 2014 2015 2016
Stock Market (Disclaimer: synthetic data/in another parallel universe)
Orange MacroHard Ackermann
Sliding window over time: serve as input 𝑥; non-i.i.d.
Linear regression
Real data: Prostate Cancer
by Stamey et al. (1989)
Figure borrowed from
The Elements of Statistical Learning
𝑦: prostate
specific antigen
(𝑥1, … , 𝑥8):
clinical measures
Linear regression
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes ෠
𝐿 𝑓𝑤 =
1
𝑛
σ𝑖=1
𝑛
𝑤𝑇𝑥𝑖 − 𝑦𝑖
2
𝑙2 loss; also called mean
square error
Hypothesis class 𝓗
Linear regression: optimization
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes ෠
𝐿 𝑓𝑤 =
1
𝑛
σ𝑖=1
𝑛
𝑤𝑇𝑥𝑖 − 𝑦𝑖
2
• Let 𝑋 be a matrix whose 𝑖-th row is 𝑥𝑖
𝑇
, 𝑦 be the vector 𝑦1, … , 𝑦𝑛
𝑇
෠
𝐿 𝑓𝑤 =
1
𝑛
෍
𝑖=1
𝑛
𝑤𝑇𝑥𝑖 − 𝑦𝑖
2 =
1
𝑛
⃦𝑋𝑤 − 𝑦 ⃦2
2
Linear regression: optimization
• Set the gradient to 0 to get the minimizer
𝛻𝑤
෠
𝐿 𝑓𝑤 = 𝛻𝑤
1
𝑛
⃦𝑋𝑤 − 𝑦 ⃦2
2
= 0
𝛻𝑤[ 𝑋𝑤 − 𝑦 𝑇(𝑋𝑤 − 𝑦)] = 0
𝛻𝑤[ 𝑤𝑇𝑋𝑇𝑋𝑤 − 2𝑤𝑇𝑋𝑇𝑦 + 𝑦𝑇𝑦] = 0
2𝑋𝑇𝑋𝑤 − 2𝑋𝑇𝑦 = 0
w = 𝑋𝑇𝑋 −1𝑋𝑇𝑦
Linear regression: optimization
• Algebraic view of the minimizer
• If 𝑋 is invertible, just solve 𝑋𝑤 = 𝑦 and get 𝑤 = 𝑋−1𝑦
• But typically 𝑋 is a tall matrix
𝑋
𝑤
=
𝑦
𝑋𝑇
𝑋 𝑤
=
𝑋𝑇
𝑦
Normal equation: w = 𝑋𝑇
𝑋 −1
𝑋𝑇
𝑦
Linear regression with bias
• Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇𝑥 + 𝑏 to minimize the loss
• Reduce to the case without bias:
• Let 𝑤′
= 𝑤; 𝑏 , 𝑥′
= 𝑥; 1
• Then 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇
𝑥 + 𝑏 = 𝑤′ 𝑇
(𝑥′
)
Bias term

More Related Content

PDF
Lecture 2 neural network covers the basic
PPTX
Data Science and Machine Learning with Tensorflow
PPTX
Training and Testing Neural Network unit II
PPT
PPTX
DeepLearningLecture.pptx
PDF
machine_learning.pptx
PPTX
13Kernel_Machines.pptx
PDF
DSC603_ClassificationIntrointodatascience
Lecture 2 neural network covers the basic
Data Science and Machine Learning with Tensorflow
Training and Testing Neural Network unit II
DeepLearningLecture.pptx
machine_learning.pptx
13Kernel_Machines.pptx
DSC603_ClassificationIntrointodatascience

Similar to ML_basics_lecture1_linear_regression.pdf (20)

PPTX
Lecture4.pptx
PDF
مدخل إلى تعلم الآلة
PPTX
Fundamentals of Data Science Modeling Lec
PDF
Yulia Honcharenko "Application of metric learning for logo recognition"
PDF
Introduction to Boosted Trees by Tianqi Chen
PPTX
introduction to machine learning 3c-feature-extraction.pptx
PDF
机器学习Adaboost
PPTX
NeurIPS22.pptx
PDF
Boosted tree
PDF
Introduction to Big Data Science
PDF
Le Machine Learning de A à Z
PDF
deeplearninhg........ applicationsWEEK 05.pdf
PDF
Domain adaptation: A Theoretical View
PDF
Gradient Boosted Regression Trees in scikit-learn
PPTX
Training DNN Models - II.pptx
PDF
lec02-DecisionTreed. Checking primality of an integer n .pdf
PPTX
ngboost.pptx
PDF
Paper Study: Melding the data decision pipeline
PPTX
Neural Learning to Rank
PPTX
Coursera 1week
Lecture4.pptx
مدخل إلى تعلم الآلة
Fundamentals of Data Science Modeling Lec
Yulia Honcharenko "Application of metric learning for logo recognition"
Introduction to Boosted Trees by Tianqi Chen
introduction to machine learning 3c-feature-extraction.pptx
机器学习Adaboost
NeurIPS22.pptx
Boosted tree
Introduction to Big Data Science
Le Machine Learning de A à Z
deeplearninhg........ applicationsWEEK 05.pdf
Domain adaptation: A Theoretical View
Gradient Boosted Regression Trees in scikit-learn
Training DNN Models - II.pptx
lec02-DecisionTreed. Checking primality of an integer n .pdf
ngboost.pptx
Paper Study: Melding the data decision pipeline
Neural Learning to Rank
Coursera 1week
Ad

More from Tigabu Yaya (20)

PDF
Deep Learning and types Convolutional Neural Network
PDF
03. Data Exploration in Data Science.pdf
PDF
MOD_Architectural_Design_Chap6_Summary.pdf
PDF
MOD_Design_Implementation_Ch7_summary.pdf
PDF
GER_Project_Management_Ch22_summary.pdf
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
6_RealTimeScheduling.pdf
PPTX
Regression.pptx
PDF
lecture6.pdf
PDF
lecture5.pdf
PDF
lecture4.pdf
PDF
lecture3.pdf
PDF
lecture2.pdf
PPT
Chap 4.ppt
PPT
200402_RoseRealTime.ppt
PPT
matrixfactorization.ppt
PPTX
nnfl.0620.pptx
PPT
L20.ppt
PDF
The Jacobi and Gauss-Seidel Iterative Methods.pdf
Deep Learning and types Convolutional Neural Network
03. Data Exploration in Data Science.pdf
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Design_Implementation_Ch7_summary.pdf
GER_Project_Management_Ch22_summary.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
6_RealTimeScheduling.pdf
Regression.pptx
lecture6.pdf
lecture5.pdf
lecture4.pdf
lecture3.pdf
lecture2.pdf
Chap 4.ppt
200402_RoseRealTime.ppt
matrixfactorization.ppt
nnfl.0620.pptx
L20.ppt
The Jacobi and Gauss-Seidel Iterative Methods.pdf
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
Pre independence Education in Inndia.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Insiders guide to clinical Medicine.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Institutional Correction lecture only . . .
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Classroom Observation Tools for Teachers
Pre independence Education in Inndia.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet
FourierSeries-QuestionsWithAnswers(Part-A).pdf
01-Introduction-to-Information-Management.pdf
GDM (1) (1).pptx small presentation for students
Insiders guide to clinical Medicine.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Computing-Curriculum for Schools in Ghana
Institutional Correction lecture only . . .
Sports Quiz easy sports quiz sports quiz
Supply Chain Operations Speaking Notes -ICLT Program
TR - Agricultural Crops Production NC III.pdf
Microbial diseases, their pathogenesis and prophylaxis
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Anesthesia in Laparoscopic Surgery in India
human mycosis Human fungal infections are called human mycosis..pptx
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...

ML_basics_lecture1_linear_regression.pdf

  • 1. Machine Learning Basics Lecture 1: Linear Regression Princeton University COS 495 Instructor: Yingyu Liang
  • 3. What is machine learning? • “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T as measured by P, improves with experience E.” ------- Machine Learning, Tom Mitchell, 1997
  • 4. Example 1: image classification Task: determine if the image is indoor or outdoor Performance measure: probability of misclassification
  • 5. Example 1: image classification indoor outdoor Experience/Data: images with labels Indoor
  • 6. Example 1: image classification • A few terminologies • Training data: the images given for learning • Test data: the images to be classified • Binary classification: classify into two classes
  • 7. Example 1: image classification (multi-class) ImageNet figure borrowed from vision.standford.edu
  • 8. Example 2: clustering images Task: partition the images into 2 groups Performance: similarities within groups Data: a set of images
  • 9. Example 2: clustering images • A few terminologies • Unlabeled data vs labeled data • Supervised learning vs unsupervised learning
  • 10. Math formulation Color Histogram Red Green Blue Indoor 0 Feature vector: 𝑥𝑖 Label: 𝑦𝑖 Extract features
  • 11. Math formulation Color Histogram Red Green Blue outdoor 1 Feature vector: 𝑥𝑗 Label: 𝑦𝑗 Extract features
  • 12. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 • Find 𝑦 = 𝑓(𝑥) using training data • s.t. 𝑓 correct on test data What kind of functions?
  • 13. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. 𝑓 correct on test data Hypothesis class
  • 14. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. 𝑓 correct on test data Connection between training data and test data?
  • 15. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. 𝑓 correct on test data i.i.d. from distribution 𝐷 They have the same distribution i.i.d.: independently identically distributed
  • 16. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. 𝑓 correct on test data i.i.d. from distribution 𝐷 What kind of performance measure?
  • 17. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. the expected loss is small 𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] Various loss functions
  • 18. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. the expected loss is small 𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] • Examples of loss functions: • 0-1 loss: 𝑙 𝑓, 𝑥, 𝑦 = 𝕀[𝑓 𝑥 ≠ 𝑦] and 𝐿 𝑓 = Pr[𝑓 𝑥 ≠ 𝑦] • 𝑙2 loss: 𝑙 𝑓, 𝑥, 𝑦 = [𝑓 𝑥 − 𝑦]2 and 𝐿 𝑓 = 𝔼[𝑓 𝑥 − 𝑦]2
  • 19. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 using training data • s.t. the expected loss is small 𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] How to use?
  • 20. Math formulation • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes ෠ 𝐿 𝑓 = 1 𝑛 σ𝑖=1 𝑛 𝑙(𝑓, 𝑥𝑖, 𝑦𝑖) • s.t. the expected loss is small 𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)] Empirical loss
  • 21. Machine learning 1-2-3 • Collect data and extract features • Build model: choose hypothesis class 𝓗 and loss function 𝑙 • Optimization: minimize the empirical loss
  • 22. Wait… • Why handcraft the feature vectors 𝑥, 𝑦? • Can use prior knowledge to design suitable features • Can computer learn the features on the raw images? • Learn features directly on the raw images: Representation Learning • Deep Learning ⊆ Representation Learning ⊆ Machine Learning ⊆ Artificial Intelligence
  • 23. Wait… • Does MachineLearning-1-2-3 include all approaches? • Include many but not all • Our current focus will be MachineLearning-1-2-3
  • 24. Example: Stock Market Prediction 2013 2014 2015 2016 Stock Market (Disclaimer: synthetic data/in another parallel universe) Orange MacroHard Ackermann Sliding window over time: serve as input 𝑥; non-i.i.d.
  • 26. Real data: Prostate Cancer by Stamey et al. (1989) Figure borrowed from The Elements of Statistical Learning 𝑦: prostate specific antigen (𝑥1, … , 𝑥8): clinical measures
  • 27. Linear regression • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes ෠ 𝐿 𝑓𝑤 = 1 𝑛 σ𝑖=1 𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖 2 𝑙2 loss; also called mean square error Hypothesis class 𝓗
  • 28. Linear regression: optimization • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes ෠ 𝐿 𝑓𝑤 = 1 𝑛 σ𝑖=1 𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖 2 • Let 𝑋 be a matrix whose 𝑖-th row is 𝑥𝑖 𝑇 , 𝑦 be the vector 𝑦1, … , 𝑦𝑛 𝑇 ෠ 𝐿 𝑓𝑤 = 1 𝑛 ෍ 𝑖=1 𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖 2 = 1 𝑛 ⃦𝑋𝑤 − 𝑦 ⃦2 2
  • 29. Linear regression: optimization • Set the gradient to 0 to get the minimizer 𝛻𝑤 ෠ 𝐿 𝑓𝑤 = 𝛻𝑤 1 𝑛 ⃦𝑋𝑤 − 𝑦 ⃦2 2 = 0 𝛻𝑤[ 𝑋𝑤 − 𝑦 𝑇(𝑋𝑤 − 𝑦)] = 0 𝛻𝑤[ 𝑤𝑇𝑋𝑇𝑋𝑤 − 2𝑤𝑇𝑋𝑇𝑦 + 𝑦𝑇𝑦] = 0 2𝑋𝑇𝑋𝑤 − 2𝑋𝑇𝑦 = 0 w = 𝑋𝑇𝑋 −1𝑋𝑇𝑦
  • 30. Linear regression: optimization • Algebraic view of the minimizer • If 𝑋 is invertible, just solve 𝑋𝑤 = 𝑦 and get 𝑤 = 𝑋−1𝑦 • But typically 𝑋 is a tall matrix 𝑋 𝑤 = 𝑦 𝑋𝑇 𝑋 𝑤 = 𝑋𝑇 𝑦 Normal equation: w = 𝑋𝑇 𝑋 −1 𝑋𝑇 𝑦
  • 31. Linear regression with bias • Given training data 𝑥𝑖, 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷 • Find 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇𝑥 + 𝑏 to minimize the loss • Reduce to the case without bias: • Let 𝑤′ = 𝑤; 𝑏 , 𝑥′ = 𝑥; 1 • Then 𝑓𝑤,𝑏 𝑥 = 𝑤𝑇 𝑥 + 𝑏 = 𝑤′ 𝑇 (𝑥′ ) Bias term