SlideShare a Scribd company logo
2
Most read
4
Most read
13
Most read
Gradient Descent
Natural Language Processing
Emory University
Jinho D. Choi
ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
E(f) =
Z
`(ˆy; y) · P(x, y)
Supervised Learning
2
(X, Y ) = {(x1, y1), . . . , (xn, yn)}
ˆy = f(x) predicts the output of x
input
prediction
loss function joint distribution
Expected risk
unknown!
Empirical risk minimize!
output y = ±1 binomial distribution
`(w, x; y) =
1
2
(wT
x y)2
ˆE(f) =
1
n
nX
i=1
`(ˆyi; yi)
Linear Prediction
3
least squares
linear function
Find a weight vector that minimizes the loss.
`(ˆy; y) =
1
2
(ˆy y)2
ˆy = f(x) = wT
(x) = wT
x
feature vector
wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
Gradient Descent
4
learning rate derivative of the loss
Minimize loss
Derivative → 0
Global optimum?
Convex optimization
Gradient Descent
5
How often is the weight vector updated?
wt+1 wt ⌘t
1
n
nX
i=1
@
@w
`(wt, xi; yi)
`(w, x; y) =
1
2
(wT
x y)2
@
@w
`(w, x; y) =
@
@w
1
2
(wT
x y)2
= (wT
x y)x
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi
Stochastic Gradient Descent
6
wt+1 wt ⌘t
1
n
nX
i=1
(wT
xi yi)xi
wt+1 wt ⌘t(wT
t xi yi)xi
0
+
-
w0 0
wT
0 x1 > 0
wT
1 x2 < 0
wT
2 x3 < 0 w3 w2 ⌘( 1)x3
w2 w1 ⌘( + 1)x2
w1 w0 ⌘( + 1)x1
wT
3 x4 > 0 w4 w3 ⌘( 1)x4
updated for every instance
Perceptron
7
wt+1 wt ⌘t `
Stochastic gradient descent
wt+1 wt + ⌘t
⇢
x · y wT
t x · y < 0
0 otherwise
`(w, x; y) =
1
2
(wT
x y)2
` = (wT
x y)x
Least squares
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron
Averaged Perceptron
8
The final hyperplane may be

overfitted to later instances.
Take the average of all hyperplanes
including ones that are not updated.
Averaged Perceptron
9
c c + 1
Initialization:
Update rule: for every instance
c 1
sparse vector?
wt+1 wt + ⌘t(x · y)
vt+1 vt + ⌘t · c(x · y)
w w
1
c
· v
wt+1 wt + ⌘t(x · y) if wT
t x · y < 0
w
1
c
c 1X
t=0
wt
Emory University Logo Guidelines
-
Multinomial Perceptron
10
Binomial distribution requires
1 hyperplane to separate 2 classes.
Multinomial distribution requires
m hyperplanes to separate m classes.
How many for

m classes?
Multinomial Perceptron
11
a b c d ew =
1 0 0 1 0x =
wT
x = a + d ˆy =
⇢
1 wT
x 0
1 otherwise
a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 e0 e1 e2 e3w =
5 features (including bias)
Binomial
Multinomial y = {0, 1, 2, 3}
ˆy = arg max
y
wT
y xwT
y x = ay + dy
y = { 1, 1}
Binomial vs. Multinomial Perceptron
12
wt+1 wt + ⌘t(x · y)
Binomial
wy,t+1 wy,t + ⌘t · x
Multinomial
wˆy,t+1 wˆy,t ⌘t · x
if wT
t x · y < 0 , y 6= ˆy
Hinge Loss
13
` =
⇢
x · y wT
x · y < 0
0 otherwise
`(w, x; y) = max{0, wT
x · y}
Perceptron
Hinge loss
`(w, x; y) = max{0, 1 wT
x · y}
` =
⇢
x · y wT
x · y < 1
0 otherwise
Adaptive Gradient Descent
14
if wT
t x · y < 0
Perceptron
if wT
t · y < 1
Hinge loss
wt+1 wt + ⌘t(x · y)
gt+1 gt + x x
wt+1 wt +
⌘
⇢ +
p
gt+1
· (x · y)

More Related Content

PDF
Hill climbing algorithm in artificial intelligence
PPTX
data structures- back tracking
PPTX
An introduction to reinforcement learning
PPTX
Liver Diseases Prediction analysis in india
PDF
Dimensionality Reduction
PDF
Gradient descent method
PPTX
AI3391 Artificial intelligence Session 28 Resolution.pptx
Hill climbing algorithm in artificial intelligence
data structures- back tracking
An introduction to reinforcement learning
Liver Diseases Prediction analysis in india
Dimensionality Reduction
Gradient descent method
AI3391 Artificial intelligence Session 28 Resolution.pptx

What's hot (20)

PDF
Understanding cnn
PPTX
Optimization/Gradient Descent
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
An introduction to Machine Learning
PDF
K - Nearest neighbor ( KNN )
PDF
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
PPTX
Introdution and designing a learning system
PDF
Neural Networks: Self-Organizing Maps (SOM)
PPTX
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
PPTX
An overview of gradient descent optimization algorithms
PDF
AI_ 8 Weak Slot and Filler Structure
PDF
Creating data apps using Streamlit in Python
PPTX
Lect7 Association analysis to correlation analysis
ODP
Machine Learning With Logistic Regression
PPTX
Linear regression with gradient descent
PDF
Unit3:Informed and Uninformed search
PPTX
Random forest algorithm
PDF
Tuning learning rate
PPTX
Supervised Machine Learning
Understanding cnn
Optimization/Gradient Descent
Reinforcement Learning : A Beginners Tutorial
An introduction to Machine Learning
K - Nearest neighbor ( KNN )
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Introdution and designing a learning system
Neural Networks: Self-Organizing Maps (SOM)
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
An overview of gradient descent optimization algorithms
AI_ 8 Weak Slot and Filler Structure
Creating data apps using Streamlit in Python
Lect7 Association analysis to correlation analysis
Machine Learning With Logistic Regression
Linear regression with gradient descent
Unit3:Informed and Uninformed search
Random forest algorithm
Tuning learning rate
Supervised Machine Learning
Ad

Viewers also liked (9)

PDF
Introduction to Deep Learning and neon at Galvanize
PPTX
Sentiment analysis using naive bayes classifier
PPTX
Rule based approach to sentiment analysis at romip’11 slides
PPTX
Tutorial on Opinion Mining and Sentiment Analysis
PDF
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
PDF
CS571: Sentiment Analysis
PPT
Text categorization
PDF
(Deep) Neural Networks在 NLP 和 Text Mining 总结
PPTX
Text categorization
Introduction to Deep Learning and neon at Galvanize
Sentiment analysis using naive bayes classifier
Rule based approach to sentiment analysis at romip’11 slides
Tutorial on Opinion Mining and Sentiment Analysis
A comparison of Lexicon-based approaches for Sentiment Analysis of microblog ...
CS571: Sentiment Analysis
Text categorization
(Deep) Neural Networks在 NLP 和 Text Mining 总结
Text categorization
Ad

Similar to CS571: Gradient Descent (20)

PDF
Calculus First Test 2011/10/20
PDF
Statistical Hydrology for Engineering.pdf
PDF
lec5_annotated.pdf ml csci 567 vatsal sharan
PPTX
Physical Chemistry Assignment Help
PDF
Eight Regression Algorithms
PDF
Emat 213 study guide
PDF
Differential Calculus
PDF
Lecture8 multi class_svm
PDF
MLHEP Lectures - day 3, basic track
PDF
6.3_DiscriminantFunctions for machine learning supervised learning
PDF
Interpolation
PDF
Calculus B Notes (Notre Dame)
PDF
Kernels and Support Vector Machines
PDF
1. newtonsforwardbackwordinterpolation-190305095001.pdf
PDF
Sect1 5
PDF
Sect1 4
PDF
1 - Linear Regression
PDF
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
PDF
Lecture 2: linear SVM in the Dual
PDF
Lecture 2: linear SVM in the dual
Calculus First Test 2011/10/20
Statistical Hydrology for Engineering.pdf
lec5_annotated.pdf ml csci 567 vatsal sharan
Physical Chemistry Assignment Help
Eight Regression Algorithms
Emat 213 study guide
Differential Calculus
Lecture8 multi class_svm
MLHEP Lectures - day 3, basic track
6.3_DiscriminantFunctions for machine learning supervised learning
Interpolation
Calculus B Notes (Notre Dame)
Kernels and Support Vector Machines
1. newtonsforwardbackwordinterpolation-190305095001.pdf
Sect1 5
Sect1 4
1 - Linear Regression
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the dual

More from Jinho Choi (20)

PDF
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
PDF
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
PDF
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PDF
The Myth of Higher-Order Inference in Coreference Resolution
PDF
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
PDF
Abstract Meaning Representation
PDF
Semantic Role Labeling
PDF
CKY Parsing
PDF
CS329 - WordNet Similarities
PDF
CS329 - Lexical Relations
PDF
Automatic Knowledge Base Expansion for Dialogue Management
PDF
Attention is All You Need for AMR Parsing
PDF
Graph-to-Text Generation and its Applications to Dialogue
PDF
Real-time Coreference Resolution for Dialogue Understanding
PDF
Topological Sort
PDF
Tries - Put
PDF
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
PDF
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
PDF
How to make Emora talk about Sports Intelligently
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
The Myth of Higher-Order Inference in Coreference Resolution
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Abstract Meaning Representation
Semantic Role Labeling
CKY Parsing
CS329 - WordNet Similarities
CS329 - Lexical Relations
Automatic Knowledge Base Expansion for Dialogue Management
Attention is All You Need for AMR Parsing
Graph-to-Text Generation and its Applications to Dialogue
Real-time Coreference Resolution for Dialogue Understanding
Topological Sort
Tries - Put
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
How to make Emora talk about Sports Intelligently

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Dropbox Q2 2025 Financial Results & Investor Presentation
Big Data Technologies - Introduction.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Review of recent advances in non-invasive hemoglobin estimation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
The Rise and Fall of 3GPP – Time for a Sabbatical?

CS571: Gradient Descent

  • 1. Gradient Descent Natural Language Processing Emory University Jinho D. Choi
  • 2. ˆE(f) = 1 n nX i=1 `(ˆyi; yi) E(f) = Z `(ˆy; y) · P(x, y) Supervised Learning 2 (X, Y ) = {(x1, y1), . . . , (xn, yn)} ˆy = f(x) predicts the output of x input prediction loss function joint distribution Expected risk unknown! Empirical risk minimize! output y = ±1 binomial distribution
  • 3. `(w, x; y) = 1 2 (wT x y)2 ˆE(f) = 1 n nX i=1 `(ˆyi; yi) Linear Prediction 3 least squares linear function Find a weight vector that minimizes the loss. `(ˆy; y) = 1 2 (ˆy y)2 ˆy = f(x) = wT (x) = wT x feature vector
  • 4. wt+1 wt ⌘t 1 n nX i=1 @ @w `(wt, xi; yi) Gradient Descent 4 learning rate derivative of the loss Minimize loss Derivative → 0 Global optimum? Convex optimization
  • 5. Gradient Descent 5 How often is the weight vector updated? wt+1 wt ⌘t 1 n nX i=1 @ @w `(wt, xi; yi) `(w, x; y) = 1 2 (wT x y)2 @ @w `(w, x; y) = @ @w 1 2 (wT x y)2 = (wT x y)x wt+1 wt ⌘t 1 n nX i=1 (wT xi yi)xi
  • 6. Stochastic Gradient Descent 6 wt+1 wt ⌘t 1 n nX i=1 (wT xi yi)xi wt+1 wt ⌘t(wT t xi yi)xi 0 + - w0 0 wT 0 x1 > 0 wT 1 x2 < 0 wT 2 x3 < 0 w3 w2 ⌘( 1)x3 w2 w1 ⌘( + 1)x2 w1 w0 ⌘( + 1)x1 wT 3 x4 > 0 w4 w3 ⌘( 1)x4 updated for every instance
  • 7. Perceptron 7 wt+1 wt ⌘t ` Stochastic gradient descent wt+1 wt + ⌘t ⇢ x · y wT t x · y < 0 0 otherwise `(w, x; y) = 1 2 (wT x y)2 ` = (wT x y)x Least squares ` = ⇢ x · y wT x · y < 0 0 otherwise `(w, x; y) = max{0, wT x · y} Perceptron
  • 8. Averaged Perceptron 8 The final hyperplane may be
 overfitted to later instances. Take the average of all hyperplanes including ones that are not updated.
  • 9. Averaged Perceptron 9 c c + 1 Initialization: Update rule: for every instance c 1 sparse vector? wt+1 wt + ⌘t(x · y) vt+1 vt + ⌘t · c(x · y) w w 1 c · v wt+1 wt + ⌘t(x · y) if wT t x · y < 0 w 1 c c 1X t=0 wt
  • 10. Emory University Logo Guidelines - Multinomial Perceptron 10 Binomial distribution requires 1 hyperplane to separate 2 classes. Multinomial distribution requires m hyperplanes to separate m classes. How many for
 m classes?
  • 11. Multinomial Perceptron 11 a b c d ew = 1 0 0 1 0x = wT x = a + d ˆy = ⇢ 1 wT x 0 1 otherwise a0 a1 a2 a3 b0 b1 b2 b3 c0 c1 c2 c3 d0 d1 d2 d3 e0 e1 e2 e3w = 5 features (including bias) Binomial Multinomial y = {0, 1, 2, 3} ˆy = arg max y wT y xwT y x = ay + dy y = { 1, 1}
  • 12. Binomial vs. Multinomial Perceptron 12 wt+1 wt + ⌘t(x · y) Binomial wy,t+1 wy,t + ⌘t · x Multinomial wˆy,t+1 wˆy,t ⌘t · x if wT t x · y < 0 , y 6= ˆy
  • 13. Hinge Loss 13 ` = ⇢ x · y wT x · y < 0 0 otherwise `(w, x; y) = max{0, wT x · y} Perceptron Hinge loss `(w, x; y) = max{0, 1 wT x · y} ` = ⇢ x · y wT x · y < 1 0 otherwise
  • 14. Adaptive Gradient Descent 14 if wT t x · y < 0 Perceptron if wT t · y < 1 Hinge loss wt+1 wt + ⌘t(x · y) gt+1 gt + x x wt+1 wt + ⌘ ⇢ + p gt+1 · (x · y)