SlideShare a Scribd company logo
Exploring Optimization in Vowpal
Wabbit
-Shiladitya Sen
Vowpal Wabbit
• Online
• Open Source
• Machine Learning Library
• Has achieved record-breaking speed by implementation of
• Parallel Processing
• Caching
• Hashing,etc.
• A “true” library:
offers a wide range of
machine learning and
optimization algorithms
Machine Learning Models
• Linear Regressor ( --loss_function squared)
• Logistic Regressor (--loss_function logistic)
• SVM (--loss_function hinge)
• Neural Networks ( --nn <arg> )
• Matrix Factorization
• Latent Dirichlet Allocation ( --lda <arg> )
• Active Learning ( --active_learning)
Regularization
• L1 Regularization ( --l1 <arg> )
• L2 Regularization ( --l2 <arg> )
Optimization Algorithms
• Online Gradient Descent ( default )
• Conjugate Gradient ( --conjugate_gradient )
• L-BFGS ( --bfgs )
Optimization
The Convex Definition
Convex Sets
Definition:
(0,1)andCyx,whereCy)x-(1
ifsetconvexabetosaidisofCsubsetA n



Convex Functions:
set.convexais
f(x)}andX,x|){(x,
asdefinedepigraph,itsiffunctionconvexabetosaidis
infunctionconvexaisXwhereX:ffunctionvalued-realA n



It can be proved from the definition of convex functions
that such a function can have no maxima.
In other words…
Might have at most one minima
i.e. Local minima is global minima
Loss functions which are convex help in optimization for
Machine Learning
Optimization
Algorithm I : Online Gradient Descent
What the batch implementation of
Gradient Descent (GD) does
How does Batch-version of GD work?
• Expresses total loss J as a function of a set of
parameters : x
•
• Takes a calculated step α in that direction to
reach a new point, with new co-ordinate values
of x
descent.steepestofdirectionisJ(x)-So,
ascent.steepestofdirectiontheasJ(x)Calculates


achieved.istolerancerequireduntilcontinuesThis
)(
:Algorithm
1 tttt xJxx  
What is the online implementation of GD?
How does online GD work?
1. Takes a point from the dataset :
2. Using existing hypothesis, predicts value
3. True value is revealed
4. Calculates error J as a function of parameters
x for point
5.
6.
7.
8. Moves onto next point
tp
tp
)(Evaluates txJ
)(
:descentsteepestofdirectionin thestepaTakes
txJ

)(:asparametersUpdates 1 ttt xJxx  
1tp 
Looking Deeper into Online GD
• Essentially calculates error function J(x)
independently for each point, as opposed to
calculating J(x) as sum of all errors as in Batch
implementation (Offline) GD
• To achieve accuracy, Online GD takes multiple
passes through the dataset
(Continued…)
Still deeper…
• So that a convergence is reached, the step η in
each pass is reduced. In VW, this is
implemented as:
• Cache file used for multiple passes (-c)
10][late]learning_r-[--l-
0.5][p-initial_p-
1][i-initial_t-
1][drning_rate-decay_lea-
)(
.
'
'
1








 p
ee
e
pn
ii
ild
e
So why Online GD?
• It takes less space…
• And my system needs its space!
Optimization
Algorithm II: Method of Conjugate
Gradients
What is wrong with Gradient Descent?
•Often takes steps in the same direction
•Convergence Issues
Convergence Problems:
The need for Conjugate Gradients:
Wouldn’t it be wonderful if we did not need to take steps
in the same direction to minimize error in that direction?
This is where Conjugate Gradient comes in…
Method of Orthogonal Directions
• In an (n+1) dimensional vector space where J is
defined with n parameters, at most n linearly
independent directions for parameters exist
• Error function may have a component in at most n
linearly independent (orthogonal) directions
• Intended: A step in each of these directions
i.e. at most n steps to minimize the error
• Not solvable for orthogonal directions
Conjugate Directions:
0:)respect towithConjugate(
0:Orthogonal
directionssearchared,d ji


j
T
i
j
T
i
AddA
dd
How do we get the conjugate
directions?
• We first choose n mutually orthogonal
directions:
• We calculate as:
nuuu ,...,, 21
id
.calculateto,...,,toorthogonal-
notarewhichofcomponentsanyoutSubtracts
k21
1
1


n
i
i
k
kkii
dddA
u
dud 



So what is Method of Conjugate Gradients?
• If we set to , the gradient in the i-th step,
we have the Method of Conjugate Gradients.
• The step size in the direction is found by an
exact line search.
iu ir
jitindependanlinearlyare, ji rr
id
The Algorithm for Conjugate Gradient:
Requirement for Preconditioning:
• Round-off errors – leads to slight deviations
from Conjugate Directions
• As a result, Conjugate Gradient is
implemented iteratively
• To minimize number of iterations,
preconditioning is done on the vector space
What is Pre-conditioning?
• The vector space is modified by multiplying a
matrix such that M is a symmetric,
positive-definite matrix.
• This leads to a better clustering of the
eigenvectors and a faster convergence.
-1
M
Optimization
Algorithm III: L-BFGS
Why think linearly?
Newton’s Method proposes a step along a non-
linear path as opposed to a linear one as in GD
and CG..
Leads to a faster convergence…
Newton’s Method:
)).(()(
2
1
)).(()()(
:)(ofexpansionseriessTaylor'order2nd
2
xxxJxx
xxxJxJxxJ
xxJ
T



:getwe,respect towithMinimizing x
)()]([ 12
xJxJx  
)()]([
:formiterativeIn
12
1 xJxJxx nn  
 
What is this BFGS Algorithm?
•
• Named after Broyden-Fletcher-Goldfarb-
Shanno
• Maintains an approximate matrix B and
updates B upon each iteration
BFGS.iswhichamongpopularmost
Methods,Newton-toQuasiled
)]([][gcalculatininonsComplicati 121 
 xJHB
BFGS Algorithm:
Memory is a limited asset
• In Vowpal Wabbit, the version of BFGS
implemented is L-BFGS
• In L-BFGS, all the previous updates to B are
not stored in memory
• At a particular iteration i, only the last m
updates are stored and used to make new
update
• Also, the step size η in each step is calculated
by an inexact line search following Wolfe’s
Conditions.

More Related Content

PDF
Technical Tricks of Vowpal Wabbit
PDF
Terascale Learning
ODP
Wapid and wobust active online machine leawning with Vowpal Wabbit
PPTX
Online learning, Vowpal Wabbit and Hadoop
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PPTX
Modern classification techniques
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Technical Tricks of Vowpal Wabbit
Terascale Learning
Wapid and wobust active online machine leawning with Vowpal Wabbit
Online learning, Vowpal Wabbit and Hadoop
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Modern classification techniques
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...

What's hot (20)

PDF
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PPTX
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
PDF
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
PPTX
Scaling out logistic regression with Spark
PDF
Performance Analysis of Lattice QCD with APGAS Programming Model
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
PDF
Large scale logistic regression and linear support vector machines using spark
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Deep Learning in theano
PDF
Multinomial Logistic Regression with Apache Spark
PDF
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
PDF
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
PDF
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
PDF
Deep Learning in Python with Tensorflow for Finance
PDF
Learning stochastic neural networks with Chainer
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Parallel External Memory Algorithms Applied to Generalized Linear Models
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
Scaling out logistic regression with Spark
Performance Analysis of Lattice QCD with APGAS Programming Model
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Large scale logistic regression and linear support vector machines using spark
論文輪読資料「Gated Feedback Recurrent Neural Networks」
Deep learning for molecules, introduction to chainer chemistry
Deep Learning in theano
Multinomial Logistic Regression with Apache Spark
Large data with Scikit-learn - Boston Data Mining Meetup - Alex Perrier
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Deep Learning in Python with Tensorflow for Finance
Learning stochastic neural networks with Chainer
Ad

Viewers also liked (16)

PPTX
Kill the wabbit
PPTX
Vowpal Wabbit
PPTX
Datasets for logistic regression
PDF
CTR logistic regression
PPTX
Linear regression on 1 terabytes of data? Some crazy observations and actions
PDF
一淘广告机器学习
ODP
Click-Trough Rate (CTR) prediction
PPTX
Dynamic pricing
PDF
Cross Device Ad Targeting at Scale
PDF
Ad Click Prediction - Paper review
PDF
Training Large-scale Ad Ranking Models in Spark
PPTX
Ranking scales
PDF
Large scale-ctr-prediction lessons-learned-florian-hartl
PDF
数据挖掘竞赛经验分享 严强
PPTX
CTR Prediction using Spark Machine Learning Pipelines
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Kill the wabbit
Vowpal Wabbit
Datasets for logistic regression
CTR logistic regression
Linear regression on 1 terabytes of data? Some crazy observations and actions
一淘广告机器学习
Click-Trough Rate (CTR) prediction
Dynamic pricing
Cross Device Ad Targeting at Scale
Ad Click Prediction - Paper review
Training Large-scale Ad Ranking Models in Spark
Ranking scales
Large scale-ctr-prediction lessons-learned-florian-hartl
数据挖掘竞赛经验分享 严强
CTR Prediction using Spark Machine Learning Pipelines
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Ad

Similar to Exploring Optimization in Vowpal Wabbit (20)

PPTX
Cvpr 2018 papers review (efficient computing)
PPTX
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
PDF
Introduction to Chainer
PPTX
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
PDF
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
PPTX
Introduction to Deep Learning
PPTX
An Introduction to Deep Learning
PDF
Cerebellar Model Articulation Controller
PPTX
Quantum Machine Learning for IBM AI
PPSX
matrixmultiplicationparallel.ppsx
PPSX
MAtrix Multiplication Parallel.ppsx
PDF
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
PPTX
PPT
hpc-unit-IV-2-dense-matrix-algorithms.ppt
PDF
Using CNTK's Python Interface for Deep LearningDave DeBarr -
PDF
Learning to Optimize
PPTX
QuantumFuzzylogic
PPTX
Quantum Fuzzy Logic
PDF
SPICE-MATEX @ DAC15
PPTX
Scikit-learn-with-Python-A-Comprehensive-Overview.pptx
Cvpr 2018 papers review (efficient computing)
[Q-tangled 22] Deconstructing Quantum Machine Learning Algorithms - Sasha Laz...
Introduction to Chainer
A Framework for Scene Recognition Using Convolutional Neural Network as Featu...
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Introduction to Deep Learning
An Introduction to Deep Learning
Cerebellar Model Articulation Controller
Quantum Machine Learning for IBM AI
matrixmultiplicationparallel.ppsx
MAtrix Multiplication Parallel.ppsx
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
hpc-unit-IV-2-dense-matrix-algorithms.ppt
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Learning to Optimize
QuantumFuzzylogic
Quantum Fuzzy Logic
SPICE-MATEX @ DAC15
Scikit-learn-with-Python-A-Comprehensive-Overview.pptx

Exploring Optimization in Vowpal Wabbit

  • 1. Exploring Optimization in Vowpal Wabbit -Shiladitya Sen
  • 2. Vowpal Wabbit • Online • Open Source • Machine Learning Library • Has achieved record-breaking speed by implementation of • Parallel Processing • Caching • Hashing,etc. • A “true” library: offers a wide range of machine learning and optimization algorithms
  • 3. Machine Learning Models • Linear Regressor ( --loss_function squared) • Logistic Regressor (--loss_function logistic) • SVM (--loss_function hinge) • Neural Networks ( --nn <arg> ) • Matrix Factorization • Latent Dirichlet Allocation ( --lda <arg> ) • Active Learning ( --active_learning)
  • 4. Regularization • L1 Regularization ( --l1 <arg> ) • L2 Regularization ( --l2 <arg> )
  • 5. Optimization Algorithms • Online Gradient Descent ( default ) • Conjugate Gradient ( --conjugate_gradient ) • L-BFGS ( --bfgs )
  • 9. It can be proved from the definition of convex functions that such a function can have no maxima. In other words… Might have at most one minima i.e. Local minima is global minima Loss functions which are convex help in optimization for Machine Learning
  • 10. Optimization Algorithm I : Online Gradient Descent
  • 11. What the batch implementation of Gradient Descent (GD) does
  • 12. How does Batch-version of GD work? • Expresses total loss J as a function of a set of parameters : x • • Takes a calculated step α in that direction to reach a new point, with new co-ordinate values of x descent.steepestofdirectionisJ(x)-So, ascent.steepestofdirectiontheasJ(x)Calculates   achieved.istolerancerequireduntilcontinuesThis )( :Algorithm 1 tttt xJxx  
  • 13. What is the online implementation of GD?
  • 14. How does online GD work? 1. Takes a point from the dataset : 2. Using existing hypothesis, predicts value 3. True value is revealed 4. Calculates error J as a function of parameters x for point 5. 6. 7. 8. Moves onto next point tp tp )(Evaluates txJ )( :descentsteepestofdirectionin thestepaTakes txJ  )(:asparametersUpdates 1 ttt xJxx   1tp 
  • 15. Looking Deeper into Online GD • Essentially calculates error function J(x) independently for each point, as opposed to calculating J(x) as sum of all errors as in Batch implementation (Offline) GD • To achieve accuracy, Online GD takes multiple passes through the dataset (Continued…)
  • 16. Still deeper… • So that a convergence is reached, the step η in each pass is reduced. In VW, this is implemented as: • Cache file used for multiple passes (-c) 10][late]learning_r-[--l- 0.5][p-initial_p- 1][i-initial_t- 1][drning_rate-decay_lea- )( . ' ' 1          p ee e pn ii ild e
  • 17. So why Online GD? • It takes less space… • And my system needs its space!
  • 18. Optimization Algorithm II: Method of Conjugate Gradients
  • 19. What is wrong with Gradient Descent? •Often takes steps in the same direction •Convergence Issues
  • 21. The need for Conjugate Gradients: Wouldn’t it be wonderful if we did not need to take steps in the same direction to minimize error in that direction? This is where Conjugate Gradient comes in…
  • 22. Method of Orthogonal Directions • In an (n+1) dimensional vector space where J is defined with n parameters, at most n linearly independent directions for parameters exist • Error function may have a component in at most n linearly independent (orthogonal) directions • Intended: A step in each of these directions i.e. at most n steps to minimize the error • Not solvable for orthogonal directions
  • 24. How do we get the conjugate directions? • We first choose n mutually orthogonal directions: • We calculate as: nuuu ,...,, 21 id .calculateto,...,,toorthogonal- notarewhichofcomponentsanyoutSubtracts k21 1 1   n i i k kkii dddA u dud    
  • 25. So what is Method of Conjugate Gradients? • If we set to , the gradient in the i-th step, we have the Method of Conjugate Gradients. • The step size in the direction is found by an exact line search. iu ir jitindependanlinearlyare, ji rr id
  • 26. The Algorithm for Conjugate Gradient:
  • 27. Requirement for Preconditioning: • Round-off errors – leads to slight deviations from Conjugate Directions • As a result, Conjugate Gradient is implemented iteratively • To minimize number of iterations, preconditioning is done on the vector space
  • 28. What is Pre-conditioning? • The vector space is modified by multiplying a matrix such that M is a symmetric, positive-definite matrix. • This leads to a better clustering of the eigenvectors and a faster convergence. -1 M
  • 30. Why think linearly? Newton’s Method proposes a step along a non- linear path as opposed to a linear one as in GD and CG.. Leads to a faster convergence…
  • 32. What is this BFGS Algorithm? • • Named after Broyden-Fletcher-Goldfarb- Shanno • Maintains an approximate matrix B and updates B upon each iteration BFGS.iswhichamongpopularmost Methods,Newton-toQuasiled )]([][gcalculatininonsComplicati 121   xJHB
  • 34. Memory is a limited asset • In Vowpal Wabbit, the version of BFGS implemented is L-BFGS • In L-BFGS, all the previous updates to B are not stored in memory • At a particular iteration i, only the last m updates are stored and used to make new update • Also, the step size η in each step is calculated by an inexact line search following Wolfe’s Conditions.