Introduction of Online Machine Learning Algorithms

Paper Report for SDM course in 2016
Ad Click Prediction: a View from the Trenches
(Online Machine Learning)
報告者：蔡宗倫、洪紹嚴、蔡佳盈
日期：2016/12/22

2Final Presentation for SDM-2016
https://aci.in
fo/2014/07/
12/the-data-
explosion-in-
2014-
minute-by-
minute-
infographic/

READ DATA Time Memory
read.csv 264.5 (secs) 8.73 (GB)
fread 33.18 (secs) 2.98 (GB)
read.big.matrix 205.03 (secs) 0.2 (MB)
2GB 資料，四百萬筆資料，200個變數
lm Time Memory
read.csv X X
fread X X
read.big.matrix 2.72 (mins) 83.6 (MB)

Big Data (TB, PB, ZB)
Model
Train • Memory
• Time/Accuracy
Problem
• Parallel Computation: Hadoop, MapReduce, Spark (TB, PB, ZB)
• R-package: read.table, bigmemory, ff (GB)
• Online learning algorithms
Solutions

TG
(2009, Microsoft)
FOBOS
(2009, Google)
RDA
(2010, Microsoft)
FTRL-Proximal
(2011, Google)
Logistic Regression
AOGD
(2007, IBM)
Adaptive online
gradient descend
Truncated Gradient
Online learning algorithms
Regularized dual averaging
Follow-the-regularized-Leader Proximal
Forward-Backward Splitting

Model
Train
New
data
Renew
weights
• Memory
• Time/accuracy
Sparsity (LASSO)
SGD/OGD (NN/GBM)
Problem

TG
(2009, Microsoft)
FOBOS
(2009, Google)
RDA
(2010, Microsoft)
FTRL-Proximal
(2011, Google)
Logistic Regression
AOGD
(2007, IBM)
+ =
Online learning algorithms
Adaptive online
gradient descend
Truncated Gradient
Regularized dual averaging
Follow-the-regularized-Leader Proximal
Forward-Backward Splitting

12
Online Gradient Descent-OGD
Kind of algorithms used on the online convex optimization
 Can be formulated as a repeated game between a player and an adversary
 At round 𝑡, the player chooses an action 𝑥𝑡 from some convex subset 𝐾 𝑜𝑓 ℝ 𝑛,
and then the adversary chooses a convex loss function 𝑓𝑡
 ℛ 𝑇 = 𝑡=1
𝑇
𝑓𝑡(𝑥 𝑡) − min
𝑥∈𝐾
𝑡=1
𝑇
𝑓𝑡 𝑥 , where 𝑥 is any fixed action
A central question is how the regret grows with the number of rounds
of the game
Final Presentation for SDM-2016

13
Online Gradient Descent-OGD
Zinkevich considered the following gradient descent algorithm, with step
size 𝜂 𝑡 = Θ
1
𝑡
.
 1: 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒 𝑥1 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑖𝑙𝑦.
 2: 𝒇𝒐𝒓 𝑡 = 1 𝑡𝑜 𝑇 𝒅𝒐
3: 𝑃𝑟𝑒𝑑𝑖𝑐𝑡 𝑥 𝑡, 𝑜𝑏𝑠𝑒𝑟𝑣𝑒 𝑓𝑡.
 4: 𝑈𝑝𝑑𝑎𝑡𝑒 𝑥𝑡+1 = 𝐾(𝑥𝑡 − 𝜂 𝑡+1 𝛻𝑓𝑡(𝑥𝑡)) .
 5: 𝒆𝒏𝒅 𝒇𝒐𝒓
Here, 𝐾 𝑣 𝑑𝑒𝑛𝑜𝑡𝑒𝑠 𝑡ℎ𝑒 𝐸𝑢𝑐𝑙𝑖𝑑𝑒𝑎𝑛 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝑣 𝑜𝑛 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑜𝑛𝑐𝑒𝑥 𝑠𝑒𝑡 𝐾

14
Forward-Backward Splitting (FOBOS)
(1)Loss function of Logistic Regression:
𝑊𝑡+1 = 𝑊𝑡 − η 𝜕
𝑙 𝑊𝑡, 𝑋
𝜕𝑊𝑡
𝑙(𝑊, 𝑋) = 𝑡=1
𝑛
log(1 + 𝑒−𝑦𝑡 (𝑊 𝑇 𝑥 𝑡))
Batch gradient descend formula:
Online gradient descend formula: 𝑔𝑡 =𝑙 𝑊𝑡, 𝑥𝑖
η𝜕
𝑙 𝑊𝑡, 𝑋
𝜕𝑊𝑡

15
(1)Loss function of Logistic Regression:
𝑊𝑡+1 = 𝑊𝑡 − η 𝜕
𝑙 𝑊𝑡, 𝑋
𝜕𝑊𝑡
𝑙(𝑊, 𝑋) = 𝑡=1
𝑛
log(1 + 𝑒−𝑦𝑡 (𝑊 𝑇 𝑥 𝑡))
Batch gradient descend formula:
Online gradient descend formula:
(2) FOBOS的梯度下降公式，可以細分為兩部分：
 前部分：微調發生在梯度下降的結果(𝑾 𝒕+
𝟏
𝟐
)附近
 後部分：處理正則化，產生稀疏性
r(w) = λ||𝒙|| 𝟏
(regularization functions)
𝑔𝑡 =𝑙 𝑊𝑡, 𝑥𝑖

(3) 要求得(2)最佳解的充分條件: 0 屬於其subgradient set之中
(4) 因為，(3) 可以改寫成：
(5) 換句話說，把(4)移項之後：
 迭代前的狀態𝑾 𝒕 與梯度
backward
 當次迭代的正則項資訊 𝝏𝒓(𝑾𝒕+𝟏)
forward
x
y

17
FOBOS, RDA, FTRL-Proximal
(A)：過去的累積梯度量
(B)：regularization functions
(C)：proximal: 𝑄𝑠 = learning rate (保證微調不會離0或已迭代後的解太遠)
𝚿 𝒙 ∶ λ||𝒙|| 𝟏 (non-smooth convex function)
𝚽𝒕 : certain subgradient of 𝚿 𝒙

18
FOBOS, RDA, FTRL-Proximal
OGD不夠稀疏
FOBOS能產生更
加好的稀疏特徵
梯度下降類方法，精度比較好
RDA可以在精度與稀疏
性之間做更好的平衡
稀疏性更加出色
最關鍵的不同點
是累積L1懲罰項
的處理方式
FTRL-Proximal
綜合FOBOS的精度和RDA的稀疏性

f(x) = 0.5A + 1.1B + 3.8C + 0.1D + 11E + 41F
1 2 3 4
Per-Coordinate

f(x) = 0.4A + 0.8B + 3.8C + 0.8D + 0E + 41F
1 2 3 4
8 5 7 3
Per-Coordinate

f(x) = 0.4A + 1.2B + 3.5C + 0.9D + 0.3E + 41F
1 2 3 4
8 5 7 3
Per-Coordinate

Model
Train
New
data
Renew Weights
(per-coordinate)
• Memory
• Time/Accuracy
Sparsity (LASSO)
SGD/OGD (NN/GBM)
Problem
FOBOS
(2009, Google)
RDA
(2010, Microsoft)
FTRL-Proximal
(2011, Google)
Logistic Regression
+

R package: FTRLProximal

https://w
ww.kaggle.
com/c/ava
zu-ctr-
prediction

5.87GB
Prediction result

[1] John Langford, Lihong Li & Tong Zhang. Sparse Online Learning via Truncated
Gradient. Journal of Machine Learning Research, 2009.
[2] John Duchi & Yoram Singer. Efficient Online and Batch Learning using Forward
Backward Splitting. Journal of Machine Learning Research, 2009.
[3] Lin Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online
Optimization. Journal of Machine Learning Research, 2010.
[4] H. B. McMahan. Follow-the-regularized-leader and mirror descent: Equivalence
theorems and L1 regularization. In AISTATS, 2011.
[5] H. Brendan McMahan,Gary Holt, D. Sculley et al. Ad Click Prediction: a View from
the Trenches. In KDD , 2013.
[6] Peter Bartlett, Elad Hazan, and Alexander Rakhlin. Adaptive online gradient
descent. Technical Report UCB/EECS-2007-82, EECS Department, University of
California, Berkeley, Jun 2007.
[7] Martin Zinkevich. Online convex programming and generalized infinitesimal
gradient ascent. In ICML, pages 928–936, 2003.
Reference

Introduction of Online Machine Learning Algorithms

More Related Content

What's hot (17)

Similar to Introduction of Online Machine Learning Algorithms (20)

More from Shao-Yen Hung (6)

Recently uploaded (20)

Introduction of Online Machine Learning Algorithms

Editor's Notes