SlideShare a Scribd company logo
Modern Classification Techniques
Mark Landry
Austin Machine Learning Meetup
1/19/2015
Overview
• Problem & Data
– Click-through rate prediction for online auctions
– 40 million rows
– Sparse: gather characteristics
– Down-sampled
• Methods
– Logistic regression
– Sparse feature handling
– Hash trick
– Online learning
– Online gradient descent
– Adaptive learning rate
– Regularization (L1 & L2)
• Solution characteristics
– Fast: 20 minutes
– Efficient: ~4GB RAM
– Robust: Easy to extend
– Accurate: competitive with factorization machines, particularly when extended to key
interactions
Two Data Sets
• Primary use case: click logs
– 40 million rows
– 20 columns
– Values appear in dense fashion, but a sparse feature space
• For highly informative features types (URL/site) 70% of features
have 3 or fewer instances
– Note: negatives have been down-sampled
• Extended to separate use case: clinical + genomic
– 4k rows
– 1300 columns
– Mix of dense and sparse features
Methods and objectives
• Logistic regression: accuracy/base algorithm
• Stochastic gradient descent: optimization
• Adaptive learning rate: accuracy, speed
• Regularization (L1 & L2): generalized solution
• Online learning: speed
• Sparse feature handling: memory efficiency
• Hash trick: memory efficiency, robustness
Implementation Infrastructure
• From scratch: no machine learning libraries
• Maintain vectors for
– Features (1/0)
– Weights
– Feature Counts
• Each vector will use the same index scheme
• Hash trick means we can immediately find the
index of any feature and we bound the vector
size (more later)
Logistic Regression
• Natural fit for probability problems (0/1)
– 1 / (1 + exp(sum(weight*feature)))
– Solves based on log odds
– Higher calibration than many other algorithms
(particularly decision trees), which is useful for
Real Time Bid problem
Sparse Features
• All values experience receive a column where
the absence/presence
• So 1 / (1 + exp(sum(weight*feature))) resolves
to 1 / (1 + exp(sum(weight))) for only the
features in each instance
Hash Trick
• Hash trick allows for quick access into parallel arrays that hold key
information to your model
• Example: use native python hash(‘string’) to cast into a large integer
• Bound the parameter space by using modulo
– E.g. abs(hash(‘string’)) % (2 ** 20)
– The size of that integer is a parameter, and it allows you to set it as
large as your system can handle
– Why set it larger? Hash collisions
– Keep features separate: abs(hash(feature-name + ‘string’)) % (2 ** 20)
• Any hash function can have a collision. The particular function used
is fast, but much more likely to encounter a collision than a murmur
hash or something more elaborate.
• So a speed/accuracy tradeoff dictates what function to use. The
larger the bits, the lower the hash collisions.
Online Learning
• Learn one record at a time
– A prediction is always available at any point, and the
best possible given the data the algorithm has seen
– Do not have to retrain to take in more data
• Though you may still want to
• Depending on learning rate used, may desire to
iterate through data set more than once
• Fast: VW approaches speed of network interface
OGD/SGD: online gradient descent
Gradient descent
Optimization algorithms are required to minimize the loss in logistic regression
Gradient descent, and many variants, are a popular choice, especially with large –scale data.
Visualization (in R)
library(animation)
par(mar = c(4, 4, 2, 0.1))
grad.desc()
ani.options(nmax = 50)
par(mar = c(4, 4, 2, 0.1))
f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))
grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.3, tol = 1e-04)
ani.options(nmax = 70)
par(mar = c(4, 4, 2, 0.1))
f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y))
grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.1, tol = 1e-04)
# interesting comparison: https://imgur.com/a/Hqolp
Other common optimization
algorithms
ADAGRAD
Still slightly sensitive to choice of n
ADADELTANewton’s Method
Quasi-Newton
Momentum
Adaptive learning rate
• Difficulty using SGD is finding a good learning rate
• An adaptive learning rate will
– ADAGRAD is an adaptive method
• Simple learning rate in example code
– alpha / (sqrt(n) + 1)
• Where N is the number of times a specific feature has been
encountered
– w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.)
• Full weight update will shrink the change by the learning rate
of the specific feature
Regularization (L1 & L2)
• Regularization attempts to ensure robustness of a
solution
• Enforces a penalty term on the coefficients of a
model, guiding toward a simpler solution
• L1: guides parameter values to be 0
• L2: guides parameters to be close to 0, but not 0
• In practice, these ensure large coefficients are not
applied to rare features
Related Tools
• Vowpal Wabbit
– Implements all of these features, plus far more
– Command line tool
– svmLite-like data format
– Source code available on Github with fairly open license
• Straight Python implementation (see code references slide)
• glmnet, for R: L1/L2 regression, sparse
• Scikit-learn, python ML library: ridge, elastic net (l1+l2), SGD (can
specify logistic regression)
• H2O, Java tool; many techniques used, particularly in deep learning
• Many of these techniques are used in neural networks, particularly
deep learning
Code References
• Introductory version: online logistic regression, hash trick,
adaptive learning rate
– Kaggle forum post
• Data set is available on that competition’s data page
• But you can easily adapt the code to work for your data set by
changing the train and test file names (lines 25-26) and the names of
the id and output columns (104-107, 129-130)
– Direct link to python code from forum post
– Github version of the same python code
• Latest version: adds FTRL-proximal (including SGD, L1/L2
regularization), epochs, and automatic interaction handling
– Kaggle forum post
– Direct link to python code from forum post (version 3)
– Github version of the same python code
Additional References
• Overall process
– Google paper, FTRL proximal and practical observations
– Facebook paper, includes logistic regression and trees, feature
handling, down-sampling
• Follow The Regularized Leader Proximal (Google)
• Optimization
– Stochastic gradient descent: examples and guidance (Microsoft)
– ADADELTA and discussion of additional optimization algorithms
(Google/NYU intern)
– Comparison Visualization
• Hash trick:
– The Wikipedia page offers a decent introduction
– general description and list of references, from VW author

More Related Content

PPTX
GBM package in r
PDF
Data Wrangling For Kaggle Data Science Competitions
PDF
Introduction of Feature Hashing
PPTX
Exploring Optimization in Vowpal Wabbit
PDF
Training Large-scale Ad Ranking Models in Spark
PDF
Technical Tricks of Vowpal Wabbit
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
GBM package in r
Data Wrangling For Kaggle Data Science Competitions
Introduction of Feature Hashing
Exploring Optimization in Vowpal Wabbit
Training Large-scale Ad Ranking Models in Spark
Technical Tricks of Vowpal Wabbit
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Parallel External Memory Algorithms Applied to Generalized Linear Models

What's hot (20)

PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PPTX
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
PDF
Terascale Learning
ODP
Wapid and wobust active online machine leawning with Vowpal Wabbit
PPTX
Scaling out logistic regression with Spark
PDF
H2O World - GBM and Random Forest in H2O- Mark Landry
PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PPTX
Gbm.more GBM in H2O
PPTX
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
PPTX
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
PDF
XGBoost: the algorithm that wins every competition
PDF
MLConf 2016 SigOpt Talk by Scott Clark
PDF
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
PPTX
Mahout scala and spark bindings
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
PDF
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
PPTX
Ppt shuai
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Terascale Learning
Wapid and wobust active online machine leawning with Vowpal Wabbit
Scaling out logistic regression with Spark
H2O World - GBM and Random Forest in H2O- Mark Landry
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Gbm.more GBM in H2O
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
XGBoost: the algorithm that wins every competition
MLConf 2016 SigOpt Talk by Scott Clark
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Mahout scala and spark bindings
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Ppt shuai
Ad

Viewers also liked (20)

PDF
CTR logistic regression
PPTX
Statistical classification: A review on some techniques
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PPTX
Western Classification of Government
ODP
Security in the Real World - JavaOne 2013
PDF
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PPT
Computer
PDF
Navigli sssw
PDF
Faster persistent data structures through hashing
PDF
What is pattern recognition (lecture 4 of 6)
PPT
Governance
PPTX
Classification of the government
PPTX
Tutorial Cognition - Irene
PPTX
Linked Open Data
PDF
Introduction to open data quality et
PPT
Classification system
PPTX
Pattern recognition
PDF
LOD(Linked Open Data) Recommendations
PDF
An introduction to Linked (Open) Data
PDF
CTR logistic regression
Statistical classification: A review on some techniques
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Western Classification of Government
Security in the Real World - JavaOne 2013
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Computer
Navigli sssw
Faster persistent data structures through hashing
What is pattern recognition (lecture 4 of 6)
Governance
Classification of the government
Tutorial Cognition - Irene
Linked Open Data
Introduction to open data quality et
Classification system
Pattern recognition
LOD(Linked Open Data) Recommendations
An introduction to Linked (Open) Data
Ad

Similar to Modern classification techniques (20)

PPTX
Graph Models for Deep Learning
ODP
Online advertising and large scale model fitting
PDF
Kaggle presentation
PDF
An introduction to machine learning for particle physics
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
A detailed analysis of the supervised machine Learning Algorithms
PPTX
Informs presentation new ppt
PDF
Machine learning Mind Map
PPTX
Deep Learning for Search
PPTX
Deep Learning for Search
PPTX
Machine Learning in the Financial Industry
PDF
Deep learning concepts
PDF
Data Science Cheatsheet.pdf
PPTX
Reuqired ppt for machine learning algirthms and part
PDF
Distributed Coordinate Descent for Logistic Regression with Regularization
PDF
Machine learning and its parameter is discussed here
PPTX
Machine learning presentation (razi)
PDF
super-cheatsheet-artificial-intelligence.pdf
PDF
Hadoop Summit 2010 Machine Learning Using Hadoop
PDF
Machine Learning - Supervised Learning
Graph Models for Deep Learning
Online advertising and large scale model fitting
Kaggle presentation
An introduction to machine learning for particle physics
Introduction to Machine Learning with SciKit-Learn
A detailed analysis of the supervised machine Learning Algorithms
Informs presentation new ppt
Machine learning Mind Map
Deep Learning for Search
Deep Learning for Search
Machine Learning in the Financial Industry
Deep learning concepts
Data Science Cheatsheet.pdf
Reuqired ppt for machine learning algirthms and part
Distributed Coordinate Descent for Logistic Regression with Regularization
Machine learning and its parameter is discussed here
Machine learning presentation (razi)
super-cheatsheet-artificial-intelligence.pdf
Hadoop Summit 2010 Machine Learning Using Hadoop
Machine Learning - Supervised Learning

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Foundation of Data Science unit number two notes
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
1_Introduction to advance data techniques.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Foundation of Data Science unit number two notes
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
annual-report-2024-2025 original latest.
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ISS -ESG Data flows What is ESG and HowHow
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
climate analysis of Dhaka ,Banglades.pptx
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Acceptance and paychological effects of mandatory extra coach I classes.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to machine learning and Linear Models
1_Introduction to advance data techniques.pptx

Modern classification techniques

  • 1. Modern Classification Techniques Mark Landry Austin Machine Learning Meetup 1/19/2015
  • 2. Overview • Problem & Data – Click-through rate prediction for online auctions – 40 million rows – Sparse: gather characteristics – Down-sampled • Methods – Logistic regression – Sparse feature handling – Hash trick – Online learning – Online gradient descent – Adaptive learning rate – Regularization (L1 & L2) • Solution characteristics – Fast: 20 minutes – Efficient: ~4GB RAM – Robust: Easy to extend – Accurate: competitive with factorization machines, particularly when extended to key interactions
  • 3. Two Data Sets • Primary use case: click logs – 40 million rows – 20 columns – Values appear in dense fashion, but a sparse feature space • For highly informative features types (URL/site) 70% of features have 3 or fewer instances – Note: negatives have been down-sampled • Extended to separate use case: clinical + genomic – 4k rows – 1300 columns – Mix of dense and sparse features
  • 4. Methods and objectives • Logistic regression: accuracy/base algorithm • Stochastic gradient descent: optimization • Adaptive learning rate: accuracy, speed • Regularization (L1 & L2): generalized solution • Online learning: speed • Sparse feature handling: memory efficiency • Hash trick: memory efficiency, robustness
  • 5. Implementation Infrastructure • From scratch: no machine learning libraries • Maintain vectors for – Features (1/0) – Weights – Feature Counts • Each vector will use the same index scheme • Hash trick means we can immediately find the index of any feature and we bound the vector size (more later)
  • 6. Logistic Regression • Natural fit for probability problems (0/1) – 1 / (1 + exp(sum(weight*feature))) – Solves based on log odds – Higher calibration than many other algorithms (particularly decision trees), which is useful for Real Time Bid problem
  • 7. Sparse Features • All values experience receive a column where the absence/presence • So 1 / (1 + exp(sum(weight*feature))) resolves to 1 / (1 + exp(sum(weight))) for only the features in each instance
  • 8. Hash Trick • Hash trick allows for quick access into parallel arrays that hold key information to your model • Example: use native python hash(‘string’) to cast into a large integer • Bound the parameter space by using modulo – E.g. abs(hash(‘string’)) % (2 ** 20) – The size of that integer is a parameter, and it allows you to set it as large as your system can handle – Why set it larger? Hash collisions – Keep features separate: abs(hash(feature-name + ‘string’)) % (2 ** 20) • Any hash function can have a collision. The particular function used is fast, but much more likely to encounter a collision than a murmur hash or something more elaborate. • So a speed/accuracy tradeoff dictates what function to use. The larger the bits, the lower the hash collisions.
  • 9. Online Learning • Learn one record at a time – A prediction is always available at any point, and the best possible given the data the algorithm has seen – Do not have to retrain to take in more data • Though you may still want to • Depending on learning rate used, may desire to iterate through data set more than once • Fast: VW approaches speed of network interface
  • 10. OGD/SGD: online gradient descent Gradient descent Optimization algorithms are required to minimize the loss in logistic regression Gradient descent, and many variants, are a popular choice, especially with large –scale data. Visualization (in R) library(animation) par(mar = c(4, 4, 2, 0.1)) grad.desc() ani.options(nmax = 50) par(mar = c(4, 4, 2, 0.1)) f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y)) grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.3, tol = 1e-04) ani.options(nmax = 70) par(mar = c(4, 4, 2, 0.1)) f2 = function(x, y) sin(1/2 * x^2 - 1/4 * y^2 + 3) * cos(2 * x + 1 - exp(y)) grad.desc(f2, c(-2, -2, 2, 2), c(-1, 0.5), gamma = 0.1, tol = 1e-04) # interesting comparison: https://imgur.com/a/Hqolp
  • 11. Other common optimization algorithms ADAGRAD Still slightly sensitive to choice of n ADADELTANewton’s Method Quasi-Newton Momentum
  • 12. Adaptive learning rate • Difficulty using SGD is finding a good learning rate • An adaptive learning rate will – ADAGRAD is an adaptive method • Simple learning rate in example code – alpha / (sqrt(n) + 1) • Where N is the number of times a specific feature has been encountered – w[i] -= (p - y) * alpha / (sqrt(n[i]) + 1.) • Full weight update will shrink the change by the learning rate of the specific feature
  • 13. Regularization (L1 & L2) • Regularization attempts to ensure robustness of a solution • Enforces a penalty term on the coefficients of a model, guiding toward a simpler solution • L1: guides parameter values to be 0 • L2: guides parameters to be close to 0, but not 0 • In practice, these ensure large coefficients are not applied to rare features
  • 14. Related Tools • Vowpal Wabbit – Implements all of these features, plus far more – Command line tool – svmLite-like data format – Source code available on Github with fairly open license • Straight Python implementation (see code references slide) • glmnet, for R: L1/L2 regression, sparse • Scikit-learn, python ML library: ridge, elastic net (l1+l2), SGD (can specify logistic regression) • H2O, Java tool; many techniques used, particularly in deep learning • Many of these techniques are used in neural networks, particularly deep learning
  • 15. Code References • Introductory version: online logistic regression, hash trick, adaptive learning rate – Kaggle forum post • Data set is available on that competition’s data page • But you can easily adapt the code to work for your data set by changing the train and test file names (lines 25-26) and the names of the id and output columns (104-107, 129-130) – Direct link to python code from forum post – Github version of the same python code • Latest version: adds FTRL-proximal (including SGD, L1/L2 regularization), epochs, and automatic interaction handling – Kaggle forum post – Direct link to python code from forum post (version 3) – Github version of the same python code
  • 16. Additional References • Overall process – Google paper, FTRL proximal and practical observations – Facebook paper, includes logistic regression and trees, feature handling, down-sampling • Follow The Regularized Leader Proximal (Google) • Optimization – Stochastic gradient descent: examples and guidance (Microsoft) – ADADELTA and discussion of additional optimization algorithms (Google/NYU intern) – Comparison Visualization • Hash trick: – The Wikipedia page offers a decent introduction – general description and list of references, from VW author