SlideShare a Scribd company logo
The Impact of Big Data on
Classic Machine Learning
Algorithms
Thomas Jensen, Senior Business Analyst @ Expedia
Who am I?
• Senior Business Analyst @ Expedia
• Working within the competitive
intelligence unit
• Responsible for :
• Algorithm that score new hotels
• Algorithm that predicts room nights
sold on existing Expedia hotels
• Scraping competitor sites
• Other stuff….
The Promise of Big Data
Real time data
Data driven decision
More accurate and
robust models
Granularity
Big Data Challenges
Data Processing – not
going to talk about
this.
Speed at which to use
data – how fast should
we update
algorithms?
How do we train
algorithms on data
sets that do not fit
into memory?
Big Data Challenges
Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Classification - Logistic Regression
• One classic task in machine learning / statistics is to classify some
objects/events/decisions correctly
• Examples are:
• Customer churn
• Click behavior
• Purchase behavior
• ….
• One of the most popular algorithms to carry out these tasks is logistic
regression
What is logistic regression?
• Logistic regression attaches probabilities to individual outcomes,
showing how likely they are to belong to one class or the other
• Pr 𝑦 𝑥 =
1
1+𝑒−𝑥𝛽
• The challenge is to choose the
optimal beta(s)
• To do that we minimize a cost
function
Why Use Logistic Regression?
• It is simple and well understood algorithm
• Outputs probabilities
• There are tried and tested models to estimate the parameters
• It is flexible – can handle a number of different inputs, and feature
transformations
Usual Approaches
• Batch training (offline approach)
• Get all the data and train the algorithm in one go
• Disadvantages when data is big
• Requires all data to be loaded into memory
• Periodic retraining is necessary
• Very time consuming with big data!
Batch Training
Examples of Logistic Regression in Industry
Settings – Real Time Bidding
• RTB
• RTB algorithms are usually
based on logistic regression
• Whether or not to bid on a
user is determined by the
probability that the user will
click on an add
• Each day billions of bids are
processed
• Each bid has to be processed
within 80 milliseconds
Examples of Logistic Regression in Industry
Settings – Fraud Detection
Detecting Fraudulent Credit Card
Transactions
• The probability that a transaction
is using a stolen credit card is
typically estimated with logistic
regression
• Billions of transactions are
analyzed each day
How Slow is the Batch Version of Logistic
Regression?
One target variable and two feature vectors.
All randomly generated.
A Real World Problem
A Real World Problem
• Some stats on the training job in the pipeline:
• Runs training jobs on a per country basis
• Longest running job lasts ~9 hours
• Shortest running job lasts ~3 hours
• There are often convergence failures
• What we need an algorithm that:
• Can reduce training time
• Is robust towards convergence failures
A Big Data Friendly Approach
Online Training
• Pass each data point sequentially through the algorithm
• Only requires one data point at a time in memory
• Allows for on-the-fly training of the algorithm
Online Learning
• We want to learn a vector of
weights
• Initialize all weights. Begin loop:
1. Get training example
2. Make a prediction for the target
variable
3. Learn the true value of the
target
4. Update the weights and go to 1
Online Learning
• Initialise all weights. Begin loop:
Repeat {
For i = 1 to m {
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃 𝑗
𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖))
}
}
the partial derivative
of the cost functions
the cost function – given
theta and row i, i.e. how wrong
Are we?
the step size – how fast
we should climb the
gradient
Online Learning
• Approaches the maximum of the function in a jumpy manner and
never actually settles on the maximum.
Batch vs. Online Learning
Data
Size: 4.8GB
Rows: 500,000
Columns: 5000
0
20
40
60
80
100
120
Batch SGDClassifier Sofia-ml
Training
*Times include reading data and training algorithm
Online Learning Vs. Batch
Online Learning
• When we have a continuous
stream of data
• When It is important to update
the algorithm in real time – can
hit a moving target
• When training speed is
important
• Parameters are “jumpy” around
the optimal values
Batch
• When it is very important to get
the exact optimal values
• When data can fit in memory
• When training time is not of the
essence
Popular Online Learning Libraries
• Sofia-ml (c/c++)
• Requires data in svmLight format
• Have implementations of SVM, Neural networks and logistic regression
• Supports classification and ranking
• Wovbal wabbit (c/c++)
• Requires data in own wv format
• Have implementations of the most popular loss functions
• Supports classification, ranking and regression
• Pandas + scikit-learn (python)
• Pandas has a nice function for reading files in batches
• Can handle sparse and non-sparse matrices
• Scikit–learn has an SGD classifier that can fit the model in batches
• Supports classification, ranking and regression
Thomas Jensen. Machine Learning

More Related Content

PDF
Ed Snelson. Counterfactual Analysis
PDF
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
PDF
Big Data at Speed
PDF
Near real-time anomaly detection at Lyft
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
PPTX
Response prediction for display advertising - WSDM 2014
PDF
Machine Learning Pipelines
Ed Snelson. Counterfactual Analysis
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Big Data at Speed
Near real-time anomaly detection at Lyft
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Response prediction for display advertising - WSDM 2014
Machine Learning Pipelines

What's hot (20)

PDF
Gender Prediction with Databricks AutoML Pipeline
PDF
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
PDF
Parallel machines flinkforward2017
PPTX
Conference 2014: Rajat Arya - Deployment with GraphLab Create
PDF
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
PDF
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
PPTX
Machine Learning In Production
PDF
Zipline - A Declarative Feature Engineering Framework
PDF
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...
PDF
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
PDF
A Production Quality Sketching Library for the Analysis of Big Data
PDF
On Improving Broadcast Joins in Apache Spark SQL
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PPTX
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
PDF
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
PDF
Is This Thing On? A Well State Model for the People
PDF
AutoML Toolkit – Deep Dive
Gender Prediction with Databricks AutoML Pipeline
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Parallel machines flinkforward2017
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Machine Learning In Production
Zipline - A Declarative Feature Engineering Framework
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
GraphLab Conference 2014 Keynote - Carlos Guestrin
A Production Quality Sketching Library for the Analysis of Big Data
On Improving Broadcast Joins in Apache Spark SQL
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Production ready big ml workflows from zero to hero daniel marcous @ waze
Tuning ML Models: Scaling, Workflows, and Architecture
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Is This Thing On? A Well State Model for the People
AutoML Toolkit – Deep Dive
Ad

Viewers also liked (8)

PDF
Ramunas Urbonas. The Journey
PDF
Dionizas Antipenkovas. Big Data Intro
PDF
Tadas Pivorius. Married to Cassandra
PDF
Ramunas Balukonis. Research DWH
PDF
Brian Bulkowski. Aerospike
PDF
Andrei Kirilenkov. Vertica
PDF
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
PDF
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Ramunas Urbonas. The Journey
Dionizas Antipenkovas. Big Data Intro
Tadas Pivorius. Married to Cassandra
Ramunas Balukonis. Research DWH
Brian Bulkowski. Aerospike
Andrei Kirilenkov. Vertica
Ernestas Sysojevas. Hadoop Essentials and Ecosystem
Сергей Сверчков и Виталий Руденя. Choosing a NoSQL database
Ad

Similar to Thomas Jensen. Machine Learning (20)

PDF
Barga Galvanize Sept 2015
PPTX
Learn Like a Human: Taking Machine Learning from Batch to Real-Time
PPTX
Machine Learning With ML.NET
PDF
Productionising Machine Learning Models
PDF
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
PDF
Horizon: Deep Reinforcement Learning at Scale
PPTX
An Agile Approach to Machine Learning
PPTX
Design Like a Pro: Machine Learning Basics
PPTX
Design Like a Pro: Machine Learning Basics
PPTX
Building High Available and Scalable Machine Learning Applications
PDF
Machine_Learning_Overview_Presentation_1.pdf
PDF
Machine_Learning_Overview_Presentation_1.pdf
PPTX
Machine_Learning_Overview_Presentation_1.pptx
PDF
credit card fraud detection
PDF
Tech essentials for Product managers
PPTX
Shikha fdp 62_14july2017
PDF
Pragmatic Machine Learning @ ML Spain
PPTX
Moving from BI to AI : For decision makers
PDF
EIA2017Italy - Danny Lange - Artificial Intelligence - A Game Changer in App ...
PPTX
Unit 1-ML (1) (1).pptx
Barga Galvanize Sept 2015
Learn Like a Human: Taking Machine Learning from Batch to Real-Time
Machine Learning With ML.NET
Productionising Machine Learning Models
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Horizon: Deep Reinforcement Learning at Scale
An Agile Approach to Machine Learning
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning Basics
Building High Available and Scalable Machine Learning Applications
Machine_Learning_Overview_Presentation_1.pdf
Machine_Learning_Overview_Presentation_1.pdf
Machine_Learning_Overview_Presentation_1.pptx
credit card fraud detection
Tech essentials for Product managers
Shikha fdp 62_14july2017
Pragmatic Machine Learning @ ML Spain
Moving from BI to AI : For decision makers
EIA2017Italy - Danny Lange - Artificial Intelligence - A Game Changer in App ...
Unit 1-ML (1) (1).pptx

More from Volha Banadyseva (20)

PDF
Андрей Светлов. Aiohttp
PDF
Сергей Зефиров
PDF
Eugene Burmako
PDF
Heather Miller
PPT
Валерий Прытков, декан факультета КСиС, БГУИР
PPTX
Елена Локтева, «Инфопарк»
PPTX
Татьяна Милова, директор института непрерывного образования БГУ
PDF
Trillhaas Goetz. Innovations in Google and Global Digital Trends
PDF
Александр Чекан. 28 правДИвых слайдов о белорусах в интернете
PDF
Мастер-класс Ильи Красинского и Елены Столбовой. Жизнь до и после выхода в store
PDF
Бахрам Исмаилов. Продвижение мобильного приложение - оптимизация в App Store
PDF
Евгений Пальчевский. Что можно узнать из отзывов пользователей в мобильных ма...
PDF
Евгений Невгень. Оптимизация мета-данных приложения для App Store и Google Play
PDF
Евгений Козяк. Tips & Tricks мобильного прототипирования
PDF
Егор Белый. Модели успешной монетизации мобильных приложений
PDF
Станислав Пацкевич. Инструменты аналитики для мобильных платформ
PDF
Артём Азевич. Эффективные подходы к разработке приложений. Как найти своего п...
PDF
Дина Сударева. Развитие игровой команды и ее самоорганизация. Роль менеджера ...
PDF
Юлия Ерина. Augmented Reality Games: становление и развитие
PDF
Александр Дзюба. Знать игрока: плейтест на стадии прототипа и позже
Андрей Светлов. Aiohttp
Сергей Зефиров
Eugene Burmako
Heather Miller
Валерий Прытков, декан факультета КСиС, БГУИР
Елена Локтева, «Инфопарк»
Татьяна Милова, директор института непрерывного образования БГУ
Trillhaas Goetz. Innovations in Google and Global Digital Trends
Александр Чекан. 28 правДИвых слайдов о белорусах в интернете
Мастер-класс Ильи Красинского и Елены Столбовой. Жизнь до и после выхода в store
Бахрам Исмаилов. Продвижение мобильного приложение - оптимизация в App Store
Евгений Пальчевский. Что можно узнать из отзывов пользователей в мобильных ма...
Евгений Невгень. Оптимизация мета-данных приложения для App Store и Google Play
Евгений Козяк. Tips & Tricks мобильного прототипирования
Егор Белый. Модели успешной монетизации мобильных приложений
Станислав Пацкевич. Инструменты аналитики для мобильных платформ
Артём Азевич. Эффективные подходы к разработке приложений. Как найти своего п...
Дина Сударева. Развитие игровой команды и ее самоорганизация. Роль менеджера ...
Юлия Ерина. Augmented Reality Games: становление и развитие
Александр Дзюба. Знать игрока: плейтест на стадии прототипа и позже

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Introduction to the R Programming Language
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Predictive modeling basics in data cleaning process
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Lecture1 pattern recognition............
PPTX
modul_python (1).pptx for professional and student
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
SAP 2 completion done . PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to the R Programming Language
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Predictive modeling basics in data cleaning process
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Introduction-to-Cloud-ComputingFinal.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Lecture1 pattern recognition............
modul_python (1).pptx for professional and student

Thomas Jensen. Machine Learning

  • 1. The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia
  • 2. Who am I? • Senior Business Analyst @ Expedia • Working within the competitive intelligence unit • Responsible for : • Algorithm that score new hotels • Algorithm that predicts room nights sold on existing Expedia hotels • Scraping competitor sites • Other stuff….
  • 3. The Promise of Big Data Real time data Data driven decision More accurate and robust models Granularity
  • 4. Big Data Challenges Data Processing – not going to talk about this. Speed at which to use data – how fast should we update algorithms? How do we train algorithms on data sets that do not fit into memory?
  • 5. Big Data Challenges Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 6. Classification - Logistic Regression • One classic task in machine learning / statistics is to classify some objects/events/decisions correctly • Examples are: • Customer churn • Click behavior • Purchase behavior • …. • One of the most popular algorithms to carry out these tasks is logistic regression
  • 7. What is logistic regression? • Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other • Pr 𝑦 𝑥 = 1 1+𝑒−𝑥𝛽 • The challenge is to choose the optimal beta(s) • To do that we minimize a cost function
  • 8. Why Use Logistic Regression? • It is simple and well understood algorithm • Outputs probabilities • There are tried and tested models to estimate the parameters • It is flexible – can handle a number of different inputs, and feature transformations
  • 9. Usual Approaches • Batch training (offline approach) • Get all the data and train the algorithm in one go • Disadvantages when data is big • Requires all data to be loaded into memory • Periodic retraining is necessary • Very time consuming with big data!
  • 11. Examples of Logistic Regression in Industry Settings – Real Time Bidding • RTB • RTB algorithms are usually based on logistic regression • Whether or not to bid on a user is determined by the probability that the user will click on an add • Each day billions of bids are processed • Each bid has to be processed within 80 milliseconds
  • 12. Examples of Logistic Regression in Industry Settings – Fraud Detection Detecting Fraudulent Credit Card Transactions • The probability that a transaction is using a stolen credit card is typically estimated with logistic regression • Billions of transactions are analyzed each day
  • 13. How Slow is the Batch Version of Logistic Regression? One target variable and two feature vectors. All randomly generated.
  • 14. A Real World Problem
  • 15. A Real World Problem • Some stats on the training job in the pipeline: • Runs training jobs on a per country basis • Longest running job lasts ~9 hours • Shortest running job lasts ~3 hours • There are often convergence failures • What we need an algorithm that: • Can reduce training time • Is robust towards convergence failures
  • 16. A Big Data Friendly Approach Online Training • Pass each data point sequentially through the algorithm • Only requires one data point at a time in memory • Allows for on-the-fly training of the algorithm
  • 17. Online Learning • We want to learn a vector of weights • Initialize all weights. Begin loop: 1. Get training example 2. Make a prediction for the target variable 3. Learn the true value of the target 4. Update the weights and go to 1
  • 18. Online Learning • Initialise all weights. Begin loop: Repeat { For i = 1 to m { 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝜕 𝜕𝜃 𝑗 𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖)) } } the partial derivative of the cost functions the cost function – given theta and row i, i.e. how wrong Are we? the step size – how fast we should climb the gradient
  • 19. Online Learning • Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.
  • 20. Batch vs. Online Learning Data Size: 4.8GB Rows: 500,000 Columns: 5000 0 20 40 60 80 100 120 Batch SGDClassifier Sofia-ml Training *Times include reading data and training algorithm
  • 21. Online Learning Vs. Batch Online Learning • When we have a continuous stream of data • When It is important to update the algorithm in real time – can hit a moving target • When training speed is important • Parameters are “jumpy” around the optimal values Batch • When it is very important to get the exact optimal values • When data can fit in memory • When training time is not of the essence
  • 22. Popular Online Learning Libraries • Sofia-ml (c/c++) • Requires data in svmLight format • Have implementations of SVM, Neural networks and logistic regression • Supports classification and ranking • Wovbal wabbit (c/c++) • Requires data in own wv format • Have implementations of the most popular loss functions • Supports classification, ranking and regression • Pandas + scikit-learn (python) • Pandas has a nice function for reading files in batches • Can handle sparse and non-sparse matrices • Scikit–learn has an SGD classifier that can fit the model in batches • Supports classification, ranking and regression