SlideShare a Scribd company logo
Real-Time Machine Learning at
             Industrial scale
                          ... the battle of accuracy vs latency

                                        tumra.com
                                         @tumra
                                                                9th October 2012
TUMRA LTD, Building 3, Chiswick Park,
566 Chiswick High Road, W4 5YA                              Michael Cutler @cotdp
$ whoami
Michael Cutler (@cotdp)
●   Previously at British Sky Broadcasting
    ○   Last 7 years in R&D
    ○   Created several patented systems & algorithms
    ○   Kicked off ‘Big Data’ initiative at Sky in 2008

●   Co-founder CTO @ TUMRA in March '12
    ○   Real-time big data science platform
    ○   Alpha-testing with selected clients
Agenda
●   Background
●   Real-Time vs Batch processing
●   Accuracy vs Latency
●   Use Cases
     ○ eCommerce

     ○ Financial Services

     ○ Media

●   Questions
Background
Big Data is "in vogue", but what does it mean:
 ● Distributed processing

 ● Massively scalable

 ● Commodity



Apache Hadoop is "Kernel" of Big Data OS:
 ● Distributed Filesystem (HDFS)

 ● Parallel Processing (Map/Reduce, YARN)
Background (cont'd)
Solving problems with Big Data is hard:
 ● Tools are all low-level (Pig, Hive etc.)

 ● Skills are hard to find



What is "Data Science":
● Understanding data & solving problems

● Applies the following skills:

   ○   Statistical Analysis
   ○   Machine Learning
   ○   Communicating Results
Real-Time vs
Batch processing
Batch - Hoppers, Bins, Buckets




 Credit: http://bit.ly/Q71u4W
Real-Time - Flows & Streams




                          Credit: http://bit.ly/NOslqf
Real-Time vs Batch processing
Similarities to the Industrial Revolution:
 ● From handicraft to Batch & Real-Time

 ● Complexity increases



Need for "Real-Time":
● Wherever the variation can change faster

  than you can retrain models
● When you can't pre-compute everything

  ahead of time
Accuracy vs Latency
Accuracy vs Latency
Netflix Prize winning entry :-
● Ensemble of 100's of models

● Massively compute intensive solution

● Marginally better than much simpler models




IBM won the KDD Cup 2009 (Orange) :-
 ● IBM Watson team won by sheer brute force

 ● Used a "one of everything" approach

   generating hundreds of models
Accuracy vs Latency (cont'd)
Mathematical navel-gazing:
● Often the factor we're optimising for, isn't

  the thing we measure improvement in:
   ○ User ratings vs. customer longevity/value

   ○ Overfitting outliers vs. missing clear Fraud




Given the choice between a "best guess" now,
and a "marginally better" answer later, I'd take
the "best guess" every time.
However, that doesn't mean...
Accuracy vs Latency (cont'd)
It's a trade-off:
 ● Sometimes "best guess" is good enough,

 ● Other times we can wait for the accuracy,

 ● And of course, occasionally we want both!



Key objective:
 ● Most appropriate solution for the use-case

 ● Hybrid solutions part batch, part real-time
Use Case
eCommerce
Use Case - eCommerce
Objective - Increase profits
How:
●   Match potential customers to the right products
●   Personalise user experience on web & email
●   Customer lifecycle management

Method:
●   Ensemble of real-time models
●   Collect lots of implicit feedback data
Use Case - eCommerce (cont'd)
Detail:
●   Clustering - behavior, demogs
●   Simple predictors - keywords to products
●   Bayesian Bandit - blend the output

Requirements:
●   Predictions in < 50 ms
●   Online learning models
●   Occasional batch updates are OK
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
When eCommerce
    #FAILs
I've only ever bought Cat food...
... wait there's more, no Cat food
Even Amazon can #FAIL
Use Case
Financial Services
Use Case - Financial Services
Objective - Reduce Fraud
How:
●   Compute patterns/predictors for individuals
●   Cluster individuals and recompute for clusters
●   Compute baselines across all data

Method:
●   Hybrid and Hierarchical Clustering models
●   Simple predictors for individuals, clusters & baseline
Use Case - Financial Services
Detail:
●   CHEAT!!! ... Cluster to nearest centroid
     ○ will degrade over time (Hunchback Clusters)

●   Use simple metrics to alert (stddev)

Requirements:
●   Ability to alert/intervene near real-time < 1 second
●   Adapt to rapid changes (within baseline & clusters)
●   Periodic batch processing to recompute clusters
Use Case - Financial Services
Use Case
 Media
Use Case - Media
Objective - Generating Metadata
Why:
●   Drive second screen applications
●   Create new streams of information for resale

How:
●   Video / Audio analysis
●   Closed Caption or, Subtitle text processing
●   Knowledgebase :- People, Places, Products & Things
Use Case - Media (cont'd)
Method:
●   Natural Language Processing
    ○   Named Entity Recognition
    ○   Topic Extraction & Disambiguation
●   Graph databases & algorithms

Requirements:
●   Responses in < 1 second
●   Ability to learn new 'Things'
Example of 12,000 entities from our Knowledgebase...
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)
Summary
Summary
Key points:
●   Clear move towards distributed algorithms
●   Latency is often more favorable than accuracy
●   Trade-offs are dependant on the use-cases

Further reading:
●   Apache Mahout - http://mahout.apache.org/
●   Storm Project - http://storm-project.net/
●   Data Science London - http://datasciencelondon.org/
●   Machine Learning Meetup - http://bit.ly/w8V8f6
Almost finished!
Introducing TUMRA Labs
API access to some of our real-time models:
●   Probabilistic Demographics

Coming Soon:
●   Language detection
●   Sentiment analysis
●   Metadata Generation


      Free to signup and easy to get started!
              http://labs.tumra.com/
Questions?
  Work          Personal
tumra.com      cotdp.com
 @tumra         @cotdp

More Related Content

PDF
H2o storm
PPTX
Real time machine learning
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PPTX
Machine Learning in the Real World
PPTX
Better Customer Experience with Data Science - Bernard Burg, Comcast
PDF
Wizard Driven AI Anomaly Detection with Databricks in Azure
PDF
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
PPTX
Infochimps + CloudCon: Infinite Monkey Theorem
H2o storm
Real time machine learning
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
Machine Learning in the Real World
Better Customer Experience with Data Science - Bernard Burg, Comcast
Wizard Driven AI Anomaly Detection with Databricks in Azure
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
Infochimps + CloudCon: Infinite Monkey Theorem

What's hot (20)

PPTX
Role of Analytics in Digital Business
PDF
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
PPTX
Polyglot Processing - An Introduction 1.0
PPTX
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
PDF
MAALBS Big Data agile framwork
PDF
FrugalML: Using ML APIs More Accurately and Cheaply
PDF
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
PPTX
Mastering MapReduce: MapReduce for Big Data Management and Analysis
PDF
02 a holistic approach to big data
PPTX
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
PDF
Data analysis trend 2015 2016 v071
PPTX
The Other 99% of a Data Science Project
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
PDF
The Rise of Streaming SQL and Evolution of Streaming Applications
PDF
Meetup7 integration microservices_machine_learning
PPTX
Leveraging Open Source Automated Data Science Tools
PDF
Data scientist enablement dse 400 week 8 roadmap
PPTX
Dive into H2O: NYC
PPTX
SPSS Modeler 16 What's New!?
Role of Analytics in Digital Business
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...
Polyglot Processing - An Introduction 1.0
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
MAALBS Big Data agile framwork
FrugalML: Using ML APIs More Accurately and Cheaply
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Mastering MapReduce: MapReduce for Big Data Management and Analysis
02 a holistic approach to big data
Real-time Recommendations for Retail: Architecture, Algorithms, and Design
Data analysis trend 2015 2016 v071
The Other 99% of a Data Science Project
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
The Rise of Streaming SQL and Evolution of Streaming Applications
Meetup7 integration microservices_machine_learning
Leveraging Open Source Automated Data Science Tools
Data scientist enablement dse 400 week 8 roadmap
Dive into H2O: NYC
SPSS Modeler 16 What's New!?
Ad

Viewers also liked (20)

PPTX
Going Real-Time with Mahout, Predicting gender of Facebook Users
PDF
Big Data - Fast Machine Learning at Scale + Couchbase
PDF
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
PDF
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
PDF
Fast detection of Android malware: machine learning approach
PDF
PDF
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
PPTX
Watson – from Jeopardy to healthcare
PPTX
Application of machine learning in industrial applications
PDF
IBM Watson Health: How cognitive technologies have begun transforming clinica...
PPT
IBM WATSON
PDF
Machine learning in image processing
PDF
IBM Watson in Healthcare
PPTX
Machine Learning and Real-World Applications
PPTX
Application of Clustering in Data Science using Real-life Examples
KEY
Machine Learning on Big Data
PPTX
Ibm's watson
PDF
IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!
Going Real-Time with Mahout, Predicting gender of Facebook Users
Big Data - Fast Machine Learning at Scale + Couchbase
Continuous Analytics & Optimisation using Apache Spark (Big Data Analytics, L...
Using Graph theory to understand Intent & Concepts - Neo4j User Group (Januar...
Fast detection of Android malware: machine learning approach
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Watson – from Jeopardy to healthcare
Application of machine learning in industrial applications
IBM Watson Health: How cognitive technologies have begun transforming clinica...
IBM WATSON
Machine learning in image processing
IBM Watson in Healthcare
Machine Learning and Real-World Applications
Application of Clustering in Data Science using Real-life Examples
Machine Learning on Big Data
Ibm's watson
IBM Watson: How it Works, and What it means for Society beyond winning Jeopardy!
Ad

Similar to Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012) (20)

PDF
Digital and data journey demystified: how it all works
PPTX
[DSC Croatia 22] Building smarter ML and AI models and making them more accur...
PPTX
Moving from BI to AI : For decision makers
PPTX
Why data warehouses cannot support hot analytics
PDF
Google Cloud Machine Learning
PPTX
Making advertising personal, 4th NL Recommenders Meetup
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
PPTX
Deploying AI Applications in Enterprises
PDF
Bring Your Own Recipes Hands-On Session
PDF
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
PDF
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
PDF
Machine Learning and Industrie 4.0
PDF
Making better use of Data and AI in Industry 4.0
PDF
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
PDF
Connecta Event: Big Query och dataanalys med Google Cloud Platform
PDF
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
PDF
Exploring the Cloud
PDF
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
PDF
Witekio introducing-predictive-maintenance
PDF
Data science Applications in the Enterprise
Digital and data journey demystified: how it all works
[DSC Croatia 22] Building smarter ML and AI models and making them more accur...
Moving from BI to AI : For decision makers
Why data warehouses cannot support hot analytics
Google Cloud Machine Learning
Making advertising personal, 4th NL Recommenders Meetup
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Deploying AI Applications in Enterprises
Bring Your Own Recipes Hands-On Session
[WSO2Con EU 2017] Deriving Insights for Your Digital Business with Analytics
Google Cloud Platform & rockPlace Big Data Event-Mar.31.2016
Machine Learning and Industrie 4.0
Making better use of Data and AI in Industry 4.0
Transformacion del Negocio Financiero por medio de Tecnologias Cloud
Connecta Event: Big Query och dataanalys med Google Cloud Platform
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Exploring the Cloud
SDD2017 - 03 Abed Ajraou - putting data science in your business a first uti...
Witekio introducing-predictive-maintenance
Data science Applications in the Enterprise

Recently uploaded (20)

PDF
1 - Historical Antecedents, Social Consideration.pdf
PPT
What is a Computer? Input Devices /output devices
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
STKI Israel Market Study 2025 version august
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
project resource management chapter-09.pdf
PDF
August Patch Tuesday
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Hybrid model detection and classification of lung cancer
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
1 - Historical Antecedents, Social Consideration.pdf
What is a Computer? Input Devices /output devices
WOOl fibre morphology and structure.pdf for textiles
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
STKI Israel Market Study 2025 version august
Final SEM Unit 1 for mit wpu at pune .pptx
cloud_computing_Infrastucture_as_cloud_p
Getting started with AI Agents and Multi-Agent Systems
Programs and apps: productivity, graphics, security and other tools
project resource management chapter-09.pdf
August Patch Tuesday
Module 1.ppt Iot fundamentals and Architecture
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Hybrid model detection and classification of lung cancer
DP Operators-handbook-extract for the Mautical Institute
A contest of sentiment analysis: k-nearest neighbor versus neural network
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
The various Industrial Revolutions .pptx
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...

Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

  • 1. Real-Time Machine Learning at Industrial scale ... the battle of accuracy vs latency tumra.com @tumra 9th October 2012 TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA Michael Cutler @cotdp
  • 2. $ whoami Michael Cutler (@cotdp) ● Previously at British Sky Broadcasting ○ Last 7 years in R&D ○ Created several patented systems & algorithms ○ Kicked off ‘Big Data’ initiative at Sky in 2008 ● Co-founder CTO @ TUMRA in March '12 ○ Real-time big data science platform ○ Alpha-testing with selected clients
  • 3. Agenda ● Background ● Real-Time vs Batch processing ● Accuracy vs Latency ● Use Cases ○ eCommerce ○ Financial Services ○ Media ● Questions
  • 4. Background Big Data is "in vogue", but what does it mean: ● Distributed processing ● Massively scalable ● Commodity Apache Hadoop is "Kernel" of Big Data OS: ● Distributed Filesystem (HDFS) ● Parallel Processing (Map/Reduce, YARN)
  • 5. Background (cont'd) Solving problems with Big Data is hard: ● Tools are all low-level (Pig, Hive etc.) ● Skills are hard to find What is "Data Science": ● Understanding data & solving problems ● Applies the following skills: ○ Statistical Analysis ○ Machine Learning ○ Communicating Results
  • 7. Batch - Hoppers, Bins, Buckets Credit: http://bit.ly/Q71u4W
  • 8. Real-Time - Flows & Streams Credit: http://bit.ly/NOslqf
  • 9. Real-Time vs Batch processing Similarities to the Industrial Revolution: ● From handicraft to Batch & Real-Time ● Complexity increases Need for "Real-Time": ● Wherever the variation can change faster than you can retrain models ● When you can't pre-compute everything ahead of time
  • 11. Accuracy vs Latency Netflix Prize winning entry :- ● Ensemble of 100's of models ● Massively compute intensive solution ● Marginally better than much simpler models IBM won the KDD Cup 2009 (Orange) :- ● IBM Watson team won by sheer brute force ● Used a "one of everything" approach generating hundreds of models
  • 12. Accuracy vs Latency (cont'd) Mathematical navel-gazing: ● Often the factor we're optimising for, isn't the thing we measure improvement in: ○ User ratings vs. customer longevity/value ○ Overfitting outliers vs. missing clear Fraud Given the choice between a "best guess" now, and a "marginally better" answer later, I'd take the "best guess" every time.
  • 14. Accuracy vs Latency (cont'd) It's a trade-off: ● Sometimes "best guess" is good enough, ● Other times we can wait for the accuracy, ● And of course, occasionally we want both! Key objective: ● Most appropriate solution for the use-case ● Hybrid solutions part batch, part real-time
  • 16. Use Case - eCommerce Objective - Increase profits How: ● Match potential customers to the right products ● Personalise user experience on web & email ● Customer lifecycle management Method: ● Ensemble of real-time models ● Collect lots of implicit feedback data
  • 17. Use Case - eCommerce (cont'd) Detail: ● Clustering - behavior, demogs ● Simple predictors - keywords to products ● Bayesian Bandit - blend the output Requirements: ● Predictions in < 50 ms ● Online learning models ● Occasional batch updates are OK
  • 19. When eCommerce #FAILs
  • 20. I've only ever bought Cat food...
  • 21. ... wait there's more, no Cat food
  • 24. Use Case - Financial Services Objective - Reduce Fraud How: ● Compute patterns/predictors for individuals ● Cluster individuals and recompute for clusters ● Compute baselines across all data Method: ● Hybrid and Hierarchical Clustering models ● Simple predictors for individuals, clusters & baseline
  • 25. Use Case - Financial Services Detail: ● CHEAT!!! ... Cluster to nearest centroid ○ will degrade over time (Hunchback Clusters) ● Use simple metrics to alert (stddev) Requirements: ● Ability to alert/intervene near real-time < 1 second ● Adapt to rapid changes (within baseline & clusters) ● Periodic batch processing to recompute clusters
  • 26. Use Case - Financial Services
  • 28. Use Case - Media Objective - Generating Metadata Why: ● Drive second screen applications ● Create new streams of information for resale How: ● Video / Audio analysis ● Closed Caption or, Subtitle text processing ● Knowledgebase :- People, Places, Products & Things
  • 29. Use Case - Media (cont'd) Method: ● Natural Language Processing ○ Named Entity Recognition ○ Topic Extraction & Disambiguation ● Graph databases & algorithms Requirements: ● Responses in < 1 second ● Ability to learn new 'Things' Example of 12,000 entities from our Knowledgebase...
  • 34. Summary Key points: ● Clear move towards distributed algorithms ● Latency is often more favorable than accuracy ● Trade-offs are dependant on the use-cases Further reading: ● Apache Mahout - http://mahout.apache.org/ ● Storm Project - http://storm-project.net/ ● Data Science London - http://datasciencelondon.org/ ● Machine Learning Meetup - http://bit.ly/w8V8f6
  • 36. Introducing TUMRA Labs API access to some of our real-time models: ● Probabilistic Demographics Coming Soon: ● Language detection ● Sentiment analysis ● Metadata Generation Free to signup and easy to get started! http://labs.tumra.com/
  • 37. Questions? Work Personal tumra.com cotdp.com @tumra @cotdp