SlideShare a Scribd company logo
By
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com
Big Data Analytics :
Production Ready Flows &
Waze Use Cases
Rules
1. Interactive is interesting.
2. If you got something to say, say!
3. Be open minded - I’m sure I got something to
learn from you, hope you got something to
learn from me.
What’s a Data Wizard you ask?
Gain Actionable Insights!
What’s here?
What’s here?
Methodology
Deploying big models to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
Why Big Data?
Google in just 1 minute:
1000 new
devices
3M Searches 100 Hours
1B Activated
Devices
100M GB
Search
Content
10+ Years of Tackling Big Data Problems
8
Google
Papers
20082002 2004 2006 2010 2012 2014 2015
GFS
Map
Reduce
Flume
Java
Millwheel
Open
Source
2005
Google
Cloud
Products BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache
Beam
Tensorflow
“Google is living a few years in the
future and sending the rest of us
messages”
Doug Cutting, Hadoop Co-Creator
Why Big ML?
Bigger is better
● More processing power
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Keep training until you hit 0
○ Some models can not overfit when
optimising until training error is 0.
■ RF - more trees
■ ANN - more iterations
● Handle BIG data
○ Tons of training data (if you have it) - no
need for sampling on wrong populations!
○ Millions of features? Easy… (text
processing with TF-IDF)
○ Some models (ANN) can’t do good without
training on a lot of data.
Challenges
Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
○ Different metric readings
■ Different implementations (distributed VS central memory)
■ Different programming language (heuristics)
○ Different populations trained on (sampling)
Solution = Workflow
Measure first, optimize
second.
Before you start
● Create example input
○ Raw input
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ Coverage
■ Amount of subjects affected
■ Sometimes measures as average
precision per K random subjects.
Remember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
● Create example output
○ Featured input
○ Prediction rows
Naive
Matrix
1
1
2
3
3
Preprocess
● Naive feature matrix
○ Parse (Text -> RDD[Object] -> DataFrame)
○ Clean (remove outliers / bad records)
○ Join
○ Remove non-features
● Get real data
● Create a baseline dataset for training
○ Add some basic features
■ Day of week / hour / etc.
○ Write a READABLE CSV that you can start and work with.
Preprocess
Case Class RDD to DataFrame
RDD[String] to Case Class RDD
String row to object
Preprocess
Parse string to Object with
java.sql types
Metric Generation
Craft useful metrics.
Pre class metrics
Confusion matrix by hand
Monitor.
Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
○ Anomaly detection - Does a metric suddenly drastically change?
○ Impact analysis - Did deploying a model had a significant effect on metric change?
● Web application framework for R.
○ Introduces user interaction to analysis
○ Combines ad-hoc testing with R statistical / modeling power
● Turns R function wrappers to interactive dashboard elements.
○ Generates HTML, CSS, JS behind the scenes so you only write R.
● Get started
● Get inspired
● Shiny @Waze
Shiny
Dashboard
monitoring
Dashboard should support - picking
different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC
Dashboard
monitoring
Dashboard should support -
Timeseries anomaly detection, and
impact analysis (deploying new model)
Start small and grow.
Reduce the problem
● Tradeoff : Time to market VS Loss of accuracy
● Sample data
○ Is random actually what you want?
■ Keep label distributions
■ Keep important features distributions
● Test everything you believe worthy
○ Choose model
○ Choose features (important when you go big)
■ Leave the “boarder” significant ones in
○ Test different parameter configurations (you’ll need to validate your choice later)
Remember : This isn’t your production model. You’re only getting a sense of the data for now.
Getting a feel
Exploring a dataset with R.
Dividing data to training and testing.
Random partitioning
Getting a feel
Logistic regression and basic variable
selection with R.
Logistic regression
Variable significance test
Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model
Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
Start with a flow.
Basic moving parts
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Flow motives
● Only 1 job for preprocessing
○ Used in both training and serving - reduces risk of training on wrong population
○ Should also be used before sampling when experimenting on a smaller scale.
○ When data sources are different for training and serving (RT VS Batch for example) use
interfaces!
● Saving training & scoring feature matrices aside
○ Try new algorithms / parameters on the same data
○ Measure changes on same data as used in production.
Reusable flow code
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
SparkSQL UDFs
Implement feature generation -
decouples training and serving
Data cleaning work
Create a feature generation interface
and some UDFs with Spark. Use later
for both training and scoring purposes
with minor changes.
Reusable flow code
Generate feature matrix
Blackbox from app view
Good ML code trumps
performance.
Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
○ Easier to read
○ Easier to change code - targeted changes only affect their specific process
○ One input, one output (almost…)
● Easier to tweak and deploy changes
Test your infrastructure.
@Test
● Suppose to happen throughout development, if not - now is the time to
make sure you have it!
○ Data read correctly
■ Null rates?
○ Features calculated correctly
■ Does my complex join / function / logic return what is should?
○ Access
■ Can I access all the data sources from my “production” account?
○ Formats
■ Adapt for variance in non-structured formats such as JSONs
○ Required Latency
Set up a baseline.
Start with a neutral launch
● Take a snapshot of your metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
○ Training takes X hours
○ Serving predictions on Y records takes X seconds
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
Go to work.
Coffee recommended at this point.
Optimize
What? How?
● Grid search over
parameters
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Cross validate Everything
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark 1.6
● Tweak training
○ Different models
○ Different model parameters
Spark ML
Building an easy to use wrapper
around training and serving.
Build model pipeline, train, evaluate
Not necessarily a random split
Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
Spark ML
Cross-validate, grid search params and
evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
Spark ML
Score a feature matrix and parse
output.
Get probability for predicted class
(default is a probability vector for all classes)
A/B
Test your changes
● Same data, different results
○ Use preprocessed feature matrix (same one used for current model)
● Best testing - production A/B test
○ Use current production model and new model in parallel
● Metrics improvements (Remember your dashboard?)
○ Time series analysis of metrics
○ Compare metrics over different code versions (improves preprocessing / modeling)
● Deploy / Revert = Update user assignments
○ Based on new metrics / feedback loop if possible
Compare to baseline
A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
Watch. Iterate.
● Respond to anomalies (alerts) on metric reads
● Try out new stuff
○ Tech versions (e.g. new Spark version)
○ New data sources
○ New features
● When you find something interesting - “Go to Work.”
Constant improvement
Remember : Trends and industries change, re-training on new data is not a bad thing.
Ad-Hoc statistics
● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to compile anything!
Enter Apache Zeppelin
Playing with it
Setting up zeppelin to user our jars.
Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL
Playing with it
Using spark-csv by Databricks.
CSV to DataFrame by
Databricks
Using user compiled code inside a
notebook.
Playing with it
Bring your own code
Technological pitfalls
Keep in mind
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Algorithmic Richness
● Using Parquet
○ Intermediate outputs
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Output size
○ Coalesce to desired size
● Dataframe Windows - Buggy
○ Write your own over RDD
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory
Putting it all together
Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trums performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)
Code:
https://github.com/dmarcous/BigMLFlow/
Slides:
http://www.slideshare.net/DanielMarcous/productionready
-big-ml-workflows-from-zero-to-hero
Use Cases
What Waze does with all its data?
Trending Locations / Day of Week Breakdown
Opening Hours Inference
Optimising - Ad clicks / Time from drive start
Time to Content (US) - Day of week / Category
Irregular Events / Anomaly Detection
Major events, causing out of the ordinary traffic/road blocks etc’ affecting large
numbers of users.
Dangerous Places - Clustering
Find most dangerous areas / streets, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)
Parking Places Detection
Parking entrance
Parking lot
Street parking
Server Distribution Optimisation
Calculate the optimal routing servers distribution according to geographical load.
● Better experience - faster response time
● Saves money - no need for redundant elastic scaling of servers
Text Mining - Topic Analysis
Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice
wazers usual road social still morgan
eta traffic driving drivers will ang
con stay info reporting update freeman
zona today using helped drive kanan
usando times area nearby delay voice
real clear realtime traffic add meter
tiempo slower sharing jam jammed kan
carretera accident soci drive near masuk
Text Mining - New Version Impressions
● Text analysis - stemming / stopword detection etc.
● Topic modeling
● Sentiment analysis
Waze V4 update :
● Good - “redesign”, ”smarter”, “cleaner”, “improved”
● Bad - “stuck”
Overall very positive score!
Text Mining - Store Sentiments
Text Mining - Sentiment by Time & Place
Production-Ready BIG ML Workflows - from zero to hero
Daniel Marcous
dmarcous@google.com
dmarcous@gmail.com

More Related Content

PDF
Big data real time architectures
PDF
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
PDF
Data Science as Scale
PPTX
Scalable data systems at Traveloka
PPTX
Geo data analytics
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PDF
A primer on building real time data-driven products
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Big data real time architectures
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Data Science as Scale
Scalable data systems at Traveloka
Geo data analytics
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
A primer on building real time data-driven products
Scalable data pipeline at Traveloka - Facebook Dev Bandung

What's hot (20)

PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Data Science At Zillow
PDF
Introduction to Real-time data processing
PDF
Data pipelines from zero to solid
PDF
The Big Bad Data
PDF
Towards Data Operations
PDF
Google Dremel. Concept and Implementations.
PDF
Big data on google platform dev fest presentation
PDF
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
PDF
Distributed machine learning
PPTX
giasan.vn real-estate analytics: a Vietnam case study
PPTX
Data pipelines from zero
PPTX
2013 DATA @ NFLX (Tableau User Group)
PDF
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
PDF
MapReduce: Optimizations, Limitations, and Open Issues
PPTX
An Intro to Elasticsearch and Kibana
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
Organising for Data Success
PPTX
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
PPTX
Migration strategies for a mission critical cluster
Production ready big ml workflows from zero to hero daniel marcous @ waze
Data Science At Zillow
Introduction to Real-time data processing
Data pipelines from zero to solid
The Big Bad Data
Towards Data Operations
Google Dremel. Concept and Implementations.
Big data on google platform dev fest presentation
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Distributed machine learning
giasan.vn real-estate analytics: a Vietnam case study
Data pipelines from zero
2013 DATA @ NFLX (Tableau User Group)
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
MapReduce: Optimizations, Limitations, and Open Issues
An Intro to Elasticsearch and Kibana
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Organising for Data Success
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Migration strategies for a mission critical cluster
Ad

Viewers also liked (17)

PPTX
Big Data - Big Insights - Waze @Google
PPTX
Distributed K-Betweenness (Spark)
PPTX
Waze Partnership
PDF
How data is renewing and reshaping rio de janeiro
PPTX
PPTX
Intelligent Transportation Systems for a Smart City
PDF
Google Waze Ads
PPT
Social Commerce Summit
PDF
Waze Overview
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PPTX
Resistance in technology change
PPTX
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
DOC
CONTRTATO. CASO EJEMPLAR
PPTX
business model of waze
PPTX
Conceptos de neurosis, líbido y trauma
PDF
Shortening the feedback loop
PDF
Road to Analytics
Big Data - Big Insights - Waze @Google
Distributed K-Betweenness (Spark)
Waze Partnership
How data is renewing and reshaping rio de janeiro
Intelligent Transportation Systems for a Smart City
Google Waze Ads
Social Commerce Summit
Waze Overview
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Resistance in technology change
Waze - An introduction to Location Based Marketing on Waze [HUBFORUM]
CONTRTATO. CASO EJEMPLAR
business model of waze
Conceptos de neurosis, líbido y trauma
Shortening the feedback loop
Road to Analytics
Ad

Similar to Production-Ready BIG ML Workflows - from zero to hero (20)

PDF
Machine learning systems for engineers
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PDF
Mastering Predictive Analytics with R 2nd edition Edition Forte
PPTX
Ml2 production
PDF
Demystifying ML/AI
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
DevOps Days Rockies MLOps
PPTX
Apache Spark Model Deployment
PPTX
Data and Business Team Collaboration
PDF
Data Science meets Software Development
PDF
Mastering Predictive Analytics with R 2nd edition Edition Forte
PDF
Mastering Predictive Analytics with R 2nd edition Edition Forte
PDF
Machine learning for IoT - unpacking the blackbox
PDF
Big Data Science - hype?
PPTX
Productionalizing ML : Real Experience
PPTX
Academia to Data Science - A Hitchhiker's Guide
PDF
Ideas spracklen-final
PPTX
Big Data Pipelines and Machine Learning at Uber
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
Pragmatic Machine Learning @ ML Spain
Machine learning systems for engineers
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Mastering Predictive Analytics with R 2nd edition Edition Forte
Ml2 production
Demystifying ML/AI
Machine learning at scale - Webinar By zekeLabs
DevOps Days Rockies MLOps
Apache Spark Model Deployment
Data and Business Team Collaboration
Data Science meets Software Development
Mastering Predictive Analytics with R 2nd edition Edition Forte
Mastering Predictive Analytics with R 2nd edition Edition Forte
Machine learning for IoT - unpacking the blackbox
Big Data Science - hype?
Productionalizing ML : Real Experience
Academia to Data Science - A Hitchhiker's Guide
Ideas spracklen-final
Big Data Pipelines and Machine Learning at Uber
BSSML16 L10. Summary Day 2 Sessions
Pragmatic Machine Learning @ ML Spain

More from Daniel Marcous (6)

PDF
Cloud AI Platform Notebooks - Kaggle IL
PPTX
Towards Smart Transportation DSS 2018
PPTX
Distributed Databases - Concepts & Architectures
PPTX
Prediction of taxi rides ETA
PDF
Data Visualisation
Cloud AI Platform Notebooks - Kaggle IL
Towards Smart Transportation DSS 2018
Distributed Databases - Concepts & Architectures
Prediction of taxi rides ETA
Data Visualisation

Recently uploaded (20)

PPTX
Computer network topology notes for revision
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Global journeys: estimating international migration
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Lecture1 pattern recognition............
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Foundation of Data Science unit number two notes
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to machine learning and Linear Models
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
Computer network topology notes for revision
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Logistic Regression ml machine learning.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Global journeys: estimating international migration
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
Lecture1 pattern recognition............
IB Computer Science - Internal Assessment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Foundation of Data Science unit number two notes
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Business Acumen Training GuidePresentation.pptx
Introduction to Knowledge Engineering Part 1
Introduction to machine learning and Linear Models
STUDY DESIGN details- Lt Col Maksud (21).pptx
Moving the Public Sector (Government) to a Digital Adoption

Production-Ready BIG ML Workflows - from zero to hero

  • 1. By Daniel Marcous Google, Waze, Data Wizard dmarcous@gmail/google.com Big Data Analytics : Production Ready Flows & Waze Use Cases
  • 2. Rules 1. Interactive is interesting. 2. If you got something to say, say! 3. Be open minded - I’m sure I got something to learn from you, hope you got something to learn from me.
  • 3. What’s a Data Wizard you ask? Gain Actionable Insights!
  • 5. What’s here? Methodology Deploying big models to production - step by step Pitfalls What to look out for in both methodology and code Use Cases Showing off what we actually do in Waze Analytics Based on tough lessons learned & Google experts recommendations and inputs.
  • 7. Google in just 1 minute: 1000 new devices 3M Searches 100 Hours 1B Activated Devices 100M GB Search Content
  • 8. 10+ Years of Tackling Big Data Problems 8 Google Papers 20082002 2004 2006 2010 2012 2014 2015 GFS Map Reduce Flume Java Millwheel Open Source 2005 Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable BigTable Dremel PubSub Apache Beam Tensorflow
  • 9. “Google is living a few years in the future and sending the rest of us messages” Doug Cutting, Hadoop Co-Creator
  • 11. Bigger is better ● More processing power ○ Grid search all the parameters you ever wanted. ○ Cross validate in parallel with no extra effort. ● Keep training until you hit 0 ○ Some models can not overfit when optimising until training error is 0. ■ RF - more trees ■ ANN - more iterations ● Handle BIG data ○ Tons of training data (if you have it) - no need for sampling on wrong populations! ○ Millions of features? Easy… (text processing with TF-IDF) ○ Some models (ANN) can’t do good without training on a lot of data.
  • 13. Bigger is harder ● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R) ● Curse of dimensionality ○ Some algorithms require exponential time/memory as dimensions grow ○ Harder and more important to tell what’s gold and what’s noise ○ Unbalance data goes a long way with more records ● Big model != Small model ○ Different parameter settings ○ Different metric readings ■ Different implementations (distributed VS central memory) ■ Different programming language (heuristics) ○ Different populations trained on (sampling)
  • 16. Before you start ● Create example input ○ Raw input ● Set up your metrics ○ Derived from business needs ○ Confusion matrix ○ Precision / recall ■ Per class metrics ○ AUC ○ Coverage ■ Amount of subjects affected ■ Sometimes measures as average precision per K random subjects. Remember : Desired short term behaviour does not imply long term behaviour Measure Preprocess (parse, clean, join, etc.) ● Create example output ○ Featured input ○ Prediction rows Naive Matrix 1 1 2 3 3
  • 17. Preprocess ● Naive feature matrix ○ Parse (Text -> RDD[Object] -> DataFrame) ○ Clean (remove outliers / bad records) ○ Join ○ Remove non-features ● Get real data ● Create a baseline dataset for training ○ Add some basic features ■ Day of week / hour / etc. ○ Write a READABLE CSV that you can start and work with.
  • 18. Preprocess Case Class RDD to DataFrame RDD[String] to Case Class RDD String row to object
  • 19. Preprocess Parse string to Object with java.sql types
  • 20. Metric Generation Craft useful metrics. Pre class metrics Confusion matrix by hand
  • 22. Visualise - easiest way to measure quickly ● Set up your dashboard ○ Amounts of input data ■ Before /after joining ○ Amounts of output data ○ Metrics (See “Measure first, optimize second”) ● Different model comparison - what’s best, when and where ● Timeseries Analysis ○ Anomaly detection - Does a metric suddenly drastically change? ○ Impact analysis - Did deploying a model had a significant effect on metric change?
  • 23. ● Web application framework for R. ○ Introduces user interaction to analysis ○ Combines ad-hoc testing with R statistical / modeling power ● Turns R function wrappers to interactive dashboard elements. ○ Generates HTML, CSS, JS behind the scenes so you only write R. ● Get started ● Get inspired ● Shiny @Waze Shiny
  • 24. Dashboard monitoring Dashboard should support - picking different models, comparing metrics. Pick models to compare Statistical tests on distributions t.test / AUC
  • 25. Dashboard monitoring Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)
  • 27. Reduce the problem ● Tradeoff : Time to market VS Loss of accuracy ● Sample data ○ Is random actually what you want? ■ Keep label distributions ■ Keep important features distributions ● Test everything you believe worthy ○ Choose model ○ Choose features (important when you go big) ■ Leave the “boarder” significant ones in ○ Test different parameter configurations (you’ll need to validate your choice later) Remember : This isn’t your production model. You’re only getting a sense of the data for now.
  • 28. Getting a feel Exploring a dataset with R. Dividing data to training and testing. Random partitioning
  • 29. Getting a feel Logistic regression and basic variable selection with R. Logistic regression Variable significance test
  • 30. Getting a feel Advanced variable selection with regularisation techniques in R. Intercepts - by significance No intercept = not entered to model
  • 31. Getting a feel Trying modeling techniques in R. Root mean square error Lower = better (~ kinda) Fit a gradient boosted trees model
  • 32. Getting a feel Modeling bigger data with R, using parallelism. Fit and combine 6 random forest models (10k trees each) in parallel
  • 33. Start with a flow.
  • 34. Basic moving parts Data source 1 Data source N Preprocess Training Feature matrix Scoring Models 1..N Predictions 1..N Dashboard Serving DB Feedback loop Conf. User/Model assignments
  • 35. Flow motives ● Only 1 job for preprocessing ○ Used in both training and serving - reduces risk of training on wrong population ○ Should also be used before sampling when experimenting on a smaller scale. ○ When data sources are different for training and serving (RT VS Batch for example) use interfaces! ● Saving training & scoring feature matrices aside ○ Try new algorithms / parameters on the same data ○ Measure changes on same data as used in production.
  • 36. Reusable flow code Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes. SparkSQL UDFs Implement feature generation - decouples training and serving Data cleaning work
  • 37. Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes. Reusable flow code Generate feature matrix Blackbox from app view
  • 38. Good ML code trumps performance.
  • 39. Why so many parts you ask? ● Scaling ● Fault tolerance ○ Failed preprocessing /training doesn’t affect serving model ○ Rerunning only failed parts ● Different logical parts - Different processes (@”Clean code” by Uncle Bob) ○ Easier to read ○ Easier to change code - targeted changes only affect their specific process ○ One input, one output (almost…) ● Easier to tweak and deploy changes
  • 41. @Test ● Suppose to happen throughout development, if not - now is the time to make sure you have it! ○ Data read correctly ■ Null rates? ○ Features calculated correctly ■ Does my complex join / function / logic return what is should? ○ Access ■ Can I access all the data sources from my “production” account? ○ Formats ■ Adapt for variance in non-structured formats such as JSONs ○ Required Latency
  • 42. Set up a baseline. Start with a neutral launch
  • 43. ● Take a snapshot of your metric reads: ○ The ones you chose earlier in the process as important to you ■ Confusion matrix ■ Weighted average % classified correctly ■ % subject coverage ● Latency ○ Building feature matrix on last day data takes X minutes ○ Training takes X hours ○ Serving predictions on Y records takes X seconds You are here: Remember : You are running with a naive model. Everything better than the old model / random is OK.
  • 44. Go to work. Coffee recommended at this point.
  • 45. Optimize What? How? ● Grid search over parameters ● Evaluate metrics ○ Using a Spark predefined Evaluator ○ Using user defined metrics ● Cross validate Everything ● Tweak preprocessing (mainly features) ○ Feature engineering ○ Feature transformers ■ Discretize / Normalise ○ Feature selectors ○ In Apache Spark 1.6 ● Tweak training ○ Different models ○ Different model parameters
  • 46. Spark ML Building an easy to use wrapper around training and serving. Build model pipeline, train, evaluate Not necessarily a random split
  • 47. Spark ML Building a training pipeline with spark.ml. Create dummy variables Required response label format The ML model itself Labels back to readable format Assembled training pipeline
  • 48. Spark ML Cross-validate, grid search params and evaluate metrics. Grid search with reference to ML model stage (RF) Metrics to evaluate Yes, you can definitely extend and add your own metrics.
  • 49. Spark ML Score a feature matrix and parse output. Get probability for predicted class (default is a probability vector for all classes)
  • 51. ● Same data, different results ○ Use preprocessed feature matrix (same one used for current model) ● Best testing - production A/B test ○ Use current production model and new model in parallel ● Metrics improvements (Remember your dashboard?) ○ Time series analysis of metrics ○ Compare metrics over different code versions (improves preprocessing / modeling) ● Deploy / Revert = Update user assignments ○ Based on new metrics / feedback loop if possible Compare to baseline
  • 52. A/B Infrastructures Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper. Conf hold Mapping of: model -> user_id/subject list Score in parallel (inside a map) Distributed=awesome. Fancy scala union for all score files
  • 54. ● Respond to anomalies (alerts) on metric reads ● Try out new stuff ○ Tech versions (e.g. new Spark version) ○ New data sources ○ New features ● When you find something interesting - “Go to Work.” Constant improvement Remember : Trends and industries change, re-training on new data is not a bad thing.
  • 56. ● If you wrote your code right, you can easily reuse it in a notebook ! ● Answer ad-hoc questions ○ How many predictions did you output last month? ○ How many new users had a prediction with probability > 0.7 ○ How accurate were we on last month predictions? (join with real data) ● No need to compile anything! Enter Apache Zeppelin
  • 57. Playing with it Setting up zeppelin to user our jars.
  • 58. Playing with it Read a parquet file , show statistics, register as table and run SparkSQL on it. Parquet - already has a schema inside For usage in SparkSQL
  • 59. Playing with it Using spark-csv by Databricks. CSV to DataFrame by Databricks
  • 60. Using user compiled code inside a notebook. Playing with it Bring your own code
  • 62. Keep in mind ● Code produced with ○ Apache Spark 1.6 / Scala 2.11.4 ● RDD VS Dataframe ○ Enter “Dataset API” (V2.0+) ● mllib VS spark.ml ○ Always use spark.ml if functionality exists ● Algorithmic Richness ● Using Parquet ○ Intermediate outputs ● Unbalanced partitions ○ Stuck on reduce ○ Stragglers ● Output size ○ Coalesce to desired size ● Dataframe Windows - Buggy ○ Write your own over RDD ● Parameter tuning ○ Spark.sql.partitions ○ Executors ○ Driver VS executor memory
  • 63. Putting it all together
  • 64. Work Process Step by step for deploying your big ML workflows to production, ready for operations and optimisations. 1. Measure first, optimize second. a. Define metrics. b. Preprocess data (using examples) c. Monitor. (dashboard setup) 2. Start small and grow. 3. Start with a flow. a. Good ML code trums performance. b. Test your infrastructure. 4. Set up a baseline. 5. Go to work. a. Optimize. b. A/B. i. Test new flow in parallel to existing flow. ii. Update user assignments. 6. Watch. Iterate. (see 5.)
  • 66. Use Cases What Waze does with all its data?
  • 67. Trending Locations / Day of Week Breakdown
  • 69. Optimising - Ad clicks / Time from drive start
  • 70. Time to Content (US) - Day of week / Category
  • 71. Irregular Events / Anomaly Detection Major events, causing out of the ordinary traffic/road blocks etc’ affecting large numbers of users.
  • 72. Dangerous Places - Clustering Find most dangerous areas / streets, using custom developed clustering algorithms ● Alert authorities / users ● Compare & share with 3rd parties (NYPD)
  • 73. Parking Places Detection Parking entrance Parking lot Street parking
  • 74. Server Distribution Optimisation Calculate the optimal routing servers distribution according to geographical load. ● Better experience - faster response time ● Saves money - no need for redundant elastic scaling of servers
  • 75. Text Mining - Topic Analysis Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice wazers usual road social still morgan eta traffic driving drivers will ang con stay info reporting update freeman zona today using helped drive kanan usando times area nearby delay voice real clear realtime traffic add meter tiempo slower sharing jam jammed kan carretera accident soci drive near masuk
  • 76. Text Mining - New Version Impressions ● Text analysis - stemming / stopword detection etc. ● Topic modeling ● Sentiment analysis Waze V4 update : ● Good - “redesign”, ”smarter”, “cleaner”, “improved” ● Bad - “stuck” Overall very positive score!
  • 77. Text Mining - Store Sentiments
  • 78. Text Mining - Sentiment by Time & Place