SlideShare a Scribd company logo
Real Time Machine Learning
Visualization with Spark
Chester Chen, Ph.D
Sr. Manager, Data Science & Engineering
GoPro, Inc.
Hadoop Summit, San Jose 2016
Who am I ?
• Sr. Manager of Data Science & Engineering at GoPro
• Founder and Organizer of SF Big Analytics Meetup (4500+ members)
• Previous Employment:
– Alpine Data, Tinga, Clearwell/Symantec, AltaVista, Ascent Media, ClearStory Systems,
WebWare.
• Experience with Spark
– Exposed to Spark since Spark 0.6
– Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x
• Hadoop Distribution
– CDH, HDP and MapR
Real Time Machine Learning Visualization with Spark
Growing data needs
Lightning-fast cluster computing
Real Time ML Visualization with Spark
http://spark.apache.org/
Iris data set, K-Means clustering with K=3
Cluster 2
Cluster 1
Cluster 0
Centroids
Sepal width vs Petal length
Iris data set, K-Means clustering with K=3
distance
What is K-Means ?
• Given a set of observations (x1, x2, …, xn), where each observation is a d-
dimensional real vector,
• k-means clustering aims to partition the n observations into k (≤ n) sets
S = {S1, S2, …, Sk}
• The clusters are determined by minimizing the inter-cluster sum of squares (ICSS)
(sum of distance functions of each point in the cluster to the K center). In other
words, the objective is to find
• where μi is the mean of points in Si.
• https://en.wikipedia.org/wiki/K-means_clustering
Visualization Cost
35
35.5
36
36.5
37
37.5
38
38.5
0 5 10 15 20 25
Cost vs Iteration
Cost
Real Time ML Visualization
• Use Cases
– Use visualization to determine whether to end the
training early
• Need a way to visualize the training process
including the convergence, clustering or residual
plots, etc.
• Need a way to stop the training and save current
model
• Need a way to disable or enable the visualization
Real Time ML Visualization with Spark
DEMO
How to Enable Real Time ML Visualization ?
• A callback interface for Spark Machine Learning Algorithm to send
messages
– Algorithms decide when and what message to send
– Algorithms don’t care how the message is delivered
• A task channel to handle the message delivery from Spark Driver to
Spark Client
– It doesn’t care about the content of the message or who sent the message
• The message is delivered from Spark Client to Browser
– We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response
(PUSH)
– Pull is possible, but requires a message Queue
• Visualization using JavaScript Frameworks Plot.ly and D3
Spark Job in Yarn-Cluster mode
Spark
Client
Hadoop Cluster
Yarn-Container
Spark Driver
Spark Job
Spark Context
Spark ML
algorithm
Command Line
Rest API
Servlet
Application Host
Spark Job in Yarn-Cluster mode
Spark
Client
Hadoop Cluster
Command Line
Rest API
Servlet
Application Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger
Spark
Client
Hadoop ClusterApplication Host
Spark Job
App Context Spark ML
Algorithms
ML Listener
Message
Logger
Spark Job in Yarn-Cluster mode
Web/
Rest
API
Server
Akka
Browser
Enable Real Time ML Visualization
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
Enable Real Time ML Visualization
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
Machine Learning Listeners
Callback Interface: ML Listener
trait MLListener {
def onMessage(message: => Any)
}
Callback Interface: MLListenerSupport
trait MLListenerSupport {
// rest of code
def sendMessage(message: => Any): Unit = {
if (enableListener) {
listeners.foreach(l => l.onMessage(message))
}
}
KMeansEx: KMeans with MLListener
class KMeansExt private (…) extends Serializable
with Logging
with MLListenerSupport {
...
}
KMeansEx: KMeans with MLListener
case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double )
private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = {
...
while (!stopIteration &&
iteration < maxIterations && !activeRuns.isEmpty) {
...
if (listenerEnabled()) {
sendMessage(KMeansCoreStats(…))
}
...
}
}
KMeans ML Listener
class KMeansListener(columnNames: List[String],
data : RDD[Vector],
logger : MessageLogger) extends MLListener{
var sampleDataOpt : Option[Array[Vector]]= None
override def onMessage(message : => Any): Unit = {
message match {
case coreStats :KMeansCoreStats =>
if (sampleDataOpt.isEmpty)
sampleDataOpt = Some(data.takeSample(withReplacement = false, num=100))
//use the KMeans model of the current iteration to predict sample cluster indexes
val kMeansModel = new KMeansModel(coreStats.centers)
val cluster=sampleDataOpt.get.map(vector => (vector.toArray, kMeansModel.predict(vector)))
val msg = KMeansStats(…)
logger.sendBroadCastMessage(MLConstants.KMEANS_CENTER, msg)
case _ =>
println(" message lost")
}
KMeans Spark Job Setup
Val appCtxOpt : Option[ApplicationContext] = …
val kMeans = new KMeansExt().setK(numClusters)
.setEpsilon(epsilon)
.setMaxIterations(maxIterations)
.enableListener(enableVisualization)
.addListener(
new KMeansListener(...))
appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger)))
kMeans.run(vectors)
ML Task Observer
• Receives command from User to update running Spark Job
• Once receives UpdateTask Command from notify call, it preforms the
necessary update operation
trait TaskObserver {
def notify (task: UpdateTaskCmd)
}
class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger )
extends TaskObserver {
//implement notify
}
Logistic Regression MLListener
class LogisticRegression(…) extends MLListenerSupport {
def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= {
// initialization code
val (rawWeights, loss) = OWLQN.runOWLQN( …)
generateLORModel(…)
}
}
Logistic Regression MLListener
object OWLQN extends Logging {
def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector,
Array[Double]) = {
val costFun=new CostFun(data, mlSupport, IterationState(), /*other
args */)
val states : Iterator[lbfgs.State] =
lbfgs.iterations(
new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector
)
…
}
Logistic Regression MLListener
In Cost function :
override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = {
val shouldStop = mlSupport.exists(_.stopIteration)
if (!shouldStop) {
…
mlSupport.filter(_.listenerEnabled()).map { s=>
s.sendMessage( (iState.iteration, w, loss))
}
…
}
else {
…
}
}
Task Communication Channel
Task Channel : Akka Messaging
Spark
Application Application
Context
Actor System
Messager
Actor
Task
Channel
Actor
SparkContext Spark tasks
Akka
Akka
Task Channel : Akka messaging
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
Push To The Browser
HTTP Chunked Response and SSE
SSE
Plotly
D3
Browser
Rest
API
Server
Web Server
Spark
Client
Hadoop Cluster
Spark Job
App Context
Message
Logger
Task Channel
Spark ML
Algorithms
ML Listener
Akka
Chunked
Response
Akka
HTML5 Server-Sent Events (SSE)
• Server-sent Events (SSE) is one-way messaging
– An event is when a web page automatically get update from Server
• Register an event source (JavaScript)
var source = new EventSource(url);
• The Callback onMessage(data)
source.onmessage = function(message){...}
• Data Format:
data: { n
data: “key” : “value”, nn
data: } nn
HTTP Chunked Response
• Spray Rest Server supports Chunked Response
val responseStart =
HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn"))
requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack)
val nextChunk = MessageChunk(s"data: $r nn")
requestCtx.responder ! nextChunk.withAck(Messages.Ack)
requestCtx.responder ! MessageChunk(s"data: Finished nn")
requestCtx.responder ! ChunkedMessageEnd
Push vs. Pull
Push
• Pros
– The data is streamed (pushed) to browser via chunked response
– There is no need for data queue, but the data can be lost if not consumed
– Multiple pages can be pushed at the same time, which allows multiple visualization
views
• Cons
– For slow network, slow browser and fast data iterations, the data might all show-up in
browser at once, rather showing a nice iteration-by-iteration display
– If you control the data chunked response by Network Acknowledgement, the
visualization may not show-up at all as the data is not pushed due to slow network
acknowledgement
Push vs. Pull
Pull
• Pros
– Message does not get lost, since it can be temporarily stored in the message
queue
– The visualization will render in an even pace
• Cons
– Need to periodically send server request for update,
– We will need a message queue before the message is consumed
– Hard to support multiple pages rendering with simple message queue
Visualization: Plot.ly + D3
Cost vs. IterationCost vs. Iteration
ArrTime vs. DistanceArrTime vs. DepTime
Alpine Workflow
Use Plot.ly to render graph
function showCost(dataParsed) {
var costTrace = { … };
var data = [ costTrace ];
var costLayout = {
xaxis: {…},
yaxis: {…},
title: …
};
Plotly.newPlot('cost', data, costLayout);
}
Real Time ML Visualization: Summary
• Training machine learning model involves a lot of experimentation,
we need a way to visualize the training process.
• We presented a system to enable real time machine learning
visualization with Spark:
– Gives visibility into the training of a model
– Allows us monitor the convergence of the algorithms during training
– Can stop the iterations when convergence is good enough.
Thank You
Chester Chen
chesterxgchen@yahoo.com
LinkedIn
https://www.linkedin.com/in/chester-chen-3205992
SlideShare
http://www.slideshare.net/ChesterChen/presentations
demo video
https://youtu.be/DkbYNYQhrao

More Related Content

PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PDF
Apache Eagle: Secure Hadoop in Real Time
PPTX
Automated Analytics at Scale
PPTX
Spark Technology Center IBM
PPTX
Preventative Maintenance of Robots in Automotive Industry
PDF
The Next Generation of Data Processing and Open Source
PDF
Spark Uber Development Kit
PPTX
LEGO: Data Driven Growth Hacking Powered by Big Data
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Apache Eagle: Secure Hadoop in Real Time
Automated Analytics at Scale
Spark Technology Center IBM
Preventative Maintenance of Robots in Automotive Industry
The Next Generation of Data Processing and Open Source
Spark Uber Development Kit
LEGO: Data Driven Growth Hacking Powered by Big Data

What's hot (20)

PDF
What's new in SQL on Hadoop and Beyond
PDF
Visualizing Big Data in Realtime
PPTX
Building a Scalable Data Science Platform with R
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
PPTX
Real Time Machine Learning Visualization With Spark
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Real time fraud detection at 1+M scale on hadoop stack
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
PPTX
Cloudbreak - Technical Deep Dive
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Kafka for data scientists
PDF
Big Data Computing Architecture
PPTX
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
PDF
Machine Learning for Any Size of Data, Any Type of Data
PPTX
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
PDF
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
PPTX
Real time machine learning visualization with spark -- Hadoop Summit 2016
PPTX
Integrating Apache Phoenix with Distributed Query Engines
What's new in SQL on Hadoop and Beyond
Visualizing Big Data in Realtime
Building a Scalable Data Science Platform with R
Make Streaming Analytics work for you: The Devil is in the Details
Solr + Hadoop: Interactive Search for Hadoop
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Real Time Machine Learning Visualization With Spark
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Real time fraud detection at 1+M scale on hadoop stack
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Cloudbreak - Technical Deep Dive
Next Gen Big Data Analytics with Apache Apex
Kafka for data scientists
Big Data Computing Architecture
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Machine Learning for Any Size of Data, Any Type of Data
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Real time machine learning visualization with spark -- Hadoop Summit 2016
Integrating Apache Phoenix with Distributed Query Engines
Ad

Viewers also liked (20)

PPTX
Accelerating Data Warehouse Modernization
PDF
Visualization and Machine Learning - for exploratory data ...
PPTX
Real time machine learning
PPT
Is Writing More Important Than Programming
PDF
Lecture7 xing fei-fei
PDF
Visualizing Threats: Network Visualization for Cyber Security
PDF
Plotcon 2016 Visualization Talk by Alexandra Johnson
PPTX
PROTEUS H2020
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
PPT
Visualization and Theories of Learning in Education
KEY
Real Time BI with Hadoop
PPTX
Omid: A Transactional Framework for HBase
PDF
IoT Crash Course Hadoop Summit SJ
PPTX
Using Hadoop for Cognitive Analytics
PDF
Making the leap to BI on Hadoop by Mariani, dave @ atscale
PPTX
Curb your insecurity with HDP
PPTX
The Path to Wellness through Big Data
PDF
Data visualization 4 dummies
PDF
Attention Please! Learning analytics for visualization & recommendation
PPTX
Combining Machine Learning frameworks with Apache Spark
Accelerating Data Warehouse Modernization
Visualization and Machine Learning - for exploratory data ...
Real time machine learning
Is Writing More Important Than Programming
Lecture7 xing fei-fei
Visualizing Threats: Network Visualization for Cyber Security
Plotcon 2016 Visualization Talk by Alexandra Johnson
PROTEUS H2020
Spark Streaming, Machine Learning and meetup.com streaming API.
Visualization and Theories of Learning in Education
Real Time BI with Hadoop
Omid: A Transactional Framework for HBase
IoT Crash Course Hadoop Summit SJ
Using Hadoop for Cognitive Analytics
Making the leap to BI on Hadoop by Mariani, dave @ atscale
Curb your insecurity with HDP
The Path to Wellness through Big Data
Data visualization 4 dummies
Attention Please! Learning analytics for visualization & recommendation
Combining Machine Learning frameworks with Apache Spark
Ad

Similar to Real Time Machine Learning Visualization with Spark (20)

PPTX
Real timeml visualizationwithspark_v6
PPTX
Real Time Visualization with Spark
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
PDF
MLlib: Spark's Machine Learning Library
PDF
End-to-end Data Pipeline with Apache Spark
PPTX
Real time streaming analytics
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PDF
Live Tutorial – Streaming Real-Time Events Using Apache APIs
PDF
Machine Learning for (JVM) Developers
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PDF
Productionalizing Spark ML
PDF
Dev Ops Training
PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
TriHUG talk on Spark and Shark
PDF
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Real timeml visualizationwithspark_v6
Real Time Visualization with Spark
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Spark Under the Hood - Meetup @ Data Science London
Machine Learning with ML.NET and Azure - Andy Cross
MLlib: Spark's Machine Learning Library
End-to-end Data Pipeline with Apache Spark
Real time streaming analytics
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Machine Learning for (JVM) Developers
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Productionalizing Spark ML
Dev Ops Training
Combining Machine Learning Frameworks with Apache Spark
TriHUG talk on Spark and Shark
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Revolutionize Text Mining with Spark and Zeppelin
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Apache Spark Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced IT Governance
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Advanced IT Governance
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development

Real Time Machine Learning Visualization with Spark

  • 1. Real Time Machine Learning Visualization with Spark Chester Chen, Ph.D Sr. Manager, Data Science & Engineering GoPro, Inc. Hadoop Summit, San Jose 2016
  • 2. Who am I ? • Sr. Manager of Data Science & Engineering at GoPro • Founder and Organizer of SF Big Analytics Meetup (4500+ members) • Previous Employment: – Alpine Data, Tinga, Clearwell/Symantec, AltaVista, Ascent Media, ClearStory Systems, WebWare. • Experience with Spark – Exposed to Spark since Spark 0.6 – Architect for Alpine Spark Integration on Spark 1.1, 1.3 and 1.5.x • Hadoop Distribution – CDH, HDP and MapR
  • 5. Lightning-fast cluster computing Real Time ML Visualization with Spark http://spark.apache.org/
  • 6. Iris data set, K-Means clustering with K=3 Cluster 2 Cluster 1 Cluster 0 Centroids Sepal width vs Petal length
  • 7. Iris data set, K-Means clustering with K=3 distance
  • 8. What is K-Means ? • Given a set of observations (x1, x2, …, xn), where each observation is a d- dimensional real vector, • k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} • The clusters are determined by minimizing the inter-cluster sum of squares (ICSS) (sum of distance functions of each point in the cluster to the K center). In other words, the objective is to find • where μi is the mean of points in Si. • https://en.wikipedia.org/wiki/K-means_clustering
  • 9. Visualization Cost 35 35.5 36 36.5 37 37.5 38 38.5 0 5 10 15 20 25 Cost vs Iteration Cost
  • 10. Real Time ML Visualization • Use Cases – Use visualization to determine whether to end the training early • Need a way to visualize the training process including the convergence, clustering or residual plots, etc. • Need a way to stop the training and save current model • Need a way to disable or enable the visualization
  • 11. Real Time ML Visualization with Spark DEMO
  • 12. How to Enable Real Time ML Visualization ? • A callback interface for Spark Machine Learning Algorithm to send messages – Algorithms decide when and what message to send – Algorithms don’t care how the message is delivered • A task channel to handle the message delivery from Spark Driver to Spark Client – It doesn’t care about the content of the message or who sent the message • The message is delivered from Spark Client to Browser – We use HTML5 Server-Sent Events ( SSE) and HTTP Chunked Response (PUSH) – Pull is possible, but requires a message Queue • Visualization using JavaScript Frameworks Plot.ly and D3
  • 13. Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Yarn-Container Spark Driver Spark Job Spark Context Spark ML algorithm Command Line Rest API Servlet Application Host
  • 14. Spark Job in Yarn-Cluster mode Spark Client Hadoop Cluster Command Line Rest API Servlet Application Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger
  • 15. Spark Client Hadoop ClusterApplication Host Spark Job App Context Spark ML Algorithms ML Listener Message Logger Spark Job in Yarn-Cluster mode Web/ Rest API Server Akka Browser
  • 16. Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 17. Enable Real Time ML Visualization SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 19. Callback Interface: ML Listener trait MLListener { def onMessage(message: => Any) }
  • 20. Callback Interface: MLListenerSupport trait MLListenerSupport { // rest of code def sendMessage(message: => Any): Unit = { if (enableListener) { listeners.foreach(l => l.onMessage(message)) } }
  • 21. KMeansEx: KMeans with MLListener class KMeansExt private (…) extends Serializable with Logging with MLListenerSupport { ... }
  • 22. KMeansEx: KMeans with MLListener case class KMeansCoreStats (iteration: Int, centers: Array[Vector], cost: Double ) private def runAlgorithm(data: RDD[VectorWithNorm]): KMeansModel = { ... while (!stopIteration && iteration < maxIterations && !activeRuns.isEmpty) { ... if (listenerEnabled()) { sendMessage(KMeansCoreStats(…)) } ... } }
  • 23. KMeans ML Listener class KMeansListener(columnNames: List[String], data : RDD[Vector], logger : MessageLogger) extends MLListener{ var sampleDataOpt : Option[Array[Vector]]= None override def onMessage(message : => Any): Unit = { message match { case coreStats :KMeansCoreStats => if (sampleDataOpt.isEmpty) sampleDataOpt = Some(data.takeSample(withReplacement = false, num=100)) //use the KMeans model of the current iteration to predict sample cluster indexes val kMeansModel = new KMeansModel(coreStats.centers) val cluster=sampleDataOpt.get.map(vector => (vector.toArray, kMeansModel.predict(vector))) val msg = KMeansStats(…) logger.sendBroadCastMessage(MLConstants.KMEANS_CENTER, msg) case _ => println(" message lost") }
  • 24. KMeans Spark Job Setup Val appCtxOpt : Option[ApplicationContext] = … val kMeans = new KMeansExt().setK(numClusters) .setEpsilon(epsilon) .setMaxIterations(maxIterations) .enableListener(enableVisualization) .addListener( new KMeansListener(...)) appCtxOpt.foreach(_.addTaskObserver(new MLTaskObserver(kMeans,logger))) kMeans.run(vectors)
  • 25. ML Task Observer • Receives command from User to update running Spark Job • Once receives UpdateTask Command from notify call, it preforms the necessary update operation trait TaskObserver { def notify (task: UpdateTaskCmd) } class MLTaskObserver(support: MLListenerSupport, logger: MessageLogger ) extends TaskObserver { //implement notify }
  • 26. Logistic Regression MLListener class LogisticRegression(…) extends MLListenerSupport { def train(data: RDD[(Double, Vector)]): LogisticRegressionModel= { // initialization code val (rawWeights, loss) = OWLQN.runOWLQN( …) generateLORModel(…) } }
  • 27. Logistic Regression MLListener object OWLQN extends Logging { def runOWLQN(/*args*/,mlSupport:Option[MLListenerSupport]):(Vector, Array[Double]) = { val costFun=new CostFun(data, mlSupport, IterationState(), /*other args */) val states : Iterator[lbfgs.State] = lbfgs.iterations( new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector ) … }
  • 28. Logistic Regression MLListener In Cost function : override def calculate(weights: BDV[Double]): (Double, BDV[Double]) = { val shouldStop = mlSupport.exists(_.stopIteration) if (!shouldStop) { … mlSupport.filter(_.listenerEnabled()).map { s=> s.sendMessage( (iState.iteration, w, loss)) } … } else { … } }
  • 30. Task Channel : Akka Messaging Spark Application Application Context Actor System Messager Actor Task Channel Actor SparkContext Spark tasks Akka Akka
  • 31. Task Channel : Akka messaging SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 32. Push To The Browser
  • 33. HTTP Chunked Response and SSE SSE Plotly D3 Browser Rest API Server Web Server Spark Client Hadoop Cluster Spark Job App Context Message Logger Task Channel Spark ML Algorithms ML Listener Akka Chunked Response Akka
  • 34. HTML5 Server-Sent Events (SSE) • Server-sent Events (SSE) is one-way messaging – An event is when a web page automatically get update from Server • Register an event source (JavaScript) var source = new EventSource(url); • The Callback onMessage(data) source.onmessage = function(message){...} • Data Format: data: { n data: “key” : “value”, nn data: } nn
  • 35. HTTP Chunked Response • Spray Rest Server supports Chunked Response val responseStart = HttpResponse(entity = HttpEntity(`text/event-stream`, s"data: Startn")) requestCtx.responder ! ChunkedResponseStart(responseStart).withAck(Messages.Ack) val nextChunk = MessageChunk(s"data: $r nn") requestCtx.responder ! nextChunk.withAck(Messages.Ack) requestCtx.responder ! MessageChunk(s"data: Finished nn") requestCtx.responder ! ChunkedMessageEnd
  • 36. Push vs. Pull Push • Pros – The data is streamed (pushed) to browser via chunked response – There is no need for data queue, but the data can be lost if not consumed – Multiple pages can be pushed at the same time, which allows multiple visualization views • Cons – For slow network, slow browser and fast data iterations, the data might all show-up in browser at once, rather showing a nice iteration-by-iteration display – If you control the data chunked response by Network Acknowledgement, the visualization may not show-up at all as the data is not pushed due to slow network acknowledgement
  • 37. Push vs. Pull Pull • Pros – Message does not get lost, since it can be temporarily stored in the message queue – The visualization will render in an even pace • Cons – Need to periodically send server request for update, – We will need a message queue before the message is consumed – Hard to support multiple pages rendering with simple message queue
  • 38. Visualization: Plot.ly + D3 Cost vs. IterationCost vs. Iteration ArrTime vs. DistanceArrTime vs. DepTime Alpine Workflow
  • 39. Use Plot.ly to render graph function showCost(dataParsed) { var costTrace = { … }; var data = [ costTrace ]; var costLayout = { xaxis: {…}, yaxis: {…}, title: … }; Plotly.newPlot('cost', data, costLayout); }
  • 40. Real Time ML Visualization: Summary • Training machine learning model involves a lot of experimentation, we need a way to visualize the training process. • We presented a system to enable real time machine learning visualization with Spark: – Gives visibility into the training of a model – Allows us monitor the convergence of the algorithms during training – Can stop the iterations when convergence is good enough.

Editor's Notes

  • #5: Here’s what we saw… - Business was indeed growing, the product line was expanding in number and sophistication, BUT we were becoming more than a camera company. - We had a growing ecosystem of software and services - We had a rich media side of the business that was growing and in social and various media distribution channels - We’re moving now into advanced capture - And with drones, entirely new categories - This all lends and leads to the current Big Data landscape that we have today. So, we brought together the a team of bad assess for companies like LinkedIn, Apple, Oracle, and Splice Machine to tackle the problem Thus formed the Data Science and Engineering team at GoPro
  • #8: Steps : Choose centers Compute and min d = distance to centroid, choose new center Convergence when centroid is not changed
  • #22: Once we define the MLListener Support, we can gather stats at initial, iteration and final step and call: sendMessage(gatherKMeansStats(/*…*/))
  • #31: Turn into picture
  • #37: Two slides
  • #38: Two slides
  • #42: Share contact info? Link to slides again?