SlideShare a Scribd company logo
Powering Machine Learning
EXPERIENCES WITH STREAMING & MICRO-BATCH FOR ONLINE LEARNING
Swaminathan Sundararaman
FlinkForward 2017
2
The Challenge of Today’s Analytics Trajectory
Edges benefit from real-time online learning and/or inference
IoT is Driving Explosive Growth in Data Volume
“Things” Edge and network Datacenter/Cloud
Data lake
3
• Real-world data is unpredictable and bursty
o Data behavior changes (different time of day, special events, flash crowds, etc.)
• Data behavior changes require retraining & model updates
o Updating models offline can be expensive (compute, retraining)
• Online algorithms retrain on the fly with real-time data
o Lightweight, low compute and memory requirements
o Better accuracy through continuous learning
• Online algorithms are more accurate, especially with data
behavior changes
Real-Time Intelligence: Online Algorithm Advantages
4
Experience Building ML Algorithms on Flink 1.0
• Built both Offline(Batch) and Online algorithms
o Batch Algorithms (Examples: KMeans, PCA, and Random Forest)
o Online Algorithms (Examples: Online KMeans, Online SVM)
• Uses many of the Flink DataStream primitives:
o DataStream APIs are sufficient and primitives are generic for ML algorithms.
o CoFlatMaps, Windows, Collect, Iterations, etc.
• We have also added Python Streaming API support in Flink and
are working with dataArtisans to contribute it to upstream Flink.
5
Example: Online SVM Algorithm
/* Co-map to update local model(s) when new data arrives and also
create the shared model when a pre-defined threshold is met */
private case class SVMModelCoMap(...) {
/* flatMap1 processes new elements and updates local model*/
def flatMap1(data: LabeledVector[Double],
out: Collector[Model]) {
. . .
}
/* flatMap2 accumulates local models and creates a new model
(with decay) once all local models are received */
def flatMap2(currentModel: Model, out: Collector[Model]) {
. . .
}
}
object OnlineSVM {
. . .
def main(args: Array[String]): Unit = {
// initialize input arguments and connectors
. . .
}
}
DataStream
FM1 FM1
FM2 FM2
M M
Task Slot Task Slot
M MM M
M M
Aggregated and local models
combined with decay factor
6
• A server providing VoD services to VLC (i.e., media player) clients
o Clients request videos of different sizes at different times
o Server statistics used to predict violations
• SLA violation: service level drops below predetermined threshold
Telco Example: Measuring SLA Violations
can access a VoD service running on the server side. Note that
a similar setup was investigated in our previous work [1][5].
However, compared to our previous work this setup assumes a
stream of learning examples from at least one client in order to
build an online service quality prediction model for the clients
in real time. The experimental setup is visualized in Figure 2.
Fig. 2. Test-bed setup.
Device statistics X are collected at the server while the
service is operational. In this setting, device statistics refer to
system metrics on the operating-system level such as CPU
utilization, the number of running processes, the rate of context
switches and free memory. In contrast, service-level metrics Y
on the client side refer to statistics on the application level such
as the video frame rate. The metrics X and Y are fed in real time
to the Service Predictor module.
that the underlying distrib
Further note that we a
depend on the network
means that the network
such assumptions may n
on relaxing them in th
shown how end-to-end n
predict client-side metric
would not necessarily a
Rather, we believe that t
work on performance pre
III. LEARNING FRO
Data stream processi
data processing in that
made using an infinit
samples [6]. This restric
those that can be efficien
data. The practical ration
with high velocity and i
and multi-pass processin
Machine learning re
samples based on the
statistical learning tech
categorized into a classif
[7]. In this paper we lim
which contrasts to our p
Dataset
Labels for
training
7
• Load patterns – Flashcrowd, Periodic
• Delivered to Flink and Spark as live stream in experiments
Dataset
(https://arxiv.org/pdf/1509.01386.pdf)
CPU
Utilization
Memory/Swap I/O
Transactions
Block I/O
operations
Process
Statistics
Network
Statistics
CPU Idle Mem Used Read
transactions/s
Block
Reads/s
New
Processes/s
Received
packets/s
CPU User Mem
Committed
Write
transactions/s
Block
Writes/s
Context
Switches/s
Transmitted
Packets/s
CPU System Swap Used Bytes Read/s Received Data
(KB)/s
CPU IO_Wait Swap Cached Bytes Written/s Transmitted Data
(KB)/s
Interface
Utilization %
8
When load pattern remains static (unchanged),
Online algorithms can be as accurate as Offline algorithms
Fixed workloads – Online vs Offline (Batch)
Load Scenario Offline (LibSVM)
Accuracy
Offline (Pegasos)
Accuracy
Online SVM
Accuracy
flashcrowd_load 0.843 0.915 0.943
periodic_load 0.788 0.867 0.927
constant_load 0.999 0.999 0.999
poisson_load 0.963 0.963 0.971
9
Online SVM vs Batch (Offline) SVM – both in FlinkAccumulatedErrorRate
Time
Load Change
Until retraining occurs, changing data
results in lower accuracy model
Training Workload
Real-World
Workload
0%
5%
10%
15%
20%
25%
Network-SLA-Violation-Prediction
Offline-SVM
Online-SVM
Online Algorithm
retrains on the fly
reduces error rate
Online algorithms quickly adapt to workload changes
10
Throughput: Online SVM in Streams and Micro-Batch
23.32 26.58
46.29 39.4444.69
85.11
173.91
333.33
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
1 2 4 8
ThousandsofOperationsperSecond
Number of Nodes
Throughput for processing samples with 256 attributes from Kafka
Spark 2.0 Flink 1.0.3
1.9x
3.2x
3.8x
8.5x
Notable performance improvement over micro-batch based solution
11
Latency: Online SVM in Streams & Micro-batch
1.9x
3.2x
3.8x
8.5x
0.01
0.1
1
10
100
Latency(secs)
Time
Spark - 1s ubatch Spark - 0.1s ubatch Spark 0.01s ubatch Spark 10s ubatch Flink 1.0.3
Low and predictable latency as needed in Edge
12
Edge computing & Online learning are needed for real-time analytics
• Edge Computing: minimizes the excessive latencies, reaction time
• Online learning: can dynamically adapt to changing data / behavior
Online machine learning with streaming on Flink
• Supports low latency processing with scaling across multiple nodes
• Using real world data, demonstrate improved accuracy over offline
algorithms
Conclusions
13
Parallel Machines
The Machine Learning Management Solution
info@parallelmachines.com

More Related Content

PDF
Nisha talagala keynote_inflow_2016
PDF
Strata parallel m-ml-ops_sept_2017
PPTX
Rest microservice ml_deployment_ntalagala_ai_conf_2019
PDF
Near real-time anomaly detection at Lyft
PDF
Big Data at Speed
PDF
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
PDF
Challenges on Distributed Machine Learning
PDF
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Nisha talagala keynote_inflow_2016
Strata parallel m-ml-ops_sept_2017
Rest microservice ml_deployment_ntalagala_ai_conf_2019
Near real-time anomaly detection at Lyft
Big Data at Speed
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
Challenges on Distributed Machine Learning
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...

What's hot (20)

PDF
Scaling up Machine Learning Development
PDF
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
PDF
Distributed Heterogeneous Mixture Learning On Spark
PDF
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
PDF
From discovering to trusting data
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
PDF
Machine Learning Pipelines
PDF
Streaming analytics state of the art
PDF
Tuning ML Models: Scaling, Workflows, and Architecture
PDF
Operationalizing Machine Learning at Scale with Sameer Nori
PDF
Importance of ML Reproducibility & Applications with MLfLow
PDF
AutoML Toolkit – Deep Dive
PDF
Data Intensive Applications with Apache Flink
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
PDF
Thomas Jensen. Machine Learning
PPTX
Feature store: Solving anti-patterns in ML-systems
PDF
Ed Snelson. Counterfactual Analysis
PDF
MLOps Virtual Event: Automating ML at Scale
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
PDF
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Scaling up Machine Learning Development
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Distributed Heterogeneous Mixture Learning On Spark
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
From discovering to trusting data
Infrastructure Agnostic Machine Learning Workload Deployment
Machine Learning Pipelines
Streaming analytics state of the art
Tuning ML Models: Scaling, Workflows, and Architecture
Operationalizing Machine Learning at Scale with Sameer Nori
Importance of ML Reproducibility & Applications with MLfLow
AutoML Toolkit – Deep Dive
Data Intensive Applications with Apache Flink
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Thomas Jensen. Machine Learning
Feature store: Solving anti-patterns in ML-systems
Ed Snelson. Counterfactual Analysis
MLOps Virtual Event: Automating ML at Scale
Self-Service Analytics on Hadoop: Lessons Learned
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Ad

Similar to Parallel machines flinkforward2017 (20)

PDF
Streaming Analytics Unit 1 notes for engineers
PPTX
Traffic Simulator
PDF
Stream Processing Overview
PDF
Network visibility and control using industry standard sFlow telemetry
PDF
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
PDF
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
PDF
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
PDF
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
PDF
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
PDF
SA UNIT I STREAMING ANALYTICS.pdf
PPT
Scalable scheduling of updates in streaming data warehouses
PPT
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
PPTX
CPN302 your-linux-ami-optimization-and-performance
PDF
Caching for Microservices Architectures: Session II - Caching Patterns
PPTX
Shikha fdp 62_14july2017
PDF
Operationalizing Machine Learning: Serving ML Models
PDF
Disadvantages Of Robotium
PPTX
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PPTX
Hpe service virtualization 3.8 what's new chicago adm
Streaming Analytics Unit 1 notes for engineers
Traffic Simulator
Stream Processing Overview
Network visibility and control using industry standard sFlow telemetry
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
USING SEMI-SUPERVISED CLASSIFIER TO FORECAST EXTREME CPU UTILIZATION
Using Semi-supervised Classifier to Forecast Extreme CPU Utilization
SA UNIT I STREAMING ANALYTICS.pdf
Scalable scheduling of updates in streaming data warehouses
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
CPN302 your-linux-ami-optimization-and-performance
Caching for Microservices Architectures: Session II - Caching Patterns
Shikha fdp 62_14july2017
Operationalizing Machine Learning: Serving ML Models
Disadvantages Of Robotium
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
ML on Big Data: Real-Time Analysis on Time Series
Hpe service virtualization 3.8 what's new chicago adm
Ad

More from Nisha Talagala (6)

PDF
Ml ops past_present_future
PDF
Storage Challenges for Production Machine Learning
PDF
Msst 2019 v4
PDF
Global ai conf_final
PDF
Pm.ais ummit 180917 final
PDF
Fms invited talk_2018 v5
Ml ops past_present_future
Storage Challenges for Production Machine Learning
Msst 2019 v4
Global ai conf_final
Pm.ais ummit 180917 final
Fms invited talk_2018 v5

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
annual-report-2024-2025 original latest.
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Analytics and business intelligence.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction-to-Cloud-ComputingFinal.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Clinical guidelines as a resource for EBP(1).pdf
annual-report-2024-2025 original latest.
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Foundation of Data Science unit number two notes
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
.pdf is not working space design for the following data for the following dat...
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Parallel machines flinkforward2017

  • 1. Powering Machine Learning EXPERIENCES WITH STREAMING & MICRO-BATCH FOR ONLINE LEARNING Swaminathan Sundararaman FlinkForward 2017
  • 2. 2 The Challenge of Today’s Analytics Trajectory Edges benefit from real-time online learning and/or inference IoT is Driving Explosive Growth in Data Volume “Things” Edge and network Datacenter/Cloud Data lake
  • 3. 3 • Real-world data is unpredictable and bursty o Data behavior changes (different time of day, special events, flash crowds, etc.) • Data behavior changes require retraining & model updates o Updating models offline can be expensive (compute, retraining) • Online algorithms retrain on the fly with real-time data o Lightweight, low compute and memory requirements o Better accuracy through continuous learning • Online algorithms are more accurate, especially with data behavior changes Real-Time Intelligence: Online Algorithm Advantages
  • 4. 4 Experience Building ML Algorithms on Flink 1.0 • Built both Offline(Batch) and Online algorithms o Batch Algorithms (Examples: KMeans, PCA, and Random Forest) o Online Algorithms (Examples: Online KMeans, Online SVM) • Uses many of the Flink DataStream primitives: o DataStream APIs are sufficient and primitives are generic for ML algorithms. o CoFlatMaps, Windows, Collect, Iterations, etc. • We have also added Python Streaming API support in Flink and are working with dataArtisans to contribute it to upstream Flink.
  • 5. 5 Example: Online SVM Algorithm /* Co-map to update local model(s) when new data arrives and also create the shared model when a pre-defined threshold is met */ private case class SVMModelCoMap(...) { /* flatMap1 processes new elements and updates local model*/ def flatMap1(data: LabeledVector[Double], out: Collector[Model]) { . . . } /* flatMap2 accumulates local models and creates a new model (with decay) once all local models are received */ def flatMap2(currentModel: Model, out: Collector[Model]) { . . . } } object OnlineSVM { . . . def main(args: Array[String]): Unit = { // initialize input arguments and connectors . . . } } DataStream FM1 FM1 FM2 FM2 M M Task Slot Task Slot M MM M M M Aggregated and local models combined with decay factor
  • 6. 6 • A server providing VoD services to VLC (i.e., media player) clients o Clients request videos of different sizes at different times o Server statistics used to predict violations • SLA violation: service level drops below predetermined threshold Telco Example: Measuring SLA Violations can access a VoD service running on the server side. Note that a similar setup was investigated in our previous work [1][5]. However, compared to our previous work this setup assumes a stream of learning examples from at least one client in order to build an online service quality prediction model for the clients in real time. The experimental setup is visualized in Figure 2. Fig. 2. Test-bed setup. Device statistics X are collected at the server while the service is operational. In this setting, device statistics refer to system metrics on the operating-system level such as CPU utilization, the number of running processes, the rate of context switches and free memory. In contrast, service-level metrics Y on the client side refer to statistics on the application level such as the video frame rate. The metrics X and Y are fed in real time to the Service Predictor module. that the underlying distrib Further note that we a depend on the network means that the network such assumptions may n on relaxing them in th shown how end-to-end n predict client-side metric would not necessarily a Rather, we believe that t work on performance pre III. LEARNING FRO Data stream processi data processing in that made using an infinit samples [6]. This restric those that can be efficien data. The practical ration with high velocity and i and multi-pass processin Machine learning re samples based on the statistical learning tech categorized into a classif [7]. In this paper we lim which contrasts to our p Dataset Labels for training
  • 7. 7 • Load patterns – Flashcrowd, Periodic • Delivered to Flink and Spark as live stream in experiments Dataset (https://arxiv.org/pdf/1509.01386.pdf) CPU Utilization Memory/Swap I/O Transactions Block I/O operations Process Statistics Network Statistics CPU Idle Mem Used Read transactions/s Block Reads/s New Processes/s Received packets/s CPU User Mem Committed Write transactions/s Block Writes/s Context Switches/s Transmitted Packets/s CPU System Swap Used Bytes Read/s Received Data (KB)/s CPU IO_Wait Swap Cached Bytes Written/s Transmitted Data (KB)/s Interface Utilization %
  • 8. 8 When load pattern remains static (unchanged), Online algorithms can be as accurate as Offline algorithms Fixed workloads – Online vs Offline (Batch) Load Scenario Offline (LibSVM) Accuracy Offline (Pegasos) Accuracy Online SVM Accuracy flashcrowd_load 0.843 0.915 0.943 periodic_load 0.788 0.867 0.927 constant_load 0.999 0.999 0.999 poisson_load 0.963 0.963 0.971
  • 9. 9 Online SVM vs Batch (Offline) SVM – both in FlinkAccumulatedErrorRate Time Load Change Until retraining occurs, changing data results in lower accuracy model Training Workload Real-World Workload 0% 5% 10% 15% 20% 25% Network-SLA-Violation-Prediction Offline-SVM Online-SVM Online Algorithm retrains on the fly reduces error rate Online algorithms quickly adapt to workload changes
  • 10. 10 Throughput: Online SVM in Streams and Micro-Batch 23.32 26.58 46.29 39.4444.69 85.11 173.91 333.33 0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 1 2 4 8 ThousandsofOperationsperSecond Number of Nodes Throughput for processing samples with 256 attributes from Kafka Spark 2.0 Flink 1.0.3 1.9x 3.2x 3.8x 8.5x Notable performance improvement over micro-batch based solution
  • 11. 11 Latency: Online SVM in Streams & Micro-batch 1.9x 3.2x 3.8x 8.5x 0.01 0.1 1 10 100 Latency(secs) Time Spark - 1s ubatch Spark - 0.1s ubatch Spark 0.01s ubatch Spark 10s ubatch Flink 1.0.3 Low and predictable latency as needed in Edge
  • 12. 12 Edge computing & Online learning are needed for real-time analytics • Edge Computing: minimizes the excessive latencies, reaction time • Online learning: can dynamically adapt to changing data / behavior Online machine learning with streaming on Flink • Supports low latency processing with scaling across multiple nodes • Using real world data, demonstrate improved accuracy over offline algorithms Conclusions
  • 13. 13 Parallel Machines The Machine Learning Management Solution info@parallelmachines.com