Parallel machines flinkforward2017

Powering Machine Learning
EXPERIENCES WITH STREAMING & MICRO-BATCH FOR ONLINE LEARNING
Swaminathan Sundararaman
FlinkForward 2017

2
The Challenge of Today’s Analytics Trajectory
Edges benefit from real-time online learning and/or inference
IoT is Driving Explosive Growth in Data Volume
“Things” Edge and network Datacenter/Cloud
Data lake

3
• Real-world data is unpredictable and bursty
o Data behavior changes (different time of day, special events, flash crowds, etc.)
• Data behavior changes require retraining & model updates
o Updating models offline can be expensive (compute, retraining)
• Online algorithms retrain on the fly with real-time data
o Lightweight, low compute and memory requirements
o Better accuracy through continuous learning
• Online algorithms are more accurate, especially with data
behavior changes
Real-Time Intelligence: Online Algorithm Advantages

4
Experience Building ML Algorithms on Flink 1.0
• Built both Offline(Batch) and Online algorithms
o Batch Algorithms (Examples: KMeans, PCA, and Random Forest)
o Online Algorithms (Examples: Online KMeans, Online SVM)
• Uses many of the Flink DataStream primitives:
o DataStream APIs are sufficient and primitives are generic for ML algorithms.
o CoFlatMaps, Windows, Collect, Iterations, etc.
• We have also added Python Streaming API support in Flink and
are working with dataArtisans to contribute it to upstream Flink.

5
Example: Online SVM Algorithm
/* Co-map to update local model(s) when new data arrives and also
create the shared model when a pre-defined threshold is met */
private case class SVMModelCoMap(...) {
/* flatMap1 processes new elements and updates local model*/
def flatMap1(data: LabeledVector[Double],
out: Collector[Model]) {
. . .
}
/* flatMap2 accumulates local models and creates a new model
(with decay) once all local models are received */
def flatMap2(currentModel: Model, out: Collector[Model]) {
. . .
}
}
object OnlineSVM {
. . .
def main(args: Array[String]): Unit = {
// initialize input arguments and connectors
. . .
}
}
DataStream
FM1 FM1
FM2 FM2
M M
Task Slot Task Slot
M MM M
M M
Aggregated and local models
combined with decay factor

6
• A server providing VoD services to VLC (i.e., media player) clients
o Clients request videos of different sizes at different times
o Server statistics used to predict violations
• SLA violation: service level drops below predetermined threshold
Telco Example: Measuring SLA Violations
can access a VoD service running on the server side. Note that
a similar setup was investigated in our previous work [1][5].
However, compared to our previous work this setup assumes a
stream of learning examples from at least one client in order to
build an online service quality prediction model for the clients
in real time. The experimental setup is visualized in Figure 2.
Fig. 2. Test-bed setup.
Device statistics X are collected at the server while the
service is operational. In this setting, device statistics refer to
system metrics on the operating-system level such as CPU
utilization, the number of running processes, the rate of context
switches and free memory. In contrast, service-level metrics Y
on the client side refer to statistics on the application level such
as the video frame rate. The metrics X and Y are fed in real time
to the Service Predictor module.
that the underlying distrib
Further note that we a
depend on the network
means that the network
such assumptions may n
on relaxing them in th
shown how end-to-end n
predict client-side metric
would not necessarily a
Rather, we believe that t
work on performance pre
III. LEARNING FRO
Data stream processi
data processing in that
made using an infinit
samples [6]. This restric
those that can be efficien
data. The practical ration
with high velocity and i
and multi-pass processin
Machine learning re
samples based on the
statistical learning tech
categorized into a classif
[7]. In this paper we lim
which contrasts to our p
Dataset
Labels for
training

7
• Load patterns – Flashcrowd, Periodic
• Delivered to Flink and Spark as live stream in experiments
Dataset
(https://arxiv.org/pdf/1509.01386.pdf)
CPU
Utilization
Memory/Swap I/O
Transactions
Block I/O
operations
Process
Statistics
Network
Statistics
CPU Idle Mem Used Read
transactions/s
Block
Reads/s
New
Processes/s
Received
packets/s
CPU User Mem
Committed
Write
transactions/s
Block
Writes/s
Context
Switches/s
Transmitted
Packets/s
CPU System Swap Used Bytes Read/s Received Data
(KB)/s
CPU IO_Wait Swap Cached Bytes Written/s Transmitted Data
(KB)/s
Interface
Utilization %

8
When load pattern remains static (unchanged),
Online algorithms can be as accurate as Offline algorithms
Fixed workloads – Online vs Offline (Batch)
Load Scenario Offline (LibSVM)
Accuracy
Offline (Pegasos)
Accuracy
Online SVM
Accuracy
flashcrowd_load 0.843 0.915 0.943
periodic_load 0.788 0.867 0.927
constant_load 0.999 0.999 0.999
poisson_load 0.963 0.963 0.971

9
Online SVM vs Batch (Offline) SVM – both in FlinkAccumulatedErrorRate
Time
Load Change
Until retraining occurs, changing data
results in lower accuracy model
Training Workload
Real-World
Workload
0%
5%
10%
15%
20%
25%
Network-SLA-Violation-Prediction
Offline-SVM
Online-SVM
Online Algorithm
retrains on the fly
reduces error rate
Online algorithms quickly adapt to workload changes

10
Throughput: Online SVM in Streams and Micro-Batch
23.32 26.58
46.29 39.4444.69
85.11
173.91
333.33
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
1 2 4 8
ThousandsofOperationsperSecond
Number of Nodes
Throughput for processing samples with 256 attributes from Kafka
Spark 2.0 Flink 1.0.3
1.9x
3.2x
3.8x
8.5x
Notable performance improvement over micro-batch based solution

11
Latency: Online SVM in Streams & Micro-batch
1.9x
3.2x
3.8x
8.5x
0.01
0.1
1
10
100
Latency(secs)
Time
Spark - 1s ubatch Spark - 0.1s ubatch Spark 0.01s ubatch Spark 10s ubatch Flink 1.0.3
Low and predictable latency as needed in Edge

12
Edge computing & Online learning are needed for real-time analytics
• Edge Computing: minimizes the excessive latencies, reaction time
• Online learning: can dynamically adapt to changing data / behavior
Online machine learning with streaming on Flink
• Supports low latency processing with scaling across multiple nodes
• Using real world data, demonstrate improved accuracy over offline
algorithms
Conclusions

13
Parallel Machines
The Machine Learning Management Solution
info@parallelmachines.com

Parallel machines flinkforward2017

More Related Content

What's hot (20)

Similar to Parallel machines flinkforward2017 (20)

More from Nisha Talagala (6)

Recently uploaded (20)

Parallel machines flinkforward2017