Apache Samoa: Mining Big Data Streams with Apache Flink

APACHE SAMOA:
MINING BIG DATA STREAMS
WITH APACHE FLINK
Albert Bifet @abifet
12 October 2015

APACHE SAMOA 0.3.0
• Released July 2015
pReduce Limitations
ample
w compute in real time (latency less than 1 second):
redictions
requent items as Twitter hashtags
entiment analysis
14

Streaming Predictive Analytics on
Apache Flink
Author:
Foteini Beligianni
Examiner:
Vladimir Vlassov
Supervisors:
Seif Haridi
Paris Carbone
A thesis submitted for the degree of Master of Science in
Distributed Systems and Services

REALTIME ANALYTICS
eal time analytics

REALTIME ANALYTICS
real time analytics

APACHE SAMOAVISION
• Distributed stream mining platform
• Library of state-of-the-art algorithms 
for practitioners
• Development and collaboration framework 
for researchers
• Algorithms & Systems

IMPORTANCE
• Example: spam detection in
comments onYahoo News
• Trends change in time
• Need to retrain model with
new data
Importance$of$O
•  As$spam$trends$change
retrain$the$model$with

INTERNET OF THINGS
• EMC Digital Universe, 2014
digital universe
Figure 3: EMC Digital Universe, 2014
7

BIG DATA STREAM
• Volume +Velocity (+Variety)
• Too large for single commodity
server main memory
• Too fast for single commodity
server CPU
• A solution should be:
• Distributed
• Scalable

BIG DATA
PROCESSING ENGINES
• Low latency
• High Latency (Not real time)
apache storm
Storm characteristics for real-time data processing workloads
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14
apache spark streaming

DATA SCIENCEdata scientist
Figure 1:
2

MACHINE LEARNING
• Classiﬁcation
• Regression
• Clustering
• Frequent Pattern Mining

STREAMING MODEL
• Sequence is potentially inﬁnite
• High amount of data, high speed of arrival
• Change over time (concept drift)
• Approximation algorithms 
(small error with high probability)
• Single pass, one data item at a time
• Sub-linear space and time per data item

TAXONOMY
Data
Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4,
Samza
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA

ARCHITECTURE
An adapter for integrating Apache Flink into Apache SAMOA was implemente
n scope of this master thesis, with the main parts of its implementation bein
addressed in this section. With the use of our adapter, ML algorithms can b
executed on top of Apache Flink. The implemented adapter will be used for th
evaluation of the ML pipelines and HT algorithm variations.
Figure 20: Apache SAMOA’s high level architecture.

STATUSSTATUS
• Parallel algorithms
• Classiﬁcation (Vertical HoeffdingTree)
• Clustering (CluStream)
• Regression (Adaptive Model Rules)
• Execution engines

IS SAMOA USEFUL FORYOU?
• Only if you need to deal with:
• Large fast data
• Evolving process (model updates)
• What is happening now?
• Use feedback in real-time
• Adapt to changes faster

GROUPINGS
• Key Grouping  
(hashing)
• Shufﬂe Grouping 
(round-robin)
• All Grouping 
(broadcast)
PE PE
PEI
PEI
PEI
PEI

PE PE
PEI
PEI
PEI
PEI
GROUPINGS
• Key Grouping  
(hashing)
• Shufﬂe Grouping 
(round-robin)
• All Grouping 
(broadcast)

ML DEVELOPER API
Processing Item
Processor
Stream

ML DEVELOPER API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShufﬂe(streamOne)
.connectInputKey(streamTwo);

DECISIONTREE
• Nodes are tests on attributes
• Branches are possible
outcomes
• Leafs are class assignments 
  Class
Instance
Attributes
Road
Tested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?

HOEFFDINGTREE
• Sample of stream enough for near optimal decision
• Estimate merit of alternatives from preﬁx of stream
• Choose sample size based on statistical principles
• When to expand a leaf?
• Let x1 be the most informative attribute, 
x2 the second most informative one
• Hoeffding bound: split if G(x1, x2) > ✏ =
r
R2 ln(1/ )
2n
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

PARALLEL DECISIONTREES
• Which kind of parallelism?
• Task
• Data
• Horizontal
• Vertical
Data
Attributes
Instances

HORIZONTAL PARALLELISM
Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp.
849–872, 2010
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
Aggregation to
compute splits
Single attribute
tracked in
multiple node
30

HOEFFDINGTREE
PROFILING
Other
6 %
Split
24 %
Learn
70 %
CPU time for training 
100 nominal and 100
numeric attributes

VERTICAL PARALLELISM
Single attribute tracked
in single node
Stats
Stats
Stats
Stream
Model
Attributes
Splits

ADVANTAGES OFVERTICAL
• High number of attributes => high level of parallelism 
(e.g., documents)
• Vs task parallelism
• Parallelism observed immediately
• Vs horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation

VERTICAL HOEFFDINGTREE
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping

ACCURACY
No. Leaf Nodes VHT2 –
tree-100
30
Very close and
very high accuracy

PERFORMANCE
35
0
50
100
150
200
250
MHT VHT2-par-3
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec

REPLICATED MODELVHT
(RMVHT)4 ALGORITHM IMPLEMENTATION
4.1.2 Replicated Model of VHT Algorithm (RmVHT)

COMPARISON NATIVEVHT6 EXPERIMENTAL EVALUATION
Figure 22: Prequential classiﬁcation error of Flink’s native VHT SAMOA’s VHT
and RmVHT algorithm for UCI-Forest Covertype data set.Flink’s native
VHT has data source with parallelism equal to 1.

COMPARISON NATIVEVHT6 EXPERIMENTAL EVALUATION
Figure 25: Prequential classiﬁcation error of Flink’s native VHT, SAMOA’s VHT
and RmVHT algorithm for UCI-Forest Covertype data set. Flink’s na-
tive VHT has data source with parallelism equal to 8.

COMPARISON NATIVEVHT
The Higgs data set is a synthetic data set, a detailed description of which is
presented in Appendix Section A.2.1. In general we observe that Higgs is not
such a good data set to be used for classification with a DT classifier. As we see
in Figure 27, SAMOA’s VHT learns slower than Flink’s native VHT but achieves
lower prequential classification error at the end. On the other hand Flink’s VHT
seems to learn faster at the beginning, but then its prequential classification error
remains stable and slightly greater than SAMOA’s.
Figure 27: Prequential classification error of Flink’s native VHT, SAMOA’s VHT
and RmVHT algorithm for UCI-HIGGS data set.

As we observe in Figure 31, for the Waveform21 data set SAMOA’s VHT outper-
forms Flink’s native VHT implementation. Moreover, we see that SAMOA’s VHT is
learning slower, but achieves lower classification error at the end, whereas Flink’s
native VHT learns faster, as it decreases very fast the classification error, but then
its error remains stable.
Figure 31: Classification error of VHT and RmVHT classifier, for Waveform 21-
attribute data set on Apache Flink and Apache SAMOA.
In Figure 32, we observe that for the Led data set Flink’s native VHT outper-

• NativeVHT is faster than SAMOAVHT
• NativeVHT is more accurate than SAMOAVHT
in real datasets
• Future work for nativeVHT: stress test with
nominal attributes, and use Gini Impurity

CONCLUSIONS
• Streaming is the future and is happening now
• Mining big data streams is an open ﬁeld
• SAMOA:A Platform for Mining Big Data Streams
• Available and open-source (incubating @ASF) 
http://samoa.incubator.apache.org
• A platform for collaboration and research on 
distributed stream mining

OPEN CHALLENGES
• Distributed stream mining algorithms
• Active & semi-supervised learning + crowdsourcing
• Millions of classes (e.g.,Wikipedia pages)
• Multi-target learning
• System issues (load balancing, communication)
• Programming paradigms and abstractions

THETEAM
Albert 
Bifet
Matthieu 
Morel
Gianmarco 
De Francisci Morales
Arinto 
Murdopo
Nicolas 
Kourtellis
Olivier 
Van Laere

THANKS!
https://samoa.incubator.apache.org
@ApacheSAMOA

Apache Samoa: Mining Big Data Streams with Apache Flink

More Related Content

What's hot (20)

Similar to Apache Samoa: Mining Big Data Streams with Apache Flink (20)

More from Albert Bifet (20)

Recently uploaded (20)

Apache Samoa: Mining Big Data Streams with Apache Flink