Mining Big Data Streams with APACHE SAMOA

MINING BIG DATA STREAMS
WITH APACHE SAMOA
Albert Bifet @abifet
#J_OnTheBeach
Malaga, 20 May 2016

REALTIME ANALYTICS
eal time analytics

REALTIME ANALYTICS
real time analytics

APACHE SA(MOA)VISION
• Data Stream mining platform
• Library of state-of-the-art algorithms 
for practitioners
• Development and collaboration framework 
for researchers
• Algorithms & Systems

IMPORTANCE
• Example: spam detection in
comments onYahoo News
• Trends change in time
• Need to retrain model with
new data
Importance$of$O
•  As$spam$trends$change
retrain$the$model$with

INTERNET OF THINGS
• EMC Digital Universe, 2014
digital universe
Figure 3: EMC Digital Universe, 2014
7

BIG DATA STREAM
• Volume +Velocity (+Variety)
• Too large for single commodity
server main memory
• Too fast for single commodity
server CPU
• A solution should be:
• Distributed
• Scalable

BIG DATA
PROCESSING ENGINES
• Low latency
• High Latency (Not real time)
apache storm
Storm characteristics for real-time data processing workloads
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14
apache spark streaming

MACHINE LEARNING
• Classiﬁcation
• Regression
• Clustering
• Frequent Pattern Mining

MOA
• {M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
• It is closely related to WEKA
• It includes a collection of ofﬂine and online as well as tools for
evaluation:
• classiﬁcation, regression
• clustering, frequent pattern mining
• Easy to extend, design and run experiments
{M}assive {O}nline {A}
MOA (Bifet et al. 20
{M}assive {O}nline {A}nalysis is a framework
learning from data streams.
It is closely related to WEKA

STREAM SETTING
• Process an example at a time,and
inspect it only once (at most)
• Use a limited amount of memory
• Work in a limited amount of
time
• Be ready to predict at any point

STREAM EVALUATION
• Holdout Evaluation
• InterleavedTest-Then-Train or
Prequential

STREAM EVALUATION
Holdout an independent
test set
• Apply the current decision model
to the test set, at regular time
intervals
• The loss estimated in the holdout
is an unbiased estimator

STREAM EVALUATION
Prequential Evaluation
• The error of a model is computed
from the sequence of examples.
• For each example in the stream, the
actual model makes a prediction based
only on the example attribute-values.

COMMAND LINE
• java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar
moa.DoTask "EvaluatePeriodicHeldOutTest -l
DecisionStump -s generators.WaveformGenerator -n
100000 -i 100000000 -f 1000000" > dsresult.csv
• This command creates a comma separated values file:
• training the DecisionStump classifier on the WaveformGenerator data,
• using the first 100 thousand examples for testing,
• training on a total of 100 million examples,
• and testing every one million examples

STREAMING MODEL
• Sequence is potentially inﬁnite
• High amount of data, high speed of arrival
• Change over time (concept drift)
• Approximation algorithms 
(small error with high probability)
• Single pass, one data item at a time
• Sub-linear space and time per data item

TAXONOMY
Data
Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4,
Samza
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA

ARCHITECTURE
An adapter for integrating Apache Flink into Apache SAMOA was implemente
n scope of this master thesis, with the main parts of its implementation bein
addressed in this section. With the use of our adapter, ML algorithms can b
executed on top of Apache Flink. The implemented adapter will be used for th
evaluation of the ML pipelines and HT algorithm variations.
Figure 20: Apache SAMOA’s high level architecture.

STATUSSTATUS
• Parallel algorithms
• Classiﬁcation (Vertical HoeffdingTree)
• Clustering (CluStream)
• Regression (Adaptive Model Rules)
• Execution engines

IS SAMOA USEFUL FORYOU?
• Only if you need to deal with:
• Large fast data
• Evolving process (model updates)
• What is happening now?
• Use feedback in real-time
• Adapt to changes faster

ML DEVELOPER API
Processing Item
Processor
Stream

ML DEVELOPER API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShufﬂe(streamOne)
.connectInputKey(streamTwo);

DECISIONTREE
• Nodes are tests on attributes
• Branches are possible
outcomes
• Leafs are class assignments 
  Class
Instance
Attributes
Road
Tested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?

HOEFFDINGTREE
• Sample of stream enough for near optimal decision
• Estimate merit of alternatives from preﬁx of stream
• Choose sample size based on statistical principles
• When to expand a leaf?
• Let x1 be the most informative attribute, 
x2 the second most informative one
• Hoeffding bound: split if G(x1, x2) > ✏ =
r
R2 ln(1/ )
2n
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00

PARALLEL DECISIONTREES
• Which kind of parallelism?
• Task
• Data
• Horizontal
• Vertical
Data
Attributes
Instances

HORIZONTAL PARALLELISM
Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp.
849–872, 2010
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
Aggregation to
compute splits
Single attribute
tracked in
multiple node
32

HOEFFDINGTREE
PROFILING
Other
6 %
Split
24 %
Learn
70 %
CPU time for training 
100 nominal and 100
numeric attributes

VERTICAL PARALLELISM
Single attribute tracked
in single node
Stats
Stats
Stats
Stream
Model
Attributes
Splits

ADVANTAGES OFVERTICAL
• High number of attributes => high level of parallelism 
(e.g., documents)
• Vs task parallelism
• Parallelism observed immediately
• Vs horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation

VERTICAL HOEFFDINGTREE
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping

ACCURACY
No. Leaf Nodes VHT2 –
tree-100
30
Very close and
very high accuracy

PERFORMANCE
35
0
50
100
150
200
250
MHT VHT2-par-3
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec

SUMMARY
• Streaming is an importantV of Big Data
• Mining big data streams is an open ﬁeld
• MOA: Massive Online Analytics
• Available and open-source http://moa.cms.waikato.ac.nz/
• SAMOA:A Platform for Mining Big Data Streams
• Available and open-source (incubating @ASF) 
http://samoa.incubator.apache.org

OPEN CHALLENGES
• Distributed stream mining algorithms
• Active & semi-supervised learning + crowdsourcing
• Millions of classes (e.g.,Wikipedia pages)
• Multi-target learning
• System issues (load balancing, communication)
• Programming paradigms and abstractions

SAMOATEAM
Albert 
Bifet
Matthieu 
Morel
Gianmarco 
De Francisci Morales
Arinto 
Murdopo
Nicolas 
Kourtellis
Olivier 
Van Laere

THANKS!
https://samoa.incubator.apache.org
@ApacheSAMOA

Mining Big Data Streams with APACHE SAMOA

More Related Content

What's hot (20)

Similar to Mining Big Data Streams with APACHE SAMOA (20)

More from Albert Bifet (19)

Recently uploaded (20)

Mining Big Data Streams with APACHE SAMOA