SlideShare a Scribd company logo
MINING BIG DATA STREAMS
WITH APACHE SAMOA
Albert Bifet @abifet
#J_OnTheBeach
Malaga, 20 May 2016
MOTIVATION
REALTIME ANALYTICS
REALTIME ANALYTICS
eal time analytics
REALTIME ANALYTICS
real time analytics
APACHE SA(MOA)VISION
• Data Stream mining platform
• Library of state-of-the-art algorithms

for practitioners
• Development and collaboration framework

for researchers
• Algorithms & Systems
IMPORTANCE
• Example: spam detection in
comments onYahoo News
• Trends change in time
• Need to retrain model with
new data
Importance$of$O
•  As$spam$trends$change
retrain$the$model$with
INTERNET OF THINGS
• EMC Digital Universe, 2014
digital universe
Figure 3: EMC Digital Universe, 2014
7
BIG DATA STREAM
• Volume +Velocity (+Variety)
• Too large for single commodity
server main memory
• Too fast for single commodity
server CPU
• A solution should be:
• Distributed
• Scalable
BIG DATA
PROCESSING ENGINES
• Low latency
• High Latency (Not real time)
apache storm
Storm characteristics for real-time data processing workloads
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14
apache spark streaming
MACHINE LEARNING
• Classification
• Regression
• Clustering
• Frequent Pattern Mining
WHAT IS MOA?
MOA
• {M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.
• It is closely related to WEKA
• It includes a collection of offline and online as well as tools for
evaluation:
• classification, regression
• clustering, frequent pattern mining
• Easy to extend, design and run experiments
{M}assive {O}nline {A}
MOA (Bifet et al. 20
{M}assive {O}nline {A}nalysis is a framework
learning from data streams.
It is closely related to WEKA
STREAM SETTING
• Process an example at a time,and
inspect it only once (at most)
• Use a limited amount of memory
• Work in a limited amount of
time
• Be ready to predict at any point
STREAM EVALUATION
• Holdout Evaluation
• InterleavedTest-Then-Train or
Prequential
STREAM EVALUATION
Holdout an independent
test set
• Apply the current decision model
to the test set, at regular time
intervals
• The loss estimated in the holdout
is an unbiased estimator
STREAM EVALUATION
Prequential Evaluation
• The error of a model is computed
from the sequence of examples.
• For each example in the stream, the
actual model makes a prediction based
only on the example attribute-values.
CLUSTERING
COMMAND LINE
• java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar
moa.DoTask "EvaluatePeriodicHeldOutTest -l
DecisionStump -s generators.WaveformGenerator -n
100000 -i 100000000 -f 1000000" > dsresult.csv
• This command creates a comma separated values file:
• training the DecisionStump classifier on the WaveformGenerator data,
• using the first 100 thousand examples for testing,
• training on a total of 100 million examples,
• and testing every one million examples
WHAT IS APACHE SAMOA?
STREAMING MODEL
• Sequence is potentially infinite
• High amount of data, high speed of arrival
• Change over time (concept drift)
• Approximation algorithms

(small error with high probability)
• Single pass, one data item at a time
• Sub-linear space and time per data item
TAXONOMY
Data
Mining
Distributed
Batch
Hadoop
Mahout
Stream
Storm, S4,
Samza
SAMOA
Non
Distributed
Batch
R,
WEKA,
…
Stream
MOA
ARCHITECTURE
An adapter for integrating Apache Flink into Apache SAMOA was implemente
n scope of this master thesis, with the main parts of its implementation bein
addressed in this section. With the use of our adapter, ML algorithms can b
executed on top of Apache Flink. The implemented adapter will be used for th
evaluation of the ML pipelines and HT algorithm variations.
Figure 20: Apache SAMOA’s high level architecture.
STATUSSTATUS
• Parallel algorithms
• Classification (Vertical HoeffdingTree)
• Clustering (CluStream)
• Regression (Adaptive Model Rules)
• Execution engines
IS SAMOA USEFUL FORYOU?
• Only if you need to deal with:
• Large fast data
• Evolving process (model updates)
• What is happening now?
• Use feedback in real-time
• Adapt to changes faster
ML DEVELOPER API
Processing Item
Processor
Stream
ML DEVELOPER API
TopologyBuilder builder;
Processor sourceOne = new SourceProcessor();
builder.addProcessor(sourceOne);
Stream streamOne = builder.createStream(sourceOne);
Processor sourceTwo = new SourceProcessor();
builder.addProcessor(sourceTwo);
Stream streamTwo = builder.createStream(sourceTwo);
Processor join = new JoinProcessor());
builder.addProcessor(join)
.connectInputShuffle(streamOne)
.connectInputKey(streamTwo);
VERTICAL HOEFFDINGTREE
(VHT)
DECISIONTREE
• Nodes are tests on attributes
• Branches are possible
outcomes
• Leafs are class assignments


 Class
Instance
Attributes
Road
Tested?
Mileage?
Age?
NoYes
High
✅
❌
Low
OldRecent
✅ ❌
Car deal?
HOEFFDINGTREE
• Sample of stream enough for near optimal decision
• Estimate merit of alternatives from prefix of stream
• Choose sample size based on statistical principles
• When to expand a leaf?
• Let x1 be the most informative attribute,

x2 the second most informative one
• Hoeffding bound: split if G(x1, x2) > ✏ =
r
R2 ln(1/ )
2n
P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
PARALLEL DECISIONTREES
• Which kind of parallelism?
• Task
• Data
• Horizontal
• Vertical
Data
Attributes
Instances
HORIZONTAL PARALLELISM
Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp.
849–872, 2010
Stats
Stats
Stats
Stream
Histograms
Model
Instances
Model Updates
Aggregation to
compute splits
Single attribute
tracked in
multiple node
32
HOEFFDINGTREE
PROFILING
Other
6 %
Split
24 %
Learn
70 %
CPU time for training

100 nominal and 100
numeric attributes
VERTICAL PARALLELISM
Single attribute tracked
in single node
Stats
Stats
Stats
Stream
Model
Attributes
Splits
ADVANTAGES OFVERTICAL
• High number of attributes => high level of parallelism

(e.g., documents)
• Vs task parallelism
• Parallelism observed immediately
• Vs horizontal parallelism
• Reduced memory usage (no model replication)
• Parallelized split computation
VERTICAL HOEFFDINGTREE
Control
Split
Result
Source (n) Model (n) Stats (n) Evaluator (1)
InstanceStream
Shuffle Grouping
Key Grouping
All Grouping
ACCURACY
No. Leaf Nodes VHT2 –
tree-100
30
Very close and
very high accuracy
PERFORMANCE
35
0
50
100
150
200
250
MHT VHT2-par-3
ExecutionTime(seconds)
Classifier
Profiling Results for text-10000
with 100000 instances
t_calc
t_comm
t_serial
Throughput
VHT2-par-3: 2631 inst/sec
MHT : 507 inst/sec
SUMMARY
• Streaming is an importantV of Big Data
• Mining big data streams is an open field
• MOA: Massive Online Analytics
• Available and open-source http://moa.cms.waikato.ac.nz/
• SAMOA:A Platform for Mining Big Data Streams
• Available and open-source (incubating @ASF)

http://samoa.incubator.apache.org
OPEN CHALLENGES
• Distributed stream mining algorithms
• Active & semi-supervised learning + crowdsourcing
• Millions of classes (e.g.,Wikipedia pages)
• Multi-target learning
• System issues (load balancing, communication)
• Programming paradigms and abstractions
SAMOATEAM
Albert

Bifet
Matthieu

Morel
Gianmarco

De Francisci Morales
Arinto

Murdopo
Nicolas

Kourtellis
Olivier

Van Laere
SUPPORTING
ORGANISATIONS
THANKS!
https://samoa.incubator.apache.org
@ApacheSAMOA

More Related Content

PDF
Artificial intelligence and data stream mining
PDF
MOA for the IoT at ACML 2016
PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Mining big data streams with APACHE SAMOA by Albert Bifet
PDF
Mining Big Data in Real Time
PDF
Real-Time Big Data Stream Analytics
PPTX
Streaming Algorithms
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Artificial intelligence and data stream mining
MOA for the IoT at ACML 2016
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining Big Data in Real Time
Real-Time Big Data Stream Analytics
Streaming Algorithms
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

What's hot (20)

PDF
Moa: Real Time Analytics for Data Streams
PDF
Introduction to Big Data
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
PPT
5.1 mining data streams
PPTX
Mining and Managing Large-scale Linked Open Data
PDF
ACM DEBS 2015: Realtime Streaming Analytics Patterns
PPTX
A Comparison of Different Strategies for Automated Semantic Document Annotation
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PPTX
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
PPTX
Anomaly Detection with Apache Spark
PPTX
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
PPTX
Machine Learning with Spark
PDF
Josh Patterson MLconf slides
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Probabilistic data structures
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
PPTX
Mining high speed data streams: Hoeffding and VFDT
PPTX
Mining and Managing Large-scale Linked Open Data
PDF
Joey gonzalez, graph lab, m lconf 2013
Moa: Real Time Analytics for Data Streams
Introduction to Big Data
Sentiment Knowledge Discovery in Twitter Streaming Data
5.1 mining data streams
Mining and Managing Large-scale Linked Open Data
ACM DEBS 2015: Realtime Streaming Analytics Patterns
A Comparison of Different Strategies for Automated Semantic Document Annotation
Efficient Online Evaluation of Big Data Stream Classifiers
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
Anomaly Detection with Apache Spark
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Machine Learning with Spark
Josh Patterson MLconf slides
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Probabilistic data structures
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Mining high speed data streams: Hoeffding and VFDT
Mining and Managing Large-scale Linked Open Data
Joey gonzalez, graph lab, m lconf 2013
Ad

Similar to Mining Big Data Streams with APACHE SAMOA (20)

PDF
Apache Samoa: Mining Big Data Streams with Apache Flink
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
PPTX
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PDF
Huawei Advanced Data Science With Spark Streaming
PDF
Data Streams Models And Algorithms Charu C Aggarwal Ed
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PPTX
Crash course on data streaming (with examples using Apache Flink)
PDF
Data Streaming For Big Data
PPTX
Mining Big Data in Real Time
PPTX
Real time streaming analytics
PDF
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
PPT
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
PPTX
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
PPTX
Intro to Spark - for Denver Big Data Meetup
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PPT
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
PDF
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Apache Samoa: Mining Big Data Streams with Apache Flink
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
Huawei Advanced Data Science With Spark Streaming
Data Streams Models And Algorithms Charu C Aggarwal Ed
Distributed Decision Tree Learning for Mining Big Data Streams
Crash course on data streaming (with examples using Apache Flink)
Data Streaming For Big Data
Mining Big Data in Real Time
Real time streaming analytics
DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics
Introduction to Large Scale Data Analysis with WSO2 Analytics Platform
VHT: Vertical Hoeffding Tree (IEEE BigData 2016)
Intro to Spark - for Denver Big Data Meetup
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Scalable Realtime Analytics with declarative SQL like Complex Event Processin...
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...
Ad

More from Albert Bifet (19)

PDF
Introduction to Big Data Science
PDF
Internet of Things Data Science
PDF
Real Time Big Data Management
PDF
A Short Course in Data Stream Mining
PDF
Multi-label Classification with Meta-labels
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PPTX
STRIP: stream learning of influence probabilities.
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
PDF
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PDF
Leveraging Bagging for Evolving Data Streams
PDF
MOA : Massive Online Analysis
PDF
New ensemble methods for evolving data streams
PDF
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
PDF
Adaptive XML Tree Mining on Evolving Data Streams
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
PDF
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
PDF
Mining Implications from Lattices of Closed Trees
PDF
Kalman Filters and Adaptive Windows for Learning in Data Streams
Introduction to Big Data Science
Internet of Things Data Science
Real Time Big Data Management
A Short Course in Data Stream Mining
Multi-label Classification with Meta-labels
Pitfalls in benchmarking data stream classification and how to avoid them
STRIP: stream learning of influence probabilities.
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Mining Frequent Closed Graphs on Evolving Data Streams
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
Leveraging Bagging for Evolving Data Streams
MOA : Massive Online Analysis
New ensemble methods for evolving data streams
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Implications from Lattices of Closed Trees
Kalman Filters and Adaptive Windows for Learning in Data Streams

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Introduction to Business Data Analytics.
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
1_Introduction to advance data techniques.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx
Fluorescence-microscope_Botany_detailed content
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Business Data Analytics.
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
1_Introduction to advance data techniques.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Acumen Training GuidePresentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Major-Components-ofNKJNNKNKNKNKronment.pptx

Mining Big Data Streams with APACHE SAMOA

  • 1. MINING BIG DATA STREAMS WITH APACHE SAMOA Albert Bifet @abifet #J_OnTheBeach Malaga, 20 May 2016
  • 6. APACHE SA(MOA)VISION • Data Stream mining platform • Library of state-of-the-art algorithms
 for practitioners • Development and collaboration framework
 for researchers • Algorithms & Systems
  • 7. IMPORTANCE • Example: spam detection in comments onYahoo News • Trends change in time • Need to retrain model with new data Importance$of$O •  As$spam$trends$change retrain$the$model$with
  • 8. INTERNET OF THINGS • EMC Digital Universe, 2014 digital universe Figure 3: EMC Digital Universe, 2014 7
  • 9. BIG DATA STREAM • Volume +Velocity (+Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable
  • 10. BIG DATA PROCESSING ENGINES • Low latency • High Latency (Not real time) apache storm Storm characteristics for real-time data processing workloads 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14 apache spark streaming
  • 11. MACHINE LEARNING • Classification • Regression • Clustering • Frequent Pattern Mining
  • 13. MOA • {M}assive {O}nline {A}nalysis is a framework for online learning from data streams. • It is closely related to WEKA • It includes a collection of offline and online as well as tools for evaluation: • classification, regression • clustering, frequent pattern mining • Easy to extend, design and run experiments {M}assive {O}nline {A} MOA (Bifet et al. 20 {M}assive {O}nline {A}nalysis is a framework learning from data streams. It is closely related to WEKA
  • 14. STREAM SETTING • Process an example at a time,and inspect it only once (at most) • Use a limited amount of memory • Work in a limited amount of time • Be ready to predict at any point
  • 15. STREAM EVALUATION • Holdout Evaluation • InterleavedTest-Then-Train or Prequential
  • 16. STREAM EVALUATION Holdout an independent test set • Apply the current decision model to the test set, at regular time intervals • The loss estimated in the holdout is an unbiased estimator
  • 17. STREAM EVALUATION Prequential Evaluation • The error of a model is computed from the sequence of examples. • For each example in the stream, the actual model makes a prediction based only on the example attribute-values.
  • 19. COMMAND LINE • java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv • This command creates a comma separated values file: • training the DecisionStump classifier on the WaveformGenerator data, • using the first 100 thousand examples for testing, • training on a total of 100 million examples, • and testing every one million examples
  • 20. WHAT IS APACHE SAMOA?
  • 21. STREAMING MODEL • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms
 (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item
  • 23. ARCHITECTURE An adapter for integrating Apache Flink into Apache SAMOA was implemente n scope of this master thesis, with the main parts of its implementation bein addressed in this section. With the use of our adapter, ML algorithms can b executed on top of Apache Flink. The implemented adapter will be used for th evaluation of the ML pipelines and HT algorithm variations. Figure 20: Apache SAMOA’s high level architecture.
  • 24. STATUSSTATUS • Parallel algorithms • Classification (Vertical HoeffdingTree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines
  • 25. IS SAMOA USEFUL FORYOU? • Only if you need to deal with: • Large fast data • Evolving process (model updates) • What is happening now? • Use feedback in real-time • Adapt to changes faster
  • 26. ML DEVELOPER API Processing Item Processor Stream
  • 27. ML DEVELOPER API TopologyBuilder builder; Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo);
  • 29. DECISIONTREE • Nodes are tests on attributes • Branches are possible outcomes • Leafs are class assignments
 
 Class Instance Attributes Road Tested? Mileage? Age? NoYes High ✅ ❌ Low OldRecent ✅ ❌ Car deal?
  • 30. HOEFFDINGTREE • Sample of stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? • Let x1 be the most informative attribute,
 x2 the second most informative one • Hoeffding bound: split if G(x1, x2) > ✏ = r R2 ln(1/ ) 2n P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00
  • 31. PARALLEL DECISIONTREES • Which kind of parallelism? • Task • Data • Horizontal • Vertical Data Attributes Instances
  • 32. HORIZONTAL PARALLELISM Y. Ben-Haim and E.Tom-Tov,“A Streaming Parallel DecisionTree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010 Stats Stats Stats Stream Histograms Model Instances Model Updates Aggregation to compute splits Single attribute tracked in multiple node 32
  • 33. HOEFFDINGTREE PROFILING Other 6 % Split 24 % Learn 70 % CPU time for training
 100 nominal and 100 numeric attributes
  • 34. VERTICAL PARALLELISM Single attribute tracked in single node Stats Stats Stats Stream Model Attributes Splits
  • 35. ADVANTAGES OFVERTICAL • High number of attributes => high level of parallelism
 (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation
  • 36. VERTICAL HOEFFDINGTREE Control Split Result Source (n) Model (n) Stats (n) Evaluator (1) InstanceStream Shuffle Grouping Key Grouping All Grouping
  • 37. ACCURACY No. Leaf Nodes VHT2 – tree-100 30 Very close and very high accuracy
  • 38. PERFORMANCE 35 0 50 100 150 200 250 MHT VHT2-par-3 ExecutionTime(seconds) Classifier Profiling Results for text-10000 with 100000 instances t_calc t_comm t_serial Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
  • 39. SUMMARY • Streaming is an importantV of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA:A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF)
 http://samoa.incubator.apache.org
  • 40. OPEN CHALLENGES • Distributed stream mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g.,Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions