SlideShare a Scribd company logo
Streaming &Parallel 

Decision Tree in Flink
1 2 3 4
1 2 3 4 anwar.rizal @anrizal
1 2 3 4
Outlines
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Motivation
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
(predicted) to increase zero to two days
(predicted) to increase this week
(predicted) to increase next week
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Need attention
revenue decrease
Need attention
passenger
decrease
Need attention
revenue decrease,
cost increase
Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
Motivation The classifier is kept fresh
No need for separate batch learning/evaluation
The feedback is taken into account in real time,
regularly
The classifier can be introspected
Transparent model structure
(e.g. know the tree, information gain for each
split point)
Known expected performance (accuracy, precision,
recall, AUC)
Seamless support for workflow of
machine learning
Data preprocessing: up/down sampling, imputations, …
Feature selections
Model evaluation, cross validation,
MUST
Motivation
The classifier is immediately available
The classifier can already predict during learning
When learning phase is terminated, it starts another
cycle of learning
The classifier has a meta-learning
capability
The classifier has several models different parameters
It is possible to learn about the learning capability of
the models
NICE TO HAVE
Motivation
Learning Learning &
Classifying
End of
learning
New cycle of
learning
Cycle of
Learning, Classifying during Learning,
End of Learning, Classifying, New
Learning
Motivation
Classifying Application
Stream Learner
Labeled
points
Classifier Predicted
points
Unlabeled
points
Motivation
Architecture
Decision Trees
Implementation
Conclusion
DecisionTrees
DecisionTrees
From origin to recent developments
“Understand data by asking a sequence of questions ”
Classification and Regression Trees (CART) by Breiman et al. in 1984
“Pool decision trees to improve generalization”
Random Forests by Breiman in 1999
“Let’s play: pose estimation for XBox’s Kinect”
Shotton et al. 2011
DecisionTrees
Streaming Decision Trees
“A classifier for streaming data with a bound”
Hoeffding Tree (VFDT), Dominguez & Hulthen 2000
“Use of Approximate Histograms for Decision Tree”
Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov
2010
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Train a decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy
procrastinators
Tourists Foreseeing
businessmen
Tourists
Brad Pitt
Save money
for the company
Business
Leisure
Supervision
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Classifying - get the intuition !
Business
Leisure
1 2 3 4 1 2
3 4
+
confidence
measure
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Decision tree - node optimization
 
 
  
 
Information Gain
 
DecisionTrees
Streaming Decision Trees
The batch version of decision trees require
view of the full learning data set
In streaming
each point can only be seen once
the processing should be fast, can’t afford too much
access to disks
DecisionTrees
Streaming Decision Tree – get the intuition
!
Instead of using every point, the points are
compressed
The real position of each point is then
approximated
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Streaming decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy
procrastinators
Tourists Foreseeing
businessmen
Tourists
Business
Leisure
Supervision
+ Count of
points nearby
DecisionTrees
Streaming Decision Tree – the Question
“How to find split points for a decision tree ?“
label 1 / feature 1
Count
0
2
4
6
8
Feature 1
2 5 7.5 9 11
DecisionTrees
Compressing Data

An approximate histogram is built for each label/feature
label n/feature 1
0
4
8
4 6 8.5 10 13
label n
feature 1
0
10
20
30
40
1 3.5 7 11 14
Total
label 1 / feature 1
0
4
8
2 5 7.5 9 11
label 0
DecisionTrees
For each feature, all histograms of th
feature are merged
Prepare Split Candidates (1/2)
Total
0
10
20
30
40
1 3.5 7 11 14
Total
Get the split candidates s.t. the interval between two split
candidates have same number of points (the colored square is as
large as each other )
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Prepare Split Candidates (2/2)
Find the split point that maximizes the information gain
using
the split points
histogram per feature/label
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Determine Split
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
The Intuition is not exactly precise
Business
Leisure
Supervision
• The histograms can no longer be used for
further split
• And of course, we have already lost
original data
A different data set is used for different iteration
DecisionTree
* If there are not enough data, the same data can be reinjected
instead, Kafka is very good for this
Subsequent Split – get the intuition !
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Implementation
Implementation
Stream Learner
Implementation
Stream Learner
We use two kafka streams:
• One for labeled data stream
• One for the tree developed so far
(the topic is also use by
classifying applications)
• Because we need to annotate
each message with the tree so far
Implementation
Code Outlines
val kafkaDataStream: DataStream[Point]=
val kafkaTreeStream: DataStream[Node] =
// annotate each message with the latest tree
val annotatedDataStream: DataStream[AnnotatedPoint] =
(kafkaDataStream connect kafkaTreeStream) flatMap (new
AnnotateMessageCoFlatMap(…))
// create histogram per feature / node
val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) }
.timeWindowAll(Time.of(1, TimeUnit.MINUTES))
.reduce{ (n1, n2) => mergeHistogram(n1, n2) }
// merge histogram
val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2)
}
val newTree = mergedHistograms
.filter(hs => haveEnoughPoints(hs) && toSplit(hs))
.map{ n => val splitPoint =
maxInformationGain( calculateSplitCandidates(n))
val Histogram
➔accumulate
var histogram
➔re-accumulate from 0
h1
Motivation
Architecture
Decision Trees
Implementation


Conclusion
Conclusion
Conclusions
Summary
Streaming algorithms based on approximate
histograms are explained
The streaming decision trees algorithms open
possibilities to have interesting properties of
classifier: freshness and continuous learning
Flink together with Kafka allow an
implementation of the algorithm in a nice way
Conclusions
Next Steps
Random Forests:
Trees with randomly selected features at each
level
Trees with different span of data (trees with
more but old data might behave worse than
trees with less but more fresh data:
forgetting capabilities)
Providing information of what type of trees
behave better at a given period of time (meta
learning)
Thanks!
Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns

More Related Content

PDF
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PPTX
Large Scale Machine Learning with Apache Spark
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Latent Semantic Analysis of Wikipedia with Spark
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Large Scale Machine Learning with Apache Spark
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Ufuc Celebi – Stream & Batch Processing in one System
Latent Semantic Analysis of Wikipedia with Spark

What's hot (20)

PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PDF
Optimizing Terascale Machine Learning Pipelines with Keystone ML
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
PDF
Apache Flink internals
PDF
Apache Flink Deep Dive
PDF
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PPTX
Distributed GLM with H2O - Atlanta Meetup
PDF
Vasia Kalavri – Training: Gelly School
PPTX
Surge: Rise of Scalable Machine Learning at Yahoo!
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Unsupervised Learning with Apache Spark
PDF
Demystifying DataFrame and Dataset
PPTX
Michael Häusler – Everyday flink
PDF
Distributed real time stream processing- why and how
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Enhancing Spark SQL Optimizer with Reliable Statistics
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Optimizing Terascale Machine Learning Pipelines with Keystone ML
FlinkML: Large Scale Machine Learning with Apache Flink
Apache Flink internals
Apache Flink Deep Dive
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Distributed Stream Processing - Spark Summit East 2017
Machine Learning with Apache Flink at Stockholm Machine Learning Group
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
Distributed GLM with H2O - Atlanta Meetup
Vasia Kalavri – Training: Gelly School
Surge: Rise of Scalable Machine Learning at Yahoo!
Apache Flink: API, runtime, and project roadmap
Unsupervised Learning with Apache Spark
Demystifying DataFrame and Dataset
Michael Häusler – Everyday flink
Distributed real time stream processing- why and how
Ad

Viewers also liked (20)

PPTX
Apache Flink - Hadoop MapReduce Compatibility
PDF
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
PPTX
Apache Flink Training: DataSet API Basics
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PPTX
Slim Baltagi – Flink vs. Spark
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
PPTX
Flink Case Study: Bouygues Telecom
PDF
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PDF
Mikio Braun – Data flow vs. procedural programming
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Marton Balassi – Stateful Stream Processing
PDF
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
PPTX
Apache Flink Training: DataStream API Part 1 Basic
PPTX
Assaf Araki – Real Time Analytics at Scale
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
PPTX
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Apache Flink - Hadoop MapReduce Compatibility
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Apache Flink Training: DataSet API Basics
K. Tzoumas & S. Ewen – Flink Forward Keynote
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Slim Baltagi – Flink vs. Spark
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Case Study: Bouygues Telecom
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Mikio Braun – Data flow vs. procedural programming
Introduction to Apache Flink - Fast and reliable big data processing
Marton Balassi – Stateful Stream Processing
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Apache Flink Training: DataStream API Part 1 Basic
Assaf Araki – Real Time Analytics at Scale
Flink 0.10 @ Bay Area Meetup (October 2015)
Matthias J. Sax – A Tale of Squirrels and Storms
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Ad

Similar to Anwar Rizal – Streaming & Parallel Decision Tree in Flink (20)

PDF
Know How to Create and Visualize a Decision Tree with Python.pdf
PDF
Cloudera Movies Data Science Project On Big Data
PPTX
Real time streaming analytics
PPT
Hands on Mahout!
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PPTX
Data Mining with SQL Server 2008
PDF
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
PDF
Decision Tree-ID3,C4.5,CART,Regression Tree
PPTX
Building and deploying analytics
PPTX
Query expansion_Team42_IRE2k14
PDF
Machine Learning Lecture 3 Decision Trees
PDF
Towards a rebirth of data science (by Data Fellas)
PPTX
Cassandra & puppet, scaling data at $15 per month
PDF
AI and Deep Learning
PPT
Mca i unit part 501 dm
PPTX
Decision Tree.pptx
PPTX
Random Forest and KNN is fun
PPTX
DITA's New Thang: Going Mapless!
PDF
Productionizing your Streaming Jobs
PDF
Comparison of Top Data Mining(Final)
Know How to Create and Visualize a Decision Tree with Python.pdf
Cloudera Movies Data Science Project On Big Data
Real time streaming analytics
Hands on Mahout!
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Data Mining with SQL Server 2008
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Decision Tree-ID3,C4.5,CART,Regression Tree
Building and deploying analytics
Query expansion_Team42_IRE2k14
Machine Learning Lecture 3 Decision Trees
Towards a rebirth of data science (by Data Fellas)
Cassandra & puppet, scaling data at $15 per month
AI and Deep Learning
Mca i unit part 501 dm
Decision Tree.pptx
Random Forest and KNN is fun
DITA's New Thang: Going Mapless!
Productionizing your Streaming Jobs
Comparison of Top Data Mining(Final)

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025

Anwar Rizal – Streaming & Parallel Decision Tree in Flink