SlideShare a Scribd company logo
real time big data management
Albert Bifet (@abifet)
Paris, 7 October 2015
albert.bifet@telecom-paristech.fr
data streams
Big Data & Real Time
1
hadoop
Hadoop architecture deals with datasets, not
data streams
2
apache s4
Apache S4
3
apache s4
4
apache storm
Storm from Twitter
5
apache storm
Stream, Spout, Bolt, Topology
6
apache storm
Stream, Spout, Bolt, Topology
7
apache storm
Storm characteristics for real-time data processing workloads:
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
8
apache kafka from linkedin
Apache Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system.
9
apache kafka from linkedin
Apache Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system.
9
apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
10
apache spark streaming
Spark Streaming is an extension of Spark that allows
processing data stream using micro-batches of data.
11
apache flink motivation
apache flink motivation
1 Real time computation: streaming computation
2 Fast, as there is not need to write to disk
3 Easy to write code
13
real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14
easy to write code
case class Word ( word : String , frequency : I n t )
DataSet API (batch):
val lines : DataSet [ String ] = env . readTextFile ( . . . )
lines . flatMap { l i n e => l i n e . s p l i t ( ” ” )
.map( word => Word( word , 1 ) ) }
. groupBy ( ” word ” ) . sum( ” frequency ” )
. p r i n t ( )
15
easy to write code
case class Word ( word : String , frequency : I n t )
DataSet API (batch):
val lines : DataSet [ String ] = env . readTextFile ( . . . )
lines . flatMap { l i n e => l i n e . s p l i t ( ” ” )
.map( word => Word( word , 1 ) ) }
. groupBy ( ” word ” ) . sum( ” frequency ” )
. p r i n t ( )
DataStream API (streaming):
val lines : DataStream [ String ] = env . fromSocketStream ( . . . )
lines . flatMap { l i n e => l i n e . s p l i t ( ” ” )
.map( word => Word( word , 1 ) ) }
. window ( Time . of (5 ,SECONDS ) ) . every ( Time . of (1 ,SECONDS) )
. groupBy ( ” word ” ) . sum( ” frequency ” )
. p r i n t ( )
16
what is apache flink?
Figure 1: Apache Flink Overview
17
batch and streaming engines
Figure 2: Batch, streaming and hybrid data processing engines.
18
batch comparison
Figure 3: Comparison between Hadoop, Spark And Flink.
19
streaming comparison
Figure 4: Comparison between Storm, Spark And Flink.
20
scala language
• What is Scala?
• object oriented
• functional
• How is Scala?
• Scala is compatible
• Scala is concise
• Scala is high-level
• Scala is statically typed
21
short course on scala
• Easy to use: includes an interpreter
• Variables:
• val: immutable (preferable)
• var: mutable
• Scala treats everything as objects with methods
• Scala has first-class functions
• Functions:
def max( x : Int , y : I n t ) : I n t = {
i f ( x > y ) x
else y
}
def max2( x : Int , y : I n t ) = i f ( x > y ) x else y
22
short course on scala
• Functional:
args . foreach ( ( arg : String ) => p r i n t l n ( arg ) )
args . foreach ( arg => p r i n t l n ( arg ) )
args . foreach ( p r i n t l n )
• Imperative:
for ( arg <− args ) p r i n t l n ( arg )
• Scala achieves a conceptual simplicity by treating
everything, from arrays to expressions, as objects with
methods.
( 1 ) . + ( 2 )
greetStrings (0) = ” Hello ”
greetStrings . update (0 , ” Hello ” )
val numNames2 = Array . apply ( ” zero ” , ”one ” , ”two ” )23
short course on scala
• Array: mutable sequence of objects that share the same type
• List: immutable sequence of objects that share the same
type
• Tuple: immutable sequence of objects that does not share
the same type
val pair = (99 , ” Luftballons ” )
p r i n t l n ( pair . _1 )
p r i n t l n ( pair . _2 )
24
short course on scala
• Sets and maps
var jetSet = Set ( ” Boeing ” , ” Airbus ” )
jetSet += ” Lear ”
p r i n t l n ( jetSet . contains ( ” Cessna ” ) )
import scala . collection . mutable .Map
val treasureMap = Map[ Int , String ] ( )
treasureMap += (1 −> ”Go to island . ” )
treasureMap += (2 −> ” Find big X on ground . ” )
treasureMap += (3 −> ” Dig . ” )
p r i n t l n ( treasureMap ( 2 ) )
25
short course on scala
• Functional style
• Does not contain any var
def printArgs ( args : Array [ String ] ) : Unit = {
var i = 0
while ( i < args . length ) {
p r i n t l n ( args ( i ) )
i += 1
}
}
def printArgs ( args : Array [ String ] ) : Unit = {
for ( arg <− args )
p r i n t l n ( arg )
}
def printArgs ( args : Array [ String ] ) : Unit = {
args . foreach ( p r i n t l n )
}
def formatArgs ( args : Array [ String ] ) = args . mkString ( ”  n ” )
p r i n t l n ( formatArgs ( args ) ) 26
short course on scala
• Prefer vals, immutable objects, and methods without side
effects.
• Use vars, mutable objects, and methods with side effects
when you have a specific need and justification for them.
• In a Scala program, a semicolon at the end of a statement is
usually optional.
• A singleton object definition looks like a class definition,
except instead of the keyword class you use the keyword
object .
• Scala provides a trait, scala.Application:
object FallWinterSpringSummer extends Application {
for ( season <− List ( ” f a l l ” , ” winter ” , ” spring ” ) )
p r i n t l n ( season + ” : ”+ calculate ( season ) )
}
27
short course on scala
• Scala has first-class functions: you can write down
functions as unnamed literals and then pass them around as
values.
( x : I n t ) => x + 1
• Short forms of function literals
someNumbers . f i l t e r ( ( x : I n t ) => x > 0)
someNumbers . f i l t e r ( ( x ) => x > 0)
someNumbers . f i l t e r ( x => x > 0)
someNumbers . f i l t e r ( _ > 0)
someNumbers . foreach ( x => p r i n t l n ( x ) )
someNumbers . foreach ( p r i n t l n _ )
28
short course on scala
• Zipping lists: zip and unzip
• The zip operation takes two lists and forms a list of pairs:
• A useful special case is to zip a list with its index. This is done
most efficiently with the zipWithIndex method,
• Mapping over lists: map , flatMap and foreach
• Filtering lists: filter , partition , find , takeWhile , dropWhile ,
and span
• Folding lists: /: and : or foldLeft and foldRight.
( z / : List ( a , b , c ) ) ( op )
equals
op ( op ( op ( z , a ) , b ) , c )
• Sorting lists: sortWith
29
apache flink architecture
references
Apache Flink Documentation
http://dataartisans.github.io/flink-training/ 31
api introduction
Flink programs
1 Input from source
2 Apply operations
3 Output to source
32
batch and streaming apis
1 DataSet API
• Example: Map/Reduce paradigm
2 DataStream API
• Example: Live Stock Feed
33
streaming and batch comparison
34
architecture overview
Figure 5: The JobManager is the coordinator of the Flink system
TaskManagers are the workers that execute parts of the parallel
programs.
35
client
1 Optimize
2 Construct job graph
3 Pass job graph to job manager
4 Retrieve job results
36
job manager
1 Parallelization: Create Execution Graph
2 Scheduling: Assign tasks to task managers
3 State: Supervise the execution
37
task manager
1 Operations are split up into tasks depending on the specified
parallelism
2 Each parallel instance of an operation runs in a separate
task slot
3 The scheduler may run several tasks from different
operators in one task slot
38
component stack
39
component stack
1 API layer: implements multiple APIs that create operator
DAGs for their programs. Each API needs to provide utilities
(serializers, comparators) that describe the interaction
between its data types and the runtime.
2 Optimizer and common api layer: takes programs in the form
of operator DAGs. The operators are specific (e.g., Map,
Join, Filter, Reduce, …), but are data type agnostic.
3 Runtime layer: receives a program in the form of a
JobGraph. A JobGraph is a generic parallel data flow with
arbitrary tasks that consume and produce data streams.
40
flink topologies
Flink programs
1 Input from source
2 Apply operations
3 Output to source
41
sources (selection)
• Collection-based
• fromCollection
• fromElements
• File-based
• TextInputFormat
• CsvInputFormat
• Other
• SocketInputFormat
• KafkaInputFormat
• Databases
42
sinks (selection)
• File-based
• TextOutputFormat
• CsvOutputFormat
• PrintOutput
• Others
• SocketOutputFormat
• KafkaOutputFormat
• Databases
43
apache flink algorithns
flink skeleton program
1 Obtain an ExecutionEnvironment,
2 Load/create the initial data,
3 Specify transformations on this data,
4 Specify where to put the results of your computations,
5 Trigger the program execution
45
java wordcount example
public class WordCountExample {
public s ta t ic void main ( String [ ] args ) throws Exception {
f i n a l ExecutionEnvironment env =
ExecutionEnvironment . getExecutionEnvironment ( ) ;
DataSet < String > text = env . fromElements (
”Who’ s there ? ” ,
” I think I hear them . Stand , ho ! Who’ s there ? ” ) ;
DataSet <Tuple2 < String , Integer >> wordCounts = text
. flatMap (new L i n e S p l i t t e r ( ) )
. groupBy (0)
.sum( 1 ) ;
wordCounts . p r i n t ( ) ;
}
. . . .
}
46
java wordcount example
public class WordCountExample {
public s ta tic void main ( String [ ] args ) throws Exception {
. . . .
}
public s ta tic class L i n e S p l i t t e r implements
FlatMapFunction < String , Tuple2 < String , Integer >> {
@Override
public void flatMap ( String line , Collector <Tuple2 < String ,
Integer >> out ) {
for ( String word : l i n e . s p l i t ( ” ” ) ) {
out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ;
}
}
}
}
47
scala wordcount example
import org . apache . f l i n k . api . scala . _
object WordCount {
def main ( args : Array [ String ] ) {
val env = ExecutionEnvironment . getExecutionEnvironment
val text = env . fromElements (
”Who’ s there ? ” ,
” I think I hear them . Stand , ho ! Who’ s there ? ” )
val counts = text . flatMap
{ _ . toLowerCase . s p l i t ( ”  W+”) f i l t e r { _ . nonEmpty } }
.map { ( _ , 1) }
. groupBy (0)
.sum(1)
counts . p r i n t ( )
}
}
48
java 8 wordcount example
public class WordCountExample {
public s tat ic void main ( String [ ] args ) throws Exception {
f i n a l ExecutionEnvironment env =
ExecutionEnvironment . getExecutionEnvironment ( ) ;
DataSet < String > text = env . fromElements (
”Who’ s there ?”
” I think I hear them . Stand , ho ! Who’ s there ? ” ) ;
text .map( l i n e −> l i n e . s p l i t ( ” ” ) )
. flatMap ( ( String [ ] wordArray ,
Collector <Tuple2 < String , Integer >> out )
−> Arrays . stream ( wordArray )
. forEach ( t −> out . collect (new Tuple2 < >( t , 1 ) ) )
)
. groupBy (0)
.sum(1)
. p r i n t ( ) ;
}
}
49
data streams algorithns
flink skeleton program
1 Obtain an StreamExecutionEnvironment,
2 Load/create the initial data,
3 Specify transformations on this data,
4 Specify where to put the results of your computations,
5 Trigger the program execution
51
java wordcount example
public class StreamingWordCount {
public st a t i c void main ( String [ ] args ) {
StreamExecutionEnvironment env =
StreamExecutionEnvironment . getExecutionEnvironment ( ) ;
DataStream <Tuple2 < String , Integer >> dataStream = env
. socketTextStream ( ” localhost ” , 9999)
. flatMap (new S p l i t t e r ( ) )
. groupBy (0)
.sum( 1 ) ;
dataStream . p r i n t ( ) ;
env . execute ( ” Socket Stream WordCount ” ) ;
}
. . . .
}
52
java wordcount example
public class StreamingWordCount {
public st a t i c void main ( String [ ] args ) throws Exception {
. . . .
}
public st a ti c class S p l i t t e r implements
FlatMapFunction < String , Tuple2 < String , Integer >> {
@Override
public void flatMap ( String sentence ,
Collector <Tuple2 < String , Integer >> out ) throws Exception
for ( String word : sentence . s p l i t ( ” ” ) ) {
out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ;
}
}
}
}
53
scala wordcount example
object WordCount {
def main ( args : Array [ String ] ) {
val env = StreamExecutionEnvironment . getExecutionEnvironment
val text = env . socketTextStream ( ” localhost ” , 9999)
val counts = text . flatMap { _ . toLowerCase . s p l i t ( ”  W+”)
f i l t e r { _ . nonEmpty } }
.map { ( _ , 1) }
. groupBy (0)
.sum(1)
counts . p r i n t
env . execute ( ” Scala Socket Stream WordCount ” )
}
}
54
obtain an streamexecutionenvironment
The StreamExecutionEnvironment is the basis for all Flink programs.
StreamExecutionEnvironment . getExecutionEnvironment
StreamExecutionEnvironment . createLocalEnvironment ( parallelism )
StreamExecutionEnvironment . createRemoteEnvironment ( host : String ,
port : String , parallelism : Int , j a r F i l e s : String *)
env . socketTextStream ( host , port )
env . fromElements ( elements . . . )
env . addSource ( sourceFunction )
55
specify transformations on this data
• Map
• FlatMap
• Filter
• Reduce
• Fold
• Union
56
3) specify transformations on this data
Map
Takes one element and produces one element.
data .map { x => x . toInt }
57
3) specify transformations on this data
FlatMap
Takes one element and produces zero, one, or more elements.
data . flatMap { str => str . s p l i t ( ” ” ) }
58
3) specify transformations on this data
Filter
Evaluates a boolean function for each element and retains
those for which the function returns true.
data . f i l t e r { _ > 1000 }
59
3) specify transformations on this data
Reduce
Combines a group of elements into a single element by
repeatedly combining two elements into one. Reduce may be
applied on a full data set, or on a grouped data set.
data . reduce { _ + _ }
60
3) specify transformations on this data
Union
Produces the union of two data sets.
data . union ( data2 )
61
window operators
The user has different ways of using the result of a window
operation:
• windowedDataStream.flatten() - streams the results
element wise and returns a DataStream<T> where T is the
type of the underlying windowed stream
• windowedDataStream.getDiscretizedStream() -
returns a DataStream<StreamWindow<T» for applying
some advanced logic on the stream windows itself.
• Calling any window transformation further transforms the
windows, while preserving the windowing logic
dataStream . window ( Time . of (5 , TimeUnit .SECONDS) )
. every ( Time . of (1 , TimeUnit .SECONDS ) ) ;
dataStream . window ( Count . of (100))
. every ( Time . of (1 , TimeUnit .MINUTES ) ) ;
6 62
gelly: flink graph api
• Gelly is a Java Graph API for Flink.
• In Gelly, graphs can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API.
• In Gelly, a Graph is represented by a DataSet of vertices and
a DataSet of edges.
• The Graph nodes are represented by the Vertex type. A
Vertex is defined by a unique ID and a value.
// create a new vertex with a Long ID and a String value
Vertex <Long , String > v = new Vertex <Long , String >(1L , ” foo ” ) ;
// create a new vertex with a Long ID and no value
Vertex <Long , NullValue > v =
new Vertex <Long , NullValue >(1L , NullValue . getInstance ( ) ) ;
63
gelly: flink graph api
• The graph edges are represented by the Edge type.
• An Edge is defined by a source ID (the ID of the source
Vertex), a target ID (the ID of the target Vertex) and an
optional value.
• The source and target IDs should be of the same type as the
Vertex IDs. Edges with no value have a NullValue value type.
Edge<Long , Double > e = new Edge<Long , Double >(1L , 2L , 0 . 5 ) ;
// reverse the source and target of this edge
Edge<Long , Double > reversed = e . reverse ( ) ;
Double weight = e . getValue ( ) ; // weight = 0.5
64
table api - relational queries
• Flink provides an API that allows specifying operations
using SQL-like expressions.
• Instead of manipulating DataSet or DataStream you work
with Table on which relational operations can be performed.
import org . apache . f l i n k . api . scala . _
import org . apache . f l i n k . api . scala . table . _
case class WC( word : String , count : I nt )
val input = env . fromElements (WC( ” hello ” , 1) ,
WC( ” hello ” , 1) , WC( ” ciao ” , 1))
val expr = input . toTable
val result = expr . groupBy ( ’ word )
. select ( ’ word , ’ count .sum as ’ count ) . toDataSet [WC]
65

More Related Content

PDF
Real-Time Big Data Stream Analytics
PDF
Internet of Things Data Science
PDF
Introduction to Big Data Science
PPTX
STRIP: stream learning of influence probabilities.
PDF
A Short Course in Data Stream Mining
PDF
Efficient Data Stream Classification via Probabilistic Adaptive Windows
PDF
Mining Frequent Closed Graphs on Evolving Data Streams
PDF
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Real-Time Big Data Stream Analytics
Internet of Things Data Science
Introduction to Big Data Science
STRIP: stream learning of influence probabilities.
A Short Course in Data Stream Mining
Efficient Data Stream Classification via Probabilistic Adaptive Windows
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

What's hot (20)

PDF
MOA for the IoT at ACML 2016
PDF
Leveraging Bagging for Evolving Data Streams
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
Speaker Diarization
PDF
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
PDF
Mining Big Data Streams with APACHE SAMOA
PDF
Artificial intelligence and data stream mining
PPTX
Time Series Analysis for Network Secruity
PPTX
Distributed GLM with H2O - Atlanta Meetup
PDF
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PDF
Realtime Analytics
PDF
Europy17_dibernardo
PDF
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
PPT
New zealand bloom filter
PPTX
Probabilistic data structures
PDF
Probabilistic data structures. Part 3. Frequency
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Tutorial 9 (bloom filters)
PPT
Bloom filter
MOA for the IoT at ACML 2016
Leveraging Bagging for Evolving Data Streams
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Speaker Diarization
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
Mining Big Data Streams with APACHE SAMOA
Artificial intelligence and data stream mining
Time Series Analysis for Network Secruity
Distributed GLM with H2O - Atlanta Meetup
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
TensorFrames: Google Tensorflow on Apache Spark
Realtime Analytics
Europy17_dibernardo
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
New zealand bloom filter
Probabilistic data structures
Probabilistic data structures. Part 3. Frequency
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Tutorial 9 (bloom filters)
Bloom filter
Ad

Viewers also liked (20)

PDF
Introduction to Big Data
PDF
The Big 5- Twitter
PDF
Using Ruby to do Map/Reduce with Hadoop
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PPTX
Bigdata analytics-twitter
PDF
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
PDF
Apache Samoa: Mining Big Data Streams with Apache Flink
PPTX
Real Time Analytics for Big Data - A twitter inspired case study
PPTX
SAP HANA in Healthcare: Real-Time Big Data Analysis
KEY
NoSQL at Twitter (NoSQL EU 2010)
PDF
Real-time Big Data Processing with Storm
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PDF
Alfresco in few points - Search Tutorial
PPT
Hadoop Map Reduce 程式設計
PPT
Hadoop Map Reduce
KEY
Machine Learning on Big Data
KEY
Big Data in Real-Time at Twitter
PDF
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
PPT
Introduction To Map Reduce
PPTX
Introduction to Big Data/Machine Learning
Introduction to Big Data
The Big 5- Twitter
Using Ruby to do Map/Reduce with Hadoop
Efficient Online Evaluation of Big Data Stream Classifiers
Bigdata analytics-twitter
Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in ...
Apache Samoa: Mining Big Data Streams with Apache Flink
Real Time Analytics for Big Data - A twitter inspired case study
SAP HANA in Healthcare: Real-Time Big Data Analysis
NoSQL at Twitter (NoSQL EU 2010)
Real-time Big Data Processing with Storm
ML on Big Data: Real-Time Analysis on Time Series
Alfresco in few points - Search Tutorial
Hadoop Map Reduce 程式設計
Hadoop Map Reduce
Machine Learning on Big Data
Big Data in Real-Time at Twitter
Map Reduce 〜入門編:仕組みの理解とアルゴリズムデザイン〜
Introduction To Map Reduce
Introduction to Big Data/Machine Learning
Ad

Similar to Real Time Big Data Management (20)

PPTX
Introduction to Apache Flink
PDF
Introduction to parallel and distributed computation with spark
PDF
Quick introduction to scala
PPTX
Advanced
PPTX
ScalaDays 2013 Keynote Speech by Martin Odersky
PDF
Getting Started With Scala
PDF
Meet scala
PDF
Spark workshop
PDF
Stepping Up : A Brief Intro to Scala
PPTX
Taxonomy of Scala
PPTX
Introduction to Spark - DataFactZ
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Scala - core features
PDF
Getting Started With Scala
PDF
Getting Started With Scala
PDF
Scala.pdf
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
ODP
PDF
Scala in Practice
PDF
An Introduction to Scala (2014)
Introduction to Apache Flink
Introduction to parallel and distributed computation with spark
Quick introduction to scala
Advanced
ScalaDays 2013 Keynote Speech by Martin Odersky
Getting Started With Scala
Meet scala
Spark workshop
Stepping Up : A Brief Intro to Scala
Taxonomy of Scala
Introduction to Spark - DataFactZ
Apache Flink: API, runtime, and project roadmap
Scala - core features
Getting Started With Scala
Getting Started With Scala
Scala.pdf
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Scala in Practice
An Introduction to Scala (2014)

More from Albert Bifet (15)

PDF
Multi-label Classification with Meta-labels
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PPTX
Mining Big Data in Real Time
PDF
Mining Big Data in Real Time
PDF
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
PDF
Moa: Real Time Analytics for Data Streams
PDF
MOA : Massive Online Analysis
PDF
New ensemble methods for evolving data streams
PDF
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
PDF
Adaptive XML Tree Mining on Evolving Data Streams
PDF
Adaptive Learning and Mining for Data Streams and Frequent Patterns
PDF
Mining Implications from Lattices of Closed Trees
PDF
Kalman Filters and Adaptive Windows for Learning in Data Streams
Multi-label Classification with Meta-labels
Pitfalls in benchmarking data stream classification and how to avoid them
Mining Big Data in Real Time
Mining Big Data in Real Time
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
Sentiment Knowledge Discovery in Twitter Streaming Data
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Moa: Real Time Analytics for Data Streams
MOA : Massive Online Analysis
New ensemble methods for evolving data streams
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Mining Implications from Lattices of Closed Trees
Kalman Filters and Adaptive Windows for Learning in Data Streams

Recently uploaded (20)

PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Computer network topology notes for revision
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Introduction to Business Data Analytics.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Moving the Public Sector (Government) to a Digital Adoption
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Computer network topology notes for revision
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Supervised vs unsupervised machine learning algorithms
.pdf is not working space design for the following data for the following dat...
Reliability_Chapter_ presentation 1221.5784
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Business Data Analytics.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to Knowledge Engineering Part 1
Moving the Public Sector (Government) to a Digital Adoption

Real Time Big Data Management

  • 1. real time big data management Albert Bifet (@abifet) Paris, 7 October 2015 albert.bifet@telecom-paristech.fr
  • 2. data streams Big Data & Real Time 1
  • 3. hadoop Hadoop architecture deals with datasets, not data streams 2
  • 7. apache storm Stream, Spout, Bolt, Topology 6
  • 8. apache storm Stream, Spout, Bolt, Topology 7
  • 9. apache storm Storm characteristics for real-time data processing workloads: 1 Fast 2 Scalable 3 Fault-tolerant 4 Reliable 5 Easy to operate 8
  • 10. apache kafka from linkedin Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. 9
  • 11. apache kafka from linkedin Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. 9
  • 12. apache samza from linkedin Storm and Samza are fairly similar. Both systems provide: 1 a partitioned stream model, 2 a distributed execution environment, 3 an API for stream processing, 4 fault tolerance, 5 Kafka integration 10
  • 13. apache spark streaming Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data. 11
  • 15. apache flink motivation 1 Real time computation: streaming computation 2 Fast, as there is not need to write to disk 3 Easy to write code 13
  • 16. real time computation: streaming computation MapReduce Limitations Example How compute in real time (latency less than 1 second): 1 predictions 2 frequent items as Twitter hashtags 3 sentiment analysis 14
  • 17. easy to write code case class Word ( word : String , frequency : I n t ) DataSet API (batch): val lines : DataSet [ String ] = env . readTextFile ( . . . ) lines . flatMap { l i n e => l i n e . s p l i t ( ” ” ) .map( word => Word( word , 1 ) ) } . groupBy ( ” word ” ) . sum( ” frequency ” ) . p r i n t ( ) 15
  • 18. easy to write code case class Word ( word : String , frequency : I n t ) DataSet API (batch): val lines : DataSet [ String ] = env . readTextFile ( . . . ) lines . flatMap { l i n e => l i n e . s p l i t ( ” ” ) .map( word => Word( word , 1 ) ) } . groupBy ( ” word ” ) . sum( ” frequency ” ) . p r i n t ( ) DataStream API (streaming): val lines : DataStream [ String ] = env . fromSocketStream ( . . . ) lines . flatMap { l i n e => l i n e . s p l i t ( ” ” ) .map( word => Word( word , 1 ) ) } . window ( Time . of (5 ,SECONDS ) ) . every ( Time . of (1 ,SECONDS) ) . groupBy ( ” word ” ) . sum( ” frequency ” ) . p r i n t ( ) 16
  • 19. what is apache flink? Figure 1: Apache Flink Overview 17
  • 20. batch and streaming engines Figure 2: Batch, streaming and hybrid data processing engines. 18
  • 21. batch comparison Figure 3: Comparison between Hadoop, Spark And Flink. 19
  • 22. streaming comparison Figure 4: Comparison between Storm, Spark And Flink. 20
  • 23. scala language • What is Scala? • object oriented • functional • How is Scala? • Scala is compatible • Scala is concise • Scala is high-level • Scala is statically typed 21
  • 24. short course on scala • Easy to use: includes an interpreter • Variables: • val: immutable (preferable) • var: mutable • Scala treats everything as objects with methods • Scala has first-class functions • Functions: def max( x : Int , y : I n t ) : I n t = { i f ( x > y ) x else y } def max2( x : Int , y : I n t ) = i f ( x > y ) x else y 22
  • 25. short course on scala • Functional: args . foreach ( ( arg : String ) => p r i n t l n ( arg ) ) args . foreach ( arg => p r i n t l n ( arg ) ) args . foreach ( p r i n t l n ) • Imperative: for ( arg <− args ) p r i n t l n ( arg ) • Scala achieves a conceptual simplicity by treating everything, from arrays to expressions, as objects with methods. ( 1 ) . + ( 2 ) greetStrings (0) = ” Hello ” greetStrings . update (0 , ” Hello ” ) val numNames2 = Array . apply ( ” zero ” , ”one ” , ”two ” )23
  • 26. short course on scala • Array: mutable sequence of objects that share the same type • List: immutable sequence of objects that share the same type • Tuple: immutable sequence of objects that does not share the same type val pair = (99 , ” Luftballons ” ) p r i n t l n ( pair . _1 ) p r i n t l n ( pair . _2 ) 24
  • 27. short course on scala • Sets and maps var jetSet = Set ( ” Boeing ” , ” Airbus ” ) jetSet += ” Lear ” p r i n t l n ( jetSet . contains ( ” Cessna ” ) ) import scala . collection . mutable .Map val treasureMap = Map[ Int , String ] ( ) treasureMap += (1 −> ”Go to island . ” ) treasureMap += (2 −> ” Find big X on ground . ” ) treasureMap += (3 −> ” Dig . ” ) p r i n t l n ( treasureMap ( 2 ) ) 25
  • 28. short course on scala • Functional style • Does not contain any var def printArgs ( args : Array [ String ] ) : Unit = { var i = 0 while ( i < args . length ) { p r i n t l n ( args ( i ) ) i += 1 } } def printArgs ( args : Array [ String ] ) : Unit = { for ( arg <− args ) p r i n t l n ( arg ) } def printArgs ( args : Array [ String ] ) : Unit = { args . foreach ( p r i n t l n ) } def formatArgs ( args : Array [ String ] ) = args . mkString ( ” n ” ) p r i n t l n ( formatArgs ( args ) ) 26
  • 29. short course on scala • Prefer vals, immutable objects, and methods without side effects. • Use vars, mutable objects, and methods with side effects when you have a specific need and justification for them. • In a Scala program, a semicolon at the end of a statement is usually optional. • A singleton object definition looks like a class definition, except instead of the keyword class you use the keyword object . • Scala provides a trait, scala.Application: object FallWinterSpringSummer extends Application { for ( season <− List ( ” f a l l ” , ” winter ” , ” spring ” ) ) p r i n t l n ( season + ” : ”+ calculate ( season ) ) } 27
  • 30. short course on scala • Scala has first-class functions: you can write down functions as unnamed literals and then pass them around as values. ( x : I n t ) => x + 1 • Short forms of function literals someNumbers . f i l t e r ( ( x : I n t ) => x > 0) someNumbers . f i l t e r ( ( x ) => x > 0) someNumbers . f i l t e r ( x => x > 0) someNumbers . f i l t e r ( _ > 0) someNumbers . foreach ( x => p r i n t l n ( x ) ) someNumbers . foreach ( p r i n t l n _ ) 28
  • 31. short course on scala • Zipping lists: zip and unzip • The zip operation takes two lists and forms a list of pairs: • A useful special case is to zip a list with its index. This is done most efficiently with the zipWithIndex method, • Mapping over lists: map , flatMap and foreach • Filtering lists: filter , partition , find , takeWhile , dropWhile , and span • Folding lists: /: and : or foldLeft and foldRight. ( z / : List ( a , b , c ) ) ( op ) equals op ( op ( op ( z , a ) , b ) , c ) • Sorting lists: sortWith 29
  • 34. api introduction Flink programs 1 Input from source 2 Apply operations 3 Output to source 32
  • 35. batch and streaming apis 1 DataSet API • Example: Map/Reduce paradigm 2 DataStream API • Example: Live Stock Feed 33
  • 36. streaming and batch comparison 34
  • 37. architecture overview Figure 5: The JobManager is the coordinator of the Flink system TaskManagers are the workers that execute parts of the parallel programs. 35
  • 38. client 1 Optimize 2 Construct job graph 3 Pass job graph to job manager 4 Retrieve job results 36
  • 39. job manager 1 Parallelization: Create Execution Graph 2 Scheduling: Assign tasks to task managers 3 State: Supervise the execution 37
  • 40. task manager 1 Operations are split up into tasks depending on the specified parallelism 2 Each parallel instance of an operation runs in a separate task slot 3 The scheduler may run several tasks from different operators in one task slot 38
  • 42. component stack 1 API layer: implements multiple APIs that create operator DAGs for their programs. Each API needs to provide utilities (serializers, comparators) that describe the interaction between its data types and the runtime. 2 Optimizer and common api layer: takes programs in the form of operator DAGs. The operators are specific (e.g., Map, Join, Filter, Reduce, …), but are data type agnostic. 3 Runtime layer: receives a program in the form of a JobGraph. A JobGraph is a generic parallel data flow with arbitrary tasks that consume and produce data streams. 40
  • 43. flink topologies Flink programs 1 Input from source 2 Apply operations 3 Output to source 41
  • 44. sources (selection) • Collection-based • fromCollection • fromElements • File-based • TextInputFormat • CsvInputFormat • Other • SocketInputFormat • KafkaInputFormat • Databases 42
  • 45. sinks (selection) • File-based • TextOutputFormat • CsvOutputFormat • PrintOutput • Others • SocketOutputFormat • KafkaOutputFormat • Databases 43
  • 47. flink skeleton program 1 Obtain an ExecutionEnvironment, 2 Load/create the initial data, 3 Specify transformations on this data, 4 Specify where to put the results of your computations, 5 Trigger the program execution 45
  • 48. java wordcount example public class WordCountExample { public s ta t ic void main ( String [ ] args ) throws Exception { f i n a l ExecutionEnvironment env = ExecutionEnvironment . getExecutionEnvironment ( ) ; DataSet < String > text = env . fromElements ( ”Who’ s there ? ” , ” I think I hear them . Stand , ho ! Who’ s there ? ” ) ; DataSet <Tuple2 < String , Integer >> wordCounts = text . flatMap (new L i n e S p l i t t e r ( ) ) . groupBy (0) .sum( 1 ) ; wordCounts . p r i n t ( ) ; } . . . . } 46
  • 49. java wordcount example public class WordCountExample { public s ta tic void main ( String [ ] args ) throws Exception { . . . . } public s ta tic class L i n e S p l i t t e r implements FlatMapFunction < String , Tuple2 < String , Integer >> { @Override public void flatMap ( String line , Collector <Tuple2 < String , Integer >> out ) { for ( String word : l i n e . s p l i t ( ” ” ) ) { out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ; } } } } 47
  • 50. scala wordcount example import org . apache . f l i n k . api . scala . _ object WordCount { def main ( args : Array [ String ] ) { val env = ExecutionEnvironment . getExecutionEnvironment val text = env . fromElements ( ”Who’ s there ? ” , ” I think I hear them . Stand , ho ! Who’ s there ? ” ) val counts = text . flatMap { _ . toLowerCase . s p l i t ( ” W+”) f i l t e r { _ . nonEmpty } } .map { ( _ , 1) } . groupBy (0) .sum(1) counts . p r i n t ( ) } } 48
  • 51. java 8 wordcount example public class WordCountExample { public s tat ic void main ( String [ ] args ) throws Exception { f i n a l ExecutionEnvironment env = ExecutionEnvironment . getExecutionEnvironment ( ) ; DataSet < String > text = env . fromElements ( ”Who’ s there ?” ” I think I hear them . Stand , ho ! Who’ s there ? ” ) ; text .map( l i n e −> l i n e . s p l i t ( ” ” ) ) . flatMap ( ( String [ ] wordArray , Collector <Tuple2 < String , Integer >> out ) −> Arrays . stream ( wordArray ) . forEach ( t −> out . collect (new Tuple2 < >( t , 1 ) ) ) ) . groupBy (0) .sum(1) . p r i n t ( ) ; } } 49
  • 53. flink skeleton program 1 Obtain an StreamExecutionEnvironment, 2 Load/create the initial data, 3 Specify transformations on this data, 4 Specify where to put the results of your computations, 5 Trigger the program execution 51
  • 54. java wordcount example public class StreamingWordCount { public st a t i c void main ( String [ ] args ) { StreamExecutionEnvironment env = StreamExecutionEnvironment . getExecutionEnvironment ( ) ; DataStream <Tuple2 < String , Integer >> dataStream = env . socketTextStream ( ” localhost ” , 9999) . flatMap (new S p l i t t e r ( ) ) . groupBy (0) .sum( 1 ) ; dataStream . p r i n t ( ) ; env . execute ( ” Socket Stream WordCount ” ) ; } . . . . } 52
  • 55. java wordcount example public class StreamingWordCount { public st a t i c void main ( String [ ] args ) throws Exception { . . . . } public st a ti c class S p l i t t e r implements FlatMapFunction < String , Tuple2 < String , Integer >> { @Override public void flatMap ( String sentence , Collector <Tuple2 < String , Integer >> out ) throws Exception for ( String word : sentence . s p l i t ( ” ” ) ) { out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ; } } } } 53
  • 56. scala wordcount example object WordCount { def main ( args : Array [ String ] ) { val env = StreamExecutionEnvironment . getExecutionEnvironment val text = env . socketTextStream ( ” localhost ” , 9999) val counts = text . flatMap { _ . toLowerCase . s p l i t ( ” W+”) f i l t e r { _ . nonEmpty } } .map { ( _ , 1) } . groupBy (0) .sum(1) counts . p r i n t env . execute ( ” Scala Socket Stream WordCount ” ) } } 54
  • 57. obtain an streamexecutionenvironment The StreamExecutionEnvironment is the basis for all Flink programs. StreamExecutionEnvironment . getExecutionEnvironment StreamExecutionEnvironment . createLocalEnvironment ( parallelism ) StreamExecutionEnvironment . createRemoteEnvironment ( host : String , port : String , parallelism : Int , j a r F i l e s : String *) env . socketTextStream ( host , port ) env . fromElements ( elements . . . ) env . addSource ( sourceFunction ) 55
  • 58. specify transformations on this data • Map • FlatMap • Filter • Reduce • Fold • Union 56
  • 59. 3) specify transformations on this data Map Takes one element and produces one element. data .map { x => x . toInt } 57
  • 60. 3) specify transformations on this data FlatMap Takes one element and produces zero, one, or more elements. data . flatMap { str => str . s p l i t ( ” ” ) } 58
  • 61. 3) specify transformations on this data Filter Evaluates a boolean function for each element and retains those for which the function returns true. data . f i l t e r { _ > 1000 } 59
  • 62. 3) specify transformations on this data Reduce Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set. data . reduce { _ + _ } 60
  • 63. 3) specify transformations on this data Union Produces the union of two data sets. data . union ( data2 ) 61
  • 64. window operators The user has different ways of using the result of a window operation: • windowedDataStream.flatten() - streams the results element wise and returns a DataStream<T> where T is the type of the underlying windowed stream • windowedDataStream.getDiscretizedStream() - returns a DataStream<StreamWindow<T» for applying some advanced logic on the stream windows itself. • Calling any window transformation further transforms the windows, while preserving the windowing logic dataStream . window ( Time . of (5 , TimeUnit .SECONDS) ) . every ( Time . of (1 , TimeUnit .SECONDS ) ) ; dataStream . window ( Count . of (100)) . every ( Time . of (1 , TimeUnit .MINUTES ) ) ; 6 62
  • 65. gelly: flink graph api • Gelly is a Java Graph API for Flink. • In Gelly, graphs can be transformed and modified using high-level functions similar to the ones provided by the batch processing API. • In Gelly, a Graph is represented by a DataSet of vertices and a DataSet of edges. • The Graph nodes are represented by the Vertex type. A Vertex is defined by a unique ID and a value. // create a new vertex with a Long ID and a String value Vertex <Long , String > v = new Vertex <Long , String >(1L , ” foo ” ) ; // create a new vertex with a Long ID and no value Vertex <Long , NullValue > v = new Vertex <Long , NullValue >(1L , NullValue . getInstance ( ) ) ; 63
  • 66. gelly: flink graph api • The graph edges are represented by the Edge type. • An Edge is defined by a source ID (the ID of the source Vertex), a target ID (the ID of the target Vertex) and an optional value. • The source and target IDs should be of the same type as the Vertex IDs. Edges with no value have a NullValue value type. Edge<Long , Double > e = new Edge<Long , Double >(1L , 2L , 0 . 5 ) ; // reverse the source and target of this edge Edge<Long , Double > reversed = e . reverse ( ) ; Double weight = e . getValue ( ) ; // weight = 0.5 64
  • 67. table api - relational queries • Flink provides an API that allows specifying operations using SQL-like expressions. • Instead of manipulating DataSet or DataStream you work with Table on which relational operations can be performed. import org . apache . f l i n k . api . scala . _ import org . apache . f l i n k . api . scala . table . _ case class WC( word : String , count : I nt ) val input = env . fromElements (WC( ” hello ” , 1) , WC( ” hello ” , 1) , WC( ” ciao ” , 1)) val expr = input . toTable val result = expr . groupBy ( ’ word ) . select ( ’ word , ’ count .sum as ’ count ) . toDataSet [WC] 65