Real Time Big Data Management

real time big data management
Albert Bifet (@abifet)
Paris, 7 October 2015
albert.bifet@telecom-paristech.fr

data streams
Big Data & Real Time
1

hadoop
Hadoop architecture deals with datasets, not
data streams
2

apache storm
Storm from Twitter
5

apache storm
Stream, Spout, Bolt, Topology
6

apache storm
Stream, Spout, Bolt, Topology
7

apache storm
Storm characteristics for real-time data processing workloads:
1 Fast
2 Scalable
3 Fault-tolerant
4 Reliable
5 Easy to operate
8

apache kafka from linkedin
Apache Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system.
9

apache samza from linkedin
Storm and Samza are fairly similar. Both systems provide:
1 a partitioned stream model,
2 a distributed execution environment,
3 an API for stream processing,
4 fault tolerance,
5 Kafka integration
10

apache spark streaming
Spark Streaming is an extension of Spark that allows
processing data stream using micro-batches of data.
11

apache flink motivation
1 Real time computation: streaming computation
2 Fast, as there is not need to write to disk
3 Easy to write code
13

real time computation: streaming computation
MapReduce Limitations
Example
How compute in real time (latency less than 1 second):
1 predictions
2 frequent items as Twitter hashtags
3 sentiment analysis
14

easy to write code
case class Word ( word : String , frequency : I n t )
DataSet API (batch):
val lines : DataSet [ String ] = env . readTextFile ( . . . )
lines . flatMap { l i n e => l i n e . s p l i t ( ” ” )
.map( word => Word( word , 1 ) ) }
. groupBy ( ” word ” ) . sum( ” frequency ” )
. p r i n t ( )
15

easy to write code
case class Word ( word : String , frequency : I n t )
DataSet API (batch):
val lines : DataSet [ String ] = env . readTextFile ( . . . )
. p r i n t ( )
DataStream API (streaming):
val lines : DataStream [ String ] = env . fromSocketStream ( . . . )
. window ( Time . of (5 ,SECONDS ) ) . every ( Time . of (1 ,SECONDS) )
. p r i n t ( )
16

what is apache flink?
Figure 1: Apache Flink Overview
17

batch and streaming engines
Figure 2: Batch, streaming and hybrid data processing engines.
18

batch comparison
Figure 3: Comparison between Hadoop, Spark And Flink.
19

streaming comparison
Figure 4: Comparison between Storm, Spark And Flink.
20

scala language
• What is Scala?
• object oriented
• functional
• How is Scala?
• Scala is compatible
• Scala is concise
• Scala is high-level
• Scala is statically typed
21

short course on scala
• Easy to use: includes an interpreter
• Variables:
• val: immutable (preferable)
• var: mutable
• Scala treats everything as objects with methods
• Scala has first-class functions
• Functions:
def max( x : Int , y : I n t ) : I n t = {
i f ( x > y ) x
else y
}
def max2( x : Int , y : I n t ) = i f ( x > y ) x else y
22

• Functional:
args . foreach ( ( arg : String ) => p r i n t l n ( arg ) )
args . foreach ( arg => p r i n t l n ( arg ) )
args . foreach ( p r i n t l n )
• Imperative:
for ( arg <− args ) p r i n t l n ( arg )
• Scala achieves a conceptual simplicity by treating
everything, from arrays to expressions, as objects with
methods.
( 1 ) . + ( 2 )
greetStrings (0) = ” Hello ”
greetStrings . update (0 , ” Hello ” )
val numNames2 = Array . apply ( ” zero ” , ”one ” , ”two ” )23

• Array: mutable sequence of objects that share the same type
• List: immutable sequence of objects that share the same
type
• Tuple: immutable sequence of objects that does not share
the same type
val pair = (99 , ” Luftballons ” )
p r i n t l n ( pair . _1 )
p r i n t l n ( pair . _2 )
24

• Sets and maps
var jetSet = Set ( ” Boeing ” , ” Airbus ” )
jetSet += ” Lear ”
p r i n t l n ( jetSet . contains ( ” Cessna ” ) )
import scala . collection . mutable .Map
val treasureMap = Map[ Int , String ] ( )
treasureMap += (1 −> ”Go to island . ” )
treasureMap += (2 −> ” Find big X on ground . ” )
treasureMap += (3 −> ” Dig . ” )
p r i n t l n ( treasureMap ( 2 ) )
25

• Functional style
• Does not contain any var
def printArgs ( args : Array [ String ] ) : Unit = {
var i = 0
while ( i < args . length ) {
p r i n t l n ( args ( i ) )
i += 1
}
}
for ( arg <− args )
p r i n t l n ( arg )
}
args . foreach ( p r i n t l n )
}
def formatArgs ( args : Array [ String ] ) = args . mkString ( ” n ” )
p r i n t l n ( formatArgs ( args ) ) 26

• Prefer vals, immutable objects, and methods without side
effects.
• Use vars, mutable objects, and methods with side effects
when you have a specific need and justification for them.
• In a Scala program, a semicolon at the end of a statement is
usually optional.
• A singleton object definition looks like a class definition,
except instead of the keyword class you use the keyword
object .
• Scala provides a trait, scala.Application:
object FallWinterSpringSummer extends Application {
for ( season <− List ( ” f a l l ” , ” winter ” , ” spring ” ) )
p r i n t l n ( season + ” : ”+ calculate ( season ) )
}
27

• Scala has first-class functions: you can write down
functions as unnamed literals and then pass them around as
values.
( x : I n t ) => x + 1
• Short forms of function literals
someNumbers . f i l t e r ( ( x : I n t ) => x > 0)
someNumbers . f i l t e r ( ( x ) => x > 0)
someNumbers . f i l t e r ( x => x > 0)
someNumbers . f i l t e r ( _ > 0)
someNumbers . foreach ( x => p r i n t l n ( x ) )
someNumbers . foreach ( p r i n t l n _ )
28

• Zipping lists: zip and unzip
• The zip operation takes two lists and forms a list of pairs:
• A useful special case is to zip a list with its index. This is done
most efficiently with the zipWithIndex method,
• Mapping over lists: map , flatMap and foreach
• Filtering lists: filter , partition , find , takeWhile , dropWhile ,
and span
• Folding lists: /: and : or foldLeft and foldRight.
( z / : List ( a , b , c ) ) ( op )
equals
op ( op ( op ( z , a ) , b ) , c )
• Sorting lists: sortWith
29

references
Apache Flink Documentation
http://dataartisans.github.io/flink-training/ 31

api introduction
Flink programs
1 Input from source
2 Apply operations
3 Output to source
32

batch and streaming apis
1 DataSet API
• Example: Map/Reduce paradigm
2 DataStream API
• Example: Live Stock Feed
33

streaming and batch comparison
34

architecture overview
Figure 5: The JobManager is the coordinator of the Flink system
TaskManagers are the workers that execute parts of the parallel
programs.
35

client
1 Optimize
2 Construct job graph
3 Pass job graph to job manager
4 Retrieve job results
36

job manager
1 Parallelization: Create Execution Graph
2 Scheduling: Assign tasks to task managers
3 State: Supervise the execution
37

task manager
1 Operations are split up into tasks depending on the specified
parallelism
2 Each parallel instance of an operation runs in a separate
task slot
3 The scheduler may run several tasks from different
operators in one task slot
38

component stack
1 API layer: implements multiple APIs that create operator
DAGs for their programs. Each API needs to provide utilities
(serializers, comparators) that describe the interaction
between its data types and the runtime.
2 Optimizer and common api layer: takes programs in the form
of operator DAGs. The operators are specific (e.g., Map,
Join, Filter, Reduce, …), but are data type agnostic.
3 Runtime layer: receives a program in the form of a
JobGraph. A JobGraph is a generic parallel data flow with
arbitrary tasks that consume and produce data streams.
40

flink topologies
Flink programs
1 Input from source
2 Apply operations
3 Output to source
41

sources (selection)
• Collection-based
• fromCollection
• fromElements
• File-based
• TextInputFormat
• CsvInputFormat
• Other
• SocketInputFormat
• KafkaInputFormat
• Databases
42

sinks (selection)
• File-based
• TextOutputFormat
• CsvOutputFormat
• PrintOutput
• Others
• SocketOutputFormat
• KafkaOutputFormat
• Databases
43

flink skeleton program
1 Obtain an ExecutionEnvironment,
2 Load/create the initial data,
3 Specify transformations on this data,
4 Specify where to put the results of your computations,
5 Trigger the program execution
45

java wordcount example
public class WordCountExample {
public s ta t ic void main ( String [ ] args ) throws Exception {
f i n a l ExecutionEnvironment env =
ExecutionEnvironment . getExecutionEnvironment ( ) ;
DataSet < String > text = env . fromElements (
”Who’ s there ? ” ,
” I think I hear them . Stand , ho ! Who’ s there ? ” ) ;
DataSet <Tuple2 < String , Integer >> wordCounts = text
. flatMap (new L i n e S p l i t t e r ( ) )
. groupBy (0)
.sum( 1 ) ;
wordCounts . p r i n t ( ) ;
}
. . . .
}
46

public s ta tic void main ( String [ ] args ) throws Exception {
. . . .
}
public s ta tic class L i n e S p l i t t e r implements
FlatMapFunction < String , Tuple2 < String , Integer >> {
@Override
public void flatMap ( String line , Collector <Tuple2 < String ,
Integer >> out ) {
for ( String word : l i n e . s p l i t ( ” ” ) ) {
out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ;
}
}
}
}
47

scala wordcount example
import org . apache . f l i n k . api . scala . _
object WordCount {
def main ( args : Array [ String ] ) {
val env = ExecutionEnvironment . getExecutionEnvironment
val text = env . fromElements (
”Who’ s there ? ” ,
” I think I hear them . Stand , ho ! Who’ s there ? ” )
val counts = text . flatMap
{ _ . toLowerCase . s p l i t ( ” W+”) f i l t e r { _ . nonEmpty } }
.map { ( _ , 1) }
. groupBy (0)
.sum(1)
counts . p r i n t ( )
}
}
48

java 8 wordcount example
public s tat ic void main ( String [ ] args ) throws Exception {
f i n a l ExecutionEnvironment env =
ExecutionEnvironment . getExecutionEnvironment ( ) ;
DataSet < String > text = env . fromElements (
”Who’ s there ?”
” I think I hear them . Stand , ho ! Who’ s there ? ” ) ;
text .map( l i n e −> l i n e . s p l i t ( ” ” ) )
. flatMap ( ( String [ ] wordArray ,
Collector <Tuple2 < String , Integer >> out )
−> Arrays . stream ( wordArray )
. forEach ( t −> out . collect (new Tuple2 < >( t , 1 ) ) )
)
. groupBy (0)
.sum(1)
. p r i n t ( ) ;
}
}
49

flink skeleton program
1 Obtain an StreamExecutionEnvironment,
2 Load/create the initial data,
3 Specify transformations on this data,
4 Specify where to put the results of your computations,
5 Trigger the program execution
51

public class StreamingWordCount {
public st a t i c void main ( String [ ] args ) {
StreamExecutionEnvironment env =
StreamExecutionEnvironment . getExecutionEnvironment ( ) ;
DataStream <Tuple2 < String , Integer >> dataStream = env
. socketTextStream ( ” localhost ” , 9999)
. flatMap (new S p l i t t e r ( ) )
. groupBy (0)
.sum( 1 ) ;
dataStream . p r i n t ( ) ;
env . execute ( ” Socket Stream WordCount ” ) ;
}
. . . .
}
52

public class StreamingWordCount {
public st a t i c void main ( String [ ] args ) throws Exception {
. . . .
}
public st a ti c class S p l i t t e r implements
FlatMapFunction < String , Tuple2 < String , Integer >> {
@Override
public void flatMap ( String sentence ,
Collector <Tuple2 < String , Integer >> out ) throws Exception
for ( String word : sentence . s p l i t ( ” ” ) ) {
out . collect (new Tuple2 < String , Integer >(word , 1 ) ) ;
}
}
}
}
53

scala wordcount example
object WordCount {
def main ( args : Array [ String ] ) {
val env = StreamExecutionEnvironment . getExecutionEnvironment
val text = env . socketTextStream ( ” localhost ” , 9999)
val counts = text . flatMap { _ . toLowerCase . s p l i t ( ” W+”)
f i l t e r { _ . nonEmpty } }
.map { ( _ , 1) }
. groupBy (0)
.sum(1)
counts . p r i n t
env . execute ( ” Scala Socket Stream WordCount ” )
}
}
54

obtain an streamexecutionenvironment
The StreamExecutionEnvironment is the basis for all Flink programs.
StreamExecutionEnvironment . getExecutionEnvironment
StreamExecutionEnvironment . createLocalEnvironment ( parallelism )
StreamExecutionEnvironment . createRemoteEnvironment ( host : String ,
port : String , parallelism : Int , j a r F i l e s : String *)
env . socketTextStream ( host , port )
env . fromElements ( elements . . . )
env . addSource ( sourceFunction )
55

specify transformations on this data
• Map
• FlatMap
• Filter
• Reduce
• Fold
• Union
56

3) specify transformations on this data
Map
Takes one element and produces one element.
data .map { x => x . toInt }
57

FlatMap
Takes one element and produces zero, one, or more elements.
data . flatMap { str => str . s p l i t ( ” ” ) }
58

Filter
Evaluates a boolean function for each element and retains
those for which the function returns true.
data . f i l t e r { _ > 1000 }
59

Reduce
Combines a group of elements into a single element by
repeatedly combining two elements into one. Reduce may be
applied on a full data set, or on a grouped data set.
data . reduce { _ + _ }
60

Union
Produces the union of two data sets.
data . union ( data2 )
61

window operators
The user has different ways of using the result of a window
operation:
• windowedDataStream.flatten() - streams the results
element wise and returns a DataStream<T> where T is the
type of the underlying windowed stream
• windowedDataStream.getDiscretizedStream() -
returns a DataStream<StreamWindow<T» for applying
some advanced logic on the stream windows itself.
• Calling any window transformation further transforms the
windows, while preserving the windowing logic
dataStream . window ( Time . of (5 , TimeUnit .SECONDS) )
. every ( Time . of (1 , TimeUnit .SECONDS ) ) ;
dataStream . window ( Count . of (100))
. every ( Time . of (1 , TimeUnit .MINUTES ) ) ;
6 62

gelly: flink graph api
• Gelly is a Java Graph API for Flink.
• In Gelly, graphs can be transformed and modified using
high-level functions similar to the ones provided by the
batch processing API.
• In Gelly, a Graph is represented by a DataSet of vertices and
a DataSet of edges.
• The Graph nodes are represented by the Vertex type. A
Vertex is defined by a unique ID and a value.
// create a new vertex with a Long ID and a String value
Vertex <Long , String > v = new Vertex <Long , String >(1L , ” foo ” ) ;
// create a new vertex with a Long ID and no value
Vertex <Long , NullValue > v =
new Vertex <Long , NullValue >(1L , NullValue . getInstance ( ) ) ;
63

gelly: flink graph api
• The graph edges are represented by the Edge type.
• An Edge is defined by a source ID (the ID of the source
Vertex), a target ID (the ID of the target Vertex) and an
optional value.
• The source and target IDs should be of the same type as the
Vertex IDs. Edges with no value have a NullValue value type.
Edge<Long , Double > e = new Edge<Long , Double >(1L , 2L , 0 . 5 ) ;
// reverse the source and target of this edge
Edge<Long , Double > reversed = e . reverse ( ) ;
Double weight = e . getValue ( ) ; // weight = 0.5
64

table api - relational queries
• Flink provides an API that allows specifying operations
using SQL-like expressions.
• Instead of manipulating DataSet or DataStream you work
with Table on which relational operations can be performed.
import org . apache . f l i n k . api . scala . _
import org . apache . f l i n k . api . scala . table . _
case class WC( word : String , count : I nt )
val input = env . fromElements (WC( ” hello ” , 1) ,
WC( ” hello ” , 1) , WC( ” ciao ” , 1))
val expr = input . toTable
val result = expr . groupBy ( ’ word )
. select ( ’ word , ’ count .sum as ’ count ) . toDataSet [WC]
65

Real Time Big Data Management

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Real Time Big Data Management (20)

More from Albert Bifet (15)

Recently uploaded (20)

Real Time Big Data Management