JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet

In-Memory Data Streams
With
NEIL STEVENSON
neil@hazelcast.com
27th May 2017
13:25-14:10

© 2017 Hazelcast Inc. Confidential & Proprietary
Outline
• Hazelcast
• → The company, the software, and my role
• Background
• → Why stream at all ?
• Java 8 streams
• → What did Java 8 add to Java 7
• → Why isn’t this good enough ?
• Hazelcast Jet, part #1
• → Introduction and outline architecture
• → Low level abstractions : directed acyclic graphs
• A sample application, available to download : not Word Count
• Hazelcast Jet, part #2
• → Higher level abstractions → distributed java.util.stream

Hazelcast : The company, the software and my role
The Company
Founded in 2008, based out of Palo Alto, California with offices worldwide
Provides commercial support and valid-add subscription features for open source Hazelcast software
The Software
Apache 2 licensed, available to download from Github, from https://hazelcast.org or
https://hazelcast.com
My Role
Solutions Architect – help customers, give talks, drink coffee, write code, drink coffee

Part 1 – Fast Big Data
DAG = Directed Acyclic Graph
Model the flow of data from processing stage to processing stage
→ a stream of data, potentially infinite
→ process as it comes in, don’t save first, maybe never save
→ enrich, deplete, filter, split, etc as data passes through
→ at memory speeds, no waiting for disks

Part 1 – Fast Big Data
6
Stream and Fast In-Memory Batch Processing
Enrichment
Databases
IoT
Social
Networks
Enterprise
Applications
Databases/
Hazelcast IMDG
HDFS/
Spark
Stream
Stream
Stream
Batch
Batch
Ingest
Alerts
Enterprise
Applications
Interactive
Analytics
Databases/
Hazelcast IMDG
Output

Jet : Directed Acyclic Graph
VERTEX
The vertex is just the processing node in a pipeline.
→ Input comes in from somewhere, the first stage or the previous stage
→ Output goes out somewhere, the last stage or the next state stage
→ Stateless or stateful
→ Split, filter, enrich, deplete, fan-out, fan-in the data, many possibilities

Jet : Directed Acyclic Graph
EDGE
The edge is just the data transmission in the pipeline.
→ Out of one processor into the next one
→ Out of one processor into the next ones
→ The next processor can be on any JVM, local or distributed routing
→ Back-pressure system throttles producer when consumer cannot keep up

Part 1 – Jet Engine
Stream Processing
Traditional processing is based on calculations on stored data
Stream processing is about calculations prior to storage
Streams are immutable
Streams may be infinite
The “pipeline” paradigm, (input →process →output)
Pipeline stages are lambdas : (x, y) -> {return x * y;}

Part 1 –Jet Engine
What does it do ?
Stream Processing
In-memory
Distributed

Example 1 : Word Count
Word Count is the “hello world” of stream processing:
The Problem
 Count how many times each word occurs in some text
 Trivial, but shows some major concepts
Input
 Hamlet’s Soliloquy
1: To be, or not to be, that is the Question:
2: Whether ’tis Nobler in the mind to suffer
3: The Slings and Arrows of outragious Fortune,
4: Or to take Armes against a Sea of troubles,
Output
the=23
to=14
and=13
be=4
…

Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
Stream.of(Constants.WORDS_PATTERN.split(m.getValue())))
.map(String::toLowerCase)
.filter(m -> m.length() >= 5)
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));
In Java we would basically iterate and tally
How can the JVM optimise?

Input OutputTokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals
Once everything is finished,
emit all pairs of (word, count)
(text) (word) (word, count)
But really this is just a pipeline, so a DAG

Input
(text) (word)
Output
(word, count)
Tokenizer Reducer
Split the text into words
For each word emit (word)
Collect running totals.
Once everything is finished,
emit all pairs of (word, count)
Using queues between vertices allows each to run in parallel, at their own speed

Output
(word, count)
ReducerInput
Tokenizer
Tokenizer
We can exploit multiple CPUs because lines can be processed in parallel

(word)
(word)
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Use routing algorithms to select the next vertex or vertices

Node
Node
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Input Output
Tokenizer
Tokenizer
Reducer
Reducer
Combiner
Combiner
Distribute!!

Example 2 : Foreign Currency
The Problem :
Time-series foreign exchange prices.
We want to compute moving averages in various ways
→ last n measurements, last 15, last 50, etc
Why ?
→ rapidly changing data
→ time-to-market benefits from fast processing
Why ?
→ gives a clearer view of the trend
Why ?
→ to demonstrate a different architecture pattern
→ processing a stream of data, don’t save first then analyse
→ partitioning a stream of data, for scaling

The Data
For convenience, we’re using end of day prices rather than live prices, so frequency is one
sample per 24x60x60x1000 milliseconds. And only for the Euro.
<gesmes:Sender>
<gesmes:name>European Central Bank</gesmes:name>
</gesmes:Sender>
<Cube>
<Cube time="2017-04-20">
<Cube currency="USD" rate="1.0745"/>
<Cube currency="JPY" rate="117.16"/>
<Cube currency="BGN" rate="1.9558"/>
<Cube currency="CZK" rate="26.907"/>
<Cube currency="DKK" rate="7.4381"/>
<Cube currency="GBP" rate="0.8392"/>
<Cube currency="HUF" rate="313.5"/>
<Cube currency="PLN" rate="4.2588"/>
<Cube currency="RON" rate="4.5405"/>

Last n
Window
Input:
FX feed
(from,to,price)
One Solution
Input arrives as a stream of individual prices. Eg ”EUR,GBP,0.8392”
Collate these into batch of n per pair

Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
(from,to,price)
n * (from,to,price)
n * (from,to,price)
One Solution
Send a self-contained parcel of work to each calculator
A batch of n prices for a pair, eg. ”EUR,GBP,0.8392, 0.8391, 0.8390, 0.8389, …”

Output:
Store A
Last n
Window
Simple
Average
Weighted
Average
Input:
FX feed
Output:
Store B
(from,to,price)
n * (from,to,price)
n * (from,to,price)
(from,to,average)
(from,to,average)
One Solution
Stream out the averages….
Your output is someone else’s input

Output:
Store A
Input:
FX feed
Output:
Store B
(from,CAD,price)
Last n
Window
Simple
Average
Weighted
Average
n * (from, USD,price)
n * (from, USD,price)
(from, USD,,average)
(from, CAD,,average)
Last n
Window
Simple
Average
Weighted
Average
n * (from CAD,price)
n * (from, CAD,price)
(from,USD,price)
(from, CAD,,average)
(from, USD,,average)
One Solution
Partition provides performance. Send US Dollars and Canadian Dollars to different processor
clones

One Solution
DEMO
https://github.com/neilstevenson/jeeconf2017

One Solution

Jet Engine
Jet capability is easy to add to IMDG
Two steps and you’re ready to submit jobs!
<dependency>
<groupId>com.hazelcast.jet</groupId>
<artifactId>hazelcast-jet</artifactId>
<version>0.3.1</version>
</dependency>
@Bean
public JetInstance jetInstance(Config config) {
JetConfig jetConfig = new JetConfig();
jetConfig.setHazelcastConfig(config);
return Jet.newJetInstance(jetConfig);
}

Jet Engine
Jet capability is the processing, but what about the start and end of the pipelines ?
A source creates output without input.
A sink consumes input without output.
Where it goes is just a matter of plumbing
→ Hazelcast IMDG, IMap and IList
→ Kafka
→ HDFS
→ flat files
→ sockets
→ easy to write your own, they’re just vertices
implement process() to consume input
implement complete() to generate output

Jet Engine
DAG construction is easy(?)
Create vertices, and edges to link them
public MaDAG (final int last) {
Vertex mapSource = this.newVertex("mapSource",
Processors.readMap(Constants.MAP_HISTORIC_CURRENCY));
Vertex lastN = this.newVertex("lastN", new LastNProcessorSupplier(last));
this.edge(Edge.between(mapSource, lastN).partitioned(new MaKeyExtractor()));
Vertex sma = this.newVertex("sma", SmaProcessor::new);
this.edge(Edge.from(lastN, 0).to(sma));
Vertex smaMapSink = this.newVertex("smaMapSink",
Processors.writeMap(Constants.MAP_SMA));
this.edge(Edge.between(sma, smaMapSink));
But is there any easier way ?

Jet Engine
java.util.stream
An easier(?) way to construct a pipeline
Change from Java 8
Set<Map.Entry<Integer, String>> entrySet = sourceMap.entrySet();
Map<String, Integer> wordCounts = entrySet.stream()
.flatMap(m ->
.collect(toMap(
key -> key,
value -> 1,
Integer::sum));

Jet Engine
com.hazelcast.jet.stream
An easier(?) way to construct a pipeline
Change to Jet
IStreamMap<Integer, String> streamMap = IStreamMap.streamMap(sourceMap);
IMap<String, Integer> wordCounts = streamMap.stream()
.flatMap(m ->
.collect(toIMap(
key -> key,
value -> 1,
Integer::sum));
More thinking than typing

Jet Engine
DAG v java.util.stream
JET provides java.util.stream interface – high-level constructs
like Java 8’s collect(), distinct(),filter(), reduce(), sorted() etc
but run distributed
Or use the DAG approach, for low-level fine grained approached
Or mix & match
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) ->
traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));

Jet Engine
DAG v java.util.stream
Vertex tokenize = dag.newVertex("tokenize",
flatMap((String line) -> traverseArray(delimiter.split(line.toLowerCase()))
.filter(word -> !word.isEmpty())));
Here filter implements
java.util.stream.Stream<T>
java.util.stream.Stream.filter(Predicate<? super T> predicate)
But the Jet version is
com.hazelcast.jet.stream.DistributedStream<T>
com.hazelcast.jet.stream.DistributedStream.filter(
(com.hazelcast.jet.Distribtued.Predicate<? super T> predicate)
So you can send copies to the grid to execute, remotely and in parallel

Jet Engine
Architecture

Jet Roadmap
34
Features Description
Robust Stream Processing
Processing guarantees for stream processing | Streaming specific
features (windowing, triggering)
High Performance
Hazelcast Integrations
JCache | Map and Cache events using partition ring buffer | CQ
Cache | Projection and Predicate for Map source
Management Center Management and monitoring features for Jet.
More Connectors JMS | JDBC
Cloud Deployment Pivotal Cloud Foundry | Open Shift

Performance
Fastest in town!

Performance
Run the graph on as many machines as necessary or available
→ Fan-out the input
→ Send from node to node, local or distributed
→ Fan-in the output

Conclusions
Stream Processing
• Suitable when data arrives too fast to process after storing, or where you don’t care to store
• Needs a much more functional programming style than tradition Java
• → lambdas feature heavily
• Java streams is ok, might be all you need
• → makes good use of a single machine
• Jet streams is better, for bigger volumes
• → makes use of multiple machines
• Jet is from Hazelcast
• → easy to get going, deploy to bare metal or any cloud
• Alternatives exist, such as Spark and Flink
• → Jet is open-source, Java, faster, no Zookeeper

The End
https://github.com/neilstevenson/jeeconf2017
neil@hazelcast.com
https://jet.hazelcast.org/
https://github.com/hazelcast/hazelcast-jet
Stack Overflow “hazelcast-jet” or Google Group
https://gitter.im/hazelcast/home

JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet

More Related Content

Similar to JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet (20)

Recently uploaded (20)

JEEConf 2017 - In-Memory Data Streams With Hazelcast Jet